.. _slep_025: ============================================== SLEP025: Losing Accuracy in Scikit-Learn Score ============================================== :Author: Christian Lorentzen :Status: Rejected :Type: Standards Track :Created: 2025-12-07 :Resolution: https://github.com/scikit-learn/enhancement_proposals/pull/98 Abstract -------- This SLEP proposes to rectify the default ``score`` method for scikit-learn classifiers. Currently, the ease of ``classifier.score(X, y)`` favors the use of *accuracy*, which has many well known deficiencies. This SLEP changes the default scoring method. Motivation ---------- As it stands, *accuracy* is the most used metric for classifiers in scikit-learn. This is manifest in ``classifier.score(..)`` which applies accuracy. While the original goal might have been to provide a score method that works for all classifiers, the actual implication has been the blind usage, without critical thinking, of the accuracy score. This has mislead many researchers and users because accuracy is well known for its severe deficiencies: To the point, it is not a *strictly proper scoring rule* and scikit-learn's implementation hard-coded a probability threshold of 50% into it by relying on ``predict``. This situation calls for a correction. Ideally, scikit-learn provides good defaults or fosters a conscious decision by users, e.g. by forcing engagement with the subject, see [2]_ subsection "Which scoring function should I use?". Solution -------- The solution is a multi-step approach: 1. Introduce the new keyword ``scoring`` to the ``score`` method. The default for classifiers is ``scoring="accuracy"``, for regressors ``scoring="r2"``. 2. Deprecate the default ``"accuracy"`` for classifiers. 3. After the release cycle, set a new default for classifiers: ``"d2_brier_score"``. There are two main questions with this approach: a. The time frame of the deprecation period. Should it be longer than the usual 2 minor releases? Should step 1 and 2 happen in the same minor release? b. What is the new default scoring parameter in step 3? The fact that different scoring metrics focus on different things, i.e. ``predict`` vs. ``predict_proba``, and not all classifiers provide ``predict_proba`` complicates a unified choice. Possibilities are: - D2 Brier score, ``"d2_brier_score"``, which is basically the same as R2 for regressors, - the objective function of the estimator, i.e. the penalized log loss for ``LogisticRegression``. Proposals: a. Use a deprecation period of 4 instead of 2 minor releases which amounts to 2 years and do step 1 and 2 at the same time (in the same release). Reasoning: It is a deprecation that is doable within the current deprecation habit of minor releases. It should be longer than the usual 2 minor releases because of it's big impact. A major release just because of such a deprecation is not very attractive (or marketable). b. Use D2 Brier score. Reasoning: Scores will be compared among different models. Therefore, the model specific loss is not suitable. Note that the Brier score and hence also the D2 Brier score are strictly proper scoring rules (or strictly consistent scoring functions) for the probability predictions with ``predict_proba``. At the same time, Brier score returns a valid score even for ``predict`` (in case a classifier has no ``predict_proba``), in constrast to log loss (which returns infinity for false certainty). On top, this would result in classifiers and regressors having the same score (it's just a different name), returning values in the range [-inf, 1]. Note that the D2 Brier score as a skill score (a relavitve score to a baseline) is invariant under a multiplicative factor, e.g. specified by ``scale_by_half``. It is given by ``1 - MSE(model predictions) / MSE(mean of data)``. Backward compatibility ---------------------- The outlined solution would be feasible within the usual deprecation strategy of scikit-learn releases. Alternatives ------------ Removing ^^^^^^^^ An alternative is to remove the ``score`` method altogether. Scoring metrics are well available in scikit-learn, see ``sklearn.metric`` module and [2]_. The advantages of removing ``score`` are: - An active choice by the user is triggered as there is no more default. - Defaults for ``score`` are tricky anyway. Different estimators estimate different things and the output of their ``score`` method most likely is not comparable, e.g. consider a hinge loss based SVM vs. log loss based logistic regression. Disadvantages: - Disruption of the API. - Very likely a major release for something not very marketable. - More imports required and a bit longer code as compared to just ``my_estimator.score(X, y)``. Keep status quo ^^^^^^^^^^^^^^^ Advantages: - No change or breaking things for users - No ressources bound Disadvantages: - No change for users - Bad practice is continued - Bad signal: scikit-learn community is unable to rectify serious grievance Discussion ---------- The following issues contain discussions on this subject: - https://github.com/scikit-learn/scikit-learn/issues/28995 References and Footnotes ------------------------ .. [1] Each SLEP must either be explicitly labeled as placed in the public domain (see this SLEP as an example) or licensed under the `Open Publication License`_. .. _Open Publication License: https://www.opencontent.org/openpub/ .. [2] Scikit-Learn User Guide on "Metrics and Scoring" https://scikit-learn.org/stable/modules/model_evaluation.html Copyright --------- This document has been placed in the public domain. [1]_