This SLEP proposes the introduction of a public
for most estimators (where relevant).
Knowing the number of features that an estimator expects is useful for
inspection purposes. This is also useful for implementing the feature names
propagation (SLEP 8) . For
example any of the scaler can easily create feature names if they know
The proposed solution is to replace most calls to
check_X_y() by calls to a newly created private method:
def _validate_data(self, X, y=None, reset=True, **check_array_params) ...
_validate_data() method will call
check_X_y() function depending on the
reset parameter is True (default), the method will set the
n_feature_in_ attribute of the estimator, regardless of its potential
previous value. This should typically be used in
fit(), or in the first
partial_fit() call. Passing
reset=False will not set the attribute but
instead check against it, and potentially raise an error. This should typically
be used in
transform(), or on subsequent calls to
In most cases, the
n_features_in_ attribute exists only once
been called, but there are exceptions (see below).
A new common check is added: it makes sure that for most estimators, the
n_features_in_ attribute does not exist until
fit is called, and
that its value is correct. Instead of raising an exception, this check will
raise a warning for the next two releases. This will give downstream
packages some time to adjust (see considerations below).
Since the introduced method is private, third party libraries are recommended not to rely on it.
The logic that is proposed here (calling a stateful method instead of a
stateless function) is a pre-requisite to fixing the dataframe column
ordering issue: with a stateless
check_array, there is no way to raise
an error if the column ordering of a dataframe was changed between
predict. This is however out os scope for this SLEP, which only focuses
on the introduction of the
The main consideration is that the addition of the common test means that
existing estimators in downstream libraries will not pass our test suite,
unless the estimators also have the
The newly introduced checks will only raise a warning instead of an exception for the next 2 releases, so this will give more time for downstream packages to adjust.
There are other minor considerations:
- In most meta-estimators, the input validation is handled by the
n_features_in_attribute of the meta-estimator is thus explicitly set to that of the sub-estimator, either via a
@property, or directly in
- Some estimators like the dummy estimators do not validate the input
(the ‘no_validation’ tag should be True). The
n_features_in_attribute should be set to None, though this is not enforced in the common check.
- Some estimators expect a non-rectangular input: the vectorizers. These
estimators expect dicts or lists, not a
n_samples * n_featuresmatrix.
n_features_in_makes no sense here and these estimators just don’t have the attribute.
- Some estimators may know the number of input features before
fitis called: typically the
n_feature_in_is known at
dictionaryparameter. In this case the attribute is a property and is available right after object instantiation.
References and Footnotes¶
|||Each SLEP must either be explicitly labeled as placed in the public domain (see this SLEP as an example) or licensed under the Open Publication License.|