SLEP015: Feature Names Propagation
- Author:
Thomas J Fan
- Status:
Rejected
- Type:
Standards Track
- Created:
2020-10-03
Abstract
This SLEP proposes adding the get_feature_names_out
method to all
transformers and the feature_names_in_
attribute for all estimators.
The feature_names_in_
attribute is set during fit
if the input, X
,
contains the feature names.
Motivation
scikit-learn
is commonly used as a part of a larger data processing
pipeline. When this pipeline is used to transform data, the result is a
NumPy array, discarding column names. The current workflow for
extracting the feature names requires calling get_feature_names
on the
transformer that created the feature. This interface can be cumbersome when used
together with a pipeline with multiple column names:
X = pd.DataFrame({'letter': ['a', 'b', 'c'],
'pet': ['dog', 'snake', 'dog'],
'distance': [1, 2, 3]})
y = [0, 0, 1]
orig_cat_cols, orig_num_cols = ['letter', 'pet'], ['num']
ct = ColumnTransformer(
[('cat', OneHotEncoder(), orig_cat_cols),
('num', StandardScaler(), orig_num_cols)])
pipe = make_pipeline(ct, LogisticRegression()).fit(X, y)
cat_names = (pipe['columntransformer']
.named_transformers_['onehotencoder']
.get_feature_names(orig_cat_cols))
feature_names = np.r_[cat_names, orig_num_cols]
The feature_names
extracted above corresponds to the features directly
passed into LogisticRegression
. As demonstrated above, the process of
extracting feature_names
requires knowing the order of the selected
categories in the ColumnTransformer
. Furthermore, if there is feature
selection in the pipeline, such as SelectKBest
, the get_support
method
would need to be used to infer the column names that were selected.
Solution
This SLEP proposes adding the feature_names_in_
attribute to all estimators
that will extract the feature names of X
during fit
. This will also
be used for validation during non-fit
methods such as transform
or
predict
. If the X
is not a recognized container with columns, then
feature_names_in_
can be undefined. If feature_names_in_
is undefined,
then it will not be validated.
Secondly, this SLEP proposes adding get_feature_names_out(input_names=None)
to all transformers. By default, the input features will be determined by the
feature_names_in_
attribute. The feature names of a pipeline can then be
easily extracted as follows:
pipe[:-1].get_feature_names_out()
# ['cat__letter_a', 'cat__letter_b', 'cat__letter_c',
'cat__pet_dog', 'cat__pet_snake', 'num__distance']
Note that get_feature_names_out
does not require input_names
because the feature names was stored in the pipeline itself. These
features will be passed to each step’s get_feature_names_out
method to
obtain the output feature names of the Pipeline
itself.
Enabling Functionality
The following enhancements are not a part of this SLEP. These features are made possible if this SLEP gets accepted.
This SLEP enables us to implement an
array_out
keyword argument to alltransform
methods to specify the array container outputted bytransform
. An implementation ofarray_out
requiresfeature_names_in_
to validate that the names infit
andtransform
are consistent. An implementation ofarray_out
needs a way to map from the input feature names to output feature names, which is provided byget_feature_names_out
.An alternative to
array_out
: Transformers in a pipeline may wish to have feature names passed in asX
. This can be enabled by adding aarray_input
parameter toPipeline
:pipe = make_pipeline(ct, MyTransformer(), LogisticRegression(), array_input='pandas')
In this case, the pipeline will construct a pandas DataFrame to be inputted into
MyTransformer
andLogisticRegression
. The feature names will be constructed by callingget_feature_names_out
as data is passed through thePipeline
. This feature implies thatPipeline
is doing the construction of the DataFrame.
Considerations and Limitations
The
get_feature_names_out
will be constructed using the name generation specification from SLEP007: Feature names, their generation and the API.For a
Pipeline
with only one estimator, slicing will not work and one would need to access the feature names directly:pipe1 = make_pipeline(StandardScaler(), LogisticRegression()) pipe[:-1].feature_names_in_ # Works pipe2 = make_pipeline(LogisticRegression()) pipe[:-1].feature_names_in_ # Does not work
This is because
pipe2[:-1]
raises an error because it will result in a pipeline with no steps. We can work around this by allowing pipelines with no steps.feature_names_in_
can be any 1-DSequence
, such as an list or an ndarray.Meta-estimators will delegate the setting and validation of
feature_names_in_
to its inner estimators. The meta-estimator will definefeature_names_in_
by referencing its inner estimators. For example, thePipeline
can usesteps[0].feature_names_in_
as the input feature names. If the inner estimators do not definefeature_names_in_
then the meta-estimator will not definedfeature_names_in_
as well.
Backward compatibility
This SLEP is fully backward compatible with previous versions. With the introduction of
get_feature_names_out
,get_feature_names
will be deprecated. Note thatget_feature_names_out
’s signature will always containinput_features
which can be used or ignored. This helps standardize the interface for the get feature names method.The inclusion of a
get_feature_names_out
method will not introduce any overhead to estimators.The inclusion of a
feature_names_in_
attribute will increase the size of estimators because they would store the feature names. Users can remove the attribute by callingdel est.feature_names_in_
if they want to remove the feature and disable validation.
Alternatives
There have been many attempts to address this issue:
array_out
in keyword parameter intransform
: This approach requires third party estimators to unwrap and wrap array containers in transform, which introduces more burden for third party estimator maintainers. Furthermore,array_out
with sparse data will introduce an overhead when being passed along in aPipeline
. This overhead comes from the construction of the sparse data container that has the feature names.SLEP007: Feature names, their generation and the API :
SLEP007
introduces afeature_names_out_
attribute while this SLEP proposes aget_feature_names_out
method to accomplish the same task. The benefit of theget_feature_names_out
method is that it can be used even if the feature names were not passed infit
with a dataframe. For example, in aPipeline
the feature names are not passed through to each step and aget_feature_names_out
method can be used to get the names of each step with slicing.SLEP012: InputArray : The
InputArray
was developed to work around the overhead of using a pandasDataFrame
or an xarrayDataArray
. The introduction of another data structure into the Python Data Ecosystem, would lead to more burden for third party estimator maintainers.
References and Footnotes
Copyright
This document has been placed in the public domain. [1]