SLEP015: Feature Names Propagation
- Author:
Thomas J Fan
- Status:
Rejected
- Type:
Standards Track
- Created:
2020-10-03
Abstract
This SLEP proposes adding the get_feature_names_out method to all
transformers and the feature_names_in_ attribute for all estimators.
The feature_names_in_ attribute is set during fit if the input, X,
contains the feature names.
Motivation
scikit-learn is commonly used as a part of a larger data processing
pipeline. When this pipeline is used to transform data, the result is a
NumPy array, discarding column names. The current workflow for
extracting the feature names requires calling get_feature_names on the
transformer that created the feature. This interface can be cumbersome when used
together with a pipeline with multiple column names:
X = pd.DataFrame({'letter': ['a', 'b', 'c'],
'pet': ['dog', 'snake', 'dog'],
'distance': [1, 2, 3]})
y = [0, 0, 1]
orig_cat_cols, orig_num_cols = ['letter', 'pet'], ['num']
ct = ColumnTransformer(
[('cat', OneHotEncoder(), orig_cat_cols),
('num', StandardScaler(), orig_num_cols)])
pipe = make_pipeline(ct, LogisticRegression()).fit(X, y)
cat_names = (pipe['columntransformer']
.named_transformers_['onehotencoder']
.get_feature_names(orig_cat_cols))
feature_names = np.r_[cat_names, orig_num_cols]
The feature_names extracted above corresponds to the features directly
passed into LogisticRegression. As demonstrated above, the process of
extracting feature_names requires knowing the order of the selected
categories in the ColumnTransformer. Furthermore, if there is feature
selection in the pipeline, such as SelectKBest, the get_support method
would need to be used to infer the column names that were selected.
Solution
This SLEP proposes adding the feature_names_in_ attribute to all estimators
that will extract the feature names of X during fit. This will also
be used for validation during non-fit methods such as transform or
predict. If the X is not a recognized container with columns, then
feature_names_in_ can be undefined. If feature_names_in_ is undefined,
then it will not be validated.
Secondly, this SLEP proposes adding get_feature_names_out(input_names=None)
to all transformers. By default, the input features will be determined by the
feature_names_in_ attribute. The feature names of a pipeline can then be
easily extracted as follows:
pipe[:-1].get_feature_names_out()
# ['cat__letter_a', 'cat__letter_b', 'cat__letter_c',
'cat__pet_dog', 'cat__pet_snake', 'num__distance']
Note that get_feature_names_out does not require input_names
because the feature names was stored in the pipeline itself. These
features will be passed to each step’s get_feature_names_out method to
obtain the output feature names of the Pipeline itself.
Enabling Functionality
The following enhancements are not a part of this SLEP. These features are made possible if this SLEP gets accepted.
This SLEP enables us to implement an
array_outkeyword argument to alltransformmethods to specify the array container outputted bytransform. An implementation ofarray_outrequiresfeature_names_in_to validate that the names infitandtransformare consistent. An implementation ofarray_outneeds a way to map from the input feature names to output feature names, which is provided byget_feature_names_out.An alternative to
array_out: Transformers in a pipeline may wish to have feature names passed in asX. This can be enabled by adding aarray_inputparameter toPipeline:pipe = make_pipeline(ct, MyTransformer(), LogisticRegression(), array_input='pandas')
In this case, the pipeline will construct a pandas DataFrame to be inputted into
MyTransformerandLogisticRegression. The feature names will be constructed by callingget_feature_names_outas data is passed through thePipeline. This feature implies thatPipelineis doing the construction of the DataFrame.
Considerations and Limitations
The
get_feature_names_outwill be constructed using the name generation specification from SLEP007: Feature names, their generation and the API.For a
Pipelinewith only one estimator, slicing will not work and one would need to access the feature names directly:pipe1 = make_pipeline(StandardScaler(), LogisticRegression()) pipe[:-1].feature_names_in_ # Works pipe2 = make_pipeline(LogisticRegression()) pipe[:-1].feature_names_in_ # Does not work
This is because
pipe2[:-1]raises an error because it will result in a pipeline with no steps. We can work around this by allowing pipelines with no steps.feature_names_in_can be any 1-DSequence, such as an list or an ndarray.Meta-estimators will delegate the setting and validation of
feature_names_in_to its inner estimators. The meta-estimator will definefeature_names_in_by referencing its inner estimators. For example, thePipelinecan usesteps[0].feature_names_in_as the input feature names. If the inner estimators do not definefeature_names_in_then the meta-estimator will not definedfeature_names_in_as well.
Backward compatibility
This SLEP is fully backward compatible with previous versions. With the introduction of
get_feature_names_out,get_feature_nameswill be deprecated. Note thatget_feature_names_out’s signature will always containinput_featureswhich can be used or ignored. This helps standardize the interface for the get feature names method.The inclusion of a
get_feature_names_outmethod will not introduce any overhead to estimators.The inclusion of a
feature_names_in_attribute will increase the size of estimators because they would store the feature names. Users can remove the attribute by callingdel est.feature_names_in_if they want to remove the feature and disable validation.
Alternatives
There have been many attempts to address this issue:
array_outin keyword parameter intransform: This approach requires third party estimators to unwrap and wrap array containers in transform, which introduces more burden for third party estimator maintainers. Furthermore,array_outwith sparse data will introduce an overhead when being passed along in aPipeline. This overhead comes from the construction of the sparse data container that has the feature names.SLEP007: Feature names, their generation and the API :
SLEP007introduces afeature_names_out_attribute while this SLEP proposes aget_feature_names_outmethod to accomplish the same task. The benefit of theget_feature_names_outmethod is that it can be used even if the feature names were not passed infitwith a dataframe. For example, in aPipelinethe feature names are not passed through to each step and aget_feature_names_outmethod can be used to get the names of each step with slicing.SLEP012: InputArray : The
InputArraywas developed to work around the overhead of using a pandasDataFrameor an xarrayDataArray. The introduction of another data structure into the Python Data Ecosystem, would lead to more burden for third party estimator maintainers.
References and Footnotes
Copyright
This document has been placed in the public domain. [1]