This proposal results in a solution to propagating feature names through transformers, pipelines, and the column transformer. Ideally, we would have:
df = pd.readcsv('tabular.csv') # transforming the data in an arbitrary way transformer0 = ColumnTransformer(...) # a pipeline preprocessing the data and then a classifier (or a regressor) clf = make_pipeline(transformer0, ..., SVC()) # now we can investigate features at each stage of the pipeline clf[-1].input_feature_names_
The feature names are propagated throughout the pipeline and the user can investigate them at each step of the pipeline.
This proposal suggests adding a new data structure, called
which augments the data array
X with additional meta-data. In this proposal
we assume the feature names (and other potential meta-data) are attached to the
data when passed to an estimator. Alternative solutions are discussed later in
A main constraint of this data structure is that is should be backward
compatible, i.e. code which expects a
numpy.ndarray as the output of a
transformer, would not break. This SLEP focuses on feature names as the only
meta-data attached to the data. Support for other meta-data can be added later.
Since currently transformers return a
numpy or a
scipy array, backward
compatibility in this context means the operations which are valid on those
arrays should also be valid on the new data structure.
All operations are delegated to the data part of the container, and the
meta-data is lost immediately after each operation and operations result in a
numpy.ndarray. This includes indexing and slicing, i.e. to avoid
__getitem__ is not overloaded and if the user
wishes to preserve the meta-data, they shall do so via explicitly calling a
method such as
select(). Operations between two
InpuArrays will not
try to align rows and/or columns of the two given objects.
pandas compatibility comes ideally as a
pandas does not provide a clean API at the moment. Alternatively,
inputarray.todataframe() would return a
pandas.DataFrame with the
relevant meta-data attached.
Feature names are an object
ndarray of strings aligned with the columns.
They can be
Estimators understand the
InputArray and extract the feature names from the
given data before applying the operations and transformations on the data.
All transformers return an
InputArray with feature names attached to it.
The way feature names are generated is discussed in SLEP007 - The Style of The
Ideally sparse arrays follow the same pattern, but since
not provide the kinda of API provided by
numpy, we may need to find
There will be factory methods creating an
InputArray given a
pandas.DataFrame or an
xarray.DataArray or simply an
sp.SparseMatrix and a given set of feature names.
InputArray can also be converted to a
pandas.DataFrame using a
X being an
>>> np.array(X) >>> X.todataframe() >>> pd.DataFrame(X) # only if pandas implements the API
np.ndarray or an
sp.sparse matrix and a set of
feature names, one can make the right
>>> make_inputarray(X, feature_names)
Since we expect the feature names to be attached to the data given to an estimator, there are a few potential approaches we can take:
pandasout: this means we expect the user to give the data as a
pandas.DataFrame, and if so, the transformer would output a
pandas.DataFramewhich also includes the [generated] feature names. This is not a feasible solution since
pandasplans to move to a per column representation, which means
pd.DataFrame(np.asarray(df))has two guaranteed memory copies.
XArray: we could accept a
pandas.DataFrame, and use
xarray.DataArrayas the output of transformers, including feature names. However,
xarrayhas a hard dependency on
pandas, and uses
pandas.Indexto handle row labels and aligns rows when an operation between two
xarray.DataArrayis done, which can be time consuming, and is not the semantic expected in
scikit-learn; we only expect the number of rows to be equal, and that the rows always correspond to one another in the same order.
As a result, we need to have another data structure which we’ll use to transfer data related information (such as feature names), which is lightweight and doesn’t interfere with existing user code.
Another alternative to the problem of passing meta-data around is to pass that
as a parameter to
fit. This would heavily involve modifying meta-estimators
since they’d need to pass that information, and extract the relevant
information from the estimators to pass that along to the next estimator. Our
prototype implementations showed significant challenges compared to when the
meta-data is attached to the data.