SLEP018: Pandas Output for Transformers with set_output

Author:

Thomas J. Fan

Status:

Accepted

Type:

Standards Track

Created:

2022-06-22

Abstract

This SLEP proposes a set_output method to configure the output data container of scikit-learn transformers.

Detailed description

Currently, scikit-learn transformers return NumPy ndarrays or SciPy sparse matrices. This SLEP proposes adding a set_output method to configure a transformer to output pandas DataFrames:

scalar = StandardScaler().set_output(transform="pandas")
scalar.fit(X_df)

# X_trans_df is a pandas DataFrame
X_trans_df = scalar.transform(X_df)

The index of the output DataFrame must match the index of the input. If the transformer does not support transform="pandas", then it must raise a ValueError stating that it does not support the feature.

This SLEP’s only focus is dense data for set_output. If a transformer returns sparse data, e.g. OneHotEncoder(sparse=True), then transform will raise a ValueError if set_output(transform="pandas"). Dealing with sparse output might be the scope of another future SLEP.

For a pipeline, calling set_output will configure all inner transformers and does not configure non-transformers. This enables the following workflow:

log_reg = make_pipeline(SimpleImputer(), StandardScaler(), LogisticRegression())
log_reg.set_output(transform="pandas")

# All transformers return DataFrames during fit
log_reg.fit(X_df, y)

# X_trans_df is a pandas DataFrame
X_trans_df = log_reg[:-1].transform(X_df)

# X_trans_df is again a pandas DataFrame
X_trans_df = log_reg[0].transform(X_df)

# The classifier contains the feature names in
log_reg[-1].feature_names_in_

Meta-estimators that support set_output are required to configure all inner transformers by calling set_output. Specifically all fitted and non-fitted inner transformers must be configured with set_output. This enables transform’s output to be a DataFrame before and after the meta-estimator is fitted. If an inner transformer does not define set_output, then an error is raised.

Global Configuration

For ease of use, this SLEP proposes a global configuration flag that sets the output for all transformers:

import sklearn
sklearn.set_config(transform_output="pandas")

The global default configuration is "default" where the transformer determines the output container.

The configuration can also be set locally using the config_context context manager:

from sklearn import config_context
with config_context(transform_output="pandas"):
   num_prep = make_pipeline(SimpleImputer(), StandardScaler(), PCA())
   num_preprocessor.fit_transform(X_df)

The following specifies the precedence levels for the three ways to configure the output container:

  1. Locally configure a transformer: transformer.set_output

  2. Context manager: config_context

  3. Global configuration: set_config

Implementation

A possible implementation of this SLEP is worked out in #23734.

Backward compatibility

There are no backward compatibility concerns, because the set_output method is a new API. Third party transformers can opt-in to the API by defining set_output.

Alternatives

Alternatives to this SLEP includes:

  1. SLEP014 proposes that if the input is a DataFrame than the output is a DataFrame.

  2. Prototype #20100 showcases array_out="pandas" in transform. This API is limited because does not directly support fitting on a pipeline where the steps requires data frames input.

Discussion

A list of issues discussing Pandas output are: #14315, #20100, and #23001. This SLEP proposes configuring the output to be pandas because it is the DataFrame library that is most widely used and requested by users. The set_output API can be extended to support additional DataFrame libraries and sparse data formats in the future.

References and Footnotes