4. SLEP004: Data information¶
This is a specification to introduce data information (as
sample_weights
) during the computation of an estimator methods
(fit
, score
, …) based on the different discussion proposes on
issues and PR :
- Consistent API for attaching properties to samples #4497
- Acceptance of sample_weights in pipeline.score #7723
- Establish global error state like np.seterr #4660
- Should cross-validation scoring take sample-weights into account? #4632
- Sample properties #4696
Probably related PR: - Add feature_extraction.ColumnTransformer #3886 - Categorical split for decision tree #3346
Google doc of the sample_prop discussion done during the sklearn day in paris the 7th June 2017: https://docs.google.com/document/d/1k8d4vyw87gWODiyAyQTz91Z1KOnYr6runx-N074qIBY/edit
Table of contents
4.1. 1. Requirement¶
These requirements are defined from the different issues and PR discussions:
- User can attach information to samples.
- Must be a DataFrame like object.
- Can be given to
fit
,score
,split
and every time user give X. - Must work with every meta-estimator
(
Pipeline, GridSearchCV, cross_val_score
). - Can specify what sample property is used by each part of the meta-estimator.
- Must raise an error if not necessary extra information are given to an estimator. In the case of meta-estimator these errors are not raised.
Requirement proposed but not used by this specification: - User can attach feature properties to samples.
4.2. 2. Definition¶
Some estimator in sklearn can change their behavior when an attribute
sample_props
is provided. sample_props
is a dictionary
(pandas.DataFrame
compatible) defining sample properties. The
example bellow explain how a sample_props
can be provided to
LogisticRegression to weighted the samples:
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
digits = datasets.load_digits()
X = digits.data
y = digits.target
# Define weights used by sample_props
weights_fit = np.random.rand(X.shape[0])
weights_fit /= np.sum(weights_fit)
weights_score = np.random.rand(X.shape[0])
weights_score /= np.sum(weights_score)
logreg = LogisticRegression()
# Fit and score a LogisticRegression without sample weights
logreg = logreg.fit(X, y)
score = logreg.score(X, y)
print("Score obtained without applying weights: %f" % score)
# Fit LogisticRegression without sample weights and score with sample weights
logreg = logreg.fit(X, y)
score = logreg.score(X, y, sample_props={'weight': weights_score})
print("Score obtained by applying weights only to score: %f" % score)
# Fit and score a LogisticRegression with sample weights
log_reg = logreg.fit(X, y, sample_props={'weight': weights_fit})
score = logreg.score(X, y, sample_props={'weight': weights_score})
print("Score obtained by applying weights to both"
" score and fit: %f" % score)
When an estimator expects a mandatory sample_props
, an error is
raised for each property not provided. Moreover if an unintended
properties is given through sample_props
, a warning will be
launched to prevent that the result may be different from the one
expected. For example, the following code :
import numpy as np
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline
digits = datasets.load_digits()
X = digits.data
y = digits.target
weights = np.random.rand(X.shape[0])
logreg = LogisticRegression()
# This instruction will raise the warning
logreg = logreg.fit(X, y, sample_props={'bad_property': weights})
will raise the warning message: “sample_props[‘bad_property’] is
not used by LogisticRegression.fit
. The results obtained may be
different from the one expected.”
We provide the function sklearn.seterr
in the case you want to
change the behavior of theses messages. Even if there are considered as
warnings by default, we recommend to change the behavior to raise as
errors. You can do it by adding the following code:
sklearn.seterr(sample_props="raise")
Please refer to the documentation of np.seterr
for more information.
4.3. 3. Behavior of sample_props
for meta-estimator¶
4.3.1. 3.1 Common routing scheme¶
Meta-estimators can also change their behavior when an attribute
sample_props
is provided. On that case, sample_props
will be
sent to any internal estimator and function supporting the
sample_props
attribute. In other terms all the property defined by
sample_props
will be transmitted to each internal functions or
classes supporting sample_props
. For example in the following
example, the property weights
is sent through sample_props
to
pca.fit_transform
and logistic.fit
:
import numpy as np
from sklearn import decomposition, datasets, linear_model
from sklearn.pipeline import Pipeline
digits = datasets.load_digits()
X = digits.data
y = digits.target
logistic = linear_model.LogisticRegression()
pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic),])
# Define weights
weights = np.random.rand(X.shape[0])
weights /= np.sum(weights)
# weights is send to pca.fit_transform and logistic.fit
pipe.fit(X, sample_props={"weights": weights})
By contrast with the estimator, no warning will be raised by a
meta-estimator if an extra property is sent through sample_props
.
Anyway, errors are still raised if a mandatory property is not provided.
4.3.2. 3.2 Override common routing scheme¶
You can override the common routing scheme of sample_props
of
nested objects by defining sample properties of the form
<component>__<property>
.
You can override the common routing scheme of sample_props
by
defining your own routes through the routing
attribute of a
meta-estimator.
A route defines a way to override the value of a key of
sample_props
by the value of another key in the same
sample_props
. This modification is done every time a method
compatible with sample_prop
is called.
To illustrate how it works, if you want to send weights
only to
pca
, you can define a sample_prop
with a property
pca__weights
:
import numpy as np
from sklearn import decomposition, datasets, linear_model
from sklearn.pipeline import Pipeline
digits = datasets.load_digits()
X = digits.data
y = digits.target
logistic = linear_model.LogisticRegression()
pca = decomposition.PCA()
# Create a route using routing
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic),])
# Define weights
weights = np.random.rand(X.shape[0])
weights /= np.sum(pca_weights)
pca_weights = np.random.rand(X.shape[0])
pca_weights /= np.sum(pca_weights)
# Only pca will receive pca_weights as weights
pipe.fit(X, sample_props={'pca__weights': pca_weights})
# pca will receive pca_weights and logistic will receive weights as weights
pipe.fit(X, sample_props={'pca__weights': pca_weights,
'weights': weights})
By defining pca__weights
, we have overridden the property
weights
for pca
. On all cases, the property pca__weights
will be send to pca
and logistic
.
Overriding the routing scheme can be subtle and you must remember the priority of application of each route types:
- Routes applied specifically to a function/estimator:
{'pca__weights': weights}}
- Routes defined globally:
{'weights': weights}
Let’s consider the following code to familiarized yourself with the different routes definitions :
import numpy as np
from sklearn import datasets
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV, LeaveOneLabelOut
digits = datasets.load_digits()
X = digits.data
y = digits.target
# Define the groups used by cross_val_score
cv_groups = np.random.randint(3, size=y.shape)
# Define the groups used by GridSearchCV
gs_groups = np.random.randint(3, size=y.shape)
# Define weights used by cross_val_score
weights = np.random.rand(X.shape[0])
weights /= np.sum(weights)
# We define the GridSearchCV used by cross_val_score
grid = GridSearchCV(SGDClassifier(), params, cv=LeaveOneLabelOut())
# When cross_val_score is called, we send all parameters for internal values
cross_val_score(grid, X, y, cv=LeaveOneLabelOut(),
sample_props={'cv__groups': groups,
'split__groups': gs_groups,
'weights': weights})
With this code, the sample_props
sent to each function of
GridSearchCV
and cross_val_score
will be:
function | sample_props |
---|---|
grid.fit | {'weights': weights, 'cv__groups': cv_groups, split_groups
': gs_groups} |
grid.score | {'weights': weights, 'cv__groups': cv_groups, split_groups
': gs_groups} |
grid.split | {'weights': weights, 'groups': gs_groups, 'cv__groups': cv
_groups, split_groups': gs_groups} |
cross_val_score | {'weights': weights, 'groups': groups, 'cv__groups': cv_gr
oups, split_groups': gs_groups} |
Thus, these functions receive as weights
and groups
properties :
function | weights |
groups |
---|---|---|
grid.fit | weights |
None |
grid.score | weights |
None |
grid.split | weights |
gs_groups |
cross_val_score | weights |
cv_groups |
4.4. 4. Alternative propositions for sample_props (06.17.17)¶
The meta-estimator says which columns of sample_props they wanted to use.
p = make_pipeline(
PCA(n_components=10),
SVC(C=10).with(<method>_<thing_the_method_knows>=<column_name>)
)
p.fit(X, y, sample_props={column_name=value})
For example :
p = make_pipeline(
PCA(n_components=10),
SVC(C=10).with(fit_weights='weights', score_weights='weights')
)
p.fit(X, y, sample_props={"weights": w})
Other proposals: - Olivier suggests to modify .with(...)
by
.sample_props_mapping(...)
. - Gael suggests to change the
.with(...)
by a property with_props=...
like :
p = make_pipeline(
PCA(n_components=10),
SVC(C=10),
with_props={
'svc':(<method>_<thing_the_method_knows>=<column_name>)}
)
4.4.1. 4.1 GridSearch + Pipeline case¶
Let’s consider the case of a GridSearch
working with a Pipeline
.
How we definer the sample_props
on that case ?
4.4.1.1. Alternative 1¶
Pass through everything in GridSearchCV
:
pipe = make_pipeline(
PCA(), SVC(),
with_props={pca__fit_weight: 'my_weights'}})
GridSearchCV(
pipe, cv=my_cv,
with_props={'cv__groups': "my_groups", '*':'*')
A more complex example with this solution:
pipe = make_pipeline(
make_union(
CountVectorizer(analyzer='word').with(fit_weight='my_weight'),
CountVectorizer(analyzer='char').with(fit_weight='my_weight')),
SVC())
GridSearchCV(
pipe,
cv=my_cv.with(groups='my_groups'), score_weight='my_weight')
4.4.1.2. Alternative 2¶
Grid search manage the sample_props
of all internal variable.
pipe = make_pipeline(PCA(), SVC())
GridSearchCV(
pipe, cv=my_cv,
with_props={
'cv__groups': "my_groups",
'estimator__pca__fit_weight': "my_weights"),
})
A more complex example with this solution:
pipe = make_pipeline(
make_union(
CountVectorizer(analyzer='word'),
CountVectorizer(analyzer='char')),
SVC())
GridSearchCV(
pipe, cv=my_cv,
with_props={
'cv__groups': "my_groups",
'estimator__featureunion__countvectorizer-1__fit_weight': "my_weights",
'estimator__featureunion__countvectorizer-2__fit_weight': "my_weights",
'score_weight': "my_weights",
}
)