2. SLEP002: Dynamic pipelines
2.1. Goals
Being backward-compatible
Allow interactive pipeline construction (for example in IPython)
Support adding and replacing parts of pipeline
Support using steps as label (y’s) transformers
2.2. Design
2.2.1. Imports
In addition to Pipeline <sklearn.pipeline.Pipeline>
class some additional
wrappers are proposed as part of public API:
from sklearn.pipeline import (Pipeline, fitted, transformer, predictor
label_transformer, label_predictor,
ignore_transform, ignore_predict)
2.2.2. Pipeline creation
2.2.2.1. Backward-compatible
Of course, old syntax should be supported:
pipe = Pipeline(steps=[('name1', estimator1), ('name2', 'estimator2)]
2.2.2.2. Proposed default constructor
It is not backward-compatible, but it shouldn’t break most of old code:
pipe = Pipeline()
It is not yet configured, so trying to use it should fail:
>>> pipe.predict(...)
Traceback (most recent call last):
...
NotFittedError: This Pipeline instance is not fitted yet
>>> pipe.fit(...)
Traceback (most recent call last):
...
NotConfiguredError: This Pipeline instance is not configured yet
2.2.2.3. Proposed construction from iterable of dicts
Dictionaries emphasize structure:
pipe = Pipeline(
steps=[
{'name1': Estimator1()},
{'name2': Estimator2()},
]
)
Every dict should be of length 1:
>>> pipe = Pipeline(
... steps=(
... {'name1': Estimator1(),
... 'name2': Estimator2()},
... {},
... ),
... )
Traceback (most recent call last):
...
TypeError: Wrong step definition
2.2.2.4. Proposed construction from collections.OrderedDict
It is probably the most natural way to create a pipeline:
pipe = Pipeline(
collections.OrderedDict([
('name1', Estimator1()),
('name2', Estimator2()),
]),
)
2.2.3. Backward-compatibility notice
As user can provide object of any type as steps
argument to constructor,
there is no way to be 100% compatible, if we are going to maintain our oun
type for Pipeline.steps
.
But in most cases people provide list
object as steps
parameter, so
being backward-compatible with list
API should be fine.
2.2.4. Adding estimators
2.2.4.1. Backward-compatible
Although not documented, but popular method of modifying (not fitted) pipelines should be supported:
pipe.steps.append(['name', estimator])
The only difference is that special handler is returned instead of None
.
2.2.4.2. Enhanced: by indexing
Using dict-like syntax if very user-friendly:
pipe.steps['name'] = estimator
2.2.4.3. Enhanced: add
function
Alias to previous two calls:
pipe.steps.add('name', estimator)
And also:
pipe.add_estimator('name', estimator)
2.2.4.4. Adding estimators with type specification
Estimator types will be discussed later, but some functions belong to this section:
pipe.add_estimator('name0', estimator0).mark_fitted()
pipe.add_transformer('name1', estimator1) # never calls .fit (x, y -> x)
pipe.add_predictor('name2', estimator2) # never calls .trasform (x -> y)
pipe.add_label_transformer('name3', estimator3) # (y -> y)
pipe.add_label_predictor('name4', estimator4) # (y -> y)
2.2.5. Steps (subestimators) access
2.2.5.1. Backward-compatible
Indexing by number should return (step, estimator)
pair:
>>> pipe.steps[0]
('name', SomeEstimator(...))
2.2.5.2. Enhanced access via indexing
One should be able to retrieve any estimator with indexing by step’s name:
>>> pipe.steps['mame']
SomeEstimator(param1=value1, param2=value2)
2.2.5.3. Enhanced access via attributes
Dotted access should also work if name of step is valid python name literal and there is no inference with internal methods:
>>> pipe.steps.name
SomeEstimator(param1=value1, param2=value2)
>>> pipe.steps.get
<bound method index of <StepsOrderedDict object at ...>>
>>> pipe.add_transformer('my transformer', estimator)
>>> pipe.steps.my transformer
File ...
pipe.steps_.my transformer
^
SyntaxError: invalid syntax
2.2.6. Replacing estimators
2.2.6.1. Backward-compatible
Replacing should only be supported via access to .steps
attribute. This way
there is no ambiguity with new/old subestimator subtype:
pipe = Pipeline(steps=[('name', SomeEstimator())])
pipe.steps[0] = ('name', AnotherEstimator())
2.2.6.2. Replace via indexing by step name
Dict-like behavior can be used too:
pipe = Pipeline(steps=[('name', SomeEstimator())])
pipe.steps['name'] = AnotherEstimator()
2.2.6.3. Replace via replace()
function
This way one can obtain special handler:
pipe.steps.replace('old_step_name', 'new_step_name', NewEstimator())
pipe.steps.replace('step_name', 'new_name',
SomeEstimator()).mark_transformer()
2.2.6.4. Rename step via rename()
function
Simple way to change step’s name (doesn’t affect anything except object representation):
pipe.steps.rename('old_name', 'new_name')
2.2.7. Modifying estimators
Changing estimator params should only be performed via
pipeline.set_params()
. If somebody calls subestimator.set_params()
directly, pipeline object will have no idea about changed state. There is no
easy way to control it, so docs should just warm users about it.
On the other hand, there exist not-so-easy way to at least warm users during
runtime: pipeline will have to keep params of all its children and compare them
with actual params during fit
or predict
routines and raise a warning
if they do not match. This functionality may be implemented as part of some
kind of debugging mode.
2.2.8. Deleting estimators
2.2.8.1. Backward-compatible
Backward-compatible way to delete a step is to del
it via index number:
del pipe.steps[2]
2.2.8.2. Enhanced indexing
A little more user-friendly way to remove a step can be achieved using enhanced indexing:
pipe = Pipeline()
est1 = Estimator1()
est2 = Estimator2()
pipe.steps.add('name1', est1)
pipe.steps.add('name2', est2)
del pipe.steps['name1']
del pipe.steps[pipe.steps.index(est2)]
2.2.8.3. Using dict/list-like pop()
functions
Last estimator in a chain can be deleted with any of these calls:
>>> pipe.steps.pop()
SomeEstimator()
>>> pipe.steps.popitem()
('some_name', SomeEstimator())
Likewise, first estimator in the pipeline can be removed with any of these calls:
>>> pipe.steps.popfront()
BeginEstimator()
>>> pipe.steps.popitemfront()
('begin', BeginEstimator)
Any step can be removed with pop(step_name)
or popitem(step_name)
.
2.2.9. Fitted flag reset
Internally Pipeline
object should keep track on whatever it is fitted or
not. It should consider itself fitted if it wasn’t modified after:
successful call to
.fit
:pipe.fit(...) # Got fitted pipeline if no exception was raised
construction with list of estimators, all marked as fitted via
fitted
function:pipe = pipeline.Pipeline(steps=[ ('name1', fitted(estimator1)), ('name2', fitted(estimator2)(, ... ])
adding fitted estimator to fitted pipeline:
pipe.steps.append(fitted(estimator1)) pipe.steps['new_step'] = fitted(estimator2) pipe.add_transformer('some_key', estimator3).set_fitted()
renaming step in fitted pipeline
removing first or last step from fitted pipeline
2.2.10. Subestimator types
Subestimator type contains information about the way a pipeline should process a step with that subestimator.
Subestimator type can be specified:
- By wrapping estimator with subtype constructor call:
when creating pipeline:
Pipeline([ ('name1', transformer(estimator)), ('name2', predictor(estimator)), ('name3', label_transformer(estimator)), ('name4', label_predictor(estimator)), ])
when adding or replacing a step:
pipe.steps.append(['name', label_predictor(estimator]) pipe.steps.add('name', label_transformer(estimator)) pipe.add_estimator('name', predictor(estimator)) pipe.steps.replace('name', transformer(fitted(estimator))) pipe.steps['name'] = fitted(predictor(estimator))
Using
pipe.add_*
methods:pipe.add_transformer('transformer', Transformer()) pipe.add_predictor('predictor', Predictor()) pipe.add_label_transformer('l_transformer', LabelTransformer()) pipe.add_label_predictor('l_predictor', LabelPredictor())
Using special handler methods:
pipe.add_estimator('name1', EstimatorA()).mark_transformer() pipe.steps.add('name2', EstimatorB()).mark_predictor() pipe.steps.append(['name3', EstimatorC()]).mark_label_transformer() pipe.steps.replace('name4', EstimatorD()).mark_label_predictor() pipe.steps.replace('name4', EstimatorE()).mark('label_transformer')
2.2.10.1. Transformer
Is a default type.
It is processed like this:
y_new = y
if fiting:
X_new = step_estimator.fit_transform(X, y)
else:
X_new = step.transform(X, y)
2.2.10.2. Predictor
It is processed like this:
X_new = X
if fitting:
y_new = step_estimator.fit_predict(X, y)
else:
y_new = step_estimator.predict(X, y)
2.2.10.3. Label transformer
Processing pseudocode:
X_new = X
if fitting:
y_new = step_estimator.fit_transform(y)
else:
y_new = step_estimator.transform(y)
2.2.10.4. Label predictor
Processing pseudocode:
X_new = X
if fitting:
y_new = step_estimator.fit_predict(y)
else:
y_new = step_estimator.predict(y)
2.2.11. Special handlers and wrapper functions
2.2.11.1. Assuming estimator is already fitted
to add estimator, that was already fitted to a pipline one can use fitted function:
est = SomeEstimator().fit(some_data)
pipe.steps.add('prefitted', fitted(est))
or special hanlder method:
pipe.steps.add('prefitted', est).mark_fitted()
# or
pipe.steps.add('prefitted', est).mark('fitted')
2.2.11.2. Ignoring estimator during prediction
In some cases we only need to apply estimator only during fit-phase:
pipe.add_estimator('sampler', ignore_transform(Sampler()))
# or
pipe.add_estimator('sampler', Sampler()).mark_ignore_transform()
# or
pipe.add_estimator('sampler', Sampler()).mark('ignore_transform')
If it is predictor
or label_predictor
, then one should use
ignore_predict
:
pipe.add_estimator('cluster', ignore_predict(predictor(ClusteringEstimator())))
# or
pipe.add_estimator('cluster', predictor(ClusteringEstimator())).mark_ignore_predict()
# or
pipe.add_estimator('cluster', predictor(ClusteringEstimator())).mark('ignore_predict')
2.2.11.3. Setting subestimator type
As specified above setting subestimator type can be performed with special handler or special function call.
2.2.11.4. Combining multiple flags
All sorts of syntax combinations should be supported:
pipe.steps.add('step', fitted(predictor(Estimator())))
pipe.steps.add('step', predictor(fitted(Estimator())))
pipe.steps.add('step', predictor(Estimator())).mark_fitted()
pipe.steps.add('step', fitted(Estimator())).mark_predictor()
pipe.steps.add('step', Estimator()).mark_predictor().mark_fitted()
pipe.steps.add('step', Estimator()).mark_fitted().mark_predictor()
pipe.steps.add('step', Estimator()).mark('fitted').mark_predictor()
pipe.steps.add('step', Estimator()).mark('predictor').mark_fitted()
pipe.steps.add('step', Estimator()).mark('predictor').mark('fitted')
pipe.steps.add('step', Estimator()).mark('fitted').mark('predictor')
pipe.steps.add('step', Estimator()).mark('fitted', 'predictor')
pipe.steps.add('step', Estimator()).mark('predictor', 'fitted')
2.2.12. Type of steps object
This is internal type, users shouldn’r usualy mess with that. But public methods should be considered as part of pipeline API.
2.2.12.1. Attributes and methods with standard behavior
Special methods:
__contains__()
,__getitem__()
,__setitem__()
,__delitem__()
__len__()
,__iter__()
__add__()
,__iadd__()
Methods:
get()
,index()
extend()
,insert()
keys()
,items()
,values()
clear()
,pop()
,popitem()
,popfront()
,popitemfront()
2.2.12.2. Non-standard methods
replace()
rename()
2.2.12.3. Not supported arguments and methods
This type provides dict-like and list-like interfaces, but following methods and attributes are not supported:
fromkeys()
setdefault()
sort()
__mul__()
,__rmul__()
,__imul__()
Any attempt to use them should fail with AttributeError
or
NotImplementedError
Thease methods may be not supported:
__ge__()
,__gt__()
__le__()
,__lt__()
2.2.13. Serialization
Support loading/unpickling pipelines from old scikit-learn versions
Keep track of API version in
__getstate__
/picklier
: all future versions should support unpickling all previous versions of enhanced pipelineSerialization of
.steps
attribute (without master pipeline) may be not supported.
2.3. Examples
2.3.1. Example: remove outliers
Proposed design allows to do many things, but some of them have to be done in two steps. But it shouldn’t be a problem, as one can make a pipeline with those steps:
def make_outlier_remover(bad_value=-1):
outlier_remover = Pipeline()
outlier_remover.steps.add(
'data',
DropLinesOfXCorrespondingLabel(remove_if=bad_value),
)
outlier_remover.steps.add(
'labels',
DropLabelsIf(remove_if=bad_value),
).mark_label_transformer()
return outlier_remover
2.3.2. Example: sample dataset
We can use previous example function for this:
def make_sampler(percent=75):
sentinel = object()
sampler = Pipeline()
sampler.steps.add(
'sample',
LabelSomeRowsAs(percent=percent, label=sentinel),
).mark('predictor', 'ignore_predict')
sampler.steps.add(
'down',
make_outlier_remover(bad_value=sentinel),
)
return sampler
2.4. Benefits
Users can use old code with new pipeline: usual
__init__
,set_params
,get_params
,fit
,transform
andpredict
are the only requirements of subestimators.Users can use new pipeline with their old code: pipeline is stil usual estimator, that supports usual set of methods.
We finally can transform
y
in a pipeline.
2.5. Drawbacks
Well, it’s a lot of code to write and support…