2. SLEP002: Dynamic pipelines

Summary

Create and manipulate pipelines with ease.

2.1. Goals

  • Being backward-compatible
  • Allow interactive pipeline construction (for example in IPython)
  • Support adding and replacing parts of pipeline
  • Support using steps as label (y’s) transformers

2.2. Design

2.2.1. Imports

In addition to Pipeline class some additional wrappers are proposed as part of public API:

from sklearn.pipeline import (Pipeline, fitted, transformer, predictor
                              label_transformer, label_predictor,
                              ignore_transform, ignore_predict)

2.2.2. Pipeline creation

2.2.2.1. Backward-compatible

Of course, old syntax should be supported:

pipe = Pipeline(steps=[('name1', estimator1), ('name2', 'estimator2)]

2.2.2.2. Proposed default constructor

It is not backward-compatible, but it shouldn’t break most of old code:

pipe = Pipeline()

It is not yet configured, so trying to use it should fail:

>>> pipe.predict(...)
Traceback (most recent call last):
...
NotFittedError: This Pipeline instance is not fitted yet

>>> pipe.fit(...)
Traceback (most recent call last):
...
NotConfiguredError: This Pipeline instance is not configured yet

2.2.2.3. Proposed construction from iterable of dicts

Dictionaries emphasize structure:

pipe = Pipeline(
    steps=[
        {'name1': Estimator1()},
        {'name2': Estimator2()},
    ]
)

Every dict should be of length 1:

>>> pipe = Pipeline(
...     steps=(
...         {'name1': Estimator1(),
...          'name2': Estimator2()},
...         {},
...     ),
... )
Traceback (most recent call last):
...
TypeError: Wrong step definition

2.2.2.4. Proposed construction from collections.OrderedDict

It is probably the most natural way to create a pipeline:

pipe = Pipeline(
    collections.OrderedDict([
        ('name1', Estimator1()),
        ('name2', Estimator2()),
    ]),
)

2.2.3. Backward-compatibility notice

As user can provide object of any type as steps argument to constructor, there is no way to be 100% compatible, if we are going to maintain our oun type for Pipeline.steps. But in most cases people provide list object as steps parameter, so being backward-compatible with list API should be fine.

2.2.4. Adding estimators

2.2.4.1. Backward-compatible

Although not documented, but popular method of modifying (not fitted) pipelines should be supported:

pipe.steps.append(['name', estimator])

The only difference is that special handler is returned instead of None.

2.2.4.2. Enhanced: by indexing

Using dict-like syntax if very user-friendly:

pipe.steps['name'] = estimator

2.2.4.3. Enhanced: add function

Alias to previous two calls:

pipe.steps.add('name', estimator)

And also:

pipe.add_estimator('name', estimator)

2.2.4.4. Adding estimators with type specification

Estimator types will be discussed later, but some functions belong to this section:

pipe.add_estimator('name0', estimator0).mark_fitted()
pipe.add_transformer('name1', estimator1)  # never calls .fit (x, y -> x)
pipe.add_predictor('name2', estimator2)  # never calls .trasform (x -> y)
pipe.add_label_transformer('name3', estimator3)  # (y -> y)
pipe.add_label_predictor('name4', estimator4)  # (y -> y)

2.2.5. Steps (subestimators) access

2.2.5.1. Backward-compatible

Indexing by number should return (step, estimator) pair:

>>> pipe.steps[0]
('name', SomeEstimator(...))

2.2.5.2. Enhanced access via indexing

One should be able to retrieve any estimator with indexing by step’s name:

>>> pipe.steps['mame']
SomeEstimator(param1=value1, param2=value2)

2.2.5.3. Enhanced access via attributes

Dotted access should also work if name of step is valid python name literal and there is no inference with internal methods:

>>> pipe.steps.name
SomeEstimator(param1=value1, param2=value2)

>>> pipe.steps.get
<bound method index of <StepsOrderedDict object at ...>>

>>> pipe.add_transformer('my transformer', estimator)
>>> pipe.steps.my transformer
File ...
pipe.steps_.my transformer
               ^
SyntaxError: invalid syntax

2.2.6. Replacing estimators

2.2.6.1. Backward-compatible

Replacing should only be supported via access to .steps attribute. This way there is no ambiguity with new/old subestimator subtype:

pipe = Pipeline(steps=[('name', SomeEstimator())])
pipe.steps[0] = ('name', AnotherEstimator())

2.2.6.2. Replace via indexing by step name

Dict-like behavior can be used too:

pipe = Pipeline(steps=[('name', SomeEstimator())])
pipe.steps['name'] = AnotherEstimator()

2.2.6.3. Replace via replace() function

This way one can obtain special handler:

pipe.steps.replace('old_step_name', 'new_step_name', NewEstimator())
pipe.steps.replace('step_name', 'new_name',
                   SomeEstimator()).mark_transformer()

2.2.6.4. Rename step via rename() function

Simple way to change step’s name (doesn’t affect anything except object representation):

pipe.steps.rename('old_name', 'new_name')

2.2.7. Modifying estimators

Changing estimator params should only be performed via pipeline.set_params(). If somebody calls subestimator.set_params() directly, pipeline object will have no idea about changed state. There is no easy way to control it, so docs should just warm users about it.

On the other hand, there exist not-so-easy way to at least warm users during runtime: pipeline will have to keep params of all its children and compare them with actual params during fit or predict routines and raise a warning if they do not match. This functionality may be implemented as part of some kind of debugging mode.

2.2.8. Deleting estimators

2.2.8.1. Backward-compatible

Backward-compatible way to delete a step is to del it via index number:

del pipe.steps[2]

2.2.8.2. Enhanced indexing

A little more user-friendly way to remove a step can be achieved using enhanced indexing:

pipe = Pipeline()
est1 = Estimator1()
est2 = Estimator2()

pipe.steps.add('name1', est1)
pipe.steps.add('name2', est2)

del pipe.steps['name1']
del pipe.steps[pipe.steps.index(est2)]

2.2.8.3. Using dict/list-like pop() functions

Last estimator in a chain can be deleted with any of these calls:

>>> pipe.steps.pop()
SomeEstimator()

>>> pipe.steps.popitem()
('some_name', SomeEstimator())

Likewise, first estimator in the pipeline can be removed with any of these calls:

>>> pipe.steps.popfront()
BeginEstimator()

>>> pipe.steps.popitemfront()
('begin', BeginEstimator)

Any step can be removed with pop(step_name) or popitem(step_name).

2.2.9. Fitted flag reset

Internally Pipeline object should keep track on whatever it is fitted or not. It should consider itself fitted if it wasn’t modified after:

  • successful call to .fit:

    pipe.fit(...)  # Got fitted pipeline if no exception was raised
    
  • construction with list of estimators, all marked as fitted via fitted function:

    pipe = pipeline.Pipeline(steps=[
        ('name1', fitted(estimator1)),
        ('name2', fitted(estimator2)(,
        ...
    ])
    
  • adding fitted estimator to fitted pipeline:

    pipe.steps.append(fitted(estimator1))
    pipe.steps['new_step'] = fitted(estimator2)
    pipe.add_transformer('some_key', estimator3).set_fitted()
    
  • renaming step in fitted pipeline

  • removing first or last step from fitted pipeline

2.2.10. Subestimator types

Subestimator type contains information about the way a pipeline should process a step with that subestimator.

Subestimator type can be specified:

  • By wrapping estimator with subtype constructor call:
    • when creating pipeline:

      Pipeline([
          ('name1', transformer(estimator)),
          ('name2', predictor(estimator)),
          ('name3', label_transformer(estimator)),
          ('name4', label_predictor(estimator)),
      ])
      
    • when adding or replacing a step:

      pipe.steps.append(['name', label_predictor(estimator])
      pipe.steps.add('name', label_transformer(estimator))
      pipe.add_estimator('name', predictor(estimator))
      pipe.steps.replace('name', transformer(fitted(estimator)))
      pipe.steps['name'] = fitted(predictor(estimator))
      
  • Using pipe.add_* methods:

    pipe.add_transformer('transformer', Transformer())
    pipe.add_predictor('predictor', Predictor())
    pipe.add_label_transformer('l_transformer', LabelTransformer())
    pipe.add_label_predictor('l_predictor', LabelPredictor())
    
  • Using special handler methods:

    pipe.add_estimator('name1', EstimatorA()).mark_transformer()
    pipe.steps.add('name2', EstimatorB()).mark_predictor()
    pipe.steps.append(['name3', EstimatorC()]).mark_label_transformer()
    pipe.steps.replace('name4', EstimatorD()).mark_label_predictor()
    pipe.steps.replace('name4', EstimatorE()).mark('label_transformer')
    

2.2.10.1. Transformer

Is a default type.

It is processed like this:

y_new = y
if fiting:
    X_new = step_estimator.fit_transform(X, y)
else:
    X_new = step.transform(X, y)

2.2.10.2. Predictor

It is processed like this:

X_new = X
if fitting:
    y_new = step_estimator.fit_predict(X, y)
else:
    y_new = step_estimator.predict(X, y)

2.2.10.3. Label transformer

Processing pseudocode:

X_new = X
if fitting:
    y_new = step_estimator.fit_transform(y)
else:
    y_new = step_estimator.transform(y)

2.2.10.4. Label predictor

Processing pseudocode:

X_new = X
if fitting:
    y_new = step_estimator.fit_predict(y)
else:
    y_new = step_estimator.predict(y)

2.2.11. Special handlers and wrapper functions

2.2.11.1. Assuming estimator is already fitted

to add estimator, that was already fitted to a pipline one can use fitted function:

est = SomeEstimator().fit(some_data)
pipe.steps.add('prefitted', fitted(est))

or special hanlder method:

pipe.steps.add('prefitted', est).mark_fitted()
# or
pipe.steps.add('prefitted', est).mark('fitted')

2.2.11.2. Ignoring estimator during prediction

In some cases we only need to apply estimator only during fit-phase:

pipe.add_estimator('sampler', ignore_transform(Sampler()))
# or
pipe.add_estimator('sampler', Sampler()).mark_ignore_transform()
# or
pipe.add_estimator('sampler', Sampler()).mark('ignore_transform')

If it is predictor or label_predictor, then one should use ignore_predict:

pipe.add_estimator('cluster', ignore_predict(predictor(ClusteringEstimator())))
# or
pipe.add_estimator('cluster', predictor(ClusteringEstimator())).mark_ignore_predict()
# or
pipe.add_estimator('cluster', predictor(ClusteringEstimator())).mark('ignore_predict')

2.2.11.3. Setting subestimator type

As specified above setting subestimator type can be performed with special handler or special function call.

2.2.11.4. Combining multiple flags

All sorts of syntax combinations should be supported:

pipe.steps.add('step', fitted(predictor(Estimator())))
pipe.steps.add('step', predictor(fitted(Estimator())))
pipe.steps.add('step', predictor(Estimator())).mark_fitted()
pipe.steps.add('step', fitted(Estimator())).mark_predictor()
pipe.steps.add('step', Estimator()).mark_predictor().mark_fitted()
pipe.steps.add('step', Estimator()).mark_fitted().mark_predictor()
pipe.steps.add('step', Estimator()).mark('fitted').mark_predictor()
pipe.steps.add('step', Estimator()).mark('predictor').mark_fitted()
pipe.steps.add('step', Estimator()).mark('predictor').mark('fitted')
pipe.steps.add('step', Estimator()).mark('fitted').mark('predictor')
pipe.steps.add('step', Estimator()).mark('fitted', 'predictor')
pipe.steps.add('step', Estimator()).mark('predictor', 'fitted')

2.2.12. Type of steps object

This is internal type, users shouldn’r usualy mess with that. But public methods should be considered as part of pipeline API.

2.2.12.1. Attributes and methods with standard behavior

Special methods:

  • __contains__(), __getitem__(), __setitem__(), __delitem__()
  • __len__(), __iter__()
  • __add__(), __iadd__()

Methods:

  • get(), index()
  • extend(), insert()
  • keys(), items(), values()
  • clear(), pop(), popitem(), popfront(), popitemfront()

2.2.12.2. Non-standard methods

  • replace()
  • rename()

2.2.12.3. Not supported arguments and methods

This type provides dict-like and list-like interfaces, but following methods and attributes are not supported:

  • fromkeys()
  • setdefault()
  • sort()
  • __mul__(), __rmul__(), __imul__()

Any attempt to use them should fail with AttributeError or NotImplementedError

Thease methods may be not supported:

  • __ge__(), __gt__()
  • __le__(), __lt__()

2.2.13. Serialization

  • Support loading/unpickling pipelines from old scikit-learn versions
  • Keep track of API version in __getstate__ / picklier: all future versions should support unpickling all previous versions of enhanced pipeline
  • Serialization of .steps attribute (without master pipeline) may be not supported.

2.3. Examples

2.3.1. Example: remove outliers

Proposed design allows to do many things, but some of them have to be done in two steps. But it shouldn’t be a problem, as one can make a pipeline with those steps:

def make_outlier_remover(bad_value=-1):
    outlier_remover = Pipeline()
    outlier_remover.steps.add(
        'data',
        DropLinesOfXCorrespondingLabel(remove_if=bad_value),
    )
    outlier_remover.steps.add(
        'labels',
        DropLabelsIf(remove_if=bad_value),
    ).mark_label_transformer()
    return outlier_remover

2.3.2. Example: sample dataset

We can use previous example function for this:

def make_sampler(percent=75):
    sentinel = object()
    sampler = Pipeline()
    sampler.steps.add(
        'sample',
        LabelSomeRowsAs(percent=percent, label=sentinel),
    ).mark('predictor', 'ignore_predict')
    sampler.steps.add(
        'down',
        make_outlier_remover(bad_value=sentinel),
    )
    return sampler

2.4. Benefits

  • Users can use old code with new pipeline: usual __init__, set_params, get_params, fit, transform and predict are the only requirements of subestimators.
  • Users can use new pipeline with their old code: pipeline is stil usual estimator, that supports usual set of methods.
  • We finally can transform y in a pipeline.

2.5. Drawbacks

Well, it’s a lot of code to write and support…