Configs
Generally, configs defines all necessary parameter values, which are needed to run a pipeline and tasks in it.
In TaskChain, config is also entry point for creating a chain and describes how pipelines are connected to a chain. Therefore, a config has to define:
- description of tasks which should be part of a chain (e.g. pipeline)
- parameter values needed by these tasks
- dependencies on other configs (pipelines)
Usual setup is one config in one file which defines one pipeline. This allows effective reuse of the pipeline in multiple chains without need of repeating parameter values.
What actually is config?
Config in TaskChain has dual meaning.
First, config as YAML file containing information described above.
Second, config as an instance of taskchain.Config
which is usually based
on the config file and adds other information necessary for computation.
Config definition
Simple example of config
tasks: my_project.tasks.pipeline.*
string_parameter: value
int_parameter: 123
complex_parameter:
key1: value1
key2:
- v1
- v2
from taskchain import Config
config = Config(
'/path/to/task_data',
'/path/to/config.yaml',
)
Config is basically map from strings to arbitrary values.
In addition to YAML files, you can define your config also in JSON file or directly in code
by passing dict like object to Config()
in parameter data
.
More examples of configs can be found in example project.
You can access values of config object by attribute config.string_parameter
or in dict like fashion config['string_parameter']
. Actually, config is descendant of dict
.
However, direct access to the values is rarely needed because parameters are handled by TaskChain.
Tasks
Each config should define which tasks are configured. This is a way how a config defines a pipeline.
The special parameter tasks
is string or list of strings describing tasks.
Task is described by a path to task's class: 'my_project.tasks.pipeline.ProcessTask'
.
This description corresponds exactly to python imports.
To import all tasks from pipeline (defined in single file) at once you can use wildcard *
in last
place of description: 'my_project.tasks.pipeline.*'
.
Task exclusion
For more flexibility, you can also exclude tasks with special parameter excluded_tasks
with same syntax as tasks
parameter.
Config dependencies
More complicated chains are split to multiple pipelines with corresponding configs.
Parameter uses
defines how configs of these pipelines are connected together.
For example, project is split to data preparation pipeline and model pipeline.
Config of data preparation have no prerequisites, and thus it doesn't need uses
.
Some tasks of the model pipeline depends on tasks of data pipeline and therefore
model config has to depend on data config.
Example
tasks: my_project.tasks.model.*
uses: "/path/to/data_config.yaml"
model: ...
tasks: my_project.tasks.data.*
source_file: ...
uses
is string or list of strings if there are multiple dependency configs.
Placeholders & Global vars
Placeholders and global variables is a mechanism which allows TaskChain projects work in multiple environments. The same project can be moved to different directory or machine or can be run by multiple peoples with different setups. This is especially useful for handle paths in configs.
To make configs independent on environment it is possible to use placeholders in them
which are later replaced by values provided in Config
object in instantiation in parameter global_vars
.
Basic usage of global vars
tasks: my_project.tasks.pipeline.*
uses: "{CONFIGS_DIR}/dependency_config.yaml"
source_data: {DATA_DIR}/data.csv
from pathlib import Path
from taskchain import Config
CONFIGS_DIR = Path('/path/to/configs')
config = Config(
'/path/to/task_data',
CONFIGS_DIR / 'config.yaml',
global_vars={
'CONFIGS_DIR': CONFIGS_DIR,
'DATA_DIR': '/path/to/data'
}
)
Parameter global_var
can be a dict
with placeholders as keys or an object with placeholders as attributes.
This allows following typical construction:
Typical usage of global vars
from pathlib import Path
from taskchain import Config
from project import project_config
config = Config(
project_config.TASKS_DIR,
project_config.CONFIGS_DIR / 'config.yaml',
global_vars=project_config
)
from pathlib import Path
REPO_DIR = Path(__file__).resolve().parent.parent.parent
DATA_DIR = Path('/path/to/project_data/source_data')
TASKS_DIR = Path('/path/to/project_data/task_data')
CONFIGS_DIR = REPO_DIR / 'configs'
tasks: my_project.tasks.pipeline.*
uses: "{CONFIGS_DIR}/dependency_config.yaml"
source_data: {DATA_DIR}/data.csv
Parameter objects
Sometimes configuration using only json-like parameter values is not enough, or it is not practical. For these cases, you can include definition of a object instance as parameter value to your config.
Object instance is defined by class and args
and kwargs
passed to constructor.
Class has to be derived class of taskchain.paramater.ParameterObject
, i.e. has
to define repr
method which should return unique string representation of object.
This value is then used by taskchain to keep track of changes in configs,
and it is necessary for correct function of data persistence.
Common pattern is that there is base class which defines interface which is used by tasks
and parameter objects are instances of child classes.
Good example is Model
class which define abstract methods
train
, predict
, save
andload
. Children of this base class
(e.g. NeuralModel
, LinearRegressionModel
, ...) implement these methods, are configurable
by their constructor and are used in configs. Here is example of this pattern.
Definition of object instance in config is dict containing key class
with fully-qualified name of class as value.
Additionally, dict can contain args
with a list and kwargs
with a dict.
Definition of object in config
model:
class: my_project.models.LinearRegressionModel
kwargs:
normalize: True
regularization: 1
In last example, config provide parameter model
with value LinearRegressionModel(normalize=True, regularization=1)
.
In config, objects can be defined inside other structures, such as list, dict or definitions of other objects. I.e. you can define parameter which is list of objects.
AutoParameterObject
Writing of repr
method for parameter object can repetitive and omitting
a argument can lead to mistakes.
Therefore, there is AutoParameterObject
which defines repr
for you
and it is based on class name and arguments of __init__
method.
To make it work, all arguments values of constructor has to be saved in object attributes.
For argument named my_argument
, AutoParameterObject
is looking for its value at
self._my_argument
or self.my_argument
.
To allow more flexibility and ease adding new arguments, you can also define
ignore_persistence_args
or dont_persist_default_value_args
which return
list of string names of arguments and have similar meaning as
Parameter
arguments.
ChainObject
In case that your parameter object need to access the chain directly
(e.g. take a task's data), you can inherit also from taskchain.chain.ChainObject
and implement init_chain(self, chain)
method which is called after chain creation
and pass chain itself.
Namespaces
If you need to use one pipeline with different configs in one chain, or you just make your larger chains more structured, you can use namespaces.
You can put part of your chain to a namespace and all tasks in that part
will be referenced not but their name task_name
but by namespace and task name
namespace_name::task_name
.
Creating namespaces is really simple, in referencing other config in config definition
(uses
clause) just add as namespace_name
.
Example
tasks: my_project.tasks.model.*
uses:
- "/path/to/data_configs/train_data.yaml as train"
- "/path/to/data_configs/valid_data.yaml as valid"
- "/path/to/data_configs/test_data.yaml as test"
model: ...
...
class TrainModel(Task):
class Meta:
parameters = [Parameter('model')]
input_tasks = ['train::features', 'valid::features']
def run(self, model) -> ...:
train_data = self.input_tasks['train::features'].value
...
class EvalModel(Task):
class Meta:
parameters = [Parameter('model')]
input_tasks = [TrainModel, 'test::features]
def run(self, model, train_model, features) -> dict:
...
...
Notes
- if a config is in a namespace, also configs used by this config are in the same namespace
- namespaces can be nested, e.g. task
features
can be in nested namespacemain_model::train
- you can still reference task without namespace as long as there is only one task of that name
- this is the case of
input_tasks
ofEvalmodel
task but notinput_tasks
ofTrainModel
task in example above - this applies to referencing tasks in chain, in
input_tasks
andrun
method arguments.
- this is the case of
Advanced topics
Multi-config files
It is possible to save multiple configs to one file. This can be useful, when chain has multiple pipelines, and you need one file configuration.
Example of multi-config file
configs:
data_config:
tasks: ...
input_data_file: ...
...
model_config:
main_part: True
tasks: ...
uses: "#data_config as data"
model: ...
...
other_config_name:
...
Single config can be taken from this file using part
argument:
config = Config('/path/to/task_data', 'multi_config.yaml', part='data_config')
part
argument if one of defined configs specify main_part: True
.
To access a config from another config (in uses
) use
/path/to/multi_config.yaml#data_config
or if you refer to a part of the same multi-config,
you can use only #data_donfig
as shown in example.
Contexts
Context is mechanism which allows rewrite parameter values in configs in the time of their instantiation.
Under the hood Context
is special case of Config
which is used in specific way.
Example
config = Config(
'/path/to/task_data',
CONFIGS_DIR / 'config.yaml',
context={'verbose': True, 'batch_size': 32}
)
This can be useful for
- ad-hock experiment
- hyper-parameter optimization
- tuning parameters, which are not used in persistence, e.g.
batch_size
- consider long data processing chain consisting of multiple dependent pipelines each with own config file.
When we get new input data, it usually leads to recreating all configs which are very similar
(only
input_data
parameter is changed and config paths inuses
). Other approach is omitinput_data
parameter value in config and provide it context, which allows run pipeline with same configuration on multiple inputs easily.
What can be context
- dict of parameters and their values
- file path with json on yaml file - this is analogous to context files
- Context object
- list of previous - in that case context are merged together. In case of parameter conflict later has higher priority.
Warning
Parameters in context are applied globaly, i.e. on all configs in chain. Be cearful with parameters of the same name in different pipelines.
Namespaces
In the case of more complicated chains which uses namespaces you can run to problems,
when one pipeline is in chain multiple times with different configuration (under different namespaces).
For these cases, context can have for_namespaces
field. It's valued should be dict with namespaces as keys
and parameters to overwrite in this namespace as value.
Context YAML file using for_namespace
for_namespaces:
train:
input_data: '/path/to/data1'
other_param: 42
test:
input_data: '/path/to/data2'
batch_size: 32
Uses
It is possible to join multiple context in one file with uses
field.
Syntax is the same as in configs, but meaning slight different.
In contexts files in uses
are just merged to the main context.
If ... as namespace
format is used, loaded context is applied only for given namespace.
Following example is equivalent to the previous one.
Context YAML files using uses
uses:
- /path/to/train.context.yaml as train
- /path/to/test.context.yaml as test
batch_size: 32
input_data: '/path/to/data1'
other_param: 42
input_data: '/path/to/data2'