SRC (SAMPO Run Configuration) Specification¶
Contents
Overview¶
A SAMPO Run Configuration (SRC) describes a run configuration of a learning/prediction process.
The SRC can be prepared in two formats: Python object or file.
Example:
learn_1:
type: learn
data_sources:
dl1:
path: data.csv
attr_schema: schema.asd
filters:
- slice(1, 50, 2)
Format¶
The SRC can be prepared either as:
a Python object (usable in SAMPO API)
a text file (usable in SAMPO API and SAMPO Command)
SRC Object¶
The SRC object is an instance of SAMPO API RunConfiguration and can only be used via SAMPO API.
See also
Generating RunConfiguration objects requires any of the following:
SRC file
String object that follows the SRC base syntax.
SRC file or string object that follows the SRC template syntax with a parameter dictionary for rendering.
See also
SRC File¶
The SRC file is a text file that follows only the SRC base syntax. SAMPO Command can only support SRC files.
The SRC file must fit the following constraints:
Property |
Constraint |
---|---|
File name |
ASCII characters.src |
Character code |
Python 3: UTF-8 (ASCII + Japanese Characters)
Python 2: ASCII
|
Newline code |
CRLF (Recommended), LF (Not Recommended) |
SRC Syntax¶
Preparing SRCs have two possible syntaxes:
- Base Syntax
Provides complete SRC information.
- Template Syntax
Provides a renderable template that allows dynamic value changes in SAMPO API.
SRCs follow a syntax based on YAML.
- YAML Version 1.2
Base Syntax¶
The base syntax contains complete SRC information that allows it to be directly used in SAMPO. SRC Files strictly follow the base syntax for SAMPO Command.
In SAMPO API gen_src(), the params
parameter
is ignored whenever loading an SRC written in base syntax.
Base Syntax Example:
learn_1:
type: learn
data_sources:
dl1:
path: sample1.csv
attr_schema: sample1.asd
filters:
- slice(5, 100, 2)
Template Syntax¶
Since SAMPO API gen_src() allows the use of templates for creating SRC objects, templates for both file and string object cases must follow a similar syntax. Templates must be Jinja2 compliant and need to be rendered first before using in SAMPO.
- Jinja2
Template Example:
learn_{{ proc_num }}:
type: learn
data_sources:
dl1:
path: {{ csv_file_path }}
attr_schema: {{ asd }}
filters:
- {{ filter_1 }}
- {{ filter_2 }}
SRC Parameters¶
SRCs have different parameter combination configurations depending on the type of data source to be used.
For CSV file data_sources:
<process_name>:
type: <learn|predict>
data_sources:
<cid>:
path: <file_path>
attr_schema: <asd_file_path_or_asd_object>
filters:
- <filter_name>
- ...
...
model_process: <process_name>
hotstart:
<cid>:
base_model: <process_name>
<parameter_key1>: <parameter_value1>
<parameter_key2>: <parameter_value2>
For database data_sources:
<process_name>:
type: <learn|predict>
data_sources:
<cid>:
sql: <sql_query_or_database_table_or_view_name>
connection_uri: <connection_uri>
attr_schema: <asd_file_path_or_asd_object>
filters:
- <filter_name>
- ...
...
model_process: <process_name>
hotstart:
base_model: <process_name>
<parameter_key1>: <parameter_value1>
<parameter_key2>: <parameter_value2>
For Pandas DataFrame data_sources:
<process_name>:
type: <learn|predict>
data_sources:
<cid>:
df: <pandas_dataframe_object>
attr_schema: <asd_file_path_or_asd_object>
filters:
- <filter_name>
- ...
...
model_process: <process_name>
hotstart:
<cid>:
base_model: <process_name>
<parameter_key1>: <parameter_value1>
<parameter_key2>: <parameter_value2>
Note
Pandas DataFrame data_sources format is valid only via gen_src() function.
For ARFF file data_sources:
<process_name>:
type: <learn|predict>
data_sources:
<cid>:
<path|data_source>: <arff_file_path>
filters:
- <filter_name>
- ...
...
model_process: <process_name>
Warning
ARFF file data_sources format is deprecated. Use CSV file data_sources format instead of ARFF file.
Parameters common to all data source patterns¶
- <process_name>
Only alphanumeric characters and underscores can be used.
The first character must be an alphabetic character.
The process name must be unique name.
- type
Specifies the process type: learn or predict.
- model_process (prediction process only)
Specifies a model as process name for a prediction, which has been learned in a learn process.
- hotstart (learning process with hot-start only)
Hot-start learning means learning with initial solution generated by previous model (base_model).
Specifies a model as process name at base_model for a hot-start learning, which has been learned in a learn process.
The ASD specified in the attr_schema of the data_sources parameter must be same as the one used in learning process of the base_model.
The attributes of the input data for each component learned by hot-start must be same as those of base_model.
Other parameter_keys and parameter_values are described in each Component Specifications.
Parameters specific to each data source¶
- data_sources
Defines the data source for each data loader component.
- CSV file
path: Specifies the input file path.
attr_schema: Specifies an ASD file path or ASD object.
- SQL Query, database table, or view
sql: Specifies the input select sql query, database table, or view name. Table or column names with spaces in sql queries must be enclosed in double quotations. When working with time-series data, it is recommended to use ORDER BY in your query.
table_name: Same as sql. If specified together with sql, this parameter’s value is ignored.
Warning
table_name is deprecated. sql should be used instead of table_name.
connection_uri: Specifies a database connection URI as following format:
schema://[user[:password]@][host][:port][/database]
schema: postgresql is supported.
The password file (.pgpass) of PostgreSQL can be used to hold parts of the information of the database connection URI. A sample using .pgpass as following:
postgresql://aapfuser@dbhost:5432/testdb
attr_schema: Specifies an ASD file path or ASD object.
- Pandas DataFrame
df : Specifies the input pandas.DataFrame object.
attr_schema: Specifies an ASD file path or ASD object.
- ARFF file
path: Specifies the input file path.
data_source: Same as path. If specified together with path, this parameter’s value is ignored.
Warning
ARFF file is deprecated format. Use CSV file data_sources instead of ARFF file.
data_source is deprecated. path should be used instead of data_source.
Filters¶
data_sources section supports the following filters, which can select samples from the data:
-
slice
(start=0, stop, step=1)
Slices the data from start
to stop
at intervals of step
.
If there is only one argument assigned, the argument is considered as stop
, and rest of parameters are set to default.
If there are two arguments assigned, the arguments are considered as start
and stop
, and step
is set to default.
-
k_split
(k, pos=0, complementary=False)
Splits the data into k
parts and returns the pos
-th part.
If complementary
is True, return the complementary set of the
specified part instead of the part itself.
Examples¶
CSV File
The learning process:
learn_1: type: learn data_sources: dl1: path: sample1.csv attr_schema: sample1.asd filters: - slice(0, 1800, 2)
The prediction process:
predict_1: type: predict data_sources: dl1: path: sample1.csv attr_schema: sample1.asd filters: - slice(1800, 2000, 2) model_process: learn_1
Database table
The learning process:
learn_1: type: learn data_sources: dl1: sql: SELECT * FROM table_a ORDER BY _datetime ASC connection_uri: postgresql://aapfuser:aapfpass@localhost:5432/testdb attr_schema: table_a.asd filters: - slice(0, 1800, 2)
The prediction process:
predict_1: type: predict data_sources: dl1: sql: SELECT * FROM table_a ORDER BY _datetime ASC connection_uri: postgresql://aapfuser:aapfpass@localhost:5432/testdb attr_schema: table_a.asd filters: - slice(1800, 2000, 2) model_process: learn_1
ARFF File
The learning process:
learn_1: type: learn data_sources: dl1: path: sample1.arff filters: - slice(0, 1800, 2)
The prediction process:
predict_1: type: predict data_sources: dl1: path: sample1.arff filters: - slice(1800, 2000, 2) model_process: learn_1