SRC (SAMPO Run Configuration) Specification

Overview

A SAMPO Run Configuration (SRC) describes a run configuration of a learning/prediction process.

The SRC can be prepared in two formats: Python object or file.

Example:

learn_1:
    type: learn

    data_sources:
        dl1:
            path: data.csv
            attr_schema: schema.asd
            filters:
                - slice(1, 50, 2)

Format

The SRC can be prepared either as:

  1. a Python object (usable in SAMPO API)

  2. a text file (usable in SAMPO API and SAMPO Command)

SRC Object

The SRC object is an instance of SAMPO API RunConfiguration and can only be used via SAMPO API.

Generating RunConfiguration objects requires any of the following:

SRC File

The SRC file is a text file that follows only the SRC base syntax. SAMPO Command can only support SRC files.

The SRC file must fit the following constraints:

Property

Constraint

File name

ASCII characters.src

Character code

Python 3: UTF-8 (ASCII + Japanese Characters)
Python 2: ASCII

Newline code

CRLF (Recommended), LF (Not Recommended)


SRC Syntax

Preparing SRCs have two possible syntaxes:

  1. Base Syntax

    Provides complete SRC information.

  2. Template Syntax

    Provides a renderable template that allows dynamic value changes in SAMPO API.

SRCs follow a syntax based on YAML.

Base Syntax

The base syntax contains complete SRC information that allows it to be directly used in SAMPO. SRC Files strictly follow the base syntax for SAMPO Command.

In SAMPO API gen_src(), the params parameter is ignored whenever loading an SRC written in base syntax.

Base Syntax Example:

learn_1:
    type: learn

    data_sources:
        dl1:
            path: sample1.csv
            attr_schema: sample1.asd
            filters:
                - slice(5, 100, 2)

Template Syntax

Since SAMPO API gen_src() allows the use of templates for creating SRC objects, templates for both file and string object cases must follow a similar syntax. Templates must be Jinja2 compliant and need to be rendered first before using in SAMPO.

Template Example:

learn_{{ proc_num }}:
    type: learn
    data_sources:
        dl1:
            path: {{ csv_file_path }}
            attr_schema: {{ asd }}
            filters:
                - {{ filter_1 }}
                - {{ filter_2 }}

SRC Parameters

SRCs have different parameter combination configurations depending on the type of data source to be used.

For CSV file data_sources:

<process_name>:
    type: <learn|predict>
    data_sources:
        <cid>:
            path: <file_path>
            attr_schema: <asd_file_path_or_asd_object>
            filters:
                - <filter_name>
                - ...

        ...
    model_process: <process_name>

    hotstart:
        <cid>:
            base_model: <process_name>
            <parameter_key1>: <parameter_value1>
            <parameter_key2>: <parameter_value2>

For database data_sources:

<process_name>:
    type: <learn|predict>
    data_sources:
        <cid>:
            sql: <sql_query_or_database_table_or_view_name>
            connection_uri: <connection_uri>
            attr_schema: <asd_file_path_or_asd_object>
            filters:
                - <filter_name>
                - ...

        ...
    model_process: <process_name>

    hotstart:
        base_model: <process_name>
        <parameter_key1>: <parameter_value1>
        <parameter_key2>: <parameter_value2>

For Pandas DataFrame data_sources:

<process_name>:
    type: <learn|predict>
    data_sources:
        <cid>:
            df: <pandas_dataframe_object>
            attr_schema: <asd_file_path_or_asd_object>
            filters:
                - <filter_name>
                - ...

        ...
    model_process: <process_name>

    hotstart:
        <cid>:
            base_model: <process_name>
            <parameter_key1>: <parameter_value1>
            <parameter_key2>: <parameter_value2>

Note

Pandas DataFrame data_sources format is valid only via gen_src() function.


For ARFF file data_sources:

<process_name>:
    type: <learn|predict>
    data_sources:
        <cid>:
            <path|data_source>: <arff_file_path>
            filters:
                - <filter_name>
                - ...

        ...
    model_process: <process_name>

Warning

ARFF file data_sources format is deprecated. Use CSV file data_sources format instead of ARFF file.

Parameters common to all data source patterns

  • <process_name>
    • Only alphanumeric characters and underscores can be used.

    • The first character must be an alphabetic character.

    • The process name must be unique name.

  • type
    • Specifies the process type: learn or predict.

  • model_process (prediction process only)
    • Specifies a model as process name for a prediction, which has been learned in a learn process.

  • hotstart (learning process with hot-start only)
    • Hot-start learning means learning with initial solution generated by previous model (base_model).

    • Specifies a model as process name at base_model for a hot-start learning, which has been learned in a learn process.

    • The ASD specified in the attr_schema of the data_sources parameter must be same as the one used in learning process of the base_model.

    • The attributes of the input data for each component learned by hot-start must be same as those of base_model.

    • Other parameter_keys and parameter_values are described in each Component Specifications.

Parameters specific to each data source

  • data_sources

    Defines the data source for each data loader component.

    • CSV file
      • path: Specifies the input file path.

      • attr_schema: Specifies an ASD file path or ASD object.

    • SQL Query, database table, or view
      • sql: Specifies the input select sql query, database table, or view name. Table or column names with spaces in sql queries must be enclosed in double quotations. When working with time-series data, it is recommended to use ORDER BY in your query.

      • table_name: Same as sql. If specified together with sql, this parameter’s value is ignored.

      Warning

      table_name is deprecated. sql should be used instead of table_name.

      • connection_uri: Specifies a database connection URI as following format:

        schema://[user[:password]@][host][:port][/database]
        
        • schema: postgresql is supported.

        • The password file (.pgpass) of PostgreSQL can be used to hold parts of the information of the database connection URI. A sample using .pgpass as following:

          postgresql://aapfuser@dbhost:5432/testdb
          
      • attr_schema: Specifies an ASD file path or ASD object.

    • Pandas DataFrame
      • df : Specifies the input pandas.DataFrame object.

      • attr_schema: Specifies an ASD file path or ASD object.

    • ARFF file
      • path: Specifies the input file path.

      • data_source: Same as path. If specified together with path, this parameter’s value is ignored.

      Warning

      • ARFF file is deprecated format. Use CSV file data_sources instead of ARFF file.

      • data_source is deprecated. path should be used instead of data_source.

Filters

data_sources section supports the following filters, which can select samples from the data:

slice(start=0, stop, step=1)

Slices the data from start to stop at intervals of step. If there is only one argument assigned, the argument is considered as stop, and rest of parameters are set to default. If there are two arguments assigned, the arguments are considered as start and stop, and step is set to default.


k_split(k, pos=0, complementary=False)

Splits the data into k parts and returns the pos-th part. If complementary is True, return the complementary set of the specified part instead of the part itself.


Examples

  • CSV File

    The learning process:

    learn_1:
        type: learn
        data_sources:
            dl1:
                path: sample1.csv
                attr_schema: sample1.asd
                filters:
                    - slice(0, 1800, 2)
    

The prediction process:

predict_1:
    type: predict
    data_sources:
        dl1:
            path: sample1.csv
            attr_schema: sample1.asd
            filters:
                - slice(1800, 2000, 2)
    model_process: learn_1
  • Database table

    The learning process:

    learn_1:
        type: learn
        data_sources:
            dl1:
                sql: SELECT * FROM table_a ORDER BY _datetime ASC
                connection_uri: postgresql://aapfuser:aapfpass@localhost:5432/testdb
                attr_schema: table_a.asd
                filters:
                    - slice(0, 1800, 2)
    

The prediction process:

predict_1:
    type: predict
    data_sources:
        dl1:
            sql: SELECT * FROM table_a ORDER BY _datetime ASC
            connection_uri: postgresql://aapfuser:aapfpass@localhost:5432/testdb
            attr_schema: table_a.asd
            filters:
                - slice(1800, 2000, 2)
    model_process: learn_1
  • ARFF File

    The learning process:

    learn_1:
        type: learn
        data_sources:
            dl1:
                path: sample1.arff
                filters:
                    - slice(0, 1800, 2)
    

The prediction process:

predict_1:
    type: predict
    data_sources:
        dl1:
            path: sample1.arff
            filters:
                - slice(1800, 2000, 2)
    model_process: learn_1