===========================================
SRC (SAMPO Run Configuration) Specification
===========================================

.. contents:: Contents
    :local:

Overview
========
A SAMPO Run Configuration (SRC) describes a run configuration of a learning/prediction process.

The SRC can be prepared in two formats: Python object or file.

**Example**::

    learn_1:
        type: learn

        data_sources:
            dl1:
                path: data.csv
                attr_schema: schema.asd
                filters:
                    - slice(1, 50, 2)

|

Format
======
The SRC can be prepared either as:

#. a Python object (usable in SAMPO API)
#. a text file (usable in SAMPO API and SAMPO Command)

SRC Object
----------
The SRC object is an instance of SAMPO API RunConfiguration and can only be used via SAMPO API.

.. seealso::

    `SAMPO API RunConfiguration <../sampo/api/run_configuration.html>`_

Generating RunConfiguration objects requires any of the following:

* SRC file
* String object that follows the :ref:`SRC base syntax<src-base-syntax>`.
* SRC file or string object that follows the :ref:`SRC template syntax<src-template-syntax>`
  with a parameter dictionary for rendering.

.. seealso::

    `SAMPO API gen_src() <../sampo/api/gen_src.html>`_

.. _src:

SRC File
--------
The SRC file is a text file that follows only the :ref:`SRC base syntax<src-base-syntax>`.
SAMPO Command can only support SRC files.

The SRC file must fit the following constraints:

+----------------+------------------------------------------------------+
| Property       | Constraint                                           |
+================+======================================================+
| File name      | *ASCII characters*.src                               |
+----------------+------------------------------------------------------+
| Character code | | Python 3: UTF-8 (ASCII + Japanese Characters)      |
|                | | Python 2: ASCII                                    |
+----------------+------------------------------------------------------+
| Newline code   | CRLF (Recommended),  LF (Not Recommended)            |
+----------------+------------------------------------------------------+

|

.. _src-syntax:

SRC Syntax
==========
Preparing SRCs have two possible syntaxes:

#. :ref:`Base Syntax<src-base-syntax>`
    Provides complete SRC information.
#. :ref:`Template Syntax<src-template-syntax>`
    Provides a renderable template that allows dynamic value changes in SAMPO API.

SRCs follow a syntax based on YAML.

* YAML Version 1.2
    https://yaml.org/spec/1.2/spec.html

.. _src-base-syntax:

Base Syntax
-----------
The base syntax contains complete SRC information that allows it to be directly
used in SAMPO. SRC Files strictly follow the base syntax for SAMPO Command.

In `SAMPO API gen_src() <../sampo/api/gen_src.html>`_, the ``params`` parameter
is ignored whenever loading an SRC written in base syntax.


**Base Syntax Example**::

    learn_1:
        type: learn

        data_sources:
            dl1:
                path: sample1.csv
                attr_schema: sample1.asd
                filters:
                    - slice(5, 100, 2)

.. _src-template-syntax:

Template Syntax
---------------
Since `SAMPO API gen_src() <../sampo/api/gen_src.html>`_ allows the use of templates
for creating SRC objects, templates for both file and string object cases must follow
a similar syntax. Templates must be Jinja2 compliant and need to be rendered first before
using in SAMPO.

* Jinja2
    http://jinja.pocoo.org

**Template Example**::

      learn_{{ proc_num }}:
          type: learn
          data_sources:
              dl1:
                  path: {{ csv_file_path }}
                  attr_schema: {{ asd }}
                  filters:
                      - {{ filter_1 }}
                      - {{ filter_2 }}


SRC Parameters
==============
SRCs have different parameter combination configurations depending on the type of data source to be used.

For CSV file data_sources::

    <process_name>:
        type: <learn|predict>
        data_sources:
            <cid>:
                path: <file_path>
                attr_schema: <asd_file_path_or_asd_object>
                filters:
                    - <filter_name>
                    - ...

            ...
        model_process: <process_name>

        hotstart:
            <cid>:
                base_model: <process_name>
                <parameter_key1>: <parameter_value1>
                <parameter_key2>: <parameter_value2>

|

For database data_sources::

    <process_name>:
        type: <learn|predict>
        data_sources:
            <cid>:
                sql: <sql_query_or_database_table_or_view_name>
                connection_uri: <connection_uri>
                attr_schema: <asd_file_path_or_asd_object>
                filters:
                    - <filter_name>
                    - ...

            ...
        model_process: <process_name>

        hotstart:
            base_model: <process_name>
            <parameter_key1>: <parameter_value1>
            <parameter_key2>: <parameter_value2>

|

For Pandas DataFrame data_sources::

    <process_name>:
        type: <learn|predict>
        data_sources:
            <cid>:
                df: <pandas_dataframe_object>
                attr_schema: <asd_file_path_or_asd_object>
                filters:
                    - <filter_name>
                    - ...

            ...
        model_process: <process_name>

        hotstart:
            <cid>:
                base_model: <process_name>
                <parameter_key1>: <parameter_value1>
                <parameter_key2>: <parameter_value2>

.. note::

   Pandas DataFrame data_sources format is valid only via `gen_src() <../sampo/api/gen_src.html>`_ function.

|

For ARFF file data_sources::

    <process_name>:
        type: <learn|predict>
        data_sources:
            <cid>:
                <path|data_source>: <arff_file_path>
                filters:
                    - <filter_name>
                    - ...

            ...
        model_process: <process_name>

.. warning::

   ARFF file data_sources format is deprecated. Use CSV file data_sources format instead of ARFF file.

Parameters common to all data source patterns
---------------------------------------------

* <process_name>
    * Only alphanumeric characters and underscores can be used.
    * The first character must be an alphabetic character.
    * The process name must be unique name.

* type
    * Specifies the process type: **learn** or **predict**.

* model_process (prediction process only)
    * Specifies a model as **process name** for a prediction, which has been learned in a **learn** process.

* hotstart (learning process with hot-start only)
    * Hot-start learning means learning with initial solution generated by previous model (base_model).
    * Specifies a model as **process name** at base_model for a hot-start learning, which has been learned
      in a **learn** process.
    * The ASD specified in the attr_schema of the data_sources parameter must be same as the one used in 
      learning process of the base_model.
    * The attributes of the input data for each component learned by hot-start must be same as those of base_model.
    * Other parameter_keys and parameter_values are described in each **Component Specifications**.

Parameters specific to each data source
---------------------------------------

* data_sources
    Defines the data source for each data loader component.

    * CSV file
        * path: Specifies the input file path.
        * attr_schema: Specifies an ASD file path or ASD object.

    * SQL Query, database table, or view
        * sql: Specifies the input select sql query, database table, or view name.
          Table or column names with spaces in sql queries must be enclosed in double quotations.
          When working with time-series data, it is recommended to use ORDER BY in your query.

        * table_name: Same as sql. If specified together with sql, this parameter's value is ignored.

        .. warning::

           table_name is deprecated. sql should be used instead of table_name.

        * connection_uri: Specifies a database connection URI as following format:

          ::

            schema://[user[:password]@][host][:port][/database]

          * schema: postgresql is supported.

          * The password file (.pgpass) of PostgreSQL can be used to hold parts of
            the information of the database connection URI.
            A sample using .pgpass as following:

            .. code-block:: python

               postgresql://aapfuser@dbhost:5432/testdb

        * attr_schema: Specifies an ASD file path or ASD object.

    * Pandas DataFrame
        * df : Specifies the input pandas.DataFrame object.
        * attr_schema: Specifies an ASD file path or ASD object.

    * ARFF file
        * path: Specifies the input file path.
        * data_source: Same as path. If specified together with path, this parameter's value is ignored.

        .. warning::

           * ARFF file is deprecated format. Use CSV file data_sources instead of ARFF file.
           * data_source is deprecated. path should be used instead of data_source.

Filters
-------
**data_sources** section supports the following filters, which can select samples from the data:

.. method:: slice(start=0, stop, step=1)
    :noindex:

Slices the data from ``start`` to ``stop`` at intervals of ``step``.
If there is only one argument assigned, the argument is considered as ``stop``, and rest of parameters are set to default.
If there are two arguments assigned, the arguments are considered as ``start`` and ``stop``, and ``step`` is set to default.

|

.. method:: k_split(k, pos=0, complementary=False)
    :noindex:

Splits the data into ``k`` parts and returns the ``pos``-th part.
If ``complementary`` is **True**, return the complementary set of the
specified part instead of the part itself.

|

Examples
========

* CSV File

  The learning process::

      learn_1:
          type: learn
          data_sources:
              dl1:
                  path: sample1.csv
                  attr_schema: sample1.asd
                  filters:
                      - slice(0, 1800, 2)

|

  The prediction process::

      predict_1:
          type: predict
          data_sources:
              dl1:
                  path: sample1.csv
                  attr_schema: sample1.asd
                  filters:
                      - slice(1800, 2000, 2)
          model_process: learn_1


* Database table

  The learning process::

      learn_1:
          type: learn
          data_sources:
              dl1:
                  sql: SELECT * FROM table_a ORDER BY _datetime ASC
                  connection_uri: postgresql://aapfuser:aapfpass@localhost:5432/testdb
                  attr_schema: table_a.asd
                  filters:
                      - slice(0, 1800, 2)

|

  The prediction process::

      predict_1:
          type: predict
          data_sources:
              dl1:
                  sql: SELECT * FROM table_a ORDER BY _datetime ASC
                  connection_uri: postgresql://aapfuser:aapfpass@localhost:5432/testdb
                  attr_schema: table_a.asd
                  filters:
                      - slice(1800, 2000, 2)
          model_process: learn_1


* ARFF File

  The learning process::

      learn_1:
          type: learn
          data_sources:
              dl1:
                  path: sample1.arff
                  filters:
                      - slice(0, 1800, 2)

|

  The prediction process::

      predict_1:
          type: predict
          data_sources:
              dl1:
                  path: sample1.arff
                  filters:
                      - slice(1800, 2000, 2)
          model_process: learn_1
