=============================================
SPD (SAMPO Process Description) Specification
=============================================

.. contents:: Contents
    :local:

Overview
========
A SAMPO Process Description (SPD) describes directed relationships of components and the parameters of each component.

The SPD can be prepared in two formats: Python object or file.

Each SPD has two main sections separated by ``---`` (three or more '``-``' characters) and are written in the following order:
    #. :ref:`Data Flow Section<data-flow-section>`:
        Describing connections among components.
    #. :ref:`Parameters Section<parameters-section>`:
        Describing parameters of each component and global settings.


**Example**::

    dl1 -> std1 -> rg1

    ---

    components:
        dl1:
            component: DataLoader

        std1:
            component: StandardizeFDComponent
            features: scale == 'real' or scale == 'integer'

        rg1:
            component: FABHMEBernGateLinearRgComponent
            features: scale == 'real' or scale == 'integer
            target: name == 'target'

    global_settings:
        keep_attributes:
            - a

        feature_exclude:
            - c

|

Format
======
The SPD can be prepared either as:

#. a Python object (usable in SAMPO API)
#. a text file (usable in SAMPO API and SAMPO Command)

SPD Object
----------
The SPD object is an instance of SAMPO API ProcessDescription and can only be used via SAMPO API.

.. seealso::

    `SAMPO API ProcessDescription <../sampo/api/process_description.html>`_

Generating ProcessDescription objects requires any of the following:

* SPD file
* String object that follows the :ref:`SPD base syntax<spd-base-syntax>`.
* Text file or string object that follows the :ref:`SPD template syntax<spd-template-syntax>`
  with a parameter dictionary for rendering.

.. seealso::

    `SAMPO API gen_spd() <../sampo/api/gen_spd.html>`_

.. _spd:

SPD File
--------
The SPD file is a text file that follows only the :ref:`SPD base syntax<spd-base-syntax>`.
SAMPO Command can only support SPD files.

An SPD file must fit the following constraints:

+----------------+------------------------------------------------------+
| Property       | Constraint                                           |
+================+======================================================+
| File name      | *ASCII characters*.spd                               |
+----------------+------------------------------------------------------+
| Character code | | Python 3: UTF-8 (ASCII + Japanese Characters)      |
|                | | Python 2: ASCII                                    |
+----------------+------------------------------------------------------+
| Newline code   | CRLF (Recommended),  LF (Not Recommended)            |
+----------------+------------------------------------------------------+

|

.. _spd-syntax:

SPD Syntax
==========
SPDs follow either of the two possible syntaxes:

#. :ref:`Base Syntax<spd-base-syntax>`
    Provides complete SPD information.
#. :ref:`Template Syntax<spd-template-syntax>`
    Provides a renderable template that allows dynamic value changes in SAMPO API.

.. _spd-base-syntax:

Base Syntax
-----------
The base syntax contains complete SPD information that allows it to be directly
used in SAMPO. SPD Files strictly follow the base syntax for SAMPO Command.

In `SAMPO API gen_spd() <../sampo/api/gen_spd.html>`_, the ``params`` parameter
is ignored whenever loading an SPD written in base syntax.

**Base Syntax Example**::

    dl1 -> fd1 -> rg1
        -> fd2 -> rg1

    ---

    components:
        dl1:
            component: DataLoader

        fd1:
            component: StandardizeFDComponent
            features: scale == 'real' or scale == 'integer'

        fd2:
            component: BinaryExpandFDComponent
            features: scale == 'nominal'
            disable_feature_exclude: True

        rg1:
            component: FABHMEBernGateLinearRgComponent
            features: scale == 'real' and re_match('^a', name)
            target: name == 'target'

    global_settings:
        keep_attributes:
            - a
            - b

        feature_exclude:
            - c

.. _spd-template-syntax:

Template Syntax
---------------
Since `SAMPO API gen_spd() <../sampo/api/gen_spd.html>`_ allows the use of templates
for creating SPD objects, templates for both file and string object cases must follow
a similar syntax. Templates must be Jinja2 compliant and need to be rendered first before
using in SAMPO.

* Jinja2
    http://jinja.pocoo.org


**Template Example**::

    dl1 -> fd1 -> rg1

    ---

    components:
        dl1:
            component: DataLoader

        fd1:
            component: StandardizeFDComponent
            features: scale == 'real' or scale == 'integer'

        rg1:
            component: FABHMEBernGateLinearRgComponent
            features: scale == 'real' and re_match('^a', name)
            target: name == {{ target_attr }}
            tree_depth: {{ tree_depth }}

    global_settings:
        keep_attributes:
            - a
            - b

        feature_exclude:
            - c

|

.. _data-flow-section:

Data Flow Section
=================
In the data flow section, connections between each component are described with component IDs and arrows (``->``)::

    dl1 -> fd1 -> rg1

Component ID
------------
You can use only alphanumeric characters and underscores for the component ID.
The first character must be an alphabetic character::

    aaa_2 -> bbb_2    # OK

    _a -> b           # NG: the first character is an underscore

    3a -> b3          # NG: the first character is a number

    あいう -> かきく  # NG: non-ASCII characters

Spaces
------
There can be zero or more spaces between a component and an arrow.
Tab characters are not allowed::

    a->b->c      # OK

    a  ->  b->c  # OK

    a□->b       # NG: □ is a tab

Comments
--------
In the data flow section, only single-line comments are available.
Single-line comments begin with the hash character **#** and are terminated by the end of line::

    a -> b -> c   # this is a comment

    #a -> b -> c  # this line is commented out

Branching
---------
When a line begins with spaces and followed by an arrow which is aligned to an arrow in the previous line, it represents a branch from the preceding component of the arrow on the previous line::

    a -> b -> c
      -> d -> c
           -> e -> c

|

The above configuration is same as the following::

    a -> b -> c
    a -> d -> c
    d -> e -> c

|

The above configuration creates the following relationship:

.. graphviz::

  digraph foo {
      graph [rankdir = LR];
      "a" -> "b";
      "a" -> "d";
      "b" -> "c";
      "d" -> "c";
      "d" -> "e";
      "e" -> "c";
  }

Duplicate and Join
------------------
**Duplicate**
    If multiple components have the same parent (preceding component), every child component receives the same data.

    Below is an example.

    **Data flow**::

        dl1 -> fd1
        dl1 -> fd2

    In the above data flow, both **fd1** and **fd2** will be received the same data from **dl1**.

    |

**Join**
    If a component has multiple parents, the component receives the output of all parents joined by the sample indices (_sid).

    Below is an example.

    **Data flow**::

        dl1 -> rg1
        dl2 -> rg1

    **Output of dl1**:

    +------+--------+
    | _sid | dl1[0] |
    +======+========+
    |    0 |      1 |
    +------+--------+
    |    1 |      2 |
    +------+--------+
    |    2 |      3 |
    +------+--------+
    |    3 |      4 |
    +------+--------+

    **Output of dl2**:
    
    +------+--------+
    | _sid | dl2[0] |
    +======+========+
    |    3 |      1 |
    +------+--------+
    |    4 |      2 |
    +------+--------+
    |    5 |      3 |
    +------+--------+
    |    6 |      4 |
    +------+--------+

    |

    **Input to rg1**:

    +------+--------+--------+
    | _sid | dl1[0] | dl2[0] |
    +======+========+========+
    |    0 |      1 |    NaN |
    +------+--------+--------+
    |    1 |      2 |    NaN |
    +------+--------+--------+
    |    2 |      3 |    NaN |
    +------+--------+--------+
    | **3**|   **4**|   **1**|
    +------+--------+--------+
    |    4 |    NaN |      2 |
    +------+--------+--------+
    |    5 |    NaN |      3 |
    +------+--------+--------+
    |    6 |    NaN |      4 |
    +------+--------+--------+

|

.. _parameters-section:

Parameters Section
==================
The parameters section has the following two sub-sections:

#. Components sub-section:
    Describing parameters of each component.
#. Global settings sub-section:
    Describing process-common parameters.

The basic syntax of the parameters section is YAML.

YAML Version 1.2
    https://yaml.org/spec/1.2/spec.html

**Format**::

    components:
        <cid>:
            component: <component_class_name>
            <parameter_key>: <parameter_value>
        ...

    global_settings:
        keep_attributes:
            - <attr_name>
            - ...
        feature_exclude:
            - <attr_name>
            - ...

Components Sub-section
----------------------
Components sub-section describes parameters of each component as shown below::

    rg1:
        component: FABHMEBernGateLinearRgComponent
        features: scale == 'real' and re_match('^a', name)
        target: name == 'target'

|

``features`` and ``target`` are special parameters to select feature/target attributes.

Component parameters other than ``features`` and ``target`` are described in each **Component Specifications**.

Attribute Selection by ``features`` and ``target``
**************************************************
Attribute selection by ``features`` and ``target`` parameters can be written in Python syntax as shown below::

    features: scale == 'real' and not name == 'velocity'
    target: name == 'target'

|

The following operators, matching functions, and defined variables are available in attribute selection.

Operators
^^^^^^^^^
- Comparison operator

  - ==
  - !=
  - is

- Logical operator

  - and
  - or
  - not

Matching Functions
^^^^^^^^^^^^^^^^^^
.. method:: re_match(pattern, variable)
    :noindex:

Matches all attributes whose *variable* matches the *pattern* (regular expression)::

  features: re_match('^a', name)

In the above example, attributes whose ``name`` begin with '**a**' are selected as features.

|

.. method:: derived_from(attr_name)
    :noindex:

Matches all attributes that are derived from the specified attribute name::

  features: derived_from('a')

In the above example, attributes which are derived from the attribute **'a'** are selected as features.

.. warning::

  derived_from() is deprecated.

|

.. method:: generated_by(cid)
    :noindex:

Matches all attributes that are generated by the specified component::

  features: generated_by('fd1')

In the above example, attributes which are generated by the component **'fd1'** are selected as features.

|

.. method:: all()
    :noindex:

Matches all attributes::
    
  features: all()

In the above example, all the attributes are selected as features.

|

.. method:: empty()
    :noindex:

Matches no attributes::
    
  features: empty()

In the above example, no attributes are selected as features.

Defined Variables
^^^^^^^^^^^^^^^^^
**name**
  Attribute name that is assigned by a component or a user.

**scale**
  Scale of the attribute as follows:

  * integer
  * real
  * date
  * nominal

``disable_feature_exclude``
***************************
When the global settings parameter ``feature_exclude`` is **True**, specified
attributes will be excluded from the selected features (see the **Global Settings Sub-section**).

However, when the component parameter ``disable_feature_exclude`` is **True**, the feature exclusion for the component will be disabled::

    components:
        ...
        log1:
            ...
            features: re_match('a', name)
            disable_feature_exclude: True
        std1:
            ...
            features: re_match('a', name)

    global_settings:
        ...
        feature_exclude:
            - a1

In the above example, the component **log1** selects the attribute **a1** as a feature, whereas the component **std1** doesn't because of ``feature_exclude``.

Global Settings Sub-section
---------------------------
Global Settings sub-section describes process-common parameters: ``keep_attributes`` and ``feature_exclude``::

    global_settings:
        keep_attributes:
            - a
            - b

        feature_exclude:
            - c

``keep_attributes``
*******************
``keep_attributes`` parameter describes the attribute names which will be kept even after running every component.

Attributes which are specified in this parameter can go through all components, be utilized by any component within the process, and get carried over to the succeeding components. These attributes can be utilized even though they are not generated by a preceding component.

``feature_exclude``
*******************
``feature_exclude`` parameter describes the attribute names which will not be selected as features even though it matches the feature selection condition.
