SPD (SAMPO Process Description) Specification

Overview

A SAMPO Process Description (SPD) describes directed relationships of components and the parameters of each component.

The SPD can be prepared in two formats: Python object or file.

Each SPD has two main sections separated by --- (three or more ‘-‘ characters) and are written in the following order:
  1. Data Flow Section:

    Describing connections among components.

  2. Parameters Section:

    Describing parameters of each component and global settings.

Example:

dl1 -> std1 -> rg1

---

components:
    dl1:
        component: DataLoader

    std1:
        component: StandardizeFDComponent
        features: scale == 'real' or scale == 'integer'

    rg1:
        component: FABHMEBernGateLinearRgComponent
        features: scale == 'real' or scale == 'integer
        target: name == 'target'

global_settings:
    keep_attributes:
        - a

    feature_exclude:
        - c

Format

The SPD can be prepared either as:

  1. a Python object (usable in SAMPO API)

  2. a text file (usable in SAMPO API and SAMPO Command)

SPD Object

The SPD object is an instance of SAMPO API ProcessDescription and can only be used via SAMPO API.

Generating ProcessDescription objects requires any of the following:

SPD File

The SPD file is a text file that follows only the SPD base syntax. SAMPO Command can only support SPD files.

An SPD file must fit the following constraints:

Property

Constraint

File name

ASCII characters.spd

Character code

Python 3: UTF-8 (ASCII + Japanese Characters)
Python 2: ASCII

Newline code

CRLF (Recommended), LF (Not Recommended)


SPD Syntax

SPDs follow either of the two possible syntaxes:

  1. Base Syntax

    Provides complete SPD information.

  2. Template Syntax

    Provides a renderable template that allows dynamic value changes in SAMPO API.

Base Syntax

The base syntax contains complete SPD information that allows it to be directly used in SAMPO. SPD Files strictly follow the base syntax for SAMPO Command.

In SAMPO API gen_spd(), the params parameter is ignored whenever loading an SPD written in base syntax.

Base Syntax Example:

dl1 -> fd1 -> rg1
    -> fd2 -> rg1

---

components:
    dl1:
        component: DataLoader

    fd1:
        component: StandardizeFDComponent
        features: scale == 'real' or scale == 'integer'

    fd2:
        component: BinaryExpandFDComponent
        features: scale == 'nominal'
        disable_feature_exclude: True

    rg1:
        component: FABHMEBernGateLinearRgComponent
        features: scale == 'real' and re_match('^a', name)
        target: name == 'target'

global_settings:
    keep_attributes:
        - a
        - b

    feature_exclude:
        - c

Template Syntax

Since SAMPO API gen_spd() allows the use of templates for creating SPD objects, templates for both file and string object cases must follow a similar syntax. Templates must be Jinja2 compliant and need to be rendered first before using in SAMPO.

Template Example:

dl1 -> fd1 -> rg1

---

components:
    dl1:
        component: DataLoader

    fd1:
        component: StandardizeFDComponent
        features: scale == 'real' or scale == 'integer'

    rg1:
        component: FABHMEBernGateLinearRgComponent
        features: scale == 'real' and re_match('^a', name)
        target: name == {{ target_attr }}
        tree_depth: {{ tree_depth }}

global_settings:
    keep_attributes:
        - a
        - b

    feature_exclude:
        - c

Data Flow Section

In the data flow section, connections between each component are described with component IDs and arrows (->):

dl1 -> fd1 -> rg1

Component ID

You can use only alphanumeric characters and underscores for the component ID. The first character must be an alphabetic character:

aaa_2 -> bbb_2    # OK

_a -> b           # NG: the first character is an underscore

3a -> b3          # NG: the first character is a number

あいう -> かきく  # NG: non-ASCII characters

Spaces

There can be zero or more spaces between a component and an arrow. Tab characters are not allowed:

a->b->c      # OK

a  ->  b->c  # OK

a□->b       # NG: □ is a tab

Comments

In the data flow section, only single-line comments are available. Single-line comments begin with the hash character # and are terminated by the end of line:

a -> b -> c   # this is a comment

#a -> b -> c  # this line is commented out

Branching

When a line begins with spaces and followed by an arrow which is aligned to an arrow in the previous line, it represents a branch from the preceding component of the arrow on the previous line:

a -> b -> c
  -> d -> c
       -> e -> c

The above configuration is same as the following:

a -> b -> c
a -> d -> c
d -> e -> c

The above configuration creates the following relationship:

digraph foo {
    graph [rankdir = LR];
    "a" -> "b";
    "a" -> "d";
    "b" -> "c";
    "d" -> "c";
    "d" -> "e";
    "e" -> "c";
}

Duplicate and Join

Duplicate

If multiple components have the same parent (preceding component), every child component receives the same data.

Below is an example.

Data flow:

dl1 -> fd1
dl1 -> fd2

In the above data flow, both fd1 and fd2 will be received the same data from dl1.


Join

If a component has multiple parents, the component receives the output of all parents joined by the sample indices (_sid).

Below is an example.

Data flow:

dl1 -> rg1
dl2 -> rg1

Output of dl1:

_sid

dl1[0]

0

1

1

2

2

3

3

4

Output of dl2:

_sid

dl2[0]

3

1

4

2

5

3

6

4


Input to rg1:

_sid

dl1[0]

dl2[0]

0

1

NaN

1

2

NaN

2

3

NaN

3

4

1

4

NaN

2

5

NaN

3

6

NaN

4


Parameters Section

The parameters section has the following two sub-sections:

  1. Components sub-section:

    Describing parameters of each component.

  2. Global settings sub-section:

    Describing process-common parameters.

The basic syntax of the parameters section is YAML.

YAML Version 1.2

https://yaml.org/spec/1.2/spec.html

Format:

components:
    <cid>:
        component: <component_class_name>
        <parameter_key>: <parameter_value>
    ...

global_settings:
    keep_attributes:
        - <attr_name>
        - ...
    feature_exclude:
        - <attr_name>
        - ...

Components Sub-section

Components sub-section describes parameters of each component as shown below:

rg1:
    component: FABHMEBernGateLinearRgComponent
    features: scale == 'real' and re_match('^a', name)
    target: name == 'target'

features and target are special parameters to select feature/target attributes.

Component parameters other than features and target are described in each Component Specifications.

Attribute Selection by features and target

Attribute selection by features and target parameters can be written in Python syntax as shown below:

features: scale == 'real' and not name == 'velocity'
target: name == 'target'

The following operators, matching functions, and defined variables are available in attribute selection.

Operators
  • Comparison operator

    • ==

    • !=

    • is

  • Logical operator

    • and

    • or

    • not

Matching Functions
re_match(pattern, variable)

Matches all attributes whose variable matches the pattern (regular expression):

features: re_match('^a', name)

In the above example, attributes whose name begin with ‘a’ are selected as features.


derived_from(attr_name)

Matches all attributes that are derived from the specified attribute name:

features: derived_from('a')

In the above example, attributes which are derived from the attribute ‘a’ are selected as features.

Warning

derived_from() is deprecated.


generated_by(cid)

Matches all attributes that are generated by the specified component:

features: generated_by('fd1')

In the above example, attributes which are generated by the component ‘fd1’ are selected as features.


all()

Matches all attributes:

features: all()

In the above example, all the attributes are selected as features.


empty()

Matches no attributes:

features: empty()

In the above example, no attributes are selected as features.

Defined Variables
name

Attribute name that is assigned by a component or a user.

scale

Scale of the attribute as follows:

  • integer

  • real

  • date

  • nominal

disable_feature_exclude

When the global settings parameter feature_exclude is True, specified attributes will be excluded from the selected features (see the Global Settings Sub-section).

However, when the component parameter disable_feature_exclude is True, the feature exclusion for the component will be disabled:

components:
    ...
    log1:
        ...
        features: re_match('a', name)
        disable_feature_exclude: True
    std1:
        ...
        features: re_match('a', name)

global_settings:
    ...
    feature_exclude:
        - a1

In the above example, the component log1 selects the attribute a1 as a feature, whereas the component std1 doesn’t because of feature_exclude.

Global Settings Sub-section

Global Settings sub-section describes process-common parameters: keep_attributes and feature_exclude:

global_settings:
    keep_attributes:
        - a
        - b

    feature_exclude:
        - c

keep_attributes

keep_attributes parameter describes the attribute names which will be kept even after running every component.

Attributes which are specified in this parameter can go through all components, be utilized by any component within the process, and get carried over to the succeeding components. These attributes can be utilized even though they are not generated by a preceding component.

feature_exclude

feature_exclude parameter describes the attribute names which will not be selected as features even though it matches the feature selection condition.