{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 基本的な分析の実行手順\n",
    "\n",
    "## 目次\n",
    "   \n",
    "- [1. はじめに](#1.-はじめに)\n",
    "- [2.データの準備](#2.-データの準備)\n",
    "    - [2.1. 分析対象データにサンプルID (_sid) を追加](#2.1.-分析対象データにサンプルID-(_sid)-を追加)\n",
    "    - [2.2. ASDの作成](#2.2.-ASDの作成)\n",
    "    - [2.3. 分析対象データを学習用と予測用に分割](#2.3.-分析対象データを学習用と予測用に分割)\n",
    "- [3. 学習の実行方法](#3.-学習の実行方法)\n",
    "    - [3.1. SPDの定義](#3.1.-SPDの定義)\n",
    "    - [3.2. 学習用SRCの定義](#3.2.-学習用SRCの定義)\n",
    "    - [3.3. 学習の実行](#3.3.-学習の実行)\n",
    "- [4. 予測の実行方法](#4.-予測の実行方法)\n",
    "    - [4.1. 予測用SRCの定義](#4.1.-予測用SRCの定義)\n",
    "    - [4.2. 予測の実行](#4.2.-予測の実行)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. はじめに\n",
    "\n",
    "本章を通して、ユーザーは異種混合学習技術を用いて、簡単なデータ分析ができるようになります。\n",
    "\n",
    "具体的な達成目標は、以下の通りです。\n",
    "\n",
    "- **「SAMPO/FABの学習と予測の実行方法を理解した上で、ユーザー自身で用意したデータを用いて学習・予測の実行ができる」**\n",
    "\n",
    "本章では、自動車の燃料消費量予測を題材として、学習と予測を実行する例を示します。その中で、モデルの作成と予測結果の確認について示します。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. データの準備\n",
    "\n",
    "本節では、以下のデータ準備の手順について示します。\n",
    "\n",
    "1. 分析対象データにサンプルID (_sid) を追加\n",
    "2. ASD(属性スキーマ)の作成\n",
    "3. 分析対象データを学習用と予測用に分割\n",
    "\n",
    "本章では、自動車の燃料消費量予測で使用される分析対象データを使用します。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "データは、UCIのオープンデータである Auto MPG Data Set (https://archive.ics.uci.edu/ml/datasets/auto+mpg) を属性名`car_name`を削除して、利用しています。  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "自動車の燃料消費量予測の属性とデータの型は、下記の表のとおりです。目的変数はmpg、その他が説明変数です。\n",
    "\n",
    "|   属性名　　　|  データ型  | 　　　　説明　　 |\n",
    "|:------------- | :--------- | :-------------------------------------------- |\n",
    "| mpg　         |  REAL　    |  1ガロンで進める距離(mile per gallon)  |\n",
    "| cylinders　   |  INTEGER　 | エンジンの気筒数  |\n",
    "| displacement　|  REAL　    | 排気量  |\n",
    "| horsepower　  |  INTEGER　 | 馬力  |\n",
    "| weight　      |  REAL　    | 重量  |\n",
    "| acceleration　|  REAL　    | 加速度  |\n",
    "| model_year　  |  INTEGER　 | 年式(西暦の下二桁)  |\n",
    "| origin　      |  NOMINAL 　| 原産国(1:アメリカ、2:ヨーロッパ、3:日本)  |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1. 分析対象データにサンプルID (_sid) を追加"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**サンプルID(_sid)**は、SAMPO/FABがサンプルを識別するための一意に識別できる整数型の属性です。 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "SAMPO/FABで分析を実行する際には、分析対象データに_sidの属性が含まれていなければなりません。  \n",
    "そのため、ユーザーは分析を行う前に以下の作業をする必要があります。\n",
    "\n",
    " 1. 分析対象データに_sidという名の属性の有無を確認する\n",
    " 2. 分析対象データに_sidが無い場合は_sidを追加する\n",
    "\n",
    "以下で、自動車の燃料消費量予測の分析対象データを示します。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mpg</th>\n",
       "      <th>cylinders</th>\n",
       "      <th>displacement</th>\n",
       "      <th>horsepower</th>\n",
       "      <th>weight</th>\n",
       "      <th>acceleration</th>\n",
       "      <th>model_year</th>\n",
       "      <th>origin</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>18.0</td>\n",
       "      <td>8</td>\n",
       "      <td>307.0</td>\n",
       "      <td>130</td>\n",
       "      <td>3504</td>\n",
       "      <td>12.0</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>15.0</td>\n",
       "      <td>8</td>\n",
       "      <td>350.0</td>\n",
       "      <td>165</td>\n",
       "      <td>3693</td>\n",
       "      <td>11.5</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>18.0</td>\n",
       "      <td>8</td>\n",
       "      <td>318.0</td>\n",
       "      <td>150</td>\n",
       "      <td>3436</td>\n",
       "      <td>11.0</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>16.0</td>\n",
       "      <td>8</td>\n",
       "      <td>304.0</td>\n",
       "      <td>150</td>\n",
       "      <td>3433</td>\n",
       "      <td>12.0</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>17.0</td>\n",
       "      <td>8</td>\n",
       "      <td>302.0</td>\n",
       "      <td>140</td>\n",
       "      <td>3449</td>\n",
       "      <td>10.5</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    mpg  cylinders  displacement  horsepower  weight  acceleration  \\\n",
       "0  18.0          8         307.0         130    3504          12.0   \n",
       "1  15.0          8         350.0         165    3693          11.5   \n",
       "2  18.0          8         318.0         150    3436          11.0   \n",
       "3  16.0          8         304.0         150    3433          12.0   \n",
       "4  17.0          8         302.0         140    3449          10.5   \n",
       "\n",
       "   model_year  origin  \n",
       "0          70       1  \n",
       "1          70       1  \n",
       "2          70       1  \n",
       "3          70       1  \n",
       "4          70       1  "
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "input_data = pd.read_csv('./data/auto-mpg.csv', na_values='?')\n",
    "\n",
    "input_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "上記の表から、_sidが分析対象データに含まれていないことが確認できます。\n",
    "\n",
    "SAMPO/FABの分析では、欠損値が1つでも含まれるサンプルは学習や予測で使用されません。\n",
    "そのため、欠損値を含むサンプルを削除し、分析対象データの先頭列に連続値の _sid を追加します。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_sid</th>\n",
       "      <th>mpg</th>\n",
       "      <th>cylinders</th>\n",
       "      <th>displacement</th>\n",
       "      <th>horsepower</th>\n",
       "      <th>weight</th>\n",
       "      <th>acceleration</th>\n",
       "      <th>model_year</th>\n",
       "      <th>origin</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>18.0</td>\n",
       "      <td>8</td>\n",
       "      <td>307.0</td>\n",
       "      <td>130</td>\n",
       "      <td>3504</td>\n",
       "      <td>12.0</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>15.0</td>\n",
       "      <td>8</td>\n",
       "      <td>350.0</td>\n",
       "      <td>165</td>\n",
       "      <td>3693</td>\n",
       "      <td>11.5</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>18.0</td>\n",
       "      <td>8</td>\n",
       "      <td>318.0</td>\n",
       "      <td>150</td>\n",
       "      <td>3436</td>\n",
       "      <td>11.0</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>16.0</td>\n",
       "      <td>8</td>\n",
       "      <td>304.0</td>\n",
       "      <td>150</td>\n",
       "      <td>3433</td>\n",
       "      <td>12.0</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>17.0</td>\n",
       "      <td>8</td>\n",
       "      <td>302.0</td>\n",
       "      <td>140</td>\n",
       "      <td>3449</td>\n",
       "      <td>10.5</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   _sid   mpg  cylinders  displacement  horsepower  weight  acceleration  \\\n",
       "0     0  18.0          8         307.0         130    3504          12.0   \n",
       "1     1  15.0          8         350.0         165    3693          11.5   \n",
       "2     2  18.0          8         318.0         150    3436          11.0   \n",
       "3     3  16.0          8         304.0         150    3433          12.0   \n",
       "4     4  17.0          8         302.0         140    3449          10.5   \n",
       "\n",
       "   model_year  origin  \n",
       "0          70       1  \n",
       "1          70       1  \n",
       "2          70       1  \n",
       "3          70       1  \n",
       "4          70       1  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "input_data.dropna(inplace=True)\n",
    "\n",
    "input_data.insert(0, '_sid', list(range(input_data.shape[0])))\n",
    "\n",
    "input_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2. ASDの作成"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "本節では、SAMPO/FABに分析対象データの属性スキーマを読み込ませるASDを作成します。  \n",
    "\n",
    "下記のコードを実行し、分析対象データを読み込んだPandas DataFrameからASDを作成します。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>scale</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>_sid</th>\n",
       "      <td>INTEGER</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mpg</th>\n",
       "      <td>REAL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cylinders</th>\n",
       "      <td>INTEGER</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>displacement</th>\n",
       "      <td>REAL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>horsepower</th>\n",
       "      <td>INTEGER</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>weight</th>\n",
       "      <td>INTEGER</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>acceleration</th>\n",
       "      <td>REAL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>model_year</th>\n",
       "      <td>INTEGER</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>origin</th>\n",
       "      <td>INTEGER</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                scale\n",
       "_sid          INTEGER\n",
       "mpg              REAL\n",
       "cylinders     INTEGER\n",
       "displacement     REAL\n",
       "horsepower    INTEGER\n",
       "weight        INTEGER\n",
       "acceleration     REAL\n",
       "model_year    INTEGER\n",
       "origin        INTEGER"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sampotools.api import gen_asd_from_pandas_df\n",
    "\n",
    "asd = gen_asd_from_pandas_df(input_data)\n",
    "pd.DataFrame(asd).T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "上記で出力されたASDの型が、[1. はじめに](#First)で示している分析対象データの属性の型と一致しているか確認します。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "weightとoriginの型が一致していないことが確認できます。\n",
    "\n",
    "下記のコードを実行し、weightとoriginの型を修正します。\n",
    "修正後、ASDにdomainの列とoriginのカテゴリの一覧が追加されます。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>scale</th>\n",
       "      <th>domain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>_sid</th>\n",
       "      <td>INTEGER</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mpg</th>\n",
       "      <td>REAL</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cylinders</th>\n",
       "      <td>INTEGER</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>displacement</th>\n",
       "      <td>REAL</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>horsepower</th>\n",
       "      <td>INTEGER</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>weight</th>\n",
       "      <td>REAL</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>acceleration</th>\n",
       "      <td>REAL</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>model_year</th>\n",
       "      <td>INTEGER</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>origin</th>\n",
       "      <td>NOMINAL</td>\n",
       "      <td>[1, 2, 3]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                scale     domain\n",
       "_sid          INTEGER        NaN\n",
       "mpg              REAL        NaN\n",
       "cylinders     INTEGER        NaN\n",
       "displacement     REAL        NaN\n",
       "horsepower    INTEGER        NaN\n",
       "weight           REAL        NaN\n",
       "acceleration     REAL        NaN\n",
       "model_year    INTEGER        NaN\n",
       "origin        NOMINAL  [1, 2, 3]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "asd['weight'] = {'scale': 'REAL'}\n",
    "asd['origin'] = {'scale': 'NOMINAL', 'domain': ['1', '2', '3']}\n",
    "pd.DataFrame(asd).T[['scale', 'domain']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下記で、修正したASDをASD形式のファイルに出力します。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sampotools.api import save_asd\n",
    "\n",
    "save_asd(asd_object=asd, file_path='./data/auto-mpg.asd')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3. 分析対象データを学習用と予測用に分割\n",
    "\n",
    "分析対象データを学習用と予測用に分けて、CSVファイルに出力します。\n",
    "\n",
    "下記のコードを実行することで、全体の90%にあたる件数を学習用とし、残り10%の件数を予測用としてそれぞれCSVファイルに出力します。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "n_all = len(input_data)\n",
    "n_predict = n_all // 10\n",
    "n_learn = n_all - n_predict\n",
    "\n",
    "learn_data = input_data.iloc[0:n_learn,:]\n",
    "predict_data = input_data.iloc[n_learn:n_all,:]\n",
    "\n",
    "learn_data.to_csv('./data/auto-mpg_learn.csv', sep=\",\", index=False)\n",
    "predict_data.to_csv('./data/auto-mpg_predict.csv', sep=\",\", index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## 3. 学習の実行方法\n",
    "\n",
    "本節では、SPDと学習用SRCの作成、学習の実行方法を示します。\n",
    "\n",
    "### 3.1. SPDの定義\n",
    "\n",
    "本項では、SPDの定義を2つのステップで説明します。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### ステップ1 実行順序とコンポーネントの記述\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "SPDの記述例："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "spd_content = '''\n",
    "dl -> rg\n",
    "\n",
    "---\n",
    "\n",
    "components:\n",
    "    dl:\n",
    "        component: DataLoader\n",
    "\n",
    "    rg:\n",
    "        component: FABHMEBernGateLinearRgComponent\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "SPDは、データフローセクションとパラメーターセクションで構成されており、セパレーターによって区切られています。\n",
    "\n",
    "  - データフローセクション ･･･ `dl`や`rg`のようなコンポーネントIDを`->`で繋ぐことで分析プロセスを記述します\n",
    "\n",
    "  - パラメーターセクション ･･･ `components`のサブセクションにコンポーネントIDを記述します\n",
    "    - コンポーネントIDのサブセクションで、使用するコンポーネント名を記述します\n",
    "\n",
    "  - セパレーター ･･･ `---`で記述します\n",
    "  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**注意点**\n",
    "  - セパレーターは、`---`より短くすることは出来ませんが、より長く記述することは可能です。\n",
    "  \n",
    "  - パラメーターセクションの記述形式は、YAMLフォーマットが採用されています。\n",
    "そのため、インデントは必ず半角スペース4つ(または2つ)で記述します。\n",
    "また、各コンポーネントパラメーターを記述する際には、 : (コロン) と名前の前に必ず半角スペースを1つ入力して、パラメーターの値を記述します。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "#### ステップ2 コンポーネントパラメーターの記述\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "SPDの記述例："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "spd_content = '''\n",
    "dl -> rg\n",
    "\n",
    "---\n",
    "\n",
    "components:\n",
    "    dl:\n",
    "        component: DataLoader\n",
    "\n",
    "    rg:\n",
    "        component: FABHMEBernGateLinearRgComponent\n",
    "        features: name != 'mpg' and scale != 'nominal'\n",
    "        target: name == 'mpg'\n",
    "        standardize_target: True\n",
    "        tree_depth: 3\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "本ステップでは、ステップ１のSPDの記述例に続けて `components` のサブセクションにある `rg` のコンポーネントパラメーターを記述します。\n",
    "\n",
    "  コンポーネントパラメーター名 |     記述する内容      \n",
    ":----------------------------- | :-------------------------\n",
    " features                      |  説明変数の属性選択条件    \n",
    " target                        |  目的変数の属性選択条件    \n",
    " standardize_target            |  目的変数の標準化の有効化\n",
    " tree_depth                    |  初期の門木の深さ\n",
    "\n",
    "補足：\n",
    " - `dl`で選択しているデータローダーコンポーネントは、コンポーネントパラメーターを持ちません。\n",
    " - `rg`で選択している予測器コンポーネントは、NOMINAL型の属性を入力するとエラーを返します。そのため、NOMINAL型の属性を`features`で属性選択されないように条件を設定します。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "SPDの詳細については、`Analytics Reference`の`SPD (SAMPO Process Description) Specification`を参照してください。\n",
    "\n",
    "また、SPDに記述できるコンポーネントの一覧と各コンポーネントのパラメーターについては、`Analytics Reference`の`Component Specification`を参照してください。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2. 学習用SRCの定義\n",
    "\n",
    "学習用SRC の記述例："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn_src_templ = '''\n",
    "fabhmerg_learn:\n",
    "    type: learn\n",
    "    data_sources:\n",
    "        dl:\n",
    "            path: ./data/auto-mpg_learn.csv\n",
    "            attr_schema: ./data/auto-mpg.asd\n",
    "\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "学習用SRCには、プロセス名である`fabhmerg_learn`のサブセクションに`type`と`data_sources`を記述します。`type`には、学習用SRCなので`learn`を設定します。  \n",
    "\n",
    "`data_sources`のサブセクションには、SPDで設定した`DataLoader`のコンポーネントIDごとに分析対象データ、ASDの格納先を記述します。\n",
    "上記の記述例では、分析対象データがCSVファイルの場合を示しており、データベースまたはPandas DataFrameの記述例は、SRCの詳細から確認してください。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "SRCの詳細は、`Analytics Reference`の`SRC (SAMPO Run Configuration) Specification`を参照してください。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.3. 学習の実行\n",
    "\n",
    "本項では、SPDと学習用SRCを用いて学習の実行方法を示します。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "学習の実行に伴い、プロセスストアの作成を行います。\n",
    "\n",
    "プロセスストアは、学習(または予測)の実行済プロセスを格納するリポジトリです。\n",
    "下記の `pstore_url` のように、プロセスストアのパスを指定することが可能です。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from sampo.api import process_store\n",
    "\n",
    "pstore_url = './pstore'\n",
    "if not os.path.isdir(pstore_url):",
    "    process_store.create(pstore_url)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下記のコードで、SAMPO/FABが読み込めるようにSPDをgen_spd()関数で、学習用SRCをgen_src()関数で生成します。\n",
    "\n",
    "生成したSPDと学習用SRCを用いて学習を実行し、実行済プロセスをプロセスストアに格納します。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "fabhmerg_learn.f9a76202-07c4-404f-a5ed-b74c1de1e989"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sampo.api import gen_spd, gen_src, process_runner\n",
    "\n",
    "spd = gen_spd(template=spd_content)\n",
    "learn_src = gen_src(template=learn_src_templ)\n",
    "process_runner.run(src=learn_src, spd=spd, pstore_url=pstore_url)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "gen_spd()関数とgen_src()関数の詳細は、`Analytics Reference`の`API Specification`を参照してください。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. 予測の実行方法\n",
    "\n",
    "本節では、予測用SRCを作成して予測の実行方法を示します。\n",
    "\n",
    "### 4.1. 予測用SRCの定義\n",
    "\n",
    "予測用SRC の記述例 :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "predict_src_templ = '''\n",
    "fabhmerg_predict:\n",
    "    type: predict\n",
    "    data_sources:\n",
    "        dl:\n",
    "            path: ./data/auto-mpg_predict.csv\n",
    "            attr_schema: ./data/auto-mpg.asd\n",
    "\n",
    "    model_process: fabhmerg_learn\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "予測用SRCには、プロセス名である`fabhmerg_predict`のサブセクションに`type`、`data_sources`、`model_process`を記述します。`type`には、予測用SRCなので、`predict`を設定します。\n",
    "\n",
    "`data_sources`のサブセクションには、SPDで設定した`DataLoader`のコンポーネントIDごとに予測対象データ、ASDの格納先を記述します。\n",
    "\n",
    "`model_process`には、予測実行時に使用するモデルを含む、実行済プロセスのプロセス名を指定します。  \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### 4.2. 予測の実行"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "本項では、実行済プロセスと予測用SRCを用いて、予測の分析プロセスを実行します。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "予測の実行をする前に、SAMPO/FABが読み込めるように予測用SRCをgen_src()関数で生成します。\n",
    "\n",
    "予測用SRCを用いて予測を実行し、実行済プロセスをプロセスストアに格納します。\n",
    "予測時のSPDは、SRCの`model_process`で指定した実行済プロセスに含まれるものが利用されるため、指定不要です。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "fabhmerg_predict.b444e9e2-8ec6-49b0-b3b9-b39892e7b85d"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predict_src = gen_src(template=predict_src_templ)\n",
    "process_runner.run(src=predict_src, pstore_url=pstore_url)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "プロセスストアから予測結果が含まれている実行済プロセスを開き、実績値と予測値を示します。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rg_actual</th>\n",
       "      <th>rg_predict</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>_sid</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>353</th>\n",
       "      <td>31.6</td>\n",
       "      <td>29.131691</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>354</th>\n",
       "      <td>28.1</td>\n",
       "      <td>19.736074</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>355</th>\n",
       "      <td>30.7</td>\n",
       "      <td>20.046524</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>356</th>\n",
       "      <td>25.4</td>\n",
       "      <td>26.353198</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>357</th>\n",
       "      <td>24.2</td>\n",
       "      <td>19.152903</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>358</th>\n",
       "      <td>22.4</td>\n",
       "      <td>18.264947</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>359</th>\n",
       "      <td>26.6</td>\n",
       "      <td>17.650778</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>360</th>\n",
       "      <td>20.2</td>\n",
       "      <td>19.896787</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>361</th>\n",
       "      <td>17.6</td>\n",
       "      <td>18.976112</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>362</th>\n",
       "      <td>28.0</td>\n",
       "      <td>30.362683</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      rg_actual  rg_predict\n",
       "_sid                       \n",
       "353        31.6   29.131691\n",
       "354        28.1   19.736074\n",
       "355        30.7   20.046524\n",
       "356        25.4   26.353198\n",
       "357        24.2   19.152903\n",
       "358        22.4   18.264947\n",
       "359        26.6   17.650778\n",
       "360        20.2   19.896787\n",
       "361        17.6   18.976112\n",
       "362        28.0   30.362683"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sampo.api import process_store\n",
    "\n",
    "with process_store.open_process(pstore_url, 'fabhmerg_predict') as prl:\n",
    "    df = prl.load_comp_output('rg')\n",
    "\n",
    "df[['rg_actual', 'rg_predict']].head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "上記の表では、rg_actualが実績値、rg_predictが予測値を示しています。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[ページトップへ](#top)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
