{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SAMPO/FABの動作と入出力\n",
    "\n",
    "## 目次\n",
    "\n",
    "- [1. はじめに](#1.-はじめに)\n",
    "- [2. SAMPO/FABの動作](#2.-SAMPO/FABの動作)\n",
    "- [3. 入力](#3.-入力)\n",
    "    - [3.1. 分析対象データ](#3.1.-分析対象データ)\n",
    "    - [3.2. ASD (属性スキーマ)](#3.2.-ASD-(属性スキーマ))\n",
    "    - [3.3. SPD (SAMPO Process Description)](#3.3.-SPD-(SAMPO-Process-Description))\n",
    "    - [3.4. SRC (SAMPO Run Configuration)](#3.4.-SRC-(SAMPO-Run-Configuration))\n",
    "- [4. 出力](#4.-出力)\n",
    "    - [4.1. 実行済プロセス](#4.1.-実行済プロセス)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. はじめに\n",
    "\n",
    "本章では、SAMPO/FABの概要として動作と入出力について説明します。  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. SAMPO/FABの動作\n",
    "\n",
    "**SAMPO/FAB**は、分析プロセス設計情報であるSPDと実行設定であるSRCを入力すると、学習(または予測)を実行することができます。\n",
    "\n",
    "学習または予測を実行した結果は、実行済プロセスとしてプロセスストアと呼ばれるリポジトリに格納します。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
   "![SAMPO/FABの実行方法概要](../_static/sampo_fab_overview.PNG)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. 入力"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1. 分析対象データ\n",
    "\n",
    "SAMPO/FABは、分析対象データとして行列形式のデータを扱います。 \n",
    "また、分析対象データはCSVファイル、PostgreSQLのテーブル、Pandas DataFrameから読み込むことができます。\n",
    "\n",
    "下記のデータは、「あやめ」という花の種類の判別分析を行うための分析対象データです。\n",
    "がく片(sepal)や花弁(petal)の幅や長さとあやめの種類という、5つの属性を持ちます。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "データは、UCIのオープンデータである Iris Data Set (https://archive.ics.uci.edu/ml/datasets/iris) を利用しています。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal length in cm</th>\n",
       "      <th>sepal width in cm</th>\n",
       "      <th>petal length in cm</th>\n",
       "      <th>petal width in cm</th>\n",
       "      <th>kind of iris</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>Iris-setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>Iris-setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "      <td>Iris-setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "      <td>Iris-setosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "      <td>Iris-setosa</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal length in cm  sepal width in cm  petal length in cm  \\\n",
       "0                 5.1                3.5                 1.4   \n",
       "1                 4.9                3.0                 1.4   \n",
       "2                 4.7                3.2                 1.3   \n",
       "3                 4.6                3.1                 1.5   \n",
       "4                 5.0                3.6                 1.4   \n",
       "\n",
       "   petal width in cm kind of iris  \n",
       "0                0.2  Iris-setosa  \n",
       "1                0.2  Iris-setosa  \n",
       "2                0.2  Iris-setosa  \n",
       "3                0.2  Iris-setosa  \n",
       "4                0.2  Iris-setosa  "
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv('./data/iris.csv', na_values='?')\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2. ASD (属性スキーマ)\n",
    "\n",
    "**ASD(Attribute Schema Description)** には、分析対象データの各属性の名前やデータ型を定義します。\n",
    "データ型には、INTEGER (整数型)、REAL (実数型)、NOMINAL (カテゴリ型)、DATA (日付型) があります。\n",
    "\n",
    "下記のように、scaleが示すデータ型にNOMINALが含まれている場合、domainにカテゴリ値の一覧も定義されます。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>scale</th>\n",
       "      <th>domain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>sepal length in cm</th>\n",
       "      <td>REAL</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>sepal width in cm</th>\n",
       "      <td>REAL</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>petal length in cm</th>\n",
       "      <td>REAL</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>petal width in cm</th>\n",
       "      <td>REAL</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>kind of iris</th>\n",
       "      <td>NOMINAL</td>\n",
       "      <td>[Iris-versicolor, Iris-virginica, Iris-setosa]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                      scale                                          domain\n",
       "sepal length in cm     REAL                                             NaN\n",
       "sepal width in cm      REAL                                             NaN\n",
       "petal length in cm     REAL                                             NaN\n",
       "petal width in cm      REAL                                             NaN\n",
       "kind of iris        NOMINAL  [Iris-versicolor, Iris-virginica, Iris-setosa]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sampotools.api import gen_asd_from_pandas_df\n",
    "\n",
    "asd = gen_asd_from_pandas_df(df)\n",
    "pd.DataFrame(asd).T[['scale', 'domain']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### 3.3. SPD (SAMPO Process Description)\n",
    "\n",
    "**SPD**では、属性生成や予測分析を行うコンポーネントを組み合わせて、学習（または予測）を行う分析プロセスを記述します。具体的には、使用するコンポーネントやコンポーネントのパラメータを記述します。  \n",
    "SPDの詳細と記述方法は、次章で説明します。\n",
    "\n",
    "コンポーネントには、主に以下の3種類があります。\n",
    " - 「データローダーコンポーネント」･･･ CSVファイルやPostgreSQLなどからデータを読み込む\n",
    " - 「属性生成コンポーネント(FD)」･･･ 属性データに特定の操作を行い、新たな属性を生成する\n",
    " - 「予測器コンポーネント」･･･ 学習によるモデル作成やモデルを用いた予測を行う\n",
    "\n",
    "\n",
    "SPDの記述例：「データローダーコンポーネント」である`DataLoader`でデータを読み込み、「予測器コンポーネント」である`FABHMEBernGateLinearRgComponent`で回帰分析を行う\n",
    "\n",
    "```python\n",
    "dl -> rg\n",
    "\n",
    "---\n",
    "\n",
    "components:\n",
    "    dl:\n",
    "        component: DataLoader\n",
    "\n",
    "    rg:\n",
    "        component: FABHMEBernGateLinearRgComponent\n",
    "        features: name != 'target'\n",
    "        target: name == 'target'\n",
    "        tree_depth: 3\n",
    "```\n",
    "\n",
    "補足：\n",
    "  - コンポーネントには下記のような自動属性設計で使用するコンポーネントもあります。\n",
    "\n",
    "    - 「属性学習コンポーネント(FL)」･･･ 属性生成のパラメーターを自動推定して、属性生成を行う\n",
    "    - 「属性選択コンポーネント(FS)」･･･ 入力データから使用すべき属性を自動推定して選択する\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.4. SRC (SAMPO Run Configuration)\n",
    "\n",
    "**SRC**では、学習または予測の実行設定情報を記述します。\n",
    "SRCには、`learn_example`のようなプロセス名とプロセスタイプを示す`type`を記述します。\n",
    "また`data_sources`には、`path`に分析対象データの格納先を、`attr_schema`にASDの格納先情報を記述します。  \n",
    "SRCの詳細と記述方法は、次章で説明します。\n",
    "\n",
    "学習用SRCの記述例：CSVファイル形式の学習用データを使って学習を実行\n",
    "\n",
    "```python\n",
    "learn_example:\n",
    "    type: learn\n",
    "    data_sources:\n",
    "        dl:\n",
    "            path: data/fabhmerg_learn.csv\n",
    "            attr_schema: data/fabhmerg.asd\n",
    "```\n",
    "\n",
    "予測用SRCの記述例：CSVファイル形式の予測用データと実行済プロセス名`learn_example`のモデルを使って予測を実行\n",
    "\n",
    "```python\n",
    "predict_example:\n",
    "    type: predict\n",
    "    data_sources:\n",
    "        dl:\n",
    "            path: data/fabhmerg_predict.csv\n",
    "            attr_schema: data/fabhmerg.asd\n",
    "\n",
    "    model_process: learn_example\n",
    "```\n",
    "予測を実行する場合は、上記のように`model_process`で予測に使用したいモデルが含まれている実行済プロセス名を指定します。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. 出力"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1. 実行済プロセス\n",
    "\n",
    "**実行済プロセス**とは、学習(または予測)の実行が完了した分析プロセスです。\n",
    "実行済プロセスには、学習時はモデル、予測時は予測結果が格納されています。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "[ページトップへ](#top)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
