ベストモデル選択
================
本項では、ベストモデル選択について、SAMPO/FABとsklearn-fabにおける実現方法およびスクリプトの差異を示します。
グリッドサーチ等によりモデルを複数作成し、その中から精度が高いモデルを選択するために実施します。

**SAMPO/FAB**

* SAMPO API session_run関数、\ sklearn.model_selection.ParameterGrid_\により複数モデル作成を実施します。
* ベストモデル選択は、モデルの精度を参照し、手動で行います。

**sklearn-fab**

* \ sklearn.model_selection.GridSearchCV_\により実施します。

.. csv-table::
  :header: "SAMPO/FAB", "sklearn-fab"

  "
  .. code-block:: python
    :caption: **データ準備**

    import pandas as pd

    # 学習用データ
    train_data = pd.read_csv('../data/train_data.csv')
    train_data.insert(0, '_sid', list(range(train_data.shape[0])))
    # 予測用データ
    predict_data = pd.read_csv('../data/predict_data.csv')
    predict_data.insert(0, '_sid', list(range(predict_data.shape[0])))
  ", "
  .. code-block:: python
    :caption: **データ準備**

    import pandas as pd

    # 学習用データ
    train_data = pd.read_csv('../data/train_data.csv')

    X_train = train_data.iloc[:, 1:]
    y_train = train_data.iloc[:, 0]
  "
  "
  .. code-block:: python
    :caption: **ASD**

    from sampotools.api import gen_asd_from_pandas_df

    asd = gen_asd_from_pandas_df(train_data)
  ", ""
  "
  .. code-block:: python
    :caption: **SPD**

    spd_content = '''
    dl -> rg

    ---
    components:
        dl:
            component: DataLoader

        rg:
            component: FABHMEBernGateLinearRgComponent
            features: name != 'price'
            target: name == 'price'
            standardize_target: True
            tree_depth: {{ tree_depth }}
            shrink_threshold: {{ shrink_threshold }}

    global_settings:
        keep_attributes:
            - price
        feature_exclude:
            - price
    '''
  ", "
  .. code-block:: python

    from sklearn.model_selection import GridSearchCV
    from sklearn_fab import SklearnFABBernGateLinearRegressor

    # パラメーターの組み合わせ定義
    estimator_param_grid = {'tree_depth': [3, 4, 5], 'shrink_threshold': [3.0, 5.0]}

    # グリッドサーチのインスタンス作成
    estimator = SklearnFABBernGateLinearRegressor()
    search = GridSearchCV(estimator, estimator_param_grid)

    # グリッドサーチ実行
    search.fit(X_train, y_train)

    # ベストモデル取得
    search.best_estimator_

  .. code-block:: python

    SklearnFABBernGateLinearRegressor(...
                                      random_seed=3040280244, ...
                                      shrink_threshold=3.0, ...
                                      tree_depth=3, ...
                                      )
  "
  "
  .. code-block:: python
    :caption: **パラメーターの組み合わせ定義**

    from sklearn.model_selection import ParameterGrid

    spd_param_combination = {
        'tree_depth': [3, 4, 5],
        'shrink_threshold': [3.0, 5.0]
    }

    spd_params = list(ParameterGrid(spd_param_combination))
  ", ""
  "
  .. code-block:: python
    :caption: **SRC**

    train_src_temp = '''
    train_{{ parameter_pattern }}:
        type: learn
        data_sources:
            dl:
                df: {{ train_data }}
                attr_schema: {{ asd }}
    '''

    predict_src_temp = '''
    predict_{{ parameter_pattern }}:
        type: predict
        data_sources:
            dl:
                df: {{ predict_data }}
                attr_schema: {{ asd }}

        model_process: train_{{ parameter_pattern }}
    '''
  ", ""
  "
  .. code-block:: python
    :caption: **プロセスリスト作成**

    from sampo.api import gen_spd, gen_src

    process_list = []

    for grid_id, spd_param in enumerate(spd_params):
        spd = gen_spd(template=spd_content, params=spd_param)
        src_param = {'parameter_pattern': grid_id, 'train_data': train_data,
                     'predict_data': predict_data, 'asd': asd}
        train_src = gen_src(template=train_src_temp, params=src_param)
        predict_src = gen_src(template=predict_src_temp, params=src_param)
        process_list.append((train_src, spd))
        process_list.append((predict_src, None))
  ", ""
  "
  .. code-block:: python
    :caption: **グリッドサーチ実行**

    from sampo.api import process_runner, process_store

    pstore_url = 'pstore_rg'
    process_store.create(pstore_url)

    process_runner.session_run(process_list, pstore_url=pstore_url)
  ", ""
  "
  .. code-block:: python
    :caption: **RMSEの取得**

    import re

    result = []
    predict_proc_names = [src.name for src, _ in process_list if src.ptype == re.match('predict*', src.name)]
    for predict_proc_name in predict_proc_names:
        row = {}
        with process_store.open_process(pstore_url, predict_proc_name) as prl:
            evaluation = prl.load_comp_output_evaluation('rg')
            row['process_name'] = predict_proc_name
            row['rmse'] = evaluation['root_mean_squared_error'][0]
            result.append(row)

    pd.DataFrame(result).sort_values(by='rmse')
  ", ""
  "
  .. csv-table::
    :header-rows: 1

    , process_name, rmse
    2, predict_2, 5.029161
    4, predict_4, 5.134111
    5, predict_5, 5.533593
    0, predict_0, 5.641256
    1, predict_1, 7.055014
    3, predict_3, 8.862625
  ", ""
  "
  .. code-block:: python
    :caption: **ベストモデル取得**

    best_model = result[0]['process_name']
    num_best_model = int(best_model[-1:])
    spd_params[num_best_model]
  ", ""
  "
  .. code-block:: python

    {'shrink_threshold': 3.0, 'tree_depth': 3}
  "

.. _sklearn.model_selection.ParameterGrid: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ParameterGrid.html
.. _sklearn.model_selection.GridSearchCV: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
