More than 5 years have passed since last update.

SAS Viyaによる機械学習パイプラインを体験する流れ

Posted at 2018-08-07

SAS ViyaはAIプラットフォームを提供しています。デモもあるのですが、全体が英語なので利用するまでに躊躇してしまう人も多いかと思います。そこで、Pythonで機械学習パイプラインを利用するまでの流れを紹介します。

利用開始まで

まず SASの開発者サイトにアクセスします。開発者サイトは SAS for Developers です。

そして、 Get started with your SAS Viya Developer Trial を押します。

試すためには SAS Profile （SASプロファイル）というのが必要です。まず Create one をクリックします。

SASプロファイルを登録するをクリックします。

情報を入力します。

フォームを送信するとメールが送られてきます。

メールに書かれている SASプロファイルをアクティベートする をクリックします。

そしてパスワードを設定します。

SASプロファイルがアクティベートされます。 続ける をクリックします。

そうすると先ほどの SAS® Viya® Developer Trial の画面に戻ってきますので、今度はSign In をクリックします。

先ほど登録したSASプロファイルのID、パスワードでログインします。

ログインすると SAS® Viya® Developer Trial の画面に戻ってきますが、今回はプロファイルの内容が入力されているはずです。一番下にある Sign me upボタンをクリックします。

これで SAS® Viya® Developer Trial の申し込みが終わりました。数分後にメールが送られてきます。

送られてきたメールにある Log in to your Trial Portal のリンクをクリックします。

利用規約が表示されますので、問題なければAcceptボタンを押します。

そうするとログインフォームが出るので、先ほど登録したID、パスワードでログインします。

ログインすると Getting Started: SAS® Viya™ Developer Trial が表示されます。

下の方にある Get started with Python, R, and SAS API via Jupyter Notebook をクリックします。

さらに Start My Trial Now をクリックします。

そうすると見慣れたJupyter Notebookの画面が出ます。Pythonフォルダの中にある TitanicPipefitter.ipynb をクリックします。

ここからは TitanicPipefitter.ipynb に書かれている内容の意訳です。英語の後に日本語を参考として載せています。

Machine Learning Pipeline: Using Pipefitter with SAS 9 or SAS Viya （機械学習パイプライン：SAS 9またはSAS ViyaでPipefitterを利用する）

The goal is to supply a uniform set of high-level APIs built on top of SAS 9.4 and SAS Viya that help users create and tune practical machine learning pipelines. The idea is currently applied in open source framework scikit-learn and spark.ml.

このJupyter Notebookの目標はSAS 9.4とSAS Viya上に構築された実用的なパイプラインを作成し、高度かつ統一されたAPIセットを体験します。この手法はオープンソース・フレームワークのscikit-learnとspark.mlで用いられています。

The three primary components used in the workflow are:

このワークフローで使用される主なコンポーネントは次の3つです。

Transformer （変換器）
Estimator （推定器）
Pipeline （パイプライン）

First, we setup paths to development versions of packages. This is just for development purposes. Customers would install these packages in the normal fashion.

まず最初にパッケージにおける開発バージョンのパスを設定します。これは開発のためだけです。ユーザは通常、これらのパッケージをインストールします。

Create Connections and Load Data （接続の作成とデータの読み込み）

The PipelineConnection object is a thin wrapper around the saspy and SWAT connection objects. While these objects use very different connection parameters, methods such as read_csv can be normalized between them using this object.

PipelineConnection オブジェクトは、saspyとSWAT接続オブジェクトを扱う薄いラッパーです。これらのオブジェクトはまったく異なる接続パラメータを使用しますが、このオブジェクトを使用して read_csv などのメソッドを扱えるようになります。

SAS Viya Version （SAS Viya のバージョン）

import swat as sw

cashost='localhost'
casport=5570
casauth='~/.authinfo'
conn = sw.CAS(cashost, casport, authinfo=casauth, caslib="casuser")
data = conn.read_csv('../data/train.csv')

data.head()

data.info()

Pipefitter Classes （Pipefitterクラス）

These classes are agnostic across connection types. They simply proxy all method calls to the appropriate classes in the registered backends. New backends can be registered using the name of the data type of the table object as the key. Currently, Pipefitter supports SASdata from the saspy package and CASTable from the SWAT package.

これらのクラスは、接続タイプに関係しません。登録したバックエンドの適切なクラスへのすべてのメソッド呼び出しをプロキシします。新しいバックエンドは、テーブルオブジェクトのデータ型の名前をキーとして登録できます。現在、Pipefitterはsaspyパッケージの SASdataとSWATパッケージの CASTableをサポートしています。

The SAS Pipefitter project provides a Python API for developing machine learning pipelines. The pipelines are built from stages that perform variable transformation, parameter estimation, and hyperparameter tuning.
「SAS PipeFitter」は、機械学習パイプライン（≒プロセスフロー）を作成するためのPythonのAPIです。機械学習パイプラインとは、変数の変換、パラメータ推定、ハイパーパラメーターのチューニングなどの処理で構成されます。

Estimator（推定器）

from pipefitter.estimator import DecisionTree, DecisionForest, GBTree

Create a DecisionTree object. This object is the high-level object that has no knowledge of CAS or SAS.

DecisionTreeオブジェクトを作成します。CASまたはSASの知識がなくても使える、上位レベルのオブジェクトです。

params = dict(target='Survived', 
              inputs=['Sex','Age','Fare'], 
              nominals=['Sex', 'Survived'])

dtree = DecisionTree(max_depth=6, **params)
dtree

Decision Tree Fit and Score of CAS Table（CASテーブルの決定木の適合度とスコア）

Run the fit method against the data set. This fit method does some parameter validation, then looks up the appropriate sub-package for the given table object to locate the correct backend implementation to call. The fit method of that object is then called.

データセットに対して fit メソッドを実行します。この fitメソッドは、いくつかのパラメータ検証を行い、指定されたテーブルオブジェクトの適切なサブパッケージを検索します。そして、そのオブジェクトの fit メソッドを実行します。

Using the DecisionTree instance, we'll first run the fit method on the data set. This will return a model object.

まず DecisionTree インスタンスを使って、データセットに対して fitメソッドを実行します。これはモデルオブジェクトを返します。

model = dtree.fit(data)
model

vars(model)

The score method can then be called on the resulting model object using a different data set for scoring.

スコアリングのために別のデータセットを使用して結果のモデルオブジェクトに対して score メソッドを呼び出せます。

score = model.score(data)
score

Fields from the output can be selected using standard DataFrame techniques. There are a common set of fields for all backends, but other fields may be available depending on what the backend model can produce.

出力にあるフィールドは、標準の DataFrame 技術を使用して選択できます。すべてのバックエンドに共通のフィールドセットがありますが、バックエンドモデルが生成できるものに応じて他のフィールドを利用できます。

score.loc['MisClassificationRate']

Other estimator objects can be created using the same set of parameters.

他の推定器オブジェクトは、同じパラメータセットを使用して作成できます。

rf = DecisionForest(**params)
rfmodel = rf.fit(data)
rfmodel

rfmodel.score(data)

You can even take it down to a one-liner.

一行で記述することもできます。

GBTree(**params).fit(data).score(data)

Generic Imputer （一般的な代入）

Imputing can also be done using classes that work with multiple backends.

複数のバックエンドで動作するクラスを使用して、代入を行えます。

from pipefitter.transformer import Imputer

imp = Imputer(value=Imputer.MODE)
imp

data.info()

datamode = imp.transform(data)
datamode.info()

Impute measures using mean values.

平均値を利用して、連続変数（間隔変数）を補完します。

datamean = Imputer(value=Imputer.MEAN).transform(datamode)
datamean.info()

Here is a sample of the data set before imputing missing values.

欠損値を代入する前のデータセットのサンプルです。

data.head(10)

Here is a sample of the data set after imputing missing values.

欠損値を代入した後のデータセットのサンプルです。

datamean.head(10)

Pipeline（パイプライン）

Pipelines can be constructed using multiple transformers and estimators. The flow of a sample pipeline is shown in the diagram below.

パイプラインは、複数の変換器と推定器を使用して構築できます。サンプルパイプラインのフローを下の図に示します。

from pipefitter.pipeline import Pipeline

Defining the Pipeline（パイプラインの定義）

Each component of a pipeline consists of tuples that specify a label for the component and a class instance for that step in the pipeline.

パイプラインの各コンポーネントは、コンポーネントのラベルとパイプラインの中にある各ステップのクラスインスタンスという組み合わせで構成されています。

First we define the Imputer objects.

最初に Imputer オブジェクトを定義します。

from pipefitter.transformer import Imputer

mode_imputer = Imputer(value=Imputer.MODE)
mean_imputer = Imputer(value=Imputer.MEAN)

Now we define the estimators. In this case, we'll use the same parameters for all of them.

ここで推定器を定義します。この場合、すべて同じパラメータを使用します。

from pipefitter.estimator import DecisionTree, DecisionForest, GBTree

params = dict(max_depth=6, 
              target='Survived', 
              nominals='Survived', 
              inputs=['Age', 'Fare', 'Sex'])

tree = DecisionTree(**params)
tree2 = DecisionTree(**params)
rf = DecisionForest(**params)

A pipeline is constructed using the imputer and estimator objects. Note that we can use one or more of any of them. In the case of imputers, the transform method is called at each step and the resulting table is used as the input for the next step. In the case of estimators, the fit method is called to get a model, then the transform method on the model object is called to get the output table for the next step. By default, the table that was passed in is just sent to the next step.

パイプラインは、imputerオブジェクトと推定器オブジェクト（いずれか1つ以上）を使用して構築されます。imputerオブジェクトの場合、各ステップで transformメソッドが呼び出され、結果は次のステップの入力として使用されます。推定器オブジェクトの場合、 fit メソッドでモデルが取得され、そのモデルの transform メソッドを呼び、次のステップの出力テーブルが取得されます。デフォルトでは渡されたテーブルは次のステップに送られます。

pipeline = Pipeline([mode_imputer, tree, mean_imputer, tree2, rf])

Call the `fit` Method using the Pipeline（パイプラインを使って `fit` メソッドを呼び出す）

model = pipeline.fit(data)
model

We can now score the data using the model produced by the pipeline. The data passed in goes through the same transformation steps as the training data. Note that we get a separate row in the resulting data set for each estimator in the pipeline.

パイプラインで生成されたモデルを使用してデータをスコアリングできます。渡されたデータはトレーニングデータと同じ変換ステップを経ます。パイプラインの各推定器について、結果のデータセットには異なる行があるのに注意してください。

model.score(data)

It is also possible to pass in parameter overrides to the fit method. These parameters are only used for this fit invocation and are not presserved.

fit を上書きしてパラメータを渡すこともできます。これらのパラメータは、この fit メソッドを呼ぶためにだけに使われ、保存されません。

model = pipeline.fit(data, {tree2.max_depth: 6, 'max_depth': 1, 'inputs': ['Age', 'Sex']})

Here we are scoring the data and retrieving just two columns from the output

ここではデータをスコアリングし、2列だけを出力から取得しています。

model.score(data).loc[:, ['NObsUsed', 'MisClassificationRate']]

You can set new parameters in the pipeline object using the set_params method.

set_params メソッドを使って、パイプラインオブジェクトに新しいパラメータを設定できます。

pipeline.set_params({tree2.max_depth: 6, 'max_depth': 1, 'inputs': ['Age', 'Sex']})

These parameters will now be used in any call to the fit and score methods.

これらのパラメータは、 fit メソッドと score メソッドで使えるようになります。

model = pipeline.fit(data)

model.score(data).loc[:, ['NObsUsed', 'MisClassificationRate']]

HyperParameter Tuning （ハイパーパラメータの調整）

The hyper-parameter tuning classes allow you to test multiple sets of parameters across a set of estimators. These can be used both with individual estimator classes as well as full pipelines.

ハイパーパラメータチューニングクラスを使用すると、一連の推定器に対して複数のパラメータセットをテストできます。これらは、個々の推定器クラスとすべてのパイプラインで使用できます。

from pipefitter.estimator import DecisionTree
from pipefitter.model_selection import HyperParameterTuning

First we set up the parameter grid. This can be done as a single dictionary where each of the fields is a list. Or, it can be done as a list of dictionaries where each element in the list is a complete set of parameters.

最初にパラメータグリッドを設定します。これは、各フィールドがリストになっている単一の辞書として扱えます。また、リストの各要素が完全なパラメーターセットである辞書リストとしても扱えます。

param_grid = dict(
    max_depth=[6, 10],
    leaf_size=[3, 5],
)

Here we are using a DecisionTree with the parameter grid.

ここではパラメータグリッドとともに DecisionTree を使用しています。

hpt = HyperParameterTuning(
          estimator=DecisionTree(target='Survived',
                                 inputs=['Age', 'Fare', 'Sex', 'SibSp', 'Parch', 'Ticket', 'Cabin'],
                                 nominals=['Survived', 'Cabin', 'Sex']),
          param_grid=param_grid, 
          score_type='MisClassificationRate', 
          cv=3)

hpt.gridsearch(data)

Here we are using the same pipeline as defined in the previous section with this parameter grid.

ここでは、前のセクションで定義したパイプラインに対して、パラメータグリッドを適用しています。

hpt = HyperParameterTuning(
          estimator=pipeline,
          param_grid=param_grid, 
          score_type='MisClassificationRate', 
          cv=3)

In this case, we get a line for each of the estimators in the pipeline

この場合、パイプライン内の各推定器が得られます

hpt.gridsearch(data)

It is also possible to use a separate thread for each grid point in the parameter grid. This will only work if the table is globally available in CAS. Currently, saspy does not allow multiple threads.

パラメータグリッド内の各グリッドポイントで個別のスレッドを使用することもできます。これは、テーブルがCASの中でグローバルに使用可能な場合にのみ有効です。現在、saspyは複数のスレッドを許可していません。

try:
    data.table.promote()
except:
    pass

hpt.gridsearch(data, n_jobs=4)

Clean up（片付け）

conn.terminate()

ここまでの内容が TitanicPipefitter.ipynb で書かれているデモになります。Jupyter Notebookなので、Webブラウザ上でコードを実行して結果を確認できます。ぜひSAS Viyaで体験してみてください。

SAS for Developers | SAS

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up