More than 5 years have passed since last update.

Amazon SageMaker とりあえず触ってみた

Posted at 2018-12-20

リージョンはバージニア北部で作成していきます。

とりあえず触ってみた。

1.S3バケットの作成###

以下の2種類のデータを作成します。
・モデルのトレーニングデータ
・モデルのトレーニング中に生成するモデルアーティファクト

これらのデータを保存するAmazon S3(以下S3)バケットを作成します。

バケット名には、”sagemaker”という文字列を含めて作成してください。

手順は省きます。

2.SageMaker ノートブックインスタンスの作成###

SageMakerノートブックインスタンスとは？
Jupyter Notebookがインストールされたフルマネージドな機械学習EC2コンピューティングインスタンスのことです。

[ノートブックインスタンス]は適当に入力、後はデフォルト値のままにします。

IAMロールは「なし」で設定します。
【1.S3バケットの作成】で「sagemaker」が含まれる任意の S3 バケットを作成したのはこのためです。

3.組み込みのアルゴリズムでモデルをトレーニングし、デプロイする###

MNISTという１桁の手書き数字の画像のデータセットを、
K-MeansというSageMakerの組み込みアルゴリズムで分類してみます。

1.Jupyterノートブックの作成

Jupyterノートブックを開き、「conda_python3」にします。

「Untitled」をクリックし、適当な名前を入力し、リネームします。

以下のPythonコードを編集、コピペし「Run」。

from sagemaker import get_execution_role
 
role = get_execution_role()
bucket='sagemaker-leonmaron'

2.トレーニングデータのダウンロード、調査

MNISTデータセットのダウンロード

%%time
import pickle, gzip, numpy, urllib.request, json
 
# Load the dataset
urllib.request.urlretrieve("http://deeplearning.net/data/mnist/mnist.pkl.gz", "mnist.pkl.gz")
with gzip.open('mnist.pkl.gz', 'rb') as f:
    train_set, valid_set, test_set = pickle.load(f, encoding='latin1')

トレーニングデータセットの調査

%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (2,10)
 
 
def show_digit(img, caption='', subplot=None):
    if subplot==None:
        _,(subplot)=plt.subplots(1,1)
    imgr=img.reshape((28,28))
    subplot.axis('off')
    subplot.imshow(imgr, cmap='gray')
    plt.title(caption)
 
show_digit(train_set[0][30], 'This is a {}'.format(train_set[1][30]))

MNISTデータセットの３１枚目の画像データがラベルの内容（３）と共に表示されます。

3.モデルのトレーニング

トレーニングアルゴリズムの選定
機械学習では、通常モデルに適したアルゴリズムをみつけるための評価プロセスが必要になりますが、
今回は SageMaker の組み込みアルゴリズムの1つである k-means を使うことが決まっているため評価プロセスはスキップ。

トレーニングジョブの作成

from sagemaker import KMeans
 
data_location = 's3://{}/kmeans_highlevel_example/data'.format(bucket)
output_location = 's3://{}/kmeans_example/output'.format(bucket)
 
print('training data will be uploaded to: {}'.format(data_location))
print('training artifacts will be uploaded to: {}'.format(output_location))
 
kmeans = KMeans(role=role,
                train_instance_count=2,
                train_instance_type='ml.c4.8xlarge',
                output_path=output_location,
                k=10,
                data_location=data_location)

トレーニングの実行

%%time
 
kmeans.fit(kmeans.record_set(train_set[0]))

トレーニング実行後に事前に準備したS3バケットを確認すると、トレーニング中に生成されるモデルアーティファクトが格納されています。

SageMaker ホスティングサービスにモデルをデプロイする

%%time
 
kmeans_predictor = kmeans.deploy(initial_instance_count=1,
                                 instance_type='ml.m4.xlarge')

モデルの検証

result = kmeans_predictor.predict(valid_set[0][30:31])
print(result)

30番目の画像に対する推論結果が得られました。

100個分の推論結果を取得してみます。

%%time 
 
result = kmeans_predictor.predict(valid_set[0][0:100])
clusters = [r.label['closest_cluster'].float32_tensor.values[0] for r in result]

for cluster in range(10):
    print('\n\n\nCluster {}:'.format(int(cluster)))
    digits = [ img for l, img in zip(clusters, valid_set[0]) if int(l) == cluster ]
    height = ((len(digits)-1)//5) + 1
    width = 5
    plt.rcParams["figure.figsize"] = (width,height)
    _, subplots = plt.subplots(height, width)
    subplots = numpy.ndarray.flatten(subplots)
    for subplot, image in zip(subplots, digits):
        show_digit(image, subplot=subplot)
    for subplot in subplots[len(digits):]:
        subplot.axis('off')
 
    plt.show()

30番目の画像に所属していた0はこんな感じ。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up