More than 1 year has passed since last update.

【初心者】Amazon SageMaker Latent Dirichlet Allocation (LDA) アルゴリズムを試してみた

Posted at 2023-01-01

背景・目的

Amazon SagemakerでLatent Dirichlet Allocation（以降、LDAという。）を試してみます。

概要

Latent Dirichlet Allocation (LDA) とは

一連の観測を異なるカテゴリの混合物として記述しようとする教師なし学習アルゴリズムです。
LDA は、テキストコーパス内のドキュメントが共有する、ユーザーが指定した数のトピックを検出するために最もよく使用されます。
- ここで、各観測値はドキュメント、特徴は各単語の存在 (または出現回数)、カテゴリはトピックです。
  - トピックは、各ドキュメントに出現する単語の確率分布として学習されます。
  - 各ドキュメントは、トピックの混合物として説明されています。

Choosing between Latent Dirichlet Allocation (LDA) and Neural Topic Model (NTM)

トピックモデルは、一般に以下に使用される。

(1) セマンティックな意味を首尾一貫してカプセル化
(2) ドキュメントを適切に説明するコーパスからトピックを生成

そのため、トピックモデルは、トピックの一貫性を最大化することを目的としています。

Perplexity は、テストデータ内の単語ごとの幾何平均尤度の逆数を測定する組み込み言語モデリング評価メトリック
- Perplexity スコアが低いほど、一般化のパフォーマンスが優れていることを示す。
  - 調査によると、単語ごとに計算された尤度は人間の判断と一致しないことが多く、完全に相関していない可能性があるため、トピックの一貫性が導入されましたとのこと。
- モデルから推測された各トピックは単語で構成され、トピックの一貫性は、モデルの特定のトピックの上位 N 個の単語に対して計算される。
- 多くの場合、そのトピックの単語のペアごとの単語類似度スコアの平均または中央値として定義されます (例: Pointwise Mutual Information (PMI))。
- 有望なモデルは、首尾一貫したトピックまたはトピックの一貫性スコアが高いトピックを生成する

Input/Output Interface for the LDA Algorithm

recordIO-wrapped-protobufと、CSVファイル形式の両方をサポートする。
- recordIO-wrapped-protobufを使用する場合、LDA はファイルモードまたはパイプモードでトレーニングできる。
- CSV 形式の場合はファイルモードでのみトレーニングできる。データは高密度で、レコード数 * 語彙サイズに等しい次元を持つ必要がある。
推論では、以下がサポートされる。
- text/csv
- application/json
- application/x-recordio-protobuf
- application/json および application/x-recordio-protobuf にはスパースデータを渡すこともできる。
LDA 推論は、各観測の topic_mixture ベクトルを含む application/json または application/x-recordio-protobuf 予測を返す。

EC2 Instance Recommendation for the LDA Algorithm

LDA は現在、単一インスタンスの CPU トレーニングのみをサポートしている。
ホスティング/推論には CPU インスタンスが推奨される。

実践

An Introduction to SageMaker LDAのexampleを使用して実際に動かしてみます。

Setup

事前にgenerate_example_dataをダウンロードします。
ライブラリをインポートします。

%matplotlib inline

import os, re

import boto3
import matplotlib.pyplot as plt
import numpy as np

np.set_printoptions(precision=3, suppress=True)

# some helpful utility functions are defined in the Python module
# "generate_example_data" located in the same directory as this
# notebook
from generate_example_data import generate_griffiths_data, plot_lda, match_estimated_topics

# accessing the SageMaker Python SDK
import sagemaker
from sagemaker.amazon.common import RecordSerializer
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer

Setup AWS Credentials

バケットとIAMロールを確認します。

import sagemaker
from sagemaker import get_execution_role

session = sagemaker.Session()
role = get_execution_role()
bucket = session.default_bucket()
prefix = "sagemaker/DEMO-lda-introduction"

print("Training input/output will be stored in {}/{}".format(bucket, prefix))
print("\nIAM Role: {}".format(role))

Obtain Example Data

合成文書データの例をいくつか生成する。
- この特定の例では、「語彙」には合計 25 の単語がある。

トレーニングデータとテストデータを確認します。

print("Generating example data...")
num_documents = 6000
num_topics = 5
known_alpha, known_beta, documents, topic_mixtures = generate_griffiths_data(
    num_documents=num_documents, num_topics=num_topics
)
vocabulary_size = len(documents[0])

# separate the generated data into training and tests subsets
num_documents_training = int(0.9 * num_documents)
num_documents_test = num_documents - num_documents_training

documents_training = documents[:num_documents_training]
documents_test = documents[num_documents_training:]

topic_mixtures_training = topic_mixtures[:num_documents_training]
topic_mixtures_test = topic_mixtures[num_documents_training:]

print("documents_training.shape = {}".format(documents_training.shape))
print("documents_test.shape = {}".format(documents_test.shape))

Inspect Example Data

以下では、最初のいくつかのトレーニングドキュメントのトピック混合を計算します。
- 各ドキュメントは 25 語の語彙からの単語数のベクトルであり、そのトピック混合は、サンプルデータセットの生成に使用された 5つのトピックにわたる確率分布です。
- 後で、トレーニングデータセットで推論を実行するときに、推論されたトピックの混合物をこの既知のものと比較します。
```
print("First training document =\n{}".format(documents[0]))
print("\nVocabulary size = {}".format(vocabulary_size))
```
```
print("Known topic mixture of first document =\n{}".format(topic_mixtures_training[0]))
print("\nNumber of topics = {}".format(num_topics))
print("Sum of elements = {}".format(topic_mixtures_training[0].sum()))
```

プロットして確認します。

%matplotlib inline

fig = plot_lda(documents_training, nrows=3, ncols=4, cmap="gray_r", with_colorbar=True)
fig.suptitle("Example Document Word Counts")
fig.set_dpi(160)

Store Data on S3

SageMaker トレーニングジョブは、S3 バケットに保存されているトレーニングデータにアクセスする必要がある。
トレーニングはさまざまな形式のデータを扱えるが、今回はS3 バケットにアップロードする前に、SageMaker SDKの RecordSerializer を利用してMXNet RecordIO Protobuf 形式に変換する。

以下では、recordio_protobuf_serializer.serializeを使用して、変換後、lda.dataという名前でファイルをS3にアップロードしています。

# convert documents_training to Protobuf RecordIO format
recordio_protobuf_serializer = RecordSerializer()
fbuffer = recordio_protobuf_serializer.serialize(documents_training)

# upload to S3 in bucket/prefix/train
fname = "lda.data"
s3_object = os.path.join(prefix, "train", fname)
boto3.Session().resource("s3").Bucket(bucket).Object(s3_object).upload_fileobj(fbuffer)

s3_train_data = "s3://{}/{}".format(bucket, s3_object)
print("Uploaded data to S3: {}".format(s3_train_data))

Training

SageMaker LDA アルゴリズムを含む Docker コンテナを指定します。

from sagemaker.amazon.amazon_estimator import get_image_uri

# select the algorithm container based on this notebook's current location

region_name = boto3.Session().region_name
container = get_image_uri(region_name, "lda")

print("Using SageMaker LDA container: {} ({})".format(container, region_name))

トレーニングします。

実行環境と、ハイパーパラメータは下記の通りです。

実行環境

設定	説明	設定値
train_instance_count	インスタンス数	1
train_instance_type	インスタンスタイプ	ml.c4.2xlarge

ハイパーパラメータ

設定	設定値	設定値
num_topics	LDA モデルのトピックまたはカテゴリの数。	5
feature_dim	LDA 用語での「語彙」のサイズ。	25
mini_batch_size	入力トレーニングドキュメントの数。	6000 * 0.9
alpha0	(オプション) トピックの混合がどの程度「混合」されているかの測定値。 alpha0 が小さい場合、データは 1 つまたは少数のトピックで表される傾向があります。 alpha0 が大きい場合、データは複数または多数のトピックの偶数の組み合わせになる傾向があります。デフォルト値は alpha0 = 1.0 です。	1.0

# specify general training job information
lda = sagemaker.estimator.Estimator(
    container,
    role,
    output_path="s3://{}/{}/output".format(bucket, prefix),
    train_instance_count=1,
    train_instance_type="ml.c4.2xlarge",
    sagemaker_session=session,
)

# set algorithm-specific hyperparameters
lda.set_hyperparameters(
    num_topics=num_topics,
    feature_dim=vocabulary_size,
    mini_batch_size=num_documents_training,
    alpha0=1.0,
)

# run the training job on input data stored in S3
lda.fit({"train": s3_train_data})

===
2023-01-01 10:45:57 Uploading - Uploading generated training model
2023-01-01 10:45:57 Completed - Training job completed
Training seconds: 97
Billable seconds: 97

ジョブ名を確認します。

print("Training job name: {}".format(lda.latest_training_job.job_name))

===
Training job name: lda-2023-01-01-10-43-03-021

マネコンでも確認できました。

Inference

deployコマンドを使用して推論エンドポイントを作成します。

インスタンス数は1
インスタンスタイプはml.m4.xlarge

lda_inference = lda.deploy(
    initial_instance_count=1,
    instance_type="ml.m4.xlarge",  # LDA inference may work better at scale on ml.c4 instances
)

エンドポイント名を確認します。

print("Endpoint name: {}".format(lda_inference.endpoint_name))

SerializerとDesilializerを指定します。

lda_inference.serializer = CSVSerializer()
lda_inference.deserializer = JSONDeserializer()

推論エンドポイントに検証データを渡します。

results = lda_inference.predict(documents_test[:12])

print(results)

===
{'predictions': [{'topic_mixture': [0.2907695770263672, 0.7092304825782776, 0.0, 0.0, 0.0]}, {'topic_mixture': [0.5184915065765381, 0.0, 0.0, 0.10321851074695587, 0.37828999757766724]}, {'topic_mixture': [0.0, 0.0, 0.6388910412788391, 0.3611089289188385, 0.0]}, {'topic_mixture': [0.0, 0.13980618119239807, 0.0, 0.0, 0.8601938486099243]}, {'topic_mixture': [0.54769366979599, 0.4523063600063324, 0.0, 0.0, 0.0]}, {'topic_mixture': [0.43263429403305054, 0.0, 0.0, 0.5673657655715942, 0.0]}, {'topic_mixture': [0.0, 0.4107711613178253, 0.0, 0.05404851585626602, 0.535180389881134]}, {'topic_mixture': [0.05667471885681152, 0.6765181422233582, 0.0, 0.2668071687221527, 0.0]}, {'topic_mixture': [0.0, 0.0, 0.0, 0.047570109367370605, 0.9524299502372742]}, {'topic_mixture': [0.0, 0.30110329389572144, 0.6506293416023254, 0.0, 0.048267461359500885]}, {'topic_mixture': [0.0, 0.0, 0.1609993427991867, 0.02742798626422882, 0.8115726709365845]}, {'topic_mixture': [0.0, 0.8455498814582825, 0.0, 0.1544501632452011, 0.0]}]}

各入力ドキュメントに対するトピック混合物を抽出します。

computed_topic_mixtures = np.array(
    [prediction["topic_mixture"] for prediction in results["predictions"]]
)

print(computed_topic_mixtures)

===
[[0.291 0.709 0.    0.    0.   ]
 [0.518 0.    0.    0.103 0.378]
 [0.    0.    0.639 0.361 0.   ]
 [0.    0.14  0.    0.    0.86 ]
 [0.548 0.452 0.    0.    0.   ]
 [0.433 0.    0.    0.567 0.   ]
 [0.    0.411 0.    0.054 0.535]
 [0.057 0.677 0.    0.267 0.   ]
 [0.    0.    0.    0.048 0.952]
 [0.    0.301 0.651 0.    0.048]
 [0.    0.    0.161 0.027 0.812]
 [0.    0.846 0.    0.154 0.   ]]

既知のトピックと、計算済みトピックを表示

print(topic_mixtures_test[0])  # known test topic mixture
print(computed_topic_mixtures[0])  # computed topic mixture (topics permuted)

===
[0.67  0.327 0.002 0.002 0.   ]
[0.291 0.709 0.    0.    0.   ]

Stop / Close the Endpoint

エンドポイントを削除します。

sagemaker.Session().delete_endpoint(lda_inference.endpoint_name)

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up