More than 5 years have passed since last update.

[機械学習] IBM Watsonを用いて（日本語）自然言語のクラス分類を行ってみた

Last updated at 2015-09-18Posted at 2015-08-27

IBM Watsonとは

IBMによると：

Watson(ワトソン)は、コンピューターでありながら、人と同じように情報から学び、経験から学習するコグニティブ・テクノロジーです

今回は IBM Watson APIシリーズのNatural Language Classifier(NLC)を用いて、自然言語処理（クラス分類）を行ってみた

NLCとは

下図のような、16種類あるWatsonAPIのうちの１つ。Deep Learningを用いた自然言語のクラス分類を行う。
文章を入力すると、その文章のカテゴリ（class）が返ってくる。

たとえばユーザーからの問い合わせ電話を認識し、カテゴライズすることでカテゴリに対して適切な担当者につなげることができる
短い文章、フレーズに対して独自に事前定義されたキーワードをスコアをつけて返す
IBM Watson は、事前に用意された応答を使用するのではなく、これまでに習得した知識に基づいて、応答とその応答に関連する信頼スコアを決定する、自然言語質問応答システムである。したがって、Watson で実行されるアプリケーションは、単純なデータ処理の枠を超え、相関関係を検出し、仮説を作成し、結果から学習することができる

公式デモ

どんなものかまず動かしてみたいなら、IBM WatsonDeveloper Cloudにsign upし、公式デモを実行すればよい。

NLCを使う準備

NLC登録

kIBM Bluemixに入る

カタログからNLCを選択

スペース、アプリ、サービス名、資格情報名を決め、使用ボタンをクリック
APIに接続する際に使用するので、サービス資格情報(json）をコピーして持っておく

j### データセット
サンプルcsvをダウンロード
下記curlコマンドでトレーニングデータを登録し、classifierを作成する

curl -i -u "<username>":"<password>" \
-F training_data=@<path_to_file>/weather_data_train.csv \
-F training_metadata="{\"language\":\"en\",\"name\":\"TutorialClassifier\"}" \
"https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers"

サンプルレスポンスは下記のようになる

{
  "name": "TutorialClassifier",
  "language": "en",
  "status": "Training",
  "url": "https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers/10D41B-nlc-1",
  "classifier_id": "10D41B-nlc-1",
  "created": "2015-05-28T18:01:57.393Z",
  "status_description": "The classifier instance is in its training phase, not yet ready to accept classify requests"
}

上記コマンドを投げたらすぐに学習はスタートする。学習終了後にclassifierに向けてクエリを打てるようになる
学習が終了したかの確認は下記curlコマンドで可能。

curl -u "<username>":"<password>" \
"https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers/<classifier_id>"

statusが下記のようになれば準備完了。今回用いたcsvファイルを学習するのにはだいたい１分ほど要した。

{
  "status":"Available",
  "status_description":"The classifier instance is now available and is ready to take classifier requests."
}

training中は下記のようなレスポンスが返ってくる

{
  "status": "Training",
  "status_description": "The classifier instance is in its training phase, not yet ready to accept classify requests",
}

テキストの自動分類

早速、トレーニングデータにを学習したwatsonに、問題を投げてみる
"How hot will it be today?" と聞いてみて、"気温"とカテゴリ分類されることを祈る

curl -G -u "<username>":"<password>" \
"https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers/<classifier_id>/classify" \
--data-urlencode "text=How hot will it be today?"

responseは以下の通り

{
  "classifier_id": "10D41B-nlc-1",
  "url": "https://gateway.watsonplatform.net/natural-language-classifier/api/v1",
  "text": "How hot will it be today?",
  "top_class": "temperature",
  "classes": [
    {
      "class_name": "temperature",
      "confidence": 0.9998201258549781
    },
    {
      "class_name": "conditions",
      "confidence": 0.00017987414502176904
    }
  ]
}

狙い通り、しっかりカテゴリ分類が行われた。なお、クラス分類の上限は10クラスである
ここで嬉しいことに、公式ドキュメントに以下の文章が載ってあった

One of the sample questions includes a term that the classifier isn't trained on: "foggy." The classifier can score well with these terms without your having to do extra work to identify them. Try other questions that include words that are not in the training data, for example, "sleet" or "storm."

つまり "foggy" を学習させなくても、近しい言葉が学習されていればクラス分類が可能らしい。実際に試してみたレスポンスは以下のとおり

{
  "text": "Is it foggy today?",
  "top_class": "conditions",
  "classes":[
    {
      "class_name": "conditions",
      "confidence": 0.981897175307704
    },
    {
      "class_name": "temperature",
      "confidence": 0.018102824692296068
    }
  ]
}

たしかに、学習させなくともしっかりと推測できている。すでにwatsonの中間層でキーワード同士の関係性が十分に学習されていると推測できる

自分で学習データを作るとき

注意事項

UTF-8でencodeする
コンマ, \t, \n, \r, などはダブルクオーテーションで囲む
英語なら[A-Z, a-z, 0-9]、アンダースコア、ダッシュのみ値として入れられる
学習データは5行以上、10,000行未満
テキストは最大1024文字

”良い”学習のため

文字量は60文字未満が良い
クラス数は数百までとしている（今後update予定）
1クラスに対して、最低でも5-10行のテキストは必要
1テキストに対して多クラスを設定することは可能。特にテキストが曖昧な場合1つのクラスを特定することは困難であるため、多クラスを登録することは悪くない。ただし、多くのテキストが多クラスを持つと逆に分類が困難になることにも留意が必要

GA版とbeta版の違いについて

2015/08/27現在で、NCLはGAとなっている。beta版からGA版にアップロードしたことで、主に以下の変更が行われた。

学習データのJSONサポートは終了しCSVのみ受け付けるようになった。
wordあたり45-characterの制限がなくなった
テキストの最大量は1024となった

API routing

POST /v1/classifiers
GET  /v1/classifiers
POST /v1/classifiers/{classifier_id}/classify
GET  /v1/classifiers/{classifier_id}/classify
DELETE /v1/classifiers/{classifier_id}
GET  /v1/classifiers/{classifier_id}

API example request

csv登録

curl -u "<username>":"<password>"
    -F training_data=@train.csv 
    -F training_metadata="{\"language\":\"ja\",\"name\":\"My Classifier\"}"
    "https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers"

status確認

curl -u "<username>":"<password>" "https://gateway.watsonplatform.net/natural-language-classifier
    /api/v1/classifiers/6C76AF-nlc-43"

classify

curl -X POST -u "<username>":"<password>" -H "Content-Type:application/json" -d "{\"text\":\"分類したいテキスト入力\"}" 
    "https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers/6C76AF-nlc-43/classify"

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up