More than 5 years have passed since last update.

Machine LearningAdvent Calendar 2014

英語で自然言語処理するなら　〜Standford CoreNLP〜

Last updated at 2014-12-10Posted at 2014-12-10

SmartNewsの中路です。こちら[http://qiita.com/advent-calendar/2014/machinelearning](Machine Learning Advent Calendar)の10日目のエントリーです。

はじめに

Machine Learningの非常に重要な応用として、自然言語処理があります。弊社でも、いたるところで自然言語処理の技術を使っています。さて"自然言語処理を勉強したい"というのであれば、一からアルゴリズムを勉強して、学習データを準備して、という方法はとてもよいのですが、もし"アプリケーションを作りたいので気軽に試したい"というならば一から全部用意するのはめんどくさい。

ということで、簡単に試せるライブラリは本当にありがたいわけです。

さて、弊社のサーバーサイドのシステムしかり、世の多くのシステムは、Javaで作られているわけですが、自然言語処理界隈ではC++やpythonのツールがたくさんあって、「よし仕方ないからJavaから外部プログラムとして呼び出しちゃうぞ」と考えた瞬間に、

ユー外部プログラムを呼び出すオーバーヘッドについて十分検証したのかい?
せっかくJVMという素敵なものを使っているのに、ユーわざわざ環境に依存する外部ライブラリをサーバーにインストールするのかい、Chefの準備はできたかい?
ユーまさかJNI使う気かい?

などと脳内から大量のミサイルが飛んでくるので、ぜひともJavaの中で収まっていてほしい、それがJavaでサービス開発をしている人間にとっての切なる願いであります。

以上からこのエントリーのテーマを"Javaで自然言語処理"にしぼります。

さらに日本語に関しては、すでに日本語で書かれたエントリーが多くあるので、形態素解析ライブラリのkuromojiにポインタを貼っておくにとどめ、英語の自然言語処理を行うライブラリStandford CoreNLPにフォーカスして書きます。

このエントリーのゴール

以下はyahoo.comの記事から取り出した一文に、人名や地名などのタグを付けたものです。赤色が人名で青色が地名です。

これは「固有表現抽出」としてよく知られた自然言語処理のタスクですが、このタスクをStandford CoreNLPを用いて行うことを、このエントリーのゴールにします。

とにかく使ってみる

Step1 ライブラリのダウンロード

素晴らしいことに、Maven Centralにありますから、mavenを使っていれば、pom fileに以下を記述するだけでオッケー。versionは適宜変えてください。

<dependencies>
<dependency>
    <groupId>edu.stanford.nlp</groupId>
    <artifactId>stanford-corenlp</artifactId>
    <version>3.5.0</version>
</dependency>
<dependency>
    <groupId>edu.stanford.nlp</groupId>
    <artifactId>stanford-corenlp</artifactId>
    <version>3.5.0</version>
    <classifier>models</classifier>
</dependency>
</dependencies>

modelsは200MBくらいあって、ちょっとあなたなにやってるんですかくらいサイズがでかいんですが、学習済みの分類機が入っているから仕方ない。

Step2 使う

品詞タグづけ

まずは、文章があったときに、"名詞"や"動詞"といったタグ付けをする品詞タグ付けを試してみます。

コピペ用コード


import java.util.List;
import java.util.Properties;

import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;

/**
 * pos tagging sammple
 *
 */
public class PosTaggingSample {
	public static void main(String[] args) {
		String text = "I ask everyone to not rush to judgment and allow the investigation to be completed,"
				+ " Orange County Sheriff Jerry Demings said at a news conference in Orlando.";
				
		Properties properties = new Properties();
		properties.setProperty("annotators", "tokenize, ssplit, pos");
		StanfordCoreNLP coreNLP = new StanfordCoreNLP(properties);
		Annotation annotation = new Annotation(text);
		coreNLP.annotate(annotation);
		List<CoreLabel> labels = annotation.get(TokensAnnotation.class);
		for (CoreLabel label : labels) {
			System.out.println(label.get(TextAnnotation.class) + "\t"
					+ label.get(NamedEntityTagAnnotation.class));
		}
	}
}

結果

I	PRP
ask	VBP
everyone	NN
to	TO
not	RB
rush	VB
to	TO
judgment	NN
and	CC
allow	VB
the	DT
investigation	NN
to	TO
be	VB
completed	VBN
,	,
Orange	NNP
County	NNP
Sheriff	NNP
Jerry	NNP
Demings	NNP
said	VBD
at	IN
a	DT
news	NN
conference	NN
in	IN
Orlando	NNP
.	.

PRP? VBP? は？って感じですが、それぞれが品詞名になっています。品詞名一覧はWikipediaのBrown Corpusのページにあるpart-of-Speech-taggingの部分に一覧が載っています。

一部抜粋すると、以下のようなものです。

POSタグ名	説明	ざっくりいうと
NN	singular or mass noun	単数名詞
VB	verb, base form	動詞原型

固有表現抽出

このエントリーのゴールである、人名や地名を抜き出す、固有表現抽出をやってみます。

コピペ用コード

import java.util.List;
import java.util.Properties;

import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;

/**
 *  ner sammple
 *
 */
public class NERSample {
	public static void main(String[] args) {
		String text = "I ask everyone to not rush to judgment and allow the investigation to be completed,"
				+ " Orange County Sheriff Jerry Demings said at a news conference in Orlando.";

		Properties properties = new Properties();
		properties.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner");
		StanfordCoreNLP coreNLP = new StanfordCoreNLP(properties);
		Annotation annotation = new Annotation(text);
		coreNLP.annotate(annotation);
		List<CoreLabel> labels = annotation.get(TokensAnnotation.class);
		for (CoreLabel label : labels) {
			System.out.println(label.get(TextAnnotation.class) + "\t"
					+ label.get(NamedEntityTagAnnotation.class));
		}
	}
}

結果

I	O
ask	O
everyone	O
to	O
not	O
rush	O
to	O
judgment	O
and	O
allow	O
the	O
investigation	O
to	O
be	O
completed	O
,	O
Orange	ORGANIZATION
County	ORGANIZATION
Sheriff	O
Jerry	PERSON
Demings	PERSON
said	O
at	O
a	O
news	O
conference	O
in	O
Orlando	LOCATION
.	O

できました！"Jerry Demings"に"PERSON"というタグが、Orlandoに"LOCATION"というタグがついていますね。

CoreNLPの実装をみてみる

Standford CoreNLPの実装概要を簡単にまとめます。

一言でまとめるならば、

Annotation annotation = new Annotation(text);

で初期化した"文章をMap化したもの"に対して、

coreNLP.annotate(annotation);

によって順々にObjectをputしていく、という仕組みになっています。

Objectをputしていく役割を担う、いわゆる"annotators"は

Properties properties = new Properties();
properties.setProperty("annotators", "tokenize, ssplit, pos");
StanfordCoreNLP coreNLP = new StanfordCoreNLP(properties);

の部分で、指定されています。(この例だと、tokenize, ssplit, pos)。使用可能なannotatorについてはStandford CoreNLP本家のページにて。

Part of Speech Taggingの場合の例

"Map化された文章"にObjectがputされていく例を書きます。Mapといっても、普通にannotationをMapにしてしまうと、Map<String, Object>という、世界中を敵に回すようなオブジェクトを使うことになってしまうのを、CoreMapという特殊な"Map的なもの"を使うことでうまく回避しています。

step1

Annotation annotation = new Annotation(text);

として初期化した時点では、annotationは
(key, value) = (TextAnnotation.class, "I love you")
を持ったMapです。

step2

tokenizeがおわった時点では、annotationにはあらたに、
(key, value) = (TokensAnnotation.class, [Map, Map, Map, Map])
のような、Mapの配列をvalueとするエントリーが追加されます。配列中のそれぞれのMapには、TextAnnotationがkeyとして入っています。

step3

posがおわった時点では、TokensAnnotationのMap配列のそれぞれに(key, value) = (PartOfSpeechAnnotation.class, 品詞名)
のようなエントリーが追加されます。

値を取り出す

annotateが終わったannotationインスタンスは多重のMapなので、以下のように順々にgetしていけば、望むPart of speech tagが取得できるというわけです。

List<CoreLabel> labels = annotation.get(TokensAnnotation.class);
		for (CoreLabel label : labels) {
			System.out.println(label.get(TextAnnotation.class) + "\t"
					+ label.get(NamedEntityTagAnnotation.class));
		}

まとめ

一から実装して、学習データを準備して、学習させて、とやると非常に手間のかかる自然言語処理の種々のタスクを行ってくれるJavaライブラリ Standford CoreNLPについて書きました。このライブラリを使うだけで、かなりの部分の欲求を満たす事が出来ますから、Java x 英語 x 自然言語処理をやりたければ、まずはこのライブラリを試してみて、足りなければ実装する、というのがよいかと思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

英語で自然言語処理するなら 〜Standford CoreNLP〜

はじめに

このエントリーのゴール

とにかく使ってみる

Step1 ライブラリのダウンロード

Step2 使う

品詞タグづけ

コピペ用コード

結果

固有表現抽出

コピペ用コード

結果

CoreNLPの実装をみてみる

Part of Speech Taggingの場合の例

step1

step2

step3

値を取り出す

まとめ

英語で自然言語処理するなら　〜Standford CoreNLP〜