More than 5 years have passed since last update.

NLP4J [007] で kuromoji を利用する Annotator を作成する

Last updated at 2020-01-20Posted at 2020-01-15

Indexに戻る

形態素解析のモジュールを使い分ける

NLP4J では標準(nlp4j-core)においてYahoo!デベロッパーネットワークの形態素解析処理を利用しています。

テキスト解析:日本語形態素解析 - Yahoo!デベロッパーネットワーク
https://developer.yahoo.co.jp/webapi/jlp/ma/v1/parse.html

Yahoo!デベロッパーネットワークのAPIはHTTPで呼べるので便利ではありますが、回数制限があるという弱点もあります。
そこでローカルでも使える kuromoji を利用するライブラリを作成することにします。

Annotator の作成

今回は nlp4j プロジェクトのサブモジュール(sub module)として nlp4j-kuromoji を作成しました。

nlp4j-kuromoji
https://github.com/oyahiroki/nlp4j/tree/master/nlp4j/nlp4j-kuromoji

Maven には kuromoji を利用するためのdependencyを追加しています。

<!-- https://mvnrepository.com/artifact/com.atilika.kuromoji/kuromoji -->
<dependency>
 <groupId>com.atilika.kuromoji</groupId>
 <artifactId>kuromoji</artifactId>
 <version>0.9.0</version>
 <type>pom</type>
</dependency>
<dependency>
 <groupId>com.atilika.kuromoji</groupId>
 <artifactId>kuromoji-ipadic</artifactId>
 <version>0.9.0</version>
</dependency>

Class Diagram

クラス図としてはこんな感じです。
形態素解析エンジンとしては同じようなことをしているので兄弟関係ということになります。
一度インプリしてしまえば差分を意識することはなくなるので、kuromojiのインプリを意識するのもおそらく今回限りということになります。

@startuml
nlp4j.DocumentAnnotator <|-- YJpMaAnnotator
nlp4j.DocumentAnnotator <|-- KuromojiAnnotator 
@enduml

Code

NLP4J が提供する nlp4j.DocumentAnnotator インターフェイスを継承(implement)します。
kuromoji で抽出したキーワードをNLP4Jで用意しているキーワードにマップしています。


package nlp4j.krmj.annotator;
import java.util.List;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import com.atilika.kuromoji.ipadic.Token;
import com.atilika.kuromoji.ipadic.Tokenizer;
import nlp4j.AbstractDocumentAnnotator;
import nlp4j.Document;
import nlp4j.DocumentAnnotator;
import nlp4j.impl.DefaultKeyword;

/**
 * Kuromoji Annotator
 * @author Hiroki Oya
 * @since 1.2
 */
public class KuromojiAnnotator extends AbstractDocumentAnnotator implements DocumentAnnotator {
	static private final Logger logger = LogManager.getLogger(KuromojiAnnotator.class);
	@Override
	public void annotate(Document doc) throws Exception {
		Tokenizer tokenizer = new Tokenizer(); // kuromoji のインスタンス
		for (String target : targets) {
			Object obj = doc.getAttribute(target);
			if (obj == null || obj instanceof String == false) {
				continue;
			}
			String text = (String) obj;
			List<Token> tokens = tokenizer.tokenize(text);
			int sequence = 1;
			for (Token token : tokens) {
				logger.debug(token.getAllFeatures());
				DefaultKeyword kwd = new DefaultKeyword(); // 新しいキーワード
				kwd.setLex(token.getBaseForm());
				kwd.setStr(token.getSurface());
				kwd.setReading(token.getReading());
				kwd.setBegin(token.getPosition());
				kwd.setEnd(token.getPosition() + token.getSurface().length());
				kwd.setFacet(token.getPartOfSpeechLevel1());
				kwd.setSequence(sequence);
				doc.addKeyword(kwd);
				sequence++;
			}
		}
	}
}

同じ「原形」でもbaseForm と lex の違いがあったり、用語が微妙に違うことがみて取れると思います。

使い方

Annotator のクラス指定を変更する以外はYahoo!デベロッパーネットワークと同じになります。
別々の自然言語処理である kuromoji と Yahoo!デベロッパーネットワークの自然言語処理をWRAPしていることになります。

	public void testAnnotateDocument001() throws Exception {
		// 自然文のテキスト
		String text = "私は学校に行きました。";
		Document doc = new DefaultDocument();
		doc.putAttribute("text", text);
		KuromojiAnnotator annotator = new KuromojiAnnotator(); // ここだけ変更してモジュールを差し替え可能
		annotator.setProperty("target", "text");
		annotator.annotate(doc); // throws Exception
		System.err.println("Finished : annotation");
		for (Keyword kwd : doc.getKeywords()) {
			System.err.println(kwd);
		}
	}

結果

結果は以下のようになりました。
自然言語処理ライブラリの実装を意識することなく利用することができました。

Finished : annotation
私 [sequence=1, facet=名詞, lex=私, str=私, reading=ワタシ, count=-1, begin=0, end=1, correlation=0.0]
は [sequence=2, facet=助詞, lex=は, str=は, reading=ハ, count=-1, begin=1, end=2, correlation=0.0]
学校 [sequence=3, facet=名詞, lex=学校, str=学校, reading=ガッコウ, count=-1, begin=2, end=4, correlation=0.0]
に [sequence=4, facet=助詞, lex=に, str=に, reading=ニ, count=-1, begin=4, end=5, correlation=0.0]
行く [sequence=5, facet=動詞, lex=行く, str=行き, reading=イキ, count=-1, begin=5, end=7, correlation=0.0]
ます [sequence=6, facet=助動詞, lex=ます, str=まし, reading=マシ, count=-1, begin=7, end=9, correlation=0.0]
た [sequence=7, facet=助動詞, lex=た, str=た, reading=タ, count=-1, begin=9, end=10, correlation=0.0]
。 [sequence=8, facet=記号, lex=。, str=。, reading=。, count=-1, begin=10, end=11, correlation=0.0]

まとめ

NLP4J を使うと、Javaで簡単に自然言語処理ができますね！

プロジェクトURL

https://www.nlp4j.org/

Indexに戻る

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up