More than 5 years have passed since last update.

NLP4J [006-034b] NLP4J で言語処理100本ノック #34 「AのB」の Annotator を作ってみる

Last updated at 2020-01-20Posted at 2020-01-14

改善すべき点

NLP4J [006-034] NLP4J で言語処理100本ノック #34 「AのB」では「AのB」を抽出する処理を直接コードで書いていました。以下の部分です。

// 「AのB」を探す
String meishi_a = null;
String no = null;

for (Keyword kwd : kwds) {
	if (meishi_a == null && kwd.getFacet().equals("名詞")) {
		meishi_a = kwd.getLex();
	} //
	else if (meishi_a != null && no == null && kwd.getLex().equals("の")) {
		no = kwd.getLex();
	} //
	else if (meishi_a != null && no != null && kwd.getFacet().equals("名詞")) {
		System.err.println(meishi_a + no + kwd.getLex());
		meishi_a = null;
		no = null;
	} //
	else {
		meishi_a = null;
		no = null;
	}
}

このようなキーワード抽出（Annotation）の方法ではロジックの再利用ができません。

Annotator

そこでNLP4Jでは独自にAnnotationを追加できる仕組みであるAnnotatorを用意しています。
仕組みといっても単純で、Interface nlp4j.DocumentAnnotator を実装することです。

上記のロジックをAnnotatorのコードとして用意すると以下のようになります。
「AのB」を単純に文字列として出力して終わりではなく、新しいキーワードとして追加しています。
キーワードの種類を識別できるように「word_nn_no_nn」という識別子（＝ファセット: facet）を設定しています。

package nlp4j.annotator;
import java.util.ArrayList;
import nlp4j.AbstractDocumentAnnotator;
import nlp4j.Document;
import nlp4j.DocumentAnnotator;
import nlp4j.Keyword;
import nlp4j.impl.DefaultKeyword;

/**
 * 「名詞の名詞」を「word_nn_no_nn」キーワードとして抽出します。
 * @author Hiroki Oya
 */
public class Nokku34Annotator extends AbstractDocumentAnnotator implements DocumentAnnotator {
	@Override
	public void annotate(Document doc) throws Exception {
		ArrayList<Keyword> newkwds = new ArrayList<>();
		Keyword meishi_a = null;
		Keyword no = null;
		for (Keyword kwd : doc.getKeywords()) {
			if (meishi_a == null && kwd.getFacet().equals("名詞")) {
				meishi_a = kwd;
			} //
			else if (meishi_a != null && no == null && kwd.getLex().equals("の")) {
				no = kwd;
			} //
			else if (meishi_a != null && no != null && kwd.getFacet().equals("名詞")) {
				Keyword kw = new DefaultKeyword();
				kwd.setLex(meishi_a.getLex() + no.getLex() + kwd.getLex());
				kwd.setFacet("word_nn_no_nn");
				kwd.setBegin(meishi_a.getBegin());
				kwd.setEnd(kwd.getEnd());
				kwd.setStr(meishi_a.getStr() + no.getStr() + kwd.getStr());
				kwd.setReading(meishi_a.getReading() + no.getReading() + kwd.getReading());
				newkwds.add(kw);
				meishi_a = null;
				no = null;
			} //
			else {
				meishi_a = null;
				no = null;
			}
		}
		doc.addKeywords(newkwds);
	}
}

これで「AのB」というキーワードを抽出するロジックを切り分けて定義することができました。

Annotator の利用

package nlp4j.nokku.chap4;

import java.util.List;
import nlp4j.Document;
import nlp4j.DocumentAnnotator;
import nlp4j.DocumentAnnotatorPipeline;
import nlp4j.Keyword;
import nlp4j.crawler.Crawler;
import nlp4j.crawler.TextFileLineSeparatedCrawler;
import nlp4j.impl.DefaultDocumentAnnotatorPipeline;
import nlp4j.annotator.Nokku34Annotator;

public class Nokku34b {

	public static void main(String[] args) throws Exception {

		// NLP4Jが提供するテキストファイルのクローラーを利用する
		Crawler crawler = new TextFileLineSeparatedCrawler();
		crawler.setProperty("file", "src/test/resources/nlp4j.crawler/neko_short_utf8.txt");
		crawler.setProperty("encoding", "UTF-8");
		crawler.setProperty("target", "text");

		// ドキュメントのクロール
		List<Document> docs = crawler.crawlDocuments();

		// NLPパイプライン（複数の処理をパイプラインとして連結することで処理する）の定義
		DocumentAnnotatorPipeline pipeline = new DefaultDocumentAnnotatorPipeline();
		{
			// Yahoo! Japan の形態素解析APIを利用するアノテーター
			DocumentAnnotator annotator = new YJpMaAnnotator();
			pipeline.add(annotator);
		}
		{
			// 「名詞の名詞」を「word_nn_no_nn」キーワードとして抽出します。
			Nokku34Annotator annotator = new Nokku34Annotator(); // ←課題34はここだけ
			pipeline.add(annotator); // ←課題34はここだけ
		}
		// アノテーション処理の実行
		pipeline.annotate(docs);

		for (Document doc : docs) {
			for (Keyword kwd : doc.getKeywords("word_nn_no_nn")) {
				System.err.println(kwd.getStr());
			}
		}
	}
}

「AのB」を抽出する処理がたった２行になりました！

 // 「名詞の名詞」を「word_nn_no_nn」キーワードとして抽出します。
 Nokku34Annotator annotator = new Nokku34Annotator();
 pipeline.add(annotator);

このようにすれば、独自のAnnotatorをたくさん定義して、さらに自然言語処理を拡張することができるようになるのです。

結果

彼の掌
掌の上
書生の顔
はずの顔
顔の真中
穴の中

まとめ

NLP4J を使うと、Javaで簡単に自然言語処理ができますね！

プロジェクトURL

https://www.nlp4j.org/

Indexに戻る

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up