More than 5 years have passed since last update.

Java で Lucene Kuromoji + mecab-ipadic-neologd を使用した形態素解析

Last updated at 2016-04-29Posted at 2016-03-29

Javaでneologdの辞書を使いたい場合はMavenリポジトリに公開されているのを利用するのが手っ取り早そう。

build.gradle

dependencies {
    compile("org.codelibs:lucene-analyzers-kuromoji-ipadic-neologd:5.4.1-20160218")
}

repositories {
    mavenCentral()
    maven {
        url "http://maven.codelibs.org"
    }
}

KuromojiNeologdTest.java

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Set;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.util.CharArraySet;
import org.codelibs.neologd.ipadic.lucene.analysis.ja.JapaneseAnalyzer;
import org.codelibs.neologd.ipadic.lucene.analysis.ja.JapaneseTokenizer;
import org.codelibs.neologd.ipadic.lucene.analysis.ja.JapaneseTokenizer.Mode;
import org.codelibs.neologd.ipadic.lucene.analysis.ja.dict.UserDictionary;
import org.codelibs.neologd.ipadic.lucene.analysis.ja.tokenattributes.BaseFormAttribute;
import org.codelibs.neologd.ipadic.lucene.analysis.ja.tokenattributes.PartOfSpeechAttribute;
import org.codelibs.neologd.ipadic.lucene.analysis.ja.tokenattributes.ReadingAttribute;

public class KuromojiNeologdTest {

	public static void main(String[] args) {
		String[] sentences = new String[] {
                "すもももももももものうち",
				"きゃりーぱみゅぱみゅもゲスの極み乙女。もモーニング娘。も問題なく分割できます。"
			};
		for (String sentence : sentences) {
			System.out.println("* " + sentence);
			tokenize(sentence);
			System.out.println("");
		}
	}

	private static void tokenize(Reader reader) {
		UserDictionary userDict = null;
		Mode mode = JapaneseTokenizer.Mode.NORMAL;
		CharArraySet stopSet = JapaneseAnalyzer.getDefaultStopSet();
		Set<String> stopTags = JapaneseAnalyzer.getDefaultStopTags();

		try (JapaneseAnalyzer analyzer = new JapaneseAnalyzer(userDict, mode, stopSet, stopTags);
			TokenStream tokenStream = analyzer.tokenStream("", reader) ) {
			
			BaseFormAttribute baseAttr = tokenStream.addAttribute(BaseFormAttribute.class);
			CharTermAttribute charAttr = tokenStream.addAttribute(CharTermAttribute.class);
			PartOfSpeechAttribute posAttr = tokenStream.addAttribute(PartOfSpeechAttribute.class);
			ReadingAttribute readAttr = tokenStream.addAttribute(ReadingAttribute.class);

			tokenStream.reset();
			while (tokenStream.incrementToken()) {
				String text = charAttr.toString();                // 単語
				String baseForm = baseAttr.getBaseForm();	    // 原型
				String reading = readAttr.getReading();			// 読み
				String partOfSpeech = posAttr.getPartOfSpeech();	// 品詞
				
				System.out.println(text + "\t|\t" + baseForm + "\t|\t" + reading + "\t|\t" + partOfSpeech);
			}
		} catch (IOException e) {
			e.printStackTrace();
		}
	}

    private static void tokenize(String src) {
    	tokenize(new StringReader(src));
    }
}

出力

* すもももももももものうち
すもももももももものうち	|	null	|	スモモモモモモモモノウチ	|	名詞-固有名詞-一般

* きゃりーぱみゅぱみゅもゲスの極み乙女。もモーニング娘。も問題なく分割できます。
きゃりーぱみゅぱみゅ	|	null	|	キャリーパミュパミュ	|	名詞-固有名詞-一般
ゲスの極み乙女。	|	null	|	ゲスノキワミオトメ	|	名詞-固有名詞-一般
モーニング娘。	|	null	|	モーニングムスメ	|	名詞-固有名詞-一般
問題	|	null	|	モンダイ	|	名詞-ナイ形容詞語幹
分割	|	null	|	ブンカツ	|	名詞-サ変接続

最新版の辞書を使いたい場合は以下のスクリプトでビルドする。

kazuhira-r/kuromoji-with-mecab-neologd-buildscript: This script to build a Lucene and Atilika Kuromoji with bundled mecab-xxxxx-neologd

参考

CodeLibs Lucene Kuromoji＋mecab-ipadic-NEologdを使う

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up