概要
韓国語の形態素解析器はいくつか存在するようですが、「open-korean-text」がよさそうでした。
maven
<dependency>
<groupId>org.openkoreantext</groupId>
<artifactId>open-korean-text</artifactId>
<version>2.1.0</version>
</dependency>
サンプルコード
package hello;
import java.util.List;
import org.openkoreantext.processor.KoreanTokenJava;
import org.openkoreantext.processor.OpenKoreanTextProcessorJava;
import org.openkoreantext.processor.tokenizer.KoreanTokenizer;
import scala.collection.Seq;
public class HelloOpenKoreanTextMain2 {
public static void main(String[] args) {
// 今日は天気がよいので学校に歩いて行った
String text = "오늘은 날씨가 좋아서 걸어서 학교에 갔다.";
// Normalize
CharSequence normalized = OpenKoreanTextProcessorJava.normalize(text);
// Tokenize
Seq<KoreanTokenizer.KoreanToken> tokens = OpenKoreanTextProcessorJava.tokenize(normalized);
List<KoreanTokenJava> kk = OpenKoreanTextProcessorJava.tokensToJavaKoreanTokenList(tokens);
for (KoreanTokenJava k : kk) {
System.out.println("begin: " + k.getOffset());
System.out.println("end: " + (k.getOffset() + k.getLength()));
System.out.println("length: " + k.getLength());
System.out.println("lex: " + k.getStem()); // LEX 原型
System.out.println("str: " + k.getText()); // STR
System.out.println("pos: " + k.getPos().name()); // JOSA 助詞 Noun 名詞 Adjective 形容詞 Verb 動詞 Punctation ピリオド
System.out.println("isUnknown: " + k.isUnknown());
System.out.println("---");
}
}
}
結果
形態素解析の結果は以下のようになります。韓国語には日本語と同様に動詞・形容詞の活用がありますが、きちんと原型が取得できています。「갔다(行った)」の原型「가다(行く)」が取れています。
begin: 0
end: 2
length: 2
lex:
str: 오늘
pos: Noun
isUnknown: false
---
begin: 2
end: 3
length: 1
lex:
str: 은
pos: Josa
isUnknown: false
---
begin: 4
end: 6
length: 2
lex:
str: 날씨
pos: Noun
isUnknown: false
---
begin: 6
end: 7
length: 1
lex:
str: 가
pos: Josa
isUnknown: false
---
begin: 8
end: 11
length: 3
lex: 좋다
str: 좋아서
pos: Adjective
isUnknown: false
---
begin: 12
end: 15
length: 3
lex: 걸다
str: 걸어서
pos: Verb
isUnknown: false
---
begin: 16
end: 18
length: 2
lex:
str: 학교
pos: Noun
isUnknown: false
---
begin: 18
end: 19
length: 1
lex:
str: 에
pos: Josa
isUnknown: false
---
begin: 20
end: 22
length: 2
lex: 가다
str: 갔다
pos: Verb
isUnknown: false
---
begin: 22
end: 23
length: 1
lex:
str: .
pos: Punctuation
isUnknown: false
---
以上.