More than 5 years have passed since last update.

Stanford CoreNLPをPythonから使う

Last updated at 2017-06-03Posted at 2014-05-03

はじめに

Stanford CoreNLPは、英語テキストの自然言語処理用の全部入りライブラリである。
今回はCoreNLPをPythonから利用する方法を紹介する。

Stanford CoreNLPのダウンロードと解凍

ダウンロード

最新版ではなくVersion 3.2.0（2013-06-20公開）を以下のリンクからダウンロードする。
なぜ最新版ではないのかについては後述。
http://nlp.stanford.edu/software/stanford-corenlp-full-2013-06-20.zip

$ curl -L -O http://nlp.stanford.edu/software/stanford-corenlp-full-2013-06-20.zip

解凍

自分の場合、/usr/local/libに置いている。

$ unzip ./stanford-corenlp-full-2013-06-20.zip -d /usr/local/lib/

corenlp-pythonのインストール

corenlp-pythonはとろとき氏がdasmith氏のものを元に開発したもので、PyPIにも登録されている。
ただし、PyPIに登録されているcorenlp-pythonはCoreNLP Version 3.2.0にしか対応していない（本稿執筆時点）。

インストール

$ pip install corenlp-python

基本的な使い方

CoreNLPを解凍したパスを指定してパーサを生成し、テキストをパースすると、結果がJSON形式で返る。

corenlp_example.py

import pprint
import json
import corenlp

# パーサの生成
corenlp_dir = "/usr/local/lib/stanford-corenlp-full-2013-06-20/"
parser = corenlp.StanfordCoreNLP(corenlp_path=corenlp_dir)

# パースして結果をpretty print
result_json = json.loads(parser.parse("I am Alice."))
pprint.pprint(result_json)

実行結果:

{u'coref': [[[[u'I', 0, 0, 0, 1], [u'Alice', 0, 2, 2, 3]]]],
 u'sentences': [{u'dependencies': [[u'nsubj', u'Alice', u'I'],
                                   [u'cop', u'Alice', u'am'],
                                   [u'root', u'ROOT', u'Alice']],
                 u'parsetree': u'(ROOT (S (NP (PRP I)) (VP (VBP am) (NP (NNP Alice))) (. .)))',
                 u'text': u'I am Alice.',
                 u'words': [[u'I',
                             {u'CharacterOffsetBegin': u'0',
                              u'CharacterOffsetEnd': u'1',
                              u'Lemma': u'I',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'PRP'}],
                            [u'am',
                             {u'CharacterOffsetBegin': u'2',
                              u'CharacterOffsetEnd': u'4',
                              u'Lemma': u'be',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'VBP'}],
                            [u'Alice',
                             {u'CharacterOffsetBegin': u'5',
                              u'CharacterOffsetEnd': u'10',
                              u'Lemma': u'Alice',
                              u'NamedEntityTag': u'PERSON',
                              u'PartOfSpeech': u'NNP'}],
                            [u'.',
                             {u'CharacterOffsetBegin': u'10',
                              u'CharacterOffsetEnd': u'11',
                              u'Lemma': u'.',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'.'}]]}]}

機能を絞る

デフォルトでは構文解析・形態素解析から固有表現抽出まで全部やってくれるが、いくつかの機能だけを使いたければpropertiesを指定する。
機能を絞ることで動作も高速になる（特にnerが重い）。

例えば単語分割までを行いたい場合、次のようなuser.propertiesファイルを作成する。

user.properties

annotators = tokenize, ssplit

パーサを生成する際にこのファイルのパスをpropertiesパラメータに渡す。

corenlp_example2.py

import pprint
import json
import corenlp

# パーサの生成
corenlp_dir = "/usr/local/lib/stanford-corenlp-full-2013-06-20/"
properties_file = "./user.properties"
parser = corenlp.StanfordCoreNLP(
    corenlp_path=corenlp_dir,
    properties=properties_file) # propertiesを設定

# パースして結果をpretty print
result_json = json.loads(parser.parse("I am Alice."))
pprint.pprint(result_json)

実行結果:

{u'sentences': [{u'dependencies': [],
                 u'parsetree': [],
                 u'text': u'I am Alice.',
                 u'words': [[u'I',
                             {u'CharacterOffsetBegin': u'0',
                              u'CharacterOffsetEnd': u'1'}],
                            [u'am',
                             {u'CharacterOffsetBegin': u'2',
                              u'CharacterOffsetEnd': u'4'}],
                            [u'Alice',
                             {u'CharacterOffsetBegin': u'5',
                              u'CharacterOffsetEnd': u'10'}],
                            [u'.',
                             {u'CharacterOffsetBegin': u'10',
                              u'CharacterOffsetEnd': u'11'}]]}]}

annotator一覧

上では tokenize, ssplit のみを使用したが、他にもさまざまな annotator が存在するため簡単にまとめておく。

annotator	機能	依存するannotator
tokenize	単語分割	（なし）
cleanxml	XMLタグ除去	tokenize
ssplit	文分割	tokenize
pos	形態素解析（タグ詳細）	tokenize, ssplit
lemma	見出し語化	tokenize, ssplit, pos
ner	固有表現抽出	tokenize, ssplit, pos, lemma
regexner	正規表現による固有表現抽出	tokenize, ssplit
sentiment	感情語分析	（不明）
truecase	大文字・小文字の正規化	tokenize, ssplit, pos, lemma
parse	構文解析	tokenize, ssplit
dcoref	指示語解析	tokenize, ssplit, pos, lemma, ner, parse

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up