More than 5 years have passed since last update.

日本語Sentiment Analysisに関する手法比較

Last updated at 2018-07-05Posted at 2018-06-29

Survey

pythonで使える代表的な感情分析データセット・ツール

以下の3つのいずれかを利用する手法が代表的なようだ。
referencesを見る際は、結果のvisualizing方法にも着目されたい。

dataset/tool	-	data	cost	merit/demerit	references
乾先生の極性辞書	辞書セット	positive単語:+1, negative単語:-1で離散的にスコアリング	無料。クレジットを明記すれば商用利用も可。	tweetのようにデータ量が少ない場合スコアが荒くなり、短文間の感情比較には不向き。	ref_1, ref_2
高村先生の極性対応表	辞書セット	posi〜negaを1〜-1の実数で連続的にスコアリング	無料。ただし再配布は禁止。	精度がイマイチとの文献あり。	ref_1, ref_2, ref_3
Google Cloud Natural Language	Cloud API	不明	5,000ユニット/monthまでは無料。以降、課金制	googleのAPIが色々使えるがお金がかかる。	-

Try

Mac OS X 10.11 El Capitan
oh-my-zsh
Anaconda ver1.5.1 (anaconda3-4.2.0)
pip 10.0.1
python 3.5.2

mecab-python3をインストール

$ pip3 install mecab-python3

もし、エラーが出たらbrew updateとかapt-get updateとかpip3 install --upgradeとか試す。
Ubuntuでなぜかpipれないエラー出たときは、sudo apt-get install g++やるとpipれるようになった。

neologd辞書インストール

乾先生の極性辞書を使ってみた。

準備

上記リンクから極性辞書をダウンロード
- 用言極性辞書: wago.121808.pn
- 体言極性辞書: pn.csv.m3.120408.trim
分析対象の自然言語を用意
そして、pythonファイルも極性辞書もinputファイルも、すべて同一ディレクトリに置くことにする。

ソースコード

sentiment_analysis.py

import better_exceptions
import MeCab
from statistics import mean
# import pandas
import sys
import csv

input_file_name = sys.argv[1]

def read_file_into_lines(file_name):
	with open(file_name, "r") as f:
		return f.readlines()

#########

polar_dict = {}

def read_dict_of_declinable_words(wago_file):
	wago_string = read_file_into_lines(wago_file)

	for l in wago_string:
		line_list = l.split("\t")
		polar_words_list = line_list[1].split(" ")[0].strip()

		if line_list[0].split("（")[0] == "ネガ":
			polar_dict[polar_words_list] = -1
		elif line_list[0].split("（")[0] == "ポジ":
			polar_dict[polar_words_list] = 1
		else:
			polar_dict[polar_words_list] = 0
			# pass
	# del polar_dict[""]

#########

def read_dict_of_substantive_words(trim_file):
	trim_string = read_file_into_lines(trim_file)

	for l in trim_string:
		line_list = l.split("\t")
		if line_list[1] == "p":
			polar_dict[line_list[0]] = 1
		elif line_list[1] == "n":
			polar_dict[line_list[0]] = -1
		else:
			polar_dict[line_list[0]] = 0
			# pass
	# del polar_dict[""]

##############

read_dict_of_declinable_words("./wago.121808.pn")

read_dict_of_substantive_words("./pn.csv.m3.120408.trim")

### Tune up the polar dictionary.
del polar_dict[""]
del polar_dict["だ"]
polar_dict["ない"] = -1
### End of tune-up.

polar_words_list = [d for d in polar_dict]

#####################

def create_mecab_list(text):
	mecab_list = []
	mecab = MeCab.Tagger("-Ochasen -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd")
	mecab.parse("")
	pos_name_list = ["形容詞", "動詞", "形容動詞", "助動詞"]
	# encoding = text.encode('utf-8')
	node = mecab.parseToNode(text)
	while node:
		feature = node.feature.split(",")
		# [品詞,品詞細分類1,品詞細分類2,品詞細分類3,活用形,活用型,原形,読み,発音]
		# 忙しく  形容詞,自立,*,*,形容詞・イ段,連用テ接続,忙しい,イソガシク,イソガシク
		if feature[0] in pos_name_list:
			mecab_list.append({feature[6]:feature[0]})
		else:
			mecab_list.append({node.surface:feature[0]})
		node = node.next
	return mecab_list

with open(input_file_name, "r") as file:
	input_string = file.read()

morphemes_dict_list = create_mecab_list(input_string)#.decode("utf-8")

#########

def polar_score(mecab_list, polar_dict):
	morphemes_list = [k for mecab_dict in mecab_list for k, v in mecab_dict.items()]

	polar_score_list = []
	polar_score_words = []
	for mecab_dict in morphemes_list:
		if mecab_dict in polar_words_list:
			polar_score_words.append(mecab_dict)
			polar_score_list.append(polar_dict[mecab_dict])
	# polar_score_words = [mecab_dict for mecab_dict in morphemes_list if mecab_dict in polar_words_list]
	# polar_score_list = [polar_dict[mecab_dict] for mecab_dict in morphemes_list if mecab_dict in polar_words_list]

	if len(polar_score_list) == 0:
		score = 0
		print("no polar words.")
	else:
		score = mean(polar_score_list)
		print("calcurating...")
	return polar_score_words, polar_score_list, [score]

###########

def csv_writer(result_tuple, output_file):
	with open(output_file, "w") as f:
		w = csv.writer(f)
		for e in result_tuple:
			# print(e)
			w.writerow(e)

# print(polar_score(morphemes_dict_list, polar_dict))
csv_writer(polar_score(morphemes_dict_list, polar_dict), "./output.csv" )

実行してみる

$ less input.txt
今日の天気は荒れていたけど、明日には晴れるのではないかと思う。
$ python3 sentiment_analysis.py input.txt
$ less output.csv
荒れる,明日,晴れる,ない
-1,1,1,-1
0

考察

構文解析して否定語があったら極性を逆にする等の処理が必要。
- cf. https://qiita.com/matsu0228/items/0323f299d03f5b07efdc
極性辞書の精度がイマイチだったりする。
- pythonコードの真ん中あたりで辞書のチューニングしてる。

### Tune up the polar dictionary.
del polar_dict[""]
del polar_dict["だ"]
polar_dict["ない"] = -1
### End of tune-up.

その他、随時更新予定。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up