More than 5 years have passed since last update.

COTOHA API と絵文字で桃太郎をサマリーする

Last updated at 2020-02-27Posted at 2020-02-24

はじめに

COTOHA API を使った記事が面白いので、自分でも何かやってみたいと思い、「絵文字」と組み合わせて文章のサマリーをやってみました。(WordCloudでの可視化追記 2/25)

「桃太郎」¹ はこんな風にサマライズされました。

文章と絵文字の「自然言語処理的」なマッピングは私には無理でした。COTOHA の構文解析で取得した単語を WordNet データベースの類義語 (英語)と照合して絵文字に置き換えてます。

環境

Windows10
Python 3.6.5
Jupyter notebook

今回使用した Notebook は Gist に上げました。

処理概要

COTOHA API の他は、類似語検索に WordNet 、固有表現の可視化に WordCloudを使いました。

前処理 : [言い淀み除去 API]
入力する文章をきれいにします。
感情分析サマリ: [感情分析 API]
出力されるラベルと絵文字をマッピングして、スコアに応じて絵文字の数を増やして表示します。
絵文字サマリ: [構文解析 API] + [WordNet]:
構文解析で「名詞」と「動詞」を抽出して、WordNetから英語の類義語を照合します。取得した類似語を python の emoji モジュールから探します。英語の類義語を照合しているだけなので、結果の絵文字は微妙。
※ [構文解析 API] + [WordNet] の連携アイデアは、「募ってはいるが、募集はしていない」人たちへを参考にしました。
固有表現サマリ: [固有表現抽出 API] + [WordCloud]
固有表現で抽出した単語を WordCloud で可視化します。「ドンブラコ」は固有表現として抽出されるのですね。
※ WordCloudでの可視化はCOTOHAを利用して、物語の舞台を抽出・図示してみたを真似しました。

COTOHA APIのコード

**※**コードは「自然言語処理を簡単に扱えると噂のCOTOHA APIをPythonで使ってみた」を参考にしました。

コードを開く

import urllib.request
import json
from pprint import pprint
import traceback

class CotohaApi:
    def __init__(self, client_id, client_secret, developer_api_base_url, access_token_publish_url):
        self.client_id = client_id
        self.client_secret = client_secret
        self.developer_api_base_url = developer_api_base_url
        self.access_token_publish_url = access_token_publish_url
        self.request_count = 0
        self.get_access_token()
        
    def request(self, url, data):
        headers={
            "Authorization": "Bearer " + self.access_token,
            "Content-Type": "application/json;charset=UTF-8",
        }
        req = urllib.request.Request(url, json.dumps(data).encode(), headers)  
        try:
            with urllib.request.urlopen(req) as res:
                res_body = json.loads(res.read())
        except urllib.error.HTTPError as e:
            if e.code == 401:
                print(e, ": retrieving an access token....\n")
                self.get_access_token()
                headers={
                    "Authorization": "Bearer " + self.access_token,
                    "Content-Type": "application/json;charset=UTF-8",
                }
                req = urllib.request.Request(url, json.dumps(data).encode(), headers) 
                with urllib.request.urlopen(req) as res:
                    res_body = json.loads(res.read())
            else:
                print(e)
                traceback.print_exc()
                return
        self.request_count+=1
        return res_body, res.status, res.reason      
        
    def get_access_token(self):
        url = self.access_token_publish_url
        headers={
            "Content-Type": "application/json;charset=UTF-8",
        }
        data = {
            "grantType": "client_credentials",
            "clientId": self.client_id,
            "clientSecret": self.client_secret
        }
        req = urllib.request.Request(url, json.dumps(data).encode(), headers)
        with urllib.request.urlopen(req) as res:
                res_body = json.loads(res.read())
        self.request_count+=1
        self.access_token = res_body["access_token"]

    # 構文解析
    def parse(self, sentence):
        url = self.developer_api_base_url + "v1/parse"
        data = {
            "sentence": sentence
        }
        res_body, status, reason = self.request(url, data)
        return res_body

    # 固有表現抽出
    def named_entry(self, sentence):
        url = self.developer_api_base_url + "v1/ne"
        data = {
            "sentence": sentence
        }
        res_body, status, reason = self.request(url, data)
        return res_body

    # 照応解析
    def coreference(self, document):
        url = self.developer_api_base_url + "v1/coreference"
        data = {
            "document": document,
            "type": "default",
            "do_segment":True,
        }
        res_body, status, reason = self.request(url, data)
        return res_body

    # キーワード抽出
    def keyword(self, document):
        url = self.developer_api_base_url + "v1/keyword"
        data = {
            "document": document,
            "type": "default",
        }
        res_body, status, reason = self.request(url, data)
        return res_body

    # 類似度算出
    def similarity(self, s1, s2):
        url = self.developer_api_base_url + "v1/similarity"
        data = {
            "s1": s1,
            "s2": s2
        }
        res_body, status, reason = self.request(url, data)
        return res_body

    # 文タイプ判定
    def sentence_type(self, sentence):
        url = self.developer_api_base_url + "v1/sentence_type"
        data = {
            "sentence": sentence
        }
        res_body, status, reason = self.request(url, data)
        return res_body

    # ユーザ属性推定
    def user_attribute(self, document):
        url = self.developer_api_base_url + "beta/user_attribute"
        data = {
            "document": document
        }
        res_body, status, reason = self.request(url, data)
        return res_body

    # 感情分析
    def sentiment(self, sentence):
        url = self.developer_api_base_url + "v1/sentiment"
        data = {
            "sentence": sentence
        }
        res_body, status, reason = self.request(url, data)
        return res_body

    # 言い淀み除去(β)
    def remove_filter(self, text):
        url = self.developer_api_base_url + "beta/remove_filler"
        data = {
            "text": text,
            "do_segment":True,
        }
        res_body, status, reason = self.request(url, data)
        return res_body

WordNetから類似語を取得する

WordNetからデータベース wnjpn.db をダウンロードしてカレントディレクトリに置いておきます。

類似語をもってくるクエリ

query = """
SELECT
    word.wordid,
    COUNT(word.wordid) AS COUNT,
    word.lemma,
    word.lang,
    sense.synset
FROM
    sense
    JOIN word ON word.wordid = sense.wordid
WHERE
    1 = 1
    AND sense.synset IN(
        SELECT 
            synset
        FROM
            sense
        WHERE
            wordid IN (
                SELECT
                    wordid
                FROM 
                    word
                WHERE 
                    1 = 1
                    AND lemma = ?
            )
    )
    AND word.lang = ?
GROUP BY word.wordid
ORDER BY COUNT DESC
"""

類似語をもってくるコード

# ref:
# WordNet Ja: http://compling.hss.ntu.edu.sg/wnja/
import sys, sqlite3
from pprint import pprint

def get_synonyms(word, lang='eng'):
    """
    search synonyms of the input word from wordnet japnese database
    http://compling.hss.ntu.edu.sg/wnja/
    """
    synonyms = []
    conn = sqlite3.connect("./wnjpn.db")
    c = conn.cursor()
    rows = c.execute(query, (word ,lang))       
    for row in rows:
        synonyms.append(row[2])
    c.close()
       
    return synonyms

def get_emoji(words, count):
    """
    words: list of words you want to emojize
    count: count of the emoji to display depending on the sentiment api score
    """
    assert(type(words) is list)
    emojis = []   
    for word in words:
        if re.search(':.*:', word):
            emoji = emj.emojize("{}".format(word)*count, use_aliases=True)
        else:
            emoji = emj.emojize(":{}:".format(word)*count, use_aliases=True)
        if not re.search(':.*:', emoji):
            emojis.append("{}([{})".format(emoji,word))
        #else:
            #print(":{}: could not be emojized".format(word))
    return emojis

こんな感じで類義語と絵文字を探します。

word="桃"
synonyms = get_synonyms(word)
pprint(synonyms)
get_emoji(synonyms, 1)

# Output
['pink', 'peach']
['🍑([peach)']

サマリーを表示するコード

Input

sentence1= """
むかしむかし、あるところに、おじいさんとおばあさんが住んでいました。
おじいさんは山へしばかりに、おばあさんは川へせんたくに行きました。
おばあさんが川でせんたくをしていると、ドンブラコ、ドンブラコと、大きな桃が流れてきました。
「おや、これは良いおみやげになるわ」
おばあさんは大きな桃をひろいあげて、家に持ち帰りました。
そして、おじいさんとおばあさんが桃を食べようと桃を切ってみると、なんと中から元気の良い男の赤ちゃんが飛び出してきました。
「これはきっと、神さまがくださったにちがいない」
子どものいなかったおじいさんとおばあさんは、大喜びです。
桃から生まれた男の子を、おじいさんとおばあさんは桃太郎と名付けました。
"""

サマリー表示

環境変数に CLIENT_SECRET, CLIENT_ID を指定して、Pythonから読み込みます。Jupyter notebook の場合も、環境変数を指定して起動すれば同じように使えます。
WordCloudで表示する際に必要な日本語フォントipagp.ttf はIPA フォントダウンロードページからダウンロードしてカレントディレクトリに置きます。

サマリー表示用のFunction

from wordcloud import WordCloud
import matplotlib.pyplot as plt
import numpy as np
from pprint import pprint

emotional_label = {'Positive': ':thumbsup:',
                   'Negative': ':thumbsdown:',
                   'Neutral': ':hand:'}
emotion_label = {'喜ぶ': ':smile:',
                 '怒る': ':rage:',
                 '悲しい': ':cry:',
                 '不安': ':worried:',
                 '恥ずかしい': ':flushed:',
                 '好ましい': ':expressionless:',
                 '嫌'': ': ':stuck_out_tongue_closed_eyes:',
                 '興奮': ':laughing:',
                 '安心': ':relieved:',
                 '驚く': ':astonished:',
                 '切ない': ':disappointed:',
                 '願望': ':wink:',
                 'P': ':smile:',
                 'N': ':pensive:',
                 'PN': ':neutral_face:'}


def preprocess(sentence):
    result = cotoha_api.remove_filter(sentence)
    fixed_sentences_list = []
    for i, fixed_sentence in enumerate(result['result']):
        fixed_sentences_list.append(result['result'][i]['fixed_sentence'])
    return fixed_sentences_list


def get_words_by_class(parsed_result, word_classes):
    words = []
    for id in range(len(parsed_result['result'])):
        for word_class in word_classes:
            if parsed_result['result'][id]['tokens'][0]['pos'] == word_class:
                word = parsed_result['result'][id]['tokens'][0]['lemma']
                words.append(word)
    return words


def emotion_score(sentence, verbose=False):
    # initialize emotion lable: 'PN' means Neutral(Positive-Negative)????
    emotion = 'PN'
    emotions = {}
    result = cotoha_api.sentiment(sentence)
    if verbose:
        pprint(result)

    sentiment = result['result']['sentiment']
    score = result['result']['score']
    emotional_phrase = result['result']['emotional_phrase']

    # emotinal score: 5 levels (low -> high : 1 to 5)
    # score: 0 < 0.2 < 0.4 < 0.6 < 0.8 < 1.0 -> level 1, 2, 3, 4, 5
    emotional = get_emoji([emotional_label[sentiment]], int(score//0.2 + 1))

    if emotional_phrase:
        for i in range(len(result['result']['emotional_phrase'])):
            emotion = result['result']['emotional_phrase'][i]['emotion']
            emotions[str(emotion)] = get_emoji([emotion_label[emotion]], 1)

    return emotional[0], sentiment, score, emotions


def show_emoji_summary(words):
    synonyms = {}
    emojis_summary = {}

    for word in words:
        synonyms[str(word)] = get_synonyms(word, lang='eng')

    for key, word_list in synonyms.items():
        if len(word_list) > 0:
            emoji_list = get_emoji(word_list, 1)
            if len(emoji_list) > 0:
                emojis_summary[str(key)] = emoji_list
    return emojis_summary


def show_named_entry(words):
    named_entry = []
    for word in words:
        result = cotoha_api.named_entry(word)
        if len(result['result']) > 0:
            named_entry.append(result['result'][0])
    return named_entry


def get_wordcrowd_mask(text):
    """ref:
    https://amueller.github.io/word_cloud/auto_examples/single_word.html
    """
    font_path = './ipagp.ttf'
    x, y = np.ogrid[:300, :300]
    mask = (x - 150) ** 2 + (y - 150) ** 2 > 130 ** 2
    mask = 255 * mask.astype(int)
    wc = WordCloud(font_path=font_path, random_state=1, 
                   mask=mask, background_color="white").generate(text)
    return wc

サマリー表示メイン部分

%matplotlib inline
import os

CLIENT_ID = os.getenv('CLIENT_ID')
CLIENT_SECRET = os.getenv('CLIENT_SECRET')
ACCESS_TOKEN_PUBLISH_URL="https://api.ce-cotoha.com/v1/oauth/accesstokens"
DEVELOPER_API_BASE_URL = "https://api.ce-cotoha.com/api/dev/nlp/"

cotoha_api = CotohaApi(CLIENT_ID, CLIENT_SECRET, 
                       DEVELOPER_API_BASE_URL, ACCESS_TOKEN_PUBLISH_URL)

word_classes=["名詞", "動詞語幹"]

def get_summary(sentence):
    
    # preprocess: remove filter
    fixed_sentences_list = preprocess(sentence)
 
    # cancatenate fixed sentences produced by remove filter api.
    concat_sentence=""
    for fixed_sentence in fixed_sentences_list:
            concat_sentence+=fixed_sentence
        
    # get synomnyms of the words produced by the parse api.
    parsed_result = cotoha_api.parse(concat_sentence)
    words = get_words_by_class(parsed_result, word_classes)

    # Display summary
    print("\nInput:\n","-"*40,"\n",concat_sentence)

    # show emotion summmary
    print("\nOutput:\n","-"*40)
    emotional, sentiment, score, emotions = emotion_score(concat_sentence, verbose=False)
    print("*** emotion summary ***")
    print("{}:{} score:{:.2f}".format(emotional, sentiment, score))
    for key, values in emotions.items():
        print("  {}:{}".format(values, key))
              
    # show emoji summary of sentence 
    emojis_summary = show_emoji_summary(words)
    if len(emojis_summary) > 0:
        print("\n*** emoji summary ***")
        for key, value in emojis_summary.items():
            values = ""
            for v in value:
                values+=v
            print("{:5}{}".format(key, values))
    
    # show named entry summary of sentence 
    named_entries = cotoha_api.named_entry(concat_sentence)
    if len(named_entries['result']) > 0:
        print("\n*** named entry summary ***")
        named_entry_summary = ""
        for named_entry in named_entries['result']:
            named_entry_summary+=named_entry['form'] + " "
        #print("words:", named_entry_summary)
        wc = get_wordcrowd_mask(named_entry_summary)
        plt.imshow(wc, interpolation="bilinear")
        plt.axis("off")

# main
get_summary(sentence1)

いざ鬼退治へ

鬼 ,サル, 犬が出たのはうれしいです（キジはいない・・・）。絵文字では鬼退治が表現できていないのが残念ですが、WordCloud はいい感じです。

まとめ

COTOHA API をはじめて知って使ってみましたがすごく使いやすいです。内容が充実しているしAPIのレスポンスが速いです。登録してすぐに使えて「日本語」に強いってところが素晴らしいと思いました。

そもそもは会社のレポートを感情分析して時系列に俯瞰したいと思ったのがモチベーションでした。敷居が高いと思っていた感情分析が COTOHA API でできるので、これを絵文字で表現すべく頑張りましたが、気の利いた絵文字へのマッピングは私には無理でした。

やりたかったのは例えば**「仕事帰りに飲みに行った。」** ⇒ のような「文脈」からの変換です。仕事 終わり 飲む という特徴からを推測するには、コーパスに絵文字をマッピングするのかな、、わかりません。

「桃太郎」の文章は、こちらのものを使わせてもらいました。 ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up