日本語の形態素解析エンジンMecabを使って遊ぼう！
Twitter APIから指定したユーザーのツイート内容を取得して、よく使う単語を調べてみます。

環境・言語

Ubuntu 14.0.4
Python 3.4.3

インストール

MeCab本体と、新語に強い辞書mecab-ipadic-NEologdをインストールしておきます。
以下の記事を参考にしました。
Ubuntu 14.04 に Mecab と mecab-python3 をインストール

MeCab本体のインストール

$ sudo apt-get install mecab libmecab-dev mecab-ipadic mecab-ipadic-utf8

mecab-ipadic-NEologdのインストール

$ git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
$ cd mecab-ipadic-neologd
$ ./bin/install-mecab-ipadic-neologd -n -a

途中で↓と聞かれたら、yes -> Enter
Do you want to install mecab-ipadic-NEologd? Type yes or no.

/etc/mecabrc を編集し、以下のようにdicdirを変更します。

/etc/mecabrc

;
; Configuration file of MeCab
;
; $Id: mecabrc.in,v 1.3 2006/05/29 15:36:08 taku-ku Exp $;
;
; dicdir = /var/lib/mecab/dic/debian
dicdir = /usr/lib/mecab/dic/mecab-ipadic-neologd

; userdic = /home/foo/bar/user.dic

; output-format-type = wakati
; input-buffer-size = 8192

; node-format = %m\n
; bos-format = %S\n
; eos-format = EOS\n

mecab-python3

Python3からMeCabを使えるようにします。

$ sudo pip install mecab-python3

requests_oauthlib

Twitter用にOAuthのライブラリを入れておきます。

$ sudo pip install requests_oauthlib

Twitter開発者申請&アプリケーション登録

いつの間にか、Twitter APIの利用に申請が必要になっていました・・・。
参考：【第1回】Twitter APIを使うためにdeveloper accountの申請をしよう！

面倒ですが、つたない英語でもちゃんと質問どおりに記述したら、案外あっさり通りました。
developer にサインインしたら、アプリケーションを登録してAPI Keyを発行しましょう。

Pythonのコード

以下の2つのpythonファイルで構成します。

config.py
tweet-mecab.py

config.py

TwitterのAPIキーやトークンを入れておきます。

config.py

CONSUMER_KEY = "Consumer API key"
CONSUMER_SECRET = "Consumer API secret key"
ACCESS_TOKEN = "Access token"
ACCESS_TOKEN_SECRET = "Access token secret"

tweet-mecab.py

Twitter APIから取得してMeCabで解析。
コピペでも動くとは思います。

tweet-mecab.py

# coding:utf-8
import sys
import re
import pprint
import MeCab
import json, config
import collections
import warnings
from requests_oauthlib import OAuth1Session
from operator import itemgetter

warnings.filterwarnings('ignore')

CK = config.CONSUMER_KEY
CS = config.CONSUMER_SECRET
AT = config.ACCESS_TOKEN
ATS = config.ACCESS_TOKEN_SECRET
twitter = OAuth1Session(CK, CS, AT, ATS)

# 引数で指定されたユーザー名
args = sys.argv
twitter_name = args[1]

# 最新100件を対象に取得
twitter_params ={'count' : 100}
url = "https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=" + twitter_name
res = twitter.get(url, params = twitter_params)

text = ""

if res.status_code == 200:
    timelines = json.loads(res.text)
    for line in timelines:
        # APIの結果から、ツイート内容だけ取り出す
        text += re.sub("https?://[\w/:%#\$&\?\(\)~\.=\+\-]+", "", line['text']) + "\n"

    words = []
    m = MeCab.Tagger("-Ochasen")
    node = m.parseToNode(text)
    while node:
        # 意味のありそうな名詞だけを対象にして集計
        if node.feature.startswith("名詞,一般") or node.feature.startswith("名詞,固有名詞") or node.feature.startswith("名詞,形容動詞") or node.feature.startswith("名詞,サ変接続"):
          words.append(node.surface)
        node = node.next

    counter = collections.Counter(words)
    counter = sorted(counter.items(), key=itemgetter(1), reverse=True)

    i = 0
    print("＊＊＊＊　@" + twitter_name + "さんが最近のTweetでよく使った単語TOP10　＊＊＊＊")
    for k, v in counter:
        print(k + "・・・" + str(v) + "回")
        i = i + 1
        if i > 10:
          break

else:
    print("Failed: %d" % res.status_code)

試してみる

$ python netapi.py tomoeine（Twitterのユーザー名）
＊＊＊＊　@tomoeineさんが最近のTweetでよく使った単語TOP10　＊＊＊＊
宮崎・・・15回
xxxxxx・・・10回
Web・・・9回
自分・・・6回
w・・・6回
xxxxxx・・・5回
xxxxxx・・・5回
他・・・5回
人・・・5回
Laravel・・・4回
勉強・・・4回

「xxxxxx」はよくリプライする知人たちのユーザー名が入ってました。
他にも「ｗ」とか「他」とか意味のない単語が結構入ってしまってるな。。
改善の余地がありそうです。

よくツイートする単語をMeCabで分析する

環境・言語

インストール

MeCab本体のインストール

mecab-ipadic-NEologdのインストール

mecab-python3

requests_oauthlib

Twitter開発者申請&アプリケーション登録

Pythonのコード

config.py

tweet-mecab.py

試してみる