More than 5 years have passed since last update.

Python3でマルコフ連鎖を使ってツイートを作成、投稿するまで(自家用)

Last updated at 2019-06-05Posted at 2019-05-28

PC初心者の自戒ログです。
##環境と準備

Docker （Ubuntu:latestで動かします）
　新しくコンテナを作っておきます。適当に引っ張ったUbuntu 18.04
これをGUIではなくCLIからstart, attachしておきます。
Twitter APIキー
　TwitterDevで取得しておきます。

##Ubuntu内の環境整備

Python3系のインストール

以下のコマンドでツール類をダウンロードしておきます。

 apt update
 apt install build-essential libbz2-dev libdb-dev \
  libreadline-dev libffi-dev libgdbm-dev liblzma-dev \
  libncursesw5-dev libsqlite3-dev libssl-dev \
  zlib1g-dev uuid-dev tk-dev

次に以下を実行（今回は3.6.6を入れます）

cd /usr/local/src
wget https://www.python.org/ftp/python/3.6.6/Python-3.6.6.tgz
tar xvzf Python-3.6.6.tgz
cd Python-3.6.6
./configure --with-ensurepip
make
make install

python3コマンドで起動できます。確認してみましょう。ついでにpipをアップデートします。

python3 --version
pip3 install --upgrade pip

Pythonライブラリも取得します。

pip3 install requests_oauthlib
pip3 install pandas

vimのインストール

Vimが無いと生きていけないので精神安定剤的な感じでインストールします。
apt install vim

MeCabのインストール

やり方は公式サイトに書かれているのですが、ローカルにダウンロードしてしまうのでインスタンスにダウンロードしてあげるようにします。

cd
wget -O mecab-0.996.tar.gz "https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7cENtOXlicTFaRUE"

解凍してインストールしてあげます。

tar zxvf mecab-0.996.tar.gz
cd mecab-0.996
./configure
make
make install

続いて辞書をインストールします。

cd
wget -O mecab-ipadic-2.7.0-20070801.tar.gz "https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7MWVlSDBCSXZMTXM"

また解凍するところまで行きます。

tar zxvf mecab-ipadic-2.7.0-20070801.tar.gz
cd mecab-ipadic-2.7.0-20070801
./configure --with-charset=utf8

ここまで来たら、/etc/ld.so.conf を編集します。（エラー回避の為）
Vimで編集して最後に以下の一行を追加します。
/usr/local/lib
終わったらインストールします。

make
make install

最後はPythonバインディングをインストールします。
pip install mecab-python3
これでMeCabが使えるはずです。

##コードの準備

Tokenの保管

あらかじめアクセストークンの保管場所を作っておきましょう。dotenvを使ってもいいですが今回は別のPythonファイルにします。

config.py

CONSUMER_KEY = "**************"
CONSUMER_SECRET = "**************"
ACCESS_TOKEN = "**************"
ACCESS_TOKEN_SECRET = "**************"

タイムライン取得と格納

timeline.py

# -*- coding:utf-8 -*-
from requests_oauthlib import  OAuth1Session
import json, config, time
import pandas as pd

CK = config.CONSUMER_KEY
CS = config.CONSUMER_SECRET
AT = config.ACCESS_TOKEN
ATS = config.ACCESS_TOKEN_SECRET

url = "https://api.twitter.com/1.1/statuses/home_timeline.json"

params = {'count': 200}

TweetList = []

twitter = OAuth1Session(CK, CS, AT, ATS)

for i in range(10):
    req = twitter.get(url, params = params)
    if req.status_code == 200:
        timeline = json.loads(req.text)
        for tweet in timeline:
            TweetList.append(tweet["text"])
            df = pd.DataFrame(TweetList)
            df.to_csv("tweet.csv",encoding="UTF-8")
    else:
        print ("Error: %d" % req.status_code)
    time.sleep(300)

5分ごとに200個ツイートを取得し、2000個溜まるまで回すプログラムです。tweet.csvに格納されます。

前処理

removeurl.py

# -*- coding: utf-8 -*-
import pandas as pd
import re

df = pd.read_csv('tweet.csv')

tweets = df['0']  #オリジナルのリストなので見だしが0になってます

replypattern = '@[\w]+'
urlpattern = 'https?://[\w/:%#\$&\?\(\)~\.=\+\-]+'

processedtweets = []

for tweet in tweets:
    i = re.sub(replypattern, '', tweet)
    i = re.sub(urlpattern, '', i)
    if isinstance(i, str) and not i.split():
        pass
    else:
        processedtweets.append(i)

processedtweetsDataFrame = pd.Series(processedtweets)
newDF = pd.DataFrame({'text': processedtweetsDataFrame})

newDF.to_csv('processedtweets.csv')

URLとかを取り除くものです。おおよそ文章だけのものになりますが、まだ非公式RTをしている化石時代の住民のツイートは残ってしまう可能性もあります。

マルコフ連鎖からツイート生成・投稿

tweet.py

#!/user/bin/env python
# -*- coding: utf-8 -*-
from requests_oauthlib import OAuth1Session
import json
import sys
import MeCab
import random
import re
import config

CK = config.CONSUMER_KEY
CS = config.CONSUMER_SECRET
AT = config.ACCESS_TOKEN
ATS = config.ACCESS_TOKEN_SECRET

def Mecab_file():   
        f = open("processedtweets.csv","rb")
        data = f.read()
        f.close()
 
        mt = MeCab.Tagger("-Owakati")
 
        wordlist = mt.parse(data.decode('utf-8') #str列にしてあげます
        wordlist = wordlist.rstrip(" \n").split(" ")
 
        markov = {}
        w = ""
 
        for x in wordlist:
            if w:
                if w in markov:  #python3系では.has_が使えないみたいです
                    new_list = markov[w]
                else:
                    new_list =[]
 
                new_list.append(x)
                markov[w] = new_list
            w = x
 
        choice_words = wordlist[0]
        sentence = ""
        count = 0
 
        while count < 90:
            sentence += choice_words
            choice_words = random.choice(markov[choice_words])
            count += 1
 
            sentence = sentence.split(" ", 1)[0]
            p = re.compile("[!-/:-@[-`{-~]")
            sus = p.sub("", sentence)
 
        words = re.sub(re.compile("[!-~]"),"",sus)
        twits = words + " 【tweet from BOT】"
 
        url = "https://api.twitter.com/1.1/statuses/update.json"
        params = {"status": twits,"lang": "ja"}
        tw = OAuth1Session(CK,CS,AT,ATS)
        req = tw.post(url, params = params)
        if req.status_code == 200:
            print ("Success! Your Tweet")
        else:
            print (req.status_code)
if __name__ == '__main__':
    Mecab_file()

Python3系で動くように適宜修正していますが、基本的な動作は参考サイトどおりです。

バグの修正

localeコマンドでローカライズを確認したときPOSIXになっていたらAscii文字の関係でエラーが出るかもしれないです。その時はC.UTF-8にしましょう。

export LANG=C.UTF-8

参考にさせていただいたサイト

LLC JIRIKIさん
プログラムの核はこのサイトを参考にしています。

QuzeeBlogさん
タイムライン収集と格納のプログラムです。

grachroさん
MeCabインストールに関してはこの方法がいいかもしれません。

アシアトさん
PythonでMeCabを扱う上でのエラーをさらってくれてます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up