More than 5 years have passed since last update.

trigramで自分のツイートからランダム文を生成する

Python

Posted at 2019-12-13

記録です。説明軽くしか書いてないです。

やり方

準備

自分のツイート履歴をダウンロードします。Twitterページで「設定とプライバシー」→「アカウント」→「Twitterデータ」→「Twitterデータをダウンロード」でリクエストして、しばらくしてメールアドレスにダウンロードリンクが届くのでそこからダウンロードします。
2019年の夏以降ダウンロードデータの仕様が変わってtweets.csvがtweets.jsになってしまったそうなので、面倒なので他の方が書いてくれたツールを利用してtweets.csvを作ります。（https://17number.github.io/tweet-js-loader/ ）

作業場と同じディレクトリにtextフォルダを作ってその中にtweets.csvを放り込んで準備完了。

次にtweet.pyの中身について。
まずは以下の部分でtweets.txtを作ります。tweets.csvからツイート本文を抜き出してtxtファイルにしています。

tweet.py

import csv
import re

rawfile = "text/tweets.csv"
infile = "text/tweets.txt"
outfile = "text/tweets_wakati.txt"


with open(rawfile,'r') as f:
    reader = csv.reader(f)
    
    with open(infile,'w') as f:
        for d in reader:
            if len(d) > 2:
                f.write(d[2])
            f.write('\n')

次にjanomeを利用して単語を分ち書きにして、モデルに学習させます。janomeが入ってない人はpip install janomeを先にしてください。
ちなみに、アルファベットや特定の記号、質問箱などを排除してなるべく日本語で意味が通る文ができるようにしています。

tweet.py

from janome.tokenizer import Tokenizer
t = Tokenizer()


with open(infile,'r') as f:
    data = f.readlines()

p = re.compile('[a-z]+')
p2 = re.compile('[:/.@#質問●]+')

with open(outfile,'w') as f:
    for i in range(len(data)):
        line = data[i]
        if p2.search(line):
            pass
        else:
            for token in t.tokenize(line):
                if p.search(str(token.surface)):
                    pass
                else:
                    f.write(str(token.surface))
                    f.write(' ')
            f.write('\n')
        


words = []
for l in open(outfile, 'r', encoding='utf-8').readlines():
    if len(l) > 1:
        words.append(('<BOP> <BOP> ' + l + ' <EOP>').split())


from nltk.lm import Vocabulary
from nltk.lm.models import MLE
from nltk.util import ngrams

vocab = Vocabulary([item for sublist in words for item in sublist])

print('Vocabulary size: ' + str(len(vocab)))

text_trigrams = [ngrams(word, 3) for word in words]

n = 3
lm = MLE(order = n, vocabulary = vocab)
lm.fit(text_trigrams)

最後にランダム文生成です。

tweets.py

for j in range(10):
    # context = ['<BOP>']
    context = ['<BOP>','<BOP>']
    sentence = ''
    for i in range(0, 100):
        # contextのうち最後の2単語から次に繋がる確率0じゃない単語をランダムに選ぶ
        w = lm.generate(text_seed=context)

        if '<EOP>' == w or '\n' == w:
            break

        context.append(w)
        sentence += w

        
    
    print(sentence+'\n')

ランダムに10個の文が出力されます。
コピペするときは上に書いたコードを一つのファイルにまとめるか、jupyter notebookでセルに分けて実行するかなどしてください。
モデルの学習に少し時間がかかる場合があるので、後者の方法がオススメです。

結果

上みたいな感じで10個の文が出力されるはずです。
結構面白いので無限に試せちゃいます。皆さんも是非やってみてください。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up