More than 5 years have passed since last update.

MYJLab Advent Calendar 2019

@sh05_sh05

Slackに飛んでくるブックマークをDoc2VecとPCAで可視化してみた

Last updated at 2019-12-12Posted at 2019-12-11

アドベントカレンダーの11日目の記事です

これは

みんな（4人）がブックマークした記事タイトルの分散表現を獲得して可視化したもの

ブックマークするとIFTTTが拾ってslackに吐くようにしているのでそこから処理する

前提条件

環境
- Dockerfileへのリンク

参考文献／利用したもの

できるもの

やる前の予想

Rくん
- ガジェットやセキュリティ系が多い
- ブクマ数79
Yさん
- 4人の中で一番範囲が広い
- 実はこのユーザだけみんなにシェアする目的で取捨選択したものを投稿している
- ブクマ数864
Mくん
- Webと機械学習など
- ブクマ数240
S（自分）
- Webと機械学習とガジェットに加えて「今年はサンマが不漁」みたいなのまで投げてしまう
- ブクマ数896

結果

範囲と重複具合は直感的に予想に近い

準備

はてなブックマークをIFTTTにSlackへ投稿させる仕組み

手順

詳細は割愛するが下図の流れで仕組み自体は出来上がる、なお4コマ目と5コマ目の間にRSS Feedを受け取るためのURL入力が必要で、今回ははてなブックマークなのでhttp://b.hatena.ne.jp/<username>/rssとなる

こんな感じ

理由

はてな内でユーザをお気に入りすればいい?

それでも悪くはない（むしろ両方してもいい）が、コミュニティ内だとこんな感じで気軽にコメントしあえる

Slackコマンドの/feedをつかえばいい?

投稿文をカスタマイズできるので今回のように遊びに使える、またSlackコマンドを使うとめちゃくちゃスペースを取るので困る

Slackから投稿メッセージを取得

簡単にできそうなのはこの2種類

SlackのAPI
- https://api.slack.com/methods/channels.history
Go製のツール（今回はこっち）
- https://github.com/joefitzgerald/slack-dump
- バイナリはこれ
  - https://github.com/PyYoshi/slack-dump/releases

どちらにせよトークンが必要なのでここから取得

$ wget https://github.com/PyYoshi/slack-dump/releases/download/v1.1.3/slack-dump-v1.1.3-linux-386.tar.gz
$ tar -zxvf slack-dump-v1.1.3-linux-386.tar.gz
$ linux-386/slack-dump -t=<token> <channel>

DMとかもいっしょに取ってきて邪魔なので　別のところに移す

python

import zipfile, os

os.mkdir('dumps')
with zipfile.ZipFile('./zipfile_name') as z:
    for n in z.namelist():
        if 'channel_name' in n:
            z.extract(n, './dumps')

ファイルを開いて中身を取得する、日付ごとになっているので全部を１つにする

python

import json, glob

posts = []
files = glob.glob('./dumps/channel/<channel_name>/*.json'.format(dirname))
for file in files:
    with open(file) as f:
        posts += json.loads(f.read())

Messageを取り出して記事タイトルとユーザ名を紐づける（この辺はIFTTTでの設定による）

python

user_post_dic = {
    'Y': [],
    'S': [],
    'M': [],
    'R': [],
}

for p in posts:
    if "username" not in p or p["username"] != "IFTTT":
        continue
    for a in p["attachments"]:
        # 雑回避
        try:
            user_post_dic[a["text"]].append(a["title"])
        except:
            pass
        
users = user_post_dic.keys()
print([[u, len(user_post_dic[u])] for u in users])

出力

[['Y', 864], ['S', 896], ['M', 240], ['R', 79]]

本編

前処理

クレンジングとわかち書き

投稿されるメッセージはこんな感じになっていてサイトのタイトルやURLは不要なので削除する

ブラウザのテキストエリアでNeovimを使う  <http://Developers.IO|Developers.IO>

フロントエンドエンジニアのためのセキュリティ対策 / #frontkansai 2019 - Speaker Deck

matplotlibで日本語

モダンJavaScript再入門 / Re-introduction to Modern JavaScript - Speaker Deck

reを使う時のお作法がよくわからなかったのでゴリ押し
加えて、MeCabでの分かち書きもおこなう、環境にはsudachipyなども入っているが、手に馴染んでいるものをつかう、速いし

python

import MeCab, re
m = MeCab.Tagger("-Owakati")

_tag = re.compile(r'<.*?>')
_url = re.compile(r'(http|https)://([-\w]+\.)+[-\w]+(/[-\w./?%&=]*)?')
_title = re.compile(r'( - ).*$')
_par = re.compile(r'\(.*?\)')
_sla = re.compile(r'/.*$')
_qt = re.compile(r'"')
_sep = re.compile(r'\|.*$')
_twi = re.compile(r'(.*)on Twitter: ')
_lab = re.compile(r'(.*) ⇒ \(')
_last_par = re.compile(r'\)$')

def clean_text(text):
    text = text.translate(str.maketrans({chr(0xFF01 + i): chr(0x21 + i) for i in range(94)}))
    text = re.sub(_lab, '', text)
    text = re.sub(_tag, '', text)
    text = re.sub(_url, '', text)
    text = re.sub(_title, '', text)
    text = re.sub(_sla,  '', text)
    text = re.sub(_qt,  '', text)
    text = re.sub(_sep, '', text)
    text = re.sub(_twi, '', text)
    text = re.sub(_par, '', text)
    text = re.sub(_last_par, '', text)
    return text

p_all = []
m_all = []
for u in users:
    user_post_dic[u] = list(map(clean_text, p_dic[u]))
    m_all += [m.parse(p).split('\n')[0] for p in p_dic[u]]
    p_all += [u + '**' + p for p in user_post_dic[u]]

p_allで各要素の頭にユーザ名を付けたのは前処理によってテキストが消滅してしまい、listのindexがずれてしますため、苦し紛れで紐づけている
(ちなみにURLを記事タイトルとしてブクマしている場合など)

一応はきれいになった

ブラウザのテキストエリアでNeovimを使う 
 
フロントエンドエンジニアのためのセキュリティ対策
 
matplotlibで日本語

モダンJavaScript再入門

Doc2Vec

m_allが分散表現を獲得する時の材料となる文章本体
p_allは呼び方にすぎない

パラメータは熱心には検討していない

python

from gensim import models

# 参考記事： http://qiita.com/okappy/items/32a7ba7eddf8203c9fa1
class LabeledListSentence(object):
    def __init__(self, words_list, labels):
        self.words_list = words_list
        self.labels = labels

    def __iter__(self):
        for i, words in enumerate(self.words_list):
            yield models.doc2vec.TaggedDocument(words, ['%s' % self.labels[i]])

sentences = LabeledListSentence(m_all, p_all)
model = models.Doc2Vec(
    alpha=0.025,
    min_count=5,
    vector_size=100,
    epoch=20,
    workers=4
)
# 持っている文から語彙を構築
model.build_vocab(sentences)
model.train(
    sentences,
    total_examples=len(m_all),
    epochs=model.epochs
)

# 順番が変わってしまうことがあるので再呼び出し
tags = model.docvecs.offset2doctag

PCAと描画

PCAのライブラリを利用するのは初めてで、あんなに手順を踏んで学んだのに2行で使えてすごい

python

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import japanize_matplotlib

vecs = [model.docvecs[p] for p in tags]
draw_scatter_plot(vecs, ls)

# 紐付けを解く
tag_users = [p.split('**')[0] for p in tags]
tag_docs = [p.split('**')[1] for p in tags]

# ４色で同じ程度の色感を見つけるのは難しかった
cols = ["#0072c2", "#Fc6993", "#ffaa1c", "#8bd276" ]

# 無理に1行で書いた
clusters = [cols[0] if u == tag_users[0] else cols[1] if u == tag_users[1] else cols[2] if u == tag_users[2] else cols[3] for u in lab_users]

# 平面なので2次元
pca = PCA(n_components=2)
coords = pca.fit_transform(vecs)

fig, ax = plt.subplots(figsize=(16, 12))
x = [v[0] for v in coords]
y = [v[1] for v in coords]

# 凡例をつけるためにこのループをする
for i, u in enumerate(set(tag_users)):
    x_of_u = [v for i, v in enumerate(x) if tag_users[i] == u]
    y_of_u = [v for i, v in enumerate(y) if tag_users[i] == u]
    ax.scatter(
        x_of_u,
        y_of_u,
        label=u,
        c=cols[i],
        s=30,
        alpha=1,
        linewidth=0.2,
        edgecolors='#777777'
    )

plt.legend(
    loc='upper right',
    fontsize=20,
    prop={'size':18,}
)
plt.show()

できたもの（再掲）

やる前の予想

Rくん
- ガジェットやセキュリティ系が多い
- ブクマ数79
Yさん
- 4人の中で一番範囲が広い
- 実はこのユーザだけみんなにシェアする目的で取捨選択したものを投稿している
- ブクマ数864
Mくん
- Webと機械学習など
- ブクマ数240
S（自分）
- Webと機械学習とガジェットに加えて「今年はサンマが不漁」みたいなのまで投げてしまう
- ブクマ数896

結果

範囲と重複具合は直感的に予想に近い

おわり

そもそもブックマークに重複が多いのできれいには別れなくて残念
もう少しデータが増えればユーザの推論とか回してレコメンドとかしたいですね

遅れてすみませんでした(12/11/21:00)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up