More than 1 year has passed since last update.

Slackのチャットからスレッドのメッセージを取得し話題を抽出する

Last updated at 2022-03-09Posted at 2022-03-09

目的

slackでチャットが頻繁に行われていて追いつけないから話題を抽出したい
リアクション数、その話題が行われたメッセージ数の合計でランク付をする

前提

Slack Appの作成している。

手順

データ抽出
前処理
形態素解析
文書特徴量計算
類似度計算
クラスタごとに列の集計
ランク付けとソート

必要なものをインポート

import requests
import numpy as np
import pandas as pd
import time
import json
import re

データ抽出

Slackとの連携

slack_token = "SLACK_TOKEN"

headersAuth = {
    'Authorization': 'Bearer '+ slack_token
}

SLACK_TOKENはアプリを作成したときに表示されるものを使用してください

チャンネルIDとチャンネル名の全取得

def get_all_channels():
    method="https://slack.com/api/conversations.list"
    return requests.get(method, headers=headersAuth).json()['channels']

全メッセージ取得

def get_all_messages(channel_id):
    method = "https://slack.com/api/conversations.history"
    payload = {
        "channel": channel_id
    }
    response = requests.get(method, headers=headersAuth, params=payload)
    return response.json()['messages']

全チャンネルのメッセージを取得

channels = get_all_channels()
message_df = pd.DataFrame()
for channel in channels:
    temp_message_df = pd.DataFrame(get_all_messages(channel['id']))
    temp_message_df['channel_id'] = channel['id']
    temp_message_df['channel_name'] = channel['name']
  
    message_df = pd.concat([message_df, temp_message_df], ignore_index=True)

前処理

文章データを作成

sentence_df = message_df

sentence_df = sentence_df.rename(columns={'text': 'raw_text'})
# 文章を加工
def wrangle_text(raw_text: str)-> str:
    #　リアクションの削除
    text=re.sub(':[a-z_]+:', '', raw_text)
    #　記号など削除
    text=re.sub('[　「」『』【】\r\n*。、….？！!?-]', '', text)
    #　アルファベットを小文字にする
    text=text.lower()
    #　文末の処理
    text=re.sub('[〜ー草笑w]+$', '', text)
    return text

sentence_df['text'] = sentence_df['raw_text'].apply(wrangle_text)
sentence_df['url'] = sentence_df[['channel_id', 'ts']].apply(lambda rec: f"https://tonkotsu-tonkotsu.slack.com/archives/{rec['channel_id']}/p{str(rec['ts']).replace('.','')}", axis=1)

リアクションの削除では例えば、「頑張れ！！🙌応援します！」というチャットがあったとすれば

頑張れ！！:raised_hands:応援します！

と記録をされるため、この絵文字を削除をしています。

リアクションカウント

sentence_df['reaction_count'] = sentence_df['reaction_count'] = [sum([re['count'] for re in li]) if isinstance(li, list) else 0 for li in sentence_df["reactions"].values]
# sentence_df[sentence_df['reactions'].notnull()].drop(columns=['inviter', 'last_read', 'files', 'upload', 'root', 'latest_reply', 'display_as_bot', 'subscribed', 'is_locked', 'reply_users', 'thread_ts', 'edited'])[:1]
sentence_df=sentence_df.iloc[::-1]

取得されるデータは最新のものが上に来るので[::-1]によって順番を変えている。

要らないデータを削除

filtered_df = sentence_df
# 要らない行削除
filtered_df=filtered_df.replace('', np.nan)
filtered_df=filtered_df.dropna(subset=['text'])
filtered_df=filtered_df[filtered_df['subtype'].isnull()]
filtered_df["ts"]=filtered_df["ts"].apply(lambda x: int(x.replace(".","")))
# 要らないカラム削除
filtered_df = filtered_df[['client_msg_id', 'text', 'ts', 'reaction_count', 'url']]
# インデックス振り直し
filtered_df=filtered_df.reset_index(drop=True)

形態素解析


import MeCab
import ipadic

mecab = MeCab.Tagger(ipadic.MECAB_ARGS)

def filter_word(word)->bool:
    hinshi = word[1]
    if ("サ変接続" == word[1]):
        return True
    elif (word[1] == "名詞" and word[2]=="一般"):
        return True
    return False

def analysis_text(t: str) -> str:
    mecab_results = mecab.parse(t).splitlines()
    mecab_results.remove('EOS')
    return list(map(lambda result: result.replace('\t', ',').split(',')[:4], mecab_results))


words_df = filtered_df
mecab_results = filtered_df['text'].apply(lambda text: analysis_text(text))
words_df['mecab_results'] = mecab_results.apply(lambda words: list(filter(filter_word, words)))
words_df['words'] = words_df['mecab_results'].apply(lambda words: list(map(lambda word: word[0], words)))
words_df['wakachi_word'] = words_df['words'].apply(lambda words: ' '.join(words)) # 分かち書きしておくとベクトルが平均される
words_df = words_df[words_df['wakachi_word'].str.len() > 0]# filterした結果空になった行を削除

出力する話題は名詞である必要があると感じたので名詞のある文章だけ抽出しました。
名詞の中でも「こと」などの形式名詞や「今日」や「前」の時相名詞、「お疲れ様」などの感動詞は取り除く必要があるとわかったので、さ変接続と名詞一般だけを残すようにしました。

文書特徴量計算


import fasttext
import fasttext.util

path="cc.ja.300.bin"
ft = fasttext.load_model(path)

words_df["vector"]=[np.array([ft[word] for word in lis]).sum(0)/len(lis) for lis in words_df['words']]

Wikipediaのテキストデータと学習済みのモデルを使用しています。
ダウンロードはここからしました。

類似度計算

group_df=words_df.copy()

import scipy.spatial.distance as dis
# 最初のベクトル
word1=words_df.iloc[0,-1]
# clunter
group=[0]
group_num=0

word1=words_df.iloc[0,-1]
sim_lis=[0]


for i in range(1,len(words_df)):
    sim=dis.cosine(word1,words_df.iloc[i,-1])
    sim_lis.append(sim)
    if sim >0.6:
        group_num+=1
    group.append(group_num) 
    word1= words_df.iloc[i,-1]
group_df["clu"]=sim_lis
group_df["group"]=group
group_df["id"]=group_df.index

それぞれの文章の名詞のベクトルの平均をその文章のベクトルとして扱い、前後の文章のベクトルの類似度を計算した。
類似度計算にはコサイン類似度を使用した。

名詞のない文章は「それなー」や「たしかに」など共感をしている言葉が多いとわかったので前の文章と同じ話題だと判定する。

クラスターごとに列を集計

clustered_df = group_df.groupby('group').agg({
    'text': list,
    'ts': min,
    'url': list,
    'words': sum,
    'reaction_count': sum,
    'client_msg_id': 'count'
})
clustered_df = clustered_df.rename({'client_msg_id': 'message_count'}, axis=1)

def count_with_sort(values):
    values, count = np.unique(values, return_counts=True)
    return values[np.argsort(-count)][:2]
clustered_df['top_words'] = clustered_df['words'].apply(count_with_sort)
# メッセージカウント
ddf = group_df.groupby('group')
message_counts=group_df.loc[ddf["id"].idxmax(),:]["id"].values-list(map(lambda x: x-1, group_df.loc[ddf["id"].idxmin(),:]["id"].values))


clustered_df["message_count"]=message_counts

ランク付けとソート

そのトピックに関するメッセージが３件以上あるもの（同じ話題でチャットが３件以上続いていないと会話が成り立っていると思えないため）で

リアクション数
メッセージ数

の合計をもとに上のデータフレームをソート

上位５個を残す

また、出力される話題はその会話の中の名詞を頻出度順で上位二つを組み合わせたものである。
こうすることで１単語よりも内容が伝わるのではないかと考えた。


ranked_df = clustered_df[clustered_df['message_count']>2]
ranked_df['score'] = ranked_df['message_count'] + ranked_df['reaction_count']
ranked_df['url']=[i[0] for i in ranked_df['url']]
df_csv=ranked_df.sort_values(by='score', ascending=False)[:5]
df_csv["id"]=np.arange(1,6)
df_csv["word"]=["".join(i) for i in df_csv["top_words"]]
df_csv=df_csv.reset_index()[["id","url","word"]]
df_csv

結果

私が使用したデータでの結果は以下のようになった。

課題mac
鍋スタミナ
jstぼけ
バイト寝坊
バターお菓子

jstボケではなく時差ボケになってほしかったのが正直なところだが、話題を概ねつかめているようだった。
る。
今回は全期間の会話データを使用したが、データを絞る場合はAPIを叩いたときに出力されるjsonの中のtsというプロパティが時間を表しており（UNIX時間）そこからフィルターをかければ良いだろう。
ただ、私が使用したのは、243件分のデータであり、データが少なすぎると精度が低くなってしまうので、全期間使用した。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up