More than 5 years have passed since last update.

NMFによるTopic Modelの単語のトピック数毎の遷移を可視化してみる

Last updated at 2018-02-28Posted at 2018-02-27

背景：Topic Modelによる文章の分類（クラスタリング）について

NMFによるTopic Modelは自然言語の文章を指定した数のトピック（クラスター）にまとめてくれます。
NMFによるTopic Modelについてはこちらを参照（https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730)
また文章だけでなく文章に含まれる単語もトピックにまとめてくれるので、トピックに含まれる単語をみると、そのトピックの特徴がなんとなく見えてきます。

問題点：この記事のモチベーション

ここでいくつかのトピック数を指定してみてうまく特徴が出ているようなトピック数を探すのですが、トピック数を変えた時に単語のクラスターがどのように変化したのか気になったのでそれを可視化してみました。

階層的クラスタリングであればクラスター数が増えたときに新しいクラスターがどのクラスターから生まれたかがすぐ分かるし、枝分かれしていないクラスターはクラスター数が増えても中身は同じです。一方NMFの場合は指定したクラスタ数毎にクラスタリングを行うため、クラスター数５の時とクラスター数６の時のそれぞれのクラスターには特に関係性はありません。このクラスター数nとn+1の時の各クラスターがどれだけ単語を共有しているかを計算して可視化してみます。

可視化の実装：Pythonでデータ作成、d3.jsでsankey diagramを作る

可視化の実装はPythonで簡単に単語の遷移具合を計算してみてから、データを作成し、最終的な可視化はd3.jsを使ってsankey diagramを作ります。

元のデータ形式

nクラスターでクラスタリングした単語のデータは下のような形式で作成されています。各クラスターに単語が３０個あります。

1	2	...	30
word_1_1	word_1_2	...	word_1_30
word_2_1	word_2_2	...	word_2_30
...	...	...	...
word_n_1	word_n_2	...	word_n_30

クラスター毎の単語の推移をPythonで計算してみる

可視化する前に以下のようなPythonのスクリプトで各クラスターの単語の共有度合いを計算してみます。

def in_np_array(f0):
    def f(terms0, terms1):
        res = f0(terms0, terms1)
        return np.array(res).reshape(terms0.shape[0], terms1.shape[0])
    return f

@in_np_array
def term_matrix(terms0, terms1):
    "return term matrix between terms0 and terms1"
    res = [len(set(terms0.iloc[row0]).intersection(set(terms1.iloc[row1]))) for row0 in range(terms0.shape[0])
                                                                            for row1 in range(terms1.shape[0])]
    return res

>>> term5 = pd.read_csv('topic_terms_5topics.tsv', sep='\t', header=None)
>>> term6 = pd.read_csv('topic_terms_6topics.tsv', sep='\t', header=None)
>>> term_matrix(term5, term6)
array([[23,  3,  3,  3,  1,  7],
       [ 1, 29,  1,  4,  2,  1],
       [ 1,  1, 25,  0,  1,  1],
       [ 2,  5,  0, 30,  0,  1],
       [ 0,  2,  1,  0, 29,  0]])

term5とterm6はそれぞれトピック数が５と６の時のデータです。計算の結果の５x６のマトリクスをみると、(n, n)の所に大きな値が並んでいます。つまりトピック数５の時のトピックnとトピック数６の時のトピックnは多くの単語を共有していて似た特徴を持っていると言えます。概ね綺麗に遷移しているっぽいので可視化してみたいと思います。

d3.jsで可視化：sankey diagram

可視化にはd3.jsのd3-sankeyを使ってsankey diagramで表現してみます。（plotlyとかでもできるみたいhttps://medium.com/@plotlygraphs/4-interactive-sankey-diagram-made-in-python-3057b9ee8616 ）

d3-sankyを使うに当たって、以下のようなデータを用意する必要があります

{
  "nodes": [
    {"name": "foo"},
    {"name": "bar"}
  ],
  "links": [
    {"source": 0, "target": 1, "value": 5}
  ]
}

各tsvファイルをJS側で読み込んでデータを作ってもいいですが、今回はpythonでデータも作成します。


def _calc_links(terms0, terms1):
    "return list of dict of links"
    lt0 = terms0.shape[0]
    lt1 = terms1.shape[0]
    
    def link(t0, r0, t1, r1):
        value = len(set(t0.iloc[r0]).intersection(set(t1.iloc[r1])))
        return {"source": "%d_%d" % (lt0, r0), "target": "%d_%d" % (lt1, r1), "value": value}
    
    res = [link(terms0, row0, terms1, row1) for row0 in range(terms0.shape[0])
                                            for row1 in range(terms1.shape[0])]
    return res

def calc_links(*terms_lst):
    "return list of dict of links"
    return sum([_calc_links(terms_lst[i], terms_lst[i+1]) for i, e in enumerate(terms_lst) if i != len(terms_lst) - 1], [])

def calc_nodes(*terms_lst):
    "return list of dict of nodes"
    return [{"name": "%d_%d" % (terms.shape[0], row)} for terms in terms_lst for row in range(terms.shape[0])]

>>> data = {
    "nodes": calc_nodes(term5, term6, term7, term8, term9, term10),
    "links": calc_links(term5, term6, term7, term8, term9, term10)
}
>>> with open('data.json', 'w') as f: json.dump(data, f)

データができたのでd3.jsで可視化していきます。今回は１個のindex.htmlにJSを全て記述しました。
d3-sankeyはコアのd3.jsには入っていないようなので、d3.jsとそれぞれロードします。

 <script src="https://d3js.org/d3.v4.min.js"></script>
 <script src="https://unpkg.com/d3-sankey@0"></script>

あとは結構簡単でd3.sankeyを適当に定義してあげて、


var sankey = d3.sankey()
               .nodeWidth(15)
               .nodePadding(20)
               .extent([[1,1], [width - 15, height - 10]]);

d3.jsonとかで読み込んだデータを

sankey(data);

とすれば必要なnodeとlinkのポジションを全て計算してくれます。あとはd3.jsでレンダリングするだけですが、基本的なコードはd3-sankeyの例と同じなので割愛します。

まとめ：可視化の結果＆次にやりたいこと

結果の図がこちら。

左側からトピック数が５から１０までの各クラスターに含まれる単語の遷移を可視化しています。
こうみると、なかなか綺麗に遷移していっていてNMFによるTopic Modelなかなかすごいですね。

nodeの色が適当なので、トピック数nとn+1でもっとも似ているトピックの色を揃えるとかをやりたい。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up