More than 5 years have passed since last update.

友達同士のつながりをSocialGraph(ソーシャルグラフ)で見てみる

Posted at 2018-02-04

最初はツイッターのフォロワーで見ようとしたが、limitの扱いが面倒とそれに関わる時間で無し。で、競技用プログラミングで出題されるデータ形式を元にソーシャルグラフを可視化してみようと思う。

データは以下のような形とする。

user_id	follower_id
1	2
1	3
4	5
6	10
7	4

上はaizuの入力例を元に作成。Aizu Online Judge - パーティー

これに誰と誰がつながっているかが分かる様、それぞれにGroupのIDをつけていく。

上記の例だと、

user_id	follower_id	GoupID
1	2	1
1	3	1
4	5	2
6	10	3
7	4	2

こうなる。

最初の1と2は初期値を採番してGroup1とする
3は1と紐づいているのでこれもGroup1
4と5も新規ユーザーなのでGroup2を採番。6と10も同様
7は新規だが、4と紐付いているのでGroup2

ここにユーザー5は1と友達だったとする。そうすると下記の様になる。

user_id	follower_id	GoupID
1	2	1
1	3	1
4	5	1
6	10	3
7	4	1
5	1	1

5と1が友達だと分かったので、GroupID1にアップデート。
また、5の友達4も1になり、4の友達7も1という感じ。

このGroupIDを付与する実装は以下

import pandas as pd
import numpy as np
import random
from collections import Counter

def get_group_id():
    before = Counter(friends_dict.values())
    for row in df.itertuples():
        if row.user in friends_dict and row.follower in friends_dict:
            if friends_dict[row.user] != friends_dict[row.follower]:
                if friends_dict[row.user] <= friends_dict[row.follower]:
                    friends_dict[row.follower] = friends_dict[row.user]
                else:
                    friends_dict[row.user] = friends_dict[row.follower]

        if row.user not in friends_dict and row.follower in friends_dict:
            friends_dict[row.user] = friends_dict[row.follower]
        elif row.user in friends_dict and row.follower not in friends_dict:
            friends_dict[row.follower] = friends_dict[row.user]
        elif row.user not in friends_dict and row.follower not in friends_dict:
            friends_dict[row.user] = row.GroupID
            friends_dict[row.follower] = row.GroupID
    after = Counter(friends_dict.values())
    return before == after

if __name__ == "__main__":
    df = pd.DataFrame(np.random.randint(0,250000,size=(300000, 2)), 
                      columns=['user', 'follower'])

	df['GroupID'] = df.index + 1
	friends_dict = {}
	while True:
    	if get_group_id() == True:
        	break
    
    df['GroupID'] = df['user'].apply(lambda x: friends_dict[x])

もっと良いアルゴリズムがあるんだろうなーと思いつつ、取り敢えずはこれでGroupIDが付与された
誰が知ってたら教えて下さいm(_ _)m。

30,000レコードに30,000のランダム整数値を入れた結果は以下となりました

df['GroupID'].value_counts()[:10]

1        28786
738         13
1160        12
81          10
953          9
476          9
5362         8
1149         8
446          8
4256         8

96%のユーザーが何かした繋がっていることに。

可視化

では実際にそうなっているか見てみます。
先ほどのデータをソーシャルグラフとして可視化するのですが、cytoscapeという便利なツールを拝借します。

インストールはこちら

まず先ほどのデータをcsvとして出力

df.to_csv('Social_DataSample.csv', index=False)

開いたらこのボタンをクリックして、先ほどのcsvを指定。(文字コードはutf-8じゃないと読み込めなかった)

userをsource nodeに、followerをtarget nodeに設定しOKをクリック。
3万件ほどのデータだと結構時間がかかる

ズームすると、各固有IDが誰と紐付いているのかも見えてきます

styleタブのdefaultボタンを押せば、他のビジュアルにも変更可能。
自分はMarqueeが気に入ってます。これはベクトルも付いていて、先ほど選択したsource → targetの方向性が確認できます

他にもそれぞれのノードに色や形とか画像に変換できるので、クライアントにレポート出す時とかも良いビジュアルでいけそう。
他にもサンプルが入っているので、それを参考にするのも良いかも

おしまい

こういうソーシャルデータは可視化してしまうと、それぞれの関係性がすぐ分かったりして使い勝手が良い。
本来は遺伝子ネットワーク分析で使用されている様ですが、今回のようにユーザーやipのトラフィックであったり、金融商品の相関性をみるのにも使えそう。
時間と暇があれば、spotifyやtwitterでも試してみたい

参考

ネットワーク可視化プラットフォームCytoscapeの現状まとめ
 六次の隔たり
 パーコレーション

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up