More than 5 years have passed since last update.

ネットワーク分析はwebのリンク構造で①

Last updated at 2019-11-09Posted at 2019-11-09

はじめに

webのリンク構造は手軽に遊べる大規模なネットワーク。
urllibとBeautifulSoupでHTMLからリンク先のURLを取得することを繰り返し、webベージの隣接行列を作成します。
余裕で12時間以上かかる場合があるので心して試されたい。
ソースコードなどなどは筆者GitHubにご用意しております。
NetworkXを用いた分析はネットワーク分析はwebのリンク構造で②に記載しています。

プログラムの概要

リンクを辿り始めるスタートページを指定
リンクを辿る回数（スタートページから最短で何回で行けるページを考えるか）を指定
指定された回数だけ、リンクを辿る
指定された回数で得られたURL群に限定して、全てのリンクを取得
隣接行列を作成

準備

import.py

from urllib.request import urlopen
from bs4 import BeautifulSoup

import networkx as nx

from tqdm import tqdm_notebook as tqdm
import numpy as np
import pandas as pd
pd.options.display.max_colwidth = 500

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import re

`urllib.request`

Webサイトにあるデータを取得するライブラリ
（いい感じの参考サイトが無い。。。）

`BeautifulSoup`

htmlファイルをタグ情報から解析するモジュール
参考: Qiita: 10分で理解する Beautiful Soup

`networkx`

ネットワーク分析のモジュール。
次回記事で解説予定。

`tqdm`

for文でプログレスバーを出す。
jupyter notebookを利用する場合は、tqdm_notebookをimportするので注意。
参考：Qiita: Jupyter Notebook でプログレスバーを出す

`pd.options.display.max_colwidth = 500`

pandasにおいて、各列の最大幅を広げる。
めっちゃ長いURLが省略されないように。

url_prepare.py

start_url = "https://zozo.jp/"
# the page to begin with

explore_num = 2
# how many times do you explore new links

url_list = [start_url]
# list of the URL of all the pages. The components will be added.
link_list=[]
# list of lists [out_node, in_node]. The components will be added.

# prepare a file name to save figures and csv files
fname = re.split('[/.]', start_url)
if fname[2]=="www":
    fname = fname[3]
else:
    fname = fname[2]

`start_url`

リンクを辿り始めるページを指定。

`explore_num`

リンクを辿る回数（スタートページから最短で何回で行けるページを考えるか）を指定。

`url_list`

訪れたwebサイトのURLを全て格納する配列。
後の隣接行列のインデックスに対応する。

`link_list`

リンクのURLペアを全て格納する配列。
[リンクが出て行くURL, リンクが入ってくるURL]を要素とする配列の配列。
後の隣接行列の各要素に対応する。

`fname`

後にseabornのグラフやpandasの表データを保存する際のファイル名。

リンク構造解析の関数

以下は、実際にリンクを辿る関数。
link_exploreは、全てのリンク先を検索する関数。検索するURLの配列を引数にとる。
link_cruiseは、与えられたサイトのみへのリンク先を検索する関数。隣接行列を引数にとる。

link_explore.py

def link_explore(link_list, url_list, now_url_list):
    # link_list: list of the URL of all the pages
    # url_list: list of lists [out_node, in_node]
    # next_url_list: list of the URL to explore in this function
    print(f"starting explorting {len(now_url_list)} pages")
    next_url_list=[]
    
    for url in now_url_list:
        
        try:
            with urlopen(url, timeout=10) as res:
                html = res.read().decode('utf-8', 'ignore')
                soup = BeautifulSoup(html, "html.parser")

        except:
            print("x", end="")
            continue
            #print(f"\n{url}")
            
        else:
            for a in soup.find_all("a"):
                link = a.get("href")

                if link!=None and len(link)>0:
                    if link[0]=="/":
                        link = url+link[1:]

                    if link[0:4] == "http":
                        if link[-1]=="/":
                            next_url_list.append(link)
                            link_list.append([url,link])
                            
            print("o", end="")
        
    next_url_list = list(set(next_url_list))
        
    url_list += next_url_list
    url_list = list(set(url_list))
        
    return link_list, url_list, next_url_list

link_cruise.py

def link_cruise(adj, url_list, now_url_list):
    # adj: adjacency matrix
    # next_url_list: list of the URL to explore in this function
    #print(f"starting cruising {len(now_url_list)} pages")
    next_url_list=[]
    
    for url in tqdm(now_url_list):
        
        try:
            with urlopen(url, timeout=10) as res:
                html = res.read().decode('utf-8', 'ignore')
                soup = BeautifulSoup(html, "html.parser")
                
        except:
            continue
            
        else:
            for a in soup.find_all("a"):
                link = a.get("href")

                if link!=None and len(link)>0:
                    if link[0]=="/":
                        link = url+link[1:]

                    if link[0:4] == "http":
                        if link[-1]=="/":
                            if link in url_list:
                                if adj[url_list.index(url),url_list.index(link)] == 0:
                                    next_url_list.append(link)
                                    adj[url_list.index(url),url_list.index(link)] = 1
            #print("o", end="")
        
    next_url_list = list(set(next_url_list))
        
    #print("")
    return adj, next_url_list

実行

explore_numで与えられた回数だけリンクを辿る。
リンク先のHTMLのdecodeに成功した場合にo、失敗した場合にxを表示する。

explore_exe.py

next_url_list = url_list
for i in range(explore_num):
    print(f"\nNo.{i+1} starting")
    link_list, url_list, next_url_list = link_explore(link_list, url_list, next_url_list)
    print(f"\nNo.{i+1} completed\n")

↓こんな感じ↓になる。

隣接行列を作成する。

make_adj.py

adj = np.zeros((len(url_list),len(url_list)))

for link in tqdm(link_list):
    try:
        adj[url_list.index(link[0]),url_list.index(link[1])] = 1
    except:
        pass

explore_num以降の検索は、すでに訪れたことのあるページに限定する。全てのページを訪れるまで検索を繰り返す。

cruise_exe.py

while (len(next_url_list)>0):
    adj, next_url_list = link_cruise(adj, url_list, next_url_list)

完成！

続編

ネットワーク分析はwebのリンク構造で②

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up