More than 5 years have passed since last update.

wikipediaからのページからリンク一覧を取得する方法

Python

Last updated at 2016-08-21Posted at 2016-08-21

PythonによるWebスクレイピングを勉強中。その中に、Wikipediaのページから、その記事に含まれているリンクを取得する。本書に載っているサンプルは英語のページようだったので、日本語のWikipedia用に少し改良。

実行環境

OS：OX X EI Capitan(10.11.5)
Python:3.5.1

# codeing:utf-8

import re
from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.parse import unquote

url = "https://ja.wikipedia.org/wiki/%E3%83%86%E3%82%A4%E3%83%AB%E3%82%BA_%E3%82%AA%E3%83%96_%E3%82%A4%E3%83%8E%E3%82%BB%E3%83%B3%E3%82%B9"

html = urlopen(url)
bsObj = BeautifulSoup(html,'html.parser')

pattern = re.compile("^(/wiki/)((?!:).)*$")

for link in bsObj.find('div',{'id':'bodyContent'}).findAll('a',href = pattern):
    if 'href' in link.attrs:
        print (unquote(link.attrs['href']))

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up