More than 5 years have passed since last update.

BeautifulSoupを使ってHTMLファイルからある文字列を含むリンクを抽出する

Python

Last updated at 2014-09-20Posted at 2014-09-19

・HTMLファイルから"mp4"を含むリンクを抽出。
・BeautifulSoupを使用。


from BeautifulSoup import BeautifulSoup

open_name = raw_input('Open html file: ')
save_name = raw_input('Save file name: ')

f = open(open_name)
html = f.read()
f.close()

f2 = open(save_name, 'w')

soup = BeautifulSoup(html)

for link in soup.findAll("a"):
    if "mp4" in link.get("href"): # "mp4"を含むリンクを抽出
        f2.writelines(link.get('href') + '¥n')

f2.close()

下記リンクのStack Overflow を参考にしました。
他にもいい方法がありましたらご指摘ください。

参考：
python - how can I get href links from html code - Stack Overflow
https://stackoverflow.com/questions/3075550/how-can-i-get-href-links-from-html-code/3075568#3075568

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up