More than 5 years have passed since last update.

Pythonでウェブページから記事だけを取得する

Posted at 2019-12-31

簡単にウェブページから本文を抽出できるライブラリ

Pythonでスクレイピングしたデータを抽出するとHTMLタグだったり後々の分で気に役に立たない
余計な情報がよく取得されてきてしまいます。

そんな時にはreadability-lxmlがあれば大丈夫です。こちらの説明をさせていただきます

まずインストール

(env)$pip install readability-lxml

以下のようなユーティリティのクラスを作る

utils.py

# -*- coding:utf8 -*-
import lxml.html
import readability
def get_content(html):
    """
    HTMLの文字列から (タイトル, 本文) のタプルを取得する。
    """

    document = readability.Document(html)
    content_html = document.summary()
    # HTMLタグを除去して本文のテキストのみを取得する。
    content_text = lxml.html.fromstring(content_html).text_content().strip()
    short_title = document.short_title()
    return short_title, content_text

ユーティリティクラスを使って実際にタイトルとコンテンツが取得できているかテスト
(Yahooニュースの記事を使いました)

import utils
import requests
obj = requests.get('https://headlines.yahoo.co.jp/hl?a=20191230-00000310-oric-ent')
title,content = utils.get_content(obj.content)
print(title)
print(content)

以下のように記事が取得されているのを確認してみてください。

更新履歴

2019/12/31 新規作成

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up