PythonのNewspaper3kで簡単にニュース記事を抽出しよう

Posted at 2024-09-16

第1章: Newspaper3kの概要

Newspaper3kは、Pythonで書かれたニュース記事抽出ライブラリです。ウェブサイトからニュース記事を簡単に取得し、テキスト、画像、メタデータなどを抽出できます。

import newspaper

# CNNのニュースソースを作成
cnn_paper = newspaper.build('http://cnn.com')

# 記事の数を表示
print(f"CNNの記事数: {len(cnn_paper.articles)}")

第2章: インストールと基本設定

Newspaper3kをインストールし、基本的な設定を行います。

# インストール
!pip install newspaper3k

# 必要なライブラリをインポート
import newspaper
from newspaper import Article
import nltk

# 必要な自然言語処理のデータをダウンロード
nltk.download('punkt')

第3章: 単一記事の抽出

1つの記事からテキストや画像を抽出する方法を学びます。

# 記事のURLを指定
url = 'http://example.com/news/article'

# 記事オブジェクトを作成
article = Article(url)

# 記事をダウンロードして解析
article.download()
article.parse()

# タイトルと本文を表示
print(f"タイトル: {article.title}")
print(f"本文: {article.text[:200]}...")  # 最初の200文字を表示

第4章: 複数記事の一括処理

ニュースサイトから複数の記事を一度に処理する方法を紹介します。

# ニュースソースを作成
source = newspaper.build('http://example.com')

# 全ての記事を処理
for article in source.articles:
    article.download()
    article.parse()
    print(f"タイトル: {article.title}")
    print(f"URL: {article.url}")
    print("---")

第5章: 自然言語処理（NLP）機能の活用

Newspaper3kの自然言語処理機能を使って、記事のキーワードやサマリーを抽出します。

# 記事をダウンロードして解析
article.download()
article.parse()

# NLP処理を実行
article.nlp()

# キーワードとサマリーを表示
print(f"キーワード: {article.keywords}")
print(f"サマリー: {article.summary}")

第6章: 多言語対応

Newspaper3kは多言語に対応しています。日本語を含む様々な言語の記事を処理できます。

# 日本語の記事を処理
japanese_article = Article('http://example.jp/news', language='ja')
japanese_article.download()
japanese_article.parse()

print(f"日本語記事のタイトル: {japanese_article.title}")

第7章: カスタム設定

Newspaper3kの動作をカスタマイズする方法を学びます。

from newspaper import Config

# カスタム設定を作成
config = Config()
config.browser_user_agent = 'Mozilla/5.0'
config.request_timeout = 10

# カスタム設定を使用して記事を処理
custom_article = Article(url, config=config)
custom_article.download()
custom_article.parse()

第8章: エラー処理

記事の抽出中に発生する可能性のあるエラーを適切に処理する方法を紹介します。

from newspaper import ArticleException

try:
    article = Article(url)
    article.download()
    article.parse()
except ArticleException as e:
    print(f"記事の処理中にエラーが発生しました: {e}")

第9章: 画像の抽出

記事に含まれる画像を抽出する方法を学びます。

article.download()
article.parse()

# トップ画像のURLを表示
print(f"トップ画像: {article.top_image}")

# 全ての画像URLを表示
print("記事内の全画像:")
for img in article.images:
    print(img)

第10章: メタデータの抽出

記事のメタデータ（著者、公開日など）を抽出する方法を紹介します。

article.download()
article.parse()

print(f"著者: {article.authors}")
print(f"公開日: {article.publish_date}")
print(f"タグ: {article.tags}")

第11章: ソースの構築と管理

ニュースソースを効率的に構築し管理する方法を学びます。

# 複数のソースを構築
sources = [
    newspaper.build('http://cnn.com'),
    newspaper.build('http://bbc.com'),
    newspaper.build('http://reuters.com')
]

# 各ソースの記事数を表示
for source in sources:
    print(f"{source.brand}: {len(source.articles)} 記事")

第12章: 並列処理

複数の記事を並列で処理し、パフォーマンスを向上させる方法を紹介します。

import concurrent.futures

def process_article(url):
    article = Article(url)
    article.download()
    article.parse()
    return article.title

urls = ['http://example1.com', 'http://example2.com', 'http://example3.com']

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    results = list(executor.map(process_article, urls))

for title in results:
    print(f"タイトル: {title}")

第13章: データの保存と読み込み

抽出した記事データをファイルに保存し、後で読み込む方法を学びます。

import json

# データを保存
def save_article(article, filename):
    data = {
        'title': article.title,
        'text': article.text,
        'url': article.url
    }
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4)

# データを読み込み
def load_article(filename):
    with open(filename, 'r', encoding='utf-8') as f:
        return json.load(f)

# 使用例
article = Article(url)
article.download()
article.parse()
save_article(article, 'article.json')

loaded_data = load_article('article.json')
print(f"読み込んだタイトル: {loaded_data['title']}")

第14章: ニュースのトレンド分析

Newspaper3kを使ってニュースのトレンドを分析する方法を紹介します。

from collections import Counter

def analyze_trends(source):
    keywords = []
    for article in source.articles[:10]:  # 最初の10記事を分析
        article.download()
        article.parse()
        article.nlp()
        keywords.extend(article.keywords)
    
    return Counter(keywords).most_common(5)

cnn_paper = newspaper.build('http://cnn.com')
trends = analyze_trends(cnn_paper)

print("トップ5のキーワード:")
for keyword, count in trends:
    print(f"{keyword}: {count}回")

第15章: APIとの連携

Newspaper3kを他のAPIと組み合わせて使用する方法を学びます。

import requests

def translate_title(title, target_lang='ja'):
    # Google Translate APIを使用（APIキーが必要）
    url = "https://translation.googleapis.com/language/translate/v2"
    params = {
        'q': title,
        'target': target_lang,
        'key': 'YOUR_API_KEY'
    }
    response = requests.get(url, params=params)
    return response.json()['data']['translations'][0]['translatedText']

article = Article('http://example.com/english_article')
article.download()
article.parse()

original_title = article.title
translated_title = translate_title(original_title)

print(f"原文タイトル: {original_title}")
print(f"翻訳後タイトル: {translated_title}")

以上が、Newspaper3kの詳細な解説です。各章で具体的なコード例と説明を提供しましたので、実際に試してみてください。Newspaper3kを使えば、ニュース記事の抽出と分析が簡単に行えます。ぜひ、自分のプロジェクトに活用してみてください！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up