🔍【生成AI × Web検索 × ベクトルDB】ニュース要約のRAGシステムをPythonで構築してみた

Last updated at 2025-04-20Posted at 2025-04-20

※English Follows Japanese

🧠 はじめに：LLMの“次なる応用”とは？

「ChatGPTは文章を作るもの」……その認識、そろそろ一歩先に進めませんか？

本記事では、LLM（大規模言語モデル）を「自律的なニュース調査エージェント」として使う事例を紹介します。

単なる要約ではなく、以下のような流れを Python + 生成AI + ベクトルDB で自動化しています：

ユーザーの質問からキーワード抽出（LLM）
GDELT APIを使って最新ニュースを取得
記事をクレンジング＆要素抽出
埋め込みベクトルを生成しChromaDBに保存
RAG構成で関連ニュースを再検索
LLMで質問に対する文脈回答を生成

なお、この記事はGoogleの主催するGen AI Intensive Courseの最終成果物となります。
(以下GPT要約)

⚡ 全体像：ニュース系RAGシステムとは？

🔧 技術構成

ツール / 技術	用途
Google Generative AI	キーワード抽出・最終回答（Gemini）
GDELT API	世界中のニュース取得
BeautifulSoup / Readability	HTMLクリーニング・記事本文抽出
SentenceTransformer	文章のベクトル化
ChromaDB	ベクトル検索エンジン（RAG構成）
Python / Jupyter	実行環境

🔍 ユーザー質問からニュース調査までのコード例

① 質問からキーワードを抽出（LLM）

def extract_keywords(query: str) -> List[str]:
    prompt = f"この質問に関するニュース検索用のキーワードを出力してください: {query}"
    response = genai_client.models.generate_content(
        model="gemini-2.0-flash",
        contents=prompt
    )
    return json.loads(response.text.strip())

② GDELT APIで記事検索＆保存

def fetch_and_save_gdelt_articles(keyword: str):
    filters = Filters(keyword=keyword, ...)
    articles = GdeltDoc().article_search(filters)
    articles.head(10).to_json("gdelt_events_24h.json")

③ 記事をクレンジングしてChromaに保存

def process_news_json(json_path):
    data = json.load(open(json_path))
    for i, url in enumerate([d["url"] for d in data]):
        result = fetch_and_clean_article(url)
        embedding = model.encode(result["content"])
        collection.add(
            documents=[result["content"]],
            metadatas=[{"title": result["title"], "url": result["url"]}],
            ids=[f"news_{i}"],
            embeddings=[embedding.tolist()]
        )

④ 類似記事を検索してLLMに渡す（RAG）

def search_similar_documents(query):
    result = chroma_collection.query(query_texts=[query], n_results=3)
    return "\n\n".join([doc for doc in result["documents"][0]])

⑤ 文脈に基づいた回答生成（LLM）

def ask_llm(query, context):
    prompt = f"質問: {query}\n\n参考ニュース:\n{context}"
    response = genai_client.models.generate_content(
        model="gemini-2.0-flash", contents=prompt
    )
    return response.text

✅ 最終的な関数

def rag_answer(query: str):
    keywords = extract_keywords(query)
    for kw in keywords:
        if fetch_and_save_gdelt_articles(kw):
            process_news_json("gdelt_events_24h.json")
    context = search_similar_documents(query)
    return ask_llm(query, context)

💡 応用可能なユースケース

✅ 情勢分析：特定地域・トピックのニュース要約
✅ 自動調査レポート生成
✅ Chatbotによる時事対応
✅ トレンド変化のモニタリング

⚠ 限界・今後の可能性

項目	内容
GDELTは日本語対応に限界あり	英語ニュース中心。今後は別ソース（NewsAPIなど）と連携可能
LLMの幻覚リスク	根拠となる文書を明示し、事実ベースの回答を促す必要あり
RAGの精度	埋め込みモデルや検索エンジンの選定次第で大きく変化

✨ おわりに

本記事で紹介したのは、**「LLMを使ってニュース検索を丸ごと自動化するRAG構成」**の実践例でした。

ChatGPTやGeminiはただのチャットボットじゃない
API連携・DB保存・文脈応答のパイプラインを作ることで「実務に効く」LLMへ
情報取得から解釈まで、まるっと任せるアシスタントが作れます

🙌 フィードバック歓迎！

「この仕組みを業務で使いたい」「別データソースでも試してみたい」などあれば、ぜひコメントください！

もちろんです！以下は、英語圏向けに構成を調整したQiita（またはMedium, Zennなど英語ブログ）向けのMarkdown記事です。技術構成やユースケースはそのままに、文体はプロフェッショナルかつわかりやすいスタイルにしています。

English

🔍 [GenAI × News × Vector DB] Building a RAG-Based News Summarizer Agent with Python

🧠 Introduction: LLMs Are More Than Just Chatbots

Large Language Models (LLMs) like ChatGPT or Gemini are not only capable of generating fluent text—they can act as autonomous agents for data-driven tasks.

In this post, we’ll build an intelligent agent that can:

Understand a user’s natural-language query
Fetch recent news articles from the GDELT database
Clean and embed them
Store and search via vector DB (Chroma)
Generate context-aware answers using Gemini

✅ Yes — this is a fully functioning RAG (Retrieval-Augmented Generation) system tailored for news analysis.

Please note that this article is the final product of the Gen AI Intensive Course organised by Google.
(GPT summary below)

⚡ Overview: What Are We Building?

🛠️ Tech Stack

Tool / Library	Purpose
Google Generative AI	For keyword extraction and final answer
GDELT API	Worldwide news source
BeautifulSoup + `readability`	For article parsing
SentenceTransformers	For embedding text into vectors
ChromaDB	Vector search engine
Python + Jupyter	Development environment

🔍 Key Implementation Steps

1. Extract Keywords from User Query (LLM)

def extract_keywords(query: str) -> List[str]:
    prompt = f"Please extract relevant keywords from this news-related question: {query}"
    response = genai_client.models.generate_content(
        model="gemini-2.0-flash",
        contents=prompt
    )
    return json.loads(response.text.strip())

2. Search News Articles via GDELT API

def fetch_and_save_gdelt_articles(keyword: str):
    filters = Filters(keyword=keyword, ...)
    articles = GdeltDoc().article_search(filters)
    articles.head(10).to_json("gdelt_events_24h.json")

3. Clean HTML and Embed Articles

def process_news_json(json_path):
    data = json.load(open(json_path))
    for i, url in enumerate([d["url"] for d in data]):
        result = fetch_and_clean_article(url)
        embedding = model.encode(result["content"])
        collection.add(
            documents=[result["content"]],
            metadatas=[{"title": result["title"], "url": result["url"]}],
            ids=[f"news_{i}"],
            embeddings=[embedding.tolist()]
        )

4. Perform RAG-Style Search

def search_similar_documents(query):
    result = chroma_collection.query(query_texts=[query], n_results=3)
    return "\n\n".join([doc for doc in result["documents"][0]])

5. Generate Contextual Answer (LLM)

def ask_llm(query, context):
    prompt = f"Question: {query}\n\nRelevant News:\n{context}"
    response = genai_client.models.generate_content(
        model="gemini-2.0-flash", contents=prompt
    )
    return response.text

6. Final Orchestration

def rag_answer(query: str):
    keywords = extract_keywords(query)
    for kw in keywords:
        if fetch_and_save_gdelt_articles(kw):
            process_news_json("gdelt_events_24h.json")
    context = search_similar_documents(query)
    return ask_llm(query, context)

💡 Potential Use Cases

✅ Global trend analysis for specific topics or regions
✅ Autonomous news summarization and reporting
✅ Context-aware chatbot with real-time knowledge
✅ Media monitoring tools for journalists and researchers

⚠️ Limitations & Future Directions

Limitation	Detail
GDELT is mostly English	For Japanese/local-language news, other APIs (e.g. NewsAPI, RSS) needed
Hallucination risks	Responses must be backed by clearly retrieved context
Search quality depends	On embedding model + vector DB setup (can be fine-tuned)

✨ Conclusion

This project demonstrates a practical application of LLMs beyond simple Q&A — building a domain-specific, autonomous RAG system for news understanding.

With a few APIs and tools, we can delegate entire workflows like:

"Tell me what's going on with electric vehicles in China today"
→ (Search + Embed + Filter + Summarize) ✅

💬 Feedback Welcome!

Want to connect this to other APIs? Try multilingual support?
Let me know in the comments — happy to explore it with you!

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up