LlamaIndexでベクトル化→翻訳精度向上！

Posted at 2025-12-03

今回はOpenAI APIを使います！
まず、安全にAPIを使うために.zshrcからAPIキーをexportするためにviエディタ起動っ！

こちらを参考にさせていただきました！！

では、構造化データで要約を作るぞ！！

from llama_index.core import Document, VectorStoreIndex
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
from bs4 import BeautifulSoup
import glob
import os

embed_model = OpenAIEmbedding(model="text-embedding-3-large")

def parse_grobid_tei(file_path):
with open(file_path, 'r', encoding='utf-8') as f:
    soup = BeautifulSoup(f, 'xml')

# タイトル
    title_tag = soup.find('titleStmt').find('title', level='a', type='main')
    title_text = title_tag.text.strip() if title_tag else "Unknown"

# 著者（ミドルネーム対応）
    authors = []
    for persName in soup.find_all('persName'):
        forenames = [f.text for f in persName.find_all('forename') if f.text]
        surname = persName.find('surname')
        author_name = f"{' '.join(forenames)} {surname.text if surname else ''}".strip()
        if author_name:
            authors.append(author_name)

    documents = []
    # 本文（全divを走査）
    for div in soup.find_all('div'):
        head = div.find('head')
        section_title = head.text.strip() if head else "Untitled Section"
    
        text = '\n\n'.join(p.get_text(strip=True) for p in div.find_all('p'))
        if text.strip():
            documents.append(Document(
                text=text,
                metadata={
                    'title': title_text,
                    'authors': authors, 
                    'section': section_title
                }
            ))

    return documents, {'title': title_text, 'authors': authors}

# 実行部
    tei_files =glob.glob(os.path.expanduser("~/Downloads/*.tei.xml"))
if tei_files:
    file_path = max(tei_files, key=os.path.getctime)  # 最新のファイル
    print(f"読み込むファイル: {file_path}")
else:
    print("TEIファイルが見つかりません")
    exit()

docs, metadata = parse_grobid_tei(file_path)

index = VectorStoreIndex.from_documents(
    docs,
    embed_model=embed_model,
    transformations=[SentenceSplitter(chunk_size=2048, chunk_overlap=100)]
)

query_engine = index.as_query_engine()
response = query_engine.query("この論文の主要な主張は何ですか？200文字以内で答えてください。")
print(response)

でたー！・・・200字超えてるじゃん！！

The main argument of this paper is that the interaction between humans and AI chatbots can lead to emergent risk profiles that neither humans nor chatbots would create independently. The paper discusses how chatbot biases, vulnerabilities from the opaque nature of chatbots, and the adaptability of personalized chatbots can intertwine with human tendencies like anthropomorphism. It highlights how chatbots' adaptability to individual users, commercial pressures driving adaptability, and increasing agentic capabilities can lead to users forming personal relationships with chatbots, making them more susceptible to influence. The paper emphasizes the potential psychological risks of chatbot interactions and the need to consider the bidirectional amplification of beliefs between humans and chatbots.

しかし、GPTやGemini等のクラウドブラウザAIだと

vulnerabilities
opaque
agentic capabilities
emphasizes
bidirectional amplification
この辺の単語は、一般的な言葉に置き換えられるか、完全消滅してしまうのですが、ちゃんと残っています！！やったぜ！って感じ😄

元の英文にエッジが残っているので、いつものGoogle翻訳でも、このレベルを維持！！

本論文の主な主張は、人間とAIチャットボットの相互作用が、人間もチャットボットも単独では生み出さないような新たなリスクプロファイルを生み出す可能性があるというものです。本論文では、チャットボットのバイアス、チャットボットの不透明な性質に起因する脆弱性、そしてパーソナライズされたチャットボットの適応性が、擬人化といった人間の傾向※とどのように絡み合うかについて論じています。チャットボットの個々のユーザーへの適応性、適応性を促す商業的圧力、そしてエージェント能力の向上が、ユーザーとチャットボットの間に個人的な関係を築き、影響を受けやすくする可能性があることを強調しています。本論文は、チャットボットとの相互作用における潜在的な心理的リスクと、人間とチャットボット間の双方向の信念増幅を考慮する必要性を強調しています。

普通のLLM要約→日本語にすると

「新しいリスクが生まれる」「わかりにくい部分」「自律性が高まる」「お互いに信念を強め合う」みたいに全部丸くなるんですよね〜😄

気になるのは、一箇所！

※ 単に「傾向」ではなく「認知傾向」とする方がより適切。

参考：人間の要約はこちら！

よーし！いい感じですー！
これを大規模化すると理想の環境に・・・？

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up