LangExtractを利用して Few-shot学習よりエンティティ抽出ハンズオン

Last updated at 2025-08-03Posted at 2025-08-03

1.はじめに

GoogleがGemini搭載のテキスト構造化ライブラリ「LangExtract」を公開とのことだったので、どういったものかを理解するためにハンズオンを実施しました。

2.LangExtractについて

2.1.どういったライブラリか？

ユーザが定義した Few-shotによる学習データにより、AIへカテゴリパターンを学習させることで、原文からエンティティを自動的に判別させることを可能にするPythonライブラリ。

2.2.ざっくりイメージ

原文に対して、ユーザが定めたエンティティ(カテゴリ)に対応するのかをレスポンスしてくれる(以下は、今回のサンプルで出力しているもの)

{
  "extractions": [
    {
      "extraction_class": "仏教・哲学概念",
      "extraction_text": "祇園精舎の鐘の聲、諸行無常の響あり",
      "char_interval": null,
      "alignment_status": null,
      "extraction_index": 1,
      "group_index": 0,
      "description": null,
      "attributes": {}
    },
<< 中略 >>
  "text": "祇園精舎の鐘の聲(中略)も及ばれね。",
  "document_id": "doc_b533bb78"
}

2.3.嬉しいこと

項番	項目	詳細
1	正確なソース位置の特定	抽出されたすべてのエンティティに対して、原文のどこから取得したかを正確にマッピング。
2	一貫した構造化出力	ユーザーが用意したFew-shotの例に基づいて統一された出力形式を生成し、安定して構造化された結果を得られる。
3	長文書の処理に最適化	テキストのチャンク分割、並列処理、複数パス実行による最適化戦略により、大規模文書からでも高い検出率を実現。
4	インタラクティブな可視化	抽出されたエンティティを元の文脈で可視化・確認可能（自己完結型のインタラクティブHTMLファイルを生成）。
5	柔軟なLLMサポート	Google Geminiから、内蔵Ollamaインターフェースを通じたローカルのオープンソースモデルまで対応。
6	あらゆる分野に対応	わずか数個の例から、任意の分野の抽出タスクを定義可能。
7	LLMの世界知識を活用	正確なプロンプト設計とFew-shotにより、抽出タスクにおけるLLM動作を制御可能。

2.4.本ブログでのスコープ

本記事では Few-shotによる学習機能を検証し、可視化機能（HTMLファイル生成）は対象外とする

3.ハンズオン

3.1.前提

3.1.1.実行環境

環境	設定
環境	AWS CloudShell

3.1.2.事前準備

Google AI StudioにてAPI Keyを取得済みなこと

3.2.環境準備

1.環境準備

# 専用フォルダ作成
mkdir langextract_demo
cd langextract_demo

# 仮想環境作成
python -m venv langextract_env

# 仮想環境有効化
source langextract_env/bin/activate

# LangExtractインストール
pip install langextract

# エンコーディング検出用ライブラリ
pip install chardet

2.API Key設定

# 環境変数で設定
export LANGEXTRACT_API_KEY="your-api-key-here"

3.3.利用ファイル準備

1.Few-shotによる学習データ configファイル

以下の分類にカテゴリするための学習データを与える

項番	分類	抽出例
1	人物名	源頼朝
2	地名	鎌倉
3	勢力・氏族	平氏
4	仏教・哲学概念	無常の理
5	文学的表現	桜の花のように散りゆく

cat > config.json << EOF
{
  "prompt_description": "古典文学から人物名、場所、仏教・哲学概念、歴史的事件、文学的表現を抽出してください",
  "examples": [
    {
      "text": "源頼朝は鎌倉で平氏を討ち、無常の理を悟った。桜の花のように散りゆくものかな。",
      "extractions": [
        {
          "extraction_class": "人物名",
          "extraction_text": "源頼朝"
        },
        {
          "extraction_class": "地名",
          "extraction_text": "鎌倉"
        },
        {
          "extraction_class": "勢力・氏族",
          "extraction_text": "平氏"
        },
        {
          "extraction_class": "仏教・哲学概念",
          "extraction_text": "無常の理"
        },
        {
          "extraction_class": "文学的表現",
          "extraction_text": "桜の花のように散りゆく"
        }
      ]
    }
  ]
}
EOF

2.抽出対象となる原文のinputファイル

今回は抽出対象として「平家物語」の冒頭を利用

cat > input.txt << EOF
祇園精舎の鐘の聲、諸行無常の響あり。娑羅（しやら）雙樹の花の色、盛者（じやうしや）必衰のことはりをあらはす。おごれる人も久しからず、只春の夜（よ）の夢のごとし。たけき者も遂にはほろびぬ、偏に風の前の塵に同じ。遠く異朝をとぶらへば、秦の趙高（てうかう）、漢の王莽（わうまう）、梁の朱异（しうい）、唐の禄山（ろくさん）、是等（これら）は皆舊主先皇（せんくわう）の政（まつりごと）にもしたがはず、樂しみをきはめ、諌（いさめ）をもおもひいれず、天下（てんが）のみだれむ事をさとらずして、民間の愁（うれふ）る所をしらざ（ツ）しかば、久しからずして、亡（ばう）じにし者どもなり。近く本朝をうかゞふに、承平（せうへい）の將門、天慶（てんぎやう）の純友（すみとも）、康和の義親（ぎしん）、平治の信頼（しんらい）、おごれる心もたけき事も、皆とりどりにこそありしかども、まぢかくは、六波羅の入道前（さき）の太政大臣（だいじやうだいじん）平の朝臣（あ（ツ）そん）淸盛公と申（まうし）し人のありさま、傳（つたへ）承るこそ心も詞（ことば）も及ばれね。
EOF

パラメータ	設定値	内容
model_id	gemini-2.0-flash-001	使用するAIモデル（Gemini 2.0 Flash版）
temperature	0.1	生成の多様性（0.0=保守的 ~ 2.0=創造的）
max_workers	10	並列処理数（同時に動くプロセス数）
max_char_buffer	100	1回の処理で扱う文字数（チャンクサイズ）
extraction_passes	1	抽出を何回繰り返すか

3.利用モデルの設定model_configファイル

cat > model_config.json << EOF
{
  "model_id": "gemini-2.0-flash-001",
  "temperature": 0.1,
  "max_workers": 10,
  "max_char_buffer": 100,
  "extraction_passes": 1
}
EOF

3.4.スクリプト準備

1.スクリプト作成

cat > langextract_file.py << EOF
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
LangExtract外部ファイル対応版
プロンプト設定、Few-shotデータ、入力テキストを外部ファイルから読み込み
"""

import argparse
import json
import os
import sys
from pathlib import Path
from typing import Dict, Any, List
import chardet
import langextract as lx


class FileHandler:
    """ファイル読み込みとエンコーディング処理を担当するクラス"""
    
    @staticmethod
    def detect_encoding(file_path: Path) -> str:
        """ファイルのエンコーディングを自動検出"""
        with open(file_path, 'rb') as f:
            raw_data = f.read()
        
        # BOM検出（サイレント処理）
        if raw_data.startswith(b'\xef\xbb\xbf'):
            return 'utf-8-sig'
        elif raw_data.startswith(b'\xff\xfe'):
            return 'utf-16-le'
        elif raw_data.startswith(b'\xfe\xff'):
            return 'utf-16-be'
        
        # chardetで自動検出
        detected = chardet.detect(raw_data)
        encoding = detected['encoding']
        confidence = detected['confidence']
        
        if confidence < 0.7:
            return 'utf-8'
        
        return encoding
    
    @staticmethod
    def read_text_file(file_path: Path) -> str:
        """テキストファイルを読み込み"""
        if not file_path.exists():
            raise FileNotFoundError(f"ファイルが見つかりません: {file_path}")
        
        encoding = FileHandler.detect_encoding(file_path)
        
        try:
            with open(file_path, 'r', encoding=encoding) as f:
                content = f.read()
        except UnicodeDecodeError as e:
            # UTF-8でフォールバック
            with open(file_path, 'r', encoding='utf-8', errors='replace') as f:
                content = f.read()
        
        return content.strip()  # 末尾の改行を削除
    
    @staticmethod
    def read_json_file(file_path: Path) -> Dict[str, Any]:
        """JSONファイルを読み込み"""
        content = FileHandler.read_text_file(file_path)
        
        try:
            data = json.loads(content)
            return data
        except json.JSONDecodeError as e:
            raise Exception(f"JSON解析失敗: {file_path} - {e}")


class ConfigValidator:
    """設定ファイルの検証を担当するクラス"""
    
    @staticmethod
    def validate_config(config: Dict[str, Any]) -> bool:
        """抽出設定ファイルの構造を検証"""
        required_fields = ['prompt_description', 'examples']
        
        for field in required_fields:
            if field not in config:
                raise Exception(f"必須フィールド不足: {field}")
        
        if not isinstance(config['examples'], list):
            raise Exception("examplesは配列である必要があります")
        
        for i, example in enumerate(config['examples']):
            ConfigValidator._validate_example(example, i)
    
    @staticmethod
    def validate_model_config(config: Dict[str, Any]) -> bool:
        """モデル設定ファイルの最小限検証（model_id存在確認のみ）"""
        if 'model_id' not in config:
            raise Exception("model_idが必要です")
        
        return True
    
    @staticmethod
    def _validate_example(example: Dict[str, Any], index: int) -> bool:
        """個別サンプルの検証"""
        if 'text' not in example:
            raise Exception(f"サンプル{index}: textフィールドが必要")
        
        if 'extractions' not in example:
            raise Exception(f"サンプル{index}: extractionsフィールドが必要")
        
        if not isinstance(example['extractions'], list):
            raise Exception(f"サンプル{index}: extractionsは配列である必要があります")
        
        for j, extraction in enumerate(example['extractions']):
            ConfigValidator._validate_extraction(extraction, index, j)
        
        return True
    
    @staticmethod
    def _validate_extraction(extraction: Dict[str, Any], example_idx: int, extraction_idx: int) -> bool:
        """個別抽出設定の検証"""
        required = ['extraction_class', 'extraction_text']
        
        for field in required:
            if field not in extraction:
                raise Exception(f"サンプル{example_idx}の抽出{extraction_idx}: {field}フィールドが必要")
        
        return True


def create_langextract_examples(config_data: Dict[str, Any]) -> List[lx.data.ExampleData]:
    """設定データからLangExtractのExampleDataを作成"""
    examples = []
    
    for example_config in config_data['examples']:
        extractions = []
        
        for ext_config in example_config['extractions']:
            extraction = lx.data.Extraction(
                extraction_class=ext_config['extraction_class'],
                extraction_text=ext_config['extraction_text'],
                attributes=ext_config.get('attributes', {})
            )
            extractions.append(extraction)
        
        example_data = lx.data.ExampleData(
            text=example_config['text'],
            extractions=extractions
        )
        examples.append(example_data)
    
    return examples


def get_default_model_config() -> Dict[str, Any]:
    """デフォルトのモデル設定を返す"""
    return {
        "model_id": "gemini-2.0-flash-001",
        "temperature": 0.5,
        "max_workers": 10,
        "max_char_buffer": 1000,
        "extraction_passes": 1,
        "language_model_params": {}
    }


def main():
    """メイン処理"""
    parser = argparse.ArgumentParser(description='LangExtract実行')
    parser.add_argument('config', type=str, help='設定ファイル(JSON)')
    parser.add_argument('input', type=str, help='入力テキストファイル')
    parser.add_argument('--output', '-o', type=str, default='extraction_results.jsonl',
                      help='出力ファイル名')
    
    args = parser.parse_args()
    
    # ファイルパス設定
    config_path = Path(args.config)
    input_path = Path(args.input)
    model_config_path = Path("model_config.json")  # 固定
    output_path = args.output
    
    try:
        # 設定読み込み
        config_data = FileHandler.read_json_file(config_path)
        ConfigValidator.validate_config(config_data)
        
        # モデル設定読み込み（固定ファイル）
        if model_config_path.exists():
            model_config = FileHandler.read_json_file(model_config_path)
            ConfigValidator.validate_model_config(model_config)
        else:
            model_config = get_default_model_config()
        
        # 入力テキスト読み込み
        input_text = FileHandler.read_text_file(input_path)
        
        # LangExtract実行
        examples = create_langextract_examples(config_data)
        
        extract_params = {
            'text_or_documents': input_text,
            'prompt_description': config_data['prompt_description'],
            'examples': examples,
            'model_id': model_config['model_id'],
            'temperature': model_config.get('temperature', 0.5),
            'max_workers': model_config.get('max_workers', 10),
            'max_char_buffer': model_config.get('max_char_buffer', 1000),
            'extraction_passes': model_config.get('extraction_passes', 1)
        }
        
        if 'language_model_params' in model_config and model_config['language_model_params']:
            extract_params['language_model_params'] = model_config['language_model_params']
        
        # LangExtract実行（エラーハンドリング）
        try:
            result = lx.extract(**extract_params)
        except Exception as e:
            raise Exception(f"LangExtract実行失敗: {e}")
        
        # 結果表示
        print(f"抽出完了: {len(result.extractions)}項目")
        
        # カテゴリ別集計
        categories = {}
        for extraction in result.extractions:
            category = extraction.extraction_class
            if category not in categories:
                categories[category] = []
            categories[category].append(extraction.extraction_text)
        
        for category, items in categories.items():
            print(f"  {category}: {len(items)}個")
        
        # 結果保存
        lx.io.save_annotated_documents([result], output_name=output_path)
        print(f"保存完了: {output_path}")
        
    except Exception as e:
        print(f"エラー: {e}")
        sys.exit(1)


if __name__ == "__main__":
    main()
EOF

4.処理実行

1.実行コマンド

# 実行コマンド
python langextract_file.py config.json input.txt

2.レスポンス内容

# 実行コマンド
LangExtract: model=gemini-2.0-flash-001, current=462 chars, processed=462 chars:  [00:04]
✓ Extraction processing complete
✓ Extracted 25 entities (4 unique types)
  • Time: 4.96s
  • Speed: 93 chars/sec
  • Chunks: 5
抽出完了: 25項目
  仏教・哲学概念: 2個
  文学的表現: 2個
  人物名: 9個
  地名: 12個
LangExtract: Saving to extraction_results.jsonl: 1 docs [00:00, 808.31 docs/s]
✓ Saved 1 documents to extraction_results.jsonl
保存完了: extraction_results.jsonl

3.レスポンスファイル内容(適宜改行)

出力ファイル`test_output/extraction_results.jsonl`で確認可能 `1.Few-shotによる学習データ`より原文が、５つのカテゴリに分類される(省略記載)

▫️LangExtract 出力フィールドについて

項番	フィールド名	種類	値
1	extraction_class	カテゴリ名(Few-shotでの定義)	仏教・哲学概念
2	extraction_text	実際に抽出したテキスト	祇園精舎の鐘の聲、諸行無常の響あり
3	char_interval	文字位置	null(※抽出失敗)
4	alignment_status	抽出テキストと原文の対応状態	null(※No.3 null のため null)
5	extraction_index	抽出された順番	1
6	group_index	グループ内での順番	0
7	description	抽出内容の説明	null(※今回は利用していないため、null)
8	attributes	カスタム属性	{}(※今回は利用していないため空)

# 実行コマンド
cat test_output/extraction_results.jsonl

# ファイル内容
{
  "extractions": [
    {
      "extraction_class": "仏教・哲学概念",
      "extraction_text": "祇園精舎の鐘の聲、諸行無常の響あり",
      "char_interval": null,
      "alignment_status": null,
      "extraction_index": 1,
      "group_index": 0,
      "description": null,
      "attributes": {}
    },
<< 中略 >>
  "text": "祇園精舎の鐘の聲(中略)も及ばれね。",
  "document_id": "doc_b533bb78"
}

5.おわりに

5.1.得られた知見

Few-shotで簡単にカテゴリ分類パターンを学習させられること
古典文学等でもラベルを利用して分類が可能であること

5.2.今後の課題

char_intervalがnullになっているため位置情報の取得方法を調べる
- チャンクを分割し過ぎている(現在:100)かと想像して、1000に設定して実行するも同事象発生
- モデル変更や、英語でのテキストなどで再実施をして検証をする
Few-shotの学習データをもっと充実させて精度向上を図りたい

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

LangExtractを利用 して Few-shot学習よりエンティティ抽出 ハンズオン