More than 1 year has passed since last update.

DevelopersIO記事中のLlamaIndexを使ったサンプルコードをLlamaIndex 0.6.37で動作するように修正してみる

Last updated at 2023-07-23Posted at 2023-07-02

以下の元記事がそれぞれ llama-index v0.7.9で動作する内容に更新されました。

はじめに

以下の記事で､DevelopersIOの記事【LlamaIndex】Indexにクエリした際に回答で参考にした箇所（リファレンス）を取得する方法にあるサンプルコードをLlamaIndex 0.6.32で動作するように修正してみました｡

その後､LlamaIndexの記事一覧にある記事を参考にコードを書いていたところ､以下の記事中のノード分割の状況を可視化で示されているコードがLlamaIndex 0.6.37でエラーになることに気づきました｡ ( LlamaIndexを完全に理解するチュートリアルその１：処理の概念や流れを理解する基礎編（v0.6.8対応）のDocument Storeの詳細の部分も同様です｡)
LlamaIndex 0.6.34で仕様変更があったようです｡

for doc_id, node in list_index.storage_context.docstore.docs.items():
    node_dict = node.to_dict()
    print(f'{doc_id=}, len={len(node_dict["text"])}, start={node_dict["node_info"]["start"]}, end={node_dict["node_info"]["end"]}')

この部分が､LlamaIndex 0.6.37で動作するようにコードを修正してみました｡

参考情報

環境構築

環境

$ python3 -V
Python 3.11.4

必要なPythonパッケージのインストール

$ pip install llama-index pypdf

インストールされた主なライブラリは以下のとおりです｡

$ pip freeze | grep -e "openai" -e "llama-index" -e "pypdf" -e "langchain"
langchain==0.0.220
langchainplus-sdk==0.0.19
llama-index==0.6.37
openai==0.27.8
pypdf==3.11.1

発生したエラー

元記事のコードをLlamaIndex 0.6.34以降で実行すると以下のエラーが発生します｡

Traceback (most recent call last):
  File "/home/foo/llamaindex-tutorial-002-text-splitter.py", line 25, in <module>
    node_dict = node.to_dict()
                ^^^^^^^^^^^^
AttributeError: 'TextNode' object has no attribute 'to_dict'

CHANGELOG.mdに､NodeをTextNodeにリネームしたとあります｡その他にも変更内容が書かれています｡

list_index.storage_context.docstore.docs.items() の中身

list_index.storage_context.docstore.docs.items()の中身をみてみると､以下のようになっていました｡
v0.6.34ではNodeがTextNodeとなった他､node_infoプロパティがなくなったり､start/endがstart_char_idx/end_char_idxとなり階層も変更されています｡

一部抜粋(v.0.6.33以前)

dict_items(
    [
        (
        '3caa54fc-5485-4ee8-9a2e-11619b60a811',
            Node(
                text='Contoso Electronics \nPlan and Benefit Packages\n',
                doc_id='3caa54fc-5485-4ee8-9a2e-11619b60a811',
                embedding=None,
                doc_hash='07baf7ce8f9d146792515b3eab972d9d4e3283a1f62375ee94a6ba1c08a20712',
                extra_info={
                    'page_label': '1',
                    'file_name': 'Benefit_Options.pdf'},
                    node_info={
                        'start': 0,
                        'end': 47,
                        '_node_type': <NodeType.TEXT: '1'>
                    },
                    relationships={
                            <DocumentRelationship.SOURCE: '1'>: '6b0f0f21-261e-4dd5-aa08-1eea2d17d983'
                    }
            )
        ),
    ]
 )

一部抜粋(v0.6.34以降)

dict_items(
    [
        (
        'e2cfd446-104a-40fc-b562-a002e9d1d5de',
            TextNode(
                id_='e2cfd446-104a-40fc-b562-a002e9d1d5de',
                embedding=None,
                metadata={
                    'page_label': '1',
                    'file_name': 'Benefit_Options.pdf'
                },
                excluded_embed_metadata_keys=[],
                excluded_llm_metadata_keys=[],
                relationships={
                    <NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(
                        node_id='726c13d0-aaa7-43a8-affa-f3dbb04b790a',
                        node_type=None,
                        metadata={
                            'page_label': '1',
                            'file_name': 'Benefit_Options.pdf'
                        },
                        hash='07baf7ce8f9d146792515b3eab972d9d4e3283a1f62375ee94a6ba1c08a20712'
                        )
                },
                hash='bf6aea8c7ea133655178abef2f5aa6c9eb47b52df6b1770170e1d49466b0f7d0',
                text='Contoso Electronics \nPlan and Benefit Packages',
                start_char_idx=0,
                end_char_idx=46,
                text_template='{metadata_str}\n\n{content}',
                metadata_template='{key}: {value}',
                metadata_seperator='\n'
            )
        ),
    ]
)

Pythonコード

LlamaIndex 0.6.34の仕様変更に合わせて以下のように修正しました｡ ./dataには LlamaIndexを完全に理解するチュートリアルその１：処理の概念や流れを理解する基礎編（v0.6.8対応）で使用しているPDFを配置しました｡

from llama_index import SimpleDirectoryReader
from llama_index import Document
from llama_index import GPTListIndex
from llama_index import ServiceContext
from llama_index.node_parser import SimpleNodeParser
from llama_index.langchain_helpers.text_splitter import TokenTextSplitter
from llama_index.constants import DEFAULT_CHUNK_OVERLAP, DEFAULT_CHUNK_SIZE
import tiktoken

documents = SimpleDirectoryReader(input_dir="./data").load_data()

text_splitter = TokenTextSplitter(separator=" ", chunk_size=DEFAULT_CHUNK_SIZE
    , chunk_overlap=DEFAULT_CHUNK_OVERLAP
    , tokenizer=tiktoken.get_encoding("gpt2").encode)
node_parser = SimpleNodeParser(text_splitter=text_splitter)

service_context = ServiceContext.from_defaults(
    node_parser=node_parser
)

list_index = GPTListIndex.from_documents(documents
    , service_context=service_context)

for doc_id, node in list_index.storage_context.docstore.docs.items():
    node_dict = node.dict()
    print(f'{doc_id=}, len={len(tiktoken.get_encoding("cl100k_base").encode(node_dict["text"]))}, start={node_dict["start_char_idx"]}, end={node_dict["end_char_idx"]}')

# query_engine = list_index.as_query_engine()

# response = query_engine.query("機械学習に関するアップデートについて300字前後で要約してください。")

# for i in response.response.split("。"):
#     print(i + "。")

元コードとの差分

node.dict()にはmetadataが含まれており､ノード分割の際にどのファイルのどのページが分割されたかも出力することが可能です｡
page_label={node_dict["metadata"]["page_label"]}, file_name={node_dict["metadata"]["file_name"]}など追加して出力内容をながめてみるのも興味深いと思います｡

25,26c25,26
<     node_dict = node.to_dict()
<     print(f'{doc_id=}, len={len(tiktoken.get_encoding("cl100k_base").encode(node_dict["text"]))}, start={node_dict["node_info"]["start"]}, end={node_dict["node_info"]["end"]}')
---
>     node_dict = node.dict()
>     print(f'{doc_id=}, len={len(tiktoken.get_encoding("cl100k_base").encode(node_dict["text"]))}, start={node_dict["start_char_idx"]}, end={node_dict["end_char_idx"]}')

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up