DatabricksでLlamaIndexを動かしてみる

Posted at 2024-01-16

昨年このようなものがリリースされていたとは、自分はLangChainにかまけていました。

マニュアルはこちら。マニュアルによると、プライベートあるいはドメイン固有のデータの取り込み、構造化、アクセスのためのLLMベースアプリケーションのためのデータフレームワークとのことです。

こちらの記事でも説明されれています。

この記事ではまずはチュートリアルをウォークスルーします。

最新のMLランタイムのクラスターを使います。

ライブラリのインストール。

%pip install -U llama-index
dbutils.library.restartPython()

ドキュメントの準備。チュートリアルでは、こちらのエッセイを使っていますのでローカルにテキストファイルとして保存しておきます。あと、OpenAI APIを使うのでAPIキーをシークレットに保存しておきます。

そして、テキストの保存場所にはUnity Catalogのボリュームを使います。ここでは、takaakiyayoi_catalog.llama_index.dataというボリュームを作った後に以下を実行します。

import os
os.environ["OPENAI_API_KEY"] = dbutils.secrets.get("demo-token-takaaki.yayoi", "openai_api_key")

from llama_index import VectorStoreIndex, SimpleDirectoryReader

TEXT_DIR = "/Volumes/takaakiyayoi_catalog/llama_index/data/corpus"
os.mkdir(TEXT_DIR)

するとcorpusというフォルダができますので、先ほど保存したテキストファイルをアップロードします。

以下を実行してベクトルインデックスを作成します。

documents = SimpleDirectoryReader("/Volumes/takaakiyayoi_catalog/llama_index/data/corpus").load_data()
index = VectorStoreIndex.from_documents(documents)

クエリーを投げます。

query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
print(response)

チュートリアルのものと似たような結果が返ってきます。

The author mentioned that before college, the two main things they worked on outside of school were writing and programming. They wrote short stories and also tried writing programs on the IBM 1401 computer. They later got a microcomputer and started programming more extensively, writing simple games and a word processor.

ベクトルインデックスを永続化します。

PERSIST_DIR = "/Volumes/takaakiyayoi_catalog/llama_index/data/indexes/tutorial"

# ドキュメントをロードしインデックスを作成
documents = SimpleDirectoryReader(TEXT_DIR).load_data()
index = VectorStoreIndex.from_documents(documents)
# 後で利用できるように保存
index.storage_context.persist(persist_dir=PERSIST_DIR)

保存されました。

from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
    load_index_from_storage,
)

storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
index = load_index_from_storage(storage_context)

# インデックスに対するクエリー
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
print(response)

ベクトルインデックスを読み込んでクエリーを行うことができます。

from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
    load_index_from_storage,
)

storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
index = load_index_from_storage(storage_context)

# インデックスに対するクエリー
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
print(response)

The author mentioned that before college, the two main things they worked on outside of school were writing and programming. They wrote short stories and also tried writing programs on the IBM 1401 computer. They later got a microcomputer and started programming more extensively, writing simple games and a word processor.

OpenAIが動いているので、日本語でのクエリーもいけます。

query_engine = index.as_query_engine()
response = query_engine.query("幼少期、著者は何をしましたか？")
print(response)

幼少期、著者は短編小説を書いたり、IBM 1401というコンピュータでプログラムを試したりしました。また、友人と一緒にヒースキットのキットでマイクロコンピュータを組み立てたりもしました。その後、父親にTRS-80というコンピュータを買ってもらい、プログラミングを本格的に始めました。

これは便利。RAG(Retrieval Augumented Generation)のアプリを簡単に動かせますね。

はじめてのDatabricks

Databricks無料トライアル

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up