More than 1 year has passed since last update.

English SDK for Apache SparkをローカルLLMで動かす on Databricks

Last updated at 2023-11-11Posted at 2023-11-11

やっと・・・でき・・た？（出来てない）

導入

こちらの記事で動かしているのを見て、触発されました。

上記記事中では、English SDK for Apache SparkをAWS Bedrock(Claude)で利用されています。
別のバリエーションとして、ローカルLLMだとどうなるのか？という試みです。

記事にしていませんが、English SDK for Apache SparkをローカルLLMで動かす試行錯誤は、実は何度かしていました。そして、ことごとく失敗していました。

結局OpenAI APIで試したり・・・ということをしていたのですが、久しぶりに再挑戦したところ、ちょっとだけ動かせたので書いてみます。

English SDK for Apache Sparkとは

上の記事と、そしてこちらを読みましょう。

今回使うモデル

こちらを使います。

量子化変換前のモデはこちら。

DeepSeek Coderはコード生成に特化したモデルで、以下の公式サイトに掲載されているベンチマーク結果では、GPT3.5-turboを越えるパフォーマンスを発揮しています。

これなら・・・これならきっとやってくれる！と期待してやってみました。

準備

必要なモジュールをインストール。

%pip install -U -qq pyspark-ai transformers accelerate langchain python-dateutil autoawq=="0.1.5" sqlalchemy

dbutils.library.restartPython()

事前にダウンロードしておいたDeepSeek Coder 33B Instruct AWQモデルをロード。
Huggingface上のCommunity上で、transformersでは結果をうまく生成できないという記載があったので、AutoAWQを直接使います。

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

UC_VOLUME = "/Volumes/training/llm/model_snapshots"
MODEL_DIR = "models--TheBloke--deepseek-coder-33B-instruct-AWQ"
model_path = f"{UC_VOLUME}/{MODEL_DIR}"

generator = AutoAWQForCausalLM.from_quantized(model_path, fuse_layers=False)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

langchainのChat Model作成。
ChatAutoAWQはこちらで作成した独自クラスです。

from autoawq_chat import ChatAutoAWQ

human_message_template = """You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.
### Instruction:
{}
### Response:"""

chat_model = ChatAutoAWQ(
    generator=generator,
    tokenizer=tokenizer,
    human_message_template=human_message_template,
    ai_message_template="{}",
    temperature=0.1,
    max_new_tokens=2048,
)

以上で準備完了。

やってみる

SparkAIをアクティベート。SparkAI作成時に上で作成したchat_modelを渡しています。
また、実行状況を見えるようにするために、verboseはTrueを設定します。

from pyspark_ai import SparkAI

spark_ai = SparkAI(llm=chat_model, verbose=True)
spark_ai.activate()  # active partial functions for Spark DataFrame

では、こちらに似た形で加工とプロットを実行します。

加工1. いつからいつまでのデータがありますか？

# サンプルデータを読み込み
taxi_df = spark.read.table("samples.nyctaxi.trips")

# LLMを使ってデータ処理
answer_df = spark_ai.transform_df(taxi_df, "いつからいつまでのデータがありますか？")
display(answer_df)

問題なく出来ました！
以前はこのレベルでも生成できなかったので嬉しい！

加工2. 乗車料金の合計を日別に集計してください。

answer_df = spark_ai.transform_df(taxi_df, "乗車料金の合計を日別に集計してください。")
display(answer_df)

問題なくクリア！

次は、データを説明させてみます。

データを説明させてみる

加工済みデータを説明させてみます。

answer_df.ai.explain()

出力

'In summary, this dataframe is retrieving the total fare amount for each day of the trips. It presents the results sorted by the date of pickup.\n<|EOT|>'

内容の確からしさ含めて、きちんと説明されていますね！

最後に、プロットです。

プロット. 特に指定なし

特にプロンプト指定なくプロットさせてみます。

# 失敗する。。。
answer_df.ai.plot()

出力

INFO: 
Here is a Python code snippet that meets all the requirements:

from pyspark.sql import SparkSession
import pandas as pd
import plotly.express as px

# Start Spark session
spark = SparkSession.builder.getOrCreate()

# Assuming df is your DataFrame
df = spark.sql("SELECT pickup_date, total_fare FROM your_table")

（中略）

Please replace `'your_table'` with your actual table name.

This code will create a line plot of the total fare over time (pickup_date). If you want to aggregate the data in a different way, you can modify the `groupby` and `sum` lines accordingly.
<|EOT|>
WARN: Getting the following error: 
[TABLE_OR_VIEW_NOT_FOUND] The table or view `your_table` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.
To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.; line 1 pos 36;
'Project ['pickup_date, 'total_fare]
+- 'UnresolvedRelation [your_table], [], false

（後略）

うーん、ダメでした。
何度か自動リトライをしてくれるのですが、いずれもテーブル名を適切に生成できない(your_tableのようになる）ため、エラーの無いコードになりませんでした。

プロット(小細工後)1. プロンプト指定なし

というわけで、小細工します。
今回のモデルのコード出力傾向として、your_tableという名前のテーブルを指定しがちですので、DataFrameをyour_tableという名前のビューで先に登録します。
これで実行してみました。

# 小細工
answer_df.createOrReplaceTempView("your_table")
answer_df.ai.plot(cache=False)

出来ました！

プロット(小細工後)2. データを棒グラフで表示してください。

棒グラフで表示するよう指示してプロットさせます。

answer_df.createOrReplaceTempView("your_table")
answer_df.ai.plot("データを棒グラフで表示してください。")

ちゃんと棒グラフで表示されました。

まとめ

他にもいろいろ試してみましたが、うまくいくケースといかないケース両方あり、単純に活用できるという状態ではなさそうです。とはいえ、ローカルLLMでこれができるというのは、選択肢が広がるという意味で良いのではないでしょうか。
実用化という観点ではまだまだですが、この先に期待が持てます。

なお、今回使用したAWSのクラスタタイプはg5.4xlargeです。
GPU 1枚のインスタンスですが、DeepSeek Coder 33Bは量子化モデルであればギリギリ乗ります。
（長いコンテキストを入れる状況だとVRAM OOMを起こしますが。。。）
案外、6.7Bなどスモールサイズのモデルでも動くかもしれませんが、未検証です。

Github Copilot Workspaceなど、コード生成系の発展が著しく速く進んでいますね。
エンジニアとして、この先がどうなるか非常に楽しみです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up