MicrosoftのGraphRAGをDatabricksで動かしてみた

Last updated at 2026-01-11Posted at 2026-01-11

2026/1/12 追記

当初「FMAPIはGraphRAGと互換性がない」と記載していましたが、FMAPIで提供されているOpenAIモデル(databricks-gpt-5-2等)では正常に動作することを確認しました。Llamaモデルでは引き続きJSON mode制約によるエラーが発生しますが、OpenAIモデルを選択することでFMAPIのみでGraphRAGを利用できます。また、FMAPIではmax_tokensを明示的に指定する必要があります。

GraphRAGとは

GraphRAGは、Microsoftが開発したナレッジグラフベースのRAG(Retrieval-Augmented Generation)手法です。従来のベクトル検索ベースのRAGでは難しかった「データセット全体に関する質問」に回答できることが特徴です。

例えば、「この文書群の主要なテーマは何か?」「登場人物間の関係性を説明して」といった、複数の情報を統合して回答する必要がある質問に対応できます。

従来のベクトルDBアプローチとの違い

従来のRAGとGraphRAGの違いを整理します。

従来のベクトルDBアプローチ

従来のRAGは以下のような流れで動作します:

ドキュメントをチャンクに分割
各チャンクをEmbeddingしてベクトルDBに格納
クエリをEmbeddingして類似度検索
上位k件のチャンクをコンテキストとしてLLMに渡す

この方式は「特定の事実を探す」タイプの質問には有効ですが、以下のような制限があります:

局所的な検索: 類似度の高いチャンクしか取得できない
文脈の欠落: チャンク間の関係性が失われる
全体像の把握が困難: 「全体を要約して」といった質問に弱い

GraphRAGのアプローチ

GraphRAGは以下の処理でナレッジグラフを構築します:

エンティティ抽出: テキストから人物、組織、技術、概念などを抽出
リレーションシップ抽出: エンティティ間の関係性を抽出
グラフ構築: エンティティをノード、関係性をエッジとしたグラフを構築
コミュニティ検出: Leidenアルゴリズムで関連するエンティティをクラスタリング
コミュニティレポート生成: 各コミュニティの要約をLLMで生成

クエリ時には2種類の検索方法があります:

検索方法	用途	仕組み
Global Search	データセット全体に関する質問	コミュニティレポートを集約して回答
Local Search	特定のエンティティに関する質問	ベクトル検索 + グラフ探索

比較表

項目	従来のベクトルDB RAG	GraphRAG
データ構造	フラットなチャンク	階層的なグラフ構造
検索方式	ベクトル類似度のみ	ベクトル + グラフ探索
全体要約	苦手	得意(Global Search)
詳細検索	得意	得意(Local Search)
インデックス作成コスト	低い(Embeddingのみ)	高い(LLM呼び出し多数)
ストレージ	ベクトルDBのみ	ベクトルDB + グラフDB

Databricksで動かす際の注意事項

GraphRAGをDatabricksで動かす際には、いくつかの制約があります。

1. LLMの選択

Databricks Foundation Model API(FMAPI)を使用する場合、モデルの選択が重要です。

モデル	インデックス作成	クエリ(Global/Local)
FMAPI Llamaモデル	○	×
FMAPI OpenAIモデル	○	○
OpenAI API直接	○	○

FMAPIのLlamaモデル(databricks-meta-llama-3-3-70b-instruct等)でGlobal Searchを実行すると、以下のエラーが発生します:

"messages" must contain the word "json" in some form, 
to use "response_format" of type "json_object"

一方、FMAPIのOpenAIモデル(databricks-gpt-5-2等)ではこのエラーは発生せず、正常に動作します。

FMAPIでGraphRAGを使う場合のポイント:

OpenAIモデル(databricks-gpt-5-2等)を使用する
max_tokensを明示的に指定する(FMAPIはnullを受け付けない)

models:
  default_chat_model:
    type: openai_chat
    api_key: ${DATABRICKS_TOKEN}
    api_base: ${DATABRICKS_HOST}/serving-endpoints
    model: databricks-gpt-5-2
    model_supports_json: true
    max_tokens: 4096  # 必須: FMAPIはnullを受け付けない

2. ストレージ: ローカルディスクを使用

GraphRAGはLanceDBを使用してベクトルデータを保存しますが、LanceDBは内部でファイルのrename操作を行います。

Databricksの各ストレージタイプとの互換性は以下の通りです:

ストレージ	rename対応	永続化	サーバレス対応
ローカルディスク(/tmp)	○	×	×
Workspaceファイル	×	○	○
Unity Catalogボリューム	×	○	○

Unity CatalogボリュームやWorkspaceファイルシステムは、FUSEマウントされているためrename操作がサポートされていません。

推奨構成:

処理時: ローカルディスク(/tmp)を使用
永続化: 処理完了後にDelta Tableへ保存
計算環境: クラスター(サーバレスではなく)

3. インデックス作成のコスト

GraphRAGのインデックス作成では、以下の処理でLLMを呼び出します:

エンティティ・リレーションシップ抽出
エンティティ説明の要約
コミュニティレポート生成

小さなドキュメントでも数十回のAPI呼び出しが発生します。大規模なドキュメントを処理する場合は、コストに注意してください。

アーキテクチャ

Databricks上でのGraphRAGのアーキテクチャは以下のようになります:

[入力テキスト]
     ↓
[GraphRAG Index] ← FMAPI OpenAIモデル or OpenAI API(エンティティ抽出、レポート生成)
     ↓
[ローカルディスク(/tmp)]
  - entities.parquet
  - relationships.parquet
  - communities.parquet
  - community_reports.parquet
  - text_units.parquet
  - lancedb/(ベクトルデータ)
     ↓
[Delta Table] ← 永続化
     ↓
[GraphRAG Query] ← FMAPI OpenAIモデル or OpenAI API(回答生成)
     ↓
[回答]

実装

ここからは、実際にDatabricksでGraphRAGを動かした結果を示します。使用したノートブックはこちらです。

環境

Databricks Runtime: 14.3 LTS ML
クラスター: シングルノード
LLM: Databricks FMAPI(databricks-gpt-5-2)
Embedding: Databricks FMAPI(databricks-gte-large-en)

サンプルデータ

Databricksエコシステムに関する以下のテキストを使用しました:

Databricksは、データエンジニアリング、データサイエンス、機械学習のための統合プラットフォームです。
Apache Sparkの創設者たちによって設立され、レイクハウスアーキテクチャを提唱しています。

Unity Catalogは、Databricksのデータガバナンスソリューションです。
データ、MLモデル、AIアセットを一元管理し、きめ細かなアクセス制御を提供します。
...

インデックス作成

graphrag indexコマンドでインデックスを作成します。

2026-01-11 23:08:45.568388: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-01-11 23:08:45.569228: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2026-01-11 23:08:45.572400: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2026-01-11 23:08:45.581585: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1768172925.598047    3646 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1768172925.602666    3646 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1768172925.614262    3646 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1768172925.614289    3646 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1768172925.614296    3646 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1768172925.614303    3646 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2026-01-11 23:08:45.617931: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
/databricks/python/lib/python3.12/site-packages/paramiko/pkey.py:100: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
  "cipher": algorithms.TripleDES,
/databricks/python/lib/python3.12/site-packages/paramiko/transport.py:259: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
  "class": algorithms.TripleDES,
Model config based on fnllm is deprecated and will be removed in GraphRAG v3, please use ModelType.Chat or ModelType.Embedding instead to switch to LiteLLM config.
Model config based on fnllm is deprecated and will be removed in GraphRAG v3, please use ModelType.Chat or ModelType.Embedding instead to switch to LiteLLM config.
Starting pipeline with workflows: load_input_documents, create_base_text_units, create_final_documents, extract_graph, finalize_graph, extract_covariates, create_communities, create_final_text_units, create_community_reports, generate_text_embeddings
Starting workflow: load_input_documents

Workflow complete: load_input_documents
Starting workflow: create_base_text_units
  1 / 1 ............................................................................................
Workflow complete: create_base_text_units
Starting workflow: create_final_documents

Workflow complete: create_final_documents
Starting workflow: extract_graph
  72 / 72 ..........................................................................................
Workflow complete: extract_graph
Starting workflow: finalize_graph

Workflow complete: finalize_graph
Starting workflow: extract_covariates

Workflow complete: extract_covariates
Starting workflow: create_communities

Workflow complete: create_communities
Starting workflow: create_final_text_units

Workflow complete: create_final_text_units
Starting workflow: create_community_reports
  5 / 5 ............................................................................................
Workflow complete: create_community_reports
Starting workflow: generate_text_embeddings
[2026-01-11T23:10:31Z WARN  lance::dataset::write::insert] No existing dataset at /tmp/graphrag_work/output/lancedb/default-entity-description.lance, it will be created
[2026-01-11T23:10:32Z WARN  lance::dataset::write::insert] No existing dataset at /tmp/graphrag_work/output/lancedb/default-community-full_content.lance, it will be created
[2026-01-11T23:10:33Z WARN  lance::dataset::write::insert] No existing dataset at /tmp/graphrag_work/output/lancedb/default-text_unit-text.lance, it will be created

Workflow complete: generate_text_embeddings
Pipeline complete
DEBUG:ThreadMonitor:Logging python thread stack frames for MainThread and py4j threads:
DEBUG:ThreadMonitor:Logging Thread-9 (run) stack frames:
  File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/databricks/python/lib/python3.12/site-packages/ipykernel/ipkernel.py", line 766, in run_closure
    _threading_Thread_run(self)
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/databricks/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/clientserver.py", line 521, in run
    self.wait_for_commands()
  File "/databricks/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/clientserver.py", line 593, in wait_for_commands
    command = smart_decode(self.stream.readline())[:-1]
  File "/usr/lib/python3.12/socket.py", line 707, in readinto
    return self._sock.recv_into(b)

DEBUG:ThreadMonitor:Logging Thread-7 (run) stack frames:
  File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/databricks/python/lib/python3.12/site-packages/ipykernel/ipkernel.py", line 766, in run_closure
    _threading_Thread_run(self)
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/databricks/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/clientserver.py", line 521, in run
    self.wait_for_commands()
  File "/databricks/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/clientserver.py", line 593, in wait_for_commands
    command = smart_decode(self.stream.readline())[:-1]
  File "/usr/lib/python3.12/socket.py", line 707, in readinto
    return self._sock.recv_into(b)

DEBUG:ThreadMonitor:Logging Thread-4 (run) stack frames:
  File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/databricks/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/java_gateway.py", line 2323, in run
    readable, writable, errored = select.select(

DEBUG:ThreadMonitor:Logging MainThread stack frames:
  File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 52, in <module>
    main()
  File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 48, in main
    DatabricksKernelApp.launch_instance(config=databricks_kernel_config())
  File "/databricks/python/lib/python3.12/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/databricks/python/lib/python3.12/site-packages/ipykernel/kernelapp.py", line 739, in start
    self.io_loop.start()
  File "/databricks/python/lib/python3.12/site-packages/tornado/platform/asyncio.py", line 205, in start
    self.asyncio_loop.run_forever()
  File "/usr/lib/python3.12/asyncio/base_events.py", line 641, in run_forever
    self._run_once()
  File "/usr/lib/python3.12/asyncio/base_events.py", line 1987, in _run_once
    handle._run()
  File "/usr/lib/python3.12/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
  File "/databricks/python/lib/python3.12/site-packages/ipykernel/kernelbase.py", line 545, in dispatch_queue
    await self.process_one()
  File "/databricks/python/lib/python3.12/site-packages/ipykernel/kernelbase.py", line 534, in process_one
    await dispatch(*args)
  File "/databricks/python/lib/python3.12/site-packages/ipykernel/kernelbase.py", line 437, in dispatch_shell
    await result
  File "/databricks/python/lib/python3.12/site-packages/ipykernel/ipkernel.py", line 362, in execute_request
    await super().execute_request(stream, ident, parent)
  File "/databricks/python/lib/python3.12/site-packages/ipykernel/kernelbase.py", line 778, in execute_request
    reply_content = await reply_content
  File "/databricks/python_shell/lib/dbruntime/kernel.py", line 534, in do_execute
    reply_content = await super().do_execute(*args, **kwargs)
  File "/databricks/python/lib/python3.12/site-packages/ipykernel/ipkernel.py", line 449, in do_execute
    res = shell.run_cell(
  File "/databricks/python/lib/python3.12/site-packages/ipykernel/zmqshell.py", line 549, in run_cell
    return super().run_cell(*args, **kwargs)
  File "/databricks/python/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3075, in run_cell
    result = self._run_cell(
  File "/databricks/python/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3130, in _run_cell
    result = runner(coro)
  File "/databricks/python/lib/python3.12/site-packages/IPython/core/async_helpers.py", line 128, in _pseudo_sync_runner
    coro.send(None)
  File "/databricks/python/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3334, in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  File "/databricks/python/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3517, in run_ast_nodes
    if await self.run_code(code, result, async_=asy):
  File "/databricks/python/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/root/.ipykernel/3562/command-6982173096383818-156383316", line 2, in <module>
    get_ipython().system('graphrag index --root .')
  File "/databricks/python/lib/python3.12/site-packages/ipykernel/zmqshell.py", line 657, in system_piped
    self.user_ns["_exit_code"] = system(self.var_expand(cmd, depth=1))
  File "/databricks/python/lib/python3.12/site-packages/IPython/utils/_process_posix.py", line 156, in system
    res_idx = child.expect_list(patterns, self.read_timeout)
  File "/databricks/python/lib/python3.12/site-packages/pexpect/spawnbase.py", line 372, in expect_list
    return exp.expect_loop(timeout)
  File "/databricks/python/lib/python3.12/site-packages/pexpect/expect.py", line 181, in expect_loop
    return self.timeout(e)
  File "/databricks/python/lib/python3.12/site-packages/pexpect/pty_spawn.py", line 510, in read_nonblocking
    raise TIMEOUT('Timeout exceeded.')
  File "/databricks/python/lib/python3.12/site-packages/pexpect/pty_spawn.py", line 450, in select
    return select_ignore_interrupts([self.child_fd], [], [], timeout)[0]
  File "/databricks/python/lib/python3.12/site-packages/pexpect/utils.py", line 143, in select_ignore_interrupts
    return select.select(iwtd, owtd, ewtd, timeout)

抽出されたエンティティ

テキストから抽出されたエンティティ(人物、組織、技術、概念)です。

エンティティ数: 37 件

title	type	description
DATABRICKS	ORGANIZATION	Databricks is an integrated platform for data engineering, data science, and machine learning. It was founded by the creators of Apache Spark and promotes the Lakehouse architecture. Within Databricks, products such as Unity Catalog, Delta Lake (as the default table format), and Mosaic AI are positioned as key capabilities/solutions.>
APACHE SPARK	ORGANIZATION	Apache Spark is an open-source data processing engine whose creators founded Databricks. It is referenced as the originating technology/community behind Databricks’ founding.>
LAKEHOUSE ARCHITECTURE	EVENT	Lakehouse architecture is an architectural approach advocated by Databricks, positioned as a unifying paradigm for data engineering/analytics and machine learning workloads on a shared data foundation.>
UNITY CATALOG	ORGANIZATION	Unity Catalog is Databricks’ data governance solution. It centrally manages data, ML models, and AI assets, provides fine-grained access control, and enables metadata sharing across Databricks workspaces.>
MLFLOW	ORGANIZATION	MLflow is an open-source platform for managing the machine learning lifecycle, providing capabilities such as experiment tracking, model registry, and deployment. It is described as a platform with versioned evolution including MLflow 3.>
MLFLOW 3	EVENT	MLflow 3 is a major version release of MLflow in which a new concept called LoggedModel was introduced, strengthening management of GenAI applications.>
LOGGEDMODEL	EVENT	LoggedModel is a new concept introduced in MLflow 3 intended to enhance management of GenAI applications within the MLflow lifecycle management framework.>
DELTA LAKE	ORGANIZATION	Delta Lake is an open-source storage layer for data lakes that adds reliability features such as ACID transactions, schema enforcement, and time travel. In Databricks, Delta Lake is used as the default table format.>
MOSAIC AI	ORGANIZATION	Mosaic AI is Databricks’ generative AI solution. It supports foundation model fine-tuning, building RAG applications, and developing AI agents, including production-oriented agent development via an Agent Framework.>
AGENT FRAMEWORK	ORGANIZATION	Agent Framework is a framework referenced under Mosaic AI that enables building production-ready AI agents.>
DATA SCIENCE	EVENT	Data science is a workload/domain supported by Databricks, involving exploratory analysis, statistical modeling, and deriving insights from data using the platform’s unified environment.>
MACHINE LEARNING	EVENT	Machine learning is a core workload supported by Databricks and MLflow, covering model development, training, evaluation, and operationalization within a unified data/compute environment.>
DATA GOVERNANCE	EVENT	Data governance is the discipline of managing data assets, access, policies, and metadata; in the text it is specifically addressed by Unity Catalog as Databricks’ governance solution.>
DATASBRICKS WORKSPACES	ORGANIZATION	Databricks workspaces are the workspace environments within the Databricks platform across which Unity Catalog can share metadata and apply governance/access controls consistently.>
METADATA	EVENT	Metadata is descriptive information about data/assets (schemas, ownership, lineage, etc.). The text states Unity Catalog can share metadata across Databricks workspaces to enable centralized governance.>
ACCESS CONTROL	EVENT	Access control refers to mechanisms for restricting and granting permissions. Unity Catalog is described as providing fine-grained access control over data, ML models, and AI assets.>
ML MODELS	EVENT	ML models are machine learning model artifacts that Unity Catalog can centrally manage as governed assets, and that MLflow manages through lifecycle functions such as registration and deployment.>
AI ASSETS	EVENT	AI assets are AI-related artifacts (e.g., models, prompts, agents, applications) that Unity Catalog can centrally manage and govern alongside data and ML models.>
EXPERIMENT TRACKING	EVENT	Experiment tracking is an MLflow capability for recording runs, parameters, metrics, and artifacts to support reproducibility and comparison of machine learning experiments.>
MODEL REGISTRY	EVENT	Model registry is an MLflow capability for registering, versioning, and managing models for promotion through stages and deployment workflows.>
DEPLOYMENT	EVENT	Deployment is an MLflow capability and general ML lifecycle activity for packaging and serving models/applications into target environments.>
MACHINE LEARNING LIFECYCLE MANAGEMENT	EVENT	Machine learning lifecycle management is the end-to-end practice of managing ML work from experimentation through registration and deployment; MLflow is described as an open-source platform for this purpose.>
OPEN SOURCE	ORGANIZATION	Open source is referenced as the development/distribution model for MLflow and Delta Lake, indicating they are community-available projects rather than proprietary-only software.>
DATA LAKE	GEO	A data lake is the storage paradigm referenced in the text; Delta Lake is described as a storage layer that brings reliability to a data lake.>
STORAGE LAYER	EVENT	A storage layer is the architectural component that manages how data is stored and accessed; Delta Lake is described as an open-source storage layer for data lakes.>
ACID TRANSACTIONS	EVENT	ACID transactions are reliability guarantees (atomicity, consistency, isolation, durability) provided by Delta Lake to make data lake operations more dependable.>
SCHEMA ENFORCEMENT	EVENT	Schema enforcement is a Delta Lake feature that ensures data written to tables conforms to expected schemas, improving data quality and reliability.>
TIME TRAVEL	EVENT	Time travel is a Delta Lake feature enabling querying or restoring previous versions of data, supporting auditing and reproducibility.>
TABLE FORMAT	EVENT	Table format refers to how tabular data is represented/stored; the text states Delta Lake is used as the default table format in Databricks.>
GENERATIVE AI	EVENT	Generative AI is the AI category addressed by Mosaic AI, involving models and applications that generate text/code/other outputs and can be operationalized via RAG and agent patterns.>
FOUNDATION MODEL	EVENT	A foundation model is a large pre-trained model; Mosaic AI supports foundation model fine-tuning as part of building GenAI solutions.>
FINE-TUNING	EVENT	Fine-tuning is the process of adapting a pre-trained foundation model to a specific task or domain; Mosaic AI supports this capability.>
RAG APPLICATIONS	EVENT	RAG (Retrieval-Augmented Generation) applications are GenAI applications that combine retrieval of external knowledge with generation; Mosaic AI supports building RAG applications.>
AI AGENTS	EVENT	AI agents are autonomous or semi-autonomous systems that can plan and act; Mosaic AI (via Agent Framework) supports development of AI agents for production environments.>
PRODUCTION ENVIRONMENT	EVENT	Production environment refers to the operational setting where systems are deployed for real users; the text notes Agent Framework enables building AI agents intended for production use.>
GENAI APPLICATIONS	EVENT	GenAI applications are applications built using generative AI; the text states MLflow 3 strengthens management of GenAI applications via the LoggedModel concept.>
DATA ENGINEERING

抽出されたリレーションシップ

エンティティ間の関係性です。

リレーションシップ数: 35 件

source	target	description
DATABRICKS	APACHE SPARK	Databricks was founded by the creators of Apache Spark, linking the company/platform’s origin to the Spark project and its founders.
DATABRICKS	LAKEHOUSE ARCHITECTURE	Databricks advocates (promotes) the Lakehouse architecture as a core architectural approach associated with its platform.
DATABRICKS	UNITY CATALOG	Unity Catalog is described as Databricks’ data governance solution and operates across Databricks workspaces to share metadata and enforce access control.
DATABRICKS	DELTA LAKE	Delta Lake is used within Databricks as the default table format, indicating a strong product/platform integration relationship.
DATABRICKS	MOSAIC AI	Mosaic AI is described as Databricks’ generative AI solution, making it a first-party solution within the Databricks platform ecosystem.
DATABRICKS	DATA ENGINEERING	Databricks is described as an integrated platform for data engineering workloads.
DATABRICKS	DATA SCIENCE	Databricks is described as an integrated platform for data science workloads.
DATABRICKS	MACHINE LEARNING	Databricks is described as an integrated platform for machine learning workloads.
UNITY CATALOG	DATA GOVERNANCE	Unity Catalog is explicitly described as Databricks’ data governance solution.
UNITY CATALOG	ACCESS CONTROL	Unity Catalog provides fine-grained access control over governed assets.
UNITY CATALOG	METADATA	Unity Catalog can share metadata across Databricks workspaces.
UNITY CATALOG	DATASBRICKS WORKSPACES	Unity Catalog operates across Databricks workspaces to share metadata and apply governance consistently.
UNITY CATALOG	ML MODELS	Unity Catalog centrally manages ML models as governed assets.
UNITY CATALOG	AI ASSETS	Unity Catalog centrally manages AI assets as governed assets.
MLFLOW	MLFLOW 3	MLflow 3 is a version/release of MLflow, representing an evolution of the MLflow platform.
MLFLOW	MACHINE LEARNING LIFECYCLE MANAGEMENT	MLflow is described as an open-source platform for machine learning lifecycle management.
MLFLOW	EXPERIMENT TRACKING	MLflow provides experiment tracking functionality.
MLFLOW	MODEL REGISTRY	MLflow provides a model registry capability.
MLFLOW	DEPLOYMENT	MLflow provides deployment-related functionality for models/applications.
MLFLOW 3	LOGGEDMODEL	LoggedModel is explicitly introduced as a new concept in MLflow 3.
MLFLOW 3	GENAI APPLICATIONS	The text states MLflow 3 strengthens management of GenAI applications.
LOGGEDMODEL	GENAI APPLICATIONS	LoggedModel is introduced to enhance management of GenAI applications.
DELTA LAKE	DATA LAKE	Delta Lake is described as a storage layer that brings reliability to a data lake.
DELTA LAKE	STORAGE LAYER	Delta Lake is explicitly described as an open-source storage layer.
DELTA LAKE	ACID TRANSACTIONS	Delta Lake provides ACID transactions as a core reliability feature.
DELTA LAKE	SCHEMA ENFORCEMENT	Delta Lake provides schema enforcement as a core reliability feature.
DELTA LAKE	TIME TRAVEL	Delta Lake provides time travel as a core reliability feature.
DELTA LAKE	TABLE FORMAT	Delta Lake is used as the default table format in Databricks.
MOSAIC AI	AGENT FRAMEWORK	Agent Framework is used as part of Mosaic AI to build production-ready AI agents, indicating it is a component/capability within the Mosaic AI solution.
MOSAIC AI	GENERATIVE AI	Mosaic AI is described as Databricks’ generative AI solution.
MOSAIC AI	FOUNDATION MODEL	Mosaic AI supports foundation model fine-tuning.
MOSAIC AI	FINE-TUNING	Mosaic AI supports fine-tuning as a capability for adapting foundation models.
MOSAIC AI	RAG APPLICATIONS	Mosaic AI supports building RAG applications.
AGENT FRAMEWORK	AI AGENTS	Agent Framework is used to build AI agents intended for production use.
AGENT FRAMEWORK	PRODUCTION ENVIRONMENT	The text states Agent Framework enables building AI agents for production environments.

コミュニティレポート

関連するエンティティのクラスタ(コミュニティ)と、そのサマリーです。

コミュニティレポート数: 5 件

=== Databricks Unity Catalog Governance: Access Control, Metadata, and Governed AI/ML Assets ===
This community centers on Unity Catalog, described as Databricks’ data governance solution, and its role as a hub that connects governance functions to multiple governed asset types and environments. Unity Catalog is directly linked to data governance, fine-grained access control, metadata sharing, and operation across Databricks workspaces, indicating a centralized control plane for policy and visibility. It also centrally manages ML models and broader AI assets as governed assets, suggesting the community’s focus is on consistent governance and control across both data and AI/ML artifacts within the Databricks platform. [Data: Entities (3, 12, 15, 14, 13); Relationships (8, 9, 10, 11, 12, +more)]

=== Delta Lake Reliability Features for Data Lakes (ACID, Schema Enforcement, Time Travel) ===
This community centers on Delta Lake, an open-source storage layer positioned as a reliability-enhancing component for data lakes. The network describes Delta Lake’s functional relationships to core reliability and governance-like capabilities—ACID transactions, schema enforcement, and time travel—and also notes its role as the default table format in Databricks. Overall, the community is technical and product/architecture-focused, with Delta Lake acting as the hub entity connecting storage-layer concepts to specific reliability features and a table-format role in a platform context. [Data: Entities (7, 23, 24, 25, 26); Relationships (22, 23, 24, 25, 26, +more)]

=== Agent Framework for Production AI Agents ===
This community centers on the Agent Framework, described as a framework referenced under Mosaic AI that enables building production-ready AI agents. The framework is directly connected to two key concepts: AI agents (the systems it is used to build) and the production environment (the operational setting those agents are intended to run in). Overall, the network depicts a straightforward capability chain: Agent Framework → AI agents → deployment in production environments, emphasizing operationalization rather than research-only experimentation. [Data: Entities (9, 33, 34); Relationships (33, 34)]

=== Mosaic AI (Databricks) Generative AI Stack: Fine-Tuning, RAG, and Agent Framework ===
This community centers on Mosaic AI, described as Databricks’ generative AI solution, and the core capabilities it supports: foundation model fine-tuning, building RAG (Retrieval-Augmented Generation) applications, and developing AI agents via an Agent Framework. The relationships indicate a hub-and-spoke structure where Mosaic AI is the primary entity connected to the broader category of Generative AI and to specific implementation patterns and components (foundation models, fine-tuning, RAG applications, and an agent framework). Overall, the community represents an applied GenAI platform/tooling ecosystem rather than a set of individuals or a multi-organization network, with emphasis on production-oriented agent development and operational GenAI application building. [Data: Entities (8, 29, 30, 31, 32); Relationships (28, 29, 30, 31, 32)]

=== Databricks, Apache Spark, and Lakehouse Architecture ===
This community centers on Databricks as the primary entity, positioned as an integrated platform spanning data engineering, data science, and machine learning workloads. Databricks’ origin is explicitly linked to Apache Spark through its founding by Spark’s creators, and its architectural positioning is tied to its promotion of the Lakehouse architecture as a unifying paradigm for analytics and ML on a shared data foundation. Overall, the network is a hub-and-spoke structure with Databricks as the hub and Spark, Lakehouse architecture, and the three workload domains as the spokes, indicating a cohesive product/technology ecosystem rather than a fragmented set of actors. [Data: Entities (0, 1, 2, 36, 10); Relationships (0, 1, 5, 6, 7)]

ナレッジグラフの可視化

PyVisを使用して、抽出されたエンティティとリレーションシップをグラフとして可視化しました。

Global Search

データセット全体に関する質問をGlobal Searchで実行しました。

質問: このデータセットで説明されている主要な技術とその関係性を要約してください

=== 質問 ===
このデータセットで説明されている主要な技術とその関係性を要約してください

=== 回答 ===
## 全体像（主要技術の配置）
このデータセットでは、**Databricks**が統合プラットフォームとして中心に位置づけられ、**データエンジニアリング／データサイエンス／機械学習**の各ワークロードを単一基盤で扱う構造が示されています。また、Apache Sparkの創設者によって設立されたという起源的関係が示され、さらに**Lakehouseアーキテクチャ**を推進することで、分析とMLを共有データ基盤上で統合する思想が関係性として説明されています [Data: Reports (4)]。

## データ基盤：Delta Lake（ストレージ層）と信頼性機能
**Delta Lake**は、データレイク向けのオープンソース・ストレージレイヤとして中心にあり、信頼性機能として **ACIDトランザクション／スキーマ強制（schema enforcement）／タイムトラベル（過去バージョンの参照・復元）** と直接結びついています。加えて、Databricksにおける**デフォルトのテーブルフォーマット**として利用される関係が示され、プラットフォーム標準のデータ表現として位置づけられています [Data: Reports (1)]。

## ガバナンス：Unity Catalog（統制のハブ）
**Unity Catalog**はDatabricksのデータガバナンスソリューションとしてハブになっており、**データガバナンス／きめ細かなアクセス制御／メタデータ共有／複数Databricksワークスペース横断の運用**に関係づけられています。さらに、MLモデルやより広いAI資産（例：**モデル、プロンプト、エージェント、アプリケーション**）も「ガバナンス対象資産」として中央管理することで、**データとAI/ML成果物を同一の統制平面で扱う**関係性が示されています [Data: Reports (0)]。

## 生成AI：Mosaic AI（生成AI機能のハブ）
**Mosaic AI**はDatabricksの生成AIソリューションとしてハブになっており、生成AI領域の主要機能として、(1) **基盤モデル（Foundation Model）のファインチューニング**、(2) **RAG（Retrieval-Augmented Generation）アプリケーション構築**、(3) **Agent Frameworkを用いたAIエージェント開発**、という機能マップ（関係性）が示されています [Data: Reports (3)]。

## エージェント：Agent Framework（本番運用への接続）
**Agent Framework**は「本番対応のAIエージェント」を構築するための枠組みとして位置づけられ、**Agent Framework → AIエージェント → 本番環境（production environments）**という、開発から運用への直接的な関係が示されています [Data: Reports (2)]。

## まとめ（関係性の要点）
- **Databricks**が統合基盤として、データ〜ML〜生成AIまでを一つのプラットフォーム上で扱う [Data: Reports (4)]。  
- **Delta Lake**が信頼性の高いデータ格納・テーブル表現の中核となり、分析/MLの土台を支える [Data: Reports (1)]。  
- **Unity Catalog**がデータだけでなくAI資産まで含めて統制し、複数ワークスペース横断のガバナンスを担う [Data: Reports (0)]。  
- **Mosaic AI**が生成AIの実装機能（ファインチューニング、RAG、エージェント）を束ね、**Agent Framework**が本番対応エージェントの構築・運用につなげる [Data: Reports (3, 2)]。

Local Search

特定のエンティティに関する質問をLocal Searchで実行しました。

質問: MLflowとは何ですか?どのような機能がありますか?

=== 質問 ===
MLflowとは何ですか?どのような機能がありますか?

=== 回答 ===
## MLflowとは何か

MLflowは、機械学習（ML）のライフサイクル管理のためのオープンソースプラットフォームです。具体的には、モデル開発から運用（デプロイ）までの一連の流れを管理するための仕組みを提供します [Data: Sources (0)]。

## どのような機能があるか

提供されている情報の範囲では、MLflowの主な機能として以下が挙げられます。

- **実験追跡（Experiment Tracking）**：学習実験の結果やパラメータなどを追跡・記録する機能 [Data: Sources (0)]。  
- **モデル登録（Model Registry）**：学習したモデルを登録し、管理する機能 [Data: Sources (0)]。  
- **デプロイメント（Deployment）**：登録・管理したモデルを実運用環境へ展開するための機能 [Data: Sources (0)]。

## MLflow 3での追加要素（提供データに基づく）

MLflow 3では、**LoggedModel**という新しい概念が導入され、**生成AI（GenAI）アプリケーションの管理が強化**されたとされています [Data: Sources (0)]。

## 補足（このデータで分からないこと）

このデータには、MLflowの各機能の具体的な操作方法、対応する実行環境、他のDatabricks製品（例：Unity Catalog）との詳細な連携手順までは記載がありません。そのため、ここでは上記の範囲を超える断定はできません。

まとめ

GraphRAGは、従来のベクトル検索ベースのRAGでは難しかった「全体像の把握」を可能にする手法です。

Databricksで動かす際のポイント:

LLM: FMAPIのOpenAIモデル(databricks-gpt-5-2等)、OpenAI API、Azure OpenAIが利用可能
- FMAPIのLlamaモデルはJSON mode制約により非対応
- FMAPIではmax_tokensを明示的に指定する必要あり
ストレージ: ローカルディスクで処理し、Delta Tableに永続化
計算環境: クラスターを使用(サーバレスは非対応)
コスト: インデックス作成時のLLM呼び出しに注意

GraphRAGは、特に以下のようなユースケースで威力を発揮します:

大量の文書から全体像を把握したい場合
エンティティ間の関係性を分析したい場合
従来のRAGで「わからない」と返されていた質問に回答したい場合