リポジトリを丸ごとJSON形式で分割保存してLLMに入れるためのスクリプトが便利

Posted at 2025-01-03

はじめに

大規模言語モデル（LLM）を活用してコードを解析する際に、新しいリポジトリの内容を効率的に保存・共有したい場合、すべてのファイルを一つの形式でエクスポートすることが便利です。そしてLLMにとって、わかりやすいJSON形式でファイルやディレクトリ構造を出力することが重要です。

この記事では、リポジトリの内容を指定サイズ以下のJSONファイルに分割して保存するPythonスクリプトを紹介します。このスクリプトを使用することで、LLMがリポジトリ全体を効率的に解析できる形式でデータを出力できます。

スクリプトの機能

主な特徴

リポジトリ全体を再帰的に処理:
- 指定されたディレクトリ内のすべてのファイルとサブディレクトリを処理します
JSON形式で出力:
- ファイルの内容、パス、タイプ（ファイルまたはディレクトリ）をわかりやすいJSON構造で保存します
サイズ制限に対応:
- 各JSONファイルのサイズが指定された上限（例: 10MB）を超えないように分割します
エラー処理:
- 読み取り不能なファイルがあっても、エラー内容をJSONに含めて出力します

スクリプトのコード

以下がスクリプト全体です。

import os
import json

def write_code_to_json_split(repo_path, output_dir, max_size_mb=10):
    """
    Reads all files in a repository and writes their contents to multiple JSON files,
    each under the specified maximum size (in MB).
    Formats the JSON to make it easier for an LLM to understand.
    """
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    repo_name = os.path.basename(os.path.normpath(repo_path))
    current_file_size = 0
    file_counter = 1
    current_json = {
        "repository": repo_name,
        "files": []
    }

    def save_current_json():
        """
        Saves the current JSON data to a file.
        """
        nonlocal file_counter, current_json, current_file_size
        file_path = os.path.join(output_dir, f"{repo_name}_code_part_{file_counter}.json")
        with open(file_path, "w", encoding="utf-8") as json_file:
            json.dump(current_json, json_file, ensure_ascii=False, indent=4)
        current_file_size = 0
        file_counter += 1
        current_json["files"] = []

    def add_file_to_json(file_path, relative_path):
        """
        Adds the content of a file to the current JSON object.
        """
        nonlocal current_file_size, current_json

        try:
            with open(file_path, "r", encoding="utf-8") as f:
                content = f.read()
                file_data = {
                    "path": relative_path,
                    "type": "file",
                    "content": content
                }

                # Calculate size of the new entry
                entry_size = len(json.dumps(file_data).encode("utf-8"))

                # Check if adding this entry exceeds the maximum size
                if current_file_size + entry_size > max_size_mb * 1024 * 1024:
                    save_current_json()

                # Add the file content to the current JSON object
                current_json["files"].append(file_data)
                current_file_size += entry_size
        except Exception as e:
            current_json["files"].append({
                "path": relative_path,
                "type": "file",
                "error": f"Error reading file: {e}"
            })

    def add_directory_to_json(relative_path):
        """
        Adds a directory entry to the current JSON object.
        """
        nonlocal current_json
        directory_data = {
            "path": relative_path,
            "type": "directory"
        }
        current_json["files"].append(directory_data)

    def process_directory(dir_path, relative_path):
        """
        Recursively processes a directory.
        """
        add_directory_to_json(relative_path)
        for item in sorted(os.listdir(dir_path)):
            item_path = os.path.join(dir_path, item)
            item_relative_path = os.path.join(relative_path, item)
            if os.path.isfile(item_path):
                add_file_to_json(item_path, item_relative_path)
            elif os.path.isdir(item_path):
                process_directory(item_path, item_relative_path)

    # Start processing the repository
    process_directory(repo_path, "")

    # Save the last JSON file if it has any data
    if current_json["files"]:
        save_current_json()

# Usage example
if __name__ == "__main__":
    repository_path = "./create-python-server"  # Replace with your repository path
    output_directory = "./json_output"  # Directory where JSON files will be saved
    max_file_size_mb = 5  # Maximum size per file in MB
    write_code_to_json_split(repository_path, output_directory, max_file_size_mb)
    print(f"JSON files created in: {output_directory}")

使用方法

リポジトリパスと出力ディレクトリを設定
- repository_pathに解析したいリポジトリのパスを指定します。
- output_directoryにJSONファイルを保存するディレクトリを指定します。
サイズ制限の設定
- max_file_size_mbを使って、各JSONファイルの最大サイズをMB単位で設定できます。
スクリプトを実行
- 実行後、output_directoryに分割されたJSONファイルが生成されます。

出力例

出力されたJSONファイルの例は以下のようになります。

{
    "repository": "create-python-server",
    "files": [
        {
            "path": "README.md",
            "type": "file",
            "content": "# Create Python Server\nThis is a sample project."
        },
        {
            "path": "src",
            "type": "directory"
        }
    ]
}

応用例

LLM解析: JSON形式で保存されたリポジトリ内容を活用し、LLMに読み込ませてコード解析やレビューを行う。
データ共有: チーム内でリポジトリ内容をコンパクトに共有可能。

おわりに

このスクリプトは、大規模なリポジトリを整理し、LLMの解析や効率的な共有を可能にするツールとして役立ちます。ぜひ試してみてください！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up