More than 1 year has passed since last update.

GitをPythonで自作した

Last updated at 2024-02-18Posted at 2024-02-18

はじめに

Gitの内部構造を勉強した時に、自分でも実装したくなったのでPythonを使って再現してみました。

実装する機能

実装する機能は以下の通り

git add
git commit
git branch
git checkout
git log

ディレクトリ構成

ディレクトリ構成は以下のとおり

.
├─ git
│   ├─ HEAD
│   ├─ index.pkl
│   ├─ object
│   └─ refs
│       └─ heads
│            └─ master
└─ src
    └─ *.py

git ：ファイル管理を行う上で使用するファイルを保存するディレクトリ（.gitに該当）
- HEAD：現在作業しているcommitオブジェクトを表すファイル
- index.pkl：ステージングしたファイルの情報を表すファイル
- object：gitオブジェクトを管理するディレクトリ
- refs：ブランチとタグの管理を行うディレクトリ
  - heads：ブランチの管理を行うディレクトリ
    - master：masterブランチの参照先を表すファイル
src：gitを実装するPythonファイルを置くディレクトリ

実装

git add

git addを実行した時に以下の流れが実行されるようにする

指定されたファイルのblobオブジェクトの生成
indexファイルの更新

blobオブジェクトの生成

まず指定したファイルに書かれている内容を読み込み、ヘッダーをつけ、blobオブジェクトの中身を作る。

with open(filepath, "w") as f:
    file_content = f.read()

blob_content = f"blob {len(file_content)}\0{file_content}"

その後、中身をSHA-1チェックサムしたものと、zlibで圧縮したものを用意し、それぞれをオブジェクトファイル名とオブジェクトファイルの中身にして、object/heads 以下に保存する。

def make_object(object_content):
    encode_content = object_content.encode()
    comp_object = zlib.compress(encode_content, level=1) # zlibで圧縮
		
    h = hashlib.new("sha1")
    h.update(encode_content)
    object_hash = h.hexdigest() # SHA-1チェックサム

    return object_hash, comp_object

def save_object(object_hash, any_object):
    path = os.path.join("git", "object", "heads")
    object_path = os.path.join(path, object_hash[:2], object_hash[2:])

    with open(object_path, "wb") as f:
        f.write(any_object)

blob_hash, blob_object = make_object(blob_content)
save_object(blob_hash, blob_object)

indexファイルの更新

対象のファイルの情報を更新する。実際のindexファイルはファイルのパーミッションや種類、フラッグなどの情報を保持したバイナリファイルだが、今回は簡単のためファイル名とblobハッシュのみを保持する、pickleファイルとした。

with open(index_path, "wb") as f:
    index_dict = pickle.load(f)

index_dict[filename] = blob_hash

with open(index_path, "rb") as f:
    pickle.dump(index_dict, f)

git commit

git commitを実行した時に以下の流れが実行されるようにする

indexファイルからtreeオブジェクトの生成
commitオブジェクトの生成
HEADの更新

treeオブジェクトの生成

indexファイルを基にステージングされたファイルのtreeオブジェクトを生成する。
作業ディレクトリと同階層にファイルがある場合と、作業ディレクトリの直下にファイルがない場合で行う作業が異なる。

例として以下のディレクトリ構成を考える。

.
├─ hoge.txt
└─ foo
    └─ bar.txt

以上のファイルをaddするときのコマンドは、git add hoge.txt foo/bar.txtとなる。

この時、トップレベルのtreeオブジェクトに含まれる情報は、hoge.txtのファイル名(hoge.txt)とblobオブジェクトのハッシュ、fooのディレクトリ名(foo)とfooのtreeオブジェクトのハッシュである。

そのため、トップレベルのtreeオブジェクトを生成するまえに、fooのtreeオブジェクトを生成する必要がある。
なお、fooのtreeオブジェクトにはbar.txtのファイル名とblobオブジェクトのハッシュが含まれる。（bar.txtのblobオブジェクトはgit add foo/bar.txt時に生成されたもの）

以下の関数でそれを実装できる。

def make_tree_object(index):
    already_check_dir = []
    tree_content = ""
    for filename in index:
        if "/" in filename: # 下の階層がある場合
            dirname = filename.split("/")[0]
            if dirname in already_check_dir:
                break
            child_path = filename[len(dirname)+1:]
            child_index = {child_path: index[filename]}

            for name in index: # 他に同ディレクトリ下のファイルがある場合
                if name == filename: continue
                if name.startswith(dirname + "/"):
                    child_path = name[len(dirname)+1:] # "dirname/"以下のpathを取得
                    hash_tmp = index[name]
                    child_index[child_path] = hash_tmp
            
            tree_hash = make_tree_object(child_index)
            tree_content += f"\0{tree_hash} {dirname}"
            
            already_check_dir.append(dirname)

        else:
            blob_hash = index[filename]
            tree_content += f"\0{blob_hash} {filename}"
    
    header = f"tree {len(tree_content)}"
    tree = header + tree_content
    tree_hash, tree_object = make_object(tree)

    save_object(tree_hash, tree_object)
    return tree_hash

この関数にindex.pklに記述されている情報をいれると、返り値としてトップレベルのtreeオブジェクトのhashが返ってくる。

commitオブジェクトの生成

生成したトップレベルのtreeオブジェクトをもとにcommitオブジェクトを生成する。
なお、生成するcommitオブジェクトに親commitオブジェクトがある場合は、そのハッシュもオブジェクトに付け加える。（最初のコミット以外は親コミットが存在する）

toptree_hash = make_tree_object(index_dict)
commit_content = f"tree {toptree_hash}\n" #トップレベルのtreeオブジェクトのハッシュ

parent_hash = fetch_head_commit_hash() #HEADが参照しているcommitオブジェクトがあればそのハッシュをとってくる
if len(parent_hash) > 0:
    commit_content += f"parent {parent_hash}\n"

commit_content += f"\n{commit_message}\n" #コミットメッセージ

header = f"commit {len(commit_content)}\0" #ヘッダー

commit_hash, commit_object = make_object(header + commit_content)
save_object(commit_hash, commit_object)

HEADの更新

最後にHEADの参照するcommitオブジェクトを更新する。
detached HEAD状態の場合は、HEADファイルに直接commitオブジェクトのハッシュが記述されており、そうでない場合は現在のブランチのブランチファイルへの参照先が記述されている。

HEADファイルにハッシュが書かれている場合は、HEADファイルを新しいハッシュに更新。
HEADファイルに参照先が書かれている場合は、参照先のブランチファイルを新しいハッシュに更新する。

なお、参照先が書かれている場合は、HEADファイルの先頭にrefs:という文字列が必ず記述されているため、これを基に判断を行う。

with open(os.path.join("git", "HEAD"), "r") as f:
    head = f.read()
  
if head.startswith("refs:"):
    dirpath = os.path.join("git", head.split()[1])
else:
    dirpath = os.path.join("git", "HEAD")

with open(dirpath, "w") as f:
    f.write(commit_hash)

git branch

ブランチを生成する時は、指定されたブランチ名のブランチファイルを作成するだけである。
ブランチファイルにはその時のHEADが参照しているcommitオブジェクトのハッシュを記述する。

head_hash = fetch_head_commit_hash() #HEADに書かれているハッシュor参照先のブランチファイルに書かれているハッシュ

path = os.path.join("git", "refs", "heads", branch_name)
with open(path, "w") as f:
    f.write(commit_hash)

git checkout

git checkoutを実行した時に以下の流れが実行されるようにする

該当するcommitオブジェクトのハッシュを取得
HEADの参照先の変更
commitオブジェクトをもとに管理しているファイルの情報を変更
indexの更新

commitオブジェクトのハッシュを取得

git checkout ブランチ名もしくは直接commitオブジェクトのハッシュを指定することができる。
ブランチ名が指定された場合はブランチファイルに記述されたハッシュを取りに行くことで、該当するcommitオブジェクトのハッシュを得ることができる。

HEADの参照先変更

チェックアウト先にブランチ名が指定された場合は、HEADの参照先を指定されたブランチファイルに変更する。

with open(head_path, "w") as f:
    header = "refs: "
    content = header + branch_path
    f.write(content)

commitオブジェクトのハッシュが指定された場合は、そのハッシュに変更する。

with open(head_path, "w") as f:
    f.write(commit_hash)

管理しているファイルの情報を更新

管理しているファイルの情報を更新するためには、commitオブジェクトが保持しているtreeオブジェクトを展開し、各ファイルのblobオブジェクトを取得することが必要である。

def deployment_tree(object_hash, filename, blob_objects_array):
    separete_content = fetch_object(object_hash).split("\0") #fetch_object()でzlibを解凍、デコードをおこなってオブジェクトを文字列に直している

    if "blob" in separete_content[0]: #separate_content[0]は各オブジェクトのヘッダー
        content = fetch_object(object_hash).split("\0")[-1]
        blob_objects_array.append([filename, object_hash, content])
        return blob_objects_array
    else:
        for content in separete_content[1:]: #separate_content[1:]は各オブジェクトの中身
            try:
                tmp_hash = content.split()[0]
                tmp_name = os.path.join(filename, content.split()[1])
            except:
                return blob_objects_array
            blob_objects_array = deployment_tree(tmp_hash, tmp_name, blob_objects_array)
        return blob_objects_array

blob_objects_array = deployment_tree(toptree_hash, "", [])

以上のコードを実行すると、該当するblob_objects_arrayにコミットされたすべてのファイルのファイル名、blobオブジェクトのハッシュ、ファイルの中身が格納される。

その後、すべてのファイルの中身を取得したものに変更する。

for blob_object in blob_objects_array:
		file_path = blob_object[0]
    blob_content = blob_object[2]
    with open(file_path, "w") as f:
        f.write(blob_content)

indexの更新

indexの更新をしないと、チェックアウト前のブランチでステージングした記録が残ってしまう。そうなると、チェックアウト後のブランチで、あるファイルをステージングせずにコミットを行った場合、そのファイルの中身がチェックアウト前のblobオブジェクトを参照することになってしまい、意図した挙動とならない。

更新するindexの情報は、チェックアウトしたcommitオブジェクトから取得する。
ファイルの情報を更新したときと同じように、トップレベルのtreeオブジェクトを展開し、各ファイルのblobオブジェクトをindexに登録する。

blob_objects_array = deployment_tree(top_tree_hash, "", [])

for blob_object in blob_objects_array:
		file_name = blob_object[0]
    blob_hash = blob_object[1]
		index_dict[filename] = blob_hash

with open(index_path, "wb") as f:
    pickle.dump(index_dict, f)

git log

git logは現在参照しているcommitオブジェクトからたどれるすべてのcommitオブジェクトを表示するコマンドである。

やることは、commitオブジェクトのparentに書いてあるcommitオブジェクトをたどっていき、parentが書かれていない、すなわち最初のcommitオブジェクトにあたるまで、表示を続けることだけ。

def print_commit_log(commit_hash):
    commit_object = fetch_object(commit_hash)
    _, content = commit_object.split("\0")
    print_message = f"commit {commit_hash}"
    print_message += f"\n{content}"
    print(print_message)

    if "parent" in content:
        parent = content.split("\n")[1]
        parent_hash = parent.split()[-1]
        print_commit_log(parent_hash)

まとめ

さいごに実装した機能を呼び出す際のコマンドを、実際のgitによせる。

if __name__ == "__main__":
    args = sys.argv
    
    if command == "add":
        try:
            argument = args[2]
        except:
            print("add command require the filename")
            sys.exit()
        add_file(argument) # git addの実行
        print(f"add {argument}")

    elif command == "commit":
        try:
            argument = args[2]
        except:
            print("you need set the commit message")
            sys.exit()
        commit(argument) # git commitの実行

    elif command == "branch":
        try:
            argument = args[2]
            make_branch(argument)　# git branchの実行
            print(f"create {argument}")
        except:
            print_branch_name() # ブランチ名が指定されていない時はすべてのブランチ名を表示

    elif command == "checkout":
        try:
            argument = args[2]
        except:
            print("you need set the branch name")
            sys.exit()
        checkout_branch(argument) # git checkout(ブランチ名指定)の実行
        print(f"check out to {argument}")
    
    elif command == "checkout-hash":
        try:
            argument = args[2]
        except:
            print("you need set the commit hash")
            sys.exit()
        checkout_hash(argument) # git checkout(ハッシュ指定)の実行
        print(f"check out to {argument}")

    elif command == "log":
        check_log()

    else:
        print("you type the wrong command")

このファイルをgit.pyとすると以下のようなコマンドでそれぞれ実行できる。

python src/git.py add hoge.txt
python src/git.py commit
python src/git.py branch feature
python src/git.py checkout master
python src/git.py log

さいごに

自分で実装したことで、いままでなんとなくでやってた、gitの操作が明確になった気がします。
心残りとしては、mergeを実装できていないことです。
mergeを実行するのに、diffという機能（？）計算（？）を使う必要があるらしく、それの理解が追いついていないので、とりあえずmergeはおいておきました。
diffについて理解を深めた後にmergeも実装できればなと思っています。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up