python の RapidFuzz で単語単位での差分検出

Posted at 2026-06-02

やりたいこと

英語の２つの文の差分をとる際に、文字単位ではなく、単語単位で差分をとる。

RapidFuzz

RapidFuzz では２つの文字列の文字単位での差分を検出することができるが、単語単位でも差分を検出することができる。

単語(文字列)単位での差分

Levenshtein.opcodes() の引数に文字列の配列を指定すると、文字列単位で差分を検出することができる。

プログラム例

test1.py

from rapidfuzz.distance import Levenshtein

words1 = ['aaa', 'bbb', 'ccc', 'ddd', 'eee']
words2 = ['aaa', 'BBB', 'CCC', 'ddd', 'eee', 'fff']

opcodes = Levenshtein.opcodes(words1, words2)

matches1 = []
matches2 = []
diffs1 = []
diffs2 = []

for oc in opcodes:
    print(oc)

    # equal: マッチ
    if oc.tag == 'equal':
        m1 = (oc.src_start, oc.src_end, words1[oc.src_start:oc.src_end])
        matches1.append(m1)
        m2 = (oc.dest_start, oc.dest_end, words2[oc.dest_start:oc.dest_end])
        matches2.append(m2)
        continue

    # equal 以外で start != end の場合は差分
    if oc.src_start != oc.src_end:
        d1 = (oc.src_start, oc.src_end, words1[oc.src_start:oc.src_end])
        diffs1.append(d1)

    if oc.dest_start != oc.dest_end:
        d2 = (oc.dest_start, oc.dest_end, words2[oc.dest_start:oc.dest_end])
        diffs2.append(d2)

print(f"matches1: {matches1}")
print(f"matches2: {matches2}")
print(f"diffs1: {diffs1}")
print(f"diffs2: {diffs2}")

以下のように、文字列単位で一致している部分、差分を検出することができる。

実行結果

Opcode(tag='equal', src_start=0, src_end=1, dest_start=0, dest_end=1)
Opcode(tag='replace', src_start=1, src_end=3, dest_start=1, dest_end=3)
Opcode(tag='equal', src_start=3, src_end=5, dest_start=3, dest_end=5)
Opcode(tag='insert', src_start=5, src_end=5, dest_start=5, dest_end=6)
matches1: [(0, 1, ['aaa']), (3, 5, ['ddd', 'eee'])]
matches2: [(0, 1, ['aaa']), (3, 5, ['ddd', 'eee'])]
diffs1: [(1, 3, ['bbb', 'ccc'])]
diffs2: [(1, 3, ['BBB', 'CCC']), (5, 6, ['fff'])]

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up