0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

python の RapidFuzz で単語単位での差分検出

0
Posted at

やりたいこと

英語の2つの文の差分をとる際に、文字単位ではなく、単語単位で差分をとる。

RapidFuzz

RapidFuzz では2つの文字列の文字単位での差分を検出することができるが、単語単位でも差分を検出することができる。

単語(文字列)単位での差分

Levenshtein.opcodes() の引数に文字列の配列を指定すると、文字列単位で差分を検出することができる。

プログラム例

test1.py
from rapidfuzz.distance import Levenshtein

words1 = ['aaa', 'bbb', 'ccc', 'ddd', 'eee']
words2 = ['aaa', 'BBB', 'CCC', 'ddd', 'eee', 'fff']

opcodes = Levenshtein.opcodes(words1, words2)

matches1 = []
matches2 = []
diffs1 = []
diffs2 = []

for oc in opcodes:
    print(oc)

    # equal: マッチ
    if oc.tag == 'equal':
        m1 = (oc.src_start, oc.src_end, words1[oc.src_start:oc.src_end])
        matches1.append(m1)
        m2 = (oc.dest_start, oc.dest_end, words2[oc.dest_start:oc.dest_end])
        matches2.append(m2)
        continue

    # equal 以外で start != end の場合は差分
    if oc.src_start != oc.src_end:
        d1 = (oc.src_start, oc.src_end, words1[oc.src_start:oc.src_end])
        diffs1.append(d1)

    if oc.dest_start != oc.dest_end:
        d2 = (oc.dest_start, oc.dest_end, words2[oc.dest_start:oc.dest_end])
        diffs2.append(d2)

print(f"matches1: {matches1}")
print(f"matches2: {matches2}")
print(f"diffs1: {diffs1}")
print(f"diffs2: {diffs2}")

以下のように、文字列単位で一致している部分、差分を検出することができる。

実行結果
Opcode(tag='equal', src_start=0, src_end=1, dest_start=0, dest_end=1)
Opcode(tag='replace', src_start=1, src_end=3, dest_start=1, dest_end=3)
Opcode(tag='equal', src_start=3, src_end=5, dest_start=3, dest_end=5)
Opcode(tag='insert', src_start=5, src_end=5, dest_start=5, dest_end=6)
matches1: [(0, 1, ['aaa']), (3, 5, ['ddd', 'eee'])]
matches2: [(0, 1, ['aaa']), (3, 5, ['ddd', 'eee'])]
diffs1: [(1, 3, ['bbb', 'ccc'])]
diffs2: [(1, 3, ['BBB', 'CCC']), (5, 6, ['fff'])]
0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?