Help us understand the problem. What is going on with this article?

Pythonで文字列部分一致度合いを調べる

はじめに

「PC」と「ノートPC」の表層一致度は何%でしょうか?

40%?100%?

この記事では部分的な類似度(100%の方)を計算する方法を解説します。

difflibとは

difflibは様々な差を計算することができる便利ライブラリです。
今回は文字列間類似度を測るSequenceMatcherを使います。

from difflib import SequenceMatcher

src, trg = 'PC' ,'ノートPC'
r = SequenceMatcher(None, src, trg).ratio()

しかし、この方法では文字列全体どうしの類似度を測るため、、部分一致している文字列間も類似度が小さくなってしまう仕様となっています。(r≒0.57)

部分的な類似度を計算する

早速コードをどうぞ。

from difflib import SequenceMatcher

src, trg = 'PC', 'ノートPC' 
s_len, t_len = len(src), len(trg)

r = max([SequenceMatcher(None, src, trg[i:i+s_len]).ratio() for i in range(t_len-s_len+1)])

短い方の文字列を軸に、長い方の文字列から短い方の文字列長分取り出して比較、その後最大値を出力します。今回は文句無しで r=1.0 です。

参考

difflib --- 差分の計算を助ける — Python 3.7.3 ドキュメント

Why not register and get more from Qiita?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
Comments
Sign up for free and join this conversation.
If you already have a Qiita account
Why do not you register as a user and use Qiita more conveniently?
You need to log in to use this function. Qiita can be used more conveniently after logging in.
You seem to be reading articles frequently this month. Qiita can be used more conveniently after logging in.
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away