[エラー]文字列の見た目が同じなのに異なる文字列判定が起こる

Posted at 2024-12-13

はじめに

PDFから抽出した文字列を用いて学習を行っていた時に以下のエラーに遭遇．

環境

Python 3.9.13

エラー事項

以下のように，同じに見える文字列を比較しているのに，文字列比較をするとFalseが返ってくる．

'ランプ' == 'ランプ'
-->  False

原因

原因は「プ」にあった.
どうやら左側にあった「プ」は「フ」と「\u309a(半濁点)」で分かれいたようだ．これは同様に「ブ」のような濁点の時も見られた．

l = 'プ'
print(f'char:{l}, len:{len(l)}, code:{ascii(l)}')
-->  char:プ, len:2, code:'\u30d5\u309a'

r = 'プ'
print(f'char:{r}, len:{len(r)}, code:{ascii(r)}')
-->  char:プ, len:1, code:'\u30d7'

おわりに

データを眺めているだけだと気がつけない誤りだった．
きちんとassert文などを用いて検証する大切さを痛感した．

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up