5
3

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

Pythonで半角全角の表記ゆれ修正

Last updated at Posted at 2020-07-16

この記事の目的

半角や全角の表記ゆれを簡単に修正するための関数を作ります。

準備

変換の前後の文字を用意します。

abc_half = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
abc_full = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

digit_half = "0123456789"
digit_full = "0123456789"

katakana_half = "アイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワヲン"
katakana_full = "アイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワヲン"

punc_half = "!#$%&¥()*+,-./:;<=>?@[]^_`{|}~"
punc_full = "!#$%&¥()*+,-./:;<=>?@[]^_`{|}~"

半角カタカナの破裂音は2文字で1文字を表現しているので、他とは別に変換テーブルを作成します。


tmp01 = "ガギグゲゴザジズゼゾダヂヅデドバビブベボパピプペポ"
tmp02 = "ガギグゲゴザジズゼゾダヂヅデドバビブベボパピプペポ"

transtable02 = {}
for i in range(len(tmp02)):
    be = tmp01[i*2:i*2+2]
    af = tmp02[i]
    transtable02[be] = af

関数 clean_text の中で transtable01 = str.maketrans(before, after) で変換テーブルを作成し、それを text = text.translate(transtable01) で適用させています。


def clean_text(text):
    text = str(text).replace("\u3000", " ") #全角スペースを半角に

    before = abc_full + digit_full + katakana_half + punc_full
    after = abc_half + digit_half + katakana_full + punc_half

    transtable01 = str.maketrans(before, after)
    text = text.translate(transtable01)
    text = text.translate(transtable02)

    return text

使い方


text = "メモヤユヨラリルレ,-./:;qrゲゴザジズゼゾダヂ"
clean_text(text)

>>> メモヤユヨラリルレ,-./:;qrゲゴザジズゼゾダヂ

以上!

あとがき

送り仮名とか漢数字とか、日本語には他にも表記ゆれがあると思うので追々追加していけたらいいなと思ってます。

参考

【全角⇔半角】Pythonで日本語の表記ゆれを整えるライブラリおすすめ
[python] いろいろな文字種のリストを作成

5
3
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
5
3

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?