More than 5 years have passed since last update.

pythonでUTF8のテキスト処理

Python

Last updated at 2012-12-26Posted at 2012-11-23

python2.x系は、strオブジェクトとunicodeオブジェクトが別々なのでややこしい。
いろいろ調べているうちにこんな感じになった。
python3.x系はテキストはunicode処理されるので、もっと簡単らしい。

MacOS X 10.6.8
Python 2.6.1

# coding: UTF-8

import codecs
import string
import re

f_in  = codecs.open('test.txt', 'r', 'utf-8')
f_out = codecs.open('test_out.txt', 'w', 'utf-8')

lines = f_in.readlines() #読み込み
lines2 = []
for line in lines:
	line = string.replace(line,u'テキスト',u'text') #テキスト置換
	line = re.sub(r'(\d)(?=(\d{3})+(?!\d))', r'\1', line) #正規表現置換
	lines2.append(line) #別リストにする
else:
	f_out.write(string.join(lines2,'')) #書き込み
	f_in.close()
	f_out.close()

test.txt

これはサンプルテキストです。
3桁ごとにカンマを入れます。
iPad mini 36800円

test_out.txt

これはサンプルtextです。
3桁ごとにカンマを入れます。
iPad mini 36,800円

追記：
python3.3でも動くコードを書いた。
結局、python3でもcodecsモジュールは使うし、
replaceはstrオブジェクトの関数でやるのと、u''リテラルを使わないだけ？

from __future__ import unicode_literals

を追加すると、u''リテラルがなくても全ての文字列をunicodeとして扱うので、
python2.6でも普通に動く。現時点ではそれが一番いいかも。

# coding: UTF-8
from __future__ import unicode_literals # <-文字列を全てunicodeとして扱う。3系では必要なし
import codecs
import re

f_in  = codecs.open('test.txt', 'r', 'utf-8')
f_out = codecs.open('test_out.txt', 'w', 'utf-8')

lines = f_in.readlines() #読み込み
lines2 = []
for line in lines:
    line = line.replace('テキスト','text') #テキスト置換
    line = re.sub(r'(\d)(?=(\d{3})+(?!\d))', r'\1,', line) #正規表現置換
    lines2.append(line) #別リストにする
else:
    f_out.write(''.join(lines2)) #書き込み
    f_in.close()

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up