strとunicode #Python

Python 2.x 系での話です。

コードは、Macの場合はTerminal, Windowsではコマンドプロンプトから python コマンドを実行し、pythonのconsole上で試しています。 Python 2.7.9 です.

str とは？

str とは所謂マルチバイト文字列です. '文字列' の様に書きます.

len(文字列) するとByte数が返ります. coding: utf-8 の場合は 9.

>>> len('いろは')

9

for - in で回すと、バイト毎に処理します.

>>> for c in 'abc':
...    print c

a
b
c

>>> for c in 'いろは':
...    print c

�
�
�
�
�
�
�
�
�

unicode.encode() で unicode から str に変換できます.

>>> u'abc'.encode()

'abc'

encodingという概念がある.

>>> u'いろは'.encode('utf-8')

'\xe3\x81\x84\xe3\x82\x8d\xe3\x81\xaf'

>>> u'いろは'.encode('cp932')

'\x82\xa2\x82\xeb\x82\xcd'

実行環境のencodingに依存します. MacのTerminal上で実行すると utf-8 .

>>> 'いろは'

'\xe3\x81\x84\xe3\x82\x8d\xe3\x81\xaf'

Windows7のコマンドプロンプトから実行すると cp932 .

>>> 'いろは'

'\x82\xa2\x82\xeb\x82\xcd'

スクリプトファイルに直接書いた str はファイルのencodingに従う.
ただし、# coding: (エンコーディング名) と一致していないと実行時Errorになる.

#!/usr/bin/env python
# coding: utf-8

print 'いろは'

↑は utf-8 でprintしていますが、CUI環境のencodingが異なる場合は、文字化けします.

unicode とは？

unicode はバイトではなく、文字の単位で文字列を扱います. u'文字列' の様にuを先頭付けます.

len(文字列) すると 文字数 が返ります.

>>> len(u'いろは')

3

for - in で回すと、文字毎に処理します.

>>> for c in u'abc':
...    print c

a
b
c

>>> for c in u'いろは':
...    print c

い
ろ
は

str.decode() で str から unicode に変換できます.

>>> 'いろは'.decode('utf-8')

u'\u3044\u308d\u306f'

unicode はencodingが統一されており、開発者は unicode を使う場合、encodingを意識しなくていい.

>>> u'いろはに' + u'ほへと'

u'\u3044\u308d\u306f\u306b\u307b\u3078\u3068'

ただし、スクリプト外部へ出力する場合は必ず str に変換することになります.

>>> print u'いろは'.encode('utf-8')

いろは

unicode を str に変換しなかった場合、pythonのruntimeが自動で変換しますが、変換に使うencodingは実行環境に依ります.
よくあるのは、日本語が混じっている str を coding: ascii で変換しようとして UnicodeEncodeError 例外が発生するケースです...

>>> print u'いろは'

Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

変換しないでうまくいくかどうかは、実行環境に依存します.
スクリプト外部に出す時は、ちゃんと意識して unicode --> str に変換してやるのがいいです.

unicode, str の使い分け

スクリプト内部では、unicode に統一するのがBestと思います.

sys.args 等、ライブラリから取得した str は直ぐに unicode に変換します。
逆に、スクリプト外部に文字列を出す時 (例:print) は、出す直前で unicode --> str に変換してやります.

str と unicode を混ぜて使うと、pythonのruntimeが str を unicode に変換しようとします.
この時、UnicodeDecodeError が発生して悩まされる事が多いです.

>>> 'いろはに' + u'ほへと'

Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

strのencodingが分からない時...

スクリプト外部から取得した str のencodingが分からない時、
私は以下のコードで unicode に変換しています.

def toUnicode(encodedStr):
    '''
    :return: an unicode-str.
    '''
    if isinstance(encodedStr, unicode):
        return encodedStr

    for charset in [u'cp932', u'utf-8', u'euc-jp', u'shift-jis', u'iso2022-jp']:
        try:
            return encodedStr.decode(charset)
        except:
            pass