More than 5 years have passed since last update.

Python2の文字列型

Python

Posted at 2013-01-22

Python2にはstrとunicodeの2つの文字列型がある。
普通はunicode型を使うべき。

str型

strは文字列と言うよりバイト列という方が正しい(と思う)。

aiueo = 'あいうえお'
# この時 aiueo は str 型になる

len(aiueo)
# いくつになるかは、ファイルのエンコーディングによって変わる
# 例えば、utf-8だと15になり、shift_jisだと10になる。

unicode型

unicode型は文字をUCS-2(またはUCS-4)として記録している。
UCS-4の範囲の文字をつかうにはPythonのコンパイル時に指定する必要がある。

aiueo = u'あいうえお'
# この時 aiueo はunicode型になる

len(aiueo)
# どの環境でも5になる

encode と decode

str型のdecodeメソッドを呼び出すとunicode型に変換できる。
逆にunicode型のencodeメソッドを呼び出すとstr型に変換できる。

aiueo = u'あいうえお'

aiueo_utf8 = aiueo.encode('utf-8')
aiueo_shiftjis = aiueo.encode('shift_jis')

print isinstance(aiueo_utf8, str) # True
print isinstance(aiueo_shiftjis, str) # True
print len(aiueo_utf8) # 15
print len(aiueo_shiftjis) # 10

print len(aiueo_utf8.decode('utf-8')) # 5
print len(aiueo_shiftjis.decode('shift_jis')) # 5

decodeメソッドに正しくないエンコーディングを渡すとUnicodeDecodeErrorエラー。

aiueo_shiftjis.decode('utf-8') # UnicodeDecodeErrorエラー

Python3

紛らわしいので、Python3ではstrはbytesに、unicodeはstrに変わっているらしい。

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up