2
4

Python で業務システムからエクスポートした Shift-JIS とされているファイルを処理する際には cp932 として扱う方がよさげという話

Posted at

概要

Pythonで業務システムからエクスポートしたShift-JISとされているファイルを処理する際、エンコーディングとしてcp932を使用することが推奨されます。これは、Shift-JISとcp932の間には微妙な違いがあり、特に日本語の一部の文字(一部の記号や特殊文字など)がShift-JISではなくcp932でしか正しく表現できないためです。そのため、エラーを避けるためには、Pythonでこれらのファイルを扱う際にはcp932として処理することが望ましいです。

という特殊な文字をshift_jisエンコーディングで処理しようとすると、エラーが発生します。これは、shift_jisがこの特殊な文字をサポートしていないためです。このような問題を避けるためには、cp932エンコーディングの使用を検討してみてください。

import pandas as pd

data = [
    {"col_1": ""},
]

df = pd.DataFrame(data)
df['col_1'] = df['col_1'].str.encode('shift_jis')
df
UnicodeEncodeError: 'shift_jis' codec can't encode character '\u3231' in position 0: illegal multibyte sequence

本記事にて、エラーとなるコードのサンプルと同様のエラーが発生する文字を記載します。

エラーとなるコードサンプル

ファイルとして書き込む場合

shift_jisではエラーとなってしまう。

# Importing Required Library
import os

# Creating test.csv with data to be written
data = """
col_1
㈱
""".strip()

# Creating and Writing Data to test.csv file in dbfs/test folder
with open("/dbfs/test.csv", "w", encoding='shift_jis') as file:
    file.write(data)

image.png

cp932に変更すると正常終了する。

# Importing Required Library
import os

# Creating test.csv with data to be written
data = """
col_1
㈱
""".strip()

# Creating and Writing Data to test.csv file in dbfs/test folder
with open("/dbfs/test.csv", "w", encoding='cp932') as file:
    file.write(data)

image.png

Pandas データフレームを作成する場合

shift_jisではエラーとなってしまう。

import pandas as pd

df = pd.read_csv("/dbfs/test.csv", encoding='shift_jis')
df

image.png

cp932に変更すると正常終了する。

import pandas as pd

df = pd.read_csv("/dbfs/test.csv", encoding='cp932')
df

image.png

Pandas データフレームにてカラムを encode する場合

shift_jisではエラーとなってしまう。

import pandas as pd

data = [
    {"col_1": ""},
]

df = pd.DataFrame(data)
df['col_1'] = df['col_1'].str.encode('shift_jis')
df

image.png

cp932に変更すると正常終了する。

import pandas as pd

data = [
    {"col_1": ""},
]

df = pd.DataFrame(data)
df['col_1'] = df['col_1'].str.encode('cp932')
df

image.png

同様のエラーが発生する文字列

下記の文字が含まれている場合には、shift_jisとして扱う場合にはエラーとなります。

col_1 col_1_encoded
h0A=
h0E=
h0I=
h0M=
h0Q=
h0U=
h0Y=
h0c=
h0g=
h0k=
h0o=
h0s=
h0w=
h00=
h04=
h08=
h1A=
h1E=
h1I=
h1M=
h1Q=
h1U=
h1Y=
h1c=
h1g=
h1k=
h1o=
h1s=
h1w=
h10=
7u8=
7vA=
7vE=
7vI=
7vM=
7vQ=
7vU=
7vY=
7vc=
7vg=
gUU=
h18=
h2A=
h2E=
h2I=
h2M=
h2Q=
h2U=
h2Y=
h2c=
h2g=
h2k=
h2o=
h2s=
h2w=
h20=
h24=
h28=
h3A=
h3E=
h3I=
h3M=
h3Q=
h3U=
h34=
h4A=
h4E=
h4I=
h4M=
h4Q=
h4U=
h4Y=
h4c=
h4g=
h4k=
h4o=
h4s=
h4w=
h40=
h44=
h48=
geA=
gd8=
gec=
h5M=
h5Q=
geM=
gds=
gdo=
h5k=
geY=
gb8=
gb4=
iJ8=
iKA=
iKE=
iKI=
iKM=
iKQ=
iKU=
iKY=
iKc=
iKg=
iKk=
iKo=
iKs=
iKw=
iK0=
iK4=
iK8=
gco=
7vo=
7vs=
7vw=
h4o=
h4I=
h4Q=
geY=
7UA=
7UE=
7UI=
7UM=
7UQ=
7UU=
7UY=
7Uc=
7Ug=
7Uk=
7Uo=
7Us=
7Uw=
7U0=
7U4=
7U8=
7VA=
7VE=
7VI=
7VM=
7VQ=
7VU=
7VY=
7Vc=
7Vg=
7Vk=
俿 7Vo=
7Vs=
7Vw=
7V0=
7V4=
7V8=
7WA=
7WE=
7WI=
7WM=
7WQ=
7WU=
7WY=
7Wc=
7Wg=
7Wk=
7Wo=
7Ws=
7Ww=
7W0=
7W4=
7W8=
7XA=
7XE=
7XI=
7XM=
7XQ=
7XU=
7XY=
7Xc=
7Xg=
7Xk=
7Xo=
7Xs=
7Xw=
7X0=
7X4=
7YA=
7YE=
7YI=
7YM=
7YQ=
7YU=
7YY=
7Yc=
7Yg=
7Yk=
7Yo=
7Ys=
7Yw=
7Y0=
7Y4=
7Y8=
7ZA=
7ZE=
7ZI=
7ZM=
7ZQ=
7ZU=
7ZY=
7Zc=
7Zg=
7Zk=
7Zo=
7Zs=
7Zw=
7Z0=
7Z4=
7Z8=
7aA=
7aE=
7aI=
7aM=
7aQ=
7aU=
7aY=
7ac=
7ag=
7ak=
7ao=
7as=
7aw=
7a0=
7a4=
7a8=
7bA=
7bE=
7bI=
7bM=
7bQ=
7bU=
7bY=
7bc=
7bg=
7bk=
7bo=
7bs=
7bw=
7b0=
7b4=
7b8=
7cA=
7cE=
7cI=
7cM=
7cQ=
7cU=
7cY=
7cc=
7cg=
7ck=
7co=
7cs=
7cw=
7c0=
7c4=
7c8=
7dA=
7dE=
7dI=
7dM=
7dQ=
7dU=
7dY=
7dc=
7dg=
氿 7dk=
7do=
7ds=
7dw=
7d0=
7d4=
7d8=
7eA=
7eE=
7eI=
7eM=
7eQ=
7eU=
7eY=
7ec=
7eg=
7ek=
7eo=
溿 7es=
7ew=
7e0=
7e4=
7e8=
7fA=
7fE=
7fI=
7fM=
7fQ=
7fU=
7fY=
7fc=
7fg=
7fk=
7fo=
7fs=
7fw=
7kA=
7kE=
7kI=
7kM=
7kQ=
7kU=
7kY=
7kc=
7kg=
7kk=
7ko=
7ks=
7kw=
7k0=
7k4=
7k8=
7lA=
7lE=
7lI=
7lM=
7lQ=
7lU=
7lY=
7lc=
7lg=
7lk=
7lo=
7ls=
7lw=
7l0=
7l4=
7l8=
7mA=
7mE=
7mI=
7mM=
7mQ=
7mU=
7mY=
7mc=
7mg=
7mk=
7mo=
7ms=
7mw=
7m0=
7m4=
7m8=
7nA=
7nE=
7nI=
7nM=
7nQ=
7nU=
7nY=
7nc=
7ng=
7nk=
7no=
7ns=
7nw=
7n0=
7n4=
7oA=
7oE=
7oI=
7oM=
7oQ=
7oU=
7oY=
7oc=
7og=
7ok=
7oo=
7os=
7ow=
7o0=
7o4=
7o8=
譿 7pA=
7pE=
7pI=
7pM=
7pQ=
7pU=
7pY=
7pc=
7pg=
7pk=
7po=
7ps=
7pw=
7p0=
7p4=
7p8=
7qA=
7qE=
7qI=
7qM=
7qQ=
7qU=
7qY=
7qc=
7qg=
7qk=
7qo=
7qs=
7qw=
7q0=
7q4=
7q8=
7rA=
7rE=
7rI=
7rM=
7rQ=
7rU=
7rY=
7rc=
7rg=
7rk=
7ro=
7rs=
7rw=
7r0=
7r4=
7r8=
7sA=
7sE=
7sI=
7sM=
7sQ=
7sU=
7sY=
7sc=
7sg=
7sk=
7so=
7ss=
7sw=
7s0=
7s4=
7s8=
7tA=
7tE=
7tI=
7tM=
7tQ=
7tU=
7tY=
7tc=
7tg=
7tk=
7to=
7ts=
7tw=
7t0=
7t4=
7t8=
7uA=
7uE=
7uI=
7uM=
7uQ=
7uU=
7uY=
7uc=
7ug=
7uk=
7uo=
7us=
7uw=
data = [
    {"col_1": "①"},
    {"col_1": "②"},
    {"col_1": "③"},
    {"col_1": "④"},
    {"col_1": "⑤"},
    {"col_1": "⑥"},
    {"col_1": "⑦"},
    {"col_1": "⑧"},
    {"col_1": "⑨"},
    {"col_1": "⑩"},
    {"col_1": "⑪"},
    {"col_1": "⑫"},
    {"col_1": "⑬"},
    {"col_1": "⑭"},
    {"col_1": "⑮"},
    {"col_1": "⑯"},
    {"col_1": "⑰"},
    {"col_1": "⑱"},
    {"col_1": "⑲"},
    {"col_1": "⑳"},
    {"col_1": "Ⅰ"},
    {"col_1": "Ⅱ"},
    {"col_1": "Ⅲ"},
    {"col_1": "Ⅳ"},
    {"col_1": "Ⅴ"},
    {"col_1": "Ⅵ"},
    {"col_1": "Ⅶ"},
    {"col_1": "Ⅷ"},
    {"col_1": "Ⅸ"},
    {"col_1": "Ⅹ"},
    {"col_1": "ⅰ"},
    {"col_1": "ⅱ"},
    {"col_1": "ⅲ"},
    {"col_1": "ⅳ"},
    {"col_1": "ⅴ"},
    {"col_1": "ⅵ"},
    {"col_1": "ⅶ"},
    {"col_1": "ⅷ"},
    {"col_1": "ⅸ"},
    {"col_1": "ⅹ"},
    {"col_1": "・"},
    {"col_1": "㍉"},
    {"col_1": "㌔"},
    {"col_1": "㌢"},
    {"col_1": "㍍"},
    {"col_1": "㌘"},
    {"col_1": "㌧"},
    {"col_1": "㌃"},
    {"col_1": "㌶"},
    {"col_1": "㍑"},
    {"col_1": "㍗"},
    {"col_1": "㌍"},
    {"col_1": "㌦"},
    {"col_1": "㌣"},
    {"col_1": "㌫"},
    {"col_1": "㍊"},
    {"col_1": "㌻"},
    {"col_1": "㎜"},
    {"col_1": "㎝"},
    {"col_1": "㎞"},
    {"col_1": "㎎"},
    {"col_1": "㎏"},
    {"col_1": "㏄"},
    {"col_1": "㎡"},
    {"col_1": "㍻"},
    {"col_1": "〝"},
    {"col_1": "〟"},
    {"col_1": "№"},
    {"col_1": "㏍"},
    {"col_1": "℡"},
    {"col_1": "㊤"},
    {"col_1": "㊥"},
    {"col_1": "㊦"},
    {"col_1": "㊧"},
    {"col_1": "㊨"},
    {"col_1": "㈱"},
    {"col_1": "㈲"},
    {"col_1": "㈹"},
    {"col_1": "㍾"},
    {"col_1": "㍽"},
    {"col_1": "㍼"},
    {"col_1": "≒"},
    {"col_1": "≡"},
    {"col_1": "∫"},
    {"col_1": "∮"},
    {"col_1": "∑"},
    {"col_1": "√"},
    {"col_1": "⊥"},
    {"col_1": "∠"},
    {"col_1": "⊿"},
    {"col_1": "∵"},
    {"col_1": "∩"},
    {"col_1": "∪"},
    {"col_1": "亜"},
    {"col_1": "唖"},
    {"col_1": "娃"},
    {"col_1": "阿"},
    {"col_1": "哀"},
    {"col_1": "愛"},
    {"col_1": "挨"},
    {"col_1": "姶"},
    {"col_1": "逢"},
    {"col_1": "葵"},
    {"col_1": "茜"},
    {"col_1": "穐"},
    {"col_1": "悪"},
    {"col_1": "握"},
    {"col_1": "渥"},
    {"col_1": "旭"},
    {"col_1": "葦"},
    {"col_1": "¬"},
    {"col_1": "¦"},
    {"col_1": "'"},
    {"col_1": """},
    {"col_1": "㈱"},
    {"col_1": "№"},
    {"col_1": "℡"},
    {"col_1": "∵"},
    {"col_1": "纊"},
    {"col_1": "褜"},
    {"col_1": "鍈"},
    {"col_1": "銈"},
    {"col_1": "蓜"},
    {"col_1": "俉"},
    {"col_1": "炻"},
    {"col_1": "昱"},
    {"col_1": "棈"},
    {"col_1": "鋹"},
    {"col_1": "曻"},
    {"col_1": "彅"},
    {"col_1": "丨"},
    {"col_1": "仡"},
    {"col_1": "仼"},
    {"col_1": "伀"},
    {"col_1": "伃"},
    {"col_1": "伹"},
    {"col_1": "佖"},
    {"col_1": "侒"},
    {"col_1": "侊"},
    {"col_1": "侚"},
    {"col_1": "侔"},
    {"col_1": "俍"},
    {"col_1": "偀"},
    {"col_1": "倢"},
    {"col_1": "俿"},
    {"col_1": "倞"},
    {"col_1": "偆"},
    {"col_1": "偰"},
    {"col_1": "偂"},
    {"col_1": "傔"},
    {"col_1": "僴"},
    {"col_1": "僘"},
    {"col_1": "兊"},
    {"col_1": "兤"},
    {"col_1": "冝"},
    {"col_1": "冾"},
    {"col_1": "凬"},
    {"col_1": "刕"},
    {"col_1": "劜"},
    {"col_1": "劦"},
    {"col_1": "勀"},
    {"col_1": "勛"},
    {"col_1": "匀"},
    {"col_1": "匇"},
    {"col_1": "匤"},
    {"col_1": "卲"},
    {"col_1": "厓"},
    {"col_1": "厲"},
    {"col_1": "叝"},
    {"col_1": "﨎"},
    {"col_1": "咜"},
    {"col_1": "咊"},
    {"col_1": "咩"},
    {"col_1": "哿"},
    {"col_1": "喆"},
    {"col_1": "坙"},
    {"col_1": "坥"},
    {"col_1": "垬"},
    {"col_1": "埈"},
    {"col_1": "埇"},
    {"col_1": "﨏"},
    {"col_1": "塚"},
    {"col_1": "增"},
    {"col_1": "墲"},
    {"col_1": "夋"},
    {"col_1": "奓"},
    {"col_1": "奛"},
    {"col_1": "奝"},
    {"col_1": "奣"},
    {"col_1": "妤"},
    {"col_1": "妺"},
    {"col_1": "孖"},
    {"col_1": "寀"},
    {"col_1": "甯"},
    {"col_1": "寘"},
    {"col_1": "寬"},
    {"col_1": "尞"},
    {"col_1": "岦"},
    {"col_1": "岺"},
    {"col_1": "峵"},
    {"col_1": "崧"},
    {"col_1": "嵓"},
    {"col_1": "﨑"},
    {"col_1": "嵂"},
    {"col_1": "嵭"},
    {"col_1": "嶸"},
    {"col_1": "嶹"},
    {"col_1": "巐"},
    {"col_1": "弡"},
    {"col_1": "弴"},
    {"col_1": "彧"},
    {"col_1": "德"},
    {"col_1": "忞"},
    {"col_1": "恝"},
    {"col_1": "悅"},
    {"col_1": "悊"},
    {"col_1": "惞"},
    {"col_1": "惕"},
    {"col_1": "愠"},
    {"col_1": "惲"},
    {"col_1": "愑"},
    {"col_1": "愷"},
    {"col_1": "愰"},
    {"col_1": "憘"},
    {"col_1": "戓"},
    {"col_1": "抦"},
    {"col_1": "揵"},
    {"col_1": "摠"},
    {"col_1": "撝"},
    {"col_1": "擎"},
    {"col_1": "敎"},
    {"col_1": "昀"},
    {"col_1": "昕"},
    {"col_1": "昻"},
    {"col_1": "昉"},
    {"col_1": "昮"},
    {"col_1": "昞"},
    {"col_1": "昤"},
    {"col_1": "晥"},
    {"col_1": "晗"},
    {"col_1": "晙"},
    {"col_1": "晴"},
    {"col_1": "晳"},
    {"col_1": "暙"},
    {"col_1": "暠"},
    {"col_1": "暲"},
    {"col_1": "暿"},
    {"col_1": "曺"},
    {"col_1": "朎"},
    {"col_1": "朗"},
    {"col_1": "杦"},
    {"col_1": "枻"},
    {"col_1": "桒"},
    {"col_1": "柀"},
    {"col_1": "栁"},
    {"col_1": "桄"},
    {"col_1": "棏"},
    {"col_1": "﨓"},
    {"col_1": "楨"},
    {"col_1": "﨔"},
    {"col_1": "榘"},
    {"col_1": "槢"},
    {"col_1": "樰"},
    {"col_1": "橫"},
    {"col_1": "橆"},
    {"col_1": "橳"},
    {"col_1": "橾"},
    {"col_1": "櫢"},
    {"col_1": "櫤"},
    {"col_1": "毖"},
    {"col_1": "氿"},
    {"col_1": "汜"},
    {"col_1": "沆"},
    {"col_1": "汯"},
    {"col_1": "泚"},
    {"col_1": "洄"},
    {"col_1": "涇"},
    {"col_1": "浯"},
    {"col_1": "涖"},
    {"col_1": "涬"},
    {"col_1": "淏"},
    {"col_1": "淸"},
    {"col_1": "淲"},
    {"col_1": "淼"},
    {"col_1": "渹"},
    {"col_1": "湜"},
    {"col_1": "渧"},
    {"col_1": "渼"},
    {"col_1": "溿"},
    {"col_1": "澈"},
    {"col_1": "澵"},
    {"col_1": "濵"},
    {"col_1": "瀅"},
    {"col_1": "瀇"},
    {"col_1": "瀨"},
    {"col_1": "炅"},
    {"col_1": "炫"},
    {"col_1": "焏"},
    {"col_1": "焄"},
    {"col_1": "煜"},
    {"col_1": "煆"},
    {"col_1": "煇"},
    {"col_1": "凞"},
    {"col_1": "燁"},
    {"col_1": "燾"},
    {"col_1": "犱"},
    {"col_1": "犾"},
    {"col_1": "猤"},
    {"col_1": "猪"},
    {"col_1": "獷"},
    {"col_1": "玽"},
    {"col_1": "珉"},
    {"col_1": "珖"},
    {"col_1": "珣"},
    {"col_1": "珒"},
    {"col_1": "琇"},
    {"col_1": "珵"},
    {"col_1": "琦"},
    {"col_1": "琪"},
    {"col_1": "琩"},
    {"col_1": "琮"},
    {"col_1": "瑢"},
    {"col_1": "璉"},
    {"col_1": "璟"},
    {"col_1": "甁"},
    {"col_1": "畯"},
    {"col_1": "皂"},
    {"col_1": "皜"},
    {"col_1": "皞"},
    {"col_1": "皛"},
    {"col_1": "皦"},
    {"col_1": "益"},
    {"col_1": "睆"},
    {"col_1": "劯"},
    {"col_1": "砡"},
    {"col_1": "硎"},
    {"col_1": "硤"},
    {"col_1": "硺"},
    {"col_1": "礰"},
    {"col_1": "礼"},
    {"col_1": "神"},
    {"col_1": "祥"},
    {"col_1": "禔"},
    {"col_1": "福"},
    {"col_1": "禛"},
    {"col_1": "竑"},
    {"col_1": "竧"},
    {"col_1": "靖"},
    {"col_1": "竫"},
    {"col_1": "箞"},
    {"col_1": "精"},
    {"col_1": "絈"},
    {"col_1": "絜"},
    {"col_1": "綷"},
    {"col_1": "綠"},
    {"col_1": "緖"},
    {"col_1": "繒"},
    {"col_1": "罇"},
    {"col_1": "羡"},
    {"col_1": "羽"},
    {"col_1": "茁"},
    {"col_1": "荢"},
    {"col_1": "荿"},
    {"col_1": "菇"},
    {"col_1": "菶"},
    {"col_1": "葈"},
    {"col_1": "蒴"},
    {"col_1": "蕓"},
    {"col_1": "蕙"},
    {"col_1": "蕫"},
    {"col_1": "﨟"},
    {"col_1": "薰"},
    {"col_1": "蘒"},
    {"col_1": "﨡"},
    {"col_1": "蠇"},
    {"col_1": "裵"},
    {"col_1": "訒"},
    {"col_1": "訷"},
    {"col_1": "詹"},
    {"col_1": "誧"},
    {"col_1": "誾"},
    {"col_1": "諟"},
    {"col_1": "諸"},
    {"col_1": "諶"},
    {"col_1": "譓"},
    {"col_1": "譿"},
    {"col_1": "賰"},
    {"col_1": "賴"},
    {"col_1": "贒"},
    {"col_1": "赶"},
    {"col_1": "﨣"},
    {"col_1": "軏"},
    {"col_1": "﨤"},
    {"col_1": "逸"},
    {"col_1": "遧"},
    {"col_1": "郞"},
    {"col_1": "都"},
    {"col_1": "鄕"},
    {"col_1": "鄧"},
    {"col_1": "釚"},
    {"col_1": "釗"},
    {"col_1": "釞"},
    {"col_1": "釭"},
    {"col_1": "釮"},
    {"col_1": "釤"},
    {"col_1": "釥"},
    {"col_1": "鈆"},
    {"col_1": "鈐"},
    {"col_1": "鈊"},
    {"col_1": "鈺"},
    {"col_1": "鉀"},
    {"col_1": "鈼"},
    {"col_1": "鉎"},
    {"col_1": "鉙"},
    {"col_1": "鉑"},
    {"col_1": "鈹"},
    {"col_1": "鉧"},
    {"col_1": "銧"},
    {"col_1": "鉷"},
    {"col_1": "鉸"},
    {"col_1": "鋧"},
    {"col_1": "鋗"},
    {"col_1": "鋙"},
    {"col_1": "鋐"},
    {"col_1": "﨧"},
    {"col_1": "鋕"},
    {"col_1": "鋠"},
    {"col_1": "鋓"},
    {"col_1": "錥"},
    {"col_1": "錡"},
    {"col_1": "鋻"},
    {"col_1": "﨨"},
    {"col_1": "錞"},
    {"col_1": "鋿"},
    {"col_1": "錝"},
    {"col_1": "錂"},
    {"col_1": "鍰"},
    {"col_1": "鍗"},
    {"col_1": "鎤"},
    {"col_1": "鏆"},
    {"col_1": "鏞"},
    {"col_1": "鏸"},
    {"col_1": "鐱"},
    {"col_1": "鑅"},
    {"col_1": "鑈"},
    {"col_1": "閒"},
    {"col_1": "隆"},
    {"col_1": "﨩"},
    {"col_1": "隝"},
    {"col_1": "隯"},
    {"col_1": "霳"},
    {"col_1": "霻"},
    {"col_1": "靃"},
    {"col_1": "靍"},
    {"col_1": "靏"},
    {"col_1": "靑"},
    {"col_1": "靕"},
    {"col_1": "顗"},
    {"col_1": "顥"},
    {"col_1": "飯"},
    {"col_1": "飼"},
    {"col_1": "餧"},
    {"col_1": "館"},
    {"col_1": "馞"},
    {"col_1": "驎"},
    {"col_1": "髙"},
    {"col_1": "髜"},
    {"col_1": "魵"},
    {"col_1": "魲"},
    {"col_1": "鮏"},
    {"col_1": "鮱"},
    {"col_1": "鮻"},
    {"col_1": "鰀"},
    {"col_1": "鵰"},
    {"col_1": "鵫"},
    {"col_1": "鶴"},
    {"col_1": "鸙"},
    {"col_1": "黑"},
]

df = pd.DataFrame(data)
df['col_1_encoded'] = df['col_1'].str.encode('cp932')
df
2
4
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
2
4