More than 5 years have passed since last update.

subprocess.run(..., stdout=subprocess.PIPE)で日本語を含む場合のメモ(文字コード判定)

Python

Last updated at 2018-10-27Posted at 2017-06-22

17.5.1. subprocess モジュールを使う
例:

>> subprocess.run(["ls", "-l", "/dev/null"], stdout=subprocess.PIPE)

CompletedProcess(args=['ls', '-l', '/dev/null'],
returncode=0,stdout=b'crw-rw-rw- 1 root root 1, 3 Jan 23 16:23
/dev/null\n')


subprocess.run(..., stdout=subprocess.PIPE)で日本語を含む場合(*1)(*2)、Windowsだと以下のようになる。

```pycon:stdout=subprocess.PIPE
In [16]: ping = subprocess.run(["ping", "192.168.1.1", "-n", "1"], stdout=subprocess.PIPE)

In [17]: print(ping.stdout)
b'\r\n192.168.1.1 \x82\xc9 ping \x82\xf0\x91\x97\x90M\x82\xb5\x82\xc4\x82\xa2\x82\xdc\x82\xb7 32 \x83o\x83C\x83g\x82\xcc\x83f\x81[\x83^:\r\n192.168.1.1 \x82\xa9\x82\xe7\x82\xcc\x89\x9e\x93\x9a: \x83o\x83C\x83g\x90\x94 =32 \x8e\x9e\x8a\xd4 <1ms TTL=64\r\n\r\n192.168.1.1 \x82\xcc ping \x93\x9d\x8cv:\r\n    \x83p\x83P\x83b\x83g\x90\x94: \x91\x97\x90M = 1\x81A\x8e\xf3\x90M = 1\x81A\x91\xb9\x8e\xb8 = 0 (0% \x82\xcc\x91\xb9\x8e\xb8)\x81A\r\n\x83\x89\x83E\x83\x93\x83h \x83g\x83\x8a\x83b\x83v\x82\xcc\x8aT\x8eZ\x8e\x9e\x8a\xd4 (\x83~\x83\x8a\x95b):\r\n    \x8d\xc5\x8f\xac = 0ms\x81A\x8d\xc5\x91\xe5 = 0ms\x81A\x95\xbd\x8b\xcf = 0ms\r\n'

decode()

.decode('cp932')をつけると読めるようになる。

decode('cp932')


In [25]: ping = subprocess.run(["ping", "192.168.1.1", "-n", "1"], stdout=subprocess.PIPE)

In [26]: print(ping.stdout.decode('cp932'))

192.168.1.1 に ping を送信しています 32 バイトのデータ:
192.168.1.1 からの応答: バイト数 =32 時間 =1ms TTL=64

192.168.1.1 の ping 統計:
    パケット数: 送信 = 1、受信 = 1、損失 = 0 (0% の損失)、
ラウンド トリップの概算時間 (ミリ秒):
    最小 = 1ms、最大 = 1ms、平均 = 1ms

chardet.detect()

.decode('cp932')は決め打ち感があるのでchardetでencoding判定する。
chardetすごく便利ですが、Windows-1252とかWindows-1254とかに誤判定されることがあるので注意が必要。(このpingの例では誤判定されませんが)
incorrect detection of windows-1254 instead of utf-8・・・utf-8じゃなくてshift_jisでも出た気がする。

ping.py

# *-* encoding: utf-8 *-*
import subprocess
import chardet

ping = subprocess.run(["ping", "192.168.1.1", "-n", "1"], stdout=subprocess.PIPE)
print(ping.stdout.decode(chardet.detect(ping.stdout)["encoding"]))

コンソールで実行

IPythonで実行

chardet.detect()の戻り値確認

In [13]: import subprocess
In [14]: ping = subprocess.run(["ping", "192.168.1.1", "-n", "1"], stdout=subprocess.PIPE)
In [15]: b=ping.stdout
In [16]: b
Out[16]: b'\r\n192.168.1.1 \x82\xc9 ping \x82\xf0\x91\x97\x90M\x82\xb5\x82\xc4\x82\xa2\x82\xdc\x82\xb7 32 \x83o\x83C\x83g\x82\xcc\x83f\x81[\x83^:\r\n192.168.1.1 \x82\xa9\x82\xe7\x82\xcc\x89\x9e\x93\x9a: \x83o\x83C\x83g\x90\x94 =32 \x8e\x9e\x8a\xd4 =1ms TTL=64\r\n\r\n192.168.1.1 \x82\xcc ping \x93\x9d\x8cv:\r\n    \x83p\x83P\x83b\x83g\x90\x94: \x91\x97\x90M = 1\x81A\x8e\xf3\x90M = 1\x81A\x91\xb9\x8e\xb8 = 0 (0% \x82\xcc\x91\xb9\x8e\xb8)\x81A\r\n\x83\x89\x83E\x83\x93\x83h \x83g\x83\x8a\x83b\x83v\x82\xcc\x8aT\x8eZ\x8e\x9e\x8a\xd4 (\x83~\x83\x8a\x95b):\r\n    \x8d\xc5\x8f\xac = 1ms\x81A\x8d\xc5\x91\xe5 = 1ms\x81A\x95\xbd\x8b\xcf = 1ms\r\n'

In [20]: import chardet
In [21]: chardet.detect(b)
Out[21]: {'confidence': 0.99, 'encoding': 'SHIFT_JIS', 'language': 'Japanese'}

In [24]: chardet.detect(b)["encoding"]
Out[24]: 'SHIFT_JIS'

In [26]: print(b.decode(chardet.detect(b)["encoding"]))

192.168.1.1 に ping を送信しています 32 バイトのデータ:
192.168.1.1 からの応答: バイト数 =32 時間 =1ms TTL=64

192.168.1.1 の ping 統計:
    パケット数: 送信 = 1、受信 = 1、損失 = 0 (0% の損失)、
ラウンド トリップの概算時間 (ミリ秒):
    最小 = 1ms、最大 = 1ms、平均 = 1ms

chardet.universaldetector

大きなファイルの場合、推定に時間がかかるそうです。

Python chardet でテキストファイルの文字コードを検出する - CUBE SUGAR CONTAINER

大きなファイルを扱う場合

先ほどの detect() 関数をとても大きなファイルに対して使うと推定に長い時間がかかるらしい
・・・
次のサンプルコードでは、その API を使っている。 UniversalDetector というクラスのインスタンスには feed() というメソッドがあり、これには複数回に分けてバイト列を渡すことができる。十分な確度で推定が完了するとインスタンスのメンバ変数 done が真になるため、そこで計算を打ち切ることができる

chardet.universaldetectorとジェネレータを使って判定処理を作成してみました。
※ちなみに今回の例程度ではchardet.universaldetector/ジェネレータともにメリットないです。

また、こちらもchardet.detect()と同じようにWindows-1252とかWindows-1254とかに誤判定されることがあります。(このpingの例では誤判定されませんが)

ping3.py

# *-* encoding: utf-8 *-*
import subprocess
from chardet.universaldetector import UniversalDetector

# バイナリ列を少しづつ読みだすジェネレータ・・・この例くらいの量ではメリットない
def get_chunk(binary, chunk_size=1024):
    start = 0
    end = chunk_size
    while True:
        chunk = binary[start:end]
        if chunk == b'':
            break
        else:
            yield chunk
        start = end
        end = start + chunk_size

# バイナリ列をUniversalDetectorに与えて、早めにencoding推定できたら打ち切る関数
def encoding_detect(binary):
    detector = UniversalDetector()
    try:
        for chunk in get_chunk(binary):
            detector.feed(chunk)
            if detector.done:
                break
    finally:
        detector.close()
    return detector.result["encoding"]



# pingを10回実施。環境によると思うが処理時間は9.0sec程度だった。
ping = subprocess.run(["ping", "192.168.1.1", "-n", "10"], stdout=subprocess.PIPE)

# encoding判定。処理時間は0.015sec程度だった。
enc=encoding_detect(ping.stdout)

print(ping.stdout.decode(enc))

参考・・・chardetにする前のメモ

decode('cp932')のように決め打ちじゃなくて、sys.stdout.encodingを使って.decode(sys.stdout.encoding)としてみた。

ping.py

# -*- coding: utf-8 -*-
import subprocess
import sys

print("sys.stdout.encoding = " + sys.stdout.encoding)
ping = subprocess.run(["ping", "192.168.1.1", "-n", "1"], stdout=subprocess.PIPE)
print(ping.stdout.decode(sys.stdout.encoding))

コマンドプロンプトから実行するとsys.stdout.encoding = cp932なのでうまくいく。

コマンドプロンプトから実行

> python ping.py
sys.stdout.encoding = cp932

192.168.1.1 に ping を送信しています 32 バイトのデータ:
192.168.1.1 からの応答: バイト数 =32 時間 <1ms TTL=64

192.168.1.1 の ping 統計:
    パケット数: 送信 = 1、受信 = 1、損失 = 0 (0% の損失)、
ラウンド トリップの概算時間 (ミリ秒):
    最小 = 0ms、最大 = 0ms、平均 = 0ms

IPythonで実行すると、sys.stdout.encoding = utf-8なので失敗する。

IPythonで実行

In [35]: %run ping.py
sys.stdout.encoding = utf-8
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 14: invalid start byte

chcpを使って無理やり変更すると、コマンドプロンプトが再起動され、フォントが変わってしまう＋英語出力になるという不思議な動作になる。こちらを参照
python自体は再起動されないためか、sys.stdout.encoding は追従できてない。chcp 65001を実行したあともcp932のまま。
コマンドプロンプトで、chcp 932を実行すると戻るが、またコマンドプロンプトが再起動される

chcp_ping.py

# -*- coding: utf-8 -*-
import subprocess
import sys

def ping():
    ping = subprocess.run(["ping", "192.168.1.1", "-n", "1"], stdout=subprocess.PIPE)
    return ping.stdout

with open("encoding.txt", "w") as f:
    f.write(" -------- before chcp 65001 --------\r\n")
    f.write("sys.stdout.encoding = " + sys.stdout.encoding + "\r\n")
    f.write("sys.getdefaultencoding() = " + sys.getdefaultencoding() + "\r\n")
    ping_str = ping()
    f.write(ping_str.decode('cp932').strip() + "\r\n")
    

    chcp = subprocess.run("chcp.com 65001")
    f.write(" -------- after chcp 65001 --------" + "\r\n")
    f.write("sys.stdout.encoding = " + sys.stdout.encoding + "\r\n")
    f.write("sys.getdefaultencoding() = " + sys.getdefaultencoding() + "\r\n")
    ping_str = ping()
    f.write(ping_str.decode('cp932').strip() + "\r\n")

encoding.txt

 -------- before chcp 65001 --------

sys.stdout.encoding = cp932

sys.getdefaultencoding() = utf-8

192.168.1.1 に ping を送信しています 32 バイトのデータ:

192.168.1.1 からの応答: バイト数 =32 時間 =1ms TTL=64



192.168.1.1 の ping 統計:

    パケット数: 送信 = 1、受信 = 1、損失 = 0 (0% の損失)、

ラウンド トリップの概算時間 (ミリ秒):

    最小 = 1ms、最大 = 1ms、平均 = 1ms

 -------- after chcp 65001 --------

sys.stdout.encoding = cp932

sys.getdefaultencoding() = utf-8

Pinging 192.168.1.1 with 32 bytes of data:

Reply from 192.168.1.1: bytes=32 time<1ms TTL=64



Ping statistics for 192.168.1.1:

    Packets: Sent = 1, Received = 1, Lost = 0 (0% loss),

Approximate round trip times in milli-seconds:

(*1)
日本語を含まない場合でも改行が\r\nで表示されてしまう。
空ファイル(empty.vbs)を実行した結果は、以下のようになる

empty.vbs

In [31]: cscript = subprocess.run('cscript empty.vbs', stdout=subprocess.PIPE)

In [32]: print(cscript.stdout)
b'Microsoft (R) Windows Script Host Version 5.812\r\nCopyright (C) Microsoft Corporation. All rights reserved.\r\n\r\n'```

(*2)
printしない(stdout=None) なら問題なし。

stdout=None

In [5]: subprocess.run(["ping", "192.168.1.1", "-n", "1"])

192.168.1.1 に ping を送信しています 32 バイトのデータ:
192.168.1.1 からの応答: バイト数 =32 時間 =2ms TTL=64

192.168.1.1 の ping 統計:
    パケット数: 送信 = 1、受信 = 1、損失 = 0 (0% の損失)、
ラウンド トリップの概算時間 (ミリ秒):
    最小 = 2ms、最大 = 2ms、平均 = 2ms
Out[5]: CompletedProcess(args=['ping', '192.168.1.1', '-n', '1'], returncode=0)

参考

コマンドプロンプトの文字化けが鬱陶しいのでCHCPラッパー書いた - Qiita
Code Page Identifiers (Windows) - Developer Network
サポートするコードページ - Developer Network

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up