More than 5 years have passed since last update.

[Python3 入門 17日目]8章データの行き先（8.1〜8.2.5）

Posted at 2020-01-26

8.1 ファイルの入出力

fileobj=open(filename,mode)でファイルを開く。
fileobjはopen()が返すファイルのオブジェクト
filenameはファイル名
ファイルをどうしたいかの選択。
rは読み出し、wは書き込みで上書き及び新規作成も可能、xは書き込みだがファイルが存在しない場合のみ。
modeの第二文字目はファイルのタイプを示す。tはテキスト、bはバイナリの意味。

8.1.1 write()によるテキストファイルへの書き込み


>>> poem = """There was a young lady named Bright,
... Whose speed was far faster than light,
... She started one day,
... In a relative way,
... And returned on the previous night."""
>>> len(poem)
151

# write()関数は、書き込んだバイト数を返す。
>>> f=open("relatibity","wt")
>>> f.write(poem)
151
>>> f.close()

# print()でもテキストファイルに書き込み可能。
# print()は個々の引数の後にスペース、全体の末尾に改行を追加している。
>>> f=open("relatibity","wt")
>>> print(poem,file=f)
>>> f.close()

# print()をwrite()と同じように動作させるにはsep,endを使う。
# sep:セパレータ。デフォルトでスペース(" ")になる。
# end:末尾の文字列。デフォルトで改行("\n")になる。
>>> f=open("relatibity","wt")
>>> print(poem,file=f,sep="",end="")
>>> f.close()

# ソース文字列が非常に大きい場合は、chunkに分割してファイルに書き込みできる。
>>> f=open("relatibity","wt")
>>> size=len(poem)
>>> offset=0
>>> chunk=100
>>> while True:
...     if offset>size:
...         break
...     f.write(poem[offset:offset+chunk])
...     offset+=chunk
... 
100
51
>>> f.close()

# xモードにより上書きを防ぐことでファイル破壊を防げる。
>>> f=open("relatibity","xt")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
FileExistsError: [Errno 17] File exists: 'relatibity'

# 例外ハンドラとしても使える。
>>> try:
...     f=open("relatibity","xt")
...     f.write("stomp stomp stomp")
... except FileExistsError:
...     print("relativity already exists!. That was a close one.")
... 
relativity already exists!. That was a close one.

8.1.2 read(),readline(),headlines()によるテキストファイルの読み出し


# ファイル全体を一度に読み出すことができる。
>>> fin=open("relatibity","rt")
>>> poem=fin.read()
>>> fin.close()
>>> len(poem)
151

# read()の引数に字数を入れることで一度に返すデータ量を制限できる。
# ファイルを全て読んだ後でさらにread()を呼び出すと、空文字列("")が返される。→if not f:でFalseと評価される。
>>> poem=""
>>> fin=open("relatibity","rt")
>>> chunk=100
>>> while True:
...     f=fin.read(chunk)
...     if not f:
...         break
...     poem+=f
... 
>>> fin.close()
>>> len(poem)
151

# readline()を使えば、ファイルを1行ずつ読み出すことができる。
# ファイルを全て読み込むとread()同様に空文字列を返し、Falseと評価される。
>>> poem=""
>>> fin=open("relatibity","rt")
>>> while True:
...     line=fin.readline()
...     if not line:
...         break
...     poem+=line
... 
>>> fin.close()
>>> len(poem)
151

# イテレータと使って簡単に読み出すことができる。
>>> poem=""
>>> fin=open("relatibity","rt")
>>> for line in fin:
...     poem+=line
... 
>>> fin.close()
>>> len(poem)
151

# readlines()は一度に一行ずつ読み出して、一行文字列のリストを返す。
>>> fin=open("relativity","rt")
>>> lines=fin.readlines()
>>> fin.close()
>>> for line in lines:
...      print(line,end="")
... 
There was a young lady named Bright,
Whose speed was far faster than light,
She started one day,
In a relative way,
And returned on the previous night.>>>

8.1.3 write()によるバイナリファイルの書き込み

モード文字列に"b"を追加するとファイルはバイナリモードで開かれる。この場合、文字列ではなくbytesを読み書きすることになる。


# 0から255までの256バイトを生成しよう。
>>> bdata=bytes(range(0,256))
>>> len(bdata)
256
>>> f=open("bfile","wb")
>>> f.write(bdata)
256
>>> f.close()

# テキストの場合と同様にchunk単位で書き込むこともできる。
>>> f=open("bfile","wb")
>>> size=len(bdata)
>>> offset=0
>>> chunk=100
>>> while True:
...     if offset>size:
...         break
...     f.write(bdata[offset:offset+chunk])
...     offset+=chunk
... 
100
100
56
>>> f.close

8.1.4 read()によるバイナリファイルの読み出し


# "rb"として開けば良い。
>>> f=open("bfile","rb")
>>> bdata=f.read()
>>> len(bdata)
256
>>> f.close()

8.1.5 withによるファイルの自動的なクローズ

with以下のコンテキストブロックが終わると自動でファイルを閉じてくれる。
with expression as variableという形式で使う。


>>> with open("relatibity","wt") as f:
...     f.write(poem)
... 
151

8.1.6 seek()による位置の変更

tell()関数は、ファイルの先頭から現在地までのオフセットをバイト単位で返す。
seek()関数はファイルオブジェクトの位置を変更できる。
ファイルオブジェクトの位置を変更するには、f.seek(offset, whence)を使う。標準のosモジュールでも定義されている。
ファイル位置は基準点にオフセット値を足して計算される。
参照点はwhence引数で選択。0なら先頭からoffsetまでの位置、1なら現在の位置からoffsetバイトまでの位置、2なら末尾からoffsetバイトの位置にそれぞれ移動する。
whenceは省略可能で、デフォルト値は0、すなわち参照点としてファイルの先頭を使う。


>>> f=open("bfile","rb")
>>> f.tell()
0

# seek()を使ってファイルの末尾の1バイト手前に移動する。
# 場所(255)まで移動して、そこから後ろを読み出した。→最後の1バイトが読み込まれた。
# 254と255の間に位置し、最後の読み込みで、255を読み込んだイメージ。
>>> f.seek(255)
255
>>> b=f.read()
>>> len(b)
1
# seek()は移動後のオフセットも返してくる。
>>> b[0]
255


>>> import os
>>> os.SEEK_SET
0
>>> os.SEEK_CUR
1
>>> os.SEEK_END
2


>>> f=open("bfile","rb")
# 末尾から-1オフセットした位置に移動した。
>>> f.seek(-1,2)
255
# ファイルの先頭からオフセットをバイト単位で返す。
>>> f.tell()
255
>>> b=f.read()
>>> len(b)
1
>>> b[0]
255

# ファイルの先頭から末尾の2バイト前まで移動
# 253と254の間にいるイメージ
>>> f=open("bfile","rb")
>>> f.seek(254,0)
254
>>> f.tell()
254
# ファイルの末尾の2バイト前から1バイト前まで移動
# 254と255の間にいるイメージ
>>> f.seek(1,1)
255
>>> f.tell()
255
>>> b=f.read()
>>> len(b)
1
>>> b[0]
255

8.2 構造化されたテキストファイル

構造的なデータを作るには以下のような方法がある。
セパレータ、区切り子。タブ（"\t"）、カンマ（","）、縦棒（"|"）などで区切る。CSV形式。
タグを"<"と">"で区切る。XMLやHTMLがこれにあたる。
記号を駆使するもの。JSONがそう。
インデント。例えば、YAMLがこれに当たる。

CSV

区切り子によってフィールドに区切られているファイルは、スプレッドシートやデータベースとのデータ交換形式に用いられる。
一部のファイルはエスケープシーケンスを使っている。区切り子の文字がフィールド内で使われる可能性がある場合、フィールド全体をクォート文字で囲むか、区切り子の前にエスケープ文字をつける。


>>> import csv
>>> villains=[
...     ["Doctor","No"],
...     ["R","K"],
...     ["Mister","Big"],
...     ["Auric","Goldfinger"],
...     ["E","B"],
...     ]

>>> with open ("villains","wt") as fout:
# writer()で書き込み
...     csvout=csv.writer(fout)
# villainsというcsvファイルが作られた。
...     csvout.writerows(villains)

実行結果


Docter,No
R,K
Miser,Big
Auric,Goldfinger
E,B


>> import csv
>>> with open("villains","rt") as fin:
# reader()で読み込み
...     cin=csv.reader(fin)
...     villains=[row for row in cin]
... 
>>> print(villains)
[['Doctor', 'No'], ['R', 'K'], ['Mister', 'Big'], ['Auric', 'Goldfinger'], ['E', 'B']]


# DictReader()を使って列名を指定する。
>>> import csv
>>> with open("villains","rt") as fin:
...     cin=csv.DictReader(fin,fieldnames=["first","last"])
...     villains=[row for row in cin]
... 
>>> print(villains)
[{'first': 'Docter', 'last': 'No'}, {'first': 'R', 'last': 'K'}, {'first': 'Miser', 'last': 'Big'}, {'first': 'Auric', 'last': 'Goldfinger'}, {'first': 'E', 'last': 'B'}]


>>> import csv
>>> villains= [
...     {"first":"Docter","last":"No"},
...     {"first":"R","last":"K"},
...     {"first":"Miser","last":"Big"},
...     {"first":"Auric","last":"Goldfinger"},
...     {"first":"E","last":"B"},
...     ]
>>> with open("villains","wt") as fout:
...     cout=csv.DictWriter(fout,["first","last"])
# writeheader()を使ってCSVファイルの先頭に列名も書き込むことができる。
...     cout.writeheader()
...     cout.writerows(villains)
...

実行結果


first,last
Docter,No
R,K
Miser,Big
Auric,Goldfinger
E,B


# ファイルからデータを読み直す。
# DictReader()呼び出しの中でfieldnames引数を省略すると、ファイルの第一行の値(first,last)を列ラベルの辞書キーとして使えという意味になる。
>>> import csv
>>> with open("villains","rt") as fin:
...     cin=csv.DictReader(fin)
...     villains=[row for row in cin]
... 
>>> print(villains)
[OrderedDict([('first', 'Docter'), ('last', 'No')]), OrderedDict([('first', 'R'), ('last', 'K')]), OrderedDict([('first', 'Miser'), ('last', 'Big')]), OrderedDict([('first', 'Auric'), ('last', 'Goldfinger')]), OrderedDict([('first', 'E'), ('last', 'B')])]

8.2.2 XML

XMLを簡単に読み取るためにはElementTreeを使う。

menu.xml


<?xml version="1.0"?>
<menu>
# 開始タグにはオプションの属性を組み込める。
    <breakfast hours="7-11">
        <item price="$6.00">breakfast burritos</item>
        <item price="$4.00">pancakes</item>
    </breakfast>
    <lunch hours="11-3">
        <item price="$5.00">hamburger</item>
    </lunch>
    <dinner hours="3-10">
        <item price="$8.00">spaghetti</item>
    </dinner>
</menu>


>>> import xml.etree.ElementTree as et
>>> tree=et.ElementTree(file="menu.xml")
>>> root=tree.getroot()
>>> root.tag
'menu'
# tagはタグの文字列、attribはその属性の辞書である。
>>> for child in root:
...     print("tag:",child.tag,"attributes:",child.attrib)
...     for grandchild in child:
...         print("\ttag:",grandchild.tag,"attributes:",grandchild.attrib)
... 
tag: breakfast attributes: {'hours': '7-11'}
	tag: item attributes: {'price': '$6.00'}
	tag: item attributes: {'price': '$4.00'}
tag: lunch attributes: {'hours': '11-3'}
	tag: item attributes: {'price': '$5.00'}
tag: dinner attributes: {'hours': '3-10'}
	tag: item attributes: {'price': '$8.00'}
# menuセクション数
>>> len(root)
3
# breakfastの項目数
>>> len(root[0])
2

8.2.3 JSON

JavaSciriptという枠を超えて非常に使われているデータ交換形式になっている。
JSON形式はJavaSciriptのサブセットであり、Pythonで用いられることも多い。
JSONモジュールのjsonはPythonデータをJSON文字列にエンコード(ダンプ)したり、JSON文字列をPythonデータにデコード(ロード)したりする。


# データ構造を作成
>>> menu=\
... {
... "breakfast":{
...     "hours":"7-11",
...     "items":{
...             "breakfast burritos":"$6.00",
...             "pancakes":"$4.00"
...             }
...         },
... "lunch":{
...         "hours":"11-3",
...         "items":{
...             "hamburger":"$5.00"
...                 }
...         },
... "dinner":{
...     "hours":"3-10",
...     "items":{
...             "spaghetti":"$8.00"
...             }
...     }
... }

# dumps()を使ってこのデータ構造(menu)をJSON文字列(menu_json)にエンコードする。
>>> import json
>>> menu_json=json.dumps(menu)
>>> menu_json
`{"breakfast": {"hours": "7-11", "items": {"breakfast burritos": "$6.00", "pancakes": "$4.00"}}, "lunch": {"hours": "11-3", "items": {"hamburger": "$5.00"}}, "dinner": {"hours": "3-10", "items": {"spaghetti": "$8.00"}}}`

# loads()を使って、JSON文字列のmenu_jsonをPythonデータ構造menu2に戻す。
>>> menu2=json.loads(menu_json)
>>> menu2
{'breakfast': {'hours': '7-11', 'items': {'breakfast burritos': '$6.00', 'pancakes': '$4.00'}}, 'lunch': {'hours': '11-3', 'items': {'hamburger': '$5.00'}}, 'dinner': {'hours': '3-10', 'items': {'spaghetti': '$8.00'}}}

# datetimeなどの一部のオブジェクトをエンコード、デコードしようとすると、以下のような例外が発生する。
# これはJSON標準が日付、時刻型を定義していないため。
>>> import datetime
>>> now=datetime.datetime.utcnow()
>>> now
datetime.datetime(2020, 1, 23, 1, 59, 51, 106364)
>>> json.dumps(now)
# ...省略
TypeError: Object of type datetime is not JSON serializable

# datetimeを文字列やUnix時間へ変換すれば良い。
>>> now_str=str(now)
>>> json.dumps(now_str)
'"2020-01-23 01:59:51.106364"'
>>> from time import mktime
>>> now_epoch=int(mktime(now.timetuple()))
>>> json.dumps(now_epoch)
'1579712391'

# 通常変換されるデータ型にdatetime型の値が含まれている場合には、都度変換するのは煩わしい。
# そこで、json.JSONEncoderを継承したクラスを作成する。
# defaultメソッドをオーバーライド。
# isinstance()関数はobjがdatetime.datetimeクラスのオブジェクトか確認。
>>> class DTEncoder(json.JSONEncoder):
...     def default(self,obj):
...     #isinstance()はobjの型をチェックする。
...         if isinstance(obj,datetime.datetime):
...             return int(mktime(obj.timetuple()))
...         return json.JSONEncoder.default(self,obj)
... 
# now=datetime.datetime.utcnow()と定義しているのでTrueが返される。
>>> json.dumps(now,cls=DTEncoder)
`1579712391`

>>> type(now)
<class `datetime.datetime`>
>>> isinstance(now,datetime.datetime)
True
>>> type(234)
<class `int`>
>>> type("hey")
<class `str`>
>>> isinstance("hey",str)
True
>>> isinstance(234,int)
True

8.2.4 YAML

JSONと同様にYAMLはキーと値を持つが、日付と時刻を始めとして、JSONよりも多くのデータ型を処理することができる。
YAMLの処理をするためにはPyYAMLをというライブラリをインストールする必要がある。

mcintyre.yaml


name:
  first:James
  last:McIntyre
dates:
  birth:1828-05-25
  death:1906-03-31
details:
  bearded:true
  themes:[cheese,Canada]
books:
  url:http://www.gutenberg.org/files/36068/36068-h/36068-h.htm
poems:
  - title: "Motto" #半角スペースがなかったためにエラー発生。
    text: |
        Politeness,perseverance and pluck,
        To their possessor will bring good luck.
  - title: "Canadian Charms" #半角スペースがなかったためにエラー発生。
    text: |
        Here industry is not in vain,
        For we have bounteous crops of grain,
        And you behold on every field
        Of grass and roots abundant yield,
        But after all the greatest charm
        Is the snug home upon the farm,
        And stone walls now keep cattle warm.


>>> import yaml
>>> with open("mcintyre.yaml","rt") as fin:
...     text=fin.read()
... 
>>> data=yaml.load(text)
>>> data["details"]
'bearded:true themes:[cheese,Canada]'
>>> len(data["poems"])
2

8.2.5 pickleによるシリアライズ

Pythonのデータ階層を取り、文字列表現に変換することをシリアライズという。文字列表現からデータを再構築することをデシリアライズという。
シリアライズされてからデシリアライズされるまでの間に、オブジェクトの文字列表現はファイルやデータの形で保存したり、ネットワークを通じて離れたマシンに送ったりすることができる。
Pythonは特別なバイナリ形式で、あらゆるオブジェクトを保存、復元できるpickleモジュールを提供している。


>>> import pickle
>>> import datetime
>>> now1=datetime.datetime.utcnow()
>>> pickled=pickle.dumps(now1)
>>> now2=pickle.loads(pickled)
>>> now1
datetime.datetime(2020, 1, 23, 5, 30, 56, 648873)
>>> now2
datetime.datetime(2020, 1, 23, 5, 30, 56, 648873)

# pickleはプログラム内で定義された独自クラスやオブジェクトも処理できる。
>>> import pickle
>>> class Tiny():
...     def __str__(self):
...         return "tiny"
... 
>>> obj1=Tiny()
>>> obj1
<__main__.Tiny object at 0x10af86910>
>>> str(obj1)
'tiny'
# pickledはobj1オブジェクトからpickleでシリアライズしたバイナリシーケンス。
# dump()を使ってファイルにシリアライズ。
>>> pickled=pickle.dumps(obj1)
>>> pickled
b'\x80\x03c__main__\nTiny\nq\x00)\x81q\x01.'
# obj2に変換し戻して、obj1のコピーを作っている。
# loads()を使ってファイルからオブジェクトをデシリアライズ。
>>> obj2=pickle.loads(pickled)
>>> obj2
<__main__.Tiny object at 0x10b21cdd0>
>>> str(obj2)
'tiny'

感想

ようやく次はRDBMSへ。

参考文献

「Bill Lubanovic著『入門 Python3』(オライリージャパン発行)」

「Pythonチュートリアル 3.8.1ドキュメント 7.入力と出力」
https://docs.python.org/ja/3/tutorial/inputoutput.html#old-string-formatting

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

[Python3 入門 17日目]8章 データの行き先（8.1〜8.2.5）

8.1 ファイルの入出力

8.1.1 write()によるテキストファイルへの書き込み

8.1.2 read(),readline(),headlines()によるテキストファイルの読み出し

8.1.3 write()によるバイナリファイルの書き込み

8.1.4 read()によるバイナリファイルの読み出し

8.1.5 withによるファイルの自動的なクローズ

8.1.6 seek()による位置の変更

8.2 構造化されたテキストファイル

CSV

8.2.2 XML

8.2.3 JSON

8.2.4 YAML

8.2.5 pickleによるシリアライズ

感想

参考文献

[Python3 入門 17日目]8章データの行き先（8.1〜8.2.5）