More than 5 years have passed since last update.

[Python3 入門 16日目]7章文字列（7.1.3〜7.3）

Posted at 2020-01-26

7.1.3 正規表現とのマッチング

正規表現機能は、標準モジュールのreが提供するもので、使うためにはこのモジュールをインポートする。
マッチングの対象となる文字列のパターンとマッチングするソース文字列を定義する。
match()はソースの先頭がパターンになっているかどうか。
search()は最初のマッチを返す
finall()は重なり合わない全てのマッチのリストを返す。（ある場合）
split()はパターンにマッチしたところでソースを分割し、部分文字列のリストを返す。

7.1.3.1 match()による正確なマッチ


>>> import re
>>> source = "Young Frankenstein"
# "You"がパターン,sourceがソース
# matchはオブジェクトを返す。
>>> m=re.match("You",source)
>>> if m:
...     print(m.group())
... 
You

# パターンの先頭に^を付けても同じ意味。
>>> m=re.match("^You",source)
>>> if m:
...     print(m.group())
... 
You

# match()はソースの先頭になければ成功しない。
>>> m=re.match("Frank",source)
>>> if m:
...     print(m.group())
... 

# search()はパターンがどこにあってもマッチする。
>>> m=re.search("Frank",source)
>>> if m:
...     print(m.group())
... 
Frank

# .は任意の1文字という意味。
# *は任意の個数の直前（繰り返し）という意味。
# .*全体では任意の個数の任意の文字という意味。
>>> m=re.match(".*Frank",source)
>>> if m:
...     print(m.group())
... 
Young Frank

7.1.3.2 search()による最初のマッチ


>>> m=re.search("Frank",source)
>>> if m:
...     print(m.group())
... 
Frank

7.1.3.3 findal()による全てのマッチの検索


# 文字列の中に"n"が何個あるか知りたい。
# findall()はリストにして返す。
>>> m=re.findall("n",source)
>>> m
['n', 'n', 'n', 'n']
>>> print("Found",len(m),"matches")
Found 4 matches
# 最後の"n"にはマッチしていない。
>>> m=re.findall("n.",source)
>>> m
['ng', 'nk', 'ns']
# .は任意の1文字を、?はオプションであることを示す。
>>> m=re.findall("n.?",source)
>>> m
['ng', 'nk', 'ns', 'n']

7.1.3.3 split()によるマッチを利用した分割

パターンでソースを分割し、部分文字列のリストを作る。


>>> m=re.split("n",source)
>>> m
['You', 'g Fra', 'ke', 'stei', '']

7.1.3.4 sub()によるマッチした部分の置換

置換対象としてリテラル文字列ではなくパターンを指定する。


>>> m=re.sub("n","?",source)
>>> m
'You?g Fra?ke?stei?'

7.1.3.6 パターンの特殊文字


# stringモジュールはテストのために使える文字列定数をあらかじめ定義している。
>>> import string
>>> printable=string.printable
>>> len(printable)
100
>>> printable[0:50]
`0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMN`
>>> printable[50:]
`OPQRSTUVWXYZ!"#$%&\`()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c`

# printableの中で数字、英字、アンダースコアのいずれかに含まれるものはどれか。
>>> re.findall("\w",printable)
[`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `a`, `b`, `c`, `d`, `e`, `f`, `g`, `h`, `i`, `j`, `k`, `l`, `m`, `n`, `o`, `p`, `q`, `r`, `s`, `t`, `u`, `v`, `w`, `x`, `y`, `z`, `A`, `B`, `C`, `D`, `E`, `F`, `G`, `H`, `I`, `J`, `K`, `L`, `M`, `N`, `O`, `P`, `Q`, `R`, `S`, `T`, `U`, `V`, `W`, `X`, `Y`, `Z`, `_`]

# printableの中で空白文字はどれか。
>>> re.findall("\s",printable)
[` `, `\t`, `\n`, `\r`, `\x0b`, `\x0c`]

# アクセント記号付きの英字もマッチする。
>>> x="abc" + "-/*" + "\u00ea" + "\u0115"
>>> re.findall("\w",x)
[`a`, `b`, `c`, `ê`, `ĕ`]

7.1.3.7 パターン：メタ文字


>>> source="""I wish I may,I wish I might
... Have a dish of fish tonight."""

# 任意の位置にあるwishを探す。
>>> re.findall("wish",source)
[`wish`, `wish`]

# 任意の位置にあるwishかfishを探す。
>>> re.findall("wish|fish",source)
[`wish`, `wish`, `fish`]

# は先頭を表す。
>>> re.findall("^wish",source)
[]
>>> re.findall("^I wish",source)
[`I wish`]
# $は末尾を表す。
>>> re.findall("fish$",source)
[]
>>> re.findall("fish tonight$",source)
[]
# .$は行末の任意の文字にマッチする。
>>> re.findall("fish tonight.$",source)
[`fish tonight.`]
# より正確にリテラルにマッチさせるにはドットの前に\を置く。
>>> re.findall("fish tonight\.$",source)
[`fish tonight.`]

# wかfの後にishが続いているものを探す。
>>> re.findall("[wf]ish",source)
[`wish`, `wish`, `fish`]

# w,s,hのどれか1個以上続くパターンを探す。
>>> re.findall("[wsh]+",source)
[`w`, `sh`, `w`, `sh`, `h`, `sh`, `sh`, `h`]

# ghtに続いて英字以外のものがきているパターン。
>>> re.findall("ght\W",source)
[`ght\n`, `ght.`]

# Iとスペースの後ろにwishが続くパターン。
# オプション?にwishを代入している。
>>> re.findall("I (?=wish)",source)
[`I `, `I `]

# wishの前にIがあるパターン
>>> re.findall("(?<=I) wish",source)
[` wish`, ` wish`]

# \bは文字列ではバックスペースという意味だが、正規表現では単語の境界という意味。
# 正規表現のためのパターン文字列ですと示すために、明示的に先頭にrを付けてやる。→Pythonのエスケープ文字無効。
>>> re.findall("\bfish",source)
[]
>>> re.findall(r"\bfish",source)
[`fish`]

7.1.3.8 パターン：マッチした文字列の出力指定

match()やsearch()を使った時の、結果イブジェクトのmから**m.group()**という形で全てのマッチを取り出し可能。
パターンを()で囲むと独自のグループに保存される。そして、m.groups()を呼びだせば、それらのタプルが得られる。


# 正規表現を表すrを先頭につける。

>>> m=re.search(r"(.dish\b).*(\bfish)",source)
>>> m.group()
` dish of fish`
# m.groups()の呼び出しで結果をタプルで得られる。
>>> m.groups()
(` dish`, `fish`)

# (?P< name >expr)という形式を使うとexprにマッチした部分はnameという名前のグループに保存される。
>>> m = re.search(r"(?P<DISH>. dish\b).*(?P<FISH>\bfish)",source)
>>> m.group()
`a dish of fish`
>>> m.group("DISH")
`a dish`
>>> m.group("FISH")
`fish`

7.2 バイナリーデータ

エンディアン：2バイト以上で表現される数値のメモリへの格納方式のこと。

7.2.1 バイトとバイト列

bytesはイミュータブルでバイトのタプルのようなもの。
bytearrayはミュータブルでバイトのリストのようなもの。
bytes値を表現する時には、bを先頭として次にクォート文字、その後にASCII文字、最後に先頭に対応するクォート文字を置く。アルファベットのようなASCII文字はそのままASCII文字で表現される。


>>> b=[1,2,3,255]
>>> the_bytes=bytes(b)
>>> the_bytes
b'\x01\x02\x03\xff'
>>> the_byte_array=bytearray(b)
>>> the_byte_array
bytearray(b'\x01\x02\x03\xff')

# bytes変数は書き換えられない。（イミュータブル）
>>> the_bytes[1]=123
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'bytes' object does not support item assignment

# bytearry変数なら書き換え可能。（ミュータブル）
>>> the_byte_array=bytearray(b)
>>> the_byte_array
bytearray(b'\x01\x02\x03\xff')
>>> the_byte_array[1]=127
>>> the_byte_array
bytearray(b'\x01\x7f\x03\xff')

# 0から255までの値をの256個の要素を持つオブジェクトを作成。
>>> the_bytes=bytes(range(0,256))
>>> the_byte_array=bytearray(range(0,256))
# Pythonは印字不能バイトは\xxxx、印字可能バイトは対応するASCII文字を表示する。
>>> the_bytes
b`\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\`()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff`

7.2.2 structによるバイナリデータの変換

structモジュールを使えば、Pythonのデータ構造との間でバイナリデータを相互交換できる。
は整数がビッグエンディアン形式で格納されていることを意味する。
個々のLは4バイト符号なし長整数を指定する。


>>> import struct
>>> png_header=b"\x89PNG\r\n\x1a\n"
>>> data=b"\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR" + \
...     b"\x00\x00\x00\x9a\x00\x00\x00\x8d\x08\x02\x00\x00\x00\xc0"
# widthはバイトの16番目から19番目まで、heightは20から23番目に格納されている。
# >LLはunpack()に入力バイトシーケンス の解釈方法とPythonデータ型への組み立て型を指示する書式指定文字列。
>>> if data[:8]==png_header:
...     width,height=struct.unpack(">LL",data[16:24])
...     print("Valid PNG,width",width,"height",height)
... else:
...     print("Nor a valid PNG")
... 
Valid PNG,width 154 height 141

# 個々の4バイト値は直接検証できる。
>>> data[16:20]
b'\x00\x00\x00\x9a'
>>> data[20:24]
b'\x00\x00\x00\x8d'

# ビッグエンディアンの整数は最上位バイトが左端にある。
# リトルエンディアンの整数は数値の下位バイトから順に格納する。
>>> 0x9a
154
>>> 0x8d
141
>>> 0x0
0 

# Pythonデータをバイト型に変換したい場合には、structのpack関数を使う。
>>> import struct
>>> struct.pack(">L",154)
b'\x00\x00\x00\x9a'
>>> struct.pack(">L",141)
b'\x00\x00\x00\x8d'

# 型指定子はエンディアン文字の後ろに続く。全ての指定子は前に数値を付けることができる。(countの意味)
## "LL"は"2L"と同意。
>>> struct.unpack(">2L",data[16:24])
(154, 141)

# x指定子は1バイト読み飛ばしの意。
# 最初の16バイトを読み飛ばし、4バイトの符号なし長整数を2つを読み出し、最後の6バイトを読み飛ばす。
>>> struct.unpack(">16x2L6x",data)
(154, 141)

7.2.3 その他のバイナリデータツール

外部パッケージがありますよという話。

7.2.4 binasciiによるバイト/文字列の変換

標準のbinasciiモジュールにはバイナリデータと様々な文字列表現を相互変換する関数が含まれている。


>>> import binascii
# 8バイトpngヘッダーを定義。bytes変数\xxxとASCIIの混在。
>>> png_header=b"\x89PNG\r\n\x1a\n"
# 16進数に変換。
>>> print(binascii.hexlify(png_header))
b'89504e470d0a1a0a'
# 逆方向も可能。
>>> 
print(binascii.unhexlify(b'89504e470d0a1a0a'))
b'\x89PNG\r\n\x1a\n'

7.3 復習課題

7-1 mysteryというUnicode文字列を作り、"\U0001f4a9"という値を代入してmysteryを表示してみよう。また、mysteryのUnicode名を調べよう。


>>> import unicodedata
>>> mystery="\U0001f4a9"
>>> mystery
'💩'
>>> unicodedata.name(mystery)
'PILE OF POO'

7-2 UTF-8を使ってmysteryをpop_bytesというbytes変数にエンコードしよう。そしてpop_bytesを表示しよう。


>>> pop_bytes=mystery.encode("utf-8")
>>> pop_bytes
b'\xf0\x9f\x92\xa9'

7-3 UTF-8を使い、pop_bytesを文字列変数pop_stringにデコードし、pop_stringを表示しよう。pop_stringはmysteryと等しいか?


>>> pop_string=pop_bytes.decode("utf-8")
>>> pop_string
'💩'

7-4 古いスタイルの書式指定を使って次の詩を表示し、置換部分に"roast beef","ham","head","claim"を挿入しよう。


>>> poem="""My kitty cat likes %s,/
...     My kitty cat likes %s,\
...     My kitty cat feel on his %s,\
...     And now thinks he is a %s."""
# 位置引数で挿入してやれば良い。
>>> args=("roast beef","ham","head","claim")
>>> print(poem % args)
My kitty cat likes roast beef,/
    My kitty cat likes ham,    My kitty cat feel on his head,    And now thinks he is a claim.

7-5 新しいスタイルの書式指定を使って定型書簡を作りたい。次の、文字列をletterという変数に保存しよう。


>>> letters="""
... Dear {salutation} {name},
... Thank you for your letter. We are sorry that our {product} {verbed} in your
... {room}.Please note that it should never be used in a {room},especially near
... any {animals}.
... 
... Send us your receipt and {amount} for shipping and handling.We will send you
... another {product} that,in our tests, is {percent}% less likely to have {verbed}.
... 
... Thank you for your support.
... 
... Sincerely,
... {spokesman}
... {job_title}
... 
... 
... """

7-6 "salutation"、"name"、"product"、"verbed"、"room"、"animals"、"amount"、"percent"、"spokesman"、"job_title"という文字列のキーに値を追加して、dictionaryという辞書を作ろう。そして、dictionaryの値を使ってlettersを表示しよう。


>>> dictionary={
... "salutation":"A",
... "name":"B",
... "product":"C",
... "verbed":"D",
... "room":"E",
... "animals":"F",
... "amount":"G",
... "percent":"H",
... "spokesman":"I",
... "job_title":"K"
... }

>>> print(letters.format(**dictionary))

Dear A B,
Thank you for your letter. We are sorry that our C D in your
E.Please note that it should never be used in a E,especially near
any F.

Send us your receipt and G for shipping and handling.We will send you
another C that,in our tests, is H% less likely to have D.

Thank you for your support.

Sincerely,
I
K

7-7 テキスト作成して、　mammothという名前をつけよう。


>>> mammoth="""We have seen thee, queen of cheese,
...     Lying quietly at your ease,
...     Gently fanned by evening breeze,
...     Thy fair form no flies dare seize.
... 
...     All gaily dressed soon you'll go
...     To the great Provincial show,
...     To be admired by many a beau
...     In the city of Toronto.
... 
...     Cows numerous as a swarm of bees,
...     Or as the leaves upon the trees,
...     It did require to make thee please,
...     And stand unrivalled, queen of cheese.
... 
...     May you not receive a scar as
...     We have heard that Mr. Harris
...     Intends to send you off as far as
...     The great world's show at Paris.
... 
...     Of the youth beware of these,
...     For some of them might rudely squeeze
...     And bite your cheek, then songs or glees
...     We could not sing, oh! queen of cheese.
... 
...     We'rt thou suspended from balloon,
...     You'd cast a shade even at noon,
...     Folks would think it was the moon
...     About to fall and crush them soon.
... """

7-8 reモジュールをインポートする。次に、re.findall()を使って、cで始まる全ての単語を表示しよう。


>>> import re
# rは正規表現を検索しているとプログラムに伝えている。
# \bは単語と非単語の境界を先頭とするという意味。
# \wは任意の単語文字を意味。
# *前の単語が0個以上という意味。
>>> pat=r"\bc\w*"
>>> re.findall(pat,mammoth)
['cheese', 'city', 'cheese', 'cheek', 'could', 'cheese', 'cast', 'crush']

7-9 cで始まる全ての4文字単語を見つけよう。


# prev{m}形式でm個の連続したprevを示す。
# 最後の\bは必要。これがないとcで始まる全ての単語の先頭4文字が返される。
>>> pat2=r"\bc\w{3}\b"
>>> re.findall(pat2,mammoth)
['city', 'cast']

7-10 rで終わる全ての単語を見つけよう。


>>> pat3=r"\b\w*r\b"
>>> re.findall(pat3,mammoth)
['your', 'fair', 'Or', 'scar', 'Mr', 'far', 'For', 'your', 'or']

7-11 3個の連続した母音を含む全ての単語を見つけよう。


# 単語境界から始まり、0個以上の任意の単語文字が続き、3個の母音が続き、任意の非母音が続き、任意の単語文字が末尾まで続く。
# [^abc]はaまたはbまたはc以外という意味。
>>> import re
>>> pat4=r"\b\w*[aeiou]{3}[^aeiou]\w*\b"
>>> re.findall(pat4,mammoth)
['queen', 'quietly', 'queen', 'squeeze', 'queen']

# 最終結果
# \nを含む空白文字にマッチする\sを追加
# 非母音が0個以上連続しているものにもマッチさせる。
>>> pat4=r"\b\w*[aeiou]{3}[^aeiou\s]*\w*\b"
>>> re.findall(pat4,mammoth)
['queen', 'quietly', 'beau', 'queen', 'squeeze', 'queen']

7-12 unhexlifyを使ってこの16進文字列をgifというbytes変数に変換しよう。


>>> import binascii
>>> a="""47494638396101000100800000000000ffffff21f9 "" +
... ""0401000000002c000000000100010000020144003b"""
>>> gif=binascii.unhexlify(a)
>>> len(gif)
42

7-13 有効なgifファイルの先頭は、 "GIF89a"という文字列にマッチしているかどうか。


# gifの中身はbytes変数のため"b"を先頭に付けて比較してやる。
>>> gif[:6]==b"GIF89a"
True

7-14 GIFファイルの幅は、バイトオフセット6からの16ビットリトルエンディアンの整数で、高さは8からの同じサイズの整数になっている。gifのこれらの値を抽出して表示しよう。どちらも1になっているか。


>>> import struct
>>> width,height=struct.unpack("<HH",gif[6:10])
>>> width,height
(1, 1)

感想

演習課題7-14がgif[6:10]の10となぜHを使うのか理解できなかった。

幅はオフセット6から16ビットのリトルエンディアン、高さはオフセット8からは同じサイズで入っているまでは理解。

符合なし整数4バイト(I)や8バイト(L)ではなく、2バイト(H)を使う理由は、オフセットgif[6:8]には幅、gif[8:10]には高さが入っているからという理解でよろしいでしょうか??

分かる方いましたら、コメントにてご教示お願いいたします。

参考文献

「Bill Lubanovic著『入門 Python3』(オライリージャパン発行)」

「Pythonチュートリアル 3.8.1ドキュメント」
https://docs.python.org/ja/3/howto/unicode.html

「C言語　エンディアン」
https://monozukuri-c.com/langc-endian/

「バイナリーデータ説明」
http://deutschina.hatenablog.com/entry/2016/01/24/013000

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

[Python3 入門 16日目]7章 文字列（7.1.3〜7.3）