More than 3 years have passed since last update.

「つくってマスターPython」で勉強日記#2

Posted at 2022-03-05

【出典】つくってマスターPython
前回の続きです

chapter4 文書を処理する

4-1 正規表現

正規表現は文字列をある決まったルールによるパターンとして定義し、それを基に検索や置換を行うもので、pythonではreという標準ライブラリがあります。

正規表現で置換

正規表現で置換


import re #正規表現を使うためにインポート

#''''と'''で複数行に渡る文字列を埋める
s = '''One Little, two little, three little Indians, Four little, five LITTLE, six liTTle, Seven Little, eight little, nine LittLe, Ten　Little Indian boys'''

#正規表現で置換
result = re.sub('little', 'BIG', s, flags=re.IGNORECASE)
print(result)

表4-1 subの引数と使い方の例引数

引数名	実際の引数	説明
パターン(検索文字列)	'little'	検索するパターン
置換する文字列	'BIG'	置換する文字列。パターンに当てはまったものが置換される
置換される文字列	s	処理の対象となる(検索される対象となる)文字列。ここでは上述の内容が代入される
フラグ(オプション)	re.IGNORECASE	オプションの設定。IGNORECASEは、大文字小文字を無視するオプション

文字列から金額を取り出し計算する

import re

data = '''
40インチ TV 98000円
ノートPC　113000円
スマホ 58700円
スマホ　58700円

タブレット 49500円
'''

result = re.sub(r'(\d+)円', r'\\1-', data)
print(result)

res = re.findall(r'(\d+)円', data) #｢r'(\d+)円'｣：\dは半角の数字を表す。+をつけることで、1つ以上続いた状態を示す

total = 0

for item in res:
    print(item)
    total += int(item)
    
print('total' + str(total) + '円')

result = re.sub(r'(\d+)円', r'\ \1-', data) #置換による表現変更。\ \の間にスペースがいる
print(result)


import re

data = '''
40インチ TV 98000円
ノートPC　113000円
スマホ 58700円
スマホ　58700円

タブレット 49500円
'''

result = re.sub(r'(\d+)円', r'\ \1-', data)
print(result)

電話番号とメールアドレスを調べる

import re

data ='''
太郎 090-(999)-999 taro@yamada.san
花子 080-(888)-888 hanako@flower.shop

'''

result = re.findall(r'(\S+)\s+([\()\d-]+)\s+([\w.-_]+@[\w.-_]+)', data)

print('※名前')

for item in result:
    print(item[0])
    
print('\n※電話番号')

for item in result:
    print(item[1])
    
print('\n※メールアドレス')
for item in result:
    print(item[2])

HTMLからリンクを取り出す

import re

data = '''
<html><head></head>
<body>
<a href="http://www.google.co.jp/">Google</a>
<a href="https://www.google.com/webhp?tab=ww&hl=ja">Google</a>
<img src = "https://www.python.org/static/img/python-logo@2x.png">
</body>
</html>
'''

result = re.findall(r'(https?://)([\w \- \._]+)(/?)([\w\?/:%#\$&~\.=+\-@]*)',data)
"""
(https?://)・・・httpまたはhttpsを示す
([\w \- \._]+)・・・その後に続くドメイン部分
(/?)・・・/があればそこで区切って取り出す
([\w\?/:%#\$&~\.=+\-@]*)・・・/記号以降の部分を取り出す。
"""

print('※ドメイン')
for item in result:
    print(item[1])
    
print('\n※フルアドレス')
for item in result:
    print(''.join(item))

当初、頭から順を追って進めていこうかと思ったのですが、プログラムを完成させることを優先していきたいと思うので、次回以降スクレイピングのコードを書いていこうと思います。

都度、詰まった際に必要な情報を読み返しながらまとめていこうと思います。

本日は、以上です。ありがとうございました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up