More than 3 years have passed since last update.

Python/pandas DataFrameにて、特定の文字列以降を抽出する方法

Posted at 2021-08-20

調べてみると、意外に適したものがなかったので、メモ。

pandasのdataFrameの特定カラム（URL）があったとして、ここから'?'以降を抽出する方法
'https://dummy.com/index2.html?abc=1'

▼サンプルのデータ

import pandas

my_dict =  {'A': [1, 2, 3, 4, 5], 'url': ['https://dummy.com/index.html', 'https://dummy.com/index2.html?abc=1', 'https://dummy.com/index2.html?def=2', '', 'https://dummy.com/index2.html?abc=1']}
df = pd.DataFrame.from_dict(my_dict)

最初思いついたのは、正規表現でのreplaceだった

df['parms']=''
df['parms']=df['url'].replace('(.*?)\?','?',regex=True)

これで一発かな、と思ったのですが、すると、
パラメータ部分がないURLは、置換ができない。（当然ですね）

idx	A	url	parms
0	1	https://dummy.com/index.html	https://dummy.com/index.html
1	2	https://dummy.com/index2.html?abc=1	?abc=1
2	3	https://dummy.com/index2.html?def=2	?def=2
3	4
4	5	https://dummy.com/index2.html?abc=1	?abc=1

まぁ、これはそうだろう、という動きですが、次です。

df['parms']=''
df['parms']=df['url'].str.replace('(?:.*?)\?(.*?)',r'?\1',regex=True)

こんな感じで、'?'の後ろの部分だけ取れるかな、と思ったら、同じ結果だったのです。

▼通った方法

df['parms']=''
df.loc[df['url'].astype(str).str.contains('\?'), 'parms'] = df['url'].astype(str).str.replace(r'(.*?)\?','?',regex=True)

idx	A	url	parms
0	1	https://dummy.com/index.html
1	2	https://dummy.com/index2.html?abc=1	?abc=1
2	3	https://dummy.com/index2.html?def=2	?def=2
3	4
4	5	https://dummy.com/index2.html?abc=1	?abc=1

=> OKですね。

(解説)
'?'を含むものを抽出して、'parms'に'?'以前を消した値をセット

他にも良い方法があるのかもしれませんが。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up