1
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

pandasでテキストから正規表現でいろんな日付を抽出

1
Last updated at Posted at 2017-11-30

テキスト内には、様々な形式で時間情報が含まれていますね。
今回はEnglishの形式に絞って、正規表現を使って抽出する方法をご紹介します。

例えば、以下のような形式が考えられます。これら全部引っ張ります。

  • 04/20/2009; 04/20/09; 4/20/09; 4/3/09
  • Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
  • 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
  • Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
  • Feb 2009; Sep 2009; Oct 2010
  • 6/2008; 12/2009
  • 2009; 2010

今回はpandasを利用します。
日付の形式によっては04月や4月などがありますが、最後に標準化します。

準備:pandasのインポート

import pandas as pd

pandasの操作にはstrアクセサを、pandasでの正規表現抽出にはextractall()メソッドを利用します。

04/20/2009; 04/20/09; 4/20/09; 4/3/09

text = ['04/20/2009', '04/20/09', '4/20/09', '4/3/09']
df = pd.DataFrame(text, columns=['text'])
a1_1 = df.text.str.extractall(r'(?P<month>\d{1,2})[/-](?P<day>\d{1,2})[/-](?P<year>\d{2})\b')
a1_2 = df.text.str.extractall(r'(?P<month>\d{1,2})[/-](?P<day>\d{1,2})[/-](?P<year>\d{4})\b')

a1 = pd.concat([a1_1, a1_2])
a1


       month  day	year
match			
1	0	04	20	09
2	0	4	20	09
3	0	4	3	09
0	0	04	20	2009

Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;

text = ['Mar-20-2009', 'Mar 20, 2009', 'March 20, 2009', 'Mar. 20, 2009', 'Mar 20 2009']
df = pd.DataFrame(text, columns=['text'])
a2 = df.text.str.extractall(r'(?:(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)[.\s-]*(?P<day>\d{1,2})[,\s\-]*(?P<year>\d{4})')
a2

		month	day	year
match			
0	0	Mar	20	2009
1	0	Mar	20	2009
2	0	Mar	20	2009
3	0	Mar	20	2009
4	0	Mar	20	2009

20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009; Feb 2009; Sep 2009; Oct 2010

text = ['20 Mar 2009', '20 March 2009', '20 Mar. 2009', '20 March, 2009', 'Feb 2009', 'Sep 2009', 'Oct 2010']
df = pd.DataFrame(text, columns=['text'])
a3 = df.text.str.extractall(r'(?:(?P<day>\d{1,2})\s){0,1}(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)[,\.\s-]*(?P<year>\d{4})')
a3

		day	month	year
match			
0	0	20	Mar	2009
1	0	20	March	2009
2	0	20	Mar	2009
3	0	20	March	2009
4	0	NaN	Feb	2009
5	0	NaN	Sep	2009
6	0	NaN	Oct	2010

Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009

text = ['Mar 20th, 2009', 'Mar 21st, 2009', 'Mar 22nd, 2009']
df = pd.DataFrame(text, columns=['text'])

a4 = df.text.str.extractall(r'(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)\s{1}(?P<day>\d{1,2}(?:nd|th|st){1})(?:,\s){1}(?P<year>\d{4})')
a4


month	day	year
match			
0	0	Mar	20th	2009
1	0	Mar	21st	2009
2	0	Mar	22nd	2009

さいごは、いい感じにマージする。

1
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?