LoginSignup
1
2

More than 5 years have passed since last update.

pandasでテキストから正規表現でいろんな日付を抽出

Last updated at Posted at 2017-11-30

テキスト内には、様々な形式で時間情報が含まれていますね。
今回はEnglishの形式に絞って、正規表現を使って抽出する方法をご紹介します。

例えば、以下のような形式が考えられます。これら全部引っ張ります。

  • 04/20/2009; 04/20/09; 4/20/09; 4/3/09
  • Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
  • 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
  • Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
  • Feb 2009; Sep 2009; Oct 2010
  • 6/2008; 12/2009
  • 2009; 2010

今回はpandasを利用します。
日付の形式によっては04月や4月などがありますが、最後に標準化します。

準備:pandasのインポート

import pandas as pd

pandasの操作にはstrアクセサを、pandasでの正規表現抽出にはextractall()メソッドを利用します。

04/20/2009; 04/20/09; 4/20/09; 4/3/09

text = ['04/20/2009', '04/20/09', '4/20/09', '4/3/09']
df = pd.DataFrame(text, columns=['text'])
a1_1 = df.text.str.extractall(r'(?P<month>\d{1,2})[/-](?P<day>\d{1,2})[/-](?P<year>\d{2})\b')
a1_2 = df.text.str.extractall(r'(?P<month>\d{1,2})[/-](?P<day>\d{1,2})[/-](?P<year>\d{4})\b')

a1 = pd.concat([a1_1, a1_2])
a1


       month  day   year
match           
1   0   04  20  09
2   0   4   20  09
3   0   4   3   09
0   0   04  20  2009

Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;

text = ['Mar-20-2009', 'Mar 20, 2009', 'March 20, 2009', 'Mar. 20, 2009', 'Mar 20 2009']
df = pd.DataFrame(text, columns=['text'])
a2 = df.text.str.extractall(r'(?:(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)[.\s-]*(?P<day>\d{1,2})[,\s\-]*(?P<year>\d{4})')
a2

        month   day year
match           
0   0   Mar 20  2009
1   0   Mar 20  2009
2   0   Mar 20  2009
3   0   Mar 20  2009
4   0   Mar 20  2009

20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009; Feb 2009; Sep 2009; Oct 2010

text = ['20 Mar 2009', '20 March 2009', '20 Mar. 2009', '20 March, 2009', 'Feb 2009', 'Sep 2009', 'Oct 2010']
df = pd.DataFrame(text, columns=['text'])
a3 = df.text.str.extractall(r'(?:(?P<day>\d{1,2})\s){0,1}(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)[,\.\s-]*(?P<year>\d{4})')
a3

        day month   year
match           
0   0   20  Mar 2009
1   0   20  March   2009
2   0   20  Mar 2009
3   0   20  March   2009
4   0   NaN Feb 2009
5   0   NaN Sep 2009
6   0   NaN Oct 2010

Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009

text = ['Mar 20th, 2009', 'Mar 21st, 2009', 'Mar 22nd, 2009']
df = pd.DataFrame(text, columns=['text'])

a4 = df.text.str.extractall(r'(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)\s{1}(?P<day>\d{1,2}(?:nd|th|st){1})(?:,\s){1}(?P<year>\d{4})')
a4


month   day year
match           
0   0   Mar 20th    2009
1   0   Mar 21st    2009
2   0   Mar 22nd    2009

さいごは、いい感じにマージする。

1
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
2