@marcopagot (Asuka)posted at 2021-11-15

DataFrame内の文字を正規表現を用いて抽出したい。

Q&A

解決したいこと

DataFrame内の文字列を正規表現を用いて抽出したい。

例）
閲覧頂きありがとうございます。python初学者です。
現在、suumoの不動産データを用いて賃貸物件の家賃予測をしようとしています。
まず、築年数の特徴量を作成しようとしているのですが
その段階でエラーが出ています。

データの説明

train

データ

    title   category    address age floor   fee management_fee  deposit gratuity    madori  menseki
0   ドゥ・トゥールWEST   賃貸マンション   東京都中央区晴海３ \n築7年\n42階建\n   \r\n\t\t\t\t\t\t\t\t\t\t\t33階 22万円    15000円    22万円    22万円    1LDK    44.67m2
1   ドゥ・トゥールWEST   賃貸マンション   東京都中央区晴海３ \n築7年\n42階建\n   \r\n\t\t\t\t\t\t\t\t\t\t\t31階 46万円    -   46万円    46万円    3LDK    100.9m2
2   ドゥ・トゥールWEST   賃貸マンション   東京都中央区晴海３ \n築7年\n42階建\n   \r\n\t\t\t\t\t\t\t\t\t\t\t42階 57万円    30000円    57万円    57万円    4LDK    100.9m2
3   ドゥ・トゥールWEST   賃貸マンション   東京都中央区晴海３ \n築7年\n42階建\n   \r\n\t\t\t\t\t\t\t\t\t\t\t43階 58万円    20000円    58万円    58万円    3LDK    100.9m2
4   ドゥ・トゥールWEST   賃貸マンション   東京都中央区晴海３ \n築7年\n42階建\n   \r\n\t\t\t\t\t\t\t\t\t\t\t52階 70万円    20000円    140万円   70万円    2LDK    80.66m2

上記データのtrain['age']の中から、築年数の数字だけ拾ってきて
新しく['age']カラムを追加したいです。
※後程、同じように['age']カラムから階数を取り出して['max_floor']カラムも作成して追加したいです

まず、pandasの['age']カラムに対して、前6文字までを抽出してSeriesとしました。

　入力

age = train['age'].str[0:5:1]
age.head()

出力

0    \n築7年\n
1    \n築7年\n
2    \n築7年\n
3    \n築7年\n
4    \n築7年\n

この後、正規表現を利用して、ageの中から数字だけ取り出そうと
考えております。

自分で試したこと

age_single = age.str.extract(r'\d', expand = False)

出力

ValueError: pattern contains no capture groups

上記のようなエラーとなってしまいました。。。
何か解決方法をご存知の方がいらっしゃりましたら
ご教示の程宜しくお願い致します。

0 likes

Are you sure you want to delete the question?

DataFrame内の文字を正規表現を用いて抽出したい。

解決したいこと

データの説明

データ

入力

出力

自分で試したこと

出力

1Answer

Comments

Your answer might help someone💌