More than 1 year has passed since last update.

データサイエンス100本ノック 11～20

Posted at 2023-01-08

データサイエンス100本ノック（構造化データ加工編）

https://github.com/The-Japan-DataScientist-Society/100knocks-preprocess
GitHubにColaboratoryが用意されているので、自分のドライブにコピーするだけで簡単に始められます。

※協会が出している解答の書き方と異なる箇所もありますが、出力としては同じになっています。協会の解答はGitHubから確認してください。解説本も販売されています。

解答

P011

顧客データ（df_customer）から顧客ID（customer_id）の末尾が1のものだけ全項目抽出し、10件表示せよ。

P011

df_customer.query('customer_id.str.endswith("1")', engine='python').head(10)

P012

店舗データ（df_store）から、住所 (address) に"横浜市"が含まれるものだけ全項目表示せよ。

P012

df_store.query('address.str.contains("横浜市")', engine='python')

P013

顧客データ（df_customer）から、ステータスコード（status_cd）の先頭がアルファベットのA〜Fで始まるデータを全項目抽出し、10件表示せよ。

P013

boolindex = df_customer['status_cd'].str.startswith(("A","B","C","D","E","F")) #タプルはboolで返ってくる
result = df_customer[boolindex]
result.head(10)

P013 正規表現ver

import re
df_customer.query('status_cd.str.contains(r"^[A-F]")',engine='python').head(10)

P014

顧客データ（df_customer）から、ステータスコード（status_cd）の末尾が数字の1〜9で終わるデータを全項目抽出し、10件表示せよ。

P014

boolindex = df_customer['status_cd'].str.endswith(("1","2","3","4","5","6","7","8","9")) #タプルはboolで返ってくる
result = df_customer[boolindex]
result.head(10)

P014 正規表現ver

df_customer.query('status_cd.str.contains(r"[1-9]$")',engine='python').head(10)

P015

顧客データ（df_customer）から、ステータスコード（status_cd）の先頭がアルファベットのA〜Fで始まり、末尾が数字の1〜9で終わるデータを全項目抽出し、10件表示せよ。

P015

df_customer.query('status_cd.str.contains(r"^[A-F].*[1-9]$")',engine='python').head(10)

P016

店舗データ（df_store）から、電話番号（tel_no）が3桁-3桁-4桁のデータを全項目表示せよ。

P016

df_store.query('tel_no.str.contains(r"^[0-9]{3}-[0-9]{3}-[0-9]{4}$")', engine='python')

P017

顧客データ（df_customer）を生年月日（birth_day）で高齢順にソートし、先頭から全項目を10件表示せよ。

P017

df_customer.sort_values('birth_day').head(10)

P018

顧客データ（df_customer）を生年月日（birth_day）で若い順にソートし、先頭から全項目を10件表示せよ。

P018

df_customer.sort_values('birth_day',ascending = False).head(10)
# 降順にしたいときはascending = False

P019

レシート明細データ（df_receipt）に対し、1件あたりの売上金額（amount）が高い順にランクを付与し、先頭から10件表示せよ。項目は顧客ID（customer_id）、売上金額（amount）、付与したランクを表示させること。なお、売上金額（amount）が等しい場合は同一順位を付与するものとする。

P019

df_receipt_sorted = df_receipt.sort_values('amount',ascending = False)
df_receipt_sorted['rank'] = df_receipt_sorted['amount'].rank(method='min',ascending=False)

df_receipt_sorted[['customer_id','amount','rank']].head(10)

rank()の引数methodは同一値の処理を指定できる
・average：デフォルト。平均値が順位になる
・min：最小値が順位になる（スポーツの順位付けで馴染みがあるのはこれ）
・max：最大値が順位になる
・first：登場順に順位付け

P020

レシート明細データ（df_receipt）に対し、1件あたりの売上金額（amount）が高い順にランクを付与し、先頭から10件表示せよ。項目は顧客ID（customer_id）、売上金額（amount）、付与したランクを表示させること。なお、売上金額（amount）が等しい場合でも別順位を付与すること。

P020

df_receipt_sorted = df_receipt.sort_values('amount',ascending = False)
df_receipt_sorted['rank'] = df_receipt_sorted['amount'].rank(method='first',ascending=False)

df_receipt_sorted[['customer_id','amount','rank']].head(10)

感想

少しずつ詰まるようになってきたので、この先は1問にかかる時間が増えていきそうです。
正規化表現はこれまでほとんど使用してこなかったので調べながら書いてました。特に、P016の電話番号の問題は正規表現の便利さを痛感する問題だなぁと思いました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up