More than 3 years have passed since last update.

データサイエンス100本ノック解説(P021~040)

Posted at 2020-08-21

1. はじめに

　前回に引き続き、データサイエンス100本ノックの解説を行う。
導入についてはこちらの記事を参考に進めてください(※ MacでDockerを扱います)

基本的には解答の解説ですが別解についても記述しています。

2. 解説編

P-021: レシート明細データフレーム（df_receipt）に対し、件数をカウントせよ。

P-021

# 件数はlen()メソッドを用いる。
len(df_receipt)

P-022: レシート明細データフレーム（df_receipt）の顧客ID（customer_id）に対し、ユニーク件数をカウントせよ。

P-022

# ユニーク件数はunique()メソッドを使用する。
# df_receipt['customer_id'].unique() >>> ユニークな要素の値のリストをNumPy配列ndarrayで返す
len(df_receipt['customer_id'].unique())

参考: https://note.nkmk.me/python-pandas-value-counts/

P-023: レシート明細データフレーム（df_receipt）に対し、店舗コード（store_cd）ごとに売上金額（amount）と売上数量（quantity）を合計せよ。

P-023

# groupbyメソッドで'store_cd'ごとにグルーピング
# agg()でデータを集計。辞書型で売上金額（amount）と売上数量（quantity）を合計（sum）
# reset_index()でインデックスを振り直す
df_receipt.groupby('store_cd').agg({'amount': 'sum', 'quantity': 'sum'}).reset_index()

# (別解)
df_receipt[['amount', 'quantity', 'store_cd']].groupby('store_cd', as_index=False).sum()

参考(groupby): https://note.nkmk.me/python-pandas-groupby-statistics/
参考(agg): https://note.nkmk.me/python-pandas-agg-aggregate/

P-024: レシート明細データフレーム（df_receipt）に対し、顧客ID（customer_id）ごとに最も新しい売上日（sales_ymd）を求め、10件表示せよ。

P-024

# groupbyで顧客ID（customer_id）ごとにグループ分け。
# 売上日（sales_ymd）の最も新しい(max()で取得)
df_receipt.groupby('customer_id').sales_ymd.max().reset_index().head(10)

# (別解)
# locで顧客ID（customer_id）でグループ分けした売上日（sales_ymd）の最も新しい(idxmax()で取得)
df_receipt[['customer_id', 'sales_ymd']].loc[df_receipt.groupby('customer_id').sales_ymd.idxmax()].head(10)

参考(groupby): https://note.nkmk.me/python-pandas-groupby-statistics/

P-025: レシート明細データフレーム（df_receipt）に対し、顧客ID（customer_id）ごとに最も古い売上日（sales_ymd）を求め、10件表示せよ。

P-025

# P-023を参考。最も古い売上日はagg({'sales_ymd': min})で表現できる
df_receipt.groupby('customer_id').agg({'sales_ymd': min}).reset_index().head(10)

# (別解)
# locで顧客ID（customer_id）でグループ分けした売上日（sales_ymd）の最も古い(idxmin()で取得)
df_receipt[['customer_id', 'sales_ymd']].loc[df_receipt.groupby('customer_id').sales_ymd.idxmin()].head(10)

P-026: レシート明細データフレーム（df_receipt）に対し、顧客ID（customer_id）ごとに最も新しい売上日（sales_ymd）と古い売上日を求め、両者が異なるデータを10件表示せよ。

P-026

# 顧客ID（customer_id）ごとに売上日（sales_ymd）の新しい売上日(max)と古い売上日(min)のデータフレームを作成
df_tmp = df_receipt.groupby('customer_id').agg({'sales_ymd':['max','min']}).reset_index()

# カラムを変更(詳細は下記にて説明)
df_tmp.columns = ["_".join(pair) for pair in df_tmp.columns]

# 両者が異なるデータをquery()にて探す。
df_tmp.query('sales_ymd_max != sales_ymd_min').head(10)

df.columnsは、MultiIndex([('customer_id', ''),( 'sales_ymd', 'max'),( 'sales_ymd', 'min')],)となるのでfor文で一つずつ取り出し、"_".join()にてカッコの中の文字を結合する。

P-027: レシート明細データフレーム（df_receipt）に対し、店舗コード（store_cd）ごとに売上金額（amount）の平均を計算し、降順でTOP5を表示せよ。

P-027

# レシート明細データフレーム（df_receipt）を店舗コード（store_cd）ごとにグループ分け。
# agg({'amount':'mean'})にて売上金額（amount）の平均を計算
# reset_index()でインデックスを振り直し、sort_values('amount', ascending=False)で売上金額（amount）でソートし降順
df_receipt.groupby('store_cd').agg({'amount':'mean'}) \
      .reset_index().sort_values('amount', ascending=False).head(5)

参考(groupby): https://note.nkmk.me/python-pandas-groupby-statistics/
参考(agg): https://note.nkmk.me/python-pandas-agg-aggregate/

P-028: レシート明細データフレーム（df_receipt）に対し、店舗コード（store_cd）ごとに売上金額（amount）の中央値を計算し、降順でTOP5を表示せよ。

P-028

# P-027を参考
# 中央値はmedian
df_receipt.groupby('store_cd').agg({'amount':'median'}).reset_index().sort_values('amount', ascending=False).head(5)

参考(median): https://note.nkmk.me/python-statistics-mean-median-mode-var-stdev/

P-029: レシート明細データフレーム（df_receipt）に対し、店舗コード（store_cd）ごとに商品コード（product_cd）の最頻値を求めよ。

P-029

# 前半部分はP-027を参考
# 商品コード（product_cd）に関数(ラムダ式)を適用する。
df_receipt.groupby('store_cd').product_cd.apply(lambda x: x.mode()).reset_index()
# df.groupby('grouping_content').最頻を求めたい対象.apply(lambda x: x.mode())

# (誤回答)
df_receipt.groupby('store_cd').agg({'product_cd':'mode'}).reset_index()
# >>>AttributeError: 'SeriesGroupBy' object has no attribute 'mode'

参考(lambda式): https://note.nkmk.me/python-lambda-usage/

P-030: レシート明細データフレーム（df_receipt）に対し、店舗コード（store_cd）ごとに売上金額（amount）の標本分散を計算し、降順でTOP5を表示せよ。

P-030

# 店舗コード（store_cd）の売上金額（amount）の標本分散(var(ddof=0))
# reset_indexでインデックスを振り直し、ソートする(sort_values)、降順(ascending=False)
df_receipt.groupby('store_cd').amount.var(ddof=0).reset_index().sort_values('amount', ascending=False).head(5)

参考: https://deepage.net/features/numpy-var.html
参考(標本分散と不偏分散): https://bellcurve.jp/statistics/course/8614.html

P-031: レシート明細データフレーム（df_receipt）に対し、店舗コード（store_cd）ごとに売上金額（amount）の標本標準偏差を計算し、降順でTOP5を表示せよ。

P-031

# 店舗コード（store_cd）の売上金額（amount）の標本標準偏差(std(ddof=0))
# reset_indexでインデックスを振り直し、ソートする(sort_values)、降順(ascending=False)
df_receipt.groupby('store_cd').amount.std(ddof=0).reset_index().sort_values('amount', ascending=False).head()

参考: https://deepage.net/features/numpy-var.html
参考(標本分散と不偏分散): https://bellcurve.jp/statistics/course/8614.html

P-032: レシート明細データフレーム（df_receipt）の売上金額（amount）について、25％刻みでパーセンタイル値を求めよ。

P-032

# 売上金額（amount）をパーセンタイル値(quantile)
# np.arange(5)/4 >>> [0, 0.25, 0.5, 0.75, 1]
df_receipt.amount.quantile(q=np.arange(5)/4)

# (別解)
np.percentile(df_receipt['amount'], q=[25, 50, 75,100])

参考(quantile): https://note.nkmk.me/python-pandas-quantile/

P-033: レシート明細データフレーム（df_receipt）に対し、店舗コード（store_cd）ごとに売上金額（amount）の平均を計算し、330以上のものを抽出せよ。

P-033

# 店舗コード（store_cd）ごとにグループ分け。
# 売上金額（amount）の平均(mean)を算出し、reset_index()でインデックスを振り直す
# query()メソッドで売上金額（amount）の300以上
df_receipt.groupby('store_cd').amount.mean().reset_index().query('amount >= 330')

# (別解)
df_receipt.groupby('store_cd').amount.mean()[df_receipt.groupby('store_cd').amount.mean() >= 300]

P-034: レシート明細データフレーム（df_receipt）に対し、顧客ID（customer_id）ごとに売上金額（amount）を合計して全顧客の平均を求めよ。ただし、顧客IDが"Z"から始まるのものは非会員を表すため、除外して計算すること。

P-034

# queryを使わない書き方
# ~はNot。、顧客ID（customer_id）がZから始まる(str.startswith("Z"))
# 顧客ID（customer_id）ごとに売上金額（amount）を合計して全顧客の平均
df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum().mean()

# queryを使う書き方
df_receipt.query('not customer_id.str.startswith("Z")', engine='python').groupby('customer_id').amount.sum().mean()

P-035: レシート明細データフレーム（df_receipt）に対し、顧客ID（customer_id）ごとに売上金額（amount）を合計して全顧客の平均を求め、平均以上に買い物をしている顧客を抽出せよ。ただし、顧客IDが"Z"から始まるのものは非会員を表すため、除外して計算すること。なお、データは10件だけ表示させれば良い。

P-035

# 顧客IDが"Z"を除外。'customer_id'ごとにグループを分ける。（2547.742234529256）全顧客の平均
amount_mean = df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum().mean()

# 顧客ID（customer_id）ごとに売上金額（amount）を合計する（データフレーム型）
df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()

# amount_mean(全顧客の平均)以上の１０件を表示
df_amount_sum[df_amount_sum['amount'] >= amount_mean].head(10)

P-036: レシート明細データフレーム（df_receipt）と店舗データフレーム（df_store）を内部結合し、レシート明細データフレームの全項目と店舗データフレームの店舗名（store_name）を10件表示させよ。

P-036

# merge(A(df), B(df), how='inner'(内部結合), on='共通する列')
pd.merge(df_receipt, df_store[['store_cd','store_name']], how='inner', on='store_cd').head(10)

参考: https://note.nkmk.me/python-pandas-merge-join/

P-037: 商品データフレーム（df_product）とカテゴリデータフレーム（df_category）を内部結合し、商品データフレームの全項目とカテゴリデータフレームの小区分名（category_small_name）を10件表示させよ。

P-037

# merge(A(df), B(df), how='inner'(内部結合), on='共通する列')
pd.merge(df_product
         , df_category[['category_small_cd','category_small_name']]
         , how='inner', on='category_small_cd').head(10)

参考: https://note.nkmk.me/python-pandas-merge-join/

P-038: 顧客データフレーム（df_customer）とレシート明細データフレーム（df_receipt）から、各顧客ごとの売上金額合計を求めよ。ただし、買い物の実績がない顧客については売上金額を0として表示させること。また、顧客は性別コード（gender_cd）が女性（1）であるものを対象とし、非会員（顧客IDが'Z'から始まるもの）は除外すること。なお、結果は10件だけ表示させれば良い。

P-038

# 各顧客ごとの売上金額合計。
# 顧客IDごとにグループ分けし、売上金額（amount）の合計(sum)
df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()

# 顧客は性別コード（gender_cd）が女性（1）であるものを対象とし、非会員（顧客IDが'Z'から始まるもの）は除外
df_tmp = df_customer.query('gender_cd == "1" and not customer_id.str.startswith("Z")', engine='python')

# merge(A(df), B(df), how='inner'(内部結合), on='共通する列')
# fillna(0)で買い物の実績がない顧客については売上金額を0
pd.merge(df_tmp['customer_id'], df_amount_sum, how='left', on='customer_id').fillna(0).head(10)

P-039: レシート明細データフレーム（df_receipt）から売上日数の多い顧客の上位20件と、売上金額合計の多い顧客の上位20件を抽出し、完全外部結合せよ。ただし、非会員（顧客IDが'Z'から始まるもの）は除外すること。

P-039

# 顧客ID('customer_id')でグループ分けし、売上金額（amount）の合計(sum)
df_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()
# customer_id.str.startswith("Z")(顧客IDがZで始まるものを除く)
df_sum = df_sum.query('not customer_id.str.startswith("Z")', engine='python')
# 売上金額（amount）の合計をソートし、上位２０件を抽出
df_sum = df_sum.sort_values('amount', ascending=False).head(20)

# 売り上げ日数(sales_ymd)を重複を抽出する。
df_cnt = df_receipt[~df_receipt.duplicated(subset=['customer_id', 'sales_ymd'])]
# customer_id.str.startswith("Z")(顧客IDがZで始まるものを除く)
df_cnt = df_cnt.query('not customer_id.str.startswith("Z")', engine='python')
# 顧客ID('customer_id')でグループ分けし、売り上げ日数(sales_ymd)の件数(count)
df_cnt = df_cnt.groupby('customer_id').sales_ymd.count().reset_index()
# 売り上げ日数(sales_ymd)の件数でソートし、降順(ascending=False)で並べ替え上位２０件を抽出。
df_cnt = df_cnt.sort_values('sales_ymd', ascending=False).head(20)

# mergeで外部結合する
pd.merge(df_sum, df_cnt, how='outer', on='customer_id')

参考(duplicate): https://note.nkmk.me/python-pandas-duplicated-drop-duplicates/
参考（外部結合）: https://note.nkmk.me/python-pandas-merge-join/

P-040: 全ての店舗と全ての商品を組み合わせると何件のデータとなるか調査したい。店舗（df_store）と商品（df_product）を直積した件数を計算せよ。

P-040

# 店舗（df_store）のコピーを作成
df_store_tmp = df_store.copy()
# 商品（df_product）のコピーを作成
df_product_tmp = df_product.copy()

# 結合するためのキー(列)が必要なのでそれぞれ列を追加する。
df_store_tmp['key'] = 0
df_product_tmp['key'] = 0

# 外部結合し、lenメソッドで件数を確認する
len(pd.merge(df_store_tmp, df_product_tmp, on='key', how='outer'))

参考:　https://note.nkmk.me/python-pandas-merge-join/

3. 参考文献

データサイエンス100本ノック
 Macでデータサイエンス100本ノックを動かす方法

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up