More than 1 year has passed since last update.

データサイエンス100本ノック①～⑩

Posted at 2022-12-18

データサイエンス100本ノック（構造化データ加工編）

練習のためにPythonで100本ノック始めました。
記録として残していきます。

スタート：2022.12

GitHubにColaboratoryが用意されているので、自分のドライブにコピーするだけで簡単に始められます。

解答

※協会が出している書き方と同じではありません
※コードの書き方はいくつもあるので、出力としては同じものが出るようになっています

P001

レシート明細データ（df_receipt）から全項目の先頭10件を表示し、どのようなデータを保有しているか目視で確認せよ。

P001

df_receipt.head(10)

P002

レシート明細データ（df_receipt）から売上年月日（sales_ymd）、顧客ID（customer_id）、商品コード（product_cd）、売上金額（amount）の順に列を指定し、10件表示せよ。

P002

df_receipt.head(10)[['sales_ymd','customer_id','product_cd','amount']]

P003

レシート明細データ（df_receipt）から売上年月日（sales_ymd）、顧客ID（customer_id）、商品コード（product_cd）、売上金額（amount）の順に列を指定し、10件表示せよ。ただし、sales_ymdsales_dateに項目名を変更しながら抽出すること。

P003

df_receipt.head(10)[['sales_ymd','customer_id','product_cd','amount']].rename(columns = {'sales_ymd' : 'sales_date'})

P004

レシート明細データ（df_receipt）から売上日（sales_ymd）、顧客ID（customer_id）、商品コード（product_cd）、売上金額（amount）の順に列を指定し、以下の条件を満たすデータを抽出せよ。
顧客ID（customer_id）が"CS018205000001"

P004

df_receipt.query('customer_id == "CS018205000001"')[['sales_ymd','customer_id','product_cd','amount']]

P005

レシート明細データ（df_receipt）から売上日（sales_ymd）、顧客ID（customer_id）、商品コード（product_cd）、売上金額（amount）の順に列を指定し、以下の全ての条件を満たすデータを抽出せよ。
顧客ID（customer_id）が"CS018205000001"
売上金額（amount）が1,000以上

P005

df_receipt.query('customer_id == "CS018205000001" and amount >= 1000')[['sales_ymd','customer_id','product_cd','amount']]

P006

レシート明細データ（df_receipt）から売上日（sales_ymd）、顧客ID（customer_id）、商品コード（product_cd）、売上数量（quantity）、売上金額（amount）の順に列を指定し、以下の全ての条件を満たすデータを抽出せよ。
顧客ID（customer_id）が"CS018205000001"
売上金額（amount）が1,000以上または売上数量（quantity）が5以上

P006

df_receipt.query('customer_id == "CS018205000001" and (amount >= 1000 or quantity >= 5)')[['sales_ymd','customer_id','product_cd','amount']]

P007

レシート明細データ（df_receipt）から売上日（sales_ymd）、顧客ID（customer_id）、商品コード（product_cd）、売上金額（amount）の順に列を指定し、以下の全ての条件を満たすデータを抽出せよ。
顧客ID（customer_id）が"CS018205000001"
売上金額（amount）が1,000以上2,000以下

P007

df_receipt[['sales_ymd','customer_id','product_cd','amount']].query('customer_id == "CS018205000001" and (2000 >= amount >= 1000)')

P008

レシート明細データ（df_receipt）から売上日（sales_ymd）、顧客ID（customer_id）、商品コード（product_cd）、売上金額（amount）の順に列を指定し、以下の全ての条件を満たすデータを抽出せよ。
顧客ID（customer_id）が"CS018205000001"
商品コード（product_cd）が"P071401019"以外

P008

df_receipt[['sales_ymd','customer_id','product_cd','amount']].query('customer_id == "CS018205000001" and product_cd != "P071401019"')

P009

以下の処理において、出力結果を変えずにORをANDに書き換えよ。
df_store.query('not(prefecture_cd == "13" | floor_area > 900)')

P009

df_store.query('prefecture_cd != "13" and floor_area <= 900')

これはドモルガンの法則ですね。

P010

店舗データ（df_store）から、店舗コード（store_cd）が"S14"で始まるものだけ全項目抽出し、10件表示せよ。

P010

df_store[df_store['store_cd'].str.startswith('S14')]

感想

ここらへんまではあまり悩まずに書けました。
dataframeから特定の列の抽出の仕方はqueryを使用する方法に慣れるのがオススメです。
検索すると[[]]を使用する方法が出てきがちなので自分自身もqueryを使い慣れていないなぁと実感しました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up