More than 1 year has passed since last update.

データサイエンス100本ノック（構造化データ加工編） - Julia/Base/Query

Last updated at 2023-02-09Posted at 2023-02-06

データサイエンス100本ノック（構造化データ加工編） - Julia/BaseR/Query

別稿では DataFramesMeta を使う場合の解答例を示した。
https://qiita.com/WolfMoon/items/387d18e4ea3f65768997
本稿では BaseR の場合と Query を使う場合の解答例を示す。

ファイル等の準備

https://github.com/The-Japan-DataScientist-Society/100knocks-preprocess
に書かれているように進めればよいが，デスクトップマシンに Python の実行環境がある場合には，そのページの中程の Requirement > Cloud Service > Colaboratory > Open in Colab をクリックし，最初のセルの

import os

import pandas as pd

:

df_geocode = pd.read_csv("data/geocode.csv", dtype=dtype)

を実行すれば，作業ディレクトリの下に 100knocks-preprocess/docker/work/data/ というディレクトリができ，これをベースディレクトリとして以下のように 6 個のデータファイルを読み込めば，始めることができる。

なお，それぞれのファイルの列を数値として読み込むか文字列として読み込むかはユーザが指定しなければならないことがある。入力結果を確認し，想定と違う場合には読み込み方を指示しなければならない。Julia では CSV.read() の types 引数で指定する。

using DataFrames
using CSV
using Query

basedirectory = "../Python/100knocks-preprocess/docker/work/data/"
df_customer = CSV.read(basedirectory * "customer.csv", DataFrame, types=Dict(3 => String, 10 => String));
df_category = CSV.read(basedirectory * "category.csv", DataFrame, types=String);
df_product = CSV.read(basedirectory * "product.csv", DataFrame, types=Dict(2 => String, 3 => String, 4 => String));
df_receipt = CSV.read(basedirectory * "receipt.csv", DataFrame);
df_store = CSV.read(basedirectory * "store.csv", DataFrame, types=Dict(3 => String));
df_geocode = CSV.read(basedirectory * "geocode.csv", DataFrame);

はじめに

~~初めに以下のセルを実行してください~~
~~必要なライブラリのインポートとデータベース（PostgreSQL）からのデータ読み込みを行います~~
~~pandas等、利用が想定されるライブラリは以下セルでインポートしています~~
~~その他利用したいライブラリがあれば適宜インストールしてください（"!pip install ライブラリ名"でインストールも可能）~~
処理は複数回に分けても構いません
名前、住所等はダミーデータであり、実在するものではありません

入力したデータフレームの確認

入力後のデータフレームの確認は，ここでは先頭 10 件を表示させ，目視で確認するということであるが，それぞれの列がどのように読み取られているかを確認すべきである。

実際には，各項目の度数分布を求めたり，ヒストグラムを描いたりして不適切なデータがあるかないかを確認すべきである。また，欠損値の有無と欠損値に特別の値が与えられているかどうかなどを確認しておく必要がある。

1.　単純な目視確認

J-001: レシート明細データ（df_receipt）から全項目の先頭10件を表示し、どのようなデータを保有しているか目視で確認せよ。

df_receipt[1:10, :]  # 範囲を指定できる

10×9 DataFrame

Row	sales_ymd	sales_epoch	store_cd	receipt_no	receipt_sub_no	customer_id	product_cd	quantity	amount
	Int64	Int64	String7	Int64	Int64	String15	String15	Int64	Int64
1	20181103	1541203200	S14006	112	1	CS006214000001	P070305012	1	158
2	20181118	1542499200	S13008	1132	2	CS008415000097	P070701017	1	81
3	20170712	1499817600	S14028	1102	1	CS028414000014	P060101005	1	170
4	20190205	1549324800	S14042	1132	1	ZZ000000000000	P050301001	1	25
5	20180821	1534809600	S14025	1102	2	CS025415000050	P060102007	1	90
6	20190605	1559692800	S13003	1112	1	CS003515000195	P050102002	1	138
7	20181205	1543968000	S14024	1102	2	CS024514000042	P080101005	1	30
8	20190922	1569110400	S14040	1102	1	CS040415000178	P070501004	1	128
9	20170504	1493856000	S13020	1112	2	ZZ000000000000	P071302010	1	770
10	20191010	1570665600	S14027	1102	1	CS027514000015	P071101003	1	680

# 別解
first(df_receipt, 10)  # 特定の処理に特化したの関数が用意されていることもある

10×9 DataFrame

Row	sales_ymd	sales_epoch	store_cd	receipt_no	receipt_sub_no	customer_id	product_cd	quantity	amount
	Int64	Int64	String7	Int64	Int64	String15	String15	Int64	Int64
1	20181103	1541203200	S14006	112	1	CS006214000001	P070305012	1	158
2	20181118	1542499200	S13008	1132	2	CS008415000097	P070701017	1	81
3	20170712	1499817600	S14028	1102	1	CS028414000014	P060101005	1	170
4	20190205	1549324800	S14042	1132	1	ZZ000000000000	P050301001	1	25
5	20180821	1534809600	S14025	1102	2	CS025415000050	P060102007	1	90
6	20190605	1559692800	S13003	1112	1	CS003515000195	P050102002	1	138
7	20181205	1543968000	S14024	1102	2	CS024514000042	P080101005	1	30
8	20190922	1569110400	S14040	1102	1	CS040415000178	P070501004	1	128
9	20170504	1493856000	S13020	1112	2	ZZ000000000000	P071302010	1	770
10	20191010	1570665600	S14027	1102	1	CS027514000015	P071101003	1	680

Query，パイプを使うことを好む人もいるが，多くの場合面倒くさい。以下しばらくの間，パイプと組み合わせて Query を使う（R の tidyverse のような）やり方を別解として記しておく。

# 別解2
df_receipt |>
    @take(10) |>
    DataFrame

10×9 DataFrame

Row	sales_ymd	sales_epoch	store_cd	receipt_no	receipt_sub_no	customer_id	product_cd	quantity	amount
	Int64	Int64	String7	Int64	Int64	String15	String15	Int64	Int64
1	20181103	1541203200	S14006	112	1	CS006214000001	P070305012	1	158
2	20181118	1542499200	S13008	1132	2	CS008415000097	P070701017	1	81
3	20170712	1499817600	S14028	1102	1	CS028414000014	P060101005	1	170
4	20190205	1549324800	S14042	1132	1	ZZ000000000000	P050301001	1	25
5	20180821	1534809600	S14025	1102	2	CS025415000050	P060102007	1	90
6	20190605	1559692800	S13003	1112	1	CS003515000195	P050102002	1	138
7	20181205	1543968000	S14024	1102	2	CS024514000042	P080101005	1	30
8	20190922	1569110400	S14040	1102	1	CS040415000178	P070501004	1	128
9	20170504	1493856000	S13020	1112	2	ZZ000000000000	P071302010	1	770
10	20191010	1570665600	S14027	1102	1	CS027514000015	P071101003	1	680

2.　必要な項目（列）のみを表示する

J-002: レシート明細データ（df_receipt）から売上年月日（sales_ymd）、顧客ID（customer_id）、商品コード（product_cd）、売上金額（amount）の順に列を指定し、10件表示せよ。

df_receipt[1:10, [:sales_ymd, :customer_id, :product_cd, :amount]]  # 行と列を指定できる（順序も）

10×4 DataFrame

Row	sales_ymd	customer_id	product_cd	amount
	Int64	String15	String15	Int64
1	20181103	CS006214000001	P070305012	158
2	20181118	CS008415000097	P070701017	81
3	20170712	CS028414000014	P060101005	170
4	20190205	ZZ000000000000	P050301001	25
5	20180821	CS025415000050	P060102007	90
6	20190605	CS003515000195	P050102002	138
7	20181205	CS024514000042	P080101005	30
8	20190922	CS040415000178	P070501004	128
9	20170504	ZZ000000000000	P071302010	770
10	20191010	CS027514000015	P071101003	680

# 別解
df = df_receipt[!, [:sales_ymd, :customer_id, :product_cd, :amount]]
first(df, 10)

10×4 DataFrame

Row	sales_ymd	customer_id	product_cd	amount
	Int64	String15	String15	Int64
1	20181103	CS006214000001	P070305012	158
2	20181118	CS008415000097	P070701017	81
3	20170712	CS028414000014	P060101005	170
4	20190205	ZZ000000000000	P050301001	25
5	20180821	CS025415000050	P060102007	90
6	20190605	CS003515000195	P050102002	138
7	20181205	CS024514000042	P080101005	30
8	20190922	CS040415000178	P070501004	128
9	20170504	ZZ000000000000	P071302010	770
10	20191010	CS027514000015	P071101003	680

df[] の中の最初の次元の ! または : はすべての行を意味する。! は直接参照するが，: はコピーしたものを参照する。そのため，! を使うほうが処理時間は短い。

@time df = df_receipt[!, [:sales_ymd, :customer_id, :product_cd, :amount]];

  0.000022 seconds (24 allocations: 1.875 KiB)

@time df = df_receipt[:, [:sales_ymd, :customer_id, :product_cd, :amount]];

  0.004656 seconds (11.01 k allocations: 5.386 MiB, 88.53% compilation time)

# 別解2
df_receipt |>
    @select(:sales_ymd, :customer_id, :product_cd, :amount) |>
    @take(10) |>
    DataFrame

10×4 DataFrame

Row	sales_ymd	customer_id	product_cd	amount
	Int64	String15	String15	Int64
1	20181103	CS006214000001	P070305012	158
2	20181118	CS008415000097	P070701017	81
3	20170712	CS028414000014	P060101005	170
4	20190205	ZZ000000000000	P050301001	25
5	20180821	CS025415000050	P060102007	90
6	20190605	CS003515000195	P050102002	138
7	20181205	CS024514000042	P080101005	30
8	20190922	CS040415000178	P070501004	128
9	20170504	ZZ000000000000	P071302010	770
10	20191010	CS027514000015	P071101003	680

J-003: レシート明細データ（df_receipt）から売上年月日（sales_ymd）、顧客ID（customer_id）、商品コード（product_cd）、売上金額（amount）の順に列を指定し、10件表示せよ。ただし、sales_ymdをsales_dateに項目名を変更しながら抽出すること。

df = df_receipt[!, [:sales_ymd, :customer_id, :product_cd, :amount]]
df = rename(df, :sales_ymd => :sales_date)  # rename!() を使うときには注意
first(df, 10)

10×4 DataFrame

Row	sales_date	customer_id	product_cd	amount
	Int64	String15	String15	Int64
1	20181103	CS006214000001	P070305012	158
2	20181118	CS008415000097	P070701017	81
3	20170712	CS028414000014	P060101005	170
4	20190205	ZZ000000000000	P050301001	25
5	20180821	CS025415000050	P060102007	90
6	20190605	CS003515000195	P050102002	138
7	20181205	CS024514000042	P080101005	30
8	20190922	CS040415000178	P070501004	128
9	20170504	ZZ000000000000	P071302010	770
10	20191010	CS027514000015	P071101003	680

# 別解
df_receipt |>
    @select(:sales_ymd, :customer_id, :product_cd, :amount) |>
    @rename(:sales_ymd => :sales_date) |>
    @take(10) |>
    DataFrame

10×4 DataFrame

Row	sales_date	customer_id	product_cd	amount
	Int64	String15	String15	Int64
1	20181103	CS006214000001	P070305012	158
2	20181118	CS008415000097	P070701017	81
3	20170712	CS028414000014	P060101005	170
4	20190205	ZZ000000000000	P050301001	25
5	20180821	CS025415000050	P060102007	90
6	20190605	CS003515000195	P050102002	138
7	20181205	CS024514000042	P080101005	30
8	20190922	CS040415000178	P070501004	128
9	20170504	ZZ000000000000	P071302010	770
10	20191010	CS027514000015	P071101003	680

行と列の選択

3.　論理演算式による行の選択

J-004: レシート明細データ（df_receipt）から売上日（sales_ymd）、顧客ID（customer_id）、商品コード（product_cd）、売上金額（amount）の順に列を指定し、以下の条件を満たすデータを抽出せよ。

顧客ID（customer_id）が"CS018205000001"

列の選択はシンボルベクトル（:foo など）で行う。
- 指定した順で選択される。
- 1 列だけを指定するときもベクトルで指定する（[:foo] のように）。
行を選択するための条件式（論理式）を書くことができる。
- 条件式中の演算子は通常は先行するドット（.）が必要である。
- 条件式中の列 :foo の指定は df.foo のようにする。

df_receipt[df_receipt.customer_id .== "CS018205000001", [:sales_ymd, :customer_id, :product_cd, :amount]]

12×4 DataFrame

Row	sales_ymd	customer_id	product_cd	amount
	Int64	String15	String15	Int64
1	20180911	CS018205000001	P071401012	2200
2	20180414	CS018205000001	P060104007	600
3	20170614	CS018205000001	P050206001	990
4	20170614	CS018205000001	P060702015	108
5	20190216	CS018205000001	P071005024	102
6	20180414	CS018205000001	P071101002	278
7	20190226	CS018205000001	P070902035	168
8	20190924	CS018205000001	P060805001	495
9	20190226	CS018205000001	P071401020	2200
10	20180911	CS018205000001	P071401005	1100
11	20190216	CS018205000001	P040101002	218
12	20190924	CS018205000001	P091503001	280

# 別解
df_receipt |>
    @select(:sales_ymd, :customer_id, :product_cd, :amount) |>
    @filter(_.customer_id == "CS018205000001") |>
    DataFrame

12×4 DataFrame

Row	sales_ymd	customer_id	product_cd	amount
	Int64	String15	String15	Int64
1	20180911	CS018205000001	P071401012	2200
2	20180414	CS018205000001	P060104007	600
3	20170614	CS018205000001	P050206001	990
4	20170614	CS018205000001	P060702015	108
5	20190216	CS018205000001	P071005024	102
6	20180414	CS018205000001	P071101002	278
7	20190226	CS018205000001	P070902035	168
8	20190924	CS018205000001	P060805001	495
9	20190226	CS018205000001	P071401020	2200
10	20180911	CS018205000001	P071401005	1100
11	20190216	CS018205000001	P040101002	218
12	20190924	CS018205000001	P091503001	280

J-005: レシート明細データ（df_receipt）から売上日（sales_ymd）、顧客ID（customer_id）、商品コード（product_cd）、売上金額（amount）の順に列を指定し、以下の全ての条件を満たすデータを抽出せよ。

顧客ID（customer_id）が"CS018205000001"

売上金額（amount）が1,000以上

複雑な論理式を書くこともできる。

df_receipt[df_receipt.customer_id .== "CS018205000001" .&& df_receipt.amount .>= 1000, [:sales_ymd, :customer_id, :product_cd, :amount]]

3×4 DataFrame

Row	sales_ymd	customer_id	product_cd	amount
	Int64	String15	String15	Int64
1	20180911	CS018205000001	P071401012	2200
2	20190226	CS018205000001	P071401020	2200
3	20180911	CS018205000001	P071401005	1100

# 別解
df = select(df_receipt, [:sales_ymd, :customer_id, :product_cd, :amount])
filter([:customer_id, :amount] => (x, y) -> x == "CS018205000001" && y >= 1000, df)

3×4 DataFrame

Row	sales_ymd	customer_id	product_cd	amount
	Int64	String15	String15	Int64
1	20180911	CS018205000001	P071401012	2200
2	20190226	CS018205000001	P071401020	2200
3	20180911	CS018205000001	P071401005	1100

# 別解2
df_receipt |>
    @select(:sales_ymd, :customer_id, :product_cd, :amount) |>
    @filter(_.customer_id == "CS018205000001" && _.amount >= 1000) |>
    DataFrame

3×4 DataFrame

Row	sales_ymd	customer_id	product_cd	amount
	Int64	String15	String15	Int64
1	20180911	CS018205000001	P071401012	2200
2	20190226	CS018205000001	P071401020	2200
3	20180911	CS018205000001	P071401005	1100

J-006: レシート明細データ（df_receipt）から売上日（sales_ymd）、顧客ID（customer_id）、商品コード（product_cd）、売上数量（quantity）、売上金額（amount）の順に列を指定し、以下の全ての条件を満たすデータを抽出せよ。

顧客ID（customer_id）が"CS018205000001"

売上金額（amount）が1,000以上または売上数量（quantity）が5以上

df_receipt[df_receipt.customer_id .== "CS018205000001" .&& (df_receipt.amount .>= 1000 .|| df_receipt.quantity .>= 5), [:sales_ymd, :customer_id, :product_cd, :quantity, :amount]]

5×5 DataFrame

Row	sales_ymd	customer_id	product_cd	quantity	amount
	Int64	String15	String15	Int64	Int64
1	20180911	CS018205000001	P071401012	1	2200
2	20180414	CS018205000001	P060104007	6	600
3	20170614	CS018205000001	P050206001	5	990
4	20190226	CS018205000001	P071401020	1	2200
5	20180911	CS018205000001	P071401005	1	1100

# 別解
df = select(df_receipt, [:sales_ymd, :customer_id, :product_cd, :quantity, :amount])
filter([:customer_id, :amount, :quantity] => (x, y, z) -> x == "CS018205000001" && (y >= 1000 || z >= 5), df)

5×5 DataFrame

Row	sales_ymd	customer_id	product_cd	quantity	amount
	Int64	String15	String15	Int64	Int64
1	20180911	CS018205000001	P071401012	1	2200
2	20180414	CS018205000001	P060104007	6	600
3	20170614	CS018205000001	P050206001	5	990
4	20190226	CS018205000001	P071401020	1	2200
5	20180911	CS018205000001	P071401005	1	1100

# 別解2
df_receipt |>
    @select(:sales_ymd, :customer_id, :product_cd, :quantity, :amount) |>
    @filter(_.customer_id == "CS018205000001" && (_.amount >= 1000 || _.quantity >= 5)) |>
    DataFrame

5×5 DataFrame

Row	sales_ymd	customer_id	product_cd	quantity	amount
	Int64	String15	String15	Int64	Int64
1	20180911	CS018205000001	P071401012	1	2200
2	20180414	CS018205000001	P060104007	6	600
3	20170614	CS018205000001	P050206001	5	990
4	20190226	CS018205000001	P071401020	1	2200
5	20180911	CS018205000001	P071401005	1	1100

J-007: レシート明細データ（df_receipt）から売上日（sales_ymd）、顧客ID（customer_id）、商品コード（product_cd）、売上金額（amount）の順に列を指定し、以下の全ての条件を満たすデータを抽出せよ。

顧客ID（customer_id）が"CS018205000001"

売上金額（amount）が1,000以上2,000以下

df_receipt[df_receipt.customer_id .== "CS018205000001" .&& 1000 .<= df_receipt.amount .<= 2000, [:sales_ymd, :customer_id, :product_cd, :amount]]

1×4 DataFrame

Row	sales_ymd	customer_id	product_cd	amount
	Int64	String15	String15	Int64
1	20180911	CS018205000001	P071401005	1100

# 別解
df = select(df_receipt, [:sales_ymd, :customer_id, :product_cd, :amount])
filter([:customer_id, :amount] => (x, y) -> x == "CS018205000001" && y >= 1000 <= y <= 2000, df)

1×4 DataFrame

Row	sales_ymd	customer_id	product_cd	amount
	Int64	String15	String15	Int64
1	20180911	CS018205000001	P071401005	1100

# 別解2
df_receipt |>
    @select(:sales_ymd, :customer_id, :product_cd, :amount) |>
    @filter(_.customer_id == "CS018205000001" && 1000 <= _.amount <= 2000) |>
    DataFrame

1×4 DataFrame

Row	sales_ymd	customer_id	product_cd	amount
	Int64	String15	String15	Int64
1	20180911	CS018205000001	P071401005	1100

J-008: レシート明細データ（df_receipt）から売上日（sales_ymd）、顧客ID（customer_id）、商品コード（product_cd）、売上金額（amount）の順に列を指定し、以下の全ての条件を満たすデータを抽出せよ。

顧客ID（customer_id）が"CS018205000001"

商品コード（product_cd）が"P071401019"以外

df_receipt[df_receipt.customer_id .== "CS018205000001" .&& df_receipt.product_cd .!= "P071401019", [:sales_ymd, :customer_id, :product_cd, :amount]]

12×4 DataFrame

Row	sales_ymd	customer_id	product_cd	amount
	Int64	String15	String15	Int64
1	20180911	CS018205000001	P071401012	2200
2	20180414	CS018205000001	P060104007	600
3	20170614	CS018205000001	P050206001	990
4	20170614	CS018205000001	P060702015	108
5	20190216	CS018205000001	P071005024	102
6	20180414	CS018205000001	P071101002	278
7	20190226	CS018205000001	P070902035	168
8	20190924	CS018205000001	P060805001	495
9	20190226	CS018205000001	P071401020	2200
10	20180911	CS018205000001	P071401005	1100
11	20190216	CS018205000001	P040101002	218
12	20190924	CS018205000001	P091503001	280

# 別解
df = select(df_receipt, [:sales_ymd, :customer_id, :product_cd, :amount])
filter([:customer_id, :product_cd] => (x, y) -> x == "CS018205000001" && y != "P071401019", df)

12×4 DataFrame

Row	sales_ymd	customer_id	product_cd	amount
	Int64	String15	String15	Int64
1	20180911	CS018205000001	P071401012	2200
2	20180414	CS018205000001	P060104007	600
3	20170614	CS018205000001	P050206001	990
4	20170614	CS018205000001	P060702015	108
5	20190216	CS018205000001	P071005024	102
6	20180414	CS018205000001	P071101002	278
7	20190226	CS018205000001	P070902035	168
8	20190924	CS018205000001	P060805001	495
9	20190226	CS018205000001	P071401020	2200
10	20180911	CS018205000001	P071401005	1100
11	20190216	CS018205000001	P040101002	218
12	20190924	CS018205000001	P091503001	280

# 別解2
df_receipt |>
    @select(:sales_ymd, :customer_id, :product_cd, :amount) |>
    @filter(_.customer_id == "CS018205000001" && _.product_cd != "P071401019") |>
    DataFrame

12×4 DataFrame

Row	sales_ymd	customer_id	product_cd	amount
	Int64	String15	String15	Int64
1	20180911	CS018205000001	P071401012	2200
2	20180414	CS018205000001	P060104007	600
3	20170614	CS018205000001	P050206001	990
4	20170614	CS018205000001	P060702015	108
5	20190216	CS018205000001	P071005024	102
6	20180414	CS018205000001	P071101002	278
7	20190226	CS018205000001	P070902035	168
8	20190924	CS018205000001	P060805001	495
9	20190226	CS018205000001	P071401020	2200
10	20180911	CS018205000001	P071401005	1100
11	20190216	CS018205000001	P040101002	218
12	20190924	CS018205000001	P091503001	280

J-009: 以下の処理において、出力結果を変えずにORをANDに書き換えよ。

df_store.query('not(prefecture_cd == "13" | floor_area > 900)')

df_store[df_store.prefecture_cd .!= "13" .&& df_store.floor_area .<= 900, :]

3×10 DataFrame

Row	store_cd	store_name	prefecture_cd	prefecture	address	address_kana	tel_no	longitude	latitude	floor_area
	String7	String31	String	String15	String	String	String15	Float64	Float64	Float64
1	S14046	北山田店	14	神奈川県	神奈川県横浜市都筑区北山田一丁目	カナガワケンヨコハマシツヅキクキタヤマタイッチョウメ	045-123-4049	139.592	35.5619	831.0
2	S14011	日吉本町店	14	神奈川県	神奈川県横浜市港北区日吉本町四丁目	カナガワケンヨコハマシコウホククヒヨシホンチョウヨンチョウメ	045-123-4033	139.632	35.5466	890.0
3	S12013	習志野店	12	千葉県	千葉県習志野市芝園一丁目	チバケンナラシノシシバゾノイッチョウメ	047-123-4002	140.022	35.6612	808.0

# 別解
df_store |>
    @filter(_.prefecture_cd != "13" && _.floor_area <= 900) |>
    DataFrame

3×10 DataFrame

Row	store_cd	store_name	prefecture_cd	prefecture	address	address_kana	tel_no	longitude	latitude	floor_area
	String7	String31	String	String15	String	String	String15	Float64	Float64	Float64
1	S14046	北山田店	14	神奈川県	神奈川県横浜市都筑区北山田一丁目	カナガワケンヨコハマシツヅキクキタヤマタイッチョウメ	045-123-4049	139.592	35.5619	831.0
2	S14011	日吉本町店	14	神奈川県	神奈川県横浜市港北区日吉本町四丁目	カナガワケンヨコハマシコウホククヒヨシホンチョウヨンチョウメ	045-123-4033	139.632	35.5466	890.0
3	S12013	習志野店	12	千葉県	千葉県習志野市芝園一丁目	チバケンナラシノシシバゾノイッチョウメ	047-123-4002	140.022	35.6612	808.0

4.　部分文字列の有無による選択（正規表現）

正規表現に一致する行を選択することができる（これも論理式なので）。
- Julia では occursin() を使う。

J-010: 店舗データ（df_store）から、店舗コード（store_cd）が"S14"で始まるものだけ全項目抽出し、10件表示せよ。

df = df_store[occursin.(r"^S14", df_store.store_cd), :]  # occursin のあとに . が必要
first(df, 10)

10×10 DataFrame

Row	store_cd	store_name	prefecture_cd	prefecture	address	address_kana	tel_no	longitude	latitude	floor_area
	String7	String31	String	String15	String	String	String15	Float64	Float64	Float64
1	S14010	菊名店	14	神奈川県	神奈川県横浜市港北区菊名一丁目	カナガワケンヨコハマシコウホククキクナイッチョウメ	045-123-4032	139.633	35.5005	1732.0
2	S14033	阿久和店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4043	139.496	35.4592	1495.0
3	S14036	相模原中央店	14	神奈川県	神奈川県相模原市中央二丁目	カナガワケンサガミハラシチュウオウニチョウメ	042-123-4045	139.372	35.5733	1679.0
4	S14040	長津田店	14	神奈川県	神奈川県横浜市緑区長津田みなみ台五丁目	カナガワケンヨコハマシミドリクナガツタミナミダイゴチョウメ	045-123-4046	139.499	35.524	1548.0
5	S14050	阿久和西店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4053	139.496	35.4592	1830.0
6	S14028	二ツ橋店	14	神奈川県	神奈川県横浜市瀬谷区二ツ橋町	カナガワケンヨコハマシセヤクフタツバシチョウ	045-123-4042	139.496	35.463	1574.0
7	S14012	本牧和田店	14	神奈川県	神奈川県横浜市中区本牧和田	カナガワケンヨコハマシナカクホンモクワダ	045-123-4034	139.658	35.4216	1341.0
8	S14046	北山田店	14	神奈川県	神奈川県横浜市都筑区北山田一丁目	カナガワケンヨコハマシツヅキクキタヤマタイッチョウメ	045-123-4049	139.592	35.5619	831.0
9	S14022	逗子店	14	神奈川県	神奈川県逗子市逗子一丁目	カナガワケンズシシズシイッチョウメ	046-123-4036	139.579	35.2964	1838.0
10	S14011	日吉本町店	14	神奈川県	神奈川県横浜市港北区日吉本町四丁目	カナガワケンヨコハマシコウホククヒヨシホンチョウヨンチョウメ	045-123-4033	139.632	35.5466	890.0

# 別解
df_store |>
    @filter(occursin(r"^S14", _.store_cd)) |>
    @take(10) |>
    DataFrame

10×10 DataFrame

Row	store_cd	store_name	prefecture_cd	prefecture	address	address_kana	tel_no	longitude	latitude	floor_area
	String7	String31	String	String15	String	String	String15	Float64	Float64	Float64
1	S14010	菊名店	14	神奈川県	神奈川県横浜市港北区菊名一丁目	カナガワケンヨコハマシコウホククキクナイッチョウメ	045-123-4032	139.633	35.5005	1732.0
2	S14033	阿久和店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4043	139.496	35.4592	1495.0
3	S14036	相模原中央店	14	神奈川県	神奈川県相模原市中央二丁目	カナガワケンサガミハラシチュウオウニチョウメ	042-123-4045	139.372	35.5733	1679.0
4	S14040	長津田店	14	神奈川県	神奈川県横浜市緑区長津田みなみ台五丁目	カナガワケンヨコハマシミドリクナガツタミナミダイゴチョウメ	045-123-4046	139.499	35.524	1548.0
5	S14050	阿久和西店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4053	139.496	35.4592	1830.0
6	S14028	二ツ橋店	14	神奈川県	神奈川県横浜市瀬谷区二ツ橋町	カナガワケンヨコハマシセヤクフタツバシチョウ	045-123-4042	139.496	35.463	1574.0
7	S14012	本牧和田店	14	神奈川県	神奈川県横浜市中区本牧和田	カナガワケンヨコハマシナカクホンモクワダ	045-123-4034	139.658	35.4216	1341.0
8	S14046	北山田店	14	神奈川県	神奈川県横浜市都筑区北山田一丁目	カナガワケンヨコハマシツヅキクキタヤマタイッチョウメ	045-123-4049	139.592	35.5619	831.0
9	S14022	逗子店	14	神奈川県	神奈川県逗子市逗子一丁目	カナガワケンズシシズシイッチョウメ	046-123-4036	139.579	35.2964	1838.0
10	S14011	日吉本町店	14	神奈川県	神奈川県横浜市港北区日吉本町四丁目	カナガワケンヨコハマシコウホククヒヨシホンチョウヨンチョウメ	045-123-4033	139.632	35.5466	890.0

J-011: 顧客データ（df_customer）から顧客ID（customer_id）の末尾が1のものだけ全項目抽出し、10件表示せよ。

df = df_customer[occursin.(r"1$", df_customer.customer_id), :]
first(df, 10)

10×11 DataFrame

Row	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	String15	String31	String	String7	Date	Int64	String15	String	String7	String	String15
1	CS037613000071	六角雅彦	9	不明	1952-04-01	66	136-0076	東京都江東区南砂**********	S13037	20150414	0-00000000-0
2	CS028811000001	堀井かおり	1	女性	1933-03-27	86	245-0016	神奈川県横浜市泉区和泉町**********	S14028	20160115	0-00000000-0
3	CS040412000191	川井郁恵	1	女性	1977-01-05	42	226-0021	神奈川県横浜市緑区北八朔町**********	S14040	20151101	1-20091025-4
4	CS028314000011	小菅あおい	1	女性	1983-11-26	35	246-0038	神奈川県横浜市瀬谷区宮沢**********	S14028	20151123	1-20080426-5
5	CS039212000051	藤島恵梨香	1	女性	1997-02-03	22	166-0001	東京都杉並区阿佐谷北**********	S13039	20171121	1-20100215-4
6	CS015412000111	松居奈月	1	女性	1972-10-04	46	136-0071	東京都江東区亀戸**********	S13015	20150629	0-00000000-0
7	CS004702000041	野島洋	0	男性	1943-08-24	75	176-0022	東京都練馬区向山**********	S13004	20170218	0-00000000-0
8	CS041515000001	栗田千夏	1	女性	1967-01-02	52	206-0001	東京都多摩市和田**********	S13041	20160422	E-20100803-F
9	CS029313000221	北条ひかり	1	女性	1987-06-19	31	279-0011	千葉県浦安市美浜**********	S12029	20180810	0-00000000-0
10	CS034312000071	望月奈央	1	女性	1980-09-20	38	213-0026	神奈川県川崎市高津区久末**********	S14034	20160106	0-00000000-0

# 別解2
df_customer |>
    @filter(occursin(r"1$", _.customer_id)) |>
    @take(10) |>
    DataFrame

10×11 DataFrame

Row	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	String15	String31	String	String7	Date	Int64	String15	String	String7	String	String15
1	CS037613000071	六角雅彦	9	不明	1952-04-01	66	136-0076	東京都江東区南砂**********	S13037	20150414	0-00000000-0
2	CS028811000001	堀井かおり	1	女性	1933-03-27	86	245-0016	神奈川県横浜市泉区和泉町**********	S14028	20160115	0-00000000-0
3	CS040412000191	川井郁恵	1	女性	1977-01-05	42	226-0021	神奈川県横浜市緑区北八朔町**********	S14040	20151101	1-20091025-4
4	CS028314000011	小菅あおい	1	女性	1983-11-26	35	246-0038	神奈川県横浜市瀬谷区宮沢**********	S14028	20151123	1-20080426-5
5	CS039212000051	藤島恵梨香	1	女性	1997-02-03	22	166-0001	東京都杉並区阿佐谷北**********	S13039	20171121	1-20100215-4
6	CS015412000111	松居奈月	1	女性	1972-10-04	46	136-0071	東京都江東区亀戸**********	S13015	20150629	0-00000000-0
7	CS004702000041	野島洋	0	男性	1943-08-24	75	176-0022	東京都練馬区向山**********	S13004	20170218	0-00000000-0
8	CS041515000001	栗田千夏	1	女性	1967-01-02	52	206-0001	東京都多摩市和田**********	S13041	20160422	E-20100803-F
9	CS029313000221	北条ひかり	1	女性	1987-06-19	31	279-0011	千葉県浦安市美浜**********	S12029	20180810	0-00000000-0
10	CS034312000071	望月奈央	1	女性	1980-09-20	38	213-0026	神奈川県川崎市高津区久末**********	S14034	20160106	0-00000000-0

J-012: 店舗データ（df_store）から、住所 (address) に"横浜市"が含まれるものだけ全項目表示せよ。

df = df_store[occursin.(r"横浜市", df_store.address), :]
df

11×10 DataFrame

Row	store_cd	store_name	prefecture_cd	prefecture	address	address_kana	tel_no	longitude	latitude	floor_area
	String7	String31	String	String15	String	String	String15	Float64	Float64	Float64
1	S14010	菊名店	14	神奈川県	神奈川県横浜市港北区菊名一丁目	カナガワケンヨコハマシコウホククキクナイッチョウメ	045-123-4032	139.633	35.5005	1732.0
2	S14033	阿久和店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4043	139.496	35.4592	1495.0
3	S14040	長津田店	14	神奈川県	神奈川県横浜市緑区長津田みなみ台五丁目	カナガワケンヨコハマシミドリクナガツタミナミダイゴチョウメ	045-123-4046	139.499	35.524	1548.0
4	S14050	阿久和西店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4053	139.496	35.4592	1830.0
5	S14028	二ツ橋店	14	神奈川県	神奈川県横浜市瀬谷区二ツ橋町	カナガワケンヨコハマシセヤクフタツバシチョウ	045-123-4042	139.496	35.463	1574.0
6	S14012	本牧和田店	14	神奈川県	神奈川県横浜市中区本牧和田	カナガワケンヨコハマシナカクホンモクワダ	045-123-4034	139.658	35.4216	1341.0
7	S14046	北山田店	14	神奈川県	神奈川県横浜市都筑区北山田一丁目	カナガワケンヨコハマシツヅキクキタヤマタイッチョウメ	045-123-4049	139.592	35.5619	831.0
8	S14011	日吉本町店	14	神奈川県	神奈川県横浜市港北区日吉本町四丁目	カナガワケンヨコハマシコウホククヒヨシホンチョウヨンチョウメ	045-123-4033	139.632	35.5466	890.0
9	S14048	中川中央店	14	神奈川県	神奈川県横浜市都筑区中川中央二丁目	カナガワケンヨコハマシツヅキクナカガワチュウオウニチョウメ	045-123-4051	139.576	35.5491	1657.0
10	S14042	新山下店	14	神奈川県	神奈川県横浜市中区新山下二丁目	カナガワケンヨコハマシナカクシンヤマシタニチョウメ	045-123-4047	139.659	35.4389	1044.0
11	S14006	葛が谷店	14	神奈川県	神奈川県横浜市都筑区葛が谷	カナガワケンヨコハマシツヅキククズガヤ	045-123-4031	139.563	35.5357	1886.0

# 別解
df_store |>
    @filter(occursin(r"横浜市", _.address)) |>
    DataFrame

11×10 DataFrame

Row	store_cd	store_name	prefecture_cd	prefecture	address	address_kana	tel_no	longitude	latitude	floor_area
	String7	String31	String	String15	String	String	String15	Float64	Float64	Float64
1	S14010	菊名店	14	神奈川県	神奈川県横浜市港北区菊名一丁目	カナガワケンヨコハマシコウホククキクナイッチョウメ	045-123-4032	139.633	35.5005	1732.0
2	S14033	阿久和店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4043	139.496	35.4592	1495.0
3	S14040	長津田店	14	神奈川県	神奈川県横浜市緑区長津田みなみ台五丁目	カナガワケンヨコハマシミドリクナガツタミナミダイゴチョウメ	045-123-4046	139.499	35.524	1548.0
4	S14050	阿久和西店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4053	139.496	35.4592	1830.0
5	S14028	二ツ橋店	14	神奈川県	神奈川県横浜市瀬谷区二ツ橋町	カナガワケンヨコハマシセヤクフタツバシチョウ	045-123-4042	139.496	35.463	1574.0
6	S14012	本牧和田店	14	神奈川県	神奈川県横浜市中区本牧和田	カナガワケンヨコハマシナカクホンモクワダ	045-123-4034	139.658	35.4216	1341.0
7	S14046	北山田店	14	神奈川県	神奈川県横浜市都筑区北山田一丁目	カナガワケンヨコハマシツヅキクキタヤマタイッチョウメ	045-123-4049	139.592	35.5619	831.0
8	S14011	日吉本町店	14	神奈川県	神奈川県横浜市港北区日吉本町四丁目	カナガワケンヨコハマシコウホククヒヨシホンチョウヨンチョウメ	045-123-4033	139.632	35.5466	890.0
9	S14048	中川中央店	14	神奈川県	神奈川県横浜市都筑区中川中央二丁目	カナガワケンヨコハマシツヅキクナカガワチュウオウニチョウメ	045-123-4051	139.576	35.5491	1657.0
10	S14042	新山下店	14	神奈川県	神奈川県横浜市中区新山下二丁目	カナガワケンヨコハマシナカクシンヤマシタニチョウメ	045-123-4047	139.659	35.4389	1044.0
11	S14006	葛が谷店	14	神奈川県	神奈川県横浜市都筑区葛が谷	カナガワケンヨコハマシツヅキククズガヤ	045-123-4031	139.563	35.5357	1886.0

J-013: 顧客データ（df_customer）から、ステータスコード（status_cd）の先頭がアルファベットのA〜Fで始まるデータを全項目抽出し、10件表示せよ。

df = df_customer[occursin.(r"^[A-F]", df_customer.status_cd), :]
first(df, 10)

10×11 DataFrame

Row	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	String15	String31	String	String7	Date	Int64	String15	String	String7	String	String15
1	CS031415000172	宇多田貴美子	1	女性	1976-10-04	42	151-0053	東京都渋谷区代々木**********	S13031	20150529	D-20100325-C
2	CS015414000103	奥野陽子	1	女性	1977-08-09	41	136-0073	東京都江東区北砂**********	S13015	20150722	B-20100609-B
3	CS011215000048	芦田沙耶	1	女性	1992-02-01	27	223-0062	神奈川県横浜市港北区日吉本町**********	S14011	20150228	C-20100421-9
4	CS029415000023	梅田里穂	1	女性	1976-01-17	43	279-0043	千葉県浦安市富士見**********	S12029	20150610	D-20100918-E
5	CS035415000029	寺沢真希	9	不明	1977-09-27	41	158-0096	東京都世田谷区玉川台**********	S13035	20141220	F-20101029-F
6	CS031415000106	宇野由美子	1	女性	1970-02-26	49	151-0053	東京都渋谷区代々木**********	S13031	20150201	F-20100511-E
7	CS029215000025	石倉美帆	1	女性	1993-09-28	25	279-0022	千葉県浦安市今川**********	S12029	20150708	B-20100820-C
8	CS033605000005	猪股雄太	0	男性	1955-12-05	63	246-0031	神奈川県横浜市瀬谷区瀬谷**********	S14033	20150425	F-20100917-E
9	CS033415000229	板垣菜々美	1	女性	1977-11-07	41	246-0021	神奈川県横浜市瀬谷区二ツ橋町**********	S14033	20150712	F-20100326-E
10	CS008415000145	黒谷麻緒	1	女性	1977-06-27	41	157-0067	東京都世田谷区喜多見**********	S13008	20150829	F-20100622-F

# 別解
df_customer |>
    @filter(occursin(r"^[A-F]", _.status_cd)) |>
    @take(10) |>
    DataFrame

10×11 DataFrame

Row	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	String15	String31	String	String7	Date	Int64	String15	String	String7	String	String15
1	CS031415000172	宇多田貴美子	1	女性	1976-10-04	42	151-0053	東京都渋谷区代々木**********	S13031	20150529	D-20100325-C
2	CS015414000103	奥野陽子	1	女性	1977-08-09	41	136-0073	東京都江東区北砂**********	S13015	20150722	B-20100609-B
3	CS011215000048	芦田沙耶	1	女性	1992-02-01	27	223-0062	神奈川県横浜市港北区日吉本町**********	S14011	20150228	C-20100421-9
4	CS029415000023	梅田里穂	1	女性	1976-01-17	43	279-0043	千葉県浦安市富士見**********	S12029	20150610	D-20100918-E
5	CS035415000029	寺沢真希	9	不明	1977-09-27	41	158-0096	東京都世田谷区玉川台**********	S13035	20141220	F-20101029-F
6	CS031415000106	宇野由美子	1	女性	1970-02-26	49	151-0053	東京都渋谷区代々木**********	S13031	20150201	F-20100511-E
7	CS029215000025	石倉美帆	1	女性	1993-09-28	25	279-0022	千葉県浦安市今川**********	S12029	20150708	B-20100820-C
8	CS033605000005	猪股雄太	0	男性	1955-12-05	63	246-0031	神奈川県横浜市瀬谷区瀬谷**********	S14033	20150425	F-20100917-E
9	CS033415000229	板垣菜々美	1	女性	1977-11-07	41	246-0021	神奈川県横浜市瀬谷区二ツ橋町**********	S14033	20150712	F-20100326-E
10	CS008415000145	黒谷麻緒	1	女性	1977-06-27	41	157-0067	東京都世田谷区喜多見**********	S13008	20150829	F-20100622-F

J-014: 顧客データ（df_customer）から、ステータスコード（status_cd）の末尾が数字の1〜9で終わるデータを全項目抽出し、10件表示せよ。

df = df_customer[occursin.(r"[1-9]$", df_customer.status_cd), :]
first(df, 10)

10×11 DataFrame

Row	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	String15	String31	String	String7	Date	Int64	String15	String	String7	String	String15
1	CS001215000145	田崎美紀	1	女性	1995-03-29	24	144-0055	東京都大田区仲六郷**********	S13001	20170605	6-20090929-2
2	CS033513000180	安斎遥	1	女性	1962-07-11	56	241-0823	神奈川県横浜市旭区善部町**********	S14033	20150728	6-20080506-5
3	CS011215000048	芦田沙耶	1	女性	1992-02-01	27	223-0062	神奈川県横浜市港北区日吉本町**********	S14011	20150228	C-20100421-9
4	CS040412000191	川井郁恵	1	女性	1977-01-05	42	226-0021	神奈川県横浜市緑区北八朔町**********	S14040	20151101	1-20091025-4
5	CS009315000023	皆川文世	1	女性	1980-04-15	38	154-0012	東京都世田谷区駒沢**********	S13009	20150319	5-20080322-1
6	CS015315000033	福士璃奈子	1	女性	1983-03-17	36	135-0043	東京都江東区塩浜**********	S13015	20141024	4-20080219-3
7	CS023513000066	神戸そら	1	女性	1961-12-17	57	210-0005	神奈川県川崎市川崎区東田町**********	S14023	20150915	5-20100524-9
8	CS035513000134	市川美帆	1	女性	1960-03-27	59	156-0053	東京都世田谷区桜**********	S13035	20150227	8-20100711-9
9	CS001515000263	高松夏空	1	女性	1962-11-09	56	144-0051	東京都大田区西蒲田**********	S13001	20160812	1-20100804-1
10	CS040314000027	鶴田きみまろ	9	不明	1986-03-26	33	226-0027	神奈川県横浜市緑区長津田**********	S14040	20150122	2-20080426-4

# 別解
df_customer |>
    @filter(occursin(r"[1-9]$", _.status_cd)) |>
    @take(10) |>
    DataFrame

10×11 DataFrame

Row	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	String15	String31	String	String7	Date	Int64	String15	String	String7	String	String15
1	CS001215000145	田崎美紀	1	女性	1995-03-29	24	144-0055	東京都大田区仲六郷**********	S13001	20170605	6-20090929-2
2	CS033513000180	安斎遥	1	女性	1962-07-11	56	241-0823	神奈川県横浜市旭区善部町**********	S14033	20150728	6-20080506-5
3	CS011215000048	芦田沙耶	1	女性	1992-02-01	27	223-0062	神奈川県横浜市港北区日吉本町**********	S14011	20150228	C-20100421-9
4	CS040412000191	川井郁恵	1	女性	1977-01-05	42	226-0021	神奈川県横浜市緑区北八朔町**********	S14040	20151101	1-20091025-4
5	CS009315000023	皆川文世	1	女性	1980-04-15	38	154-0012	東京都世田谷区駒沢**********	S13009	20150319	5-20080322-1
6	CS015315000033	福士璃奈子	1	女性	1983-03-17	36	135-0043	東京都江東区塩浜**********	S13015	20141024	4-20080219-3
7	CS023513000066	神戸そら	1	女性	1961-12-17	57	210-0005	神奈川県川崎市川崎区東田町**********	S14023	20150915	5-20100524-9
8	CS035513000134	市川美帆	1	女性	1960-03-27	59	156-0053	東京都世田谷区桜**********	S13035	20150227	8-20100711-9
9	CS001515000263	高松夏空	1	女性	1962-11-09	56	144-0051	東京都大田区西蒲田**********	S13001	20160812	1-20100804-1
10	CS040314000027	鶴田きみまろ	9	不明	1986-03-26	33	226-0027	神奈川県横浜市緑区長津田**********	S14040	20150122	2-20080426-4

J-015: 顧客データ（df_customer）から、ステータスコード（status_cd）の先頭がアルファベットのA〜Fで始まり、末尾が数字の1〜9で終わるデータを全項目抽出し、10件表示せよ。

df = df_customer[occursin.(r"^[A-F].*[0-9]$", df_customer.status_cd), :]
first(df, 10)

10×11 DataFrame

Row	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	String15	String31	String	String7	Date	Int64	String15	String	String7	String	String15
1	CS011215000048	芦田沙耶	1	女性	1992-02-01	27	223-0062	神奈川県横浜市港北区日吉本町**********	S14011	20150228	C-20100421-9
2	CS022513000105	島村貴美子	1	女性	1962-03-12	57	249-0002	神奈川県逗子市山の根**********	S14022	20150320	A-20091115-7
3	CS001515000096	水野陽子	9	不明	1960-11-29	58	144-0053	東京都大田区蒲田本町**********	S13001	20150614	A-20100724-7
4	CS013615000053	西脇季衣	1	女性	1953-10-18	65	261-0026	千葉県千葉市美浜区幕張西**********	S12013	20150128	B-20100329-6
5	CS020412000161	小宮薫	1	女性	1974-05-21	44	174-0042	東京都板橋区東坂下**********	S13020	20150822	B-20081021-3
6	CS001215000097	竹中あさみ	1	女性	1990-07-25	28	146-0095	東京都大田区多摩川**********	S13001	20170315	A-20100211-2
7	CS035212000007	内村恵梨香	1	女性	1990-12-04	28	152-0023	東京都目黒区八雲**********	S13035	20151013	B-20101018-6
8	CS002515000386	野田コウ	1	女性	1963-05-30	55	185-0013	東京都国分寺市西恋ケ窪**********	S13002	20160410	C-20100127-8
9	CS001615000372	稲垣寿々花	1	女性	1956-10-29	62	144-0035	東京都大田区南蒲田**********	S13001	20170403	A-20100104-1
10	CS032512000121	松井知世	1	女性	1962-09-04	56	210-0011	神奈川県川崎市川崎区富士見**********	S13032	20150727	A-20100103-5

# 別解
df_customer |>
    @filter(occursin(r"^[A-F].*[0-9]$", _.status_cd)) |>
    @take(10) |>
    DataFrame

10×11 DataFrame

Row	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	String15	String31	String	String7	Date	Int64	String15	String	String7	String	String15
1	CS011215000048	芦田沙耶	1	女性	1992-02-01	27	223-0062	神奈川県横浜市港北区日吉本町**********	S14011	20150228	C-20100421-9
2	CS022513000105	島村貴美子	1	女性	1962-03-12	57	249-0002	神奈川県逗子市山の根**********	S14022	20150320	A-20091115-7
3	CS001515000096	水野陽子	9	不明	1960-11-29	58	144-0053	東京都大田区蒲田本町**********	S13001	20150614	A-20100724-7
4	CS013615000053	西脇季衣	1	女性	1953-10-18	65	261-0026	千葉県千葉市美浜区幕張西**********	S12013	20150128	B-20100329-6
5	CS020412000161	小宮薫	1	女性	1974-05-21	44	174-0042	東京都板橋区東坂下**********	S13020	20150822	B-20081021-3
6	CS001215000097	竹中あさみ	1	女性	1990-07-25	28	146-0095	東京都大田区多摩川**********	S13001	20170315	A-20100211-2
7	CS035212000007	内村恵梨香	1	女性	1990-12-04	28	152-0023	東京都目黒区八雲**********	S13035	20151013	B-20101018-6
8	CS002515000386	野田コウ	1	女性	1963-05-30	55	185-0013	東京都国分寺市西恋ケ窪**********	S13002	20160410	C-20100127-8
9	CS001615000372	稲垣寿々花	1	女性	1956-10-29	62	144-0035	東京都大田区南蒲田**********	S13001	20170403	A-20100104-1
10	CS032512000121	松井知世	1	女性	1962-09-04	56	210-0011	神奈川県川崎市川崎区富士見**********	S13032	20150727	A-20100103-5

J-016: 店舗データ（df_store）から、電話番号（tel_no）が3桁-3桁-4桁のデータを全項目表示せよ。

df = df_store[occursin.(r"[0-9]{3}-[0-9]{3}-[0-9]{4}", df_store.tel_no), :]
first(df, 10)

10×10 DataFrame

Row	store_cd	store_name	prefecture_cd	prefecture	address	address_kana	tel_no	longitude	latitude	floor_area
	String7	String31	String	String15	String	String	String15	Float64	Float64	Float64
1	S12014	千草台店	12	千葉県	千葉県千葉市稲毛区千草台一丁目	チバケンチバシイナゲクチグサダイイッチョウメ	043-123-4003	140.118	35.6356	1698.0
2	S13002	国分寺店	13	東京都	東京都国分寺市本多二丁目	トウキョウトコクブンジシホンダニチョウメ	042-123-4008	139.48	35.7057	1735.0
3	S14010	菊名店	14	神奈川県	神奈川県横浜市港北区菊名一丁目	カナガワケンヨコハマシコウホククキクナイッチョウメ	045-123-4032	139.633	35.5005	1732.0
4	S14033	阿久和店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4043	139.496	35.4592	1495.0
5	S14036	相模原中央店	14	神奈川県	神奈川県相模原市中央二丁目	カナガワケンサガミハラシチュウオウニチョウメ	042-123-4045	139.372	35.5733	1679.0
6	S14040	長津田店	14	神奈川県	神奈川県横浜市緑区長津田みなみ台五丁目	カナガワケンヨコハマシミドリクナガツタミナミダイゴチョウメ	045-123-4046	139.499	35.524	1548.0
7	S14050	阿久和西店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4053	139.496	35.4592	1830.0
8	S13052	森野店	13	東京都	東京都町田市森野三丁目	トウキョウトマチダシモリノサンチョウメ	042-123-4030	139.438	35.5529	1087.0
9	S14028	二ツ橋店	14	神奈川県	神奈川県横浜市瀬谷区二ツ橋町	カナガワケンヨコハマシセヤクフタツバシチョウ	045-123-4042	139.496	35.463	1574.0
10	S14012	本牧和田店	14	神奈川県	神奈川県横浜市中区本牧和田	カナガワケンヨコハマシナカクホンモクワダ	045-123-4034	139.658	35.4216	1341.0

# 別解
df_store |>
    @filter(occursin(r"[0-9]{3}-[0-9]{3}-[0-9]{4}", _.tel_no)) |>
    DataFrame

34×10 DataFrame

9 rows omitted

Row	store_cd	store_name	prefecture_cd	prefecture	address	address_kana	tel_no	longitude	latitude	floor_area
	String7	String31	String	String15	String	String	String15	Float64	Float64	Float64
1	S12014	千草台店	12	千葉県	千葉県千葉市稲毛区千草台一丁目	チバケンチバシイナゲクチグサダイイッチョウメ	043-123-4003	140.118	35.6356	1698.0
2	S13002	国分寺店	13	東京都	東京都国分寺市本多二丁目	トウキョウトコクブンジシホンダニチョウメ	042-123-4008	139.48	35.7057	1735.0
3	S14010	菊名店	14	神奈川県	神奈川県横浜市港北区菊名一丁目	カナガワケンヨコハマシコウホククキクナイッチョウメ	045-123-4032	139.633	35.5005	1732.0
4	S14033	阿久和店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4043	139.496	35.4592	1495.0
5	S14036	相模原中央店	14	神奈川県	神奈川県相模原市中央二丁目	カナガワケンサガミハラシチュウオウニチョウメ	042-123-4045	139.372	35.5733	1679.0
6	S14040	長津田店	14	神奈川県	神奈川県横浜市緑区長津田みなみ台五丁目	カナガワケンヨコハマシミドリクナガツタミナミダイゴチョウメ	045-123-4046	139.499	35.524	1548.0
7	S14050	阿久和西店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4053	139.496	35.4592	1830.0
8	S13052	森野店	13	東京都	東京都町田市森野三丁目	トウキョウトマチダシモリノサンチョウメ	042-123-4030	139.438	35.5529	1087.0
9	S14028	二ツ橋店	14	神奈川県	神奈川県横浜市瀬谷区二ツ橋町	カナガワケンヨコハマシセヤクフタツバシチョウ	045-123-4042	139.496	35.463	1574.0
10	S14012	本牧和田店	14	神奈川県	神奈川県横浜市中区本牧和田	カナガワケンヨコハマシナカクホンモクワダ	045-123-4034	139.658	35.4216	1341.0
11	S14046	北山田店	14	神奈川県	神奈川県横浜市都筑区北山田一丁目	カナガワケンヨコハマシツヅキクキタヤマタイッチョウメ	045-123-4049	139.592	35.5619	831.0
12	S14022	逗子店	14	神奈川県	神奈川県逗子市逗子一丁目	カナガワケンズシシズシイッチョウメ	046-123-4036	139.579	35.2964	1838.0
13	S14011	日吉本町店	14	神奈川県	神奈川県横浜市港北区日吉本町四丁目	カナガワケンヨコハマシコウホククヒヨシホンチョウヨンチョウメ	045-123-4033	139.632	35.5466	890.0
&vellip;	&vellip;	&vellip;	&vellip;	&vellip;	&vellip;	&vellip;	&vellip;	&vellip;	&vellip;	&vellip;
23	S14027	南藤沢店	14	神奈川県	神奈川県藤沢市南藤沢	カナガワケンフジサワシミナミフジサワ	046-123-4041	139.49	35.3376	1521.0
24	S14021	伊勢原店	14	神奈川県	神奈川県伊勢原市伊勢原四丁目	カナガワケンイセハラシイセハラヨンチョウメ	046-123-4035	139.313	35.4017	962.0
25	S14047	相模原店	14	神奈川県	神奈川県相模原市千代田六丁目	カナガワケンサガミハラシチヨダロクチョウメ	042-123-4050	139.375	35.5596	1047.0
26	S12013	習志野店	12	千葉県	千葉県習志野市芝園一丁目	チバケンナラシノシシバゾノイッチョウメ	047-123-4002	140.022	35.6612	808.0
27	S14042	新山下店	14	神奈川県	神奈川県横浜市中区新山下二丁目	カナガワケンヨコハマシナカクシンヤマシタニチョウメ	045-123-4047	139.659	35.4389	1044.0
28	S12030	八幡店	12	千葉県	千葉県市川市八幡三丁目	チバケンイチカワシヤワタサンチョウメ	047-123-4005	139.924	35.7232	1162.0
29	S14025	大和店	14	神奈川県	神奈川県大和市下和田	カナガワケンヤマトシシモワダ	046-123-4039	139.468	35.4341	1011.0
30	S14045	厚木店	14	神奈川県	神奈川県厚木市中町二丁目	カナガワケンアツギシナカチョウニチョウメ	046-123-4048	139.365	35.4418	980.0
31	S12029	東野店	12	千葉県	千葉県浦安市東野一丁目	チバケンウラヤスシヒガシノイッチョウメ	047-123-4004	139.897	35.6509	1101.0
32	S12053	高洲店	12	千葉県	千葉県浦安市高洲五丁目	チバケンウラヤスシタカスゴチョウメ	047-123-4006	139.918	35.6375	1555.0
33	S14024	三田店	14	神奈川県	神奈川県川崎市多摩区三田四丁目	カナガワケンカワサキシタマクミタヨンチョウメ	044-123-4038	139.542	35.6077	972.0
34	S14006	葛が谷店	14	神奈川県	神奈川県横浜市都筑区葛が谷	カナガワケンヨコハマシツヅキククズガヤ	045-123-4031	139.563	35.5357	1886.0

データフレームの並べ替え(sort)

5.　列でソート

J-017: 顧客データ（df_customer）を生年月日（birth_day）で高齢順にソートし、先頭から全項目を10件表示せよ。

df = sort(df_customer, :birth_day)
first(df, 10)

10×11 DataFrame

Row	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	String15	String31	String	String7	Date	Int64	String15	String	String7	String	String15
1	CS003813000014	村山菜々美	1	女性	1928-11-26	90	182-0007	東京都調布市菊野台**********	S13003	20160214	0-00000000-0
2	CS026813000004	吉村朝陽	1	女性	1928-12-14	90	251-0043	神奈川県藤沢市辻堂元町**********	S14026	20150723	0-00000000-0
3	CS018811000003	熊沢美里	1	女性	1929-01-07	90	204-0004	東京都清瀬市野塩**********	S13018	20150403	0-00000000-0
4	CS027803000004	内村拓郎	0	男性	1929-01-12	90	251-0031	神奈川県藤沢市鵠沼藤が谷**********	S14027	20151227	0-00000000-0
5	CS013801000003	天野拓郎	0	男性	1929-01-15	90	274-0824	千葉県船橋市前原東**********	S12013	20160120	0-00000000-0
6	CS001814000022	鶴田里穂	1	女性	1929-01-28	90	144-0045	東京都大田区南六郷**********	S13001	20161012	A-20090415-7
7	CS016815000002	山元美紀	1	女性	1929-02-22	90	184-0005	東京都小金井市桜町**********	S13016	20150629	C-20090923-C
8	CS009815000003	中田里穂	1	女性	1929-04-08	89	154-0014	東京都世田谷区新町**********	S13009	20150421	D-20091021-E
9	CS012813000013	宇野南朋	1	女性	1929-04-09	89	231-0806	神奈川県横浜市中区本牧町**********	S14012	20150712	0-00000000-0
10	CS005813000015	金谷恵梨香	1	女性	1929-04-09	89	165-0032	東京都中野区鷺宮**********	S13005	20150506	0-00000000-0

# 別解
df_customer |>
    @orderby(_.birth_day) |>
    @take(10) |>
    DataFrame

10×11 DataFrame

Row	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	String15	String31	String	String7	Date	Int64	String15	String	String7	String	String15
1	CS003813000014	村山菜々美	1	女性	1928-11-26	90	182-0007	東京都調布市菊野台**********	S13003	20160214	0-00000000-0
2	CS026813000004	吉村朝陽	1	女性	1928-12-14	90	251-0043	神奈川県藤沢市辻堂元町**********	S14026	20150723	0-00000000-0
3	CS018811000003	熊沢美里	1	女性	1929-01-07	90	204-0004	東京都清瀬市野塩**********	S13018	20150403	0-00000000-0
4	CS027803000004	内村拓郎	0	男性	1929-01-12	90	251-0031	神奈川県藤沢市鵠沼藤が谷**********	S14027	20151227	0-00000000-0
5	CS013801000003	天野拓郎	0	男性	1929-01-15	90	274-0824	千葉県船橋市前原東**********	S12013	20160120	0-00000000-0
6	CS001814000022	鶴田里穂	1	女性	1929-01-28	90	144-0045	東京都大田区南六郷**********	S13001	20161012	A-20090415-7
7	CS016815000002	山元美紀	1	女性	1929-02-22	90	184-0005	東京都小金井市桜町**********	S13016	20150629	C-20090923-C
8	CS009815000003	中田里穂	1	女性	1929-04-08	89	154-0014	東京都世田谷区新町**********	S13009	20150421	D-20091021-E
9	CS012813000013	宇野南朋	1	女性	1929-04-09	89	231-0806	神奈川県横浜市中区本牧町**********	S14012	20150712	0-00000000-0
10	CS005813000015	金谷恵梨香	1	女性	1929-04-09	89	165-0032	東京都中野区鷺宮**********	S13005	20150506	0-00000000-0

J-018: 顧客データ（df_customer）を生年月日（birth_day）で若い順にソートし、先頭から全項目を10件表示せよ。

降順にソートするには ref=true を指定する。

df = sort(df_customer, :birth_day, rev=true)
first(df, 10)

10×11 DataFrame

Row	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	String15	String31	String	String7	Date	Int64	String15	String	String7	String	String15
1	CS035114000004	大村美里	1	女性	2007-11-25	11	156-0053	東京都世田谷区桜**********	S13035	20150619	6-20091205-6
2	CS022103000002	福山はじめ	9	不明	2007-10-02	11	249-0006	神奈川県逗子市逗子**********	S14022	20160909	0-00000000-0
3	CS002113000009	柴田真悠子	1	女性	2007-09-17	11	184-0014	東京都小金井市貫井南町**********	S13002	20160304	0-00000000-0
4	CS004115000014	松井京子	1	女性	2007-08-09	11	165-0031	東京都中野区上鷺宮**********	S13004	20161120	1-20081231-1
5	CS002114000010	山内遥	1	女性	2007-06-03	11	184-0015	東京都小金井市貫井北町**********	S13002	20160920	6-20100510-1
6	CS025115000002	小柳夏希	1	女性	2007-04-18	11	245-0018	神奈川県横浜市泉区上飯田町**********	S14025	20160116	D-20100913-D
7	CS002113000025	広末まなみ	1	女性	2007-03-30	12	184-0015	東京都小金井市貫井北町**********	S13002	20171030	0-00000000-0
8	CS033112000003	長野美紀	1	女性	2007-03-22	12	245-0051	神奈川県横浜市戸塚区名瀬町**********	S14033	20150606	0-00000000-0
9	CS007115000006	福岡瞬	1	女性	2007-03-10	12	285-0845	千葉県佐倉市西志津**********	S12007	20151118	F-20101016-F
10	CS014113000008	矢口莉緒	1	女性	2007-03-05	12	260-0041	千葉県千葉市中央区東千葉**********	S12014	20150622	3-20091108-6

# 別解
df_customer |>
    @orderby_descending(_.birth_day) |>
    @take(10) |>
    DataFrame

10×11 DataFrame

Row	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	String15	String31	String	String7	Date	Int64	String15	String	String7	String	String15
1	CS035114000004	大村美里	1	女性	2007-11-25	11	156-0053	東京都世田谷区桜**********	S13035	20150619	6-20091205-6
2	CS022103000002	福山はじめ	9	不明	2007-10-02	11	249-0006	神奈川県逗子市逗子**********	S14022	20160909	0-00000000-0
3	CS002113000009	柴田真悠子	1	女性	2007-09-17	11	184-0014	東京都小金井市貫井南町**********	S13002	20160304	0-00000000-0
4	CS004115000014	松井京子	1	女性	2007-08-09	11	165-0031	東京都中野区上鷺宮**********	S13004	20161120	1-20081231-1
5	CS002114000010	山内遥	1	女性	2007-06-03	11	184-0015	東京都小金井市貫井北町**********	S13002	20160920	6-20100510-1
6	CS025115000002	小柳夏希	1	女性	2007-04-18	11	245-0018	神奈川県横浜市泉区上飯田町**********	S14025	20160116	D-20100913-D
7	CS002113000025	広末まなみ	1	女性	2007-03-30	12	184-0015	東京都小金井市貫井北町**********	S13002	20171030	0-00000000-0
8	CS033112000003	長野美紀	1	女性	2007-03-22	12	245-0051	神奈川県横浜市戸塚区名瀬町**********	S14033	20150606	0-00000000-0
9	CS007115000006	福岡瞬	1	女性	2007-03-10	12	285-0845	千葉県佐倉市西志津**********	S12007	20151118	F-20101016-F
10	CS014113000008	矢口莉緒	1	女性	2007-03-05	12	260-0041	千葉県千葉市中央区東千葉**********	S12014	20150622	3-20091108-6

6.　順位付けをしてソート

順位付けは StatsBase パッケージに tiedrank(), competerank(), ordinalrank() がある。

J-019: レシート明細データ（df_receipt）に対し、1件あたりの売上金額（amount）が高い順にランクを付与し、先頭から10件表示せよ。項目は顧客ID（customer_id）、売上金額（amount）、付与したランクを表示させること。なお、売上金額（amount）が等しい場合は同一順位を付与するものとする。

using StatsBase
df = select(df_receipt, [:customer_id, :amount])
df.rank = competerank(df_receipt.amount, rev=true)
df = sort(df, :rank)
first(df, 10)

10×3 DataFrame

Row	customer_id	amount	rank
	String15	Int64	Int64
1	CS011415000006	10925	1
2	ZZ000000000000	6800	2
3	CS028605000002	5780	3
4	CS015515000034	5480	4
5	ZZ000000000000	5480	4
6	ZZ000000000000	5480	4
7	ZZ000000000000	5440	7
8	CS021515000089	5440	7
9	CS015515000083	5280	9
10	CS017414000114	5280	9

# 別解
using StatsBase

df_receipt |>
    x -> hcat(x, DataFrame(rank = competerank(df_receipt[:, :amount], rev=true))) |>
    @select(:customer_id, :amount, :rank) |>
    @orderby(_.rank) |>
    @take(10) |>
    DataFrame

10×3 DataFrame

Row	customer_id	amount	rank
	String15	Int64	Int64
1	CS011415000006	10925	1
2	ZZ000000000000	6800	2
3	CS028605000002	5780	3
4	CS015515000034	5480	4
5	ZZ000000000000	5480	4
6	ZZ000000000000	5480	4
7	ZZ000000000000	5440	7
8	CS021515000089	5440	7
9	CS015515000083	5280	9
10	CS017414000114	5280	9

J-020: レシート明細データ（df_receipt）に対し、1件あたりの売上金額（amount）が高い順にランクを付与し、先頭から10件表示せよ。項目は顧客ID（customer_id）、売上金額（amount）、付与したランクを表示させること。なお、売上金額（amount）が等しい場合でも別順位を付与すること。

using StatsBase
df = select(df_receipt, [:customer_id, :amount])
df.rank = ordinalrank(df_receipt.amount, rev=true)
df = sort(df, :rank)
first(df, 10)

10×3 DataFrame

Row	customer_id	amount	rank
	String15	Int64	Int64
1	CS011415000006	10925	1
2	ZZ000000000000	6800	2
3	CS028605000002	5780	3
4	CS015515000034	5480	4
5	ZZ000000000000	5480	5
6	ZZ000000000000	5480	6
7	ZZ000000000000	5440	7
8	CS021515000089	5440	8
9	CS015515000083	5280	9
10	CS017414000114	5280	10

# 別解
df_receipt |>
    x -> hcat(x, DataFrame(rank = ordinalrank(df_receipt[:, :amount], rev=true))) |>
    @select(:customer_id, :amount, :rank) |>
    @orderby(_.rank) |>
    @take(10) |>
    DataFrame

10×3 DataFrame

Row	customer_id	amount	rank
	String15	Int64	Int64
1	CS011415000006	10925	1
2	ZZ000000000000	6800	2
3	CS028605000002	5780	3
4	CS015515000034	5480	4
5	ZZ000000000000	5480	5
6	ZZ000000000000	5480	6
7	ZZ000000000000	5440	7
8	CS021515000089	5440	8
9	CS015515000083	5280	9
10	CS017414000114	5280	10

件数（行数）の確認

J-021: レシート明細データ（df_receipt）に対し、件数をカウントせよ。

nrow(df_receipt)

# 別解
df_receipt |> nrow

J-022: レシート明細データ（df_receipt）の顧客ID（customer_id）に対し、ユニーク件数をカウントせよ。

ユニークなデータ，データフレームを作るには unique() を使う。

length(unique(df_receipt.customer_id))

# 別解
df_receipt |>
    @unique(_.customer_id) |>
    collect |>
    length

# 別解2
df_receipt |>
    @select(:customer_id) |>
    @unique() |>
    DataFrame |>
    nrow

# 別解3
df_receipt |>
    @select(:customer_id) |>
    @unique() |>
    collect |>
    length

グループ化して処理

groupby() によってグループ化して，それぞれのグループ化データフレームを combine() によって処理して，その結果をデータフレームとして生成する。

7.　合計

J-023: レシート明細データ（df_receipt）に対し、店舗コード（store_cd）ごとに売上金額（amount）と売上数量（quantity）を合計せよ。

gd = groupby(df_receipt, :store_cd);
df = combine(gd, :amount => sum => :amount, :quantity => sum => :quantity)
sort(df, :store_cd)

52×3 DataFrame

27 rows omitted

Row	store_cd	amount	quantity
	String7	Int64	Int64
1	S12007	638761	2099
2	S12013	787513	2425
3	S12014	725167	2358
4	S12029	794741	2555
5	S12030	684402	2403
6	S13001	811936	2347
7	S13002	727821	2340
8	S13003	764294	2197
9	S13004	779373	2390
10	S13005	629876	2004
11	S13008	809288	2491
12	S13009	808870	2486
13	S13015	780873	2248
&vellip;	&vellip;	&vellip;	&vellip;
41	S14028	786145	2458
42	S14033	725318	2282
43	S14034	653681	2024
44	S14036	203694	635
45	S14040	701858	2233
46	S14042	534689	1935
47	S14045	458484	1398
48	S14046	412646	1354
49	S14047	338329	1041
50	S14048	234276	769
51	S14049	230808	788
52	S14050	167090	580

# 別解
df_receipt |>
    @groupby(_.store_cd) |>
    @map({store_cd=key(_), amount=sum(_.amount), quantity=sum(_.quantity)}) |>
    @orderby(_.store_cd) |>
    DataFrame

52×3 DataFrame

27 rows omitted

Row	store_cd	amount	quantity
	String7	Int64	Int64
1	S12007	638761	2099
2	S12013	787513	2425
3	S12014	725167	2358
4	S12029	794741	2555
5	S12030	684402	2403
6	S13001	811936	2347
7	S13002	727821	2340
8	S13003	764294	2197
9	S13004	779373	2390
10	S13005	629876	2004
11	S13008	809288	2491
12	S13009	808870	2486
13	S13015	780873	2248
&vellip;	&vellip;	&vellip;	&vellip;
41	S14028	786145	2458
42	S14033	725318	2282
43	S14034	653681	2024
44	S14036	203694	635
45	S14040	701858	2233
46	S14042	534689	1935
47	S14045	458484	1398
48	S14046	412646	1354
49	S14047	338329	1041
50	S14048	234276	769
51	S14049	230808	788
52	S14050	167090	580

8.　最大値

J-024: レシート明細データ（df_receipt）に対し、顧客ID（customer_id）ごとに最も新しい売上年月日（sales_ymd）を求め、10件表示せよ。

gd = groupby(df_receipt, :customer_id);
df = combine(gd, :sales_ymd => maximum)
df = sort(df, :customer_id)  # ソートする必要がある場合
first(df, 10)

10×2 DataFrame

Row	customer_id	sales_ymd_maximum
	String15	Int64
1	CS001113000004	20190308
2	CS001114000005	20190731
3	CS001115000010	20190405
4	CS001205000004	20190625
5	CS001205000006	20190224
6	CS001211000025	20190322
7	CS001212000027	20170127
8	CS001212000031	20180906
9	CS001212000046	20170811
10	CS001212000070	20191018

# 別解
df_receipt |>
    @groupby(_.customer_id) |>
    @map({customer_id=key(_), sales_ymd_maximum=maximum(_.sales_ymd)}) |>
    @orderby(_.customer_id) |>
    @take(10) |>
    DataFrame

10×2 DataFrame

Row	customer_id	sales_ymd_maximum
	String15	Int64
1	CS001113000004	20190308
2	CS001114000005	20190731
3	CS001115000010	20190405
4	CS001205000004	20190625
5	CS001205000006	20190224
6	CS001211000025	20190322
7	CS001212000027	20170127
8	CS001212000031	20180906
9	CS001212000046	20170811
10	CS001212000070	20191018

9.　最小値

J-025: レシート明細データ（df_receipt）に対し、顧客ID（customer_id）ごとに最も古い売上年月日（sales_ymd）を求め、10件表示せよ。

gd = groupby(df_receipt, :customer_id);
df = combine(gd, :sales_ymd => minimum)
df2 = sort(df, :customer_id)
first(df2, 10)

10×2 DataFrame

Row	customer_id	sales_ymd_minimum
	String15	Int64
1	CS001113000004	20190308
2	CS001114000005	20180503
3	CS001115000010	20171228
4	CS001205000004	20170914
5	CS001205000006	20180207
6	CS001211000025	20190322
7	CS001212000027	20170127
8	CS001212000031	20180906
9	CS001212000046	20170811
10	CS001212000070	20191018

# 別解
df_receipt |>
    @groupby(_.customer_id) |>
    @map({customer_id=key(_), sales_ymd_minimum=minimum(_.sales_ymd)}) |>
    @orderby(_.customer_id) |>
    @take(10) |>
    DataFrame

10×2 DataFrame

Row	customer_id	sales_ymd_minimum
	String15	Int64
1	CS001113000004	20190308
2	CS001114000005	20180503
3	CS001115000010	20171228
4	CS001205000004	20170914
5	CS001205000006	20180207
6	CS001211000025	20190322
7	CS001212000027	20170127
8	CS001212000031	20180906
9	CS001212000046	20170811
10	CS001212000070	20191018

10.　比較演算，論理演算

J-026: レシート明細データ（df_receipt）に対し、顧客ID（customer_id）ごとに最も新しい売上年月日（sales_ymd）と古い売上年月日を求め、両者が異なるデータを10件表示せよ。

gd = groupby(df_receipt, :customer_id);
df = combine(gd, :sales_ymd => maximum => :max_ymd, :sales_ymd => minimum => :min_ymd) |>
    @filter(_.min_ymd != _.max_ymd) |>
    @orderby(_.customer_id) |>
    DataFrame
first(df, 10)

10×3 DataFrame

Row	customer_id	max_ymd	min_ymd
	String15	Int64	Int64
1	CS001114000005	20190731	20180503
2	CS001115000010	20190405	20171228
3	CS001205000004	20190625	20170914
4	CS001205000006	20190224	20180207
5	CS001214000009	20190902	20170306
6	CS001214000017	20191006	20180828
7	CS001214000048	20190929	20171109
8	CS001214000052	20190617	20180208
9	CS001215000005	20181021	20170206
10	CS001215000040	20171022	20170214

# 別解
df_receipt |>
    @groupby(_.customer_id) |>
    @map({customer_id=key(_), max_ymd=maximum(_.sales_ymd), min_ymd=minimum(_.sales_ymd)}) |>
    @filter(_.min_ymd != _.max_ymd) |>
    @orderby(_.customer_id) |>
    @take(10) |>
    DataFrame

10×3 DataFrame

Row	customer_id	max_ymd	min_ymd
	String15	Int64	Int64
1	CS001114000005	20190731	20180503
2	CS001115000010	20190405	20171228
3	CS001205000004	20190625	20170914
4	CS001205000006	20190224	20180207
5	CS001214000009	20190902	20170306
6	CS001214000017	20191006	20180828
7	CS001214000048	20190929	20171109
8	CS001214000052	20190617	20180208
9	CS001215000005	20181021	20170206
10	CS001215000040	20171022	20170214

11.　平均値

J-027: レシート明細データ（df_receipt）に対し、店舗コード（store_cd）ごとに売上金額（amount）の平均を計算し、降順でTOP5を表示せよ。

using StatsBase

gd = groupby(df_receipt, :store_cd);
combine(gd, :amount => mean => :mean_amount) |>
    @orderby_descending(_.mean_amount) |>
    @take(5) |>
    DataFrame

5×2 DataFrame

Row	store_cd	mean_amount
	String7	Float64
1	S13052	402.867
2	S13015	351.112
3	S13003	350.916
4	S14010	348.791
5	S13001	348.47

# 別解
using StatsBase

df_receipt |>
    @groupby(_.store_cd) |>
    @map({store_cd=key(_), mean_amount=mean(_.amount)}) |>
    @orderby_descending(_.mean_amount) |>
    @take(5) |>
    DataFrame

5×2 DataFrame

Row	store_cd	mean_amount
	String7	Float64
1	S13052	402.867
2	S13015	351.112
3	S13003	350.916
4	S14010	348.791
5	S13001	348.47

12.　中央値

J-028: レシート明細データ（df_receipt）に対し、店舗コード（store_cd）ごとに売上金額（amount）の中央値を計算し、降順でTOP5を表示せよ。

gd = groupby(df_receipt, :store_cd);
combine(gd, :amount => median => :median_amount) |>
    @orderby_descending(_.median_amount) |>
    @thenby(_.store_cd) |>
    @take(5) |>
    DataFrame

5×2 DataFrame

Row	store_cd	median_amount
	String7	Float64
1	S13052	190.0
2	S14010	188.0
3	S14050	185.0
4	S13003	180.0
5	S13018	180.0

13.　最頻値

J-029: レシート明細データ（df_receipt）に対し、店舗コード（store_cd）ごとに商品コード（product_cd）の最頻値を求め、10件表示させよ。

gd = groupby(df_receipt, :store_cd);
combine(gd, :product_cd => mode) |>
    @orderby(_.store_cd) |>
    @take(10) |>
    DataFrame

10×2 DataFrame

Row	store_cd	product_cd_mode
	String7	String15
1	S12007	P060303001
2	S12013	P060303001
3	S12014	P060303001
4	S12029	P060303001
5	S12030	P060303001
6	S13001	P060303001
7	S13002	P060303001
8	S13003	P071401001
9	S13004	P060303001
10	S13005	P040503001

# 別解
using StatsBase

df_receipt |>
    @groupby(_.store_cd) |>
    @map({store_cd=key(_), product_cd_mode=mode(_.product_cd)}) |>
    @orderby(_.store_cd) |>
    @take(10) |>
    DataFrame

10×2 DataFrame

Row	store_cd	product_cd_mode
	String7	String15
1	S12007	P060303001
2	S12013	P060303001
3	S12014	P060303001
4	S12029	P060303001
5	S12030	P060303001
6	S13001	P060303001
7	S13002	P060303001
8	S13003	P071401001
9	S13004	P060303001
10	S13005	P040503001

14.　分散

J-030: レシート明細データ（df_receipt）に対し、店舗コード（store_cd）ごとに売上金額（amount）の分散を計算し、降順で5件表示せよ。

gd = groupby(df_receipt, :store_cd);
combine(gd, :amount => (x -> var(x, corrected=false)) => :var_amount) |>
    @orderby_descending(_.var_amount) |>
    @take(5) |>
    DataFrame

5×2 DataFrame

Row	store_cd	var_amount
	String7	Float64
1	S13052	4.40089e5
2	S14011	3.06315e5
3	S14034	2.9692e5
4	S13001	295432.0
5	S13015	2.95294e5

# 別解
df_receipt |>
    @groupby(_.store_cd) |>
    @map({store_cd=key(_), var_amount=var(_.amount, corrected=false)}) |>
    @orderby_descending(_.var_amount) |>
    @take(5) |>
    DataFrame

5×2 DataFrame

Row	store_cd	var_amount
	String7	Float64
1	S13052	4.40089e5
2	S14011	3.06315e5
3	S14034	2.9692e5
4	S13001	295432.0
5	S13015	2.95294e5

15.　標準偏差

J-031: レシート明細データ（df_receipt）に対し、店舗コード（store_cd）ごとに売上金額（amount）の標準偏差を計算し、降順で5件表示せよ。

gd = groupby(df_receipt, :store_cd);
combine(gd, :amount => (x -> std(x, corrected=false)) => :std_amount) |>
    @orderby_descending(_.std_amount) |>
    @take(5) |>
    DataFrame

5×2 DataFrame

Row	store_cd	std_amount
	String7	Float64
1	S13052	663.392
2	S14011	553.457
3	S14034	544.904
4	S13001	543.537
5	S13015	543.41

# 別解
df_receipt |>
    @groupby(_.store_cd) |>
    @map({store_cd=key(_), std_amount=std(_.amount, corrected=false)}) |>
    @orderby_descending(_.std_amount) |>
    @take(5) |>
    DataFrame

5×2 DataFrame

Row	store_cd	std_amount
	String7	Float64
1	S13052	663.392
2	S14011	553.457
3	S14034	544.904
4	S13001	543.537
5	S13015	543.41

16.　パーセンタイル値（第1四分位数，中央値，第3四分位数）

J-032: レシート明細データ（df_receipt）の売上金額（amount）について、25％刻みでパーセンタイル値を求めよ。

quantile(df_receipt.amount, 0.25:0.25:1)

4-element Vector{Float64}:
   102.0
   170.0
   288.0
 10925.0

# 別解
combine(df_receipt, :amount => x -> quantile(x, 0.25:0.25:1))

4×1 DataFrame

Row	amount_function
	Float64
1	102.0
2	170.0
3	288.0
4	10925.0

17.　比較演算，論理演算

J-033: レシート明細データ（df_receipt）に対し、店舗コード（store_cd）ごとに売上金額（amount）の平均を計算し、330以上のものを抽出せよ。

gd = groupby(df_receipt, :store_cd);
combine(gd, :amount => mean => :mean_amount) |>
    @filter(_.mean_amount >= 330) |>
    @orderby(_.store_cd) |>
    DataFrame

13×2 DataFrame

Row	store_cd	mean_amount
	String7	Float64
1	S12013	330.194
2	S13001	348.47
3	S13003	350.916
4	S13004	330.944
5	S13015	351.112
6	S13019	330.209
7	S13020	337.88
8	S13052	402.867
9	S14010	348.791
10	S14011	335.718
11	S14026	332.341
12	S14045	330.082
13	S14047	330.077

# 別解
df_receipt |>
    @groupby(_.store_cd) |>
    @map({store_cd=key(_), mean_amount=mean(_.amount)}) |>
    @filter(_.mean_amount >= 330) |>
    @orderby(_.store_cd) |>
    DataFrame

13×2 DataFrame

Row	store_cd	mean_amount
	String7	Float64
1	S12013	330.194
2	S13001	348.47
3	S13003	350.916
4	S13004	330.944
5	S13015	351.112
6	S13019	330.209
7	S13020	337.88
8	S13052	402.867
9	S14010	348.791
10	S14011	335.718
11	S14026	332.341
12	S14045	330.082
13	S14047	330.077

18.　グループ化されたデータフレームごとの集計結果との比較

J-034: レシート明細データ（df_receipt）に対し、顧客ID（customer_id）ごとに売上金額（amount）を合計して全顧客の平均を求めよ。ただし、顧客IDが"Z"から始まるものは非会員を表すため、除外して計算すること。

using StatsBase

df2 = df_receipt |>
    @filter(!occursin(r"^Z", _.customer_id)) |>
    DataFrame;

gd = groupby(df2, :customer_id);

df3 = combine(gd, :amount => sum => :sum_amount)

using StatsBase
mean(df3.sum_amount)

2547.742234529256

# 別解
df2 = df_receipt |>
    @filter(!occursin(r"^Z", _.customer_id)) |>
    @groupby(_.customer_id) |>
    @map({sum=sum(_.amount)}) |>
    DataFrame |>
    x -> mean(x.sum)

2547.742234529256

J-035: レシート明細データ（df_receipt）に対し、顧客ID（customer_id）ごとに売上金額（amount）を合計して全顧客の平均を求め、平均以上に買い物をしている顧客を抽出し、10件表示せよ。ただし、顧客IDが"Z"から始まるものは非会員を表すため、除外して計算すること。

df2 = df_receipt |>
    @filter(!occursin(r"^Z", _.customer_id)) |>
    DataFrame;

gd = groupby(df2, :customer_id);

df3 = combine(gd, :amount => sum => :sum_amount)

using StatsBase
mean_amount = mean(df3.sum_amount)

2547.742234529256

df3 |>
    @filter(_.sum_amount > mean_amount) |>
    @orderby(_.customer_id) |>
    @take(10) |>
    DataFrame

10×2 DataFrame

Row	customer_id	sum_amount
	String15	Int64
1	CS001115000010	3044
2	CS001205000006	3337
3	CS001214000009	4685
4	CS001214000017	4132
5	CS001214000052	5639
6	CS001215000040	3496
7	CS001304000006	3726
8	CS001305000005	3485
9	CS001305000011	4370
10	CS001315000180	3300

# 別解
df2 = df_receipt |>
    @filter(!occursin(r"^Z", _.customer_id)) |>
    @groupby(_.customer_id) |>
    @map({customer_id=_.customer_id[1], sum_amount=sum(_.amount)}) |>
    DataFrame

grandmean = mean(df2.sum_amount)

df2 |>
    @filter(_.sum_amount > grandmean) |>
    @orderby(_.customer_id) |>
    @take(10) |>
    DataFrame

10×2 DataFrame

Row	customer_id	sum_amount
	String15	Int64
1	CS001115000010	3044
2	CS001205000006	3337
3	CS001214000009	4685
4	CS001214000017	4132
5	CS001214000052	5639
6	CS001215000040	3496
7	CS001304000006	3726
8	CS001305000005	3485
9	CS001305000011	4370
10	CS001315000180	3300

データフレームの結合(join)

19.　`innnerjoin()`

J-036: レシート明細データ（df_receipt）と店舗データ（df_store）を内部結合し、レシート明細データの全項目と店舗データの店舗名（store_name）を10件表示せよ。

innerjoin(df_receipt, df_store[!, [:store_cd, :store_name]], on=:store_cd) |> (x -> first(x, 10))

10×10 DataFrame

Row	sales_ymd	sales_epoch	store_cd	receipt_no	receipt_sub_no	customer_id	product_cd	quantity	amount	store_name
	Int64	Int64	String7	Int64	Int64	String15	String15	Int64	Int64	String31
1	20181103	1541203200	S14006	112	1	CS006214000001	P070305012	1	158	葛が谷店
2	20181118	1542499200	S13008	1132	2	CS008415000097	P070701017	1	81	成城店
3	20170712	1499817600	S14028	1102	1	CS028414000014	P060101005	1	170	二ツ橋店
4	20190205	1549324800	S14042	1132	1	ZZ000000000000	P050301001	1	25	新山下店
5	20180821	1534809600	S14025	1102	2	CS025415000050	P060102007	1	90	大和店
6	20190605	1559692800	S13003	1112	1	CS003515000195	P050102002	1	138	狛江店
7	20181205	1543968000	S14024	1102	2	CS024514000042	P080101005	1	30	三田店
8	20190922	1569110400	S14040	1102	1	CS040415000178	P070501004	1	128	長津田店
9	20170504	1493856000	S13020	1112	2	ZZ000000000000	P071302010	1	770	十条仲原店
10	20191010	1570665600	S14027	1102	1	CS027514000015	P071101003	1	680	南藤沢店

# 別解
df_receipt |>
    @join(df_store, _.store_cd, _.store_cd,
        {_.sales_ymd, _.sales_epoch, _.store_cd, _.receipt_no,
        _.receipt_sub_no, _.customer_id, _.product_cd, _.quantity,
        _.amount, __.store_name}) |>
    @take(10) |>
    DataFrame

10×10 DataFrame

Row	sales_ymd	sales_epoch	store_cd	receipt_no	receipt_sub_no	customer_id	product_cd	quantity	amount	store_name
	Int64	Int64	String7	Int64	Int64	String15	String15	Int64	Int64	String31
1	20181103	1541203200	S14006	112	1	CS006214000001	P070305012	1	158	葛が谷店
2	20181118	1542499200	S13008	1132	2	CS008415000097	P070701017	1	81	成城店
3	20170712	1499817600	S14028	1102	1	CS028414000014	P060101005	1	170	二ツ橋店
4	20190205	1549324800	S14042	1132	1	ZZ000000000000	P050301001	1	25	新山下店
5	20180821	1534809600	S14025	1102	2	CS025415000050	P060102007	1	90	大和店
6	20190605	1559692800	S13003	1112	1	CS003515000195	P050102002	1	138	狛江店
7	20181205	1543968000	S14024	1102	2	CS024514000042	P080101005	1	30	三田店
8	20190922	1569110400	S14040	1102	1	CS040415000178	P070501004	1	128	長津田店
9	20170504	1493856000	S13020	1112	2	ZZ000000000000	P071302010	1	770	十条仲原店
10	20191010	1570665600	S14027	1102	1	CS027514000015	P071101003	1	680	南藤沢店

J-037: 商品データ（df_product）とカテゴリデータ（df_category）を内部結合し、商品データの全項目とカテゴリデータのカテゴリ小区分名（category_small_name）を10件表示せよ。

first(innerjoin(df_product, df_category[!, [:category_small_name, :category_small_cd]], on=:category_small_cd, makeunique=true), 10)

10×7 DataFrame

Row	product_cd	category_major_cd	category_medium_cd	category_small_cd	unit_price	unit_cost	category_small_name
	String15	String	String	String	Int64?	Int64?	String
1	P040101001	04	0401	040101	198	149	弁当類
2	P040101002	04	0401	040101	218	164	弁当類
3	P040101003	04	0401	040101	230	173	弁当類
4	P040101004	04	0401	040101	248	186	弁当類
5	P040101005	04	0401	040101	268	201	弁当類
6	P040101006	04	0401	040101	298	224	弁当類
7	P040101007	04	0401	040101	338	254	弁当類
8	P040101008	04	0401	040101	420	315	弁当類
9	P040101009	04	0401	040101	498	374	弁当類
10	P040101010	04	0401	040101	580	435	弁当類

# 別解
df_product |>
    @join(df_category, _.category_small_cd, _.category_small_cd,
        {_.product_cd, _.category_major_cd, _.category_medium_cd,
        _.category_small_cd, _.unit_price, _.unit_cost,
        __.category_small_name}) |>
    @take(10) |>
    DataFrame

10×7 DataFrame

Row	product_cd	category_major_cd	category_medium_cd	category_small_cd	unit_price	unit_cost	category_small_name
	String15	String	String	String	Int64?	Int64?	String
1	P040101001	04	0401	040101	198	149	弁当類
2	P040101002	04	0401	040101	218	164	弁当類
3	P040101003	04	0401	040101	230	173	弁当類
4	P040101004	04	0401	040101	248	186	弁当類
5	P040101005	04	0401	040101	268	201	弁当類
6	P040101006	04	0401	040101	298	224	弁当類
7	P040101007	04	0401	040101	338	254	弁当類
8	P040101008	04	0401	040101	420	315	弁当類
9	P040101009	04	0401	040101	498	374	弁当類
10	P040101010	04	0401	040101	580	435	弁当類

20.　`leftjoin()`

J-038: 顧客データ（df_customer）とレシート明細データ（df_receipt）から、顧客ごとの売上金額合計を求め、10件表示せよ。ただし、売上実績がない顧客については売上金額を0として表示させること。また、顧客は性別コード（gender_cd）が女性（1）であるものを対象とし、非会員（顧客IDが"Z"から始まるもの）は除外すること。

df = leftjoin(df_customer[!, [:customer_id, :gender_cd]], df_receipt[!, [:customer_id, :amount]], on=:customer_id)
filter!(:customer_id => x -> !occursin(r"^Z", x), df)
filter!(:gender_cd => x -> x == "1", df);
df2 = groupby(df, :customer_id)
df3 = combine(df2, :amount => sum => :sum_amount);
df3.sum_amount .= coalesce.(df3.sum_amount, 0);  # coalesce() は missing の置換に使える
first(sort(df3, :customer_id), 10)

10×2 DataFrame

Row	customer_id	sum_amount
	String15	Int64
1	CS001112000009	0
2	CS001112000019	0
3	CS001112000021	0
4	CS001112000023	0
5	CS001112000024	0
6	CS001112000029	0
7	CS001112000030	0
8	CS001113000004	1298
9	CS001113000010	0
10	CS001114000005	626

# 別解
leftjoin(df_customer[!, [:customer_id, :gender_cd]], df_receipt[!, [:customer_id, :amount]], on=:customer_id) |>
    @filter(!occursin(r"^Z", _.customer_id)) |>
    @filter(_.gender_cd == "1") |>
    @groupby(_.customer_id) |>
    @map({customer_id=key(_), sum_amount=sum(_.amount)}) |>
    @replacena(:sum_amount => 0) |>
    @orderby(_.customer_id) |>
    @take(10) |>
    DataFrame

10×2 DataFrame

Row	customer_id	sum_amount
	String15	Int64
1	CS001112000009	0
2	CS001112000019	0
3	CS001112000021	0
4	CS001112000023	0
5	CS001112000024	0
6	CS001112000029	0
7	CS001112000030	0
8	CS001113000004	1298
9	CS001113000010	0
10	CS001114000005	626

21.　`outerjoin()`

J-039: レシート明細データ（df_receipt）から、売上日数の多い顧客の上位20件を抽出したデータと、売上金額合計の多い顧客の上位20件を抽出したデータをそれぞれ作成し、さらにその2つを完全外部結合せよ。ただし、非会員（顧客IDが"Z"から始まるもの）は除外すること。

df2 = groupby(filter(:customer_id => x -> !occursin(r"^Z", x), df_receipt), :customer_id)
df3 = sort(combine(df2, :sales_ymd => (x -> length(unique(x))) => :come_days), :come_days, rev=true)[1:20, :];
df4 = sort(combine(df2, :amount => sum => :sum_amount), :sum_amount, rev=true)[1:20, :];

io = IOContext(stdout, :limit => false)
println(outerjoin(df3, df4, on=:customer_id, makeunique=true))
io = IOContext(stdout, :limit => true)

[1m34×3 DataFrame[0m
[1m Row [0m│[1m customer_id    [0m[1m come_days [0m[1m sum_amount [0m
     │[90m String15       [0m[90m Int64?    [0m[90m Int64?     [0m
─────┼───────────────────────────────────────
   1 │ CS017415000097         20       23086
   2 │ CS015415000185         22       20153
   3 │ CS031414000051         19       19202
   4 │ CS028415000007         21       19127
   5 │ CS010214000010         22       18585
   6 │ CS016415000141         20       18372
   7 │ CS040214000008         23 [90m    missing [0m
   8 │ CS010214000002         21 [90m    missing [0m
   9 │ CS039414000052         19 [90m    missing [0m
  10 │ CS021514000045         19 [90m    missing [0m
  11 │ CS022515000226         19 [90m    missing [0m
  12 │ CS014214000023         19 [90m    missing [0m
  13 │ CS021515000172         19 [90m    missing [0m
  14 │ CS014415000077         18 [90m    missing [0m
  15 │ CS030214000008         18 [90m    missing [0m
  16 │ CS021515000056         18 [90m    missing [0m
  17 │ CS031414000073         18 [90m    missing [0m
  18 │ CS032415000209         18 [90m    missing [0m
  19 │ CS007515000107         18 [90m    missing [0m
  20 │ CS021515000211         18 [90m    missing [0m
  21 │ CS001605000009 [90m   missing [0m      18925
  22 │ CS006515000023 [90m   missing [0m      18372
  23 │ CS011414000106 [90m   missing [0m      18338
  24 │ CS038415000104 [90m   missing [0m      17847
  25 │ CS035414000024 [90m   missing [0m      17615
  26 │ CS021515000089 [90m   missing [0m      17580
  27 │ CS032414000072 [90m   missing [0m      16563
  28 │ CS016415000101 [90m   missing [0m      16348
  29 │ CS011415000006 [90m   missing [0m      16094
  30 │ CS034415000047 [90m   missing [0m      16083
  31 │ CS007514000094 [90m   missing [0m      15735
  32 │ CS009414000059 [90m   missing [0m      15492
  33 │ CS030415000034 [90m   missing [0m      15468
  34 │ CS015515000034 [90m   missing [0m      15300

IOContext(Base.PipeEndpoint(RawFD(40) open, 0 bytes waiting))

# 別解
df2 = df_receipt |>
    @filter(!occursin(r"^Z", _.customer_id)) |>
    DataFrame
gd = groupby(df2, :customer_id)
df3 = combine(gd, :sales_ymd => (x -> length(unique(x))) => :come_days) |>
    @orderby_descending(_.come_days) |>
    @take(20) |>
    DataFrame
df4 = combine(gd, :amount => sum) |>
    @orderby_descending(_.amount_sum) |>
    @take(20) |>
    DataFrame;

io = IOContext(stdout, :limit => false)
println(outerjoin(df3, df4, on=:customer_id))
io = IOContext(stdout, :limit => true)

[1m34×3 DataFrame[0m
[1m Row [0m│[1m customer_id    [0m[1m come_days [0m[1m amount_sum [0m
     │[90m String15       [0m[90m Int64?    [0m[90m Int64?     [0m
─────┼───────────────────────────────────────
   1 │ CS017415000097         20       23086
   2 │ CS015415000185         22       20153
   3 │ CS031414000051         19       19202
   4 │ CS028415000007         21       19127
   5 │ CS010214000010         22       18585
   6 │ CS016415000141         20       18372
   7 │ CS040214000008         23 [90m    missing [0m
   8 │ CS010214000002         21 [90m    missing [0m
   9 │ CS039414000052         19 [90m    missing [0m
  10 │ CS021514000045         19 [90m    missing [0m
  11 │ CS022515000226         19 [90m    missing [0m
  12 │ CS014214000023         19 [90m    missing [0m
  13 │ CS021515000172         19 [90m    missing [0m
  14 │ CS014415000077         18 [90m    missing [0m
  15 │ CS030214000008         18 [90m    missing [0m
  16 │ CS021515000056         18 [90m    missing [0m
  17 │ CS031414000073         18 [90m    missing [0m
  18 │ CS032415000209         18 [90m    missing [0m
  19 │ CS007515000107         18 [90m    missing [0m
  20 │ CS021515000211         18 [90m    missing [0m
  21 │ CS001605000009 [90m   missing [0m      18925
  22 │ CS006515000023 [90m   missing [0m      18372
  23 │ CS011414000106 [90m   missing [0m      18338
  24 │ CS038415000104 [90m   missing [0m      17847
  25 │ CS035414000024 [90m   missing [0m      17615
  26 │ CS021515000089 [90m   missing [0m      17580
  27 │ CS032414000072 [90m   missing [0m      16563
  28 │ CS016415000101 [90m   missing [0m      16348
  29 │ CS011415000006 [90m   missing [0m      16094
  30 │ CS034415000047 [90m   missing [0m      16083
  31 │ CS007514000094 [90m   missing [0m      15735
  32 │ CS009414000059 [90m   missing [0m      15492
  33 │ CS030415000034 [90m   missing [0m      15468
  34 │ CS015515000034 [90m   missing [0m      15300

IOContext(Base.PipeEndpoint(RawFD(40) open, 0 bytes waiting))

J-040: 全ての店舗と全ての商品を組み合わせたデータを作成したい。店舗データ（df_store）と商品データ（df_product）を直積し、件数を計算せよ。

件数を計算するだけなら，それぞれの件数の積である。

nrow(df_store) * nrow(df_product)

# 別解
df_store2 = copy(df_store);
insertcols!(df_store2, 1, :key => 0);

df_product2 = copy(df_product);
insertcols!(df_product2, 1, :key => 0);

outerjoin(df_store2, df_product2, on=:key) |>
    nrow

差分

J-041: レシート明細データ（df_receipt）の売上金額（amount）を日付（sales_ymd）ごとに集計し、前回売上があった日からの売上金額増減を計算せよ。そして結果を10件表示せよ。

gd = groupby(df_receipt, :sales_ymd)
df = combine(gd, :amount => sum) |>
    @orderby(_.sales_ymd) |>
    DataFrame;
df2 = df[1:end-1, :] |>
    @rename(:sales_ymd => :lag_sales_ymd, :amount_sum => :lag_amount_sum) |>
    DataFrame;

df4 = vcat(DataFrame(lag_sales_ymd = missing, lag_amount_sum = missing), df2);

hcat(df, df4) |>
    @mutate(diff_amount = _.amount_sum - _.lag_amount_sum) |>
    @take(10) |>
    DataFrame

10×5 DataFrame

Row	sales_ymd	amount_sum	lag_sales_ymd	lag_amount_sum	diff_amount
	Int64	Int64	Int64?	Int64?	Int64?
1	20170101	33723	missing	missing	missing
2	20170102	24165	20170101	33723	-9558
3	20170103	27503	20170102	24165	3338
4	20170104	36165	20170103	27503	8662
5	20170105	37830	20170104	36165	1665
6	20170106	32387	20170105	37830	-5443
7	20170107	23415	20170106	32387	-8972
8	20170108	24737	20170107	23415	1322
9	20170109	26718	20170108	24737	1981
10	20170110	20143	20170109	26718	-6575

# 別解
gd = groupby(df_receipt, :sales_ymd);
df = combine(gd, :amount => sum) |>
    @orderby(_.sales_ymd) |>
    DataFrame
d = Vector{Union{Int64, Missing}}(missing, nrow(df))
d[2:end] = diff(df[:, :amount_sum]);
DataFrame(sales_ymd = df.sales_ymd, diff_amount = d) |>
    @take(10)

sales_ymd	diff_amount
20170101	#NA
20170102	-9558
20170103	3338
20170104	8662
20170105	1665
20170106	-5443
20170107	-8972
20170108	1322
20170109	1981
20170110	-6575

1 行目（amount_sum が missing）が必要なければ，以下よりもさらに簡単になる。

# 別解2
gd = groupby(df_receipt, :sales_ymd);
df = combine(gd, :amount => sum) |>
    @orderby(_.sales_ymd) |>
    DataFrame |>
    allowmissing!;  # これをしないと missing を代入できない
df[2:end, :amount_sum] = diff(df[:, :amount_sum]);
df[1, :amount_sum] = missing  # 代入の順番も考慮する
df |>
    @rename(:amount_sum => :diff_amount) |>
    @take(10)

sales_ymd	diff_amount
20170101	#NA
20170102	-9558
20170103	3338
20170104	8662
20170105	1665
20170106	-5443
20170107	-8972
20170108	1322
20170109	1981
20170110	-6575

# 別解3
gd = groupby(df_receipt, :sales_ymd);
df = combine(gd, :amount => sum) |>
    @orderby(_.sales_ymd) |>
    DataFrame;
DataFrame(sales_ymd = df[2:end, 1], diff_amount = diff(df[:, :amount_sum])) |>
    @take(10)

sales_ymd	diff_amount
20170102	-9558
20170103	3338
20170104	8662
20170105	1665
20170106	-5443
20170107	-8972
20170108	1322
20170109	1981
20170110	-6575
20170111	4144

J-042: レシート明細データ（df_receipt）の売上金額（amount）を日付（sales_ymd）ごとに集計し、各日付のデータに対し、前回、前々回、3回前に売上があった日のデータを結合せよ。そして結果を10件表示せよ。

gd = groupby(df_receipt, :sales_ymd);
df = combine(gd, :amount => sum) |>
    @orderby(_.sales_ymd) |>
    DataFrame;

df2 = DataFrame(sales_ymd = 0, amount = 0, lag_ymd = 0, lag_amount = 0)
for i = 2:nrow(df)
    i >= 4 && append!(df2, DataFrame(sales_ymd = df[i, 1], amount = df[i, 2], lag_ymd = df[i-3, 1], lag_amount = df[i-3, 2]))
    i >= 3 && append!(df2, DataFrame(sales_ymd = df[i, 1], amount = df[i, 2], lag_ymd = df[i-2, 1], lag_amount = df[i-2, 2]))
              append!(df2, DataFrame(sales_ymd = df[i, 1], amount = df[i, 2], lag_ymd = df[i-1, 1], lag_amount = df[i-1, 2]))
end
delete!(df2, 1)
first(df2, 10)

10×4 DataFrame

Row	sales_ymd	amount	lag_ymd	lag_amount
	Int64	Int64	Int64	Int64
1	20170102	24165	20170101	33723
2	20170103	27503	20170101	33723
3	20170103	27503	20170102	24165
4	20170104	36165	20170101	33723
5	20170104	36165	20170102	24165
6	20170104	36165	20170103	27503
7	20170105	37830	20170102	24165
8	20170105	37830	20170103	27503
9	20170105	37830	20170104	36165
10	20170106	32387	20170103	27503

# 別解
gd = groupby(df_receipt, :sales_ymd);
df = combine(gd, :amount => sum) |>
    @orderby(_.sales_ymd) |>
    DataFrame;

df2 = DataFrame(sales_ymd = 0, amount = 0,
        lag_ymd_1 = 0, lag_amount_1 = 0,
        lag_ymd_2 = 0, lag_amount_2 = 0,
        lag_ymd_3 = 0, lag_amount_3 = 0)
for i = 4:nrow(df)
    append!(df2,
        DataFrame(sales_ymd = df[i, 1], amount = df[i, 2],
            lag_ymd_1 = df[i-1, 1], lag_amount_1 = df[i-1, 2],
            lag_ymd_2 = df[i-2, 1], lag_amount_2 = df[i-2, 2],
            lag_ymd_3 = df[i-3, 1], lag_amount_3 = df[i-3, 2]
            ))
end
delete!(df2, 1)
first(df2, 10)

10×8 DataFrame

Row	sales_ymd	amount	lag_ymd_1	lag_amount_1	lag_ymd_2	lag_amount_2	lag_ymd_3	lag_amount_3
	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64
1	20170104	36165	20170103	27503	20170102	24165	20170101	33723
2	20170105	37830	20170104	36165	20170103	27503	20170102	24165
3	20170106	32387	20170105	37830	20170104	36165	20170103	27503
4	20170107	23415	20170106	32387	20170105	37830	20170104	36165
5	20170108	24737	20170107	23415	20170106	32387	20170105	37830
6	20170109	26718	20170108	24737	20170107	23415	20170106	32387
7	20170110	20143	20170109	26718	20170108	24737	20170107	23415
8	20170111	24287	20170110	20143	20170109	26718	20170108	24737
9	20170112	23526	20170111	24287	20170110	20143	20170109	26718
10	20170113	28004	20170112	23526	20170111	24287	20170110	20143

クロス集計

J-043：レシート明細データ（df_receipt）と顧客データ（df_customer）を結合し、性別コード（gender_cd）と年代（ageから計算）ごとに売上金額（amount）を合計した売上サマリデータを作成せよ。性別コードは0が男性、1が女性、9が不明を表すものとする。

ただし、項目構成は年代、女性の売上金額、男性の売上金額、性別不明の売上金額の4項目とすること（縦に年代、横に性別のクロス集計）。また、年代は10歳ごとの階級とすること。

df = innerjoin(df_receipt[!, [:customer_id, :amount]],
        df_customer[!, [:customer_id, :gender_cd, :age]],
        on=:customer_id, makeunique=true)
df.age = div.(df.age, 10) .* 10
gd = groupby(df, [:age, :gender_cd])
df2 = combine(gd, :amount => sum);
using NamedArrays
using FreqTables
table = freqtable(df2.age, df2.gender_cd, weights=df2.amount_sum)
df043 = hcat(10:10:90, DataFrame(Matrix(table), :auto), makeunique=true)
rename!(df043, [:era, :male, :female, :unknown])

9×4 DataFrame

Row	era	male	female	unknown
	Int64	Int64	Int64	Int64
1	10	1591	149836	4317
2	20	72940	1363724	44328
3	30	177322	693047	50441
4	40	19355	9320791	483512
5	50	54320	6685192	342923
6	60	272469	987741	71418
7	70	13435	29764	2427
8	80	46360	262923	5111
9	90	0	6260	0

縦長フォーマット

J-044： 043で作成した売上サマリデータ（df_sales_summary）は性別の売上を横持ちさせたものであった。このデータから性別を縦持ちさせ、年代、性別コード、売上金額の3項目に変換せよ。ただし、性別コードは男性を"00"、女性を"01"、不明を"99"とする。

df2 = stack(df043, 2:4)
df2[df2[:, 2] .== "male", 2] .= "00"
df2[df2[:, 2] .== "female", 2] .= "01"
df2[df2[:, 2] .== "unknown", 2] .= "99";

df2

27×3 DataFrame

2 rows omitted

Row	era	variable	value
	Int64	String	Int64
1	10	00	1591
2	20	00	72940
3	30	00	177322
4	40	00	19355
5	50	00	54320
6	60	00	272469
7	70	00	13435
8	80	00	46360
9	90	00	0
10	10	01	149836
11	20	01	1363724
12	30	01	693047
13	40	01	9320791
&vellip;	&vellip;	&vellip;	&vellip;
16	70	01	29764
17	80	01	262923
18	90	01	6260
19	10	99	4317
20	20	99	44328
21	30	99	50441
22	40	99	483512
23	50	99	342923
24	60	99	71418
25	70	99	2427
26	80	99	5111
27	90	99	0

# 別解
df2 |>
    @rename(:variable => :gender_cd, :value => :amount) |>
    @orderby(_.gender_cd) |>
    @thenby(_.era) |>
    # @filter(_.amount > 0) |>
    DataFrame

27×3 DataFrame

2 rows omitted

Row	era	gender_cd	amount
	Int64	String	Int64
1	10	00	1591
2	20	00	72940
3	30	00	177322
4	40	00	19355
5	50	00	54320
6	60	00	272469
7	70	00	13435
8	80	00	46360
9	90	00	0
10	10	01	149836
11	20	01	1363724
12	30	01	693047
13	40	01	9320791
&vellip;	&vellip;	&vellip;	&vellip;
16	70	01	29764
17	80	01	262923
18	90	01	6260
19	10	99	4317
20	20	99	44328
21	30	99	50441
22	40	99	483512
23	50	99	342923
24	60	99	71418
25	70	99	2427
26	80	99	5111
27	90	99	0

日付データ

22.　日付型を文字列に変換

J-045: 顧客データ（df_customer）の生年月日（birth_day）は日付型でデータを保有している。これをYYYYMMDD形式の文字列に変換し、顧客ID（customer_id）とともに10件表示せよ。

first(df_customer[!, [:birth_day]], 10)

10×1 DataFrame

Row	birth_day
	Date
1	1981-04-29
2	1952-04-01
3	1976-10-04
4	1933-03-27
5	1995-03-29
6	1974-09-15
7	1977-08-09
8	1973-08-17
9	1931-05-02
10	1962-07-11

using Dates
df = df_customer[!, [:customer_id]]
df.birthday = Dates.format.(df_customer.birth_day, "yyyymmdd")
first(df, 10)

10×2 DataFrame

Row	customer_id	birthday
	String15	String
1	CS021313000114	19810429
2	CS037613000071	19520401
3	CS031415000172	19761004
4	CS028811000001	19330327
5	CS001215000145	19950329
6	CS020401000016	19740915
7	CS015414000103	19770809
8	CS029403000008	19730817
9	CS015804000004	19310502
10	CS033513000180	19620711

# 別解
df_customer |>
    @mutate(birthday=Dates.format(_.birth_day, "yyyymmdd")) |>
    @select(:customer_id, :birthday) |>
    @take(10) |>
    DataFrame

10×2 DataFrame

Row	customer_id	birthday
	String15	String
1	CS021313000114	19810429
2	CS037613000071	19520401
3	CS031415000172	19761004
4	CS028811000001	19330327
5	CS001215000145	19950329
6	CS020401000016	19740915
7	CS015414000103	19770809
8	CS029403000008	19730817
9	CS015804000004	19310502
10	CS033513000180	19620711

23.　文字列を日付型に変換

J-046: 顧客データ（df_customer）の申し込み日（application_date）はYYYYMMDD形式の文字列型でデータを保有している。これを日付型に変換し、顧客ID（customer_id）とともに10件表示せよ。

df =df_customer[!, [:customer_id]]
str = df_customer.application_date
yyyy = [parse(Int, s[1:4]) for s in str];
mm = [parse(Int, s[5:6]) for s in str];
dd = [parse(Int, s[7:8]) for s in str];
df.application_date = Date.(yyyy, mm, dd)
first(df, 10)

10×2 DataFrame

Row	customer_id	application_date
	String15	Date
1	CS021313000114	2015-09-05
2	CS037613000071	2015-04-14
3	CS031415000172	2015-05-29
4	CS028811000001	2016-01-15
5	CS001215000145	2017-06-05
6	CS020401000016	2015-02-25
7	CS015414000103	2015-07-22
8	CS029403000008	2015-05-15
9	CS015804000004	2015-06-07
10	CS033513000180	2015-07-28

# 別解
df_customer |>
    @select(:customer_id, :application_date) |>
    @mutate(yyyy = parse(Int, _.application_date[1:4]),
        mm = parse(Int, _.application_date[5:6]),
        dd = parse(Int, _.application_date[7:8])) |>
    @mutate(application_date = Date(_.yyyy, _.mm, _.dd)) |>
    @select(:customer_id, :application_date) |>
    @take(10) |>
    DataFrame

10×2 DataFrame

Row	customer_id	application_date
	String15	Date
1	CS021313000114	2015-09-05
2	CS037613000071	2015-04-14
3	CS031415000172	2015-05-29
4	CS028811000001	2016-01-15
5	CS001215000145	2017-06-05
6	CS020401000016	2015-02-25
7	CS015414000103	2015-07-22
8	CS029403000008	2015-05-15
9	CS015804000004	2015-06-07
10	CS033513000180	2015-07-28

J-047: レシート明細データ（df_receipt）の売上日（sales_ymd）はYYYYMMDD形式の数値型でデータを保有している。これを日付型に変換し、レシート番号（receipt_no）、レシートサブ番号（receipt_sub_no）とともに10件表示せよ。

df = df_receipt[!, [:receipt_no, :receipt_sub_no]]
yyyy = @. df_receipt.sales_ymd ÷ 10000;
mm = @. df_receipt.sales_ymd % 10000 ÷ 100;
dd = @. df_receipt.sales_ymd % 100;
df.sales_ymd = Date.(yyyy, mm, dd)
first(df, 10)

10×3 DataFrame

Row	receipt_no	receipt_sub_no	sales_ymd
	Int64	Int64	Date
1	112	1	2018-11-03
2	1132	2	2018-11-18
3	1102	1	2017-07-12
4	1132	1	2019-02-05
5	1102	2	2018-08-21
6	1112	1	2019-06-05
7	1102	2	2018-12-05
8	1102	1	2019-09-22
9	1112	2	2017-05-04
10	1102	1	2019-10-10

# 別解
df_receipt |>
    @select(:receipt_no, :receipt_sub_no, :sales_ymd) |>
    @mutate(yyyy = _.sales_ymd ÷ 10000,
        mm = _.sales_ymd % 10000 ÷ 100,
        dd = _.sales_ymd % 100) |>
    @mutate(salse_ymd = Date.(_.yyyy, _.mm, _.dd)) |>
    @select(:receipt_no, :receipt_sub_no, :salse_ymd) |>
    @take(10) |>
    DataFrame

10×3 DataFrame

Row	receipt_no	receipt_sub_no	salse_ymd
	Int64	Int64	Date
1	112	1	2018-11-03
2	1132	2	2018-11-18
3	1102	1	2017-07-12
4	1132	1	2019-02-05
5	1102	2	2018-08-21
6	1112	1	2019-06-05
7	1102	2	2018-12-05
8	1102	1	2019-09-22
9	1112	2	2017-05-04
10	1102	1	2019-10-10

J-048: レシート明細データ（df_receipt）の売上エポック秒（sales_epoch）は数値型のUNIX秒でデータを保有している。これを日付型に変換し、レシート番号(receipt_no)、レシートサブ番号（receipt_sub_no）とともに10件表示せよ。

df = df_receipt[!, [:receipt_no, :receipt_sub_no]]
df.sales_ymd = Date.(unix2datetime.(df_receipt.sales_epoch))
first(df, 10)

10×3 DataFrame

Row	receipt_no	receipt_sub_no	sales_ymd
	Int64	Int64	Date
1	112	1	2018-11-03
2	1132	2	2018-11-18
3	1102	1	2017-07-12
4	1132	1	2019-02-05
5	1102	2	2018-08-21
6	1112	1	2019-06-05
7	1102	2	2018-12-05
8	1102	1	2019-09-22
9	1112	2	2017-05-04
10	1102	1	2019-10-10

# 別解
df_receipt |>
    @mutate(sales_ymd = Date(unix2datetime(_.sales_epoch))) |>
    @select(:receipt_no, :receipt_sub_no, :sales_ymd) |>
    @take(10) |>
    DataFrame

10×3 DataFrame

Row	receipt_no	receipt_sub_no	sales_ymd
	Int64	Int64	Date
1	112	1	2018-11-03
2	1132	2	2018-11-18
3	1102	1	2017-07-12
4	1132	1	2019-02-05
5	1102	2	2018-08-21
6	1112	1	2019-06-05
7	1102	2	2018-12-05
8	1102	1	2019-09-22
9	1112	2	2017-05-04
10	1102	1	2019-10-10

24.　エポック秒を日付型に変換

J-049: レシート明細データ（df_receipt）の売上エポック秒（sales_epoch）を日付型に変換し、「年」だけ取り出してレシート番号(receipt_no)、レシートサブ番号（receipt_sub_no）とともに10件表示せよ。

df = df_receipt[!, [:receipt_no, :receipt_sub_no]]
df.salse_year = year.(unix2datetime.(df_receipt.sales_epoch))
first(df, 10)

10×3 DataFrame

Row	receipt_no	receipt_sub_no	salse_year
	Int64	Int64	Int64
1	112	1	2018
2	1132	2	2018
3	1102	1	2017
4	1132	1	2019
5	1102	2	2018
6	1112	1	2019
7	1102	2	2018
8	1102	1	2019
9	1112	2	2017
10	1102	1	2019

#別解
df_receipt |>
    @mutate(salse_year = year(unix2datetime(_.sales_epoch))) |>
    @select(:receipt_no, :receipt_sub_no, :salse_year) |>
    @take(10) |>
    DataFrame

10×3 DataFrame

Row	receipt_no	receipt_sub_no	salse_year
	Int64	Int64	Int64
1	112	1	2018
2	1132	2	2018
3	1102	1	2017
4	1132	1	2019
5	1102	2	2018
6	1112	1	2019
7	1102	2	2018
8	1102	1	2019
9	1112	2	2017
10	1102	1	2019

J-050: レシート明細データ（df_receipt）の売上エポック秒（sales_epoch）を日付型に変換し、「月」だけ取り出してレシート番号(receipt_no)、レシートサブ番号（receipt_sub_no）とともに10件表示せよ。なお、「月」は0埋め2桁で取り出すこと。

df = df_receipt[!, [:receipt_no, :receipt_sub_no]]
df.salse_month = @. string(month(unix2datetime(df_receipt.sales_epoch)), pad=2)
first(df, 10)

10×3 DataFrame

Row	receipt_no	receipt_sub_no	salse_month
	Int64	Int64	String
1	112	1	11
2	1132	2	11
3	1102	1	07
4	1132	1	02
5	1102	2	08
6	1112	1	06
7	1102	2	12
8	1102	1	09
9	1112	2	05
10	1102	1	10

# 別解
df_receipt |>
    @mutate(salse_month = string(month(unix2datetime(_.sales_epoch)), pad=2)) |>
    @select(:receipt_no, :receipt_sub_no, :salse_month) |>
    @take(10) |>
    DataFrame

10×3 DataFrame

Row	receipt_no	receipt_sub_no	salse_month
	Int64	Int64	String
1	112	1	11
2	1132	2	11
3	1102	1	07
4	1132	1	02
5	1102	2	08
6	1112	1	06
7	1102	2	12
8	1102	1	09
9	1112	2	05
10	1102	1	10

J-051: レシート明細データ（df_receipt）の売上エポック秒を日付型に変換し、「日」だけ取り出してレシート番号(receipt_no)、レシートサブ番号（receipt_sub_no）とともに10件表示せよ。なお、「日」は0埋め2桁で取り出すこと。

df = df_receipt[!, [:receipt_no, :receipt_sub_no]]
df.salse_day = @. string(day(unix2datetime(df_receipt.sales_epoch)), pad=2)
first(df, 10)

10×3 DataFrame

Row	receipt_no	receipt_sub_no	salse_day
	Int64	Int64	String
1	112	1	03
2	1132	2	18
3	1102	1	12
4	1132	1	05
5	1102	2	21
6	1112	1	05
7	1102	2	05
8	1102	1	22
9	1112	2	04
10	1102	1	10

# 別解
df_receipt |>
    @mutate(salse_day = string(day(unix2datetime(_.sales_epoch)), pad=2)) |>
    @select(:receipt_no, :receipt_sub_no, :salse_day) |>
    @take(10) |>
    DataFrame

10×3 DataFrame

Row	receipt_no	receipt_sub_no	salse_day
	Int64	Int64	String
1	112	1	03
2	1132	2	18
3	1102	1	12
4	1132	1	05
5	1102	2	21
6	1112	1	05
7	1102	2	05
8	1102	1	22
9	1112	2	04
10	1102	1	10

変数の作成と数式変形

J-052: レシート明細データ（df_receipt）の売上金額（amount）を顧客ID（customer_id）ごとに合計の上、売上金額合計に対して2,000円以下を0、2,000円より大きい金額を1に二値化し、顧客ID、売上金額合計とともに10件表示せよ。ただし、顧客IDが"Z"から始まるのものは非会員を表すため、除外して計算すること。

gd = groupby(filter(:customer_id => x -> !occursin(r"^Z", x), df_receipt), :customer_id)
df2 = combine(gd, :amount => sum => :sum_amount)
df2.salse_flg = (df2.sum_amount .> 2000) .+ 0
df3 = sort(df2, :customer_id)
first(df3, 10)

10×3 DataFrame

Row	customer_id	sum_amount	salse_flg
	String15	Int64	Int64
1	CS001113000004	1298	0
2	CS001114000005	626	0
3	CS001115000010	3044	1
4	CS001205000004	1988	0
5	CS001205000006	3337	1
6	CS001211000025	456	0
7	CS001212000027	448	0
8	CS001212000031	296	0
9	CS001212000046	228	0
10	CS001212000070	456	0

# 別解
df_receipt |>
    @filter(!occursin(r"^Z", _.customer_id)) |>
    @groupby(_.customer_id) |>
    @map({customer_id = key(_), sum_amount = sum(_.amount)}) |>
    @mutate(salse_flg = (_.sum_amount > 2000) + 0) |>
    @orderby(_.customer_id) |>
    @take(10) |>
    DataFrame

10×3 DataFrame

Row	customer_id	sum_amount	salse_flg
	String15	Int64	Int64
1	CS001113000004	1298	0
2	CS001114000005	626	0
3	CS001115000010	3044	1
4	CS001205000004	1988	0
5	CS001205000006	3337	1
6	CS001211000025	456	0
7	CS001212000027	448	0
8	CS001212000031	296	0
9	CS001212000046	228	0
10	CS001212000070	456	0

J-053: 顧客データ（df_customer）の郵便番号（postal_cd）に対し、東京（先頭3桁が100〜209のもの）を1、それ以外のものを0に二値化せよ。さらにレシート明細データ（df_receipt）と結合し、全期間において売上実績のある顧客数を、作成した二値ごとにカウントせよ。

df2 = df_customer[!, [:customer_id, :postal_cd]]
df2.postal_flg = @. (100 <= parse(Int, SubString(df2.postal_cd, 1, 3)) <= 209) + 0
df3 = unique(innerjoin(df2, df_receipt[!, [:customer_id]], on=:customer_id))
combine(groupby(df3, :postal_flg), nrow)

2×2 DataFrame

Row	postal_flg	nrow
	Int64	Int64
1	0	3906
2	1	4400

# 別解
using StatsBase
df2 = df_customer[!, [:customer_id, :postal_cd]] |>
    @mutate(postal_flg = (100 <= parse(Int, SubString(_.postal_cd, 1, 3)) <= 209) + 0) |>
    DataFrame;
df3 = innerjoin(df2, df_receipt[!, [:customer_id]], on=:customer_id) |>
    @unique() |>
    @groupby(_.postal_flg) |>
    @map({postal_flg = key(_), nrow = length(_.postal_flg)}) |>
    DataFrame

2×2 DataFrame

Row	postal_flg	nrow
	Int64	Int64
1	0	3906
2	1	4400

J-054: 顧客データ（df_customer）の住所（address）は、埼玉県、千葉県、東京都、神奈川県のいずれかとなっている。都道府県毎にコード値を作成し、顧客ID、住所とともに10件表示せよ。値は埼玉県を11、千葉県を12、東京都を13、神奈川県を14とすること。

df = df_customer[!, [:customer_id, :address]]
df.prefecture_cd = [indexin(s[1], ['埼', '千', '東', '神'])[1]+10 for s in df.address];
first(df, 10)

10×3 DataFrame

Row	customer_id	address	prefecture_cd
	String15	String	Int64
1	CS021313000114	神奈川県伊勢原市粟窪**********	14
2	CS037613000071	東京都江東区南砂**********	13
3	CS031415000172	東京都渋谷区代々木**********	13
4	CS028811000001	神奈川県横浜市泉区和泉町**********	14
5	CS001215000145	東京都大田区仲六郷**********	13
6	CS020401000016	東京都板橋区若木**********	13
7	CS015414000103	東京都江東区北砂**********	13
8	CS029403000008	千葉県浦安市海楽**********	12
9	CS015804000004	東京都江東区北砂**********	13
10	CS033513000180	神奈川県横浜市旭区善部町**********	14

# 別解
df_customer[!, [:customer_id, :address]] |>
    @mutate(prefecture_cd = occursin(r"^東京都", _.address) ? 13 :
            (occursin(r"^千葉県", _.address) ? 12 :
            (occursin(r"^埼玉県", _.address) ? 11 : 14))) |>
    @take(10) |>
    DataFrame

10×3 DataFrame

Row	customer_id	address	prefecture_cd
	String15	String	Int64
1	CS021313000114	神奈川県伊勢原市粟窪**********	14
2	CS037613000071	東京都江東区南砂**********	13
3	CS031415000172	東京都渋谷区代々木**********	13
4	CS028811000001	神奈川県横浜市泉区和泉町**********	14
5	CS001215000145	東京都大田区仲六郷**********	13
6	CS020401000016	東京都板橋区若木**********	13
7	CS015414000103	東京都江東区北砂**********	13
8	CS029403000008	千葉県浦安市海楽**********	12
9	CS015804000004	東京都江東区北砂**********	13
10	CS033513000180	神奈川県横浜市旭区善部町**********	14

J-055: レシート明細（df_receipt）データの売上金額（amount）を顧客ID（customer_id）ごとに合計し、その合計金額の四分位点を求めよ。その上で、顧客ごとの売上金額合計に対して以下の基準でカテゴリ値を作成し、顧客ID、売上金額合計とともに10件表示せよ。カテゴリ値は順に1〜4とする。

最小値以上第1四分位未満・・・ 1を付与

第1四分位以上第2四分位未満・・・ 2を付与

第2四分位以上第3四分位未満・・・ 3を付与

第3四分位以上・・・ 4を付与

function encode(df)
    for x in df
        x < Q[1] && return 1
        x < Q[2] && return 2
        x < Q[3] && return 3
        return 4
    end
end

using StatsBase
gd = groupby(df_receipt[:, [:amount, :customer_id]], :customer_id)
df = combine(gd, :amount => sum);
Q = quantile(df.amount_sum, [0.25, 0.50, 0.75]);

df.pct_group = encode.(df.amount_sum);
df2 = sort(df, :customer_id)
first(df2, 10)

10×3 DataFrame

Row	customer_id	amount_sum	pct_group
	String15	Int64	Int64
1	CS001113000004	1298	2
2	CS001114000005	626	2
3	CS001115000010	3044	3
4	CS001205000004	1988	3
5	CS001205000006	3337	3
6	CS001211000025	456	1
7	CS001212000027	448	1
8	CS001212000031	296	1
9	CS001212000046	228	1
10	CS001212000070	456	1

3-element Vector{Float64}:
  548.5
 1478.0
 3651.0

# 別解
using StatsBase
gd = groupby(df_receipt[:, [:amount, :customer_id]], :customer_id)
df = combine(gd, :amount => sum);
Q = quantile(df.amount_sum, [0.25, 0.50, 0.75]);

df2 = df |>
    @mutate(pct_group = _.amount_sum .< Q[1] ? 1 :
        (_.amount_sum .< Q[2] ? 2 :
        (_.amount_sum .< Q[3] ? 3 : 4))) |>
    @orderby(_.customer_id) |>
    DataFrame
first(df2, 10)

10×3 DataFrame

Row	customer_id	amount_sum	pct_group
	String15	Int64	Int64
1	CS001113000004	1298	2
2	CS001114000005	626	2
3	CS001115000010	3044	3
4	CS001205000004	1988	3
5	CS001205000006	3337	3
6	CS001211000025	456	1
7	CS001212000027	448	1
8	CS001212000031	296	1
9	CS001212000046	228	1
10	CS001212000070	456	1

J-056: 顧客データ（df_customer）の年齢（age）をもとに10歳刻みで年代を算出し、顧客ID（customer_id）、生年月日（birth_day）とともに10件表示せよ。ただし、60歳以上は全て60歳代とすること。年代を表すカテゴリ名は任意とする。

df = df_customer[!, [:customer_id, :birth_day, :age]]
df.era = @. min((df.age ÷ 10)*10, 60)
first(df, 10)

10×4 DataFrame

Row	customer_id	birth_day	age	era
	String15	Date	Int64	Int64
1	CS021313000114	1981-04-29	37	30
2	CS037613000071	1952-04-01	66	60
3	CS031415000172	1976-10-04	42	40
4	CS028811000001	1933-03-27	86	60
5	CS001215000145	1995-03-29	24	20
6	CS020401000016	1974-09-15	44	40
7	CS015414000103	1977-08-09	41	40
8	CS029403000008	1973-08-17	45	40
9	CS015804000004	1931-05-02	87	60
10	CS033513000180	1962-07-11	56	50

# 別解
df_customer[!, [:customer_id, :birth_day, :age]] |>
    @mutate(era = min((_.age ÷ 10)*10, 60)) |>
    @take(10) |>
    DataFrame

10×4 DataFrame

Row	customer_id	birth_day	age	era
	String15	Date	Int64	Int64
1	CS021313000114	1981-04-29	37	30
2	CS037613000071	1952-04-01	66	60
3	CS031415000172	1976-10-04	42	40
4	CS028811000001	1933-03-27	86	60
5	CS001215000145	1995-03-29	24	20
6	CS020401000016	1974-09-15	44	40
7	CS015414000103	1977-08-09	41	40
8	CS029403000008	1973-08-17	45	40
9	CS015804000004	1931-05-02	87	60
10	CS033513000180	1962-07-11	56	50

J-057: 056の抽出結果と性別コード（gender_cd）により、新たに性別×年代の組み合わせを表すカテゴリデータを作成し、10件表示せよ。組み合わせを表すカテゴリの値は任意とする。

df = df_customer[!, [:customer_id, :gender_cd, :birth_day, :age]]
df.era = @. min((df.age ÷ 10)*10, 60)
df.gender_era = @. string(df.gender_cd) * string(df.era)
first(df, 10)

10×6 DataFrame

Row	customer_id	gender_cd	birth_day	age	era	gender_era
	String15	String	Date	Int64	Int64	String
1	CS021313000114	1	1981-04-29	37	30	130
2	CS037613000071	9	1952-04-01	66	60	960
3	CS031415000172	1	1976-10-04	42	40	140
4	CS028811000001	1	1933-03-27	86	60	160
5	CS001215000145	1	1995-03-29	24	20	120
6	CS020401000016	0	1974-09-15	44	40	040
7	CS015414000103	1	1977-08-09	41	40	140
8	CS029403000008	0	1973-08-17	45	40	040
9	CS015804000004	0	1931-05-02	87	60	060
10	CS033513000180	1	1962-07-11	56	50	150

# 別解
df_customer[!, [:customer_id, :gender_cd, :birth_day, :age]] |>
    @mutate(era = min((_.age ÷ 10)*10, 60)) |>
    @mutate(gender_era = string(_.gender_cd) * string(_.era)) |>
    @take(10) |>
    DataFrame

10×6 DataFrame

Row	customer_id	gender_cd	birth_day	age	era	gender_era
	String15	String	Date	Int64	Int64	String
1	CS021313000114	1	1981-04-29	37	30	130
2	CS037613000071	9	1952-04-01	66	60	960
3	CS031415000172	1	1976-10-04	42	40	140
4	CS028811000001	1	1933-03-27	86	60	160
5	CS001215000145	1	1995-03-29	24	20	120
6	CS020401000016	0	1974-09-15	44	40	040
7	CS015414000103	1	1977-08-09	41	40	140
8	CS029403000008	0	1973-08-17	45	40	040
9	CS015804000004	0	1931-05-02	87	60	060
10	CS033513000180	1	1962-07-11	56	50	150

25.　ダミー変数の作成

J-058: 顧客データ（df_customer）の性別コード（gender_cd）をダミー変数化し、顧客ID（customer_id）とともに10件表示せよ。

# トリッキー（スマート）な方法
dummydatamatrix = Matrix(permutedims(["0", "1", "9"]) .== df_customer.gender_cd) .+ 0
df = hcat(df_customer[!, [:customer_id]], DataFrame(dummydatamatrix, :auto))
rename!(df, [:customer_id, :gender_cd0, :gender_cd1, :gender_cd9])
first(df, 10)

10×4 DataFrame

Row	customer_id	gender_cd0	gender_cd1	gender_cd9
	String15	Int64	Int64	Int64
1	CS021313000114	0	1	0
2	CS037613000071	0	0	1
3	CS031415000172	0	1	0
4	CS028811000001	0	1	0
5	CS001215000145	0	1	0
6	CS020401000016	1	0	0
7	CS015414000103	0	1	0
8	CS029403000008	1	0	0
9	CS015804000004	1	0	0
10	CS033513000180	0	1	0

# 地道な方法
df = df_customer[!, [:customer_id]]
df.gender_cd0 = @. (df_customer.gender_cd == "0") + 0
df.gender_cd1 = @. (df_customer.gender_cd == "1") + 0
df.gender_cd9 = @. (df_customer.gender_cd == "9") + 0
first(df, 10)

10×4 DataFrame

Row	customer_id	gender_cd0	gender_cd1	gender_cd9
	String15	Int64	Int64	Int64
1	CS021313000114	0	1	0
2	CS037613000071	0	0	1
3	CS031415000172	0	1	0
4	CS028811000001	0	1	0
5	CS001215000145	0	1	0
6	CS020401000016	1	0	0
7	CS015414000103	0	1	0
8	CS029403000008	1	0	0
9	CS015804000004	1	0	0
10	CS033513000180	0	1	0

# 別解
df_customer[!, [:customer_id, :gender_cd]] |>
    @mutate(gender_cd0 = (_.gender_cd == "0") + 0,
            gender_cd1 = (_.gender_cd == "1") + 0,
            gender_cd9 = (_.gender_cd == "9") + 0)|>
    @select(-:gender_cd) |>
    @take(10) |>
    DataFrame

10×4 DataFrame

Row	customer_id	gender_cd0	gender_cd1	gender_cd9
	String15	Int64	Int64	Int64
1	CS021313000114	0	1	0
2	CS037613000071	0	0	1
3	CS031415000172	0	1	0
4	CS028811000001	0	1	0
5	CS001215000145	0	1	0
6	CS020401000016	1	0	0
7	CS015414000103	0	1	0
8	CS029403000008	1	0	0
9	CS015804000004	1	0	0
10	CS033513000180	0	1	0

26.　標準化

J-059: レシート明細データ（df_receipt）の売上金額（amount）を顧客ID（customer_id）ごとに合計し、売上金額合計を平均0、標準偏差1に標準化して顧客ID、売上金額合計とともに10件表示せよ。標準化に使用する標準偏差は、分散の平方根、もしくは不偏分散の平方根のどちらでも良いものとする。ただし、顧客IDが"Z"から始まるのものは非会員を表すため、除外して計算すること。

TIPS:
~- query()の引数engineで'python'か'numexpr'かを選択でき、デフォルトはインストールされていればnumexprが、無ければpythonが使われます。さらに、文字列メソッドはengine='python'でないとquery()内で使えません。

gd = groupby(filter(:customer_id => x -> !occursin(r"^Z", x), df_receipt), :customer_id)
df = combine(gd, :amount => sum => :sum_amount)
df.standardized_amount = (df.sum_amount .- mean(df.sum_amount)) ./ std(df.sum_amount)
df2 = sort(df, :customer_id)
first(df2, 10)

10×3 DataFrame

Row	customer_id	sum_amount	standardized_amount
	String15	Int64	Float64
1	CS001113000004	1298	-0.45935
2	CS001114000005	626	-0.706348
3	CS001115000010	3044	0.182403
4	CS001205000004	1988	-0.205737
5	CS001205000006	3337	0.290096
6	CS001211000025	456	-0.768832
7	CS001212000027	448	-0.771773
8	CS001212000031	296	-0.827641
9	CS001212000046	228	-0.852635
10	CS001212000070	456	-0.768832

# 別解
df2 = df_receipt[!, [:customer_id, :amount]] |>
    @filter(!occursin(r"^Z", _.customer_id)) |>
    DataFrame;

gd = groupby(df2, :customer_id)
df3 = combine(gd, :amount => sum => :sum_amount) |>
    @orderby(_.customer_id) |>
    DataFrame
using StatsBase
mean_amount = mean(df3.sum_amount)
std_amount  = std(df3.sum_amount);

df3 |>
    @mutate(standardized_amount = (_.sum_amount - mean_amount) / std_amount) |>
    @take(10) |>
    DataFrame

10×3 DataFrame

Row	customer_id	sum_amount	standardized_amount
	Any	Any	Any
1	CS001113000004	1298	-0.45935
2	CS001114000005	626	-0.706348
3	CS001115000010	3044	0.182403
4	CS001205000004	1988	-0.205737
5	CS001205000006	3337	0.290096
6	CS001211000025	456	-0.768832
7	CS001212000027	448	-0.771773
8	CS001212000031	296	-0.827641
9	CS001212000046	228	-0.852635
10	CS001212000070	456	-0.768832

df2 を作ってからは，以下のように freqtables() を使うほうが簡単でわかりやすい。

using FreqTables
a = freqtable(df2.customer_id, weights=df2.amount)
(mean_amount, std_amount) = (mean(a), std(a))
DataFrame(customer_id=names(a)[1], sum_amount=a) |>
    @mutate(standardized_amount = (_.sum_amount - mean_amount) / std_amount) |>
    @take(10) |>
    DataFrame

10×3 DataFrame

Row	customer_id	sum_amount	standardized_amount
	Any	Any	Any
1	CS001113000004	1298	-0.45935
2	CS001114000005	626	-0.706348
3	CS001115000010	3044	0.182403
4	CS001205000004	1988	-0.205737
5	CS001205000006	3337	0.290096
6	CS001211000025	456	-0.768832
7	CS001212000027	448	-0.771773
8	CS001212000031	296	-0.827641
9	CS001212000046	228	-0.852635
10	CS001212000070	456	-0.768832

J-060: レシート明細データ（df_receipt）の売上金額（amount）を顧客ID（customer_id）ごとに合計し、売上金額合計を最小値0、最大値1に正規化して顧客ID、売上金額合計とともに10件表示せよ。ただし、顧客IDが"Z"から始まるのものは非会員を表すため、除外して計算すること。

gd = groupby(filter(:customer_id => x -> !occursin(r"^Z", x), df_receipt), :customer_id)
df = combine(gd, :amount => sum => :sum_amount)
df.standardized_amount = (df.sum_amount .- minimum(df.sum_amount)) ./ (maximum(df.sum_amount) - minimum(df.sum_amount))
df2 = sort(df, :customer_id)
first(df2, 10)

10×3 DataFrame

Row	customer_id	sum_amount	standardized_amount
	String15	Int64	Float64
1	CS001113000004	1298	0.0533542
2	CS001114000005	626	0.0241571
3	CS001115000010	3044	0.129214
4	CS001205000004	1988	0.0833333
5	CS001205000006	3337	0.141945
6	CS001211000025	456	0.0167709
7	CS001212000027	448	0.0164234
8	CS001212000031	296	0.00981926
9	CS001212000046	228	0.00686479
10	CS001212000070	456	0.0167709

# 別解
df2 = df_receipt[!, [:customer_id, :amount]] |>
    @filter(!occursin(r"^Z", _.customer_id)) |>
    DataFrame;

using FreqTables
a = freqtable(df2.customer_id, weights=df2.amount)
(min_amount, max_amount) = (minimum(a), maximum(a))
DataFrame(customer_id=names(a)[1], sum_amount=a) |>
    @mutate(scaled_amount = (_.sum_amount - min_amount) / (max_amount - min_amount)) |>
    @take(10) |>
    DataFrame

10×3 DataFrame

Row	customer_id	sum_amount	scaled_amount
	Any	Any	Any
1	CS001113000004	1298	0.0533542
2	CS001114000005	626	0.0241571
3	CS001115000010	3044	0.129214
4	CS001205000004	1988	0.0833333
5	CS001205000006	3337	0.141945
6	CS001211000025	456	0.0167709
7	CS001212000027	448	0.0164234
8	CS001212000031	296	0.00981926
9	CS001212000046	228	0.00686479
10	CS001212000070	456	0.0167709

27.　対数変換

J-061: レシート明細データ（df_receipt）の売上金額（amount）を顧客ID（customer_id）ごとに合計し、売上金額合計を常用対数化（底10）して顧客ID、売上金額合計とともに10件表示せよ。ただし、顧客IDが"Z"から始まるのものは非会員を表すため、除外して計算すること。

gd = groupby(filter(:customer_id => x -> !occursin(r"^Z", x), df_receipt), :customer_id)
df = combine(gd, :amount => sum => :sum_amount)
df.log_amount = log10.(df.sum_amount)
df2 = sort(df, :customer_id)
first(df2, 10)

10×3 DataFrame

Row	customer_id	sum_amount	log_amount
	String15	Int64	Float64
1	CS001113000004	1298	3.11327
2	CS001114000005	626	2.79657
3	CS001115000010	3044	3.48344
4	CS001205000004	1988	3.29842
5	CS001205000006	3337	3.52336
6	CS001211000025	456	2.65896
7	CS001212000027	448	2.65128
8	CS001212000031	296	2.47129
9	CS001212000046	228	2.35793
10	CS001212000070	456	2.65896

# 別解
df2 = df_receipt[!, [:customer_id, :amount]] |>
    @filter(!occursin(r"^Z", _.customer_id)) |>
    DataFrame;

using FreqTables
a = freqtable(df2.customer_id, weights=df2.amount)
(min_amount, max_amount) = (minimum(a), maximum(a))
DataFrame(customer_id=names(a)[1], sum_amount=a) |>
    @mutate(log_amount = log10(_.sum_amount)) |>
    @take(10) |>
    DataFrame

10×3 DataFrame

Row	customer_id	sum_amount	log_amount
	String15	Int64	Float64
1	CS001113000004	1298	3.11327
2	CS001114000005	626	2.79657
3	CS001115000010	3044	3.48344
4	CS001205000004	1988	3.29842
5	CS001205000006	3337	3.52336
6	CS001211000025	456	2.65896
7	CS001212000027	448	2.65128
8	CS001212000031	296	2.47129
9	CS001212000046	228	2.35793
10	CS001212000070	456	2.65896

J-062: レシート明細データ（df_receipt）の売上金額（amount）を顧客ID（customer_id）ごとに合計し、売上金額合計を自然対数化（底e）して顧客ID、売上金額合計とともに10件表示せよ。ただし、顧客IDが"Z"から始まるのものは非会員を表すため、除外して計算すること。

gd = groupby(filter(:customer_id => x -> !occursin(r"^Z", x), df_receipt), :customer_id)
df = combine(gd, :amount => sum => :sum_amount)
df.log_amount = log.(df.sum_amount)
df2 = sort(df, :customer_id)
first(df2, 10)

10×3 DataFrame

Row	customer_id	sum_amount	log_amount
	String15	Int64	Float64
1	CS001113000004	1298	7.16858
2	CS001114000005	626	6.43935
3	CS001115000010	3044	8.02093
4	CS001205000004	1988	7.59488
5	CS001205000006	3337	8.11283
6	CS001211000025	456	6.12249
7	CS001212000027	448	6.10479
8	CS001212000031	296	5.69036
9	CS001212000046	228	5.42935
10	CS001212000070	456	6.12249

# 別解
df2 = df_receipt[!, [:customer_id, :amount]] |>
    @filter(!occursin(r"^Z", _.customer_id)) |>
    DataFrame;

using FreqTables
a = freqtable(df2.customer_id, weights=df2.amount)
(min_amount, max_amount) = (minimum(a), maximum(a))
DataFrame(customer_id=names(a)[1], sum_amount=a) |>
    @mutate(log_amount = log(_.sum_amount)) |>
    @take(10) |>
    DataFrame

10×3 DataFrame

Row	customer_id	sum_amount	log_amount
	String15	Int64	Float64
1	CS001113000004	1298	7.16858
2	CS001114000005	626	6.43935
3	CS001115000010	3044	8.02093
4	CS001205000004	1988	7.59488
5	CS001205000006	3337	8.11283
6	CS001211000025	456	6.12249
7	CS001212000027	448	6.10479
8	CS001212000031	296	5.69036
9	CS001212000046	228	5.42935
10	CS001212000070	456	6.12249

欠損値処理

J-063: 商品データ（df_product）の単価（unit_price）と原価（unit_cost）から各商品の利益額を算出し、結果を10件表示せよ。

df = copy(df_product)  # copy() しないと，元のデータフレームが編集される
df.unit_profit = df.unit_price - df.unit_cost
first(df, 10)

10×7 DataFrame

Row	product_cd	category_major_cd	category_medium_cd	category_small_cd	unit_price	unit_cost	unit_profit
	String15	String	String	String	Int64?	Int64?	Int64?
1	P040101001	04	0401	040101	198	149	49
2	P040101002	04	0401	040101	218	164	54
3	P040101003	04	0401	040101	230	173	57
4	P040101004	04	0401	040101	248	186	62
5	P040101005	04	0401	040101	268	201	67
6	P040101006	04	0401	040101	298	224	74
7	P040101007	04	0401	040101	338	254	84
8	P040101008	04	0401	040101	420	315	105
9	P040101009	04	0401	040101	498	374	124
10	P040101010	04	0401	040101	580	435	145

# 別解
df_product |>
    @mutate(unit_profit = _.unit_price - _.unit_cost) |>
    @take(10) |>
    DataFrame

10×7 DataFrame

Row	product_cd	category_major_cd	category_medium_cd	category_small_cd	unit_price	unit_cost	unit_profit
	String15	String	String	String	Int64?	Int64?	Int64?
1	P040101001	04	0401	040101	198	149	49
2	P040101002	04	0401	040101	218	164	54
3	P040101003	04	0401	040101	230	173	57
4	P040101004	04	0401	040101	248	186	62
5	P040101005	04	0401	040101	268	201	67
6	P040101006	04	0401	040101	298	224	74
7	P040101007	04	0401	040101	338	254	84
8	P040101008	04	0401	040101	420	315	105
9	P040101009	04	0401	040101	498	374	124
10	P040101010	04	0401	040101	580	435	145

J-064: 商品データ（df_product）の単価（unit_price）と原価（unit_cost）から、各商品の利益率の全体平均を算出せよ。ただし、単価と原価には欠損が生じていることに注意せよ。

mean(skipmissing(1 .- df_product.unit_cost ./ df_product.unit_price))

0.24911389885176904

# 別解
df = df_product |>
    @mutate(unit_profit_rate = 1 - _.unit_cost / _.unit_price) |>
    DataFrame;
mean(skipmissing(df.unit_profit_rate))

0.24911389885176904

# 別解2
df = df_product |>
    @mutate(unit_profit_rate = 1 - _.unit_cost / _.unit_price) |>
    @dropna(:unit_profit_rate) |>
    DataFrame;
mean(df.unit_profit_rate)

0.24911389885177007

J-065: 商品データ（df_product）の各商品について、利益率が30%となる新たな単価を求めよ。ただし、1円未満は切り捨てること。そして結果を10件表示させ、利益率がおよそ30％付近であることを確認せよ。ただし、単価（unit_price）と原価（unit_cost）には欠損が生じていることに注意せよ。

df = dropmissing(df_product[!, [:product_cd, :unit_price, :unit_cost]])
df.new_price = @. trunc(Int, df.unit_cost / 0.7)
df.new_profit_rate = @. 1 - df.unit_cost / df.new_price
first(df, 10)

10×5 DataFrame

Row	product_cd	unit_price	unit_cost	new_price	new_profit_rate
	String15	Int64	Int64	Int64	Float64
1	P040101001	198	149	212	0.29717
2	P040101002	218	164	234	0.299145
3	P040101003	230	173	247	0.299595
4	P040101004	248	186	265	0.298113
5	P040101005	268	201	287	0.299652
6	P040101006	298	224	320	0.3
7	P040101007	338	254	362	0.298343
8	P040101008	420	315	450	0.3
9	P040101009	498	374	534	0.299625
10	P040101010	580	435	621	0.299517

# 別解
df_product[!, [:product_cd, :unit_price, :unit_cost]] |>
    @dropna() |>
    @mutate(new_price = trunc(Int, _.unit_cost / 0.7)) |>
    @mutate(new_profit_rate = 1 - _.unit_cost / _.new_price) |>
    @take(10) |>
    DataFrame

10×5 DataFrame

Row	product_cd	unit_price	unit_cost	new_price	new_profit_rate
	String15	Int64	Int64	Int64	Float64
1	P040101001	198	149	212	0.29717
2	P040101002	218	164	234	0.299145
3	P040101003	230	173	247	0.299595
4	P040101004	248	186	265	0.298113
5	P040101005	268	201	287	0.299652
6	P040101006	298	224	320	0.3
7	P040101007	338	254	362	0.298343
8	P040101008	420	315	450	0.3
9	P040101009	498	374	534	0.299625
10	P040101010	580	435	621	0.299517

J-066: 商品データ（df_product）の各商品について、利益率が30%となる新たな単価を求めよ。今回は、1円未満を丸めること（四捨五入または偶数への丸めで良い）。そして結果を10件表示させ、利益率がおよそ30％付近であることを確認せよ。ただし、単価（unit_price）と原価（unit_cost）には欠損が生じていることに注意せよ。

df = dropmissing(df_product[!, [:product_cd, :unit_price, :unit_cost]])
df.new_price = @. round(Int, df.unit_cost / 0.7)
df.new_profit_rate = @. 1 - df.unit_cost / df.new_price
first(df, 10)

10×5 DataFrame

Row	product_cd	unit_price	unit_cost	new_price	new_profit_rate
	String15	Int64	Int64	Int64	Float64
1	P040101001	198	149	213	0.300469
2	P040101002	218	164	234	0.299145
3	P040101003	230	173	247	0.299595
4	P040101004	248	186	266	0.300752
5	P040101005	268	201	287	0.299652
6	P040101006	298	224	320	0.3
7	P040101007	338	254	363	0.300275
8	P040101008	420	315	450	0.3
9	P040101009	498	374	534	0.299625
10	P040101010	580	435	621	0.299517

# 別解
df_product[!, [:product_cd, :unit_price, :unit_cost]] |>
    @dropna() |>
    @mutate(new_price = round(Int, _.unit_cost / 0.7)) |>
    @mutate(new_profit_rate = 1 - _.unit_cost / _.new_price) |>
    @take(10) |>
    DataFrame

10×5 DataFrame

Row	product_cd	unit_price	unit_cost	new_price	new_profit_rate
	String15	Int64	Int64	Int64	Float64
1	P040101001	198	149	213	0.300469
2	P040101002	218	164	234	0.299145
3	P040101003	230	173	247	0.299595
4	P040101004	248	186	266	0.300752
5	P040101005	268	201	287	0.299652
6	P040101006	298	224	320	0.3
7	P040101007	338	254	363	0.300275
8	P040101008	420	315	450	0.3
9	P040101009	498	374	534	0.299625
10	P040101010	580	435	621	0.299517

J-067: 商品データ（df_product）の各商品について、利益率が30%となる新たな単価を求めよ。今回は、1円未満を切り上げること。そして結果を10件表示させ、利益率がおよそ30％付近であることを確認せよ。ただし、単価（unit_price）と原価（unit_cost）には欠損が生じていることに注意せよ。

df = dropmissing(df_product[!, [:product_cd, :unit_price, :unit_cost]])
df.new_price = @. ceil(Int, df.unit_cost / 0.7)
df.new_profit_rate = @. 1 - df.unit_cost / df.new_price
first(df, 10)

10×5 DataFrame

Row	product_cd	unit_price	unit_cost	new_price	new_profit_rate
	String15	Int64	Int64	Int64	Float64
1	P040101001	198	149	213	0.300469
2	P040101002	218	164	235	0.302128
3	P040101003	230	173	248	0.302419
4	P040101004	248	186	266	0.300752
5	P040101005	268	201	288	0.302083
6	P040101006	298	224	320	0.3
7	P040101007	338	254	363	0.300275
8	P040101008	420	315	451	0.301552
9	P040101009	498	374	535	0.300935
10	P040101010	580	435	622	0.300643

df_product[!, [:product_cd, :unit_price, :unit_cost]] |>
    @dropna() |>
    @mutate(new_price = ceil(Int, _.unit_cost / 0.7)) |>
    @mutate(new_profit_rate = 1 - _.unit_cost / _.new_price) |>
    @take(10) |>
    DataFrame

10×5 DataFrame

Row	product_cd	unit_price	unit_cost	new_price	new_profit_rate
	String15	Int64	Int64	Int64	Float64
1	P040101001	198	149	213	0.300469
2	P040101002	218	164	235	0.302128
3	P040101003	230	173	248	0.302419
4	P040101004	248	186	266	0.300752
5	P040101005	268	201	288	0.302083
6	P040101006	298	224	320	0.3
7	P040101007	338	254	363	0.300275
8	P040101008	420	315	451	0.301552
9	P040101009	498	374	535	0.300935
10	P040101010	580	435	622	0.300643

J-068: 商品データ（df_product）の各商品について、消費税率10％の税込み金額を求めよ。1円未満の端数は切り捨てとし、結果を10件表示せよ。ただし、単価（unit_price）には欠損が生じていることに注意せよ。

df = dropmissing(df_product[!, [:product_cd, :unit_price]])
df.tax_price = @. trunc(Int, df.unit_price * 1.1)
first(df, 10)

10×3 DataFrame

Row	product_cd	unit_price	tax_price
	String15	Int64	Int64
1	P040101001	198	217
2	P040101002	218	239
3	P040101003	230	253
4	P040101004	248	272
5	P040101005	268	294
6	P040101006	298	327
7	P040101007	338	371
8	P040101008	420	462
9	P040101009	498	547
10	P040101010	580	638

# 別解
df_product[!, [:product_cd, :unit_price]] |>
    @dropna() |>
    @mutate(tax_price = trunc(Int, _.unit_price * 1.1)) |>
    @take(10) |>
    DataFrame

10×3 DataFrame

Row	product_cd	unit_price	tax_price
	String15	Int64	Int64
1	P040101001	198	217
2	P040101002	218	239
3	P040101003	230	253
4	P040101004	248	272
5	P040101005	268	294
6	P040101006	298	327
7	P040101007	338	371
8	P040101008	420	462
9	P040101009	498	547
10	P040101010	580	638

J-069: レシート明細データ（df_receipt）と商品データ（df_product）を結合し、顧客毎に全商品の売上金額合計と、カテゴリ大区分コード（category_major_cd）が"07"（瓶詰缶詰）の売上金額合計を計算の上、両者の比率を求めよ。抽出対象はカテゴリ大区分コード"07"（瓶詰缶詰）の売上実績がある顧客のみとし、結果を10件表示せよ。

df = innerjoin(df_receipt[!, [:customer_id, :product_cd, :amount]],
               df_product[!, [:category_major_cd, :product_cd]],
               on=:product_cd, makeunique=true)
df1 = combine(groupby(df, :customer_id), :amount => sum => :sum_all)

filter!(:category_major_cd => x -> x == "07", df)
df2 = combine(groupby(df, :customer_id), :amount => sum => :sum_07)

df3 = innerjoin(df1, df2, on=:customer_id)
df3.sales_rate = df3.sum_07 ./ df3.sum_all
sort!(df3, :customer_id)
first(df3, 10)

10×4 DataFrame

Row	customer_id	sum_all	sum_07	sales_rate
	String15	Int64	Int64	Float64
1	CS001113000004	1298	1298	1.0
2	CS001114000005	626	486	0.776358
3	CS001115000010	3044	2694	0.88502
4	CS001205000004	1988	346	0.174044
5	CS001205000006	3337	2004	0.600539
6	CS001212000027	448	200	0.446429
7	CS001212000031	296	296	1.0
8	CS001212000046	228	108	0.473684
9	CS001212000070	456	308	0.675439
10	CS001213000018	243	145	0.596708

# 別解
df1 = df_receipt |>
    @select(:customer_id, :product_cd, :amount) |> DataFrame;

df2 = df_product |>
    @select(:category_major_cd, :product_cd) |> DataFrame;

df3 = df1 |>
    @join(df2, _.product_cd, _.product_cd, {_.customer_id, _.amount, __.category_major_cd}) |>
    DataFrame;

df4 = df3 |>
    @groupby(_.customer_id) |>
    @map({customer_id = key(_), sum_all = sum(_.amount)}) |>
    DataFrame;

df5 = df3 |>
    @filter(_.category_major_cd == "07") |>
    @groupby(_.customer_id) |>
    @map({customer_id = key(_), sum_07 = sum(_.amount)}) |>
    DataFrame;

df4 |>
    @join(df5, _.customer_id, _.customer_id, {_.customer_id, _.sum_all, __.sum_07}) |>
    @mutate(sales_rate = _.sum_07 / _.sum_all) |>
    @orderby(_.customer_id) |>
    @take(10) |> DataFrame

10×4 DataFrame

Row	customer_id	sum_all	sum_07	sales_rate
	String15	Int64	Int64	Float64
1	CS001113000004	1298	1298	1.0
2	CS001114000005	626	486	0.776358
3	CS001115000010	3044	2694	0.88502
4	CS001205000004	1988	346	0.174044
5	CS001205000006	3337	2004	0.600539
6	CS001212000027	448	200	0.446429
7	CS001212000031	296	296	1.0
8	CS001212000046	228	108	0.473684
9	CS001212000070	456	308	0.675439
10	CS001213000018	243	145	0.596708

経過日数，月数，年数

J-070: レシート明細データ（df_receipt）の売上日（sales_ymd）に対し、顧客データ（df_customer）の会員申込日（application_date）からの経過日数を計算し、顧客ID（customer_id）、売上日、会員申込日とともに10件表示せよ（sales_ymdは数値、application_dateは文字列でデータを保持している点に注意）。

using Dates
date(d) = @. Date(d ÷ 10000, (d % 10000) ÷ 100, d % 100)
df = innerjoin(df_receipt[!, [:customer_id, :sales_ymd]],
          df_customer[!, [:customer_id, :application_date]],
          on=:customer_id);
df.elapsed_days = @. Dates.value(date(df.sales_ymd) - date(parse(Int, df.application_date)))
first(df, 10)

10×4 DataFrame

Row	customer_id	sales_ymd	application_date	elapsed_days
	String15	Int64	String	Int64
1	CS006214000001	20181103	20150201	1371
2	CS008415000097	20181118	20150322	1337
3	CS028414000014	20170712	20150711	732
4	CS025415000050	20180821	20160131	933
5	CS003515000195	20190605	20150306	1552
6	CS024514000042	20181205	20151010	1152
7	CS040415000178	20190922	20150627	1548
8	CS027514000015	20191010	20151101	1439
9	CS025415000134	20190918	20150720	1521
10	CS021515000126	20171010	20150508	886

# 別解
using Dates
date(d) = @. Date(d ÷ 10000, (d % 10000) ÷ 100, d % 100)
df_receipt |>
    @join(df_customer, _.customer_id, _.customer_id, {_.customer_id, _.sales_ymd, __.application_date}) |>
    @mutate(elapsed_days = Dates.value(date(_.sales_ymd) - date(parse(Int, _.application_date)))) |>
    @take(10) |> DataFrame

10×4 DataFrame

Row	customer_id	sales_ymd	application_date	elapsed_days
	String15	Int64	String	Int64
1	CS006214000001	20181103	20150201	1371
2	CS008415000097	20181118	20150322	1337
3	CS028414000014	20170712	20150711	732
4	CS025415000050	20180821	20160131	933
5	CS003515000195	20190605	20150306	1552
6	CS024514000042	20181205	20151010	1152
7	CS040415000178	20190922	20150627	1548
8	CS027514000015	20191010	20151101	1439
9	CS025415000134	20190918	20150720	1521
10	CS021515000126	20171010	20150508	886

J-071: レシート明細データ（df_receipt）の売上日（sales_ymd）に対し、顧客データ（df_customer）の会員申込日（application_date）からの経過月数を計算し、顧客ID（customer_id）、売上日、会員申込日とともに10件表示せよ（sales_ymdは数値、application_dateは文字列でデータを保持している点に注意）。1ヶ月未満は切り捨てること。

using Dates
date2(d) = @. (d ÷ 10000, (d % 10000) ÷ 100, d % 100)
function date3(d_begin, d_end)
    y, m, d = date2(d_end) .- date2(parse(Int, d_begin))
    12y + m + (d < 0)*(-1)
end;

df = innerjoin(df_receipt[!, [:customer_id, :sales_ymd]],
          df_customer[!, [:customer_id, :application_date]],
          on=:customer_id);
df.elapsed_months = date3.(df.application_date, df.sales_ymd)
first(df, 10)

10×4 DataFrame

Row	customer_id	sales_ymd	application_date	elapsed_months
	String15	Int64	String	Int64
1	CS006214000001	20181103	20150201	45
2	CS008415000097	20181118	20150322	43
3	CS028414000014	20170712	20150711	24
4	CS025415000050	20180821	20160131	30
5	CS003515000195	20190605	20150306	50
6	CS024514000042	20181205	20151010	37
7	CS040415000178	20190922	20150627	50
8	CS027514000015	20191010	20151101	47
9	CS025415000134	20190918	20150720	49
10	CS021515000126	20171010	20150508	29

# 別解
df_receipt |>
    @join(df_customer, _.customer_id, _.customer_id, {_.customer_id, _.sales_ymd, __.application_date}) |>
    @mutate(elapsed_months = date3( _.application_date, _.sales_ymd)) |>
    @take(10) |>
    DataFrame

10×4 DataFrame

Row	customer_id	sales_ymd	application_date	elapsed_months
	String15	Int64	String	Int64
1	CS006214000001	20181103	20150201	45
2	CS008415000097	20181118	20150322	43
3	CS028414000014	20170712	20150711	24
4	CS025415000050	20180821	20160131	30
5	CS003515000195	20190605	20150306	50
6	CS024514000042	20181205	20151010	37
7	CS040415000178	20190922	20150627	50
8	CS027514000015	20191010	20151101	47
9	CS025415000134	20190918	20150720	49
10	CS021515000126	20171010	20150508	29

J-072: レシート明細データ（df_receipt）の売上日（df_customer）に対し、顧客データ（df_customer）の会員申込日（application_date）からの経過年数を計算し、顧客ID（customer_id）、売上日、会員申込日とともに10件表示せよ（sales_ymdは数値、application_dateは文字列でデータを保持している点に注意）。1年未満は切り捨てること。

year2(ymd, ymd2) = (ymd2 .- ymd) .÷ 10000
df = innerjoin(df_receipt[!, [:customer_id, :sales_ymd]],
          df_customer[!, [:customer_id, :application_date]],
          on=:customer_id)
df.elapsed_years = year2(parse.(Int, df.application_date), df.sales_ymd)
first(df, 10)

10×4 DataFrame

Row	customer_id	sales_ymd	application_date	elapsed_years
	String15	Int64	String	Int64
1	CS006214000001	20181103	20150201	3
2	CS008415000097	20181118	20150322	3
3	CS028414000014	20170712	20150711	2
4	CS025415000050	20180821	20160131	2
5	CS003515000195	20190605	20150306	4
6	CS024514000042	20181205	20151010	3
7	CS040415000178	20190922	20150627	4
8	CS027514000015	20191010	20151101	3
9	CS025415000134	20190918	20150720	4
10	CS021515000126	20171010	20150508	2

# 別解
year2(ymd, ymd2) = (ymd2 - ymd) ÷ 10000
df_receipt |>
    @join(df_customer, _.customer_id, _.customer_id, {_.customer_id, _.sales_ymd, __.application_date}) |>
    @mutate(elapsed_years = year2(parse(Int, _.application_date), _.sales_ymd)) |>
    @take(10) |>
    DataFrame

10×4 DataFrame

Row	customer_id	sales_ymd	application_date	elapsed_years
	String15	Int64	String	Int64
1	CS006214000001	20181103	20150201	3
2	CS008415000097	20181118	20150322	3
3	CS028414000014	20170712	20150711	2
4	CS025415000050	20180821	20160131	2
5	CS003515000195	20190605	20150306	4
6	CS024514000042	20181205	20151010	3
7	CS040415000178	20190922	20150627	4
8	CS027514000015	20191010	20151101	3
9	CS025415000134	20190918	20150720	4
10	CS021515000126	20171010	20150508	2

J-073: レシート明細データ（df_receipt）の売上日（sales_ymd）に対し、顧客データ（df_customer）の会員申込日（application_date）からのエポック秒による経過時間を計算し、顧客ID（customer_id）、売上日、会員申込日とともに10件表示せよ（なお、sales_ymdは数値、application_dateは文字列でデータを保持している点に注意）。なお、時間情報は保有していないため各日付は0時0分0秒を表すものとする。

datetime(d) = DateTime(d ÷ 10000, (d % 10000) ÷ 100, d % 100, 0, 0, 0)

df = innerjoin(df_receipt[!, [:customer_id, :sales_ymd]],
          df_customer[!, [:customer_id, :application_date]],
          on=:customer_id)
df.elapsed_epoch = convert.(Int64, Dates.value.(datetime.(df.sales_ymd) - datetime.(parse.(Int, df.application_date)))/1000)
first(df, 10)

10×4 DataFrame

Row	customer_id	sales_ymd	application_date	elapsed_epoch
	String15	Int64	String	Int64
1	CS006214000001	20181103	20150201	118454400
2	CS008415000097	20181118	20150322	115516800
3	CS028414000014	20170712	20150711	63244800
4	CS025415000050	20180821	20160131	80611200
5	CS003515000195	20190605	20150306	134092800
6	CS024514000042	20181205	20151010	99532800
7	CS040415000178	20190922	20150627	133747200
8	CS027514000015	20191010	20151101	124329600
9	CS025415000134	20190918	20150720	131414400
10	CS021515000126	20171010	20150508	76550400

# 別解
datetime(d) = DateTime(d ÷ 10000, (d % 10000) ÷ 100, d % 100, 0, 0, 0)

df_receipt |>
    @join(df_customer, _.customer_id, _.customer_id, {_.customer_id, _.sales_ymd, __.application_date}) |>
    @mutate(elapsed_epoch = convert(Int64, Dates.value(datetime(_.sales_ymd) .- datetime(parse(Int, _.application_date)))/1000)) |>
    @take(10) |>
    DataFrame

10×4 DataFrame

Row	customer_id	sales_ymd	application_date	elapsed_epoch
	String15	Int64	String	Int64
1	CS006214000001	20181103	20150201	118454400
2	CS008415000097	20181118	20150322	115516800
3	CS028414000014	20170712	20150711	63244800
4	CS025415000050	20180821	20160131	80611200
5	CS003515000195	20190605	20150306	134092800
6	CS024514000042	20181205	20151010	99532800
7	CS040415000178	20190922	20150627	133747200
8	CS027514000015	20191010	20151101	124329600
9	CS025415000134	20190918	20150720	131414400
10	CS021515000126	20171010	20150508	76550400

J-074: レシート明細データ（df_receipt）の売上日（sales_ymd）に対し、当該週の月曜日からの経過日数を計算し、売上日、直前の月曜日付とともに10件表示せよ（sales_ymdは数値でデータを保持している点に注意）。

date(d) = Date(d ÷ 10000, (d % 10000) ÷ 100, d % 100)
df = copy(df_receipt)  # または，df = df_receipt[!, [:sales_ymd]]
df.monday = firstdayofweek.(date.(df.sales_ymd))
df.elapsed_days = Dates.value.(date.(df.sales_ymd) - df.monday)
df = select(df, [:sales_ymd,  :elapsed_days, :monday])
first(df, 10)

10×3 DataFrame

Row	sales_ymd	elapsed_days	monday
	Int64	Int64	Date
1	20181103	5	2018-10-29
2	20181118	6	2018-11-12
3	20170712	2	2017-07-10
4	20190205	1	2019-02-04
5	20180821	1	2018-08-20
6	20190605	2	2019-06-03
7	20181205	2	2018-12-03
8	20190922	6	2019-09-16
9	20170504	3	2017-05-01
10	20191010	3	2019-10-07

# 別解
using Dates
date(d) = Date(d ÷ 10000, (d % 10000) ÷ 100, d % 100)
df_receipt |>
    #@select(:sales_ymd) |>
    @mutate(monday = firstdayofweek(date(_.sales_ymd))) |>
    @mutate(elapsed_days = Dates.value(date(_.sales_ymd) - _.monday)) |>
    @select(:sales_ymd,  :elapsed_days, :monday) |>
    @take(10) |>
    DataFrame

10×3 DataFrame

Row	sales_ymd	elapsed_days	monday
	Int64	Int64	Date
1	20181103	5	2018-10-29
2	20181118	6	2018-11-12
3	20170712	2	2017-07-10
4	20190205	1	2019-02-04
5	20180821	1	2018-08-20
6	20190605	2	2019-06-03
7	20181205	2	2018-12-03
8	20190922	6	2019-09-16
9	20170504	3	2017-05-01
10	20191010	3	2019-10-07

無作為抽出

J-075: 顧客データ（df_customer）からランダムに1%のデータを抽出し、先頭から10件表示せよ。

using StatsBase
n = nrow(df_customer)
df = df_customer[sample(1:n, floor(Int, n*0.01), replace=false), :]
first(df, 10)

10×11 DataFrame

Row	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	String15	String31	String	String7	Date	Int64	String15	String	String7	String	String15
1	CS001313000265	竹村遥	1	女性	1984-04-13	34	144-0035	東京都大田区南蒲田**********	S13001	20160624	0-00000000-0
2	CS001415000572	田口ヒカル	1	女性	1977-07-31	41	144-0056	東京都大田区西六郷**********	S13001	20170413	8-20100925-7
3	CS013313000160	大塚遥	9	不明	1980-09-28	38	275-0022	千葉県習志野市香澄**********	S12013	20150117	0-00000000-0
4	CS017412000008	ほしの菜摘	1	女性	1971-12-18	47	166-0004	東京都杉並区阿佐谷南**********	S13017	20150907	3-20080225-3
5	CS022403000040	佐川憲史	0	男性	1969-09-25	49	240-0113	神奈川県三浦郡葉山町長柄**********	S14022	20150804	0-00000000-0
6	CS001713000040	村山遥	1	女性	1939-12-10	79	144-0045	東京都大田区南六郷**********	S13001	20150323	0-00000000-0
7	CS023602000043	戸田和之	0	男性	1949-01-31	70	212-0054	神奈川県川崎市幸区小倉**********	S14023	20170114	0-00000000-0
8	CS003615000329	おかやま貴美子	1	女性	1956-10-15	62	201-0001	東京都狛江市西野川**********	S13003	20170405	0-00000000-0
9	CS038615000053	門脇優	1	女性	1951-01-03	68	279-0041	千葉県浦安市堀江**********	S13038	20150522	4-20090601-5
10	CS039413000205	野田季衣	1	女性	1975-02-18	44	167-0032	東京都杉並区天沼**********	S13039	20150412	1-20090216-4

以下の方法は，各行に [0,1) の一様乱数を与え，その値が希望する抽出率の数値未満の行を抽出するというものである。実際に抽出される件数が若干の幅を持つが，実用上の問題はないであろう。

# 別解
using StatsBase
df = df_customer |>
    @mutate(uniformrandom=rand(1)[1]) |>
    @filter(_.uniformrandom < 0.01) |>
    @select(-:uniformrandom) |>  # 乱数列は用済みなので削除
    DataFrame
println("df_customer の件数 = $(nrow(df_customer)), 抽出された件数 = $(nrow(df)), 抽出率 = $(100nrow(df)/nrow(df_customer)) %")
first(df, 10)

df_customer の件数 = 21971, 抽出された件数 = 216, 抽出率 = 0.9831141049565336 %

10×11 DataFrame

Row	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	String15	String31	String	String7	Date	Int64	String15	String	String7	String	String15
1	CS011215000048	芦田沙耶	1	女性	1992-02-01	27	223-0062	神奈川県横浜市港北区日吉本町**********	S14011	20150228	C-20100421-9
2	CS031415000106	宇野由美子	1	女性	1970-02-26	49	151-0053	東京都渋谷区代々木**********	S13031	20150201	F-20100511-E
3	CS029313000221	北条ひかり	1	女性	1987-06-19	31	279-0011	千葉県浦安市美浜**********	S12029	20180810	0-00000000-0
4	CS013613000043	大森瞳	1	女性	1956-04-28	62	261-0025	千葉県千葉市美浜区浜田**********	S12013	20150222	0-00000000-0
5	CS006803000002	堀獅童	0	男性	1930-01-21	89	224-0057	神奈川県横浜市都筑区川和町**********	S14006	20150927	0-00000000-0
6	CS031514000047	原口礼子	1	女性	1963-07-26	55	151-0064	東京都渋谷区上原**********	S13031	20150927	8-20090607-6
7	CS025713000029	大塚怜奈	1	女性	1942-07-31	76	242-0015	神奈川県大和市下和田**********	S14025	20151005	0-00000000-0
8	CS037704000007	吹越菊生	0	男性	1943-01-11	76	136-0076	東京都江東区南砂**********	S13037	20150329	0-00000000-0
9	CS002215000165	田村美幸	1	女性	1991-01-16	28	185-0014	東京都国分寺市東恋ケ窪**********	S13002	20170728	5-20100122-1
10	CS007415000047	土屋瞬	1	女性	1973-06-25	45	285-0855	千葉県佐倉市井野**********	S12007	20141122	0-00000000-0

層化抽出

J-076: 顧客データ（df_customer）から性別コード（gender_cd）の割合に基づきランダムに10%のデータを層化抽出し、性別コードごとに件数を集計せよ。

combine(groupby(df_customer, :gender_cd), nrow => :customer_num)

3×2 DataFrame

Row	gender_cd	customer_num
	String	Int64
1	1	17918
2	9	1072
3	0	2981

function sampling(df)
    n = nrow(df)
    return df[sample(1:n, floor(Int, n*0.01), replace=false), :]
end

gd = groupby(df_customer, :gender_cd)
df = sampling(gd[1])
for i in 2:3
    df = vcat(df, sampling(gd[i]))
end

gd = groupby(df, :gender_cd)
combine(gd, nrow)

3×2 DataFrame

Row	gender_cd	nrow
	String	Int64
1	1	179
2	9	10
3	0	29

データフレームをグループ化し，それぞれのデータフレームに対して一定の処理を行い，再度一つのデータフレームにまとめるという作業は以下のようにすれば簡単になる。

# 別解
gd = groupby(df_customer, :gender_cd)  # グループ化
df = [sampling(df) for df in gd]  # 処理（関数内で行う）
df2 = vcat(df...)  # 再度一つのデータフレームにまとめる
first(df2, 10)

10×11 DataFrame

Row	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	String15	String31	String	String7	Date	Int64	String15	String	String7	String	String15
1	CS019415000255	寺西綾	1	女性	1975-09-24	43	173-0036	東京都板橋区向原**********	S13019	20150515	D-20100409-E
2	CS022412000091	溝口早紀	1	女性	1972-03-29	47	240-0112	神奈川県三浦郡葉山町堀内**********	S14022	20150531	0-00000000-0
3	CS033513000001	斉藤美帆	1	女性	1964-06-18	54	246-0037	神奈川県横浜市瀬谷区橋戸**********	S14033	20150918	9-20090809-6
4	CS004515000228	吉村南朋	1	女性	1961-05-15	57	167-0022	東京都杉並区下井草**********	S13004	20150214	0-00000000-0
5	CS035212000033	上村あさみ	1	女性	1993-09-25	25	157-0068	東京都世田谷区宇奈根**********	S13035	20150415	5-20100905-6
6	CS001312000231	西脇京子	1	女性	1985-10-18	33	210-0021	神奈川県川崎市川崎区元木**********	S13001	20160324	0-00000000-0
7	CS022715000051	黒谷景子	1	女性	1941-07-16	77	249-0005	神奈川県逗子市桜山**********	S14022	20150608	0-00000000-0
8	CS024312000022	三輪瞳	1	女性	1981-02-13	38	216-0015	神奈川県川崎市宮前区菅生**********	S14024	20150924	0-00000000-0
9	CS001211000024	島津一恵	1	女性	1992-02-06	27	212-0058	神奈川県川崎市幸区鹿島田**********	S13001	20160421	0-00000000-0
10	CS027315000113	武藤美佐	1	女性	1987-05-12	31	251-0016	神奈川県藤沢市弥勒寺**********	S14027	20150626	0-00000000-0

外れ値

J-077: レシート明細データ（df_receipt）の売上金額を顧客単位に合計し、合計した売上金額の外れ値を抽出せよ。なお、外れ値は売上金額合計を対数化したうえで平均と標準偏差を計算し、その平均から3σを超えて離れたものとする（自然対数と常用対数のどちらでも可）。結果は10件表示せよ。

gd = groupby(df_receipt, :customer_id)
df = combine(gd, :amount => sum => :sum_amount, :amount => (x -> log(sum(x))) => :log_sum_amount);

using StatsBase
m = mean(df.log_sum_amount)
s = std(df.log_sum_amount)
df2 = df[abs.(df.log_sum_amount .- m) ./ s .> 3, :]
first(df2, 10)

1×3 DataFrame

Row	customer_id	sum_amount	log_sum_amount
	String15	Int64	Float64
1	ZZ000000000000	12395003	16.3328

# 別解
df = df_receipt |>
    @groupby(_.customer_id) |>
    @map({customer_id=key(_), sum_amount=sum(_.amount), log_sum_amount=log(sum(_.amount))}) |>
    DataFrame

using StatsBase
m = mean(df.log_sum_amount)
s = std(df.log_sum_amount)
df2 = df |>
    @filter(abs(_.log_sum_amount - m) / s > 3) |>
    @take(10) |>
    DataFrame

1×3 DataFrame

Row	customer_id	sum_amount	log_sum_amount
	String15	Int64	Float64
1	ZZ000000000000	12395003	16.3328

J-078: レシート明細データ（df_receipt）の売上金額（amount）を顧客単位に合計し、合計した売上金額の外れ値を抽出せよ。ただし、顧客IDが"Z"から始まるのものは非会員を表すため、除外して計算すること。なお、ここでは外れ値を第1四分位と第3四分位の差であるIQRを用いて、「第1四分位数-1.5×IQR」を下回るもの、または「第3四分位数+1.5×IQR」を超えるものとする。結果は10件表示せよ。

gd = groupby(filter(:customer_id => x -> !occursin(r"^Z", x), df_receipt), :customer_id)
df = combine(gd, :amount => sum => :sum_amount)
using StatsBase
(qt1, qt3) = quantile(df.sum_amount, [0.25, 0.75])
IQR = qt3 - qt1
df2 = df[df.sum_amount .< qt1 - 1.5*IQR .|| df.sum_amount .> qt3 + 1.5*IQR, :]
sort!(df2, :customer_id)
first(df2, 10)

10×2 DataFrame

Row	customer_id	sum_amount
	String15	Int64
1	CS001414000048	8584
2	CS001605000009	18925
3	CS002415000594	9568
4	CS004414000181	9584
5	CS005415000137	8734
6	CS006414000001	9156
7	CS006414000029	9179
8	CS006415000105	10042
9	CS006415000147	12723
10	CS006415000157	10648

# 別解
df = df_receipt[!, [:customer_id, :amount]] |>
    @filter(!occursin(r"^Z", _.customer_id)) |>
    @groupby(_.customer_id) |>
    @map({customer_id=key(_), sum_amount=sum(_.amount)}) |>
    DataFrame

using StatsBase
(qt1, qt3) = quantile(df2.sum_amount, [0.25, 0.75])
IQR = qt3 - qt1

df |>
    @filter(_.sum_amount < qt1 - 1.5*IQR || _.sum_amount > qt3 + 1.5*IQR) |>
    @orderby(_.customer_id) |>
    @take(10) |>
    DataFrame

10×2 DataFrame

Row	customer_id	sum_amount
	String15	Int64
1	CS001113000004	1298
2	CS001114000005	626
3	CS001115000010	3044
4	CS001205000004	1988
5	CS001205000006	3337
6	CS001211000025	456
7	CS001212000027	448
8	CS001212000031	296
9	CS001212000046	228
10	CS001212000070	456

欠損値の補完

J-079: 商品データ（df_product）の各項目に対し、欠損数を確認せよ。

a = map(eachcol(df_product)) do col
    length(col) - length(collect(skipmissing(col)))
end;

DataFrame(variable = names(df_product), missings = a)

6×2 DataFrame

Row	variable	missings
	String	Int64
1	product_cd	0
2	category_major_cd	0
3	category_medium_cd	0
4	category_small_cd	0
5	unit_price	7
6	unit_cost	7

J-080: 商品データ（df_product）のいずれかの項目に欠損が発生しているレコードを全て削除した新たな商品データを作成せよ。なお、削除前後の件数を表示させ、079で確認した件数だけ減少していることも確認すること。

df = df_product |> dropmissing
println("欠損値削除前の件数 = $(nrow(df_product)),  欠損値削除後の件数 = $(nrow(df))")

欠損値削除前の件数 = 10030,  欠損値削除後の件数 = 10023

J-081: 単価（unit_price）と原価（unit_cost）の欠損値について、それぞれの平均値で補完した新たな商品データを作成せよ。なお、平均値については1円未満を丸めること（四捨五入または偶数への丸めで良い）。補完実施後、各項目について欠損が生じていないことも確認すること。

price_mean = round(Int, mean(skipmissing(df_product.unit_price)))
cost_mean  = round(Int, mean(skipmissing(df_product.unit_cost)))
println("price_mean = $price_mean,  cost_mean = $cost_mean")
df = copy(df_product)
println(df[[159, 160, 197, 497, 1532, 2013, 6297, 7076], :])
println("nrow = $(nrow(df))")
df.unit_price = coalesce.(df.unit_price, price_mean)
df.unit_cost  = coalesce.(df.unit_cost,  cost_mean)
println(df[[159, 160, 197, 497, 1532, 2013, 6297, 7076], :])
println("nrow = $(nrow(df))")

price_mean = 403,  cost_mean = 302
[1m8×6 DataFrame[0m
[1m Row [0m│[1m product_cd [0m[1m category_major_cd [0m[1m category_medium_cd [0m[1m category_small_cd [0m[1m unit_price [0m[1m unit_cost [0m
     │[90m String15   [0m[90m String            [0m[90m String             [0m[90m String            [0m[90m Int64?     [0m[90m Int64?    [0m
─────┼─────────────────────────────────────────────────────────────────────────────────────────────
   1 │ P040802006  04                 0408                040802                    438        329
   2 │ P040802007  04                 0408                040802            [90m    missing [0m[90m   missing [0m
   3 │ P050103021  05                 0501                050103            [90m    missing [0m[90m   missing [0m
   4 │ P050405009  05                 0504                050405            [90m    missing [0m[90m   missing [0m
   5 │ P060802026  06                 0608                060802            [90m    missing [0m[90m   missing [0m
   6 │ P070202092  07                 0702                070202            [90m    missing [0m[90m   missing [0m
   7 │ P080504027  08                 0805                080504            [90m    missing [0m[90m   missing [0m
   8 │ P090204185  09                 0902                090204            [90m    missing [0m[90m   missing [0m
nrow = 10030
[1m8×6 DataFrame[0m
[1m Row [0m│[1m product_cd [0m[1m category_major_cd [0m[1m category_medium_cd [0m[1m category_small_cd [0m[1m unit_price [0m[1m unit_cost [0m
     │[90m String15   [0m[90m String            [0m[90m String             [0m[90m String            [0m[90m Int64      [0m[90m Int64     [0m
─────┼─────────────────────────────────────────────────────────────────────────────────────────────
   1 │ P040802006  04                 0408                040802                    438        329
   2 │ P040802007  04                 0408                040802                    403        302
   3 │ P050103021  05                 0501                050103                    403        302
   4 │ P050405009  05                 0504                050405                    403        302
   5 │ P060802026  06                 0608                060802                    403        302
   6 │ P070202092  07                 0702                070202                    403        302
   7 │ P080504027  08                 0805                080504                    403        302
   8 │ P090204185  09                 0902                090204                    403        302
nrow = 10030

# 別解
using StatsBase
price_mean = round(Int, mean(skipmissing(df_product.unit_price)))
cost_mean  = round(Int, mean(skipmissing(df_product.unit_cost)))
println(df_product[[159, 160, 197, 497, 1532, 2013, 6297, 7076], :])
println("nrow = $(nrow(df))")
df = df_product |>
    # @replacena(:unit_price => price_mean, :unit_cost => cost_mean) |> エラーになる。変数名は指定できない
    @replacena(:unit_price => 9999999999, :unit_cost => 9999999999) |>
    @mutate(unit_price = _.unit_price == 9999999999 ? price_mean : _.unit_price,
            unit_cost  = _.unit_cost  == 9999999999 ? cost_mean  : _.unit_cost) |>
    DataFrame
println("nrow = $(nrow(df))")
println(df[[159, 160, 197, 497, 1532, 2013, 6297, 7076], :])

[1m8×6 DataFrame[0m
[1m Row [0m│[1m product_cd [0m[1m category_major_cd [0m[1m category_medium_cd [0m[1m category_small_cd [0m[1m unit_price [0m[1m unit_cost [0m
     │[90m String15   [0m[90m String            [0m[90m String             [0m[90m String            [0m[90m Int64?     [0m[90m Int64?    [0m
─────┼─────────────────────────────────────────────────────────────────────────────────────────────
   1 │ P040802006  04                 0408                040802                    438        329
   2 │ P040802007  04                 0408                040802            [90m    missing [0m[90m   missing [0m
   3 │ P050103021  05                 0501                050103            [90m    missing [0m[90m   missing [0m
   4 │ P050405009  05                 0504                050405            [90m    missing [0m[90m   missing [0m
   5 │ P060802026  06                 0608                060802            [90m    missing [0m[90m   missing [0m
   6 │ P070202092  07                 0702                070202            [90m    missing [0m[90m   missing [0m
   7 │ P080504027  08                 0805                080504            [90m    missing [0m[90m   missing [0m
   8 │ P090204185  09                 0902                090204            [90m    missing [0m[90m   missing [0m
nrow = 10030
nrow = 10030
[1m8×6 DataFrame[0m
[1m Row [0m│[1m product_cd [0m[1m category_major_cd [0m[1m category_medium_cd [0m[1m category_small_cd [0m[1m unit_price [0m[1m unit_cost [0m
     │[90m Any        [0m[90m Any               [0m[90m Any                [0m[90m Any               [0m[90m Any        [0m[90m Any       [0m
─────┼─────────────────────────────────────────────────────────────────────────────────────────────
   1 │ P040802006  04                 0408                040802             438         329
   2 │ P040802007  04                 0408                040802             403         302
   3 │ P050103021  05                 0501                050103             403         302
   4 │ P050405009  05                 0504                050405             403         302
   5 │ P060802026  06                 0608                060802             403         302
   6 │ P070202092  07                 0702                070202             403         302
   7 │ P080504027  08                 0805                080504             403         302
   8 │ P090204185  09                 0902                090204             403         302

J-082: 単価（unit_price）と原価（unit_cost）の欠損値について、それぞれの中央値で補完した新たな商品データを作成せよ。なお、中央値については1円未満を丸めること（四捨五入または偶数への丸めで良い）。補完実施後、各項目について欠損が生じていないことも確認すること。

price_median = round(Int, median(skipmissing(df_product.unit_price)))
cost_median  = round(Int, median(skipmissing(df_product.unit_cost)))
println("price_median = $price_median,  cost_median = $cost_median")
df = copy(df_product)
println(df[[159, 160, 197, 497, 1532, 2013, 6297, 7076], :])
println("nrow = $(nrow(df))")
df.unit_price = coalesce.(df.unit_price, price_median)
df.unit_cost  = coalesce.(df.unit_cost,  cost_median)
println(df[[159, 160, 197, 497, 1532, 2013, 6297, 7076], :])
println("nrow = $(nrow(df))")

price_median = 252,  cost_median = 189
[1m8×6 DataFrame[0m
[1m Row [0m│[1m product_cd [0m[1m category_major_cd [0m[1m category_medium_cd [0m[1m category_small_cd [0m[1m unit_price [0m[1m unit_cost [0m
     │[90m String15   [0m[90m String            [0m[90m String             [0m[90m String            [0m[90m Int64?     [0m[90m Int64?    [0m
─────┼─────────────────────────────────────────────────────────────────────────────────────────────
   1 │ P040802006  04                 0408                040802                    438        329
   2 │ P040802007  04                 0408                040802            [90m    missing [0m[90m   missing [0m
   3 │ P050103021  05                 0501                050103            [90m    missing [0m[90m   missing [0m
   4 │ P050405009  05                 0504                050405            [90m    missing [0m[90m   missing [0m
   5 │ P060802026  06                 0608                060802            [90m    missing [0m[90m   missing [0m
   6 │ P070202092  07                 0702                070202            [90m    missing [0m[90m   missing [0m
   7 │ P080504027  08                 0805                080504            [90m    missing [0m[90m   missing [0m
   8 │ P090204185  09                 0902                090204            [90m    missing [0m[90m   missing [0m
nrow = 10030
[1m8×6 DataFrame[0m
[1m Row [0m│[1m product_cd [0m[1m category_major_cd [0m[1m category_medium_cd [0m[1m category_small_cd [0m[1m unit_price [0m[1m unit_cost [0m
     │[90m String15   [0m[90m String            [0m[90m String             [0m[90m String            [0m[90m Int64      [0m[90m Int64     [0m
─────┼─────────────────────────────────────────────────────────────────────────────────────────────
   1 │ P040802006  04                 0408                040802                    438        329
   2 │ P040802007  04                 0408                040802                    252        189
   3 │ P050103021  05                 0501                050103                    252        189
   4 │ P050405009  05                 0504                050405                    252        189
   5 │ P060802026  06                 0608                060802                    252        189
   6 │ P070202092  07                 0702                070202                    252        189
   7 │ P080504027  08                 0805                080504                    252        189
   8 │ P090204185  09                 0902                090204                    252        189
nrow = 10030

# 別解
price_median = round(Int, median(skipmissing(df_product.unit_price)))
cost_median  = round(Int, median(skipmissing(df_product.unit_cost)))
println("price_median = $price_median,  cost_median = $cost_median")
println("nrow = $(nrow(df))")
println(df_product[[159, 160, 197, 497, 1532, 2013, 6297, 7076], :])
df = df_product |>
    # @replacena(:unit_price => price_median, :unit_cost => cost_median) |> エラーになる。変数名は指定できない
    @replacena(:unit_price => 9999999999, :unit_cost => 9999999999) |>
    @mutate(unit_price = _.unit_price == 9999999999 ? price_median : _.unit_price,
            unit_cost  = _.unit_cost  == 9999999999 ? cost_median  : _.unit_cost) |>
    DataFrame
println("nrow = $(nrow(df))")
println(df[[159, 160, 197, 497, 1532, 2013, 6297, 7076], :])

price_median = 252,  cost_median = 189
nrow = 10030
[1m8×6 DataFrame[0m
[1m Row [0m│[1m product_cd [0m[1m category_major_cd [0m[1m category_medium_cd [0m[1m category_small_cd [0m[1m unit_price [0m[1m unit_cost [0m
     │[90m String15   [0m[90m String            [0m[90m String             [0m[90m String            [0m[90m Int64?     [0m[90m Int64?    [0m
─────┼─────────────────────────────────────────────────────────────────────────────────────────────
   1 │ P040802006  04                 0408                040802                    438        329
   2 │ P040802007  04                 0408                040802            [90m    missing [0m[90m   missing [0m
   3 │ P050103021  05                 0501                050103            [90m    missing [0m[90m   missing [0m
   4 │ P050405009  05                 0504                050405            [90m    missing [0m[90m   missing [0m
   5 │ P060802026  06                 0608                060802            [90m    missing [0m[90m   missing [0m
   6 │ P070202092  07                 0702                070202            [90m    missing [0m[90m   missing [0m
   7 │ P080504027  08                 0805                080504            [90m    missing [0m[90m   missing [0m
   8 │ P090204185  09                 0902                090204            [90m    missing [0m[90m   missing [0m
nrow = 10030
[1m8×6 DataFrame[0m
[1m Row [0m│[1m product_cd [0m[1m category_major_cd [0m[1m category_medium_cd [0m[1m category_small_cd [0m[1m unit_price [0m[1m unit_cost [0m
     │[90m Any        [0m[90m Any               [0m[90m Any                [0m[90m Any               [0m[90m Any        [0m[90m Any       [0m
─────┼─────────────────────────────────────────────────────────────────────────────────────────────
   1 │ P040802006  04                 0408                040802             438         329
   2 │ P040802007  04                 0408                040802             252         189
   3 │ P050103021  05                 0501                050103             252         189
   4 │ P050405009  05                 0504                050405             252         189
   5 │ P060802026  06                 0608                060802             252         189
   6 │ P070202092  07                 0702                070202             252         189
   7 │ P080504027  08                 0805                080504             252         189
   8 │ P090204185  09                 0902                090204             252         189

J-083: 単価（unit_price）と原価（unit_cost）の欠損値について、各商品のカテゴリ小区分コード（category_small_cd）ごとに算出した中央値で補完した新たな商品データを作成せよ。なお、中央値については1円未満を丸めること（四捨五入または偶数への丸めで良い）。補完実施後、各項目について欠損が生じていないことも確認すること。

using StatsBase

gd = groupby(df_product, :category_small_cd)
df = combine(gd, :unit_price => (x -> round(Int, median(skipmissing(x)))) => :unit_price_median,
                 :unit_cost  => (x -> round(Int, median(skipmissing(x)))) => :unit_cost_median);

df2 = innerjoin(df_product, df, on=:category_small_cd);

# これは動かない
# results = df2 |>
#     @mutate(unit_price2 = ifelse(ismissing(_.unit_price), _.unit_price_median, _.unit_price)) |>
#     DataFrame;

println("Before")
println(df2[[159, 160, 197, 497, 1532, 2013, 6297, 7076], :])
for i in 1:nrow(df2)
    ismissing(df2.unit_price[i]) && (df2.unit_price[i] = df2.unit_price_median[i])
    ismissing(df2.unit_cost[i]) && (df2.unit_cost[i] = df2.unit_cost_median[i])
end
println("After")
println(df2[[159, 160, 197, 497, 1532, 2013, 6297, 7076], :])

Before
[1m8×8 DataFrame[0m
[1m Row [0m│[1m product_cd [0m[1m category_major_cd [0m[1m category_medium_cd [0m[1m category_small_cd [0m[1m unit_price [0m[1m unit_cost [0m[1m unit_price_median [0m[1m unit_cost_median [0m
     │[90m String15   [0m[90m String            [0m[90m String             [0m[90m String            [0m[90m Int64?     [0m[90m Int64?    [0m[90m Int64             [0m[90m Int64            [0m
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ P040802006  04                 0408                040802                    438        329                313               235
   2 │ P040802007  04                 0408                040802            [90m    missing [0m[90m   missing [0m               313               235
   3 │ P050103021  05                 0501                050103            [90m    missing [0m[90m   missing [0m               132               100
   4 │ P050405009  05                 0504                050405            [90m    missing [0m[90m   missing [0m               178               134
   5 │ P060802026  06                 0608                060802            [90m    missing [0m[90m   missing [0m               270               200
   6 │ P070202092  07                 0702                070202            [90m    missing [0m[90m   missing [0m               238               179
   7 │ P080504027  08                 0805                080504            [90m    missing [0m[90m   missing [0m               258               196
   8 │ P090204185  09                 0902                090204            [90m    missing [0m[90m   missing [0m               694               521
After
[1m8×8 DataFrame[0m
[1m Row [0m│[1m product_cd [0m[1m category_major_cd [0m[1m category_medium_cd [0m[1m category_small_cd [0m[1m unit_price [0m[1m unit_cost [0m[1m unit_price_median [0m[1m unit_cost_median [0m
     │[90m String15   [0m[90m String            [0m[90m String             [0m[90m String            [0m[90m Int64?     [0m[90m Int64?    [0m[90m Int64             [0m[90m Int64            [0m
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ P040802006  04                 0408                040802                    438        329                313               235
   2 │ P040802007  04                 0408                040802                    313        235                313               235
   3 │ P050103021  05                 0501                050103                    132        100                132               100
   4 │ P050405009  05                 0504                050405                    178        134                178               134
   5 │ P060802026  06                 0608                060802                    270        200                270               200
   6 │ P070202092  07                 0702                070202                    238        179                238               179
   7 │ P080504027  08                 0805                080504                    258        196                258               196
   8 │ P090204185  09                 0902                090204                    694        521                694               521

J-084: 顧客データ（df_customer）の全顧客に対して全期間の売上金額に占める2019年売上金額の割合を計算し、新たなデータを作成せよ。ただし、売上実績がない場合は0として扱うこと。そして計算した割合が0超のものを抽出し、結果を10件表示せよ。また、作成したデータに欠損が存在しないことを確認せよ。

gd2019 = groupby(df_receipt[df_receipt.sales_ymd .÷ 10000 .== 2019, [:customer_id, :amount]], :customer_id)
df2019 = combine(gd2019, :amount => sum => :amount_2019);

gdall = groupby(df_receipt[!, [:customer_id, :amount]], :customer_id)
dfall = combine(gdall, :amount => sum => :amount_all);

df = leftjoin(df2019, dfall, on=:customer_id) |>
    @replacena(0) |>
    @mutate(amount_rate = _.amount_2019 / _.amount_all) |>
    @filter(_.amount_rate > 0) |>
    @orderby(_.customer_id) |>
    DataFrame
first(df, 10)

10×4 DataFrame

Row	customer_id	amount_2019	amount_all	amount_rate
	String15	Int64	Int64	Float64
1	CS001113000004	1298	1298	1.0
2	CS001114000005	188	626	0.300319
3	CS001115000010	578	3044	0.189882
4	CS001205000004	702	1988	0.353119
5	CS001205000006	486	3337	0.14564
6	CS001211000025	456	456	1.0
7	CS001212000070	456	456	1.0
8	CS001214000009	664	4685	0.141729
9	CS001214000017	2962	4132	0.716844
10	CS001214000048	1889	2374	0.795703

a = map(eachcol(df)) do col
    length(col) - length(collect(skipmissing(col)))
end;

DataFrame(variable = names(df), missings = a)

4×2 DataFrame

Row	variable	missings
	String	Int64
1	customer_id	0
2	amount_2019	0
3	amount_all	0
4	amount_rate	0

位置情報（経緯度と距離）

J-085: 顧客データ（df_customer）の全顧客に対し、郵便番号（postal_cd）を用いてジオコードデータ（df_geocode）を紐付け、新たな顧客データを作成せよ。ただし、1つの郵便番号（postal_cd）に複数の経度（longitude）、緯度（latitude）情報が紐づく場合は、経度（longitude）、緯度（latitude）の平均値を算出して使用すること。また、作成結果を確認するために結果を10件表示せよ。

using StatsBase
gd = groupby(df_geocode[!, [:postal_cd, :longitude, :latitude]], :postal_cd)
df = combine(gd, :longitude => mean => :m_longitude, :latitude => mean => :m_latitude);

df_customer2 = innerjoin(df_customer, df, on=:postal_cd)
first(df_customer2, 10)

10×13 DataFrame

Row	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd	m_longitude	m_latitude
	String15	String31	String	String7	Date	Int64	String15	String	String7	String	String15	Float64	Float64
1	CS020301000012	都築育二	0	男性	1985-08-20	33	332-0031	埼玉県川口市青木**********	S13020	20161018	0-00000000-0	139.727	35.8088
2	CS051412000011	奥村愛	1	女性	1969-06-04	49	332-0023	埼玉県川口市飯塚**********	S13051	20180702	0-00000000-0	139.711	35.7965
3	CS051412000012	おかやま未華子	1	女性	1977-03-13	42	332-0023	埼玉県川口市飯塚**********	S13051	20180511	0-00000000-0	139.711	35.7965
4	CS051212000001	美木瞬	1	女性	1991-10-19	27	332-0023	埼玉県川口市飯塚**********	S13051	20180509	2-20101018-4	139.711	35.7965
5	CS020112000003	荒川まなみ	1	女性	2004-05-13	14	332-0015	埼玉県川口市川口**********	S13020	20151116	0-00000000-0	139.716	35.8023
6	CS020212000016	藤沢恵梨香	1	女性	1991-08-08	27	332-0015	埼玉県川口市川口**********	S13020	20150122	4-20100207-3	139.716	35.8023
7	CS051512000001	田原夏希	1	女性	1959-11-21	59	332-0015	埼玉県川口市川口**********	S13051	20190325	0-00000000-0	139.716	35.8023
8	CS020212000004	前田美佐	1	女性	1989-12-11	29	332-0015	埼玉県川口市川口**********	S13020	20150814	C-20090906-9	139.716	35.8023
9	CS051502000001	若山哲平	0	男性	1961-10-07	57	332-0015	埼玉県川口市川口**********	S13051	20180209	0-00000000-0	139.716	35.8023
10	CS020212000008	板垣瞳	1	女性	1997-01-05	22	332-0015	埼玉県川口市川口**********	S13020	20150105	0-00000000-0	139.716	35.8023

J-086: 085で作成した緯度経度つき顧客データに対し、会員申込店舗コード（application_store_cd）をキーに店舗データ（df_store）と結合せよ。そして申込み店舗の緯度（latitude）・経度情報（longitude)と顧客住所（address）の緯度・経度を用いて申込み店舗と顧客住所の距離（単位：km）を求め、顧客ID（customer_id）、顧客住所（address）、店舗住所（address）とともに表示せよ。計算式は以下の簡易式で良いものとするが、その他精度の高い方式を利用したライブラリを利用してもかまわない。結果は10件表示せよ。

$$
\mbox{緯度（ラジアン）}：\phi \
\mbox{経度（ラジアン）}：\lambda \
\mbox{距離}L = 6371 * \arccos(\sin \phi_1 * \sin \phi_2

\cos \phi_1 * \cos \phi_2 * \cos(\lambda_1 − \lambda_2))
$$

calc_distance(x1, y1, x2, y2) = 6371 .* acos.(
                        sin.(x1 * pi / 180) 
                        * sin.(x2 .* pi / 180) 
                        + cos.(x1 .* pi / 180) 
                        * cos.(x2 .* pi / 180) 
                        * cos.((y1 .* pi / 180) - (y2 .* pi / 180 )))

calc_distance (generic function with 1 method)

result = innerjoin(df_customer2, df_store, on=:application_store_cd => :store_cd, makeunique=true);

result_df = result |>
    @mutate(distance = calc_distance(_.m_latitude, _.m_longitude, _.latitude, _.longitude)) |>
    @rename(:address => :customer_address, :address_1 => :store_address) |>
    @select(:customer_id, :customer_address, :store_address, :distance) |>
    @orderby(_.customer_id) |>
    DataFrame
first(result_df, 10)

10×4 DataFrame

Row	customer_id	customer_address	store_address	distance
	String15	String	String	Float64
1	CS001105000001	東京都大田区西六郷**********	東京都大田区仲六郷二丁目	1.47979
2	CS001112000009	東京都大田区西馬込**********	東京都大田区仲六郷二丁目	4.02049
3	CS001112000019	東京都大田区昭和島**********	東京都大田区仲六郷二丁目	3.78301
4	CS001112000021	東京都大田区西六郷**********	東京都大田区仲六郷二丁目	1.47979
5	CS001112000023	東京都大田区昭和島**********	東京都大田区仲六郷二丁目	3.78301
6	CS001112000024	東京都大田区西六郷**********	東京都大田区仲六郷二丁目	1.47979
7	CS001112000029	東京都大田区西六郷**********	東京都大田区仲六郷二丁目	1.47979
8	CS001112000030	東京都大田区西六郷**********	東京都大田区仲六郷二丁目	1.47979
9	CS001113000004	東京都大田区西六郷**********	東京都大田区仲六郷二丁目	1.47979
10	CS001113000010	東京都大田区西六郷**********	東京都大田区仲六郷二丁目	1.47979

名寄データ

J-087: 顧客データ（df_customer）では、異なる店舗での申込みなどにより同一顧客が複数登録されている。名前（customer_name）と郵便番号（postal_cd）が同じ顧客は同一顧客とみなして1顧客1レコードとなるように名寄せした名寄顧客データを作成し、顧客データの件数、名寄顧客データの件数、重複数を算出せよ。ただし、同一顧客に対しては売上金額合計が最も高いものを残し、売上金額合計が同一もしくは売上実績がない顧客については顧客ID（customer_id）の番号が小さいものを残すこととする。

gd = groupby(df_receipt, :customer_id)
df = leftjoin(df_customer, combine(gd, :amount => sum => :sum_amount), on=:customer_id);

gd = groupby(df, [:customer_name, :postal_cd]);

count = 0
df_customer_u = DataFrame()
for dfx in gd
    if nrow(dfx) > 1
        for i in 1:nrow(dfx)
            if ismissing(dfx.sum_amount[i])
                dfx.sum_amount[i] = 0
            end
        end
        count += 1
       dfx =  dfx |>
            @orderby_descending(_.sum_amount) |>
            @thenby(_.customer_id) |>
            @take(1) |>
            DataFrame
        #println(dfx)
    end
    append!(df_customer_u, dfx)
end
println("df_customer   の行数 = ", nrow(df_customer))
println("df_customer_u の行数 = ", nrow(df_customer_u))
println("差 = ", count)
first(df_customer_u, 5)

df_customer   の行数 = 21971
df_customer_u の行数 = 21941
差 = 30

5×12 DataFrame

Row	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd	sum_amount
	String15	String31	String	String7	Date	Int64	String15	String	String7	String	String15	Int64?
1	CS031415000172	宇多田貴美子	1	女性	1976-10-04	42	151-0053	東京都渋谷区代々木**********	S13031	20150529	D-20100325-C	5088
2	CS001215000145	田崎美紀	1	女性	1995-03-29	24	144-0055	東京都大田区仲六郷**********	S13001	20170605	6-20090929-2	875
3	CS015414000103	奥野陽子	1	女性	1977-08-09	41	136-0073	東京都江東区北砂**********	S13015	20150722	B-20100609-B	3122
4	CS033513000180	安斎遥	1	女性	1962-07-11	56	241-0823	神奈川県横浜市旭区善部町**********	S14033	20150728	6-20080506-5	868
5	CS011215000048	芦田沙耶	1	女性	1992-02-01	27	223-0062	神奈川県横浜市港北区日吉本町**********	S14011	20150228	C-20100421-9	3444

J-088: 087で作成したデータを元に、顧客データに統合名寄IDを付与したデータを作成せよ。ただし、統合名寄IDは以下の仕様で付与するものとする。

重複していない顧客：顧客ID（customer_id）を設定

重複している顧客：前設問で抽出したレコードの顧客IDを設定

顧客IDのユニーク件数と、統合名寄IDのユニーク件数の差も確認すること。

df_customer_n = innerjoin(df_customer, df_customer_u[!, [:customer_id, :customer_name, :postal_cd]], on=[:customer_name, :postal_cd], makeunique=true) |>
    @rename(:customer_id_1 => :integration_id) |>
    DataFrame;
println("件数の差 = $(length(unique(df_customer_n.customer_id)) - length(unique(df_customer_n.integration_id)))")
first(df_customer_n, 10)

件数の差 = 30

10×12 DataFrame

Row	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd	integration_id
	String15	String31	String	String7	Date	Int64	String15	String	String7	String	String15	String15
1	CS021313000114	大野あや子	1	女性	1981-04-29	37	259-1113	神奈川県伊勢原市粟窪**********	S14021	20150905	0-00000000-0	CS021313000114
2	CS037613000071	六角雅彦	9	不明	1952-04-01	66	136-0076	東京都江東区南砂**********	S13037	20150414	0-00000000-0	CS037613000071
3	CS031415000172	宇多田貴美子	1	女性	1976-10-04	42	151-0053	東京都渋谷区代々木**********	S13031	20150529	D-20100325-C	CS031415000172
4	CS028811000001	堀井かおり	1	女性	1933-03-27	86	245-0016	神奈川県横浜市泉区和泉町**********	S14028	20160115	0-00000000-0	CS028811000001
5	CS001215000145	田崎美紀	1	女性	1995-03-29	24	144-0055	東京都大田区仲六郷**********	S13001	20170605	6-20090929-2	CS001215000145
6	CS020401000016	宮下達士	0	男性	1974-09-15	44	174-0065	東京都板橋区若木**********	S13020	20150225	0-00000000-0	CS020401000016
7	CS015414000103	奥野陽子	1	女性	1977-08-09	41	136-0073	東京都江東区北砂**********	S13015	20150722	B-20100609-B	CS015414000103
8	CS029403000008	釈人志	0	男性	1973-08-17	45	279-0003	千葉県浦安市海楽**********	S12029	20150515	0-00000000-0	CS029403000008
9	CS015804000004	松谷米蔵	0	男性	1931-05-02	87	136-0073	東京都江東区北砂**********	S13015	20150607	0-00000000-0	CS015804000004
10	CS033513000180	安斎遥	1	女性	1962-07-11	56	241-0823	神奈川県横浜市旭区善部町**********	S14033	20150728	6-20080506-5	CS033513000180

データの分割

J-089: 売上実績がある顧客を、予測モデル構築のため学習用データとテスト用データに分割したい。それぞれ8:2の割合でランダムにデータを分割せよ。

gd = groupby(df_receipt, :customer_id)
df_sales_customer = combine(gd, :amount => sum => :sum_amount);
df = innerjoin(df_customer, df_sales_customer, on=:customer_id)

using Random
n = nrow(df)
index = shuffle(collect(1:n))
df_customer_train = df[index[1:floor(Int, n*0.8)], :]
df_customer_test = df[index[floor(Int, n*0.8)+1:n], :]
println((nrow(df_customer_train)/n, nrow(df_customer_test)/n))
first(df_customer_train, 5)

(0.7999036840837949, 0.20009631591620516)

5×12 DataFrame

Row	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd	sum_amount
	String15	String31	String	String7	Date	Int64	String15	String	String7	String	String15	Int64
1	CS027411000001	長田遥	1	女性	1973-09-14	45	251-0861	神奈川県藤沢市大庭**********	S14027	20150720	A-20100214-A	2382
2	CS023615000111	松井幸子	1	女性	1956-06-23	62	212-0016	神奈川県川崎市幸区南幸町**********	S14023	20150313	8-20091213-6	1504
3	CS040214000008	佐々木由美子	1	女性	1994-05-30	24	194-0001	東京都町田市つくし野**********	S14040	20150811	F-20100901-F	13523
4	CS038615000134	杉山美智子	1	女性	1949-04-03	69	279-0043	千葉県浦安市富士見**********	S13038	20150723	5-20100705-4	608
5	CS011615000108	岡本めぐみ	9	不明	1950-03-06	69	223-0058	神奈川県横浜市港北区新吉田東**********	S14011	20141112	6-20081209-3	898

J-090: レシート明細データ（df_receipt）は2017年1月1日〜2019年10月31日までのデータを有している。売上金額（amount）を月次で集計し、学習用に12ヶ月、テスト用に6ヶ月の時系列モデル構築用データを3セット作成せよ。

temp = copy(df_receipt)
temp.sales_ymd .÷= 100
gd = groupby(temp, :sales_ymd)
df = combine(gd, :amount => sum => :sum_amount) |>
    @rename(:sales_ymd => :sales_ym) |>
    DataFrame
function sampling()
    start12 = rand(1:23, 1)[1]
    start6  = rand(1:29, 1)[1]
    df_train_1 = df[start12:start12+11, :];
    df_test_1  = df[start6:start6+5, :];
    return df_train_1, df_test_1
end
train, test = sampling();
println(train)
println(test)

[1m12×2 DataFrame[0m
[1m Row [0m│[1m sales_ym [0m[1m sum_amount [0m
     │[90m Int64    [0m[90m Int64      [0m
─────┼──────────────────────
   1 │   201705      884010
   2 │   201706      894242
   3 │   201707      959205
   4 │   201708      954836
   5 │   201709      902037
   6 │   201710      905739
   7 │   201711      932157
   8 │   201712      939654
   9 │   201801      944509
  10 │   201802      864128
  11 │   201803      946588
  12 │   201804      937099
[1m6×2 DataFrame[0m
[1m Row [0m│[1m sales_ym [0m[1m sum_amount [0m
     │[90m Int64    [0m[90m Int64      [0m
─────┼──────────────────────
   1 │   201702      764413
   2 │   201703      962945
   3 │   201704      847566
   4 │   201705      884010
   5 │   201706      894242
   6 │   201707      959205

J-091: 顧客データ（df_customer）の各顧客に対し、売上実績がある顧客数と売上実績がない顧客数が1:1となるようにアンダーサンプリングで抽出せよ。

gd = groupby(df_receipt, :customer_id)
df = combine(gd, :amount => sum => :sum_amount)
df2 = leftjoin(df_customer, df, on=:customer_id) |>
    @replacena(0) |>
    @mutate(flag = _.sum_amount > 0) |>
    DataFrame
gd2 = groupby(df2, :flag)
(n1, n2) = (nrow(gd2[1]), nrow(gd2[2]))

using StatsBase
index = sample(1:n1, n2, replace=false)  # 非復元抽出
df_amount_0 = gd2[2];
df_amount_1 = gd2[1][index, :]
nrow(df_amount_0), nrow(df_amount_1)

(8306, 8306)

第三正規形，非正規化

J-092: 顧客データ（df_customer）の性別について、第三正規形へと正規化せよ。

df_customer_std = df_customer |>
    @select(-:gender) |>
    @take(3) |>
    DataFrame

3×10 DataFrame

Row	customer_id	customer_name	gender_cd	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	String15	String31	String	Date	Int64	String15	String	String7	String	String15
1	CS021313000114	大野あや子	1	1981-04-29	37	259-1113	神奈川県伊勢原市粟窪**********	S14021	20150905	0-00000000-0
2	CS037613000071	六角雅彦	9	1952-04-01	66	136-0076	東京都江東区南砂**********	S13037	20150414	0-00000000-0
3	CS031415000172	宇多田貴美子	1	1976-10-04	42	151-0053	東京都渋谷区代々木**********	S13031	20150529	D-20100325-C

df_gender_std = df_customer |>
    @select(:gender_cd, :gender) |>
    @unique() |>
    @take(3) |>
    DataFrame

3×2 DataFrame

Row	gender_cd	gender
	String	String7
1	1	女性
2	9	不明
3	0	男性

J-093: 商品データ（df_product）では各カテゴリのコード値だけを保有し、カテゴリ名は保有していない。カテゴリデータ（df_category）と組み合わせて非正規化し、カテゴリ名を保有した新たな商品データを作成せよ。

df_product_full = innerjoin(df_product, df_category[!, [:category_small_cd, :category_major_name, :category_medium_name, :category_small_name]], on=:category_small_cd);
first(df_product_full, 3)

3×9 DataFrame

Row	product_cd	category_major_cd	category_medium_cd	category_small_cd	unit_price	unit_cost	category_major_name	category_medium_name	category_small_name
	String15	String	String	String	Int64?	Int64?	String	String	String
1	P040101001	04	0401	040101	198	149	惣菜	御飯類	弁当類
2	P040101002	04	0401	040101	218	164	惣菜	御飯類	弁当類
3	P040101003	04	0401	040101	230	173	惣菜	御飯類	弁当類

ファイル入出力

J-094: 093で作成したカテゴリ名付き商品データを以下の仕様でファイル出力せよ。

ファイル形式ヘッダ有無文字エンコーディング

CSV（カンマ区切り）有り UTF-8

ファイル出力先のパスは以下のようにすること

出力先

./data

CSV.write("-094.csv", df_product_full)

"-094.csv"

これ以降の問題は重要なものではないのでパスする

J-095: 093で作成したカテゴリ名付き商品データを以下の仕様でファイル出力せよ。

ファイル形式ヘッダ有無文字エンコーディング

CSV（カンマ区切り）有り CP932

ファイル出力先のパスは以下のようにすること。

出力先

./data

J-096: 093で作成したカテゴリ名付き商品データを以下の仕様でファイル出力せよ。

ファイル形式ヘッダ有無文字エンコーディング

CSV（カンマ区切り）無し UTF-8

ファイル出力先のパスは以下のようにすること。

出力先

./data

J-097: 094で作成した以下形式のファイルを読み込み、データを3件を表示させて正しく取り込まれていることを確認せよ。

ファイル形式ヘッダ有無文字エンコーディング

CSV（カンマ区切り）有り UTF-8

J-098: 096で作成した以下形式のファイルを読み込み、データを3件を表示させて正しく取り込まれていることを確認せよ。

ファイル形式ヘッダ有無文字エンコーディング

CSV（カンマ区切り）ヘッダ無し UTF-8

J-099: 093で作成したカテゴリ名付き商品データを以下の仕様でファイル出力せよ。

ファイル形式ヘッダ有無文字エンコーディング

TSV（タブ区切り）有り UTF-8

ファイル出力先のパスは以下のようにすること

出力先

./data

J-100: 099で作成した以下形式のファイルを読み込み、データを3件を表示させて正しく取り込まれていることを確認せよ。

ファイル形式ヘッダ有無文字エンコーディング

TSV（タブ区切り）有り UTF-8

~~# これで１００本終わりです。おつかれさまでした！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

データサイエンス100本ノック（構造化データ加工編） - Julia/Base/Query

ファイル等の準備

入力したデータフレームの確認

1. 単純な目視確認

2. 必要な項目（列）のみを表示する

行と列の選択

3. 論理演算式による行の選択

4. 部分文字列の有無による選択（正規表現）

データフレームの並べ替え(sort)

5. 列でソート

6. 順位付けをしてソート

件数（行数）の確認

グループ化して処理

7. 合計

8. 最大値

9. 最小値

10. 比較演算，論理演算

11. 平均値

12. 中央値

13. 最頻値

14. 分散

15. 標準偏差

16. パーセンタイル値（第1四分位数，中央値，第3四分位数）

17. 比較演算，論理演算

18. グループ化されたデータフレームごとの集計結果との比較

データフレームの結合(join)

19. innnerjoin()

20. leftjoin()

21. outerjoin()

差分

クロス集計

縦長フォーマット

日付データ

22. 日付型を文字列に変換

23. 文字列を日付型に変換

24. エポック秒を日付型に変換

変数の作成と数式変形

25. ダミー変数の作成

26. 標準化

27. 対数変換

欠損値処理

経過日数，月数，年数

無作為抽出

層化抽出

外れ値

欠損値の補完

位置情報（経緯度と距離）

名寄データ

データの分割

第三正規形，非正規化

ファイル入出力

1.　単純な目視確認

2.　必要な項目（列）のみを表示する

3.　論理演算式による行の選択

4.　部分文字列の有無による選択（正規表現）

5.　列でソート

6.　順位付けをしてソート

7.　合計

8.　最大値

9.　最小値

10.　比較演算，論理演算

11.　平均値

12.　中央値

13.　最頻値

14.　分散

15.　標準偏差

16.　パーセンタイル値（第1四分位数，中央値，第3四分位数）

17.　比較演算，論理演算

18.　グループ化されたデータフレームごとの集計結果との比較

19.　`innnerjoin()`

20.　`leftjoin()`

21.　`outerjoin()`

22.　日付型を文字列に変換

23.　文字列を日付型に変換

24.　エポック秒を日付型に変換

25.　ダミー変数の作成

26.　標準化

27.　対数変換