More than 3 years have passed since last update.

データサイエンス100本ノック~初心者未満の戦いpart11

Last updated at 2020-06-30Posted at 2020-06-30

これはデータサイエンティストの卵がわけもわからないまま100本ノックを行っていく奮闘録である。
完走できるか謎。~~途中で消えてもQiitaにあげてないだけと思ってください。~~

100本ノックの記事
 100本ノックのガイド

ネタバレも含みますのでやろうとされている方は注意

しばらく更新できない可能性有り。消えたらスマンヌ

コレは見づらい！この書き方は危険！等ありましたら教えていただきたいです。~~心にダメージを負いながら~~糧とさせていただきます。

この解き方は間違っている！この解釈の仕方は違う！等もありましたらコメントください。

今回は５７～６２まで。
[前回]５２～５６
[目次付き初回]

57本目

P-057: 前問題の抽出結果と性別（gender）を組み合わせ、新たに性別×年代の組み合わせを表すカテゴリデータを作成せよ。組み合わせを表すカテゴリの値は任意とする。先頭10件を表示させればよい。

自分のプログラムと先頭行だけ表示

mine56.py

df=df_customer.copy()
df_bins=pd.cut(df.age,[10,20,30,40,50,60,150],right=False,labels=[10,20,30,40,50,60])
df=pd.concat([df[['customer_id','birth_day']],df_bins],axis=1)
df.head(10)


>|customer_id 	|birth_day 	|age|
|--:|--:|--:|
|CS021313000114 	|1981-04-29 	|30|

```mine57.py
df=pd.concat([df_customer[['customer_id','birth_day','gender_cd']],df_bins],axis=1)
df['age_gen']=df.gender_cd.astype('str')+df.age.astype('str')
df.head(10)

'''模範解答'''
df_customer_era['era_gender'] = df_customer['gender_cd'] + df_customer_era['age'].astype('str')
df_customer_era.head(10)

pd.concatしたので正直別に前回の流用しなくてもいい気もした。

尚、
このage列の30のに性別を表す桁を加え
1（女性）＋30（年代）＝　130
とするのが今回の目的

と、いうのが分からず書いたのが

miss57.py

df=pd.concat([df_customer[['customer_id','birth_day','gender_cd']],df_bins],axis=1)
df=df.groupby(['age','gender_cd']).agg({'customer_id':'count'})
pd.pivot_table(df,index='age',columns='gender_cd')

~~思わずクロス集計してしまったぜ~~

58本目

P-058: 顧客データフレーム（df_customer）の性別コード（gender_cd）をダミー変数化し、顧客ID（customer_id）とともに抽出せよ。結果は10件表示させれば良い。

mine58.py

df=df_customer.copy()
pd.concat([df['customer_id'],pd.get_dummies(df['gender_cd'])],axis=1).head(10)

'''模範解答'''
pd.get_dummies(df_customer[['customer_id', 'gender_cd']], columns=['gender_cd']).head(10)

ダミー変数ってなんだ？と思い、調べたところ
先頭列に該当項目を作り、表中はtrueorfalseで要素の有無を出すことらしい

というか、表を見たほうが早い

|男性 |女性|不明|
|:--:|:--:|:--:|:--:|
|0 | 1 |0|
|0 | 0 |1|
|0 | 1 |0|
|0 | 1 |0|
|0| 1 |0|

こういうこと

59本目

P-059: レシート明細データフレーム（df_receipt）の売上金額（amount）を顧客ID（customer_id）ごとに合計し、合計した売上金額を平均0、標準偏差1に標準化して顧客ID、売上金額合計とともに表示せよ。標準化に使用する標準偏差は、不偏標準偏差と標本標準偏差のどちらでも良いものとする。ただし、顧客IDが"Z"から始まるのものは非会員を表すため、除外して計算すること。結果は10件表示させれば良い。

…

……

………

標準化ってなによ

いろんなサイトを読んで理解しようとしたが、~~当時数学をサボっていた自分は~~理解が追い付かず、

先輩に聞いて参考になるWebサイトを教えてもらいこうかな？と書くも

df['hyou1'] =df['amount_sum'] - df.amount_sum.mean()
※（合計－平均）／　１（標準偏差）のつもり

と書いて間違える。

何とか理解しようと逆引き的に回答のpreprocessing.scaleを調べようとしていたところ

https://note.nkmk.me/python-list-ndarray-dataframe-normalize-standardize/ 後半

pandas.DataFrame, pandas.Seriesの正規化・標準化
pandasのメソッドを利用
～中略～

プログラム内
print( (df.T - df.T.mean()) / df.T.std() )
# col1 col2 col3
# a -1.0 0.0 1.0
# b -1.0 0.0 1.0
# c -1.0 0.0 1.0

これかッッッ
つまり
（データ／.mean()）／.std()
とすればッ

mine59.py

df=df_receipt.copy()
df=df.query('not customer_id.str.startswith("Z")',engine='python')
df=df.groupby('customer_id').agg({'amount':'sum'}).reset_index()

df['hyou1'] =(df['amount'] - df.amount.mean()) / df.amount.std()
df.head(10)

'''模範解答'''
# skleanのpreprocessing.scaleを利用するため、標本標準偏差で計算されている
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python'). \
    groupby('customer_id').agg({'amount':'sum'}).reset_index()
df_sales_amount['amount_ss'] = preprocessing.scale(df_sales_amount['amount'])
df_sales_amount.head(10)

一致しました。

60本目

P-060: レシート明細データフレーム（df_receipt）の売上金額（amount）を顧客ID（customer_id）ごとに合計し、合計した売上金額を最小値0、最大値1に正規化して顧客ID、売上金額合計とともに表示せよ。ただし、顧客IDが"Z"から始まるのものは非会員を表すため、除外して計算すること。結果は10件表示させれば良い。

同一サイトにて

print((df - df.min()) / (df.max() - df.min()))
# col1 col2 col3
# a 0.0 0.0 0.0
# b 0.5 0.5 0.5
# c 1.0 1.0 1.0

と、あるのでこれを流用

mine60.py

df=df_receipt.copy()
df=df.query('not customer_id.str.startswith("Z")',engine='python')
df=df.groupby('customer_id').agg({'amount':'sum'}).reset_index()
df['minmax'] =(df['amount'] - df.amount.min()) / (df.amount.max()-df.amount.min())
df.head(10)

'''模範解答'''
# skleanのpreprocessing.scaleを利用するため、標本標準偏差で計算されている
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python'). \
    groupby('customer_id').agg({'amount':'sum'}).reset_index()
df_sales_amount['amount_mm'] = preprocessing.minmax_scale(df_sales_amount['amount'])
df_sales_amount.head(10)

61、62本目

P-061: レシート明細データフレーム（df_receipt）の売上金額（amount）を顧客ID（customer_id）ごとに合計し、合計した売上金額を常用対数化（底=10）して顧客ID、売上金額合計とともに表示せよ。ただし、顧客IDが"Z"から始まるのものは非会員を表すため、除外して計算すること。結果は10件表示させれば良い。

P-062: レシート明細データフレーム（df_receipt）の売上金額（amount）を顧客ID（customer_id）ごとに合計し、合計した売上金額を自然対数化(底=e）して顧客ID、売上金額合計とともに表示せよ。ただし、顧客IDが"Z"から始まるのものは非会員を表すため、除外して計算すること。結果は10件表示させれば良い。

対数化は指数関数を使えばいいので

mine61_62.py

df=df_receipt.copy()
df=df.query('not customer_id.str.startswith("Z")',engine='python')
df=df.groupby('customer_id').agg({'amount':'sum'}).reset_index()

# 60、常用対数比
df['jouyou']=df.amount.apply(lambda x: math.log10(x))
# 61、自然対数比
df['shizen']=df.amount.apply(lambda x: math.log(x))

df.head(10)

で出せる

mohan61_62.py

# skleanのpreprocessing.scaleを利用するため、標本標準偏差で計算されている
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python'). \
    groupby('customer_id').agg({'amount':'sum'}).reset_index()
df_sales_amount['amount_log10'] = np.log10(df_sales_amount['amount'] + 1)
df_sales_amount.head(10)

# skleanのpreprocessing.scaleを利用するため、標本標準偏差で計算されている
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python'). \
    groupby('customer_id').agg({'amount':'sum'}).reset_index()
df_sales_amount['amount_loge'] = np.log(df_sales_amount['amount'] + 1)
df_sales_amount.head(10)

………

+1ってなぁに？

今回はここまで

logは高校数学でよくわからなくなり始めた部分なので、ホントにお手上げです。
この、+1についてわかる方、コメントおねがいします。ホントニワカラナイ

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up