More than 3 years have passed since last update.

【pandas】Pythonで営業データ分析をやってみた【初心者向け】

Last updated at 2020-05-14Posted at 2020-04-15

海外のyoutuberのKeith Galliという方のPythonのレクチャー動画を見てワークをやってみました。

【PythonのPandasライブラリを用いて現実のデータサイエンス課題を解いてみよう！】
Solving real world data science tasks with Python Pandas!
難易度は高くなく、平易な英語なので是非ご覧ください。

1時間半ほどの動画です。非常に丁寧な説明です。
Googleでstackoverflow(海外版プログラミングQ&Aサイト)で、
解決法を調べる過程なども見せてくれています。

CSVのエクセルデータを用いて分析するタスクになります。
データは彼のGithubからダウンロード出来ます。

Githubはこちらから

私はjupyter notebookで実行しています。

コードに補足を日本語で入れながらやっています。

動画をみて、実際に触ってみたらPandasの練習に最適かと思います。

この記事に目を通すだけでもだいたい雰囲気はつかめると思います。

データの読み込みからデータフレーム作成

pandas,osライブラリを読み込む

import pandas as pd
import os

まずどんなデータが入っているかを確認します(ファイル名一覧取得)

files = [file for file in os.listdir("Sales_Data")]
for file in files:
    print(file)

Sales_April_2019.csv
Sales_August_2019.csv
Sales_December_2019.csv
Sales_February_2019.csv
Sales_January_2019.csv
Sales_July_2019.csv
Sales_June_2019.csv
Sales_March_2019.csv
Sales_May_2019.csv
Sales_November_2019.csv
Sales_October_2019.csv
Sales_September_2019.csv

ファイル名から2019年1-12月までのデータが入っているようです、分析するためにデータをひとつにまとめたいです

まず、すべてのデータを格納するための空のデータフレームを作成します

all_months_data = pd.DataFrame()

すべての月データを合体させるために、それぞれの月のcsvファイルを空データフレームに次々にいれていきます

for file in files:
    df = pd.read_csv("Sales_Data/" + file)
    all_months_data = pd.concat([all_months_data,df])

さて、データの最初の5行はこんな感じです

all_months_data.head()

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address
0	176558	USB-C Charging Cable	2	11.95	04/19/19 08:46	917 1st St, Dallas, TX 75001
1	NaN	NaN	NaN	NaN	NaN	NaN
2	176559	Bose SoundSport Headphones	1	99.99	04/07/19 22:30	682 Chestnut St, Boston, MA 02215
3	176560	Google Phone	1	600	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001
4	176560	Wired Headphones	1	11.99	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001

合体したデータを出力します (このとき通し番号はいらないので、引数にindex=Falseを加えます)

all_months_data.to_csv("all_data.csv",index=False)

出力したデータを確認出来たら、合体したデータを改めて読み込んでみます

all_data = pd.read_csv("all_data.csv")
all_data.head()

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address
0	176558	USB-C Charging Cable	2	11.95	04/19/19 08:46	917 1st St, Dallas, TX 75001
1	NaN	NaN	NaN	NaN	NaN	NaN
2	176559	Bose SoundSport Headphones	1	99.99	04/07/19 22:30	682 Chestnut St, Boston, MA 02215
3	176560	Google Phone	1	600	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001
4	176560	Wired Headphones	1	11.99	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001

ワーク１：一番売り上げがあった月はいつで、どれぐらい稼いだでしょうか？

カラムをみてみると、Order Date＝注文日の列はありますが、月だけのデータはないですので、新しく作りましょう

注文日の列の最初の2文字が月に当たるデータになりそうです

all_data['Month'] = all_data['Order Date'].str[0:2]

文字列化したので、もう一度数値に戻しておきましょう

all_data['Month'] = all_data['Month'].astype('int32')
all_data.head()

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-77-ffc394ccb2ad> in <module>
----> 1 all_data['Month'] = all_data['Month'].astype('int32')
      2 all_data.head()


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs)
   5880             # else, only a single dtype is given
   5881             new_data = self._data.astype(
-> 5882                 dtype=dtype, copy=copy, errors=errors, **kwargs
   5883             )
   5884             return self._constructor(new_data).__finalize__(self)


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in astype(self, dtype, **kwargs)
    579 
    580     def astype(self, dtype, **kwargs):
--> 581         return self.apply("astype", dtype=dtype, **kwargs)
    582 
    583     def convert(self, **kwargs):


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
    436                     kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy)
    437 
--> 438             applied = getattr(b, f)(**kwargs)
    439             result_blocks = _extend_blocks(applied, result_blocks)
    440 


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
    557 
    558     def astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):
--> 559         return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
    560 
    561     def _astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
    641                     # _astype_nansafe works fine with 1-d only
    642                     vals1d = values.ravel()
--> 643                     values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
    644 
    645                 # TODO(extension)


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
    705         # work around NumPy brokenness, #1987
    706         if np.issubdtype(dtype.type, np.integer):
--> 707             return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
    708 
    709         # if we have a datetime/timedelta array of objects


pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()


ValueError: cannot convert float NaN to integer

エラーが出てしまいました

エラー文を読むと、NaNを数値化することは出来ないよと言っています

is_na 関数とany関数を組み合わせると、１つでもNaNがあればTrueと返してくれます

axis = 1は行ごとに処理を返しくれる関数です

NaNを含む行をすべて返してもらいます

nan_df = all_data[all_data.isna().any(axis=1)]
nan_df

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Month
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
356	NaN	NaN	NaN	NaN	NaN	NaN	NaN
735	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1433	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1553	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...
185176	NaN	NaN	NaN	NaN	NaN	NaN	NaN
185438	NaN	NaN	NaN	NaN	NaN	NaN	NaN
186042	NaN	NaN	NaN	NaN	NaN	NaN	NaN
186548	NaN	NaN	NaN	NaN	NaN	NaN	NaN
186826	NaN	NaN	NaN	NaN	NaN	NaN	NaN

545 rows × 7 columns

545行もありました

NaNを含むデータを消しましょう

all_data = all_data.dropna(how='all')
all_data.head()

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Month
0	176558	USB-C Charging Cable	2	11.95	04/19/19 08:46	917 1st St, Dallas, TX 75001	04
2	176559	Bose SoundSport Headphones	1	99.99	04/07/19 22:30	682 Chestnut St, Boston, MA 02215	04
3	176560	Google Phone	1	600	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	04
4	176560	Wired Headphones	1	11.99	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	04
5	176561	Wired Headphones	1	11.99	04/30/19 09:27	333 8th St, Los Angeles, CA 90001	04

よしでは、先ほどの月を数値に戻すもう一度実行してみましょう

all_data['Month'] = all_data['Month'].astype('int32')
all_data.head()

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-80-ffc394ccb2ad> in <module>
----> 1 all_data['Month'] = all_data['Month'].astype('int32')
      2 all_data.head()


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs)
   5880             # else, only a single dtype is given
   5881             new_data = self._data.astype(
-> 5882                 dtype=dtype, copy=copy, errors=errors, **kwargs
   5883             )
   5884             return self._constructor(new_data).__finalize__(self)


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in astype(self, dtype, **kwargs)
    579 
    580     def astype(self, dtype, **kwargs):
--> 581         return self.apply("astype", dtype=dtype, **kwargs)
    582 
    583     def convert(self, **kwargs):


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
    436                     kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy)
    437 
--> 438             applied = getattr(b, f)(**kwargs)
    439             result_blocks = _extend_blocks(applied, result_blocks)
    440 


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
    557 
    558     def astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):
--> 559         return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
    560 
    561     def _astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
    641                     # _astype_nansafe works fine with 1-d only
    642                     vals1d = values.ravel()
--> 643                     values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
    644 
    645                 # TODO(extension)


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
    705         # work around NumPy brokenness, #1987
    706         if np.issubdtype(dtype.type, np.integer):
--> 707             return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
    708 
    709         # if we have a datetime/timedelta array of objects


pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()


ValueError: invalid literal for int() with base 10: 'Or'

次は、また新しいエラーが出ています

'Or'という文言が含まれていたっぽいですね

'Or'が含まれている行を見てみましょう

temp_df = all_data[all_data['Order Date'].str[0:2] == "Or"]
temp_df.head()

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Month
519	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Or
1149	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Or
1155	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Or
2878	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Or
2893	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Or

"Or"の正体は"Order Date"だったみたいですね。

"Order Date"という文言が含まれたデータ以外を抜き出してデータをリフレッシュしましょう

all_data = all_data[all_data['Order Date'].str[0:2] != "Or"]
all_data.head()

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Month
0	176558	USB-C Charging Cable	2	11.95	04/19/19 08:46	917 1st St, Dallas, TX 75001	04
2	176559	Bose SoundSport Headphones	1	99.99	04/07/19 22:30	682 Chestnut St, Boston, MA 02215	04
3	176560	Google Phone	1	600	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	04
4	176560	Wired Headphones	1	11.99	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	04
5	176561	Wired Headphones	1	11.99	04/30/19 09:27	333 8th St, Los Angeles, CA 90001	04

では3度目の正直で、数値化を実行してみましょう

all_data['Month'] = all_data['Order Date'].str[0:2]
all_data['Month'] = all_data['Month'].astype('int32')
all_data.head()

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Month
0	176558	USB-C Charging Cable	2	11.95	04/19/19 08:46	917 1st St, Dallas, TX 75001	4
2	176559	Bose SoundSport Headphones	1	99.99	04/07/19 22:30	682 Chestnut St, Boston, MA 02215	4
3	176560	Google Phone	1	600	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	4
4	176560	Wired Headphones	1	11.99	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	4
5	176561	Wired Headphones	1	11.99	04/30/19 09:27	333 8th St, Los Angeles, CA 90001	4

いけました

次に、Quantity Ordered と　Price Each　はあるものの肝心の売り上げがありません

PQ(売り上げ) = P(価格) × Q(数量) なので PQに当たる列を追加します

all_data['Sales'] = all_data['Quantity Ordered'] * all_data['Price Each']

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py in na_op(x, y)
    967         try:
--> 968             result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
    969         except TypeError:


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
    220     if use_numexpr:
--> 221         return _evaluate(op, op_str, a, b, **eval_kwargs)
    222     return _evaluate_standard(op, op_str, a, b)


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/computation/expressions.py in _evaluate_numexpr(op, op_str, a, b, truediv, reversed, **eval_kwargs)
    126     if result is None:
--> 127         result = _evaluate_standard(op, op_str, a, b)
    128 


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
     69     with np.errstate(all="ignore"):
---> 70         return op(a, b)
     71 


TypeError: can't multiply sequence by non-int of type 'str'


During handling of the above exception, another exception occurred:


TypeError                                 Traceback (most recent call last)

<ipython-input-84-7c1e2b69cbe2> in <module>
----> 1 all_data['Sales'] = all_data['Quantity Ordered'] * all_data['Price Each']


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py in wrapper(left, right)
   1046 
   1047         with np.errstate(all="ignore"):
-> 1048             result = na_op(lvalues, rvalues)
   1049         return construct_result(
   1050             left, result, index=left.index, name=res_name, dtype=None


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py in na_op(x, y)
    968             result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
    969         except TypeError:
--> 970             result = masked_arith_op(x, y, op)
    971 
    972         return missing.dispatch_fill_zeros(op, x, y, result)


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py in masked_arith_op(x, y, op)
    445         if mask.any():
    446             with np.errstate(all="ignore"):
--> 447                 result[mask] = op(xrav[mask], com.values_from_object(yrav[mask]))
    448 
    449     else:


TypeError: can't multiply sequence by non-int of type 'str'

またしても、文字列が入っていました

とりあえず計算が出来る形に書き換えてあげたいので、"to_numeric"関数を使います

all_data['Quantity Ordered'] = pd.to_numeric(all_data['Quantity Ordered'])
all_data['Price Each'] = pd.to_numeric(all_data['Price Each'])

もう一回やってみます

all_data['Sales'] = all_data['Quantity Ordered'] * all_data['Price Each']

all_data.head()

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Month	Sales
0	176558	USB-C Charging Cable	2	11.95	04/19/19 08:46	917 1st St, Dallas, TX 75001	4	23.90
2	176559	Bose SoundSport Headphones	1	99.99	04/07/19 22:30	682 Chestnut St, Boston, MA 02215	4	99.99
3	176560	Google Phone	1	600.00	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	4	600.00
4	176560	Wired Headphones	1	11.99	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	4	11.99
5	176561	Wired Headphones	1	11.99	04/30/19 09:27	333 8th St, Los Angeles, CA 90001	4	11.99

出ましたね。やっと問いに対する答えが出そうです

一応、問いを再掲しておきます。

ワーク１：一番売り上げがあった月はいつで、どれぐらい稼いだでしょうか？

all_data.groupby('Month').sum()

	Quantity Ordered	Price Each	Sales
Month
1	10903	1.811768e+06	1.822257e+06
2	13449	2.188885e+06	2.202022e+06
3	17005	2.791208e+06	2.807100e+06
4	20558	3.367671e+06	3.390670e+06
5	18667	3.135125e+06	3.152607e+06
6	15253	2.562026e+06	2.577802e+06
7	16072	2.632540e+06	2.647776e+06
8	13448	2.230345e+06	2.244468e+06
9	13109	2.084992e+06	2.097560e+06
10	22703	3.715555e+06	3.736727e+06
11	19798	3.180601e+06	3.199603e+06
12	28114	4.588415e+06	4.613443e+06

パット見分からないので図示しましょう

図示する用のライブラリをインストールします

import matplotlib.pyplot as plt

X軸用の範囲も決めながらグラフ化します

months = range(1,13)
results = all_data.groupby('Month').sum()
plt.bar(months,results['Sales'])
plt.xticks(months)
plt.xlabel('Month')
plt.ylabel('Sales')
plt.show()

以上がワーク1でした。

ワーク２：どの都市が一番の売り上げだったか

売り上げの列を作った今、簡単そうなワークですが、やってみましょう

その前に、改めて、データがどんなものだったかを確認します

all_data.head()

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Month	Sales
0	176558	USB-C Charging Cable	2	11.95	04/19/19 08:46	917 1st St, Dallas, TX 75001	4	23.90
2	176559	Bose SoundSport Headphones	1	99.99	04/07/19 22:30	682 Chestnut St, Boston, MA 02215	4	99.99
3	176560	Google Phone	1	600.00	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	4	600.00
4	176560	Wired Headphones	1	11.99	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	4	11.99
5	176561	Wired Headphones	1	11.99	04/30/19 09:27	333 8th St, Los Angeles, CA 90001	4	11.99

Purchase Addressの部分のデータの一部を抜き取って、集計する感じでしょうかね

ではSalesの時と同様に、City列を作りましょう

Addressの一部を抜き出さないといけないですが、Adressのデータをみると、"番地,都市名,郵便番号"という感じで、２個目の都市名だけを取り出したいです。

split関数で","で分けること、とそれを同じ列のすべての列で適用するためにapply関数を使いましょう

all_data['City'] = all_data['Purchase Address'].apply(lambda x: x.split(',')[1])
all_data.head()

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Month	Sales	City
0	176558	USB-C Charging Cable	2	11.95	04/19/19 08:46	917 1st St, Dallas, TX 75001	4	23.90	Dallas
2	176559	Bose SoundSport Headphones	1	99.99	04/07/19 22:30	682 Chestnut St, Boston, MA 02215	4	99.99	Boston
3	176560	Google Phone	1	600.00	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	4	600.00	Los Angeles
4	176560	Wired Headphones	1	11.99	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	4	11.99	Los Angeles
5	176561	Wired Headphones	1	11.99	04/30/19 09:27	333 8th St, Los Angeles, CA 90001	4	11.99	Los Angeles

Cityだけうまくわけられましたね！

さて、Cityで集計する際に、ここでひとつ考えないといけない問題があります。

それは、Cityといっても、同じ名前のCityだけど他の国にも同じ名前があるケースです。これを考慮しないと、合算されてしまい意図しない集計結果が出てしまいます。(例：オレゴン州とメイン州の両方にあるポートランド)

とはいえ、国名として新しく列を作る必要はありません。集計するだけなので、Cityの列のセルにState(2文字)も収めるだけでいいです。今回は、City(State)という形にします。 ex: Dallas(TX)

さきほど、lambda関数で一行で書いてしまいましたが、せっかくなので、都市名、国名をそれぞれ取得する関数を作りましょう。読み手にも親切です。

def get_city(address):
    return address.split(',')[1]

def get_state(address):
    return address.split(',')[2].split(' ')[1]
# Purchaseアドレスのデータはコンマ(,)のあと空白(" ")があるので、合体したときに不自然にならないようにsplitで削る

all_data['City'] = all_data['Purchase Address'].apply(lambda x: f"{get_city(x)} ({get_state(x)})")
all_data.drop("State",axis=1)

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Month	Sales	City
0	176558	USB-C Charging Cable	2	11.95	04/19/19 08:46	917 1st St, Dallas, TX 75001	4	23.90	Dallas (TX)
2	176559	Bose SoundSport Headphones	1	99.99	04/07/19 22:30	682 Chestnut St, Boston, MA 02215	4	99.99	Boston (MA)
3	176560	Google Phone	1	600.00	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	4	600.00	Los Angeles (CA)
4	176560	Wired Headphones	1	11.99	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	4	11.99	Los Angeles (CA)
5	176561	Wired Headphones	1	11.99	04/30/19 09:27	333 8th St, Los Angeles, CA 90001	4	11.99	Los Angeles (CA)
...	...	...	...	...	...	...	...	...	...
186845	259353	AAA Batteries (4-pack)	3	2.99	09/17/19 20:56	840 Highland St, Los Angeles, CA 90001	9	8.97	Los Angeles (CA)
186846	259354	iPhone	1	700.00	09/01/19 16:00	216 Dogwood St, San Francisco, CA 94016	9	700.00	San Francisco (CA)
186847	259355	iPhone	1	700.00	09/23/19 07:39	220 12th St, San Francisco, CA 94016	9	700.00	San Francisco (CA)
186848	259356	34in Ultrawide Monitor	1	379.99	09/19/19 17:30	511 Forest St, San Francisco, CA 94016	9	379.99	San Francisco (CA)
186849	259357	USB-C Charging Cable	1	11.95	09/30/19 00:18	250 Meadow St, San Francisco, CA 94016	9	11.95	San Francisco (CA)

185950 rows × 9 columns

OKです、あとは普通に集計すればいいだけですね

all_data.groupby('City').sum()

	Quantity Ordered	Price Each	Month	Sales
City
Atlanta (GA)	16602	2.779908e+06	104794	2.795499e+06
Austin (TX)	11153	1.809874e+06	69829	1.819582e+06
Boston (MA)	22528	3.637410e+06	141112	3.661642e+06
Dallas (TX)	16730	2.752628e+06	104620	2.767975e+06
Los Angeles (CA)	33289	5.421435e+06	208325	5.452571e+06
New York City (NY)	27932	4.635371e+06	175741	4.664317e+06
Portland (ME)	2750	4.471893e+05	17144	4.497583e+05
Portland (OR)	11303	1.860558e+06	70621	1.870732e+06
San Francisco (CA)	50239	8.211462e+06	315520	8.262204e+06
Seattle (WA)	16553	2.733296e+06	104941	2.747755e+06

San Franciscoが一位です。あと、Portlandもちゃんとわけられていることがわかります。

さっきのmonthで使ったコードをアレンジして

results = all_data.groupby('City').sum()
cities = [city for city, df in all_data.groupby('City')]
# x軸ラベル用のcity：上記でgroubbyしたときと同じ並び順にしないと数字とラベルがバラバラになる
plt.bar(cities,results['Sales'])
plt.xticks(cities,rotation="vertical")
# citiesをそのまま表示したらかさばってしまうので垂直表示に
plt.ylabel('Sales')
plt.xlabel('City name')
plt.show()

できました！

ワーク３：顧客が商品を買う見込みを最大化するためにいつ・どの時間に広告を打つ(ディスプレイする)べきでしょうか？

一気にワークがデータサイエンスっぽくなりましたね (ただ、結論を先に言うとそこまで厳密なロジックではないです)

さて、いつも通り、データを振り返ってみましょう

all_data.head()

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Month	Sales	City	State
0	176558	USB-C Charging Cable	2	11.95	04/19/19 08:46	917 1st St, Dallas, TX 75001	4	23.90	Dallas (TX)	TX 75001
2	176559	Bose SoundSport Headphones	1	99.99	04/07/19 22:30	682 Chestnut St, Boston, MA 02215	4	99.99	Boston (MA)	MA 02215
3	176560	Google Phone	1	600.00	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	4	600.00	Los Angeles (CA)	CA 90001
4	176560	Wired Headphones	1	11.99	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	4	11.99	Los Angeles (CA)	CA 90001
5	176561	Wired Headphones	1	11.99	04/30/19 09:27	333 8th St, Los Angeles, CA 90001	4	11.99	Los Angeles (CA)	CA 90001

おそらく、Order Dataの列とSales列の関連性をみるとよさそうです

Order Dateを日付データとして扱いやすいようにデータの型を変えてみましょう

all_data['Order Date'] = pd.to_datetime(all_data['Order Date'])
all_data.head()

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Month	Sales	City	State
0	176558	USB-C Charging Cable	2	11.95	2019-04-19 08:46:00	917 1st St, Dallas, TX 75001	4	23.90	Dallas (TX)	TX 75001
2	176559	Bose SoundSport Headphones	1	99.99	2019-04-07 22:30:00	682 Chestnut St, Boston, MA 02215	4	99.99	Boston (MA)	MA 02215
3	176560	Google Phone	1	600.00	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	4	600.00	Los Angeles (CA)	CA 90001
4	176560	Wired Headphones	1	11.99	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	4	11.99	Los Angeles (CA)	CA 90001
5	176561	Wired Headphones	1	11.99	2019-04-30 09:27:00	333 8th St, Los Angeles, CA 90001	4	11.99	Los Angeles (CA)	CA 90001

データの形式が変わりましたね

Hour,Minuteの列を作りましょう

MonthやSalesで列を作るときは文字列だったので、Splitなどを使っていましたが、先ほどデータの型を変えたおかげで

all_data['Hour'] = all_data['Order Date'].dt.hour
all_data['Minute'] = all_data['Order Date'].dt.minute
all_data.head()

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Month	Sales	City	State	Hour	Minute
0	176558	USB-C Charging Cable	2	11.95	2019-04-19 08:46:00	917 1st St, Dallas, TX 75001	4	23.90	Dallas (TX)	TX 75001	8	46
2	176559	Bose SoundSport Headphones	1	99.99	2019-04-07 22:30:00	682 Chestnut St, Boston, MA 02215	4	99.99	Boston (MA)	MA 02215	22	30
3	176560	Google Phone	1	600.00	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	4	600.00	Los Angeles (CA)	CA 90001	14	38
4	176560	Wired Headphones	1	11.99	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	4	11.99	Los Angeles (CA)	CA 90001	14	38
5	176561	Wired Headphones	1	11.99	2019-04-30 09:27:00	333 8th St, Los Angeles, CA 90001	4	11.99	Los Angeles (CA)	CA 90001	9	27

一旦データを把握するために可視化しましょう

hours = [hour for hour, df in all_data.groupby('Hour')]
plt.plot(hours,all_data.groupby(['Hour']).count())
#時間をX軸に、時間ごとにOrder数(行数)を集計する
plt.xticks(hours)
plt.grid()
#出来るだけ時間ごとの明確な傾向が見たいのでGridをつける
plt.xlabel('Hour')
plt.ylabel('Orders')
plt.show()

ピークは、AM:11とPM:7ぐらいですね。したがって、最もオーダー数(客数)が多いこの時間帯に広告をディスプレイするのが良さそうです。

ワーク4: どの商品が最も一緒にセットで買われているでしょうか？

all_data.head()

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Month	Sales	City	State	Hour	Minute
0	176558	USB-C Charging Cable	2	11.95	2019-04-19 08:46:00	917 1st St, Dallas, TX 75001	4	23.90	Dallas (TX)	TX 75001	8	46
2	176559	Bose SoundSport Headphones	1	99.99	2019-04-07 22:30:00	682 Chestnut St, Boston, MA 02215	4	99.99	Boston (MA)	MA 02215	22	30
3	176560	Google Phone	1	600.00	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	4	600.00	Los Angeles (CA)	CA 90001	14	38
4	176560	Wired Headphones	1	11.99	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	4	11.99	Los Angeles (CA)	CA 90001	14	38
5	176561	Wired Headphones	1	11.99	2019-04-30 09:27:00	333 8th St, Los Angeles, CA 90001	4	11.99	Los Angeles (CA)	CA 90001	9	27

OrderIDが一緒のものでグループ分けすればわかりそうですね

かぶっている値は、duplicated関数を使うと楽にソート出来ます。

df = all_data[all_data['Order ID'].duplicated(keep=False)]
df.head(20)

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Month	Sales	City	State	Hour	Minute
3	176560	Google Phone	1	600.00	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	4	600.00	Los Angeles (CA)	CA 90001	14	38
4	176560	Wired Headphones	1	11.99	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	4	11.99	Los Angeles (CA)	CA 90001	14	38
18	176574	Google Phone	1	600.00	2019-04-03 19:42:00	20 Hill St, Los Angeles, CA 90001	4	600.00	Los Angeles (CA)	CA 90001	19	42
19	176574	USB-C Charging Cable	1	11.95	2019-04-03 19:42:00	20 Hill St, Los Angeles, CA 90001	4	11.95	Los Angeles (CA)	CA 90001	19	42
30	176585	Bose SoundSport Headphones	1	99.99	2019-04-07 11:31:00	823 Highland St, Boston, MA 02215	4	99.99	Boston (MA)	MA 02215	11	31
31	176585	Bose SoundSport Headphones	1	99.99	2019-04-07 11:31:00	823 Highland St, Boston, MA 02215	4	99.99	Boston (MA)	MA 02215	11	31
32	176586	AAA Batteries (4-pack)	2	2.99	2019-04-10 17:00:00	365 Center St, San Francisco, CA 94016	4	5.98	San Francisco (CA)	CA 94016	17	0
33	176586	Google Phone	1	600.00	2019-04-10 17:00:00	365 Center St, San Francisco, CA 94016	4	600.00	San Francisco (CA)	CA 94016	17	0
119	176672	Lightning Charging Cable	1	14.95	2019-04-12 11:07:00	778 Maple St, New York City, NY 10001	4	14.95	New York City (NY)	NY 10001	11	7
120	176672	USB-C Charging Cable	1	11.95	2019-04-12 11:07:00	778 Maple St, New York City, NY 10001	4	11.95	New York City (NY)	NY 10001	11	7
129	176681	Apple Airpods Headphones	1	150.00	2019-04-20 10:39:00	331 Cherry St, Seattle, WA 98101	4	150.00	Seattle (WA)	WA 98101	10	39
130	176681	ThinkPad Laptop	1	999.99	2019-04-20 10:39:00	331 Cherry St, Seattle, WA 98101	4	999.99	Seattle (WA)	WA 98101	10	39
138	176689	Bose SoundSport Headphones	1	99.99	2019-04-24 17:15:00	659 Lincoln St, New York City, NY 10001	4	99.99	New York City (NY)	NY 10001	17	15
139	176689	AAA Batteries (4-pack)	2	2.99	2019-04-24 17:15:00	659 Lincoln St, New York City, NY 10001	4	5.98	New York City (NY)	NY 10001	17	15
189	176739	34in Ultrawide Monitor	1	379.99	2019-04-05 17:38:00	730 6th St, Austin, TX 73301	4	379.99	Austin (TX)	TX 73301	17	38
190	176739	Google Phone	1	600.00	2019-04-05 17:38:00	730 6th St, Austin, TX 73301	4	600.00	Austin (TX)	TX 73301	17	38
225	176774	Lightning Charging Cable	1	14.95	2019-04-25 15:06:00	372 Church St, Los Angeles, CA 90001	4	14.95	Los Angeles (CA)	CA 90001	15	6
226	176774	USB-C Charging Cable	1	11.95	2019-04-25 15:06:00	372 Church St, Los Angeles, CA 90001	4	11.95	Los Angeles (CA)	CA 90001	15	6
233	176781	iPhone	1	700.00	2019-04-03 07:37:00	976 Hickory St, Dallas, TX 75001	4	700.00	Dallas (TX)	TX 75001	7	37
234	176781	Lightning Charging Cable	1	14.95	2019-04-03 07:37:00	976 Hickory St, Dallas, TX 75001	4	14.95	Dallas (TX)	TX 75001	7	37

同じIDのものが連続で並ぶデータになりました

上のデータセットは、IDがかぶっているオーダー(≒セット買いしたオーダー以外なし)です

IDでグループ分けしたうえで、そのIDで購入したオーダーが格納された列を作ります。(例 ID:1111 → 注文: Apple,Banana,Orange)

df = all_data[all_data['Order ID'].duplicated(keep=False)]
df['Grouped'] = df.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))
#transformはapply関数のようなものです。一旦ここでは説明省略します。
df = df[['Order ID','Grouped']].drop_duplicates()
df.head(10)

/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

	Order ID	Grouped
3	176560	Google Phone,Wired Headphones
18	176574	Google Phone,USB-C Charging Cable
30	176585	Bose SoundSport Headphones,Bose SoundSport Hea...
32	176586	AAA Batteries (4-pack),Google Phone
119	176672	Lightning Charging Cable,USB-C Charging Cable
129	176681	Apple Airpods Headphones,ThinkPad Laptop
138	176689	Bose SoundSport Headphones,AAA Batteries (4-pack)
189	176739	34in Ultrawide Monitor,Google Phone
225	176774	Lightning Charging Cable,USB-C Charging Cable
233	176781	iPhone,Lightning Charging Cable

良い感じです。データがうまい具合にまとまってます。

このGrouped列で"どれが、どれと、どんな組み合わせで、何回ずつ"などを調べなければいけません

便利なライブラリを使いながら集計しましょう。ライブラリの具体的なイメージは以下のstackoverflowをご覧ください。

※inputとoutputのコードを眺めてだいたいこういうことか程度でOKです
https://stackoverflow.com/questions/56187042/python-itertools-make-combinations-with-sum/56187946#56187946

from itertools import combinations
from collections import Counter

count = Counter()
for row in df['Grouped']:
    row_list = row.split(',')
    count.update(Counter(combinations(row_list,2)))
    
count.most_common(10)

[(('iPhone', 'Lightning Charging Cable'), 1005),
 (('Google Phone', 'USB-C Charging Cable'), 987),
 (('iPhone', 'Wired Headphones'), 447),
 (('Google Phone', 'Wired Headphones'), 414),
 (('Vareebadd Phone', 'USB-C Charging Cable'), 361),
 (('iPhone', 'Apple Airpods Headphones'), 360),
 (('Google Phone', 'Bose SoundSport Headphones'), 220),
 (('USB-C Charging Cable', 'Wired Headphones'), 160),
 (('Vareebadd Phone', 'Wired Headphones'), 143),
 (('Lightning Charging Cable', 'Wired Headphones'), 92)]

ちょっと出力が汚いので、微修正します

for key,value in count.most_common(10):
    print(key,value)

('iPhone', 'Lightning Charging Cable') 1005
('Google Phone', 'USB-C Charging Cable') 987
('iPhone', 'Wired Headphones') 447
('Google Phone', 'Wired Headphones') 414
('Vareebadd Phone', 'USB-C Charging Cable') 361
('iPhone', 'Apple Airpods Headphones') 360
('Google Phone', 'Bose SoundSport Headphones') 220
('USB-C Charging Cable', 'Wired Headphones') 160
('Vareebadd Phone', 'Wired Headphones') 143
('Lightning Charging Cable', 'Wired Headphones') 92

iPhoneとチャージケーブルが一番多いですね

ワーク５(Final Work)：どの商品が一番売れているでしょうか。なぜそれが一番売れていると思いますか？

最後のワークです。自分で考える(仮説の部分)が入ってきましたね。
データを可視化しましょう。

数量ベースで考えます

product_group = all_data.groupby('Product')
quantity_ordered = product_group.sum()['Quantity Ordered']
products = [product for product,df in product_group]
plt.bar(products,quantity_ordered)
plt.xlabel('Product')
plt.ylabel('Quantity Ordered')
plt.xticks(products,rotation="vertical",size=8)
plt.show()

バッテリー・ケーブルとかが多いですかね。

仮に値段が安いからと推測します。値段と売り上げを調べます。

prices = all_data.groupby('Product').mean()['Price Each']

fig, ax1 = plt.subplots()

ax2 = ax1.twinx()
# x軸共通でy軸の値が異なるグラフ生成
ax1.bar(products,quantity_ordered)
ax2.plot(products,prices,'b-')

ax1.set_xlabel('Product Name')
ax1.set_ylabel('Quanity Ordered', color="g")
ax2.set_ylabel('Price', color="b")
ax1.set_xticklabels(products,rotation="vertical",size=8)

plt.show()

以上で終了です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up