3
4

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

【pandas】Pythonで営業データ分析をやってみた【初心者向け】

Last updated at Posted at 2020-04-15

海外のyoutuberのKeith Galliという方のPythonのレクチャー動画を見てワークをやってみました。

【PythonのPandasライブラリを用いて現実のデータサイエンス課題を解いてみよう!】
Solving real world data science tasks with Python Pandas!
難易度は高くなく、平易な英語なので是非ご覧ください。

1時間半ほどの動画です。非常に丁寧な説明です。
Googleでstackoverflow(海外版プログラミングQ&Aサイト)で、
解決法を調べる過程なども見せてくれています。

CSVのエクセルデータを用いて分析するタスクになります。
データは彼のGithubからダウンロード出来ます。

Githubはこちらから

私はjupyter notebookで実行しています。

コードに補足を日本語で入れながらやっています。

動画をみて、実際に触ってみたらPandasの練習に最適かと思います。

この記事に目を通すだけでもだいたい雰囲気はつかめると思います。

データの読み込みからデータフレーム作成

pandas,osライブラリを読み込む

import pandas as pd
import os

まずどんなデータが入っているかを確認します(ファイル名一覧取得)

files = [file for file in os.listdir("Sales_Data")]
for file in files:
    print(file)
Sales_April_2019.csv
Sales_August_2019.csv
Sales_December_2019.csv
Sales_February_2019.csv
Sales_January_2019.csv
Sales_July_2019.csv
Sales_June_2019.csv
Sales_March_2019.csv
Sales_May_2019.csv
Sales_November_2019.csv
Sales_October_2019.csv
Sales_September_2019.csv

ファイル名から2019年1-12月までのデータが入っているようです、分析するためにデータをひとつにまとめたいです

まず、すべてのデータを格納するための空のデータフレームを作成します

all_months_data = pd.DataFrame()

すべての月データを合体させるために、それぞれの月のcsvファイルを空データフレームに次々にいれていきます

for file in files:
    df = pd.read_csv("Sales_Data/" + file)
    all_months_data = pd.concat([all_months_data,df])

さて、データの最初の5行はこんな感じです

all_months_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001
1 NaN NaN NaN NaN NaN NaN
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215
3 176560 Google Phone 1 600 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001

合体したデータを出力します (このとき通し番号はいらないので、引数にindex=Falseを加えます)

all_months_data.to_csv("all_data.csv",index=False)

出力したデータを確認出来たら、合体したデータを改めて読み込んでみます

all_data = pd.read_csv("all_data.csv")
all_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001
1 NaN NaN NaN NaN NaN NaN
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215
3 176560 Google Phone 1 600 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001

ワーク1:一番売り上げがあった月はいつで、どれぐらい稼いだでしょうか?

カラムをみてみると、Order Date=注文日の列はありますが、月だけのデータはないですので、新しく作りましょう

注文日の列の最初の2文字が月に当たるデータになりそうです

all_data['Month'] = all_data['Order Date'].str[0:2]

文字列化したので、もう一度数値に戻しておきましょう

all_data['Month'] = all_data['Month'].astype('int32')
all_data.head()
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-77-ffc394ccb2ad> in <module>
----> 1 all_data['Month'] = all_data['Month'].astype('int32')
      2 all_data.head()


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs)
   5880             # else, only a single dtype is given
   5881             new_data = self._data.astype(
-> 5882                 dtype=dtype, copy=copy, errors=errors, **kwargs
   5883             )
   5884             return self._constructor(new_data).__finalize__(self)


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in astype(self, dtype, **kwargs)
    579 
    580     def astype(self, dtype, **kwargs):
--> 581         return self.apply("astype", dtype=dtype, **kwargs)
    582 
    583     def convert(self, **kwargs):


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
    436                     kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy)
    437 
--> 438             applied = getattr(b, f)(**kwargs)
    439             result_blocks = _extend_blocks(applied, result_blocks)
    440 


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
    557 
    558     def astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):
--> 559         return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
    560 
    561     def _astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
    641                     # _astype_nansafe works fine with 1-d only
    642                     vals1d = values.ravel()
--> 643                     values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
    644 
    645                 # TODO(extension)


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
    705         # work around NumPy brokenness, #1987
    706         if np.issubdtype(dtype.type, np.integer):
--> 707             return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
    708 
    709         # if we have a datetime/timedelta array of objects


pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()


ValueError: cannot convert float NaN to integer

エラーが出てしまいました

エラー文を読むと、NaNを数値化することは出来ないよと言っています

is_na 関数とany関数を組み合わせると、1つでもNaNがあればTrueと返してくれます

axis = 1は行ごとに処理を返しくれる関数です

NaNを含む行をすべて返してもらいます

nan_df = all_data[all_data.isna().any(axis=1)]
nan_df
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month
1 NaN NaN NaN NaN NaN NaN NaN
356 NaN NaN NaN NaN NaN NaN NaN
735 NaN NaN NaN NaN NaN NaN NaN
1433 NaN NaN NaN NaN NaN NaN NaN
1553 NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ...
185176 NaN NaN NaN NaN NaN NaN NaN
185438 NaN NaN NaN NaN NaN NaN NaN
186042 NaN NaN NaN NaN NaN NaN NaN
186548 NaN NaN NaN NaN NaN NaN NaN
186826 NaN NaN NaN NaN NaN NaN NaN

545 rows × 7 columns

545行もありました

NaNを含むデータを消しましょう

all_data = all_data.dropna(how='all')
all_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001 04
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 04
3 176560 Google Phone 1 600 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 04
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 04
5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 04

よしでは、先ほどの月を数値に戻すもう一度実行してみましょう

all_data['Month'] = all_data['Month'].astype('int32')
all_data.head()
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-80-ffc394ccb2ad> in <module>
----> 1 all_data['Month'] = all_data['Month'].astype('int32')
      2 all_data.head()


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs)
   5880             # else, only a single dtype is given
   5881             new_data = self._data.astype(
-> 5882                 dtype=dtype, copy=copy, errors=errors, **kwargs
   5883             )
   5884             return self._constructor(new_data).__finalize__(self)


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in astype(self, dtype, **kwargs)
    579 
    580     def astype(self, dtype, **kwargs):
--> 581         return self.apply("astype", dtype=dtype, **kwargs)
    582 
    583     def convert(self, **kwargs):


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
    436                     kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy)
    437 
--> 438             applied = getattr(b, f)(**kwargs)
    439             result_blocks = _extend_blocks(applied, result_blocks)
    440 


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
    557 
    558     def astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):
--> 559         return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
    560 
    561     def _astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
    641                     # _astype_nansafe works fine with 1-d only
    642                     vals1d = values.ravel()
--> 643                     values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
    644 
    645                 # TODO(extension)


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
    705         # work around NumPy brokenness, #1987
    706         if np.issubdtype(dtype.type, np.integer):
--> 707             return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
    708 
    709         # if we have a datetime/timedelta array of objects


pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()


ValueError: invalid literal for int() with base 10: 'Or'

次は、また新しいエラーが出ています

'Or'という文言が含まれていたっぽいですね

'Or'が含まれている行を見てみましょう

temp_df = all_data[all_data['Order Date'].str[0:2] == "Or"]
temp_df.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month
519 Order ID Product Quantity Ordered Price Each Order Date Purchase Address Or
1149 Order ID Product Quantity Ordered Price Each Order Date Purchase Address Or
1155 Order ID Product Quantity Ordered Price Each Order Date Purchase Address Or
2878 Order ID Product Quantity Ordered Price Each Order Date Purchase Address Or
2893 Order ID Product Quantity Ordered Price Each Order Date Purchase Address Or

"Or"の正体は"Order Date"だったみたいですね。

"Order Date"という文言が含まれたデータ以外を抜き出してデータをリフレッシュしましょう

all_data = all_data[all_data['Order Date'].str[0:2] != "Or"]
all_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001 04
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 04
3 176560 Google Phone 1 600 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 04
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 04
5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 04

では3度目の正直で、数値化を実行してみましょう

all_data['Month'] = all_data['Order Date'].str[0:2]
all_data['Month'] = all_data['Month'].astype('int32')
all_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001 4
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 4
3 176560 Google Phone 1 600 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4
5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 4

いけました

次に、Quantity Ordered と Price Each はあるものの肝心の売り上げがありません

PQ(売り上げ) = P(価格) × Q(数量) なので PQに当たる列を追加します

all_data['Sales'] = all_data['Quantity Ordered'] * all_data['Price Each'] 
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py in na_op(x, y)
    967         try:
--> 968             result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
    969         except TypeError:


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
    220     if use_numexpr:
--> 221         return _evaluate(op, op_str, a, b, **eval_kwargs)
    222     return _evaluate_standard(op, op_str, a, b)


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/computation/expressions.py in _evaluate_numexpr(op, op_str, a, b, truediv, reversed, **eval_kwargs)
    126     if result is None:
--> 127         result = _evaluate_standard(op, op_str, a, b)
    128 


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
     69     with np.errstate(all="ignore"):
---> 70         return op(a, b)
     71 


TypeError: can't multiply sequence by non-int of type 'str'


During handling of the above exception, another exception occurred:


TypeError                                 Traceback (most recent call last)

<ipython-input-84-7c1e2b69cbe2> in <module>
----> 1 all_data['Sales'] = all_data['Quantity Ordered'] * all_data['Price Each']


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py in wrapper(left, right)
   1046 
   1047         with np.errstate(all="ignore"):
-> 1048             result = na_op(lvalues, rvalues)
   1049         return construct_result(
   1050             left, result, index=left.index, name=res_name, dtype=None


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py in na_op(x, y)
    968             result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
    969         except TypeError:
--> 970             result = masked_arith_op(x, y, op)
    971 
    972         return missing.dispatch_fill_zeros(op, x, y, result)


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py in masked_arith_op(x, y, op)
    445         if mask.any():
    446             with np.errstate(all="ignore"):
--> 447                 result[mask] = op(xrav[mask], com.values_from_object(yrav[mask]))
    448 
    449     else:


TypeError: can't multiply sequence by non-int of type 'str'

またしても、文字列が入っていました

とりあえず計算が出来る形に書き換えてあげたいので、"to_numeric"関数を使います

all_data['Quantity Ordered'] = pd.to_numeric(all_data['Quantity Ordered'])
all_data['Price Each'] = pd.to_numeric(all_data['Price Each'])

もう一回やってみます

all_data['Sales'] = all_data['Quantity Ordered'] * all_data['Price Each'] 
all_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001 4 23.90
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 4 99.99
3 176560 Google Phone 1 600.00 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 600.00
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 11.99
5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 4 11.99

出ましたね。やっと問いに対する答えが出そうです

一応、問いを再掲しておきます。

ワーク1:一番売り上げがあった月はいつで、どれぐらい稼いだでしょうか?

all_data.groupby('Month').sum()
Quantity Ordered Price Each Sales
Month
1 10903 1.811768e+06 1.822257e+06
2 13449 2.188885e+06 2.202022e+06
3 17005 2.791208e+06 2.807100e+06
4 20558 3.367671e+06 3.390670e+06
5 18667 3.135125e+06 3.152607e+06
6 15253 2.562026e+06 2.577802e+06
7 16072 2.632540e+06 2.647776e+06
8 13448 2.230345e+06 2.244468e+06
9 13109 2.084992e+06 2.097560e+06
10 22703 3.715555e+06 3.736727e+06
11 19798 3.180601e+06 3.199603e+06
12 28114 4.588415e+06 4.613443e+06

パット見分からないので図示しましょう

図示する用のライブラリをインストールします

import matplotlib.pyplot as plt

X軸用の範囲も決めながらグラフ化します

months = range(1,13)
results = all_data.groupby('Month').sum()
plt.bar(months,results['Sales'])
plt.xticks(months)
plt.xlabel('Month')
plt.ylabel('Sales')
plt.show()

output_60_0.png

以上がワーク1でした。

ワーク2:どの都市が一番の売り上げだったか

売り上げの列を作った今、簡単そうなワークですが、やってみましょう

その前に、改めて、データがどんなものだったかを確認します

all_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001 4 23.90
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 4 99.99
3 176560 Google Phone 1 600.00 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 600.00
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 11.99
5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 4 11.99

Purchase Addressの部分のデータの一部を抜き取って、集計する感じでしょうかね

ではSalesの時と同様に、City列を作りましょう

Addressの一部を抜き出さないといけないですが、Adressのデータをみると、"番地,都市名,郵便番号"という感じで、2個目の都市名だけを取り出したいです。

split関数で","で分けること、とそれを同じ列のすべての列で適用するためにapply関数を使いましょう

all_data['City'] = all_data['Purchase Address'].apply(lambda x: x.split(',')[1])
all_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001 4 23.90 Dallas
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 4 99.99 Boston
3 176560 Google Phone 1 600.00 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles
5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 4 11.99 Los Angeles

Cityだけうまくわけられましたね!

さて、Cityで集計する際に、ここでひとつ考えないといけない問題があります。

それは、Cityといっても、同じ名前のCityだけど他の国にも同じ名前があるケースです。これを考慮しないと、合算されてしまい意図しない集計結果が出てしまいます。(例:オレゴン州とメイン州の両方にあるポートランド)

とはいえ、国名として新しく列を作る必要はありません。集計するだけなので、Cityの列のセルにState(2文字)も収めるだけでいいです。今回は、City(State)という形にします。 ex: Dallas(TX)

さきほど、lambda関数で一行で書いてしまいましたが、せっかくなので、都市名、国名をそれぞれ取得する関数を作りましょう。読み手にも親切です。

def get_city(address):
    return address.split(',')[1]

def get_state(address):
    return address.split(',')[2].split(' ')[1]
# Purchaseアドレスのデータはコンマ(,)のあと空白(" ")があるので、合体したときに不自然にならないようにsplitで削る

all_data['City'] = all_data['Purchase Address'].apply(lambda x: f"{get_city(x)} ({get_state(x)})")
all_data.drop("State",axis=1)
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001 4 23.90 Dallas (TX)
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 4 99.99 Boston (MA)
3 176560 Google Phone 1 600.00 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA)
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA)
5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA)
... ... ... ... ... ... ... ... ... ...
186845 259353 AAA Batteries (4-pack) 3 2.99 09/17/19 20:56 840 Highland St, Los Angeles, CA 90001 9 8.97 Los Angeles (CA)
186846 259354 iPhone 1 700.00 09/01/19 16:00 216 Dogwood St, San Francisco, CA 94016 9 700.00 San Francisco (CA)
186847 259355 iPhone 1 700.00 09/23/19 07:39 220 12th St, San Francisco, CA 94016 9 700.00 San Francisco (CA)
186848 259356 34in Ultrawide Monitor 1 379.99 09/19/19 17:30 511 Forest St, San Francisco, CA 94016 9 379.99 San Francisco (CA)
186849 259357 USB-C Charging Cable 1 11.95 09/30/19 00:18 250 Meadow St, San Francisco, CA 94016 9 11.95 San Francisco (CA)

185950 rows × 9 columns

OKです、あとは普通に集計すればいいだけですね

all_data.groupby('City').sum()
Quantity Ordered Price Each Month Sales
City
Atlanta (GA) 16602 2.779908e+06 104794 2.795499e+06
Austin (TX) 11153 1.809874e+06 69829 1.819582e+06
Boston (MA) 22528 3.637410e+06 141112 3.661642e+06
Dallas (TX) 16730 2.752628e+06 104620 2.767975e+06
Los Angeles (CA) 33289 5.421435e+06 208325 5.452571e+06
New York City (NY) 27932 4.635371e+06 175741 4.664317e+06
Portland (ME) 2750 4.471893e+05 17144 4.497583e+05
Portland (OR) 11303 1.860558e+06 70621 1.870732e+06
San Francisco (CA) 50239 8.211462e+06 315520 8.262204e+06
Seattle (WA) 16553 2.733296e+06 104941 2.747755e+06

San Franciscoが一位です。あと、Portlandもちゃんとわけられていることがわかります。

さっきのmonthで使ったコードをアレンジして

results = all_data.groupby('City').sum()
cities = [city for city, df in all_data.groupby('City')]
# x軸ラベル用のcity:上記でgroubbyしたときと同じ並び順にしないと数字とラベルがバラバラになる
plt.bar(cities,results['Sales'])
plt.xticks(cities,rotation="vertical")
# citiesをそのまま表示したらかさばってしまうので垂直表示に
plt.ylabel('Sales')
plt.xlabel('City name')
plt.show()

output_1.png

できました!

ワーク3:顧客が商品を買う見込みを最大化するためにいつ・どの時間に広告を打つ(ディスプレイする)べきでしょうか?

一気にワークがデータサイエンスっぽくなりましたね (ただ、結論を先に言うとそこまで厳密なロジックではないです)

さて、いつも通り、データを振り返ってみましょう

all_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City State
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001 4 23.90 Dallas (TX) TX 75001
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 4 99.99 Boston (MA) MA 02215
3 176560 Google Phone 1 600.00 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA) CA 90001
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) CA 90001
5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) CA 90001

おそらく、Order Dataの列とSales列の関連性をみるとよさそうです

Order Dateを日付データとして扱いやすいようにデータの型を変えてみましょう

all_data['Order Date'] = pd.to_datetime(all_data['Order Date'])
all_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City State
0 176558 USB-C Charging Cable 2 11.95 2019-04-19 08:46:00 917 1st St, Dallas, TX 75001 4 23.90 Dallas (TX) TX 75001
2 176559 Bose SoundSport Headphones 1 99.99 2019-04-07 22:30:00 682 Chestnut St, Boston, MA 02215 4 99.99 Boston (MA) MA 02215
3 176560 Google Phone 1 600.00 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA) CA 90001
4 176560 Wired Headphones 1 11.99 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) CA 90001
5 176561 Wired Headphones 1 11.99 2019-04-30 09:27:00 333 8th St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) CA 90001

データの形式が変わりましたね

Hour,Minuteの列を作りましょう

MonthやSalesで列を作るときは文字列だったので、Splitなどを使っていましたが、先ほどデータの型を変えたおかげで

all_data['Hour'] = all_data['Order Date'].dt.hour
all_data['Minute'] = all_data['Order Date'].dt.minute
all_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City State Hour Minute
0 176558 USB-C Charging Cable 2 11.95 2019-04-19 08:46:00 917 1st St, Dallas, TX 75001 4 23.90 Dallas (TX) TX 75001 8 46
2 176559 Bose SoundSport Headphones 1 99.99 2019-04-07 22:30:00 682 Chestnut St, Boston, MA 02215 4 99.99 Boston (MA) MA 02215 22 30
3 176560 Google Phone 1 600.00 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA) CA 90001 14 38
4 176560 Wired Headphones 1 11.99 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) CA 90001 14 38
5 176561 Wired Headphones 1 11.99 2019-04-30 09:27:00 333 8th St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) CA 90001 9 27

一旦データを把握するために可視化しましょう

hours = [hour for hour, df in all_data.groupby('Hour')]
plt.plot(hours,all_data.groupby(['Hour']).count())
#時間をX軸に、時間ごとにOrder数(行数)を集計する
plt.xticks(hours)
plt.grid()
#出来るだけ時間ごとの明確な傾向が見たいのでGridをつける
plt.xlabel('Hour')
plt.ylabel('Orders')
plt.show()

output_2.png

ピークは、AM:11とPM:7ぐらいですね。したがって、最もオーダー数(客数)が多いこの時間帯に広告をディスプレイするのが良さそうです。

ワーク4: どの商品が最も一緒にセットで買われているでしょうか?

all_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City State Hour Minute
0 176558 USB-C Charging Cable 2 11.95 2019-04-19 08:46:00 917 1st St, Dallas, TX 75001 4 23.90 Dallas (TX) TX 75001 8 46
2 176559 Bose SoundSport Headphones 1 99.99 2019-04-07 22:30:00 682 Chestnut St, Boston, MA 02215 4 99.99 Boston (MA) MA 02215 22 30
3 176560 Google Phone 1 600.00 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA) CA 90001 14 38
4 176560 Wired Headphones 1 11.99 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) CA 90001 14 38
5 176561 Wired Headphones 1 11.99 2019-04-30 09:27:00 333 8th St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) CA 90001 9 27

OrderIDが一緒のものでグループ分けすればわかりそうですね

かぶっている値は、duplicated関数を使うと楽にソート出来ます。

df = all_data[all_data['Order ID'].duplicated(keep=False)]
df.head(20)
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City State Hour Minute
3 176560 Google Phone 1 600.00 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA) CA 90001 14 38
4 176560 Wired Headphones 1 11.99 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) CA 90001 14 38
18 176574 Google Phone 1 600.00 2019-04-03 19:42:00 20 Hill St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA) CA 90001 19 42
19 176574 USB-C Charging Cable 1 11.95 2019-04-03 19:42:00 20 Hill St, Los Angeles, CA 90001 4 11.95 Los Angeles (CA) CA 90001 19 42
30 176585 Bose SoundSport Headphones 1 99.99 2019-04-07 11:31:00 823 Highland St, Boston, MA 02215 4 99.99 Boston (MA) MA 02215 11 31
31 176585 Bose SoundSport Headphones 1 99.99 2019-04-07 11:31:00 823 Highland St, Boston, MA 02215 4 99.99 Boston (MA) MA 02215 11 31
32 176586 AAA Batteries (4-pack) 2 2.99 2019-04-10 17:00:00 365 Center St, San Francisco, CA 94016 4 5.98 San Francisco (CA) CA 94016 17 0
33 176586 Google Phone 1 600.00 2019-04-10 17:00:00 365 Center St, San Francisco, CA 94016 4 600.00 San Francisco (CA) CA 94016 17 0
119 176672 Lightning Charging Cable 1 14.95 2019-04-12 11:07:00 778 Maple St, New York City, NY 10001 4 14.95 New York City (NY) NY 10001 11 7
120 176672 USB-C Charging Cable 1 11.95 2019-04-12 11:07:00 778 Maple St, New York City, NY 10001 4 11.95 New York City (NY) NY 10001 11 7
129 176681 Apple Airpods Headphones 1 150.00 2019-04-20 10:39:00 331 Cherry St, Seattle, WA 98101 4 150.00 Seattle (WA) WA 98101 10 39
130 176681 ThinkPad Laptop 1 999.99 2019-04-20 10:39:00 331 Cherry St, Seattle, WA 98101 4 999.99 Seattle (WA) WA 98101 10 39
138 176689 Bose SoundSport Headphones 1 99.99 2019-04-24 17:15:00 659 Lincoln St, New York City, NY 10001 4 99.99 New York City (NY) NY 10001 17 15
139 176689 AAA Batteries (4-pack) 2 2.99 2019-04-24 17:15:00 659 Lincoln St, New York City, NY 10001 4 5.98 New York City (NY) NY 10001 17 15
189 176739 34in Ultrawide Monitor 1 379.99 2019-04-05 17:38:00 730 6th St, Austin, TX 73301 4 379.99 Austin (TX) TX 73301 17 38
190 176739 Google Phone 1 600.00 2019-04-05 17:38:00 730 6th St, Austin, TX 73301 4 600.00 Austin (TX) TX 73301 17 38
225 176774 Lightning Charging Cable 1 14.95 2019-04-25 15:06:00 372 Church St, Los Angeles, CA 90001 4 14.95 Los Angeles (CA) CA 90001 15 6
226 176774 USB-C Charging Cable 1 11.95 2019-04-25 15:06:00 372 Church St, Los Angeles, CA 90001 4 11.95 Los Angeles (CA) CA 90001 15 6
233 176781 iPhone 1 700.00 2019-04-03 07:37:00 976 Hickory St, Dallas, TX 75001 4 700.00 Dallas (TX) TX 75001 7 37
234 176781 Lightning Charging Cable 1 14.95 2019-04-03 07:37:00 976 Hickory St, Dallas, TX 75001 4 14.95 Dallas (TX) TX 75001 7 37

同じIDのものが連続で並ぶデータになりました

上のデータセットは、IDがかぶっているオーダー(≒セット買いしたオーダー以外なし)です

IDでグループ分けしたうえで、そのIDで購入したオーダーが格納された列を作ります。(例 ID:1111 → 注文: Apple,Banana,Orange)

df = all_data[all_data['Order ID'].duplicated(keep=False)]
df['Grouped'] = df.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))
#transformはapply関数のようなものです。一旦ここでは説明省略します。
df = df[['Order ID','Grouped']].drop_duplicates()
df.head(10)
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Order ID Grouped
3 176560 Google Phone,Wired Headphones
18 176574 Google Phone,USB-C Charging Cable
30 176585 Bose SoundSport Headphones,Bose SoundSport Hea...
32 176586 AAA Batteries (4-pack),Google Phone
119 176672 Lightning Charging Cable,USB-C Charging Cable
129 176681 Apple Airpods Headphones,ThinkPad Laptop
138 176689 Bose SoundSport Headphones,AAA Batteries (4-pack)
189 176739 34in Ultrawide Monitor,Google Phone
225 176774 Lightning Charging Cable,USB-C Charging Cable
233 176781 iPhone,Lightning Charging Cable

良い感じです。データがうまい具合にまとまってます。

このGrouped列で"どれが、どれと、どんな組み合わせで、何回ずつ"などを調べなければいけません

便利なライブラリを使いながら集計しましょう。ライブラリの具体的なイメージは以下のstackoverflowをご覧ください。

※inputとoutputのコードを眺めてだいたいこういうことか程度でOKです
https://stackoverflow.com/questions/56187042/python-itertools-make-combinations-with-sum/56187946#56187946

from itertools import combinations
from collections import Counter

count = Counter()
for row in df['Grouped']:
    row_list = row.split(',')
    count.update(Counter(combinations(row_list,2)))
    
count.most_common(10)
[(('iPhone', 'Lightning Charging Cable'), 1005),
 (('Google Phone', 'USB-C Charging Cable'), 987),
 (('iPhone', 'Wired Headphones'), 447),
 (('Google Phone', 'Wired Headphones'), 414),
 (('Vareebadd Phone', 'USB-C Charging Cable'), 361),
 (('iPhone', 'Apple Airpods Headphones'), 360),
 (('Google Phone', 'Bose SoundSport Headphones'), 220),
 (('USB-C Charging Cable', 'Wired Headphones'), 160),
 (('Vareebadd Phone', 'Wired Headphones'), 143),
 (('Lightning Charging Cable', 'Wired Headphones'), 92)]

ちょっと出力が汚いので、微修正します

for key,value in count.most_common(10):
    print(key,value)
('iPhone', 'Lightning Charging Cable') 1005
('Google Phone', 'USB-C Charging Cable') 987
('iPhone', 'Wired Headphones') 447
('Google Phone', 'Wired Headphones') 414
('Vareebadd Phone', 'USB-C Charging Cable') 361
('iPhone', 'Apple Airpods Headphones') 360
('Google Phone', 'Bose SoundSport Headphones') 220
('USB-C Charging Cable', 'Wired Headphones') 160
('Vareebadd Phone', 'Wired Headphones') 143
('Lightning Charging Cable', 'Wired Headphones') 92

iPhoneとチャージケーブルが一番多いですね

ワーク5(Final Work):どの商品が一番売れているでしょうか。なぜそれが一番売れていると思いますか?

最後のワークです。自分で考える(仮説の部分)が入ってきましたね。
データを可視化しましょう。

数量ベースで考えます

product_group = all_data.groupby('Product')
quantity_ordered = product_group.sum()['Quantity Ordered']
products = [product for product,df in product_group]
plt.bar(products,quantity_ordered)
plt.xlabel('Product')
plt.ylabel('Quantity Ordered')
plt.xticks(products,rotation="vertical",size=8)
plt.show()

output_3.png

バッテリー・ケーブルとかが多いですかね。

仮に値段が安いからと推測します。値段と売り上げを調べます。

prices = all_data.groupby('Product').mean()['Price Each']

fig, ax1 = plt.subplots()

ax2 = ax1.twinx()
# x軸共通でy軸の値が異なるグラフ生成
ax1.bar(products,quantity_ordered)
ax2.plot(products,prices,'b-')

ax1.set_xlabel('Product Name')
ax1.set_ylabel('Quanity Ordered', color="g")
ax2.set_ylabel('Price', color="b")
ax1.set_xticklabels(products,rotation="vertical",size=8)

plt.show()

output_4.png

以上で終了です。

3
4
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
3
4

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?