More than 5 years have passed since last update.

RとPythonで重回帰再追記

Last updated at 2020-05-03Posted at 2019-08-14

はじめに

R未経験にもかかわらずR対象のデータ分析系の勉強会に出席してしまったので
Pythonだとどうするの？を復習を兼ねて記録に残したいと思います

改訂2

PythonでもRのように、目的変数と説明変数を指定することができました
備忘録として追記しました

改訂

勉強会参加メンバーから疑問点について助言いただいたので文末に追記しました

python で決定係数を出力した際、scikit learnとstatsmodelsに差がありました
statsmodelsの設定により差が発生していたことがわかりました

statsmodelsのデフォルトは、切片0とのこと
A intercept is not included by default and should be added by the user.
See statsmodels.tool.add_constant.

↓公式documents
https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html

別件でscikit-learnの　intercept = False　も結果がおかしくなるとの噂も、、、
この辺は別にまとめたいと思っています

環境

windows10 Pro 64bit
R :3.6.1
RStudio :1.2.1335

Google Colaboratoryを使用
Python :3.6.8

何するの？

RのSMCRM内のcustomerAcquisitionデータセットを利用して重回帰分析を行う


library(SMCRM)

# データセットの読み出し
data("customerAcquisition")

# データセットの確認
str(customerAcquisition)

'data.frame':	500 obs. of  16 variables:
 $ customer      : num  1 2 3 4 5 6 7 8 9 10 ...
 $ acquisition   : num  1 0 0 1 1 0 0 0 1 1 ...
 $ first_purchase: num  434 0 0 226 363 ...
 $ clv           : num  0 0 0 5.73 0 ...
 $ duration      : num  384 0 0 730 579 0 0 0 730 730 ...
 $ censor        : num  0 0 0 1 0 0 0 0 1 1 ...
 $ acq_expense   : num  760 148 253 610 672 ...
 $ acq_expense_sq: num  578147 21815 63787 371771 452068 ...
 $ industry      : num  1 1 1 1 1 0 0 0 1 1 ...
 $ revenue       : num  30.2 39.8 54.9 45.8 69 ...
 $ employees     : num  1240 166 1016 122 313 ...
 $ ret_expense   : num  2310 0 0 2193 801 ...
 $ ret_expense_sq: num  5335130 0 0 4807451 641825 ...
 $ crossbuy      : num  5 0 0 2 4 0 0 0 1 1 ...
 $ frequency     : num  2 0 0 12 7 0 0 0 11 14 ...
 $ frequency_sq  : num  4 0 0 144 49 0 0 0 121 196 ...

# acquisition が 1のデータを対象としてdf に代入
df<- customerAcquisition %>% filter(acquisition == 1)

# first_purchase を目的関数、その他すべてのデータを説明関数として重回帰
lm_CA_full <- lm(first_purchase ~ . - acquisition , data = df)

# first_purchase を目的関数、説明関数を4つに絞って重回帰
lm_CA <- lm(first_purchase ~ acq_expense + acq_expense_sq + revenue + employees, data = df)

# lm_CAの回帰計算結果表示
summary(lm_CA)

Call:
lm(formula = first_purchase ~ acq_expense + acq_expense_sq + 
    revenue + employees, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-84.734 -19.765  -0.867  19.986 100.291 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     7.783e+00  3.150e+01   0.247    0.805    
acq_expense     5.317e-01  9.534e-02   5.577 5.66e-08 ***
acq_expense_sq -6.544e-04  7.192e-05  -9.099  < 2e-16 ***
revenue         2.986e+00  1.086e-01  27.488  < 2e-16 ***
employees       2.483e-01  3.769e-03  65.887  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 30.59 on 287 degrees of freedom
Multiple R-squared:  0.9617,	Adjusted R-squared:  0.9611 
F-statistic:  1800 on 4 and 287 DF,  p-value: < 2.2e-16

説明係数の回帰係数は以下
　acq_expense 5.317e-01
　acq_expense_sq -6.544e-04
　revenue 2.986e+00
　employees 2.483e-01
切片は
　(Intercept) 7.783e+00

線形回帰モデルは以下で表せる
first_purchase = 7.783e+00
　　　　　　　+ acq_expense x 5.317e-01
　　　　　　　 + acq_expense_sq x -6.544e-04
　　　　　　　+ revenue x 2.986e+00
　　　　　　　 + employees x 2.483e-01

モデルの評価値は決定係数で表される
決定係数　Multiple R-squared: 0.9617
修正決定係数　Adjusted R-squared: 0.9611

モデル同士の比較はAIC（赤池情報量基準）にて評価


> # AICで比較
> AIC(lm_CA_full)
[1] 2842.858
> AIC(lm_CA)
[1] 2833.373

AICの値が小さいほうが理想に近いということらしいので、すべての説明変数を利用したモデルより
4つの説明変数のモデルのほうがよく説明できているモデルということになる。
ではこれがベストか？はstepAICで調べられる。（らしい、、）



> lm_CA_step <- stepAIC(lm_CA_full)
Start:  AIC=2012.2
first_purchase ~ (customer + acquisition + clv + duration + censor + 
    acq_expense + acq_expense_sq + industry + revenue + employees + 
    ret_expense + ret_expense_sq + crossbuy + frequency + frequency_sq) - 
    acquisition

                 Df Sum of Sq     RSS    AIC
- customer        1       238  259372 2010.5
- crossbuy        1       259  259394 2010.5
- duration        1       555  259690 2010.8
- frequency_sq    1       621  259756 2010.9
- ret_expense_sq  1       712  259847 2011.0
- ret_expense     1      1414  260549 2011.8
<none>                         259135 2012.2
- frequency       1      2075  261210 2012.5
- censor          1      2283  261418 2012.8
- clv             1      3111  262246 2013.7
- industry        1      3131  262266 2013.7
- acq_expense     1     23542  282677 2035.6
- acq_expense_sq  1     47453  306588 2059.3
- revenue         1    536393  795528 2337.7
- employees       1   3231713 3490848 2769.6

Step:  AIC=2010.47
first_purchase ~ clv + duration + censor + acq_expense + acq_expense_sq + 
    industry + revenue + employees + ret_expense + ret_expense_sq + 
    crossbuy + frequency + frequency_sq

                 Df Sum of Sq     RSS    AIC
- crossbuy        1       213  259586 2008.7
- duration        1       536  259909 2009.1
- frequency_sq    1       568  259941 2009.1
- ret_expense_sq  1       737  260109 2009.3
- ret_expense     1      1448  260820 2010.1
<none>                         259372 2010.5
- frequency       1      1981  261354 2010.7
- censor          1      2200  261573 2010.9
- industry        1      3009  262381 2011.8
- clv             1      3038  262411 2011.9
- acq_expense     1     23644  283017 2033.9
- acq_expense_sq  1     47614  306987 2057.7
- revenue         1    538374  797746 2336.5
- employees       1   3277066 3536439 2771.3

Step:  AIC=2008.71
first_purchase ~ clv + duration + censor + acq_expense + acq_expense_sq + 
    industry + revenue + employees + ret_expense + ret_expense_sq + 
    frequency + frequency_sq

                 Df Sum of Sq     RSS    AIC
- duration        1       394  259980 2007.2
- frequency_sq    1       496  260082 2007.3
- ret_expense_sq  1       812  260398 2007.6
- ret_expense     1      1428  261014 2008.3
<none>                         259586 2008.7
- frequency       1      1803  261388 2008.7
- censor          1      2000  261586 2009.0
- industry        1      2810  262396 2009.8
- clv             1      2890  262475 2009.9
- acq_expense     1     24239  283825 2032.8
- acq_expense_sq  1     50080  309666 2058.2
- revenue         1    554513  814098 2340.5
- employees       1   3297327 3556912 2771.0

Step:  AIC=2007.15
first_purchase ~ clv + censor + acq_expense + acq_expense_sq + 
    industry + revenue + employees + ret_expense + ret_expense_sq + 
    frequency + frequency_sq

                 Df Sum of Sq     RSS    AIC
- frequency_sq    1       495  260474 2005.7
- ret_expense_sq  1       605  260585 2005.8
- ret_expense     1      1068  261048 2006.3
- censor          1      1606  261586 2007.0
- frequency       1      1609  261589 2007.0
<none>                         259980 2007.2
- industry        1      2419  262399 2007.8
- clv             1      2502  262481 2007.9
- acq_expense     1     31416  291395 2038.5
- acq_expense_sq  1     77789  337769 2081.6
- revenue         1    594439  854419 2352.6
- employees       1   3327401 3587380 2771.5

Step:  AIC=2005.7
first_purchase ~ clv + censor + acq_expense + acq_expense_sq + 
    industry + revenue + employees + ret_expense + ret_expense_sq + 
    frequency

                 Df Sum of Sq     RSS    AIC
- ret_expense_sq  1       651  261125 2004.4
- ret_expense     1      1052  261527 2004.9
- censor          1      1382  261856 2005.2
<none>                         260474 2005.7
- clv             1      2202  262677 2006.2
- industry        1      2341  262815 2006.3
- frequency       1      5221  265695 2009.5
- acq_expense     1     31388  291862 2036.9
- acq_expense_sq  1     77902  338376 2080.1
- revenue         1    612592  873066 2356.9
- employees       1   3335399 3595873 2770.2

Step:  AIC=2004.43
first_purchase ~ clv + censor + acq_expense + acq_expense_sq + 
    industry + revenue + employees + ret_expense + frequency

                 Df Sum of Sq     RSS    AIC
- ret_expense     1       796  261921 2003.3
- censor          1      1429  262554 2004.0
<none>                         261125 2004.4
- clv             1      2225  263349 2004.9
- industry        1      2402  263527 2005.1
- frequency       1      5500  266625 2008.5
- acq_expense     1     30890  292015 2035.1
- acq_expense_sq  1     77273  338398 2078.1
- revenue         1    621020  882145 2357.9
- employees       1   3347586 3608711 2769.3

Step:  AIC=2003.32
first_purchase ~ clv + censor + acq_expense + acq_expense_sq + 
    industry + revenue + employees + frequency

                 Df Sum of Sq     RSS    AIC
- censor          1       929  262850 2002.3
- clv             1      1489  263410 2003.0
<none>                         261921 2003.3
- industry        1      2086  264008 2003.6
- frequency       1      4822  266743 2006.7
- acq_expense     1     30126  292047 2033.1
- acq_expense_sq  1     76673  338594 2076.3
- revenue         1    674215  936136 2373.2
- employees       1   3509176 3771097 2780.1

Step:  AIC=2002.35
first_purchase ~ clv + acq_expense + acq_expense_sq + industry + 
    revenue + employees + frequency

                 Df Sum of Sq     RSS    AIC
- clv             1      1250  264100 2001.7
<none>                         262850 2002.3
- industry        1      1902  264752 2002.5
- frequency       1      4941  267791 2005.8
- acq_expense     1     29200  292049 2031.1
- acq_expense_sq  1     76005  338855 2074.5
- revenue         1    679663  942512 2373.2
- employees       1   3892379 4155229 2806.4

Step:  AIC=2001.74
first_purchase ~ acq_expense + acq_expense_sq + industry + revenue + 
    employees + frequency

                 Df Sum of Sq     RSS    AIC
- industry        1      1233  265333 2001.1
<none>                         264100 2001.7
- frequency       1      3738  267838 2003.8
- acq_expense     1     28352  292452 2029.5
- acq_expense_sq  1     76467  340567 2074.0
- revenue         1    693588  957688 2375.9
- employees       1   3962665 4226765 2809.4

Step:  AIC=2001.1
first_purchase ~ acq_expense + acq_expense_sq + revenue + employees + 
    frequency

                 Df Sum of Sq     RSS    AIC
<none>                         265333 2001.1
- frequency       1      3303  268636 2002.7
- acq_expense     1     28868  294201 2029.3
- acq_expense_sq  1     77127  342460 2073.6
- revenue         1    706100  971433 2378.1
- employees       1   3973948 4239281 2808.3

StartのAICが2012.2で単独で評価した2842.858となぜ異なるのかわかりませんが、ベストなモデルは
初めに選択した4つの説明変数に、frequencyを追加したものということが分かりました。

Pythonで記述

customerAcquisitionデータセットはPythonには用意されていないので、Rからcsvで出力して利用することにします


＃RからcustomerAcquisitionをcsvで出力
write.table(customerAcquisition, "customerAcquisition.csv", sep=",", row.names=F)

Pythonでのデータ解析はライブラリーに依存しているところが多々あるため下記のようなライブラリーをインポートする必要がある
pandas 　　　データ分析用ライブラリ。DataFrameを提供
numpy　　　　多次元配列計算用ライブラリ
matplotlib　　グラフ描画ライブラリ
seaborn　　　matplotlibのラッパー、かっこいいグラフが簡単に書ける
sklearn 　　　機械学習用ライブラリー


import numpy as np
import pandas as pd

# 可視化モジュール
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline

# 重回帰
from sklearn.linear_model import LinearRegression

Rから出力したcsvファイルはpandasで読み込み、データ構造も簡単に確認できます


df = pd.read_csv('customerAcquisition.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 16 columns):
customer          500 non-null int64
acquisition       500 non-null int64
first_purchase    500 non-null float64
clv               500 non-null float64
duration          500 non-null int64
censor            500 non-null int64
acq_expense       500 non-null float64
acq_expense_sq    500 non-null float64
industry          500 non-null int64
revenue           500 non-null float64
employees         500 non-null int64
ret_expense       500 non-null float64
ret_expense_sq    500 non-null float64
crossbuy          500 non-null int64
frequency         500 non-null int64
frequency_sq      500 non-null int64
dtypes: float64(7), int64(9)
memory usage: 62.6 KB

データフレームは論理式でbool値を取得し、bool値を使ってデータを抽出することができる


df['acquisition'] == 1

0       True
1      False
2      False
3       True
  ... 
497    False
498     True
499    False
Name: acquisition, Length: 500, dtype: bool

# acquisition が 1　のデータのみに抽出
df = df[df['acquisition'] == 1]

# 抽出後のデータを確認
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 292 entries, 0 to 498
Data columns (total 16 columns):
customer          292 non-null int64
acquisition       292 non-null int64
first_purchase    292 non-null float64
clv               292 non-null float64
duration          292 non-null int64
censor            292 non-null int64
acq_expense       292 non-null float64
acq_expense_sq    292 non-null float64
industry          292 non-null int64
revenue           292 non-null float64
employees         292 non-null int64
ret_expense       292 non-null float64
ret_expense_sq    292 non-null float64
crossbuy          292 non-null int64
frequency         292 non-null int64
frequency_sq      292 non-null int64
dtypes: float64(7), int64(9)
memory usage: 38.8 KB

acquisition　= 1　のみのデータ抽出でデータ数が
500 →　292 に減少

このデータフレームをベースに目的変数、説明変数を作りながら
重回帰のclassにデータを入れていきます


# 目的変数
y = df['first_purchase']

# 説明変数は元のデータフレームから目的変数を削除したもの
X_full = df.drop('first_purchase',axis=1)

# 選択した説明変数のデータフレームも作成
df_sel = df[['first_purchase','acq_expense','acq_expense_sq','revenue','employees']]
X_sel = df_sel.drop('first_purchase',axis=1)

# 重回帰のインスタンスを作成
model = LinearRegression()

# X_selで学習
model.fit(X_sel,y)

# 回帰係数
print('\n回帰係数\n{}'.format(pd.Series(model.coef_,　index=df_sel.drop('first_purchase',axis=1).columns)))

回帰係数
acq_expense       0.531689
acq_expense_sq   -0.000654
revenue           2.986234
employees         0.248334
dtype: float64

# 切片
print('切片: {:.3f}'.format(model.intercept_))

切片: 7.783

# 決定係数
print(model.score(df_sel.drop('first_purchase',axis=1),df_sel['first_purchase']))

0.962

回帰係数、切片ともにRと同様に求められたが
評価値は決定係数のみで修正決定係数は出力できない（と思う、、追記に記載しました）
Pythonの重回帰は機械学習モデルとしてLinearRegressionしか使ったことがなかったが、Rのように統計的な結出力を得たい場合は、statsmodelsで可能となるようです。


from sklearn import datasets
from statsmodels import api as sm

model = sm.OLS(y, X_sel)
result = model.fit()
print (result.summary())

                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:         first_purchase   R-squared (uncentered):                   0.994
Model:                            OLS   Adj. R-squared (uncentered):              0.994
Method:                 Least Squares   F-statistic:                          1.266e+04
Date:                Tue, 13 Aug 2019   Prob (F-statistic):                   3.21e-322
Time:                        14:44:24   Log-Likelihood:                         -1410.7
No. Observations:                 292   AIC:                                      2829.
Df Residuals:                     288   BIC:                                      2844.
Df Model:                           4                                                  
Covariance Type:            nonrobust                                                  
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
acq_expense        0.5546      0.023     24.391      0.000       0.510       0.599
acq_expense_sq    -0.0007   2.42e-05    -27.743      0.000      -0.001      -0.001
revenue            2.9894      0.108     27.755      0.000       2.777       3.201
employees          0.2486      0.004     69.021      0.000       0.242       0.256
==============================================================================
Omnibus:                        2.317   Durbin-Watson:                   1.891
Prob(Omnibus):                  0.314   Jarque-Bera (JB):                2.013
Skew:                           0.156   Prob(JB):                        0.366
Kurtosis:                       3.260   Cond. No.                     2.97e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.97e+04. This might indicate that there are
strong multicollinearity or other numerical problems.


print(result.aic)

2829.4353329751475

修正決定係数まで出力できたが、、、結果が異なる、、
AICも微妙に違う、、なぜ？
Pythonでは、調べた限りではRのようなstepAICはなく自分で組む必要がありそう
Qiitaでも自作している人がちらほら
データが異なる理由、stepAICはじっくり調べてみたいと思います

とりあえず、Rとpythonで同じデータセットを用いて重回帰を行ってみました
Rは初めてでしたが、さすが統計的な扱いは得意ですね、楽そうなのでサクッと計算ができそうです
必要に応じて使い分けですかね？
Rでの勉強会はまだ続くのでpythonとの違いをじっくり調べてみたいと思います。

追記

scikit-learnとstatsmodelsの決定係数差について

statsmodelsで回帰を行う場合は、Xに明示的に切片（statsmodelssでｈconstant)を追加する必要があった
デフォルトではなし

statsmodelsの結果
切片なしの決定係数　0.994
切片ありの决定係数　0.962

scikit-learnの結果
決定係数　0.962

Rの結果
決定係数　Multiple R-squared: 0.9617
修正決定係数　Adjusted R-squared: 0.9611


from sklearn import datasets
from statsmodels import api as sm

X_sel_ = sm.add_constant(X_sel)

model2 = sm.OLS(y, X_sel_)
result2 = model2.fit()
print (result2.summary())

                                                         OLS Regression Results                            
==============================================================================
Dep. Variable:         first_purchase   R-squared:                       0.962
Model:                            OLS   Adj. R-squared:                  0.961
Method:                 Least Squares   F-statistic:                     1800.
Date:                Sat, 31 Aug 2019   Prob (F-statistic):          7.74e-202
Time:                        03:58:27   Log-Likelihood:                -1410.7
No. Observations:                 292   AIC:                             2831.
Df Residuals:                     287   BIC:                             2850.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const              7.7830     31.498      0.247      0.805     -54.213      69.779
acq_expense        0.5317      0.095      5.577      0.000       0.344       0.719
acq_expense_sq    -0.0007   7.19e-05     -9.099      0.000      -0.001      -0.001
revenue            2.9862      0.109     27.488      0.000       2.772       3.200
employees          0.2483      0.004     65.887      0.000       0.241       0.256
==============================================================================
Omnibus:                        2.045   Durbin-Watson:                   1.889
Prob(Omnibus):                  0.360   Jarque-Bera (JB):                1.739
Skew:                           0.150   Prob(JB):                        0.419
Kurtosis:                       3.230   Cond. No.                     8.58e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.58e+06. This might indicate that there are
strong multicollinearity or other numerical problems.


print(result2.aic)

2831.3732200478385

scikit-learnの自由度修正決定係数の扱いについて

scikit-learnでは決定係数のみのようです
自由度調整決定係数は決定係数から計算できるので試してみました
https://www.statisticshowto.datasciencecentral.com/adjusted-r2/

R,修正後のstatsmodelsとも数値が合いました！


# 調整決系係数
y= df_sel['first_purchase']
X_sel = df_sel.drop('first_purchase',axis=1).columns

adj_r_sqr = 1 - (1-r_sqr)*(len(y)-1)/(len(y)-len(X_sel)-1)

print('adj_r_sqr = :',adj_r_sqr)

adj_r_sqr = : 0.9611309093220055

追記2

statsmodels.formula.api を用いると、R-styleのモデル式が使用できます

statsmodels.formula.api: A convenience interface for specifying models using formula strings and DataFrames. This API directly exposes the from_formula class method of models that support the formula API. Canonically imported using import statsmodels.formula.api as smf

import statsmodels.formula.api as smf

model_Rlike = smf.ols('first_purchase ~ acq_expense + acq_expense_sq + revenue + employees', data=df_sel)
result = model_Rlike.fit()
print(result.summary())

                           OLS Regression Results                            
==============================================================================
Dep. Variable:         first_purchase   R-squared:                       0.962
Model:                            OLS   Adj. R-squared:                  0.961
Method:                 Least Squares   F-statistic:                     1800.
Date:                Sun, 03 May 2020   Prob (F-statistic):          7.74e-202
Time:                        10:19:15   Log-Likelihood:                -1410.7
No. Observations:                 292   AIC:                             2831.
Df Residuals:                     287   BIC:                             2850.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept          7.7830     31.498      0.247      0.805     -54.213      69.779
acq_expense        0.5317      0.095      5.577      0.000       0.344       0.719
acq_expense_sq    -0.0007   7.19e-05     -9.099      0.000      -0.001      -0.001
revenue            2.9862      0.109     27.488      0.000       2.772       3.200
employees          0.2483      0.004     65.887      0.000       0.241       0.256
==============================================================================
Omnibus:                        2.045   Durbin-Watson:                   1.889
Prob(Omnibus):                  0.360   Jarque-Bera (JB):                1.739
Skew:                           0.150   Prob(JB):                        0.419
Kurtosis:                       3.230   Cond. No.                     8.58e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.58e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
'''

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

RとPythonで重回帰 再追記