More than 5 years have passed since last update.

Pythonで相関係数を計算する[4パターン]

Python

Posted at 2019-03-05

PythonでPearsonの相関係数を計算する方法を、パターンごとにまとめてみた

2つのリストを比較 -> pd.Series.corr()
1つのDataFrameに含まれるデータの総当たり -> pd.DataFrame.corr()
2つの対応のあるDataFrameで、対応しているデータ同士を比較 -> pd.DataFrame.corrwith()
2つの対応のないDataFrameを総当たりで比較 -> scipyのcdist

2つのリストを比較 -> pandasのcorr()を使用

list_corr.py

#!/usr/bin/env python3

import pandas as pd
import numpy as np

# テスト用のリストを作る
l1=list(np.random.randint(0, 10, 10))
l2=list(np.random.randint(0, 10, 10))

# 作ったlist
print(l1)
[4, 6, 0, 8, 6, 2, 0, 3, 3, 5]
print(l2)
[4, 6, 3, 7, 8, 4, 6, 9, 0, 0]

# リストをps.Seriesに変換
s1=pd.Series(l1)
s2=pd.Series(l2)

# pandasを使用してPearson's rを計算
res=s1.corr(s2)   # numpy.float64 に格納される

# 結果
print(res)
0.23385611715924406

補足
s1.corr(s2) は s1.corr(s2, method='pearson') と同じ
他にも以下などが使える
s1.corr(s2, method='spearman')
s1.corr(s2, method='kendall')

1つのDataFrameに含まれるデータの総当たり -> pandasのcorr()を使用する

df_corr.py

#!/usr/bin/env python3

import pandas as pd
import numpy as np

# テスト用のDataFrameを作る
df=pd.DataFrame(index=['idx'+str(i) for i in range(10)])
for i in range(3):
    df['col'+str(i)]=np.random.rand(10)
        
# 作ったdata frame
print(df)
          col0      col1      col2
idx0  0.490571  0.338749  0.683458
idx1  0.815814  0.959449  0.463660
idx2  0.396800  0.317452  0.170291
idx3  0.962362  0.662069  0.811776
idx4  0.474287  0.479441  0.307625
idx5  0.162198  0.680460  0.694463
idx6  0.551089  0.202127  0.615898
idx7  0.799246  0.155890  0.906621
idx8  0.279273  0.152200  0.879839
idx9  0.430898  0.267056  0.430798

# pandasを使用してPearson's rを計算
res=df.corr()   # pandasのDataFrameに格納される
    
# 結果
print(res)
          col0      col1      col2
col0  1.000000  0.300315  0.210017
col1  0.300315  1.000000 -0.185880
col2  0.210017 -0.185880  1.000000

補足
df.corr() は df.corr(method='pearson') と同じ
他にも以下などが使える
df.corr(method='spearman')
df.corr(method='kendall')

2つの対応のあるDataFrameで、対応しているデータ同士を比較 -> pandasのcorrwith()を使用する

pd_corrwith.py

#!/usr/bin/env python3

import pandas as pd
import numpy as np

# テスト用のDataFrameを作る
df1=pd.DataFrame(index=['idx'+str(i) for i in range(10)])
for i in range(3):
    df1['col'+str(i)]=np.random.rand(10)

df2=pd.DataFrame(index=['idx'+str(i) for i in range(10)])
for i in range(4):
    df2['col'+str(i)]=np.random.rand(10)

# 作ったdata frame
# indexの名前がdf1とdf2で一致している必要あり
print(df1)
          col0      col1      col2
idx0  0.470484  0.529014  0.200872
idx1  0.036357  0.999937  0.949096
idx2  0.097277  0.152169  0.568015
idx3  0.640013  0.253285  0.365569
idx4  0.738058  0.496349  0.597689
idx5  0.230077  0.979614  0.820738
idx6  0.026953  0.301144  0.739461
idx7  0.472698  0.062897  0.833863
idx8  0.081538  0.250960  0.038582
idx9  0.196873  0.683337  0.062061

print(df2)
          col0      col1      col2      col3
idx0  0.873917  0.404390  0.427867  0.135733
idx1  0.156623  0.332094  0.779584  0.971294
idx2  0.672574  0.085956  0.030390  0.017714
idx3  0.920469  0.951883  0.484358  0.013711
idx4  0.820394  0.041568  0.070731  0.911695
idx5  0.575050  0.754205  0.146625  0.787360
idx6  0.994348  0.156208  0.040534  0.908418
idx7  0.108996  0.002158  0.609719  0.829356
idx8  0.953230  0.215288  0.296275  0.954589
idx9  0.907425  0.165094  0.756403  0.742972

# pandasを使用してPearson's rを計算
res=df1.corrwith(df2)   # pandasのSeriesに格納される

# 結果
# df1とdf2で、同じ名前のカラム同士が比較される
# df2のcol3のように、同じ名前のカラムがdf1に存在しない場合は比較されない
print(res)
col0    0.073014
col1    0.331766
col2   -0.121577
col3         NaN
dtype: float64

補足
df1.corrwith(df2) は df1.corrwith(df2, method='pearson') と同じ
他にも以下などが使える
df.corrwith(df2, method='spearman')
df.corrwith(df2, method='kendall')

2つの対応のないDataFrameを総当たりで比較 -> scipyのcdistを使用する

scipy_cdist.py

#!/usr/bin/env python3

import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist

# テスト用のDataFrameを作る
df1=pd.DataFrame(index=['df1idx'+str(i) for i in range(10)])
for i in range(2):
    df1['df1col'+str(i)]=np.random.rand(10)

df2=pd.DataFrame(index=['df2idx'+str(i) for i in range(10)])
for i in range(3):
    df2['df2col'+str(i)]=np.random.rand(10)

# 作ったdata frame
# indexやcolumnの名前が異なっていても大丈夫
# 当然、indexの長さはdf1とdf2で同じである必要がある
print(df1)
          df1col0   df1col1
df1idx0  0.024177  0.665551
df1idx1  0.658245  0.551047
df1idx2  0.273205  0.457382
df1idx3  0.379643  0.219442
df1idx4  0.148248  0.925876
df1idx5  0.384743  0.606885
df1idx6  0.191794  0.667464
df1idx7  0.413076  0.453384
df1idx8  0.135606  0.461234
df1idx9  0.211061  0.369848

print(df2)
          df2col0   df2col1   df2col2
df2idx0  0.887360  0.589831  0.472463
df2idx1  0.215978  0.236339  0.215376
df2idx2  0.134346  0.366870  0.866473
df2idx3  0.712904  0.679260  0.110819
df2idx4  0.810794  0.514622  0.359084
df2idx5  0.597531  0.080000  0.327408
df2idx6  0.753117  0.935979  0.943992
df2idx7  0.961404  0.585718  0.477759
df2idx8  0.599601  0.046453  0.908469
df2idx9  0.509900  0.457647  0.964165

# pd.DataFrameをnumpy.ndarrayに変換
ndf1=df1.T.values
ndf2=df2.T.values

# cdistを使用してPearson's rを計算
# cdistは (1から相関係数を引いた値) を返す
# つまり、相関係数は (1 - cdistの結果) となる
res=(1 - cdist(ndf1, ndf2, metric='correlation'))   # numpy.ndarrayに格納される

# 結果
print(res)
[[-0.43182966 -0.23828196 -0.51695954]
 [ 0.25207449  0.06124553 -0.05151853]]

# (オプション) 結果をpd.DataFrameに変換する
res=pd.DataFrame(res, index=df1.columns, columns=df2.columns)
print(res)
          df2col0   df2col1   df2col2
df1col0 -0.431830 -0.238282 -0.516960
df1col1  0.252074  0.061246 -0.051519

環境

Ubuntu 18.04
Python 3.7.2
pandas 0.24.1
numpy 1.15.4
scipy 1.1.0

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up