LoginSignup
3
6

More than 5 years have passed since last update.

線形回帰を本当はPythonで解きたいけど表計算で解けと言われたので

Last updated at Posted at 2018-10-29

用いるデータ

例として 分子性物質のデータ(融点・沸点) を使ってみましょう。分子量(molecular weight)から沸点(boiling point)を予測するという問題を解いてみたいと思います。

import pandas as pd
data = [['HF', 19.5, 20.0],
        ['HCl', -84.9, 36.5],
        ['HBr', -67.0, 80.9],
        ['HI', -35.1, 127.9],
        ['H2O', 100.0, 18.0],
        ['H2S', -60.7, 34.1],
        ['H2Se', -42, 81.0],
        ['H2Te', -1.8, 129.6],
        ['NH3', -33.4, 17.0],
        ['PH3', -87, 34.0],
        ['AsH3', -55, 77.9],
        ['SbH3', -17.1, 124.8],
        ['CH4', -161.49, 16.0],
        ['SiH4', -111.8, 32.1],
        ['GeH4', -90, 76.6],
        ['SnH4', -52, 122.7],
        ['He', -268.934, 4.0],
        ['Ne', -246.048, 20.2],
        ['Ar', -185.7, 39.9],
        ['Kr', -152.3, 83.8],
        ['Xe', -108.1, 131.3],
       ]
df = pd.DataFrame(data, columns = ['molecule', 'boiling point', 'molecular weight'])
df
molecule boiling point molecular weight
0 HF 19.500 20.0
1 HCl -84.900 36.5
2 HBr -67.000 80.9
3 HI -35.100 127.9
4 H2O 100.000 18.0
5 H2S -60.700 34.1
6 H2Se -42.000 81.0
7 H2Te -1.800 129.6
8 NH3 -33.400 17.0
9 PH3 -87.000 34.0
10 AsH3 -55.000 77.9
11 SbH3 -17.100 124.8
12 CH4 -161.490 16.0
13 SiH4 -111.800 32.1
14 GeH4 -90.000 76.6
15 SnH4 -52.000 122.7
16 He -268.934 4.0
17 Ne -246.048 20.2
18 Ar -185.700 39.9
19 Kr -152.300 83.8
20 Xe -108.100 131.3

回帰は y = f(x) という関係において関数 f を同定するという作業です。今回は分子量を x 沸点を y とします。そのときの f を求めてみましょう。

X = df.loc[:, ['molecular weight']].as_matrix()
X
array([[  20. ],
       [  36.5],
       [  80.9],
       [ 127.9],
       [  18. ],
       [  34.1],
       [  81. ],
       [ 129.6],
       [  17. ],
       [  34. ],
       [  77.9],
       [ 124.8],
       [  16. ],
       [  32.1],
       [  76.6],
       [ 122.7],
       [   4. ],
       [  20.2],
       [  39.9],
       [  83.8],
       [ 131.3]])
Y = df['boiling point'].as_matrix()
Y
array([  19.5  ,  -84.9  ,  -67.   ,  -35.1  ,  100.   ,  -60.7  ,
        -42.   ,   -1.8  ,  -33.4  ,  -87.   ,  -55.   ,  -17.1  ,
       -161.49 , -111.8  ,  -90.   ,  -52.   , -268.934, -246.048,
       -185.7  , -152.3  , -108.1  ])

まずは x と y の関係をプロットします。

%matplotlib inline
import matplotlib.pyplot as plt
# 散布図
plt.figure(figsize=(5,4))
plt.scatter(X, Y, alpha=0.3)
for name, x, y in zip(df.loc[:, ['molecule']].as_matrix(), X, Y):
    plt.text(x, y, name[0], size=8)
plt.xlabel('molecular weight')
plt.ylabel('boiling point')
plt.grid()
plt.show()

線形回帰を本当はPythonで解きたいけど表計算で解けと言われたので_4_0.png

これを上手に回帰する直線を求めるという問題です。

まずは一番便利な scikit-learn から

本当は scikit-learn を使うのが一番便利です。まずは線形回帰を実行するための学習器を定義します。

from sklearn import linear_model
lr = linear_model.LinearRegression()

学習(フィッティング)を行います。

lr.fit(X, Y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

傾きと切片が決まれば直線が得られますが、回帰係数というのが傾きにあたります。

# 回帰係数
lr.coef_
array([ 0.55956295])

そして切片

# 切片
lr.intercept_
-117.75943870102266

つまり、得られた回帰直線は次のようになります。

print("y = f(x) = wx + t; (w, t) = ({0}, {1})".format(lr.coef_[0], lr.intercept_))
y = f(x) = wx + t; (w, t) = (0.5595629540025041, -117.75943870102266)

決定係数は、説明変数 x が従属変数 y のどれくらいを説明できるかを表す値で、1に近いほど相対的な残差が少ないことを表します。回帰方程式のあてはまりの良さの尺度として利用されます。

# 決定係数R2
lr.score(X, Y)
0.083634989644543967

さて、この数字は驚くほど低いわけですが、図示してみましょう。

%matplotlib inline
import matplotlib.pyplot as plt
# 散布図
plt.figure(figsize=(5,4))
plt.scatter(X, Y, alpha=0.3)

# 回帰直線
plt.plot(X, lr.predict(X))
for name, x, y in zip(df.loc[:, ['molecule']].as_matrix(), X, Y):
    plt.text(x, y, name[0], size=8)
plt.xlabel('molecular weight')
plt.ylabel('boiling point')
plt.grid()
plt.show()

線形回帰を本当はPythonで解きたいけど表計算で解けと言われたので_12_0.png

散布図を描いたときから既に想像できたことと思いますが、分子量と沸点の関係を直線で表すのは無理ゲーだということがよく分かります。

それはそれとして、この得られた回帰直線を使って、様々な分子量の化合物の沸点を予測してみましょう。

lr.predict(200)
array([-5.8468479])
lr.predict([[150], [100]])
array([-33.8249956, -61.8031433])

非常に疑わしいモデルですが、いちおう直線回帰モデルが完成しました。

次は、ガチPythonで。

さて、scikit-learn という便利なツールに頼ってると先生に「お前は本当に線形回帰の計算手法が分かっとんのかゴルァ」と言われてしまいますので、scikit-learn に頼らずに実装してみましょう。

# 平均値を求める関数
def mean(list):
    sum = 0
    for x in list:
        sum += x
    return sum / len(list)
# 分散を求める関数
def variance(list):
    ave = mean(list)
    sum = 0
    for x in list:
        sum += (x - ave) ** 2
    return sum / len(list)
# 標準偏差を求める関数
import math
def standard_deviation(list):
    return math.sqrt(variance(list))
# 共分散 = 偏差積の平均
def covariance(list1, list2): 
    mean1 = mean(list1)
    mean2 = mean(list2)
    sum = 0
    for d1, d2 in zip(list1, list2):
        sum += (d1 - mean1) * (d2 - mean2)
    return sum / len(list1)
# 相関係数 = 共分散を list1, list2 の標準偏差で割ったもの
def correlation(list1, list2):
    return covariance(list1, list2) / (standard_deviation(list1) * standard_deviation(list2))
# 回帰直線の傾き=相関係数*((yの標準偏差)/(xの標準偏差))
def w_fit(xlist, ylist):
    return correlation(xlist, ylist) * standard_deviation(ylist) / standard_deviation(xlist)
# y切片=yの平均-(傾き*xの平均)
def t_fit(xlist, ylist):
    return mean(ylist) - w_fit(xlist, ylist) * mean(xlist)
# 回帰直線の式を表示
w = w_fit(X, Y)
t = t_fit(X, Y)
print("y = f(x) = wx + t; (w, t) = ({0}, {1})".format(w, t))
y = f(x) = wx + t; (w, t) = ([ 0.55956295], [-117.7594387])
# 回帰直線の式を関数として表現
def f(x):
    return w * x + t
f(200)
array([-5.8468479])
f([[150], [100]])
array([[-33.8249956],
       [-61.8031433]])
# 決定係数R2
def r2(xlist, ylist):
    wa1 = 0.
    wa2 = 0.
    for x, y in zip(xlist, ylist):
        wa1 += (y - f(x))**2
        wa2 += (y - mean(ylist))**2
    return 1. - wa1 / wa2
r2(X, Y)
array([ 0.08363499])

さて、表計算で解けと言われたので pandas で書いてみましょうか。

先生はそれでも許してくれません。「計算の途中過程を一個一個確認できるように表計算を使って解けやゴルァ」とおっしゃいますので、表計算といえば pandas ですよね先生!?

import copy
from IPython.display import display
excel = copy.deepcopy(df)
excel
molecule boiling point molecular weight
0 HF 19.500 20.0
1 HCl -84.900 36.5
2 HBr -67.000 80.9
3 HI -35.100 127.9
4 H2O 100.000 18.0
5 H2S -60.700 34.1
6 H2Se -42.000 81.0
7 H2Te -1.800 129.6
8 NH3 -33.400 17.0
9 PH3 -87.000 34.0
10 AsH3 -55.000 77.9
11 SbH3 -17.100 124.8
12 CH4 -161.490 16.0
13 SiH4 -111.800 32.1
14 GeH4 -90.000 76.6
15 SnH4 -52.000 122.7
16 He -268.934 4.0
17 Ne -246.048 20.2
18 Ar -185.700 39.9
19 Kr -152.300 83.8
20 Xe -108.100 131.3
excel['y'] = excel['boiling point']
excel['x'] = excel['molecular weight']
mean_y = mean(excel['y'])
mean_x = mean(excel['x'])
display(excel, pd.DataFrame([[mean_y, mean_x]], columns=['y','x'], index=['mean']))
molecule boiling point molecular weight y x
0 HF 19.500 20.0 19.500 20.0
1 HCl -84.900 36.5 -84.900 36.5
2 HBr -67.000 80.9 -67.000 80.9
3 HI -35.100 127.9 -35.100 127.9
4 H2O 100.000 18.0 100.000 18.0
5 H2S -60.700 34.1 -60.700 34.1
6 H2Se -42.000 81.0 -42.000 81.0
7 H2Te -1.800 129.6 -1.800 129.6
8 NH3 -33.400 17.0 -33.400 17.0
9 PH3 -87.000 34.0 -87.000 34.0
10 AsH3 -55.000 77.9 -55.000 77.9
11 SbH3 -17.100 124.8 -17.100 124.8
12 CH4 -161.490 16.0 -161.490 16.0
13 SiH4 -111.800 32.1 -111.800 32.1
14 GeH4 -90.000 76.6 -90.000 76.6
15 SnH4 -52.000 122.7 -52.000 122.7
16 He -268.934 4.0 -268.934 4.0
17 Ne -246.048 20.2 -246.048 20.2
18 Ar -185.700 39.9 -185.700 39.9
19 Kr -152.300 83.8 -152.300 83.8
20 Xe -108.100 131.3 -108.100 131.3
y x
mean -82.898667 62.3
excel['y-mean(y)'] = [y - mean_y for y in excel['y']]
excel['x-mean(x)'] = [x - mean_x for x in excel['x']]
display(excel, pd.DataFrame([[mean_y, mean_x]], columns=['y','x'], index=['mean']))
molecule boiling point molecular weight y x y-mean(y) x-mean(x)
0 HF 19.500 20.0 19.500 20.0 102.398667 -42.3
1 HCl -84.900 36.5 -84.900 36.5 -2.001333 -25.8
2 HBr -67.000 80.9 -67.000 80.9 15.898667 18.6
3 HI -35.100 127.9 -35.100 127.9 47.798667 65.6
4 H2O 100.000 18.0 100.000 18.0 182.898667 -44.3
5 H2S -60.700 34.1 -60.700 34.1 22.198667 -28.2
6 H2Se -42.000 81.0 -42.000 81.0 40.898667 18.7
7 H2Te -1.800 129.6 -1.800 129.6 81.098667 67.3
8 NH3 -33.400 17.0 -33.400 17.0 49.498667 -45.3
9 PH3 -87.000 34.0 -87.000 34.0 -4.101333 -28.3
10 AsH3 -55.000 77.9 -55.000 77.9 27.898667 15.6
11 SbH3 -17.100 124.8 -17.100 124.8 65.798667 62.5
12 CH4 -161.490 16.0 -161.490 16.0 -78.591333 -46.3
13 SiH4 -111.800 32.1 -111.800 32.1 -28.901333 -30.2
14 GeH4 -90.000 76.6 -90.000 76.6 -7.101333 14.3
15 SnH4 -52.000 122.7 -52.000 122.7 30.898667 60.4
16 He -268.934 4.0 -268.934 4.0 -186.035333 -58.3
17 Ne -246.048 20.2 -246.048 20.2 -163.149333 -42.1
18 Ar -185.700 39.9 -185.700 39.9 -102.801333 -22.4
19 Kr -152.300 83.8 -152.300 83.8 -69.401333 21.5
20 Xe -108.100 131.3 -108.100 131.3 -25.201333 69.0
y x
mean -82.898667 62.3
excel['(y-mean(y))**2'] = [sa ** 2 for sa in excel['y-mean(y)']]
excel['(x-mean(x))**2'] = [sa ** 2 for sa in excel['x-mean(x)']]
display(excel, pd.DataFrame([[mean_y, mean_x]], columns=['y','x'], index=['mean']))
molecule boiling point molecular weight y x y-mean(y) x-mean(x) (y-mean(y))**2 (x-mean(x))**2
0 HF 19.500 20.0 19.500 20.0 102.398667 -42.3 10485.486935 1789.29
1 HCl -84.900 36.5 -84.900 36.5 -2.001333 -25.8 4.005335 665.64
2 HBr -67.000 80.9 -67.000 80.9 15.898667 18.6 252.767602 345.96
3 HI -35.100 127.9 -35.100 127.9 47.798667 65.6 2284.712535 4303.36
4 H2O 100.000 18.0 100.000 18.0 182.898667 -44.3 33451.922268 1962.49
5 H2S -60.700 34.1 -60.700 34.1 22.198667 -28.2 492.780802 795.24
6 H2Se -42.000 81.0 -42.000 81.0 40.898667 18.7 1672.700935 349.69
7 H2Te -1.800 129.6 -1.800 129.6 81.098667 67.3 6576.993735 4529.29
8 NH3 -33.400 17.0 -33.400 17.0 49.498667 -45.3 2450.118002 2052.09
9 PH3 -87.000 34.0 -87.000 34.0 -4.101333 -28.3 16.820935 800.89
10 AsH3 -55.000 77.9 -55.000 77.9 27.898667 15.6 778.335602 243.36
11 SbH3 -17.100 124.8 -17.100 124.8 65.798667 62.5 4329.464535 3906.25
12 CH4 -161.490 16.0 -161.490 16.0 -78.591333 -46.3 6176.597675 2143.69
13 SiH4 -111.800 32.1 -111.800 32.1 -28.901333 -30.2 835.287068 912.04
14 GeH4 -90.000 76.6 -90.000 76.6 -7.101333 14.3 50.428935 204.49
15 SnH4 -52.000 122.7 -52.000 122.7 30.898667 60.4 954.727602 3648.16
16 He -268.934 4.0 -268.934 4.0 -186.035333 -58.3 34609.145248 3398.89
17 Ne -246.048 20.2 -246.048 20.2 -163.149333 -42.1 26617.704967 1772.41
18 Ar -185.700 39.9 -185.700 39.9 -102.801333 -22.4 10568.114135 501.76
19 Kr -152.300 83.8 -152.300 83.8 -69.401333 21.5 4816.545068 462.25
20 Xe -108.100 131.3 -108.100 131.3 -25.201333 69.0 635.107202 4761.00
y x
mean -82.898667 62.3
variance_y = mean(excel['(y-mean(y))**2'])
variance_x = mean(excel['(x-mean(x))**2'])
sd_y = math.sqrt(variance_y)
sd_x = math.sqrt(variance_x)
display(excel, pd.DataFrame([[mean_y, mean_x], [variance_y, variance_x], [sd_y, sd_x]], 
                            columns=['y','x'], index=['mean', 'variance', 'sd']))
molecule boiling point molecular weight y x y-mean(y) x-mean(x) (y-mean(y))**2 (x-mean(x))**2
0 HF 19.500 20.0 19.500 20.0 102.398667 -42.3 10485.486935 1789.29
1 HCl -84.900 36.5 -84.900 36.5 -2.001333 -25.8 4.005335 665.64
2 HBr -67.000 80.9 -67.000 80.9 15.898667 18.6 252.767602 345.96
3 HI -35.100 127.9 -35.100 127.9 47.798667 65.6 2284.712535 4303.36
4 H2O 100.000 18.0 100.000 18.0 182.898667 -44.3 33451.922268 1962.49
5 H2S -60.700 34.1 -60.700 34.1 22.198667 -28.2 492.780802 795.24
6 H2Se -42.000 81.0 -42.000 81.0 40.898667 18.7 1672.700935 349.69
7 H2Te -1.800 129.6 -1.800 129.6 81.098667 67.3 6576.993735 4529.29
8 NH3 -33.400 17.0 -33.400 17.0 49.498667 -45.3 2450.118002 2052.09
9 PH3 -87.000 34.0 -87.000 34.0 -4.101333 -28.3 16.820935 800.89
10 AsH3 -55.000 77.9 -55.000 77.9 27.898667 15.6 778.335602 243.36
11 SbH3 -17.100 124.8 -17.100 124.8 65.798667 62.5 4329.464535 3906.25
12 CH4 -161.490 16.0 -161.490 16.0 -78.591333 -46.3 6176.597675 2143.69
13 SiH4 -111.800 32.1 -111.800 32.1 -28.901333 -30.2 835.287068 912.04
14 GeH4 -90.000 76.6 -90.000 76.6 -7.101333 14.3 50.428935 204.49
15 SnH4 -52.000 122.7 -52.000 122.7 30.898667 60.4 954.727602 3648.16
16 He -268.934 4.0 -268.934 4.0 -186.035333 -58.3 34609.145248 3398.89
17 Ne -246.048 20.2 -246.048 20.2 -163.149333 -42.1 26617.704967 1772.41
18 Ar -185.700 39.9 -185.700 39.9 -102.801333 -22.4 10568.114135 501.76
19 Kr -152.300 83.8 -152.300 83.8 -69.401333 21.5 4816.545068 462.25
20 Xe -108.100 131.3 -108.100 131.3 -25.201333 69.0 635.107202 4761.00
y x
mean -82.898667 62.300000
variance 7050.465101 1883.249524
sd 83.967048 43.396423
excel['(y-mean(y)) * (x-mean(x))'] = excel['y-mean(y)'] * excel['x-mean(x)']
display(excel, pd.DataFrame([[mean_y, mean_x], [variance_y, variance_x], [sd_y, sd_x]], 
                            columns=['y','x'], index=['mean', 'variance', 'sd']))
molecule boiling point molecular weight y x y-mean(y) x-mean(x) (y-mean(y))**2 (x-mean(x))**2 (y-mean(y)) * (x-mean(x))
0 HF 19.500 20.0 19.500 20.0 102.398667 -42.3 10485.486935 1789.29 -4331.463600
1 HCl -84.900 36.5 -84.900 36.5 -2.001333 -25.8 4.005335 665.64 51.634400
2 HBr -67.000 80.9 -67.000 80.9 15.898667 18.6 252.767602 345.96 295.715200
3 HI -35.100 127.9 -35.100 127.9 47.798667 65.6 2284.712535 4303.36 3135.592533
4 H2O 100.000 18.0 100.000 18.0 182.898667 -44.3 33451.922268 1962.49 -8102.410933
5 H2S -60.700 34.1 -60.700 34.1 22.198667 -28.2 492.780802 795.24 -626.002400
6 H2Se -42.000 81.0 -42.000 81.0 40.898667 18.7 1672.700935 349.69 764.805067
7 H2Te -1.800 129.6 -1.800 129.6 81.098667 67.3 6576.993735 4529.29 5457.940267
8 NH3 -33.400 17.0 -33.400 17.0 49.498667 -45.3 2450.118002 2052.09 -2242.289600
9 PH3 -87.000 34.0 -87.000 34.0 -4.101333 -28.3 16.820935 800.89 116.067733
10 AsH3 -55.000 77.9 -55.000 77.9 27.898667 15.6 778.335602 243.36 435.219200
11 SbH3 -17.100 124.8 -17.100 124.8 65.798667 62.5 4329.464535 3906.25 4112.416667
12 CH4 -161.490 16.0 -161.490 16.0 -78.591333 -46.3 6176.597675 2143.69 3638.778733
13 SiH4 -111.800 32.1 -111.800 32.1 -28.901333 -30.2 835.287068 912.04 872.820267
14 GeH4 -90.000 76.6 -90.000 76.6 -7.101333 14.3 50.428935 204.49 -101.549067
15 SnH4 -52.000 122.7 -52.000 122.7 30.898667 60.4 954.727602 3648.16 1866.279467
16 He -268.934 4.0 -268.934 4.0 -186.035333 -58.3 34609.145248 3398.89 10845.859933
17 Ne -246.048 20.2 -246.048 20.2 -163.149333 -42.1 26617.704967 1772.41 6868.586933
18 Ar -185.700 39.9 -185.700 39.9 -102.801333 -22.4 10568.114135 501.76 2302.749867
19 Kr -152.300 83.8 -152.300 83.8 -69.401333 21.5 4816.545068 462.25 -1492.128667
20 Xe -108.100 131.3 -108.100 131.3 -25.201333 69.0 635.107202 4761.00 -1738.892000
y x
mean -82.898667 62.300000
variance 7050.465101 1883.249524
sd 83.967048 43.396423
covar_xy = mean(excel['(y-mean(y)) * (x-mean(x))'])
corr_xy = covar_xy / (sd_x * sd_y)
display(excel, pd.DataFrame([[mean_y, mean_x], [variance_y, variance_x], [sd_y, sd_x]], 
                            columns=['y','x'], index=['mean', 'variance', 'sd']),
       pd.DataFrame([covar_xy, corr_xy], index=['covariance', 'correlation'], columns=['x,y']))
molecule boiling point molecular weight y x y-mean(y) x-mean(x) (y-mean(y))**2 (x-mean(x))**2 (y-mean(y)) * (x-mean(x))
0 HF 19.500 20.0 19.500 20.0 102.398667 -42.3 10485.486935 1789.29 -4331.463600
1 HCl -84.900 36.5 -84.900 36.5 -2.001333 -25.8 4.005335 665.64 51.634400
2 HBr -67.000 80.9 -67.000 80.9 15.898667 18.6 252.767602 345.96 295.715200
3 HI -35.100 127.9 -35.100 127.9 47.798667 65.6 2284.712535 4303.36 3135.592533
4 H2O 100.000 18.0 100.000 18.0 182.898667 -44.3 33451.922268 1962.49 -8102.410933
5 H2S -60.700 34.1 -60.700 34.1 22.198667 -28.2 492.780802 795.24 -626.002400
6 H2Se -42.000 81.0 -42.000 81.0 40.898667 18.7 1672.700935 349.69 764.805067
7 H2Te -1.800 129.6 -1.800 129.6 81.098667 67.3 6576.993735 4529.29 5457.940267
8 NH3 -33.400 17.0 -33.400 17.0 49.498667 -45.3 2450.118002 2052.09 -2242.289600
9 PH3 -87.000 34.0 -87.000 34.0 -4.101333 -28.3 16.820935 800.89 116.067733
10 AsH3 -55.000 77.9 -55.000 77.9 27.898667 15.6 778.335602 243.36 435.219200
11 SbH3 -17.100 124.8 -17.100 124.8 65.798667 62.5 4329.464535 3906.25 4112.416667
12 CH4 -161.490 16.0 -161.490 16.0 -78.591333 -46.3 6176.597675 2143.69 3638.778733
13 SiH4 -111.800 32.1 -111.800 32.1 -28.901333 -30.2 835.287068 912.04 872.820267
14 GeH4 -90.000 76.6 -90.000 76.6 -7.101333 14.3 50.428935 204.49 -101.549067
15 SnH4 -52.000 122.7 -52.000 122.7 30.898667 60.4 954.727602 3648.16 1866.279467
16 He -268.934 4.0 -268.934 4.0 -186.035333 -58.3 34609.145248 3398.89 10845.859933
17 Ne -246.048 20.2 -246.048 20.2 -163.149333 -42.1 26617.704967 1772.41 6868.586933
18 Ar -185.700 39.9 -185.700 39.9 -102.801333 -22.4 10568.114135 501.76 2302.749867
19 Kr -152.300 83.8 -152.300 83.8 -69.401333 21.5 4816.545068 462.25 -1492.128667
20 Xe -108.100 131.3 -108.100 131.3 -25.201333 69.0 635.107202 4761.00 -1738.892000
y x
mean -82.898667 62.300000
variance 7050.465101 1883.249524
sd 83.967048 43.396423
x,y
covariance 1053.796667
correlation 0.289197
w = corr_xy * sd_y / sd_x
t = mean_y - w * mean_x
display(excel, pd.DataFrame([[mean_y, mean_x], [variance_y, variance_x], [sd_y, sd_x]], 
                            columns=['y','x'], index=['mean', 'variance', 'sd']),
        pd.DataFrame([covar_xy, corr_xy], index=['covariance', 'correlation'], columns=['x,y']),
        pd.DataFrame([[w, t]], columns=["w", "t"], index=["y = f(x) = wx + t"]))
molecule boiling point molecular weight y x y-mean(y) x-mean(x) (y-mean(y))**2 (x-mean(x))**2 (y-mean(y)) * (x-mean(x))
0 HF 19.500 20.0 19.500 20.0 102.398667 -42.3 10485.486935 1789.29 -4331.463600
1 HCl -84.900 36.5 -84.900 36.5 -2.001333 -25.8 4.005335 665.64 51.634400
2 HBr -67.000 80.9 -67.000 80.9 15.898667 18.6 252.767602 345.96 295.715200
3 HI -35.100 127.9 -35.100 127.9 47.798667 65.6 2284.712535 4303.36 3135.592533
4 H2O 100.000 18.0 100.000 18.0 182.898667 -44.3 33451.922268 1962.49 -8102.410933
5 H2S -60.700 34.1 -60.700 34.1 22.198667 -28.2 492.780802 795.24 -626.002400
6 H2Se -42.000 81.0 -42.000 81.0 40.898667 18.7 1672.700935 349.69 764.805067
7 H2Te -1.800 129.6 -1.800 129.6 81.098667 67.3 6576.993735 4529.29 5457.940267
8 NH3 -33.400 17.0 -33.400 17.0 49.498667 -45.3 2450.118002 2052.09 -2242.289600
9 PH3 -87.000 34.0 -87.000 34.0 -4.101333 -28.3 16.820935 800.89 116.067733
10 AsH3 -55.000 77.9 -55.000 77.9 27.898667 15.6 778.335602 243.36 435.219200
11 SbH3 -17.100 124.8 -17.100 124.8 65.798667 62.5 4329.464535 3906.25 4112.416667
12 CH4 -161.490 16.0 -161.490 16.0 -78.591333 -46.3 6176.597675 2143.69 3638.778733
13 SiH4 -111.800 32.1 -111.800 32.1 -28.901333 -30.2 835.287068 912.04 872.820267
14 GeH4 -90.000 76.6 -90.000 76.6 -7.101333 14.3 50.428935 204.49 -101.549067
15 SnH4 -52.000 122.7 -52.000 122.7 30.898667 60.4 954.727602 3648.16 1866.279467
16 He -268.934 4.0 -268.934 4.0 -186.035333 -58.3 34609.145248 3398.89 10845.859933
17 Ne -246.048 20.2 -246.048 20.2 -163.149333 -42.1 26617.704967 1772.41 6868.586933
18 Ar -185.700 39.9 -185.700 39.9 -102.801333 -22.4 10568.114135 501.76 2302.749867
19 Kr -152.300 83.8 -152.300 83.8 -69.401333 21.5 4816.545068 462.25 -1492.128667
20 Xe -108.100 131.3 -108.100 131.3 -25.201333 69.0 635.107202 4761.00 -1738.892000
y x
mean -82.898667 62.300000
variance 7050.465101 1883.249524
sd 83.967048 43.396423
x,y
covariance 1053.796667
correlation 0.289197
w t
y = f(x) = wx + t 0.559563 -117.759439
# 回帰直線の式を関数として表現
def f(x):
    return w * x + t
excel['f(x)'] = f(excel['x'])
display(excel, pd.DataFrame([[mean_y, mean_x], [variance_y, variance_x], [sd_y, sd_x]], 
                            columns=['y','x'], index=['mean', 'variance', 'sd']),
        pd.DataFrame([covar_xy, corr_xy], index=['covariance', 'correlation'], columns=['x,y']),
        pd.DataFrame([[w, t]], columns=["w", "t"], index=["y = f(x) = wx + t"]))
molecule boiling point molecular weight y x y-mean(y) x-mean(x) (y-mean(y))**2 (x-mean(x))**2 (y-mean(y)) * (x-mean(x)) f(x)
0 HF 19.500 20.0 19.500 20.0 102.398667 -42.3 10485.486935 1789.29 -4331.463600 -106.568180
1 HCl -84.900 36.5 -84.900 36.5 -2.001333 -25.8 4.005335 665.64 51.634400 -97.335391
2 HBr -67.000 80.9 -67.000 80.9 15.898667 18.6 252.767602 345.96 295.715200 -72.490796
3 HI -35.100 127.9 -35.100 127.9 47.798667 65.6 2284.712535 4303.36 3135.592533 -46.191337
4 H2O 100.000 18.0 100.000 18.0 182.898667 -44.3 33451.922268 1962.49 -8102.410933 -107.687306
5 H2S -60.700 34.1 -60.700 34.1 22.198667 -28.2 492.780802 795.24 -626.002400 -98.678342
6 H2Se -42.000 81.0 -42.000 81.0 40.898667 18.7 1672.700935 349.69 764.805067 -72.434839
7 H2Te -1.800 129.6 -1.800 129.6 81.098667 67.3 6576.993735 4529.29 5457.940267 -45.240080
8 NH3 -33.400 17.0 -33.400 17.0 49.498667 -45.3 2450.118002 2052.09 -2242.289600 -108.246868
9 PH3 -87.000 34.0 -87.000 34.0 -4.101333 -28.3 16.820935 800.89 116.067733 -98.734298
10 AsH3 -55.000 77.9 -55.000 77.9 27.898667 15.6 778.335602 243.36 435.219200 -74.169485
11 SbH3 -17.100 124.8 -17.100 124.8 65.798667 62.5 4329.464535 3906.25 4112.416667 -47.925982
12 CH4 -161.490 16.0 -161.490 16.0 -78.591333 -46.3 6176.597675 2143.69 3638.778733 -108.806431
13 SiH4 -111.800 32.1 -111.800 32.1 -28.901333 -30.2 835.287068 912.04 872.820267 -99.797468
14 GeH4 -90.000 76.6 -90.000 76.6 -7.101333 14.3 50.428935 204.49 -101.549067 -74.896916
15 SnH4 -52.000 122.7 -52.000 122.7 30.898667 60.4 954.727602 3648.16 1866.279467 -49.101064
16 He -268.934 4.0 -268.934 4.0 -186.035333 -58.3 34609.145248 3398.89 10845.859933 -115.521187
17 Ne -246.048 20.2 -246.048 20.2 -163.149333 -42.1 26617.704967 1772.41 6868.586933 -106.456267
18 Ar -185.700 39.9 -185.700 39.9 -102.801333 -22.4 10568.114135 501.76 2302.749867 -95.432877
19 Kr -152.300 83.8 -152.300 83.8 -69.401333 21.5 4816.545068 462.25 -1492.128667 -70.868063
20 Xe -108.100 131.3 -108.100 131.3 -25.201333 69.0 635.107202 4761.00 -1738.892000 -44.288823
y x
mean -82.898667 62.300000
variance 7050.465101 1883.249524
sd 83.967048 43.396423
x,y
covariance 1053.796667
correlation 0.289197
w t
y = f(x) = wx + t 0.559563 -117.759439
excel['(y-f(x))**2'] = (excel['y'] - excel['f(x)'])**2
display(excel, pd.DataFrame([[mean_y, mean_x], [variance_y, variance_x], [sd_y, sd_x]], 
                            columns=['y','x'], index=['mean', 'variance', 'sd']),
        pd.DataFrame([covar_xy, corr_xy], index=['covariance', 'correlation'], columns=['x,y']),
        pd.DataFrame([[w, t]], columns=["w", "t"], index=["y = f(x) = wx + t"]))
molecule boiling point molecular weight y x y-mean(y) x-mean(x) (y-mean(y))**2 (x-mean(x))**2 (y-mean(y)) * (x-mean(x)) f(x) (y-f(x))**2
0 HF 19.500 20.0 19.500 20.0 102.398667 -42.3 10485.486935 1789.29 -4331.463600 -106.568180 15893.185913
1 HCl -84.900 36.5 -84.900 36.5 -2.001333 -25.8 4.005335 665.64 51.634400 -97.335391 154.638946
2 HBr -67.000 80.9 -67.000 80.9 15.898667 18.6 252.767602 345.96 295.715200 -72.490796 30.148838
3 HI -35.100 127.9 -35.100 127.9 47.798667 65.6 2284.712535 4303.36 3135.592533 -46.191337 123.017754
4 H2O 100.000 18.0 100.000 18.0 182.898667 -44.3 33451.922268 1962.49 -8102.410933 -107.687306 43134.016878
5 H2S -60.700 34.1 -60.700 34.1 22.198667 -28.2 492.780802 795.24 -626.002400 -98.678342 1442.354459
6 H2Se -42.000 81.0 -42.000 81.0 40.898667 18.7 1672.700935 349.69 764.805067 -72.434839 926.279451
7 H2Te -1.800 129.6 -1.800 129.6 81.098667 67.3 6576.993735 4529.29 5457.940267 -45.240080 1887.040538
8 NH3 -33.400 17.0 -33.400 17.0 49.498667 -45.3 2450.118002 2052.09 -2242.289600 -108.246868 5602.053722
9 PH3 -87.000 34.0 -87.000 34.0 -4.101333 -28.3 16.820935 800.89 116.067733 -98.734298 137.693756
10 AsH3 -55.000 77.9 -55.000 77.9 27.898667 15.6 778.335602 243.36 435.219200 -74.169485 367.469139
11 SbH3 -17.100 124.8 -17.100 124.8 65.798667 62.5 4329.464535 3906.25 4112.416667 -47.925982 950.241169
12 CH4 -161.490 16.0 -161.490 16.0 -78.591333 -46.3 6176.597675 2143.69 3638.778733 -108.806431 2775.558397
13 SiH4 -111.800 32.1 -111.800 32.1 -28.901333 -30.2 835.287068 912.04 872.820267 -99.797468 144.060777
14 GeH4 -90.000 76.6 -90.000 76.6 -7.101333 14.3 50.428935 204.49 -101.549067 -74.896916 228.103133
15 SnH4 -52.000 122.7 -52.000 122.7 30.898667 60.4 954.727602 3648.16 1866.279467 -49.101064 8.403829
16 He -268.934 4.0 -268.934 4.0 -186.035333 -58.3 34609.145248 3398.89 10845.859933 -115.521187 23535.491228
17 Ne -246.048 20.2 -246.048 20.2 -163.149333 -42.1 26617.704967 1772.41 6868.586933 -106.456267 19485.851914
18 Ar -185.700 39.9 -185.700 39.9 -102.801333 -22.4 10568.114135 501.76 2302.749867 -95.432877 8148.153524
19 Kr -152.300 83.8 -152.300 83.8 -69.401333 21.5 4816.545068 462.25 -1492.128667 -70.868063 6631.160338
20 Xe -108.100 131.3 -108.100 131.3 -25.201333 69.0 635.107202 4761.00 -1738.892000 -44.288823 4071.866330
y x
mean -82.898667 62.300000
variance 7050.465101 1883.249524
sd 83.967048 43.396423
x,y
covariance 1053.796667
correlation 0.289197
w t
y = f(x) = wx + t 0.559563 -117.759439
r2 = 1. - sum(excel['(y-f(x))**2']) / sum(excel['(y-mean(y))**2'])
display(excel, pd.DataFrame([[mean_y, mean_x], [variance_y, variance_x], [sd_y, sd_x]], 
                            columns=['y','x'], index=['mean', 'variance', 'sd']),
        pd.DataFrame([covar_xy, corr_xy], index=['covariance', 'correlation'], columns=['x,y']),
        pd.DataFrame([[w, t, r2]], columns=["w", "t", "R2"], index=["y = f(x) = wx + t"]))
molecule boiling point molecular weight y x y-mean(y) x-mean(x) (y-mean(y))**2 (x-mean(x))**2 (y-mean(y)) * (x-mean(x)) f(x) (y-f(x))**2
0 HF 19.500 20.0 19.500 20.0 102.398667 -42.3 10485.486935 1789.29 -4331.463600 -106.568180 15893.185913
1 HCl -84.900 36.5 -84.900 36.5 -2.001333 -25.8 4.005335 665.64 51.634400 -97.335391 154.638946
2 HBr -67.000 80.9 -67.000 80.9 15.898667 18.6 252.767602 345.96 295.715200 -72.490796 30.148838
3 HI -35.100 127.9 -35.100 127.9 47.798667 65.6 2284.712535 4303.36 3135.592533 -46.191337 123.017754
4 H2O 100.000 18.0 100.000 18.0 182.898667 -44.3 33451.922268 1962.49 -8102.410933 -107.687306 43134.016878
5 H2S -60.700 34.1 -60.700 34.1 22.198667 -28.2 492.780802 795.24 -626.002400 -98.678342 1442.354459
6 H2Se -42.000 81.0 -42.000 81.0 40.898667 18.7 1672.700935 349.69 764.805067 -72.434839 926.279451
7 H2Te -1.800 129.6 -1.800 129.6 81.098667 67.3 6576.993735 4529.29 5457.940267 -45.240080 1887.040538
8 NH3 -33.400 17.0 -33.400 17.0 49.498667 -45.3 2450.118002 2052.09 -2242.289600 -108.246868 5602.053722
9 PH3 -87.000 34.0 -87.000 34.0 -4.101333 -28.3 16.820935 800.89 116.067733 -98.734298 137.693756
10 AsH3 -55.000 77.9 -55.000 77.9 27.898667 15.6 778.335602 243.36 435.219200 -74.169485 367.469139
11 SbH3 -17.100 124.8 -17.100 124.8 65.798667 62.5 4329.464535 3906.25 4112.416667 -47.925982 950.241169
12 CH4 -161.490 16.0 -161.490 16.0 -78.591333 -46.3 6176.597675 2143.69 3638.778733 -108.806431 2775.558397
13 SiH4 -111.800 32.1 -111.800 32.1 -28.901333 -30.2 835.287068 912.04 872.820267 -99.797468 144.060777
14 GeH4 -90.000 76.6 -90.000 76.6 -7.101333 14.3 50.428935 204.49 -101.549067 -74.896916 228.103133
15 SnH4 -52.000 122.7 -52.000 122.7 30.898667 60.4 954.727602 3648.16 1866.279467 -49.101064 8.403829
16 He -268.934 4.0 -268.934 4.0 -186.035333 -58.3 34609.145248 3398.89 10845.859933 -115.521187 23535.491228
17 Ne -246.048 20.2 -246.048 20.2 -163.149333 -42.1 26617.704967 1772.41 6868.586933 -106.456267 19485.851914
18 Ar -185.700 39.9 -185.700 39.9 -102.801333 -22.4 10568.114135 501.76 2302.749867 -95.432877 8148.153524
19 Kr -152.300 83.8 -152.300 83.8 -69.401333 21.5 4816.545068 462.25 -1492.128667 -70.868063 6631.160338
20 Xe -108.100 131.3 -108.100 131.3 -25.201333 69.0 635.107202 4761.00 -1738.892000 -44.288823 4071.866330
y x
mean -82.898667 62.300000
variance 7050.465101 1883.249524
sd 83.967048 43.396423
x,y
covariance 1053.796667
correlation 0.289197
w t R2
y = f(x) = wx + t 0.559563 -117.759439 0.083635

先生!できました!(ドヤ)

続編として線形重回帰を本当はPythonで解きたいけど表計算で解けと言われたのでをお楽しみください。

3
6
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
3
6