More than 5 years have passed since last update.

Pystanで単回帰分析

Python3

Posted at 2018-12-01

「StanとRでベイズ統計モデリング」Chapter07より

非線形の関係

二次曲線の例

ある家庭における夏から冬にかけての一日あたりのエアコン消費電力 Y kWh と屋外での平均気温 X (C) の記録

d = pd.read_csv("input/data-aircon.txt")

sns.set_style("whitegrid")
sns.jointplot(x="X",y="Y",data=d)

モデル式

$ Y[n] \sim Normal(a+b(X[n] - x_0)^2, \sigma_Y)$　　　$ n=1,...,N $

エアコンの電力消費が少なくなる $20,^{\circ}\mathrm{C}$ 前後にこの家庭の快適と感じる温度 $ x_0 $ があって消費電力Yは $ (X-x_0)^2 $ に従って増えると仮定

stanmodel = StanModel(file="model/model7-3.stan")

Area_new = np.linspace(-3,32,60)
data_ = {'N':len(d),'X':d['X'],'Y':d['Y'],'N_new':60,'X_new':Area_new}
fit1 = stanmodel.sampling(data=data_,n_jobs=-1,seed=1234)

col =  np.linspace(-3,32,60)
print (len(col))
df2 = pd.DataFrame(fit1['y_new'])
df2.columns = col

qua = [0.1, 0.25, 0.50, 0.75, 0.9]
d_est = pd.DataFrame()

for i in np.arange(len(df2.columns)):
    for qu in qua:
        d_est[qu] = df2.quantile(qu)

x = d_est.index
y1 = d_est[0.1].values
y2 = d_est[0.25].values
y3 = d_est[0.5].values
y4 = d_est[0.75].values
y5 = d_est[0.9].values


plt.fill_between(x,y1,y5,facecolor='blue',alpha=0.2)
plt.fill_between(x,y2,y4,facecolor='blue',alpha=0.5)
plt.plot(x,y3,'k-')
plt.scatter(d["X"],d["Y"],c='k',alpha=0.3)
plt.xlabel("X")
plt.ylabel("Y")
sns.set_style('whitegrid')
plt.title('Fig7_5')

時系列データの例

患者に点滴で薬剤を投入する場合に、投与からの経過時間 Time(hour) と薬の血中濃度 Y(mg/mL) のデータ

d = pd.read_csv("input/data-conc.txt")

x = d['Time'].values
y = d['Y'].values
plt.plot(x,y,marker='o')
plt.xlim(0,25)
plt.ylim(0,15)
plt.xlabel('Time(hour)')
plt.ylabel('Y')

モデル式

$ Y[n] \sim Normal(1-exp(-b Time[t])),\sigma_Y) $　　　　　 $ t=1,,,,T $
背景知識から曲線パラメータに制限を課している
$ real<lower=0, upper=100> a;$
$real<lower=0,upper=5> b; $

stanmodel = StanModel(file="model/model7-4.stan")

Time_new = np.linspace(0,24,60)
data_ = {'T':len(d),'Time':d['Time'],'Y':d['Y'],'T_new':60,'Time_new':Time_new}
fit1 = stanmodel.sampling(data=data_,n_jobs=-1,seed=1234)
ms1 = fit1.extract()

col =  np.linspace(0,24,60)
print (len(col))
df2 = pd.DataFrame(fit1['y_new'])
df2.columns = col

qua = [0.1, 0.25, 0.50, 0.75, 0.9]
d_est = pd.DataFrame()

for i in np.arange(len(df2.columns)):
    for qu in qua:
        d_est[qu] = df2.quantile(qu)

x = d_est.index
y1 = d_est[0.1].values
y2 = d_est[0.25].values
y3 = d_est[0.5].values
y4 = d_est[0.75].values
y5 = d_est[0.9].values


plt.fill_between(x,y1,y5,facecolor='blue',alpha=0.2)
plt.fill_between(x,y2,y4,facecolor='blue',alpha=0.5)
plt.plot(x,y3,'k-')
plt.scatter(d["Time"],d["Y"],c='k',alpha=0.3)
plt.xlabel("Time(hour)")
plt.ylabel("Y")
sns.set_style('whitegrid')
plt.title('Fig7_5')

時系列データの非線形関数の応用例

Model 1 : 指数関数　振り子の振動幅は空気抵抗などによって指数関数で減少する
Model 2 : S字型の増え方をする曲線
Model 3 : 初めに増加して、ある時から減少する曲線

y1 = lambda x: 2*np.exp(-x)
y2 = lambda x: 1.8/(1+50*np.exp(-2*x))
y3 = lambda x: 8*(np.exp(-x) - np.exp(-2*x))

x = np.linspace(0,5,100)
plt.plot(x,y1(x),'k-',label='1')
plt.plot(x,y2(x),'k--',label='2')
plt.plot(x,y3(x),'k.',label='3')
plt.legend()
plt.xlabel("Time")
plt.ylabel("Y")
plt.title('Fig7_7')

多重共線性

multicollinearity （マルチコ）：重回帰分析において、説明変数間の相関が非常に高い（0.8-0.95) ために、回帰係数が収束しない

説明変数の背景知識から相関が明らかな場合は片方を捨てるのが原則

交絡

confounding モデルの外側に応答変数と説明変数の両方に影響を与える変数が存在する

架空の小学生の50m走のデータ、平均秒速 $ Y(m/sec) $ 、説明変数は体重 Weight(kg)

d = pd.read_csv("input/data-50m.txt")
x = d['Weight'].values
y = d['Y'].values
c = d['Age']

plt.scatter(x,y,c=c,edgecolors='k')
plt.xlabel("Weight")
plt.ylabel("Y")
sns.set_style('whitegrid')
plt.title('Fig7_8')

sns.pairplot(d, hue='Age',
             vars=['Weight', 'Y'])

モデル式

$ \mu_{weight}[n] = c_1 + c_2 Age[n] $
$ Weight[n] \sim Normal(\mu_{weight}[n],\sigma_W) $
$ \mu_Y [n] = b_1 + b_2 Age[n] + b_3 Weight[n] $
$ Y[n] \sim Normal (\mu_Y [n], \sigma_Y ) $

変数間の因果関係を模索する解析をパス解析 (path analysis) と呼ぶ

stanmodel = StanModel(file="model/model7-5.stan")
data_ = {'N':len(d),'Weight':d['Weight'],'Y':d['Y'],'Age':d['Age']}
fit1 = stanmodel.sampling(data=data_,n_jobs=-1,seed=1234)
ms1 = fit1.extract()

説明変数が多すぎる

説明変数にノイズを含む

d = pd.read_csv("input/data-salary.txt")

モデル式

$ X[n] \sim Normal(x_{true}[n],2.5 $
$ Y[n] \sim Normal(a + b X_{true}[n],\sigma_Y) $

stanmodel = StanModel(file="model/model7-6.stan")

data = {"N":len(d),"X":d["X"],"Y":d["Y"]}
fit = stanmodel.sampling(data=data,n_jobs=-1,seed=1234)

col = d['X'].values
print (len(col))
df2 = pd.DataFrame(fit['x_true'])
df2.columns = col

qua = [0.1, 0.25, 0.50, 0.75, 0.9]
d_est = pd.DataFrame()

for i in np.arange(len(df2.columns)):
    for qu in qua:
        d_est[qu] = df2.quantile(qu)

x = d_est.index
y1 = d_est[0.1].values
y2 = d_est[0.25].values
y3 = d_est[0.5].values
y4 = d_est[0.75].values
y5 = d_est[0.9].values

fig = plt.figure(figsize=matplotlib.figure.figaspect(1))

plt.fill_between(x,y1,y5,facecolor='blue',alpha=0.1)
plt.fill_between(x,y2,y4,facecolor='blue',alpha=0.5)
plt.plot(x,y3,'k-')
plt.scatter(x,x,c='b')
plt.xlabel("X")
plt.ylabel("X")
sns.set_style('whitegrid')
plt.title('Fig7_6')

打ち切り

モデル式

打ち切りがない場合
$ Y[n] \sim Normal(\mu,\sigma_Y) $ 　　　$ n = 1,.., N_obs $

打ち切りがある場合
$ y[n] \sim Normal(\mu,\sigma_Y) $ ただし $ y[n]<L $ 　　　$ n = 1,..., N_cens $

打ち切りがある場合の測定値の尤度 $ y < L $ である確率： $ Prob[y<L] $
$ Prob[y<L] = \int_{-\infty}^L Normal(y|\mu,\sigma_Y) $
$ = \int_{-\infty}^L \frac{1}{\sqrt{2 \pi \sigma}} exp[-\frac{1}{2}(\frac{y - \mu}{\sigma})^2] d_y $
$ = \int_{-\infty}^{\frac{L-\mu}{\sigma}} \frac{1}{\sqrt{2 \pi}} exp[-\frac{1}{2}z^2] d_z = \int_{- \infty}^{\frac{L-\mu}{\sigma}} \phi(z)d_z $
$ = \Phi ( \frac{L - \mu}{\sigma}) $
$ \phi $ は標準正規分布で、$ \Phi $ はその累積分布関数 cumulative distribution function (CDF)

d = pd.read_csv("input/data-protein.txt")

stanmodel = StanModel(file="model/model7-7.stan")

Y_obs = d[~d['Y'].str.contains('<')].astype('float')
L_ = d[d['Y'].str.contains('<')]
L = L_['Y'].str.replace('<','').astype('float')[0]
data = {"N_obs":len(Y_obs),"N_cens":len(L_),"Y_obs":Y_obs['Y'],"L":L}
fit = stanmodel.sampling(data=data,n_jobs=-1,seed=123)

外れ値 : outlier

まれに大きな値を生成するメカニズムを仮定し外れ値を含めて解析する方法

モデル式

裾の長いコーシー分布や Student の t 分布を使う

$ Y[n] \sim Cauchy(a+bX[n],\sigma) $ 　　　$ n=1,2,...,N $

混合正規分布や Zero-Inflated Poisson 分布などを使う

d = pd.read_csv("input/data-outlier.txt")

plt.scatter(x=d['X'],y=d['Y'])
plt.xlabel("X")
plt.ylabel("Y")
sns.set_style('whitegrid')
plt.title('Fig7_11')


import hashlib

def StanModel_cache(model_code, model_name=None, **kwargs):
    """Use just as you would `stan`"""
    code_hash = hashlib.md5(model_code.encode('utf-8')).hexdigest()
    if model_name is None:
        cache_fn = path + 'stan_pkl/cached-model-{}.pkl'.format(code_hash)
    else:
        cache_fn = path + 'stan_pkl/cached-{}-{}.pkl'.format(model_name, code_hash)

    try:
        sm = pickle.load(open(cache_fn, 'rb'))
    except: 
        sm = StanModel(file=model_code)
        with open(cache_fn, 'wb') as f:
            pickle.dump(sm, f)
    else: 
        print("Using cached StanModel")
    return sm

model = StanModel_cache(model_code='model/model4-4.stan')

model_code='model/model4-4.stan'
code_hash = hashlib.md5(model_code.encode('utf-8')).hexdigest()

with open(path + 'stan_pkl/cached-model-{}.pkl'.format(code_hash), 'rb') as f:
    model = pickle.load(f) 
    
data = pd.read_csv("input/data-outlier.txt")

stan_data = data.to_dict('list')

X_new = np.arange(0, 11, 1)
stan_data.update({'N':len(data), 'X_new':X_new, 'N_new':len(X_new)})

fit_nuts = model.sampling(data = stan_data, seed=1234, warmup=1000, chains=4)

col = np.arange(11)
print (len(col))
df2 = pd.DataFrame(fit_nuts['y_new'])
df2.columns = col

qua = [0.025, 0.25, 0.50, 0.75, 0.975]
d_est = pd.DataFrame()

for i in np.arange(N_X):
    for qu in qua:
        d_est[qu] = df2.quantile(qu)
        
x = d_est.index
y1 = d_est[0.025].values
y2 = d_est[0.25].values
y3 = d_est[0.5].values
y4 = d_est[0.75].values
y5 = d_est[0.975].values


plt.fill_between(x,y1,y5,facecolor='blue',alpha=0.1)
plt.fill_between(x,y2,y4,facecolor='blue',alpha=0.5)
plt.plot(x,y3,'k-')
plt.scatter(d["X"],d["Y"],c='b')

plt.xlabel("X")
plt.ylabel("Y")
sns.set_style('whitegrid')
plt.title('Fig7_12')

傾きが外れ値に引っ張られている


stanmodel = StanModel(file="model/model7-9.stan")

data = {"N":len(d),"X":d["X"],"Y":d["Y"],"N_new":len(np.arange(11)),"X_new":np.arange(11)}
fit = stanmodel.sampling(data=data,n_jobs=-1,seed=1234)

qua = [0.025, 0.25, 0.50, 0.75, 0.975]
d_est = pd.DataFrame()

for i in np.arange(N_X):
    for qu in qua:
        d_est[qu] = df2.quantile(qu)
        
x = d_est.index
y1 = d_est[0.025].values
y2 = d_est[0.25].values
y3 = d_est[0.5].values
y4 = d_est[0.75].values
y5 = d_est[0.975].values


plt.fill_between(x,y1,y5,facecolor='blue',alpha=0.1)
plt.fill_between(x,y2,y4,facecolor='blue',alpha=0.5)
plt.plot(x,y3,'k-')
plt.scatter(d["X"],d["Y"],c='b')

plt.xlabel("X")
plt.ylabel("Y")
sns.set_style('whitegrid')
plt.title('Fig7_12')

ノイズの部分にコーシー分布を使ったモデル
傾きが外れ値に引っ張られていない
ただし、どちらかが正しいわけではない

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up