Help us understand the problem. What is going on with this article?

Pystanで単回帰分析

More than 1 year has passed since last update.

「StanとRでベイズ統計モデリング」Chapter07より

非線形の関係

二次曲線の例

ある家庭における夏から冬にかけての一日あたりのエアコン消費電力 Y kWh と屋外での平均気温 X (C) の記録

d = pd.read_csv("input/data-aircon.txt")

sns.set_style("whitegrid")
sns.jointplot(x="X",y="Y",data=d)

fig_7_1_left.png

モデル式

$ Y[n] \sim Normal(a+b(X[n] - x_0)^2, \sigma_Y)$   $ n=1,...,N $

エアコンの電力消費が少なくなる $20\,^{\circ}\mathrm{C}$ 前後にこの家庭の快適と感じる温度 $ x_0 $ があって消費電力Yは $ (X-x_0)^2 $ に従って増えると仮定

stanmodel = StanModel(file="model/model7-3.stan")

Area_new = np.linspace(-3,32,60)
data_ = {'N':len(d),'X':d['X'],'Y':d['Y'],'N_new':60,'X_new':Area_new}
fit1 = stanmodel.sampling(data=data_,n_jobs=-1,seed=1234)

col =  np.linspace(-3,32,60)
print (len(col))
df2 = pd.DataFrame(fit1['y_new'])
df2.columns = col

qua = [0.1, 0.25, 0.50, 0.75, 0.9]
d_est = pd.DataFrame()

for i in np.arange(len(df2.columns)):
    for qu in qua:
        d_est[qu] = df2.quantile(qu)

x = d_est.index
y1 = d_est[0.1].values
y2 = d_est[0.25].values
y3 = d_est[0.5].values
y4 = d_est[0.75].values
y5 = d_est[0.9].values


plt.fill_between(x,y1,y5,facecolor='blue',alpha=0.2)
plt.fill_between(x,y2,y4,facecolor='blue',alpha=0.5)
plt.plot(x,y3,'k-')
plt.scatter(d["X"],d["Y"],c='k',alpha=0.3)
plt.xlabel("X")
plt.ylabel("Y")
sns.set_style('whitegrid')
plt.title('Fig7_5')

fig_7_5_R.png

時系列データの例

患者に点滴で薬剤を投入する場合に、投与からの経過時間 Time(hour) と薬の血中濃度 Y(mg/mL) のデータ

d = pd.read_csv("input/data-conc.txt")

x = d['Time'].values
y = d['Y'].values
plt.plot(x,y,marker='o')
plt.xlim(0,25)
plt.ylim(0,15)
plt.xlabel('Time(hour)')
plt.ylabel('Y')

fig_7_6_left.png

モデル式

$ Y[n] \sim Normal(1-exp(-b Time[t])),\sigma_Y) $      $ t=1,,,,T $
背景知識から曲線パラメータに制限を課している
$ real<lower=0, upper=100> a;$
$real<lower=0,upper=5> b; $

stanmodel = StanModel(file="model/model7-4.stan")

Time_new = np.linspace(0,24,60)
data_ = {'T':len(d),'Time':d['Time'],'Y':d['Y'],'T_new':60,'Time_new':Time_new}
fit1 = stanmodel.sampling(data=data_,n_jobs=-1,seed=1234)
ms1 = fit1.extract()

col =  np.linspace(0,24,60)
print (len(col))
df2 = pd.DataFrame(fit1['y_new'])
df2.columns = col

qua = [0.1, 0.25, 0.50, 0.75, 0.9]
d_est = pd.DataFrame()

for i in np.arange(len(df2.columns)):
    for qu in qua:
        d_est[qu] = df2.quantile(qu)

x = d_est.index
y1 = d_est[0.1].values
y2 = d_est[0.25].values
y3 = d_est[0.5].values
y4 = d_est[0.75].values
y5 = d_est[0.9].values


plt.fill_between(x,y1,y5,facecolor='blue',alpha=0.2)
plt.fill_between(x,y2,y4,facecolor='blue',alpha=0.5)
plt.plot(x,y3,'k-')
plt.scatter(d["Time"],d["Y"],c='k',alpha=0.3)
plt.xlabel("Time(hour)")
plt.ylabel("Y")
sns.set_style('whitegrid')
plt.title('Fig7_5')

fig_7_5_L.png

時系列データの非線形関数の応用例

Model 1 : 指数関数 振り子の振動幅は空気抵抗などによって指数関数で減少する
Model 2 : S字型の増え方をする曲線
Model 3 : 初めに増加して、ある時から減少する曲線

y1 = lambda x: 2*np.exp(-x)
y2 = lambda x: 1.8/(1+50*np.exp(-2*x))
y3 = lambda x: 8*(np.exp(-x) - np.exp(-2*x))

x = np.linspace(0,5,100)
plt.plot(x,y1(x),'k-',label='1')
plt.plot(x,y2(x),'k--',label='2')
plt.plot(x,y3(x),'k.',label='3')
plt.legend()
plt.xlabel("Time")
plt.ylabel("Y")
plt.title('Fig7_7')

fig_7_7_.png

多重共線性

multicollinearity (マルチコ):重回帰分析において、説明変数間の相関が非常に高い(0.8-0.95) ために、回帰係数が収束しない

説明変数の背景知識から相関が明らかな場合は片方を捨てるのが原則

交絡

confounding モデルの外側に応答変数と説明変数の両方に影響を与える変数が存在する

架空の小学生の50m走のデータ、平均秒速 $ Y(m/sec) $ 、説明変数は体重 Weight(kg)

d = pd.read_csv("input/data-50m.txt")
x = d['Weight'].values
y = d['Y'].values
c = d['Age']

plt.scatter(x,y,c=c,edgecolors='k')
plt.xlabel("Weight")
plt.ylabel("Y")
sns.set_style('whitegrid')
plt.title('Fig7_8')

fig_7_8_left.png

sns.pairplot(d, hue='Age',
             vars=['Weight', 'Y'])

fig_7_8_right.png

モデル式

$ \mu_{weight}[n] = c_1 + c_2 Age[n] $
$ Weight[n] \sim Normal(\mu_{weight}[n],\sigma_W) $
$ \mu_Y [n] = b_1 + b_2 Age[n] + b_3 Weight[n] $
$ Y[n] \sim Normal (\mu_Y [n], \sigma_Y ) $

変数間の因果関係を模索する解析をパス解析 (path analysis) と呼ぶ

stanmodel = StanModel(file="model/model7-5.stan")
data_ = {'N':len(d),'Weight':d['Weight'],'Y':d['Y'],'Age':d['Age']}
fit1 = stanmodel.sampling(data=data_,n_jobs=-1,seed=1234)
ms1 = fit1.extract()

説明変数が多すぎる

説明変数にノイズを含む

d = pd.read_csv("input/data-salary.txt")

モデル式

$ X[n] \sim Normal(x_{true}[n],2.5 $
$ Y[n] \sim Normal(a + b X_{true}[n],\sigma_Y) $

stanmodel = StanModel(file="model/model7-6.stan")

data = {"N":len(d),"X":d["X"],"Y":d["Y"]}
fit = stanmodel.sampling(data=data,n_jobs=-1,seed=1234)

col = d['X'].values
print (len(col))
df2 = pd.DataFrame(fit['x_true'])
df2.columns = col

qua = [0.1, 0.25, 0.50, 0.75, 0.9]
d_est = pd.DataFrame()

for i in np.arange(len(df2.columns)):
    for qu in qua:
        d_est[qu] = df2.quantile(qu)

x = d_est.index
y1 = d_est[0.1].values
y2 = d_est[0.25].values
y3 = d_est[0.5].values
y4 = d_est[0.75].values
y5 = d_est[0.9].values

fig = plt.figure(figsize=matplotlib.figure.figaspect(1))

plt.fill_between(x,y1,y5,facecolor='blue',alpha=0.1)
plt.fill_between(x,y2,y4,facecolor='blue',alpha=0.5)
plt.plot(x,y3,'k-')
plt.scatter(x,x,c='b')
plt.xlabel("X")
plt.ylabel("X")
sns.set_style('whitegrid')
plt.title('Fig7_6')

fig_7_6_model.png

打ち切り

モデル式

打ち切りがない場合
$ Y[n] \sim Normal(\mu,\sigma_Y) $    $ n = 1,.., N_obs $

打ち切りがある場合
$ y[n] \sim Normal(\mu,\sigma_Y) $ ただし $ y[n]<L $    $ n = 1,..., N_cens $

打ち切りがある場合の測定値の尤度 $ y < L $ である確率: $ Prob[y<L] $
$ Prob[y<L] = \int_{-\infty}^L Normal(y|\mu,\sigma_Y) $
$ = \int_{-\infty}^L \frac{1}{\sqrt{2 \pi \sigma}} exp[-\frac{1}{2}(\frac{y - \mu}{\sigma})^2] d_y $
$ = \int_{-\infty}^{\frac{L-\mu}{\sigma}} \frac{1}{\sqrt{2 \pi}} exp[-\frac{1}{2}z^2] d_z = \int_{- \infty}^{\frac{L-\mu}{\sigma}} \phi(z)d_z $
$ = \Phi ( \frac{L - \mu}{\sigma}) $
$ \phi $ は標準正規分布で、$ \Phi $ はその累積分布関数 cumulative distribution function (CDF)

d = pd.read_csv("input/data-protein.txt")

stanmodel = StanModel(file="model/model7-7.stan")

Y_obs = d[~d['Y'].str.contains('<')].astype('float')
L_ = d[d['Y'].str.contains('<')]
L = L_['Y'].str.replace('<','').astype('float')[0]
data = {"N_obs":len(Y_obs),"N_cens":len(L_),"Y_obs":Y_obs['Y'],"L":L}
fit = stanmodel.sampling(data=data,n_jobs=-1,seed=123)

外れ値 : outlier

まれに大きな値を生成するメカニズムを仮定し外れ値を含めて解析する方法

モデル式

裾の長いコーシー分布や Student の t 分布を使う

$ Y[n] \sim Cauchy(a+bX[n],\sigma) $    $ n=1,2,...,N $

混合正規分布や Zero-Inflated Poisson 分布などを使う

d = pd.read_csv("input/data-outlier.txt")

plt.scatter(x=d['X'],y=d['Y'])
plt.xlabel("X")
plt.ylabel("Y")
sns.set_style('whitegrid')
plt.title('Fig7_11')

fig_7_11_.png

import hashlib

def StanModel_cache(model_code, model_name=None, **kwargs):
    """Use just as you would `stan`"""
    code_hash = hashlib.md5(model_code.encode('utf-8')).hexdigest()
    if model_name is None:
        cache_fn = path + 'stan_pkl/cached-model-{}.pkl'.format(code_hash)
    else:
        cache_fn = path + 'stan_pkl/cached-{}-{}.pkl'.format(model_name, code_hash)

    try:
        sm = pickle.load(open(cache_fn, 'rb'))
    except: 
        sm = StanModel(file=model_code)
        with open(cache_fn, 'wb') as f:
            pickle.dump(sm, f)
    else: 
        print("Using cached StanModel")
    return sm

model = StanModel_cache(model_code='model/model4-4.stan')

model_code='model/model4-4.stan'
code_hash = hashlib.md5(model_code.encode('utf-8')).hexdigest()

with open(path + 'stan_pkl/cached-model-{}.pkl'.format(code_hash), 'rb') as f:
    model = pickle.load(f) 

data = pd.read_csv("input/data-outlier.txt")

stan_data = data.to_dict('list')

X_new = np.arange(0, 11, 1)
stan_data.update({'N':len(data), 'X_new':X_new, 'N_new':len(X_new)})

fit_nuts = model.sampling(data = stan_data, seed=1234, warmup=1000, chains=4)

col = np.arange(11)
print (len(col))
df2 = pd.DataFrame(fit_nuts['y_new'])
df2.columns = col

qua = [0.025, 0.25, 0.50, 0.75, 0.975]
d_est = pd.DataFrame()

for i in np.arange(N_X):
    for qu in qua:
        d_est[qu] = df2.quantile(qu)

x = d_est.index
y1 = d_est[0.025].values
y2 = d_est[0.25].values
y3 = d_est[0.5].values
y4 = d_est[0.75].values
y5 = d_est[0.975].values


plt.fill_between(x,y1,y5,facecolor='blue',alpha=0.1)
plt.fill_between(x,y2,y4,facecolor='blue',alpha=0.5)
plt.plot(x,y3,'k-')
plt.scatter(d["X"],d["Y"],c='b')

plt.xlabel("X")
plt.ylabel("Y")
sns.set_style('whitegrid')
plt.title('Fig7_12')

傾きが外れ値に引っ張られている

fig_7_12_right.png

stanmodel = StanModel(file="model/model7-9.stan")

data = {"N":len(d),"X":d["X"],"Y":d["Y"],"N_new":len(np.arange(11)),"X_new":np.arange(11)}
fit = stanmodel.sampling(data=data,n_jobs=-1,seed=1234)

qua = [0.025, 0.25, 0.50, 0.75, 0.975]
d_est = pd.DataFrame()

for i in np.arange(N_X):
    for qu in qua:
        d_est[qu] = df2.quantile(qu)

x = d_est.index
y1 = d_est[0.025].values
y2 = d_est[0.25].values
y3 = d_est[0.5].values
y4 = d_est[0.75].values
y5 = d_est[0.975].values


plt.fill_between(x,y1,y5,facecolor='blue',alpha=0.1)
plt.fill_between(x,y2,y4,facecolor='blue',alpha=0.5)
plt.plot(x,y3,'k-')
plt.scatter(d["X"],d["Y"],c='b')

plt.xlabel("X")
plt.ylabel("Y")
sns.set_style('whitegrid')
plt.title('Fig7_12')

ノイズの部分にコーシー分布を使ったモデル
傾きが外れ値に引っ張られていない
ただし、どちらかが正しいわけではない

fig_7_12_left.png

Why do not you register as a user and use Qiita more conveniently?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
Comments
Sign up for free and join this conversation.
If you already have a Qiita account
Why do not you register as a user and use Qiita more conveniently?
You need to log in to use this function. Qiita can be used more conveniently after logging in.
You seem to be reading articles frequently this month. Qiita can be used more conveniently after logging in.
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away