More than 5 years have passed since last update.

Juliaで学ぶ統計学入門（第3章:2次元のデータ）

Last updated at 2014-09-12Posted at 2014-09-07

はじめに

Facebookグループ『東大の統計学教科書を読み進める会』で、
統計学入門（通称：赤本）の読み進めを行っています。
赤本の統計処理をJuliaでどう実装するかが気になったので、コードを書き連ねています。

統計学入門（通称：赤本）とは？

「東京大学教養学部統計学教室編」の統計学入門書です。古典です。
参考：統計学入門

Juliaとは？

科学計算向けのプログラミング言語です。ポストR・ベターMATLABと呼ばれています。
参考：Juliaの公式サイト

調査対象の章

第3章「2次元のデータ」

コード

sec03.jl

blood_mtrx = [35 114; 45 124; 55 143; 65 158; 75 166;]

# [散布図 scattergram]
Pkg.add("Gadfly")
using Gadfly
plot(x=blood_mtrx[:, 1], y=blood_mtrx[:, 2], Guide.XLabel("年齢"), Guide.YLabel("血圧"), Geom.point)

# [分割表 contigency table]
## カウント
count(x -> 150 < x, blood_mtrx[:, 2]) 
## 合計
sum(blood_mtrx, 1) #縦計
sum(blood_mtrx, 2) #横計

# [共分散 covariance]
cov(blood_mtrx[:, 1], blood_mtrx[:, 2], corrected = false)
  #sum([(blood_mtrx[i, 1] - mean(blood_mtrx[:, 1])) * (blood_mtrx[i, 2] - mean(blood_mtrx[:, 2])) for i = 1:size(blood_mtrx, 1)]) / size(blood_mtrx,1)

# [ピアソンの積率相関係数 product-moment correlation coefficient]
cor(blood_mtrx[:, 1], blood_mtrx[:, 2])
  #sum([(blood_mtrx[i, 1] - mean(blood_mtrx[:, 1])) * (blood_mtrx[i, 2] - mean(blood_mtrx[:, 2])) for i = 1:size(blood_mtrx, 1)]) / size(blood_mtrx,1) / (sqrt(sum([(x - mean(blood_mtrx[:, 1]))  ^ 2 for x = blood_mtrx[:, 1]]) / size(blood_mtrx,1))  * sqrt(sum([(y - mean(blood_mtrx[:, 2]))  ^ 2 for y = blood_mtrx[:, 2]]) / size(blood_mtrx,1)))

# [偏相関係数 partial correlation coefficient]
Pkg.add("RDatasets")
using RDatasets
iris = dataset("datasets", "iris")
pcor(x, y, z) = (cor(x, y) - (cor(x, z) * cor(y, z))) / (sqrt(1 - (cor(x, z) ^ 2)) * sqrt(1 - (cor(y, z) ^ 2)))
pcor(iris[:, :SepalLength], iris[:, :PetalWidth], iris[:, :PetalLength])

# [並び替え sort]
sort(blood_mtrx[:, 2])
blood_mtrx[:, 2][sortperm(blood_mtrx[:, 2])]

# [順位相関係数 rank correlation coefficient]
Pkg.add("DataFrames")
using DataFrames
flower_mtrx = [1 3; 2 1; 3 2; 4 5; 5 4; 6 7; 7 6; 8 8;]
## スピアマン Spearman
Pkg.add("StatsBase")
using StatsBase
corspearman(flower_mtrx[:, 1], flower_mtrx[:, 2])
  #1 - ((6 / ((size(flower_mtrx, 1) ^ 3) - size(flower_mtrx, 1))) * sum([(flower_mtrx[i, 1] - flower_mtrx[i, 2]) ^ 2 for i = 1:size(flower_mtrx, 1)]))
## ケンドール Kendall
using StatsBase
corkendall(flower_mtrx[:, 1], flower_mtrx[:, 2])
# flower_mtrx |>
#  x -> [(x[i, 1] < x[i + 1, 1]) == (x[i, 2] < x[i + 1, 2]) for i = 1:(size(x, 1) - 1)] |>
#  x -> (countnz(x) + count(x1 -> !x1, x)) / (((length(x) + 1) * length(x)) / 2)

# [時系列データのプロット]
using RDatasets
arbuthnot = dataset("HistData", "Arbuthnot")
using Gadfly
plot(arbuthnot, x = :Year, y = :Mortality, Geom.line)
# [自己相関係数 auto correlation coefficient]
using StatsBase
autocor(arbuthnot[:, :Mortality], 1)
# arbuthnot[:, :Mortality] |>
#  ts -> sum([(ts[i] - mean(ts)) * (ts[i + 1] - mean(ts)) / (length(ts) - 1) for i = 1:(size(ts, 1) - 1)]) / sum([(x - mean(ts)) ^ 2 / length(ts) for x = ts])

# [コレログラム correlogram]
using Gadfly
plot(x = 1:(size(arbuthnot, 1) - 1), y = [autocor(arbuthnot[:, :Mortality], i) for i = 1:(size(arbuthnot, 1) - 1)], Geom.line)

# [線形回帰 liner regression]
using RDatasets
iris = dataset("datasets", "iris")
sepal_length = convert(Array{Float64,1}, iris[:, :SepalLength])
petal_length = convert(Array{Float64,1}, iris[:, :PetalLength])
a, b = linreg(sepal_length, petal_length) #y = a + b * x

# [線形回帰のプロット]
plot(
  layer(x=sepal_length, y=petal_length, Geom.point),
  layer(x=ifloor(minimum(sepal_length)):iceil(maximum(sepal_length)), y=[a + b * x for x = ifloor(minimum(sepal_length)):iceil(maximum(sepal_length))], Geom.line)
  )

# [多項式回帰 polynomial regression]
Pkg.add("GLM")
using GLM
using RDatasets
LifeCycleSavings = dataset("datasets", "LifeCycleSavings")
fm2 = fit(LinearModel, SR ~ Pop15 + Pop75 + DPI + DDPI, LifeCycleSavings)


# [練習問題]
# 3.1 社会経済指標と投票行動
vhrate_mtrx = {
  41.4 52.8;
  76.3 71.2;
  59.2 72.6;
  51.8 63.7;
  52.5 81.3;
  53.2 81.8;
  62.4 70.9;
  55.0 74.0;
  57.7 73.2;
  63.2 72.9;
  37.5 66.7;
  48.5 65.7;
  32.4 43.7;
  20.5 55.5;
  47.9 79.6;
  68.9 85.7;
  68.5 75.3;
  52.5 80.5;
  63.3 73.0;
  58.8 77.0;
  59.7 77.5;
  48.4 69.2;
  40.7 60.0;
  51.0 78.2;
  50.9 79.5;
  34.3 61.8;
  25.8 49.6;
  32.1 59.6;
  34.4 72.1;
  55.1 71.0;
  60.3 76.3;
  57.0 72.8;
  45.6 71.8;
  54.2 60.7;
  55.1 67.0;
  55.7 71.8;
  70.3 71.2;
  61.8 68.3;
  47.6 68.5;
  42.5 54.8;
  71.3 76.0;
  55.2 65.8;
  65.2 69.4;
  42.9 66.9;
  54.7 69.7;
  62.0 71.2;
  48.2 59.6;
  }
plot(x = vhrate_mtrx[:, 1], y = vhrate_mtrx[:, 2], Guide.XLabel("自民得票率"), Guide.YLabel("持ち家比率"), Geom.point)
cor(vhrate_mtrx[:, 1], vhrate_mtrx[:, 2])

# 3.2 統計的な関連
# (略)

# 3.3 社会的リスクの順位づけ
riskeval_mtrx = [
  1 1 8 20;
  2 5 3 1;
  3 2 1 4;
  4 3 4 2;
  5 6 2 6;
  6 7 5 3;
  7 15 11 12;
  8 8 7 17;
  9 4 15 8;
  10 11 9 5;
  11 10 6 18;
  12 14 13 13;
  13 18 10 23;
  14 13 22 26;
  15 22 12 29;
  16 24 14 15;
  17 16 18 16;
  18 19 19 9;
  19 30 17 10;
  20 9 22 11;
  21 25 16 30;
  22 17 24 7;
  23 26 21 27;
  24 23 20 19;
  25 12 28 14;
  26 20 30 21;
  27 28 25 28;
  28 21 26 24;
  29 27 27 22;
  30 29 29 25;
  ]
## スピアマン Spearman
using StatsBase
corspearman(riskeval_mtrx[:, 1], riskeval_mtrx[:, 2])
## ケンドール Kendall
using StatsBase
corkendall(riskeval_mtrx[:, 1], riskeval_mtrx[:, 2])

## 3.4 ブートストラップ
# ⅰ)
iceil(rand()*11)
## ⅱ)
height_mtrx = [
  71 69;
  68 64;
  66 65;
  67 63;
  70 65;
  71 62;
  70 65;
  73 64;
  72 66;
  65 59;
  66 62;
  ]
bootstrap(mtrx, cnt) =
  reduce(vcat, [mtrx[iceil(rand()*11), :] for i = 1:cnt]) |>
  boot_mtrx -> cor(boot_mtrx[:, 1], boot_mtrx[:, 2])
bootstrap(height_mtrx, 11)
## ⅲ)
plot(x=[bootstrap(height_mtrx, i) for i = 1:200], Geom.histogram)

感想

~~GLM.jlのfit関数でエラーが発生する。~~
~~GithubのReadme.mdのサンプルコードを実行してエラーになっているので、~~
~~v0.3.0に伴う言語の仕様変更が原因だと思える。~~
~~v1.0.0が待たれる。~~
[2014/9/12]
Glm.jlをv0.4.2に更新した所、fit関数でエラーが起こらなくなった。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up