More than 5 years have passed since last update.

JuliaでANOVA

Posted at 2019-03-15

Rをはじめよう生命科学のためのRStudio入門のデータセット（ここにある）を使って､分散分析（ANOVA）をやるJuliaコード。データを読み込み､図で結論を予想し、検定し､図を仕上げる。コードや出力の説明などは技術書典６(2019/04/14)で頒布予定。

daphnia.jl

using CSV, Gadfly, DataFrames, DataFramesMeta, Statistics, HypothesisTests,
      Cairo, Fontconfig, GLM, Distributions, Distances, LinearAlgebra

# データ読み込み
dat = CSV.read("datasets/Daphniagrowth.csv", allowmissing=:none)

# データ読み込みと、データの形式の確認
names!(dat, [:parasite, :rep, :growth])
describe(dat)
describe(parasite) # カテゴリカル変数
std(dat[:rep])     # 自然数
std(dat[:growth])  # 浮動小数点実数

# 全体、および寄生虫ごとの箱ヒゲ図
draw(PDF("fig/daphnia_1.pdf", 8cm, 18cm),
     plot(dat, y = :growth, Geom.boxplot))
draw(PDF("fig/daphnia_2.pdf", 8cm, 18cm),
     plot(dat, x = :parasite, y = :growth, Geom.boxplot))
by(dat, :parasite, :growth => mean)             # 平均値を数字で見る

# これでも同じようにプロットできる
# plot(dat, xgroup = :parasite, y = :growth, Geom.subplot_grid(Geom.boxplot))

# 分散分析
# ANOVA を一発でやってくれる関数はないので、必要な計算をそのままやる
# そのためにデータフレームを、解析しやすい用に unstack() で整形する
dat2 = unstack(dat, :parasite, :growth)         # 寄生虫ごとに列になるように
deletecols!(dat2, :rep)                         # 不要な列を消し列名を変える
names!(dat2, [:Metschnikowia_bicuspidata,       # 寄生虫１ メチニコビア酵母
              :Pansporella_perplexa,            # 寄生虫２ アメーバの一種
              :Pasteuria_ramosa,                # 寄生虫３ パスツリア菌
              :Control])                        # 寄生虫無し
# Control が最初の列になるように並べ替え
permutecols!(dat2, [:Control, :Metschnikowia_bicuspidata, :Pansporella_perplexa,
                    :Pasteuria_ramosa])         # アルファベット順に並べる
dat3 = dat2[:, :]                               # あとにも使うので変更前に複製
nG = length(levels(dat[:, :parasite]))          # 群の数
nR = length(dat2[:, :Control])                  # 各群のサンプル数
means = by(dat, :parasite, M = :growth => mean) # 群ごとの平均
means = means[:M]                               # いるのは平均値の列だけ
for i in 1:nR                                   # 各群のサンプルからその群の
  for j in 1:nG                                 # 平均を引いて二乗する
    dat2[i, j] = (dat2[i, j] - means[j])^2
  end
end
dDoF = (nG * (nR - 1))                          # 群内の自由度
denominator = (sum(DataFrames.colwise(sum, dat2))) / dDoF # 分母
m = mean(DataFrames.colwise(mean, dat3))        # 全サンプルの平均
for i in 1:nG                                   # 各群の平均から全体の平均を
  means[i] = (means[i] - m)^2                   # 引いて二乗
end
nDoF = (nG - 1)                                 # 群間の自由度
numerator = sum(means) * nR / nDoF              # 分子
ccdf(FDist(nDoF, dDoF), numerator/denominator)  # ｐ値

# きれいなプロットを作る
# これを時計回りに 90° 回転させれば見やすい
# 回転後の横軸の数値は白い箱で隠してテキストボックスを置くなどして描き直す
means = by(dat, :parasite, M = :growth => mean) # 群ごとの平均を求め直す
draw(PDF("fig/daphnia_3.pdf", 6cm, 18cm),
     plot(layer(means, x = :parasite, y = :M, Geom.point,
                shape = [Shape.diamond],
                Theme(default_color = colorant"red", point_size = 2mm)),
          layer(dat,   x = :parasite, y = :growth, Geom.point),
          layer(dat,   x = :parasite, y = :growth, Geom.boxplot),
          Theme(key_position = :none),
          Guide.ylabel("growth rate"),
          Guide.xticks(orientation=:vertical)))

draw(PDF("fig/daphnia_4.pdf", 18cm, 8cm), # 横向きにしてみる
     plot(layer(means, x = :M, y = :parasite, Geom.point,
                shape = [Shape.diamond],
                Theme(default_color = colorant"red", point_size = 2mm)),
          layer(dat, x = :growth, y = :parasite, Geom.point),
          layer(dat, x = :growth, y = :parasite, Geom.boxplot),
          Theme(key_position = :none),
          Guide.xlabel("growth rate")))

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up