More than 5 years have passed since last update.

Rubyでデータ解析 - Daru入門

Posted at 2016-10-04

データ解析といえばPythonやRを使うことが多いが、Rubyでも簡単なデータ解析は行うことができる。

Rubyでデータフレーム（Excelのスプレッドシートのようなデータ構造）を扱うには、Daru というライブラリが利用できる。作者の@v0droさんはRubykaigi 2016でも登壇してDaruについて発表していて、公演の様子はこちらで見ることができる。

この記事では、Daruの基本的な使い方を紹介する。
必要な機能を全部書くと量が多くなるので、最小限の利用例のみ書いていく。

環境

Ruby2.3.1
daru 0.1.4.1

Gemfileに gem "daru" と書いてbundleすればOK。

以後、require "daru" を行ってから実行する

データ構造

dataframeは２次元配列のような構造で、Daru::DataFrame という名前のクラスになる。
dataframeに各行や列を表す１次元配列を表すクラスが Daru::Vector で、基本的にはこのVectorとDataFrameを使う。

Daruの利用方法

DataFrame作成

列のArrayから作る

Daru::DataFrame.new(
  "col0" => [1,2,3,4,5],
  "col1" => [0.1,0.2,0.3,0.4,0.5],
  "col2" => [11,22,33,44,55]
  )

=> #<Daru::DataFrame(5x3)>
      col0 col1 col2
    0    1  0.1   11
    1    2  0.2   22
    2    3  0.3   33
    3    4  0.4   44
    4    5  0.5   55

各行のArrayから作る
- 列の名前は :orderオプションで指定

df = Daru::DataFrame.rows(
  [
    [1, 0.1, 11],
    [2, 0.2, 22],
    [3, 0.3, 33],
    [4, 0.4, 44],
    [5, 0.5, 55]
  ],
  order: ["col0", "col1", "col2"]
)
=> #<Daru::DataFrame(5x3)>
      col0 col1 col2
    0    1  0.1   11
    1    2  0.2   22
    2    3  0.3   33
    3    4  0.4   44
    4    5  0.5   55

要素の参照

列を取得

df["col0"]
=> #<Daru::Vector(5)>
      col0
    0    1
    1    2
    2    3
    3    4
    4    5

複数列の取得

df["col0", "col2"]      # 複数の列名を指定
df["col1".."col2"]      # rangeで指定

行を取得したいとき

df.row[2]
=> #<Daru::Vector(3)>
         2
 col0    3
 col1  0.3
 col2   33

要素の設定

列の設定

df["col3"] = [5,4,3,2,1]
df
=> #<Daru::DataFrame(5x4)>
      col0 col1 col2 col3
    0    1  0.1   11    5
    1    2  0.2   22    4
    2    3  0.3   33    3
    3    4  0.4   44    2
    4    5  0.5   55    1

行の設定

df.row[5] = [6, 0.6, 66, 0]
df
=> #<Daru::DataFrame(6x4)>
      col0 col1 col2 col3
    0    1  0.1   11    5
    1    2  0.2   22    4
    2    3  0.3   33    3
    3    4  0.4   44    2
    4    5  0.5   55    1
    5    6  0.6   66    0

統計量の表示

とりあえずdescribeすれば代表的な統計量が得られる

df.describe
=> #<Daru::DataFrame(5x4)>
                  col0       col1       col2       col3
      count          6          6          6          6
       mean        3.5 0.35000000       38.5        2.5
        std 1.87082869 0.18708286 20.5791156 1.87082869
        min          1        0.1         11          0
        max          6        0.6         66          5

平均

df.mean
=> #<Daru::Vector(4)>
                                    mean
                col0                 3.5
                col1 0.35000000000000003
                col2                38.5
                col3                 2.5

相関係数

df.corr
=> #<Daru::DataFrame(4x4)>
                  col0       col1       col2       col3
       col0        1.0        1.0        1.0       -1.0
       col1        1.0        1.0 1.00000000       -1.0
       col2        1.0 1.00000000 1.00000000       -1.0
       col3       -1.0       -1.0       -1.0        1.0

共分散

df.cov
=> #<Daru::DataFrame(4x4)>
                  col0       col1       col2       col3
       col0        3.5       0.35       38.5       -3.5
       col1       0.35 0.03499999       3.85      -0.35
       col2       38.5       3.85      423.5      -38.5
       col3       -3.5      -0.35      -38.5        3.5

イテレーション

列に対してiteration

df.each {|col| puts col[0] + col[3] }
# 5
# 0.5
# 55
# 7

行に対してiteration

df.each(:row) {|row| p row["col1"] }
# 0.1
# 0.2
# 0.3
# 0.4
# 0.5
# 0.6

同様に map, map(:row) も定義されている。返り値はarray。

フィルタリング

列に対してfiltering

df.filter {|col| col.mean > 10 }
=> #<Daru::DataFrame(6x1)>
      col2
    0   11
    1   22
    2   33
    3   44
    4   55
    5   66

行に対してfiltering

df.filter(:row) {|row| row["col0"]*row["col3"] < 8 }
=> #<Daru::DataFrame(3x4)>
      col0 col1 col2 col3
    0    1  0.1   11    5
    4    5  0.5   55    1
    5    6  0.6   66    0

ソート

行をソート。各列に対してascending, descendingを指定する。引数は列名の配列

df.sort(["col3", "col1"], ascending: [false, true])
=> #<Daru::DataFrame(6x4)>
      col0 col1 col2 col3
    0    1  0.1   11    5
    1    2  0.2   22    4
    2    3  0.3   33    3
    3    4  0.4   44    2
    4    5  0.5   55    1
    5    6  0.6   66    0

他の機能

ここでは紹介しないが以下の様な機能がある。詳細はREADMEを参照のこと

pivot table
csvやexcelからのimport
時系列データ用のメソッド
他の統計用のgem statsample,stasample-glm と連携してより高度な統計処理
グラフプロットライブラリ Nyaplot を利用したプロット
categorical dataの扱い

reference

github repository
- v0dro/daru
- メソッドの詳細はこちらのREADMEからリンクされているサンプル集を見ると良い
Rubykaigi2016での講演
- http://rubykaigi.org/2016/presentations/v0dro.html
上記のコードをjupyterで実行した結果
- https://gist.github.com/yohm/4f521aedc9b0f185fd5a38572fb5b4e3

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up