More than 5 years have passed since last update.

scikit-learn、Spark.ml、TensorFlow で線形回帰〜（１）イントロダクション

Posted at 2017-05-06

機械学習のライブラリとして有名な scikit-learn、Spark.ml、Tensorflow の線形回帰ライブラリを使ってみました。
言語は Python (3.5)、使ったライブラリは上記以外では numpy, matplotlib, csv。
機械学習ライブラリのバージョンは

scikit-learn (0.18.1)
spark (2.1.0)
tensorflow (1.1.0)

OS は MacOS 10.12 です。

1. 線形回帰

線形回帰は、与えられたデータ (x1,y1), (x2,y2)... から y = ax + b の関係を推定するもの（詳しくは線形回帰(wikipedia) など参照）なので、まずは対象データを作成します。
ax + b の a と b を与えると、y±d の範囲でランダムに値を生成する関数を numpy の rand() と行列演算を用いて作ります。

makeDataLR.py

import numpy as np
from numpy.random import rand

def makeDataForLR(a, b, n=100, d=0.1, xs=0, xe=10):
    x = rand(n) * (xe - xs) + xs
    r = rand(n) * 2*d - d
    y = x * a + b + r
    return x,y

numpy.random.rand(n) は 0.0〜1.0 内で n個のランダムな数字を作成する関数で、rand(n) * 3 とすれば、0.0〜3.0 内、rand(n) + 2 とすれば、2.0〜3.0 内のランダムな数字が取得できます。

>>> rand(10) * (10-5) + 5  # 5.0〜10.0 内のランダム数
array([ 6.21226444,  6.77468084,  9.36730437,  5.11593757,  5.38383768,
        7.87395788,  9.63988158,  8.28096493,  5.0125407 ,  8.60225573])
>>> rand(10) * 2*1 - 1   # -1.0〜1.0 内のランダム数
array([ 0.4029865 , -0.31802214, -0.71503869, -0.71740942,  0.05573439,
       -0.85997408, -0.91677018, -0.72540234,  0.12157467, -0.77786667])

作成したデータをファイルに保存します。csvライブラリを使うと numpy の行列データをそのまま書き込めます。

makeDataLR.py

import csv
def writeArrayWithCSV(dataFile, data):
    f = open(dataFile, 'w')
    writer = csv.writer(f, lineterminator='\n')
    writer.writerows(data)
    f.close()

# a=0.4, b=0.8, ax+b±0.2 のデータを x=0〜10の範囲で100個作成する                       
x,y = makeDataForLR(0.4, 0.8, 100, 0.2, 0, 10)

# x,y を結合して、csvファイルとして保存                           
dataFile = 'sampleLR.csv'
xy = np.c_[x, y]
writeArrayWithCSV(dataFile, xy)

さて、実際にどのようなデータができたか可視化してみましょう。matplotlibの散布図を使います。

makeDataLR.py

import matplotlib.pyplot as plt
def plotXY(title, x, y):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.scatter(x, y)
    ax.set_title(title)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    #fig.show()                                                                 
    imageFile = title + '.png'
    fig.savefig(imageFile)

title = 'sampleLR'
plotXY(title, x, y)

MacOSで matplotlib がエラーになった場合は、 ~/.matplotlib/matplotlibrc を以下の内容で作成すると使えるようになると思います。
~/.matplotlib/matplotlibrc
backend : TkAgg

最後にデータ作成のプログラムの全体です。実際の線形回帰のやり方は次の記事で。

makeDataLR.py

# !/usr/bin/env python                                                                

import numpy as np
from numpy.random import rand

# xs-xe の x について、ax + b ± d の値を N 個作成する                  
def makeDataForLR(a, b, n=100, d=0.1, xs=0, xe=10):
    x = rand(n) * (xe - xs) + xs
    r = rand(n) * 2*d - d
    y = x * a + b + r
    return x,y

# numpy.array を csv に書き込む                                              
import csv
def writeArrayWithCSV(dataFile, data):
    f = open(dataFile, 'w')
    writer = csv.writer(f, lineterminator='\n')
    writer.writerows(data)
    f.close()

# x,y の散布図                                                                
import matplotlib.pyplot as plt
def plotXY(title, x, y):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.scatter(x, y)
    ax.set_title(title)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    #fig.show()                                                                      
    imageFile = title + '.png'
    fig.savefig(imageFile)

# a=0.4, b=0.8, ax+b±0.2 のデータを x=0〜10の範囲で100個作成する                     
x,y = makeDataForLR(0.4, 0.8, 100, 0.2, 0, 10)

# x,y を結合して、csvファイルとして保存                                              
title = 'sampleLR'
dataFile = title + '.csv'
xy = np.c_[x, y]
writeArrayWithCSV(dataFile, xy)

# データをプロット                                                                   
plotXY(title, x, y)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up