More than 5 years have passed since last update.

Python機械学習プログラミング第4章良いデータセットの作り方

Posted at 2019-06-02

はじめに

Dealing with missing data

Identifying missing values in tabular data



csv_data = '''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''

if (sys.version_info < (3, 0)):
    csv_data = unicode(csv_data)

# read_csv：csvをpandas.DataFrameでとして読み込む
df = pd.read_csv(StringIO(csv_data))
df

# 各特徴量の欠測を探す。列毎。
df.isnull().sum()


[pandas.DataFrame](https://machine-earning.net/article/pandas-dataframe/)は、要するにPythonでデータ分析行う用の二次元配列。

# access the underlying NumPy array via the `values` attribute
df.values

# Eliminating samples or features with missing values
# remove rows that contain missing values
df.dropna(axis=0)

# remove columns that contain missing values
# axis個以上のNumを含む行を削除
df.dropna(axis=1)

# only drop rows where all columns are NaN
df.dropna(how='all')  

# drop rows that have less than 3 real values 
df.dropna(thresh=4)

# only drop rows where NaN appear in specific columns (here: 'C')
df.dropna(subset=['C'])

Imputing missing values


# mean Imputation:平均値補完
# impute missing values via the column mean
# Imputerクラスで欠測地を置き換える。strategyには、medianやmost_frequentlyもおおけ
imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr = imr.fit(df.values)
imputed_data = imr.transform(df.values)
imputed_data

Understanding the scikit-learn estimator API

Imputerはscikit-learnの変換器クラスに属する。
変換器クラスには、fitメソッドとtransformメソッドの二つがある。
これらは、分類機クラスのfitとtransformとは異なる。

変換器クラスのfitはパラメータを決定する為。例)平均値や中央値、最頻値を求める。
変換器クラスのtransfromはパラメータを入力する為。
(分類機クラスのfitはパラメータを学習する為。transformはpredictする為。)
参考

Handling categorical data


# Nominal and ordinal features(名義特徴量と順序特徴量)
df = pd.DataFrame([['green', 'M', 10.1, 'class2'],
                   ['red', 'L', 13.5, 'class1'],
                   ['blue', 'XL', 15.3, 'class2']])

# color:名義特徴量 size:順序特徴量 price:数値特徴量
df.columns = ['color', 'size', 'price', 'classlabel']
df

Mapping ordinal features


# dictionaryを定義
size_mapping = {'XL': 3,
                'L': 2,
                'M': 1}

# dictionaryの対応するものに変換
df['size'] = df['size'].map(size_mapping)
df

# 逆のマッピングを行うdictionaryで復元
inv_size_mapping = {v: k for k, v in size_mapping.items()}
df['size'].map(inv_size_mapping)

順序特徴量の整数への明示的なマッピングを行った。
XL = L+1 = M+2

Encoding class labels

Encoding:暗号化、符号化


# create a mapping dict
# to convert class labels from strings to integers
# クラスラベルと対応させる整数をもつマッピングdictionaryを作成
class_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))} #np.unique:重複を削除した配列化
class_mapping

# to convert class labels from strings to integers
df['classlabel'] = df['classlabel'].map(class_mapping)
df

# reverse the class label mapping
inv_class_mapping = {v: k for k, v in class_mapping.items()}
df['classlabel'] = df['classlabel'].map(inv_class_mapping)
df

# Label encoding with sklearn's LabelEncoder
# LabelEncoderクラス使って、クラスラベルをエンコードできる
class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
y

# reverse mapping
class_le.inverse_transform(y)

多くの機械学習ライブラリはクラスラベルを整数として持つ為、クラスラベルを整数で与えてやるのがプラクティス。

Performing one-hot encoding on nominal features


X = df[['color', 'size', 'price']].values

color_le = LabelEncoder()
X[:, 0] = color_le.fit_transform(X[:, 0])
X

color値を整数に変更する事には成功したが、blue=1, yellow=2, red=3のようになっている。
この為、色の関係性には本来存在しない順序関係が発生した。
→これを回避する方法がone-hotエンコーディング。ダミー特徴量*を作成する。


# OneHotEncoderの生成
ohe = OneHotEncoder(categorical_features=[0])

# OneHotEncodingの実行
ohe.fit_transform(X).toarray()

```python

> ohe = OneHotEncoder(categorical_features=[0])

categorical_features=[0]は変換したい変数の列位置をリストで定義。

> ohe.fit_transform(X).toarray()

OneHotEncoderインスタンスは、transformメソッドが呼び出されたとき疎行列を返す。
ここでは疎行列を見る為に、toarrayメソッドを使う。

toarrayメソッドは、疎行列を通常のNumpyの密行列に変換する。
[密行列とそ疎行列の違いは、0を表示するかしないか。](https://jp.mathworks.com/help/matlab/sparse-matrices.html) 疎行列は密行列より効率的に格納する。

```python

# return dense array so that we can skip
# the toarray step
ohe = OneHotEncoder(categorical_features=[0], sparse=False)
ohe.fit_transform(X)

sparse=Falseの引数を渡して、toarrayの省略も可能。
※スパース行列＝疎行列、非スパース行列＝密行列


# one-hot encoding via pandas
pd.get_dummies(df[['price', 'color', 'size']])

# multicollinearity guard in get_dummies
pd.get_dummies(df[['price', 'color', 'size']], drop_first=True)

ダミー特徴量の作成には、pandasのget_dummies関数を使うと便利。
文字列を持つ列だけが変換される。


# multicollinearity guard for the OneHotEncoder
ohe = OneHotEncoder(categorical_features=[0])
ohe.fit_transform(X).toarray()[:, 1:]

one-hotエンコーディングには**多重共線性(multicollinearity)**が発生する。
多重共線性とは、逆行列の計算は重く、数値的に不安定な予測になる事。
※変数同士の相関が高い時に起こるらしい

その為、one-hotコーディングの配列から特徴量の列を1つ削除する。


# Partitioning a dataset into a seperate training and test set
```python
df_wine = pd.read_csv('https://archive.ics.uci.edu/'
                      'ml/machine-learning-databases/wine/wine.data',
                      header=None)


# if the Wine dataset is temporarily unavailable from the
# UCI machine learning repository, un-comment the following line
# of code to load the dataset from a local path:

# df_wine = pd.read_csv('wine.data', header=None)

df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                   'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                   'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
                   'Proline']

# クラスラベルの表示
print('Class labels', np.unique(df_wine['Class label']))
# ワインラベルの先頭五行を表示
df_wine.head()

オフラインの時のダウンロード先も記載してくれている。超優しい。
np.uniqueは配列の中から重複した要素を削除して返す関数。


X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0,　stratify=y)

train_test_splitはテストデータと訓練データを分割する便利な関数！！
train_test_splitの説明

Bringing features onto the same scale


# 正規化：mix-maxスケーリング
mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)

# 標準化：平均値1と分散0
stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)

決定木とランダムフォレストは、特徴量のスケーリングをする必要がないらしい。
木のノードの閾値は、スケーリングしたら値は変わるが、木の構造が変わらない為。

勾配降下法含め、多くの機械学習アルゴリズムでは、正規化は実用的ではないらしい？
なぜなら、正規化はデータを限られた範囲にスケーリングするので、外れ値から受ける影響が小さくなる。
対して、標準化は外れ値に関する有益な情報が維持される。


# A visual example:

ex = np.array([0, 1, 2, 3, 4, 5])
print('standardized:', (ex - ex.mean()) / ex.std())

# Please note that pandas uses ddof=1 (sample standard deviation) 
# by default, whereas NumPy's std method and the StandardScaler
# uses ddof=0 (population standard deviation)

# normalize
print('normalized:', (ex - ex.min()) / (ex.max() - ex.min()))

pandasは標本標準偏差を使い、NumPyは母集団標準偏差を使っているらしい。
標本標準偏差と母集団標準偏差の違い

Selecting meaningful features

L1 and L2 regularization as penalties against model complexity
4.5.2 A geometric interpretation of L2 regularization
4.5.3 Sparse solutions with L1-regularization
スパースはスカスカ、疎の意味

L2とL1正則化がよくわからない。
ここは良さそうだが、意味わからん。
ここは感覚的に書いてくれてて、すこーーーしわかった。
その次にここの上の方を見ると、ちょっとわかった。

＜超簡単に＞
要はコスト関数に、不純物を加えると。
→不純物の性質は、データの絶対値の総和。
→データだけでなく不純物の重みも考える必要が生まれる。しかも影響力でかい。
→データが最小をとる場合と、不純物が最小をとる場合の両方を考える。
不純物が一乗のとき四角で、データと接点もつところが最小。よくある絵のようにw1かw2どっちか消える。
不純物が二乗のとき円となって、データと接点もつところが最小。w1とw2どっちも生き残る事ある。


# For regularized models in scikit-learn that support L1 regularization, we can simply set the `penalty` parameter to `'l1'` to obtain a sparse solution:
# L1ロジスティック回帰のインスタンスを作成 penaltyパラメータに引数でl1を渡してl1正則化
LogisticRegression(penalty='l1')

# Applied to the standardized Wine data ...
# from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l1', C=1.0)

# Note that C=1.0 is the default. You can increase
# or decrease it to make the regulariztion effect
# stronger or weaker, respectively.

# トレーニングデータに適合
lr.fit(X_train_std, y_train)

# トレーニングデータに対する正解率の表示
print('Training accuracy:', lr.score(X_train_std, y_train))

# テストデータに対する正解率の表示
print('Test accuracy:', lr.score(X_test_std, y_test))

lr = LogisticRegression(penalty='l1', C=1.0)

逆正則化パラメータCがなんともよくわからん。
ここ的には、正則化パラメータλの逆数っていうだけらしい。てことはつまり、正則化したあとに調整するようのパラメータで、今回は1だから調整なしってこと。


lr.intercept_

np.set_printoptions(8)

lr.coef_[lr.coef_!=0].shape

lr.coef_

lr.intercept_

ここでは切片のy座標が三つ配列型で返る。なぜ三つかというと、クラスを三つに分類する為である。
正確には、クラス1とそれ以外。クラス2とそれ以外。クラス3とそれ以外。

np.set.printoptions(8)

printのオプションを設定。引数の8は小数点以下何桁か指定してるだけっぽい

lr.coef_[lr.coef_!=0].shape

lr.coefは重みの配列で、この行は、w0を除いてるだけ。

L1正則化のパラメータλをいじってみる

このままだと0より0以外の数が多いので疎にできていない。
→疎にしてみたい？らしい


# matplotlib(plt)を使って描画する。plt.figureで描画するウィンドウを作成。
fig = plt.figure()

# plt.subplotでウインドウの大きさとか、順番とか色々決める。
ax = plt.subplot(111)
    
colors = ['blue', 'green', 'red', 'cyan', 
          'magenta', 'yellow', 'black', 
          'pink', 'lightgreen', 'lightblue', 
          'gray', 'indigo', 'orange']

weights, params = [], []
for c in np.arange(-4., 6.):
    lr = LogisticRegression(penalty='l1', C=10.**c, random_state=0)
    lr.fit(X_train_std, y_train)
    weights.append(lr.coef_[1])
    params.append(10**c)

weights = np.array(weights)

for column, color in zip(range(weights.shape[1]), colors):
    plt.plot(params, weights[:, column],
             label=df_wine.columns[column + 1],
             color=color)
plt.axhline(0, color='black', linestyle='--', linewidth=3)
plt.xlim([10**(-5), 10**5])
plt.ylabel('weight coefficient')
plt.xlabel('C')
plt.xscale('log')
plt.legend(loc='upper left')
ax.legend(loc='upper center', 
          bbox_to_anchor=(1.38, 1.03),
          ncol=1, fancybox=True)
# plt.savefig('images/04_07.png', dpi=300, 
#            bbox_inches='tight', pad_inches=0.2)
plt.show()

Sequential feature selection algorithms


class SBS():
    def __init__(self, estimator, k_features, scoring=accuracy_score,
                 test_size=0.25, random_state=1):
        self.scoring = scoring　　　　　　　　　　　　#特徴量を評価する指標
        self.estimator = clone(estimator)　　　　　 #推定器
        self.k_features = k_features
        self.test_size = test_size
        self.random_state = random_state

    def fit(self, X, y):
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.test_size, random_state=self.random_state)

        #全ての特徴量の個数、列インデックス
        dim = X_train.shape[1]
        self.indices_ = tuple(range(dim))
        self.subsets_ = [self.indices_]
        score = self._calc_score(X_train, y_train, 
                                 X_test, y_test, self.indices_)
        self.scores_ = [score]

        while dim > self.k_features:
            scores = []
            subsets = []

            for p in combinations(self.indices_, r=dim - 1):
                score = self._calc_score(X_train, y_train, 
                                         X_test, y_test, p)
                scores.append(score)
                subsets.append(p)

            best = np.argmax(scores)
            self.indices_ = subsets[best]
            self.subsets_.append(self.indices_)
            dim -= 1

            #特徴量の組み合わせを変えたときに最もよい正解率がリストself.scores_に格納される
            self.scores_.append(scores[best])

        self.k_score_ = self.scores_[-1]

        return self

    def transform(self, X):
        return X[:, self.indices_]

    def _calc_score(self, X_train, y_train, X_test, y_test, indices):
        self.estimator.fit(X_train[:, indices], y_train)
        y_pred = self.estimator.predict(X_test[:, indices])
        score = self.scoring(y_test, y_pred)
        return score

このへん難しかったからスキップっすわ。
説明読めばなんとなくはわかるけど、細かいところは難しい。


knn = KNeighborsClassifier(n_neighbors=5)

# selecting features
sbs = SBS(knn, k_features=1)
sbs.fit(X_train_std, y_train)

# plotting performance of feature subsets
k_feat = [len(k) for k in sbs.subsets_]

plt.plot(k_feat, sbs.scores_, marker='o')
plt.ylim([0.7, 1.02])
plt.ylabel('Accuracy')
plt.xlabel('Number of features')
plt.grid()
plt.tight_layout()
# plt.savefig('images/04_08.png', dpi=300)
plt.show()




k3 = list(sbs.subsets_[10])
print(df_wine.columns[1:][k3])




knn.fit(X_train_std, y_train)
print('Training accuracy:', knn.score(X_train_std, y_train))
print('Test accuracy:', knn.score(X_test_std, y_test))




knn.fit(X_train_std[:, k3], y_train)
print('Training accuracy:', knn.score(X_train_std[:, k3], y_train))
print('Test accuracy:', knn.score(X_test_std[:, k3], y_test))



# # Assessing feature importance with Random Forests




feat_labels = df_wine.columns[1:]

forest = RandomForestClassifier(n_estimators=500,
                                random_state=1)

forest.fit(X_train, y_train)
importances = forest.feature_importances_

indices = np.argsort(importances)[::-1]

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, 
                            feat_labels[indices[f]], 
                            importances[indices[f]]))

plt.title('Feature Importance')
plt.bar(range(X_train.shape[1]), 
        importances[indices],
        align='center')

plt.xticks(range(X_train.shape[1]), 
           feat_labels[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.tight_layout()
# plt.savefig('images/04_09.png', dpi=300)
plt.show()





sfm = SelectFromModel(forest, threshold=0.1, prefit=True)
X_selected = sfm.transform(X_train)
print('Number of features that meet this threshold criterion:', 
      X_selected.shape[1])


# Now, let's print the 3 features that met the threshold criterion for feature selection that we set earlier (note that this code snippet does not appear in the actual book but was added to this notebook later for illustrative purposes):



for f in range(X_selected.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, 
                            feat_labels[indices[f]], 
                            importances[indices[f]]))



# # Summary

# ...

# ---
# 
# Readers may ignore the next cell.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up