Feature Selection Datasets
機械学習の勉強や手法のベンチマーク用に集められたと思われるデータセットとして、 Feature Selection Datasets があります。
非常に多くのデータがあるので、その中身を一覧して、ちょうどいいデータを見つけたいと思って軽く解析してみました。
データを取得してデータ構造を見るだけでなく、 scikit-learn の RandomForestClassifier を使って、分類問題の難易度も見てみました。
コード
import os
import timeit
from scipy import io
import pandas as pd
import urllib.request
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
取得データの一覧はこちらです。2つほど、URLが間違っているのを直しました。
dataset_url = [
"http://featureselection.asu.edu/files/datasets/BASEHOCK.mat",
"http://featureselection.asu.edu/files/datasets/PCMAC.mat",
"http://featureselection.asu.edu/files/datasets/RELATHE.mat",
"http://featureselection.asu.edu/files/datasets/COIL20.mat",
"http://featureselection.asu.edu/files/datasets/ORL.mat",
"http://featureselection.asu.edu/files/datasets/orlraws10P.mat",
"http://featureselection.asu.edu/files/datasets/pixraw10P.mat",
"http://featureselection.asu.edu/files/datasets/warpAR10P.mat",
"http://featureselection.asu.edu/files/datasets/warpPIE10P.mat",
"http://featureselection.asu.edu/files/datasets/Yale.mat",
"http://featureselection.asu.edu/files/datasets/USPS.mat",
"http://featureselection.asu.edu/files/datasets/ALLAML.mat",
"http://featureselection.asu.edu/files/datasets/Carcinom.mat",
"http://featureselection.asu.edu/files/datasets/CLL_SUB_111.mat",
"http://featureselection.asu.edu/files/datasets/colon.mat",
"http://featureselection.asu.edu/files/datasets/GLI_85.mat",
"http://featureselection.asu.edu/files/datasets/GLIOMA.mat",
"http://featureselection.asu.edu/files/datasets/leukemia.mat",
"http://featureselection.asu.edu/files/datasets/lung.mat",
"http://featureselection.asu.edu/files/datasets/lung_discrete.mat",
"http://featureselection.asu.edu/files/datasets/lymphoma.mat",
"http://featureselection.asu.edu/files/datasets/nci9.mat",
"http://featureselection.asu.edu/files/datasets/Prostate_GE.mat",
"http://featureselection.asu.edu/files/datasets/SMK_CAN_187.mat",
"http://featureselection.asu.edu/files/datasets/TOX_171.mat",
"http://featureselection.asu.edu/files/datasets/arcene.mat",
"http://featureselection.asu.edu/files/datasets/gisette.mat",
"http://featureselection.asu.edu/files/datasets/Isolet.mat",
"http://featureselection.asu.edu/files/datasets/madelon.mat"
]
result = {
'dataset':[],
'byte':[],
'X.shape':[],
'X_type':[],
'y.shape':[],
'n_class':[],
'RF_max':[],
'RF_mean':[],
'RF_min':[],
'sec':[],
}
for url in dataset_url:
result['dataset'].append(url.split("/")[-1])
filename = 'dataset.mat'
urllib.request.urlretrieve(url, filename)
result['byte'].append(os.path.getsize(filename))
matdata = io.loadmat(filename, squeeze_me=True)
X = matdata['X']
y = matdata['Y'].flatten()
result['X.shape'].append(X.shape)
result['X_type'].append(pd.DataFrame(X).nunique()[0])
result['y.shape'].append(y.shape)
result['n_class'].append(pd.DataFrame(y).nunique()[0])
scores = []
times = []
for _ in range(10):
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = RandomForestClassifier()
times.append(timeit.timeit(lambda: model.fit(X_train, y_train), number=1))
scores.append(model.score(X_test,y_test))
result['RF_max'].append(max(scores))
result['RF_mean'].append(sum(scores) / len(scores))
result['RF_min'].append(min(scores))
result['sec'].append(sum(times) / len(times))
結果
- dataset : データセットの名前
- byte : データセットのサイズ(byte)
- X.shape : 説明変数の形
- X_type : 説明変数に入っている数値のバリエーション。2なら2種類の値しか入っていない離散値と考えることができる。十分に多ければ実質連続値と考えることができる。
- y.shape : 目的変数の形
- n_class : 目的変数の数値のバリエーション、すなわちクラスの数。
- RF_max, RF_mean, RF_min : ランダムフォレストで分類問題を解いた時の正解率の最大値、平均値、最小値
- sec : 分類問題を解くのに要した時間 (sec) の平均値
pd.DataFrame(result).sort_values("RF_max")
dataset | byte | X.shape | X_type | y.shape | n_class | RF_max | RF_mean | RF_min | sec | |
---|---|---|---|---|---|---|---|---|---|---|
21 | nci9.mat | 169288 | (60, 9712) | 3 | (60,) | 9 | 0.666667 | 0.433333 | 0.266667 | 0.183649 |
23 | SMK_CAN_187.mat | 11861244 | (187, 19993) | 171 | (187,) | 2 | 0.723404 | 0.655319 | 0.574468 | 0.670948 |
28 | madelon.mat | 1496573 | (2600, 500) | 40 | (2600,) | 2 | 0.733846 | 0.707385 | 0.680000 | 2.456003 |
13 | CLL_SUB_111.mat | 5875157 | (111, 11340) | 111 | (111,) | 3 | 0.750000 | 0.657143 | 0.464286 | 0.326307 |
24 | TOX_171.mat | 3470586 | (171, 5748) | 169 | (171,) | 4 | 0.813953 | 0.772093 | 0.697674 | 0.405085 |
16 | GLIOMA.mat | 1462087 | (50, 4434) | 50 | (50,) | 4 | 0.846154 | 0.669231 | 0.538462 | 0.154852 |
9 | Yale.mat | 161021 | (165, 1024) | 77 | (165,) | 15 | 0.857143 | 0.769048 | 0.595238 | 0.306511 |
25 | arcene.mat | 1900005 | (200, 10000) | 82 | (200,) | 2 | 0.900000 | 0.788000 | 0.680000 | 0.417719 |
20 | lymphoma.mat | 110185 | (96, 4026) | 3 | (96,) | 9 | 0.916667 | 0.829167 | 0.708333 | 0.169875 |
2 | RELATHE.mat | 226918 | (1427, 4322) | 2 | (1427,) | 2 | 0.921569 | 0.898880 | 0.876751 | 1.218853 |
14 | colon.mat | 36319 | (62, 2000) | 3 | (62,) | 2 | 0.937500 | 0.768750 | 0.687500 | 0.135427 |
7 | warpAR10P.mat | 279711 | (130, 2400) | 63 | (130,) | 10 | 0.939394 | 0.851515 | 0.757576 | 0.274956 |
1 | PCMAC.mat | 191131 | (1943, 3289) | 4 | (1943,) | 2 | 0.944444 | 0.922634 | 0.899177 | 1.491283 |
4 | ORL.mat | 376584 | (400, 1024) | 151 | (400,) | 40 | 0.950000 | 0.921000 | 0.830000 | 1.216780 |
15 | GLI_85.mat | 8743262 | (85, 22283) | 85 | (85,) | 2 | 0.954545 | 0.863636 | 0.772727 | 0.269521 |
27 | Isolet.mat | 3652673 | (1560, 617) | 1340 | (1560,) | 26 | 0.956410 | 0.938205 | 0.905128 | 2.222803 |
18 | lung.mat | 4762671 | (203, 3312) | 203 | (203,) | 5 | 0.960784 | 0.929412 | 0.882353 | 0.380843 |
22 | Prostate_GE.mat | 1524983 | (102, 5966) | 29 | (102,) | 2 | 0.961538 | 0.900000 | 0.807692 | 0.207986 |
10 | USPS.mat | 15138167 | (9298, 256) | 1617 | (9298,) | 10 | 0.965161 | 0.960258 | 0.955699 | 9.295629 |
26 | gisette.mat | 10619742 | (7000, 5000) | 345 | (7000,) | 2 | 0.974286 | 0.968971 | 0.961714 | 9.597926 |
12 | Carcinom.mat | 6917199 | (174, 9182) | 156 | (174,) | 11 | 0.977273 | 0.868182 | 0.772727 | 0.557979 |
0 | BASEHOCK.mat | 279059 | (1993, 4862) | 2 | (1993,) | 2 | 0.985972 | 0.974349 | 0.965932 | 1.789281 |
3 | COIL20.mat | 3024549 | (1440, 1024) | 10 | (1440,) | 20 | 1.000000 | 0.998889 | 0.994444 | 1.873450 |
11 | ALLAML.mat | 3639219 | (72, 7129) | 66 | (72,) | 2 | 1.000000 | 0.938889 | 0.833333 | 0.183536 |
6 | pixraw10P.mat | 520463 | (100, 10000) | 11 | (100,) | 10 | 1.000000 | 0.972000 | 0.920000 | 0.338596 |
17 | leukemia.mat | 154743 | (72, 7070) | 3 | (72,) | 2 | 1.000000 | 0.950000 | 0.777778 | 0.155346 |
8 | warpPIE10P.mat | 458267 | (210, 2420) | 36 | (210,) | 10 | 1.000000 | 0.962264 | 0.924528 | 0.410544 |
5 | orlraws10P.mat | 951783 | (100, 10304) | 46 | (100,) | 10 | 1.000000 | 0.988000 | 0.960000 | 0.415471 |
19 | lung_discrete.mat | 7516 | (73, 325) | 3 | (73,) | 7 | 1.000000 | 0.800000 | 0.526316 | 0.131734 |
簡単すぎる問題を解いてもつまらないと思ったので、 RF_max の降順に並べてみました。
データセット選びの参考になればと。