More than 3 years have passed since last update.

Feature Selection Datasets

Posted at 2020-10-01

Feature Selection Datasets

機械学習の勉強や手法のベンチマーク用に集められたと思われるデータセットとして、 Feature Selection Datasets があります。

非常に多くのデータがあるので、その中身を一覧して、ちょうどいいデータを見つけたいと思って軽く解析してみました。

データを取得してデータ構造を見るだけでなく、 scikit-learn の RandomForestClassifier を使って、分類問題の難易度も見てみました。

コード

import os
import timeit
from scipy import io
import pandas as pd
import urllib.request
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

取得データの一覧はこちらです。２つほど、URLが間違っているのを直しました。

dataset_url = [
        "http://featureselection.asu.edu/files/datasets/BASEHOCK.mat",
        "http://featureselection.asu.edu/files/datasets/PCMAC.mat",
        "http://featureselection.asu.edu/files/datasets/RELATHE.mat",
        "http://featureselection.asu.edu/files/datasets/COIL20.mat",
        "http://featureselection.asu.edu/files/datasets/ORL.mat",
        "http://featureselection.asu.edu/files/datasets/orlraws10P.mat",
        "http://featureselection.asu.edu/files/datasets/pixraw10P.mat",
        "http://featureselection.asu.edu/files/datasets/warpAR10P.mat",
        "http://featureselection.asu.edu/files/datasets/warpPIE10P.mat",
        "http://featureselection.asu.edu/files/datasets/Yale.mat",
        "http://featureselection.asu.edu/files/datasets/USPS.mat",
        "http://featureselection.asu.edu/files/datasets/ALLAML.mat",
        "http://featureselection.asu.edu/files/datasets/Carcinom.mat",
        "http://featureselection.asu.edu/files/datasets/CLL_SUB_111.mat",
        "http://featureselection.asu.edu/files/datasets/colon.mat",
        "http://featureselection.asu.edu/files/datasets/GLI_85.mat",
        "http://featureselection.asu.edu/files/datasets/GLIOMA.mat",
        "http://featureselection.asu.edu/files/datasets/leukemia.mat",
        "http://featureselection.asu.edu/files/datasets/lung.mat",
        "http://featureselection.asu.edu/files/datasets/lung_discrete.mat",
        "http://featureselection.asu.edu/files/datasets/lymphoma.mat",
        "http://featureselection.asu.edu/files/datasets/nci9.mat",
        "http://featureselection.asu.edu/files/datasets/Prostate_GE.mat",
        "http://featureselection.asu.edu/files/datasets/SMK_CAN_187.mat",
        "http://featureselection.asu.edu/files/datasets/TOX_171.mat",
        "http://featureselection.asu.edu/files/datasets/arcene.mat",
        "http://featureselection.asu.edu/files/datasets/gisette.mat",
        "http://featureselection.asu.edu/files/datasets/Isolet.mat",
        "http://featureselection.asu.edu/files/datasets/madelon.mat"
]

result = {
    'dataset':[],
    'byte':[],
    'X.shape':[],
    'X_type':[],
    'y.shape':[],
    'n_class':[],
    'RF_max':[],
    'RF_mean':[],
    'RF_min':[],
    'sec':[],
    }

for url in dataset_url:
    result['dataset'].append(url.split("/")[-1])

    filename = 'dataset.mat'
    urllib.request.urlretrieve(url, filename)
    result['byte'].append(os.path.getsize(filename))

    matdata = io.loadmat(filename, squeeze_me=True)
    X = matdata['X']
    y = matdata['Y'].flatten()
    result['X.shape'].append(X.shape)
    result['X_type'].append(pd.DataFrame(X).nunique()[0])

    result['y.shape'].append(y.shape)
    result['n_class'].append(pd.DataFrame(y).nunique()[0])

    scores = []
    times = []
    for _ in range(10):
        X_train, X_test, y_train, y_test = train_test_split(X, y)
        model = RandomForestClassifier()
        times.append(timeit.timeit(lambda: model.fit(X_train, y_train), number=1))
        scores.append(model.score(X_test,y_test))

    result['RF_max'].append(max(scores))
    result['RF_mean'].append(sum(scores) / len(scores))
    result['RF_min'].append(min(scores))

    result['sec'].append(sum(times) / len(times))

結果

dataset : データセットの名前
byte : データセットのサイズ（byte）
X.shape : 説明変数の形
X_type : 説明変数に入っている数値のバリエーション。2なら２種類の値しか入っていない離散値と考えることができる。十分に多ければ実質連続値と考えることができる。
y.shape : 目的変数の形
n_class : 目的変数の数値のバリエーション、すなわちクラスの数。
RF_max, RF_mean, RF_min : ランダムフォレストで分類問題を解いた時の正解率の最大値、平均値、最小値
sec : 分類問題を解くのに要した時間 (sec) の平均値

pd.DataFrame(result).sort_values("RF_max")

	dataset	byte	X.shape	X_type	y.shape	n_class	RF_max	RF_mean	RF_min	sec
21	nci9.mat	169288	(60, 9712)	3	(60,)	9	0.666667	0.433333	0.266667	0.183649
23	SMK_CAN_187.mat	11861244	(187, 19993)	171	(187,)	2	0.723404	0.655319	0.574468	0.670948
28	madelon.mat	1496573	(2600, 500)	40	(2600,)	2	0.733846	0.707385	0.680000	2.456003
13	CLL_SUB_111.mat	5875157	(111, 11340)	111	(111,)	3	0.750000	0.657143	0.464286	0.326307
24	TOX_171.mat	3470586	(171, 5748)	169	(171,)	4	0.813953	0.772093	0.697674	0.405085
16	GLIOMA.mat	1462087	(50, 4434)	50	(50,)	4	0.846154	0.669231	0.538462	0.154852
9	Yale.mat	161021	(165, 1024)	77	(165,)	15	0.857143	0.769048	0.595238	0.306511
25	arcene.mat	1900005	(200, 10000)	82	(200,)	2	0.900000	0.788000	0.680000	0.417719
20	lymphoma.mat	110185	(96, 4026)	3	(96,)	9	0.916667	0.829167	0.708333	0.169875
2	RELATHE.mat	226918	(1427, 4322)	2	(1427,)	2	0.921569	0.898880	0.876751	1.218853
14	colon.mat	36319	(62, 2000)	3	(62,)	2	0.937500	0.768750	0.687500	0.135427
7	warpAR10P.mat	279711	(130, 2400)	63	(130,)	10	0.939394	0.851515	0.757576	0.274956
1	PCMAC.mat	191131	(1943, 3289)	4	(1943,)	2	0.944444	0.922634	0.899177	1.491283
4	ORL.mat	376584	(400, 1024)	151	(400,)	40	0.950000	0.921000	0.830000	1.216780
15	GLI_85.mat	8743262	(85, 22283)	85	(85,)	2	0.954545	0.863636	0.772727	0.269521
27	Isolet.mat	3652673	(1560, 617)	1340	(1560,)	26	0.956410	0.938205	0.905128	2.222803
18	lung.mat	4762671	(203, 3312)	203	(203,)	5	0.960784	0.929412	0.882353	0.380843
22	Prostate_GE.mat	1524983	(102, 5966)	29	(102,)	2	0.961538	0.900000	0.807692	0.207986
10	USPS.mat	15138167	(9298, 256)	1617	(9298,)	10	0.965161	0.960258	0.955699	9.295629
26	gisette.mat	10619742	(7000, 5000)	345	(7000,)	2	0.974286	0.968971	0.961714	9.597926
12	Carcinom.mat	6917199	(174, 9182)	156	(174,)	11	0.977273	0.868182	0.772727	0.557979
0	BASEHOCK.mat	279059	(1993, 4862)	2	(1993,)	2	0.985972	0.974349	0.965932	1.789281
3	COIL20.mat	3024549	(1440, 1024)	10	(1440,)	20	1.000000	0.998889	0.994444	1.873450
11	ALLAML.mat	3639219	(72, 7129)	66	(72,)	2	1.000000	0.938889	0.833333	0.183536
6	pixraw10P.mat	520463	(100, 10000)	11	(100,)	10	1.000000	0.972000	0.920000	0.338596
17	leukemia.mat	154743	(72, 7070)	3	(72,)	2	1.000000	0.950000	0.777778	0.155346
8	warpPIE10P.mat	458267	(210, 2420)	36	(210,)	10	1.000000	0.962264	0.924528	0.410544
5	orlraws10P.mat	951783	(100, 10304)	46	(100,)	10	1.000000	0.988000	0.960000	0.415471
19	lung_discrete.mat	7516	(73, 325)	3	(73,)	7	1.000000	0.800000	0.526316	0.131734

簡単すぎる問題を解いてもつまらないと思ったので、 RF_max の降順に並べてみました。

データセット選びの参考になればと。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up