0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

Feature Selection Datasets

機械学習の勉強や手法のベンチマーク用に集められたと思われるデータセットとして、 Feature Selection Datasets があります。

非常に多くのデータがあるので、その中身を一覧して、ちょうどいいデータを見つけたいと思って軽く解析してみました。

データを取得してデータ構造を見るだけでなく、 scikit-learn の RandomForestClassifier を使って、分類問題の難易度も見てみました。

コード

import os
import timeit
from scipy import io
import pandas as pd
import urllib.request
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

取得データの一覧はこちらです。2つほど、URLが間違っているのを直しました。

dataset_url = [
        "http://featureselection.asu.edu/files/datasets/BASEHOCK.mat",
        "http://featureselection.asu.edu/files/datasets/PCMAC.mat",
        "http://featureselection.asu.edu/files/datasets/RELATHE.mat",
        "http://featureselection.asu.edu/files/datasets/COIL20.mat",
        "http://featureselection.asu.edu/files/datasets/ORL.mat",
        "http://featureselection.asu.edu/files/datasets/orlraws10P.mat",
        "http://featureselection.asu.edu/files/datasets/pixraw10P.mat",
        "http://featureselection.asu.edu/files/datasets/warpAR10P.mat",
        "http://featureselection.asu.edu/files/datasets/warpPIE10P.mat",
        "http://featureselection.asu.edu/files/datasets/Yale.mat",
        "http://featureselection.asu.edu/files/datasets/USPS.mat",
        "http://featureselection.asu.edu/files/datasets/ALLAML.mat",
        "http://featureselection.asu.edu/files/datasets/Carcinom.mat",
        "http://featureselection.asu.edu/files/datasets/CLL_SUB_111.mat",
        "http://featureselection.asu.edu/files/datasets/colon.mat",
        "http://featureselection.asu.edu/files/datasets/GLI_85.mat",
        "http://featureselection.asu.edu/files/datasets/GLIOMA.mat",
        "http://featureselection.asu.edu/files/datasets/leukemia.mat",
        "http://featureselection.asu.edu/files/datasets/lung.mat",
        "http://featureselection.asu.edu/files/datasets/lung_discrete.mat",
        "http://featureselection.asu.edu/files/datasets/lymphoma.mat",
        "http://featureselection.asu.edu/files/datasets/nci9.mat",
        "http://featureselection.asu.edu/files/datasets/Prostate_GE.mat",
        "http://featureselection.asu.edu/files/datasets/SMK_CAN_187.mat",
        "http://featureselection.asu.edu/files/datasets/TOX_171.mat",
        "http://featureselection.asu.edu/files/datasets/arcene.mat",
        "http://featureselection.asu.edu/files/datasets/gisette.mat",
        "http://featureselection.asu.edu/files/datasets/Isolet.mat",
        "http://featureselection.asu.edu/files/datasets/madelon.mat"
]
result = {
    'dataset':[],
    'byte':[],
    'X.shape':[],
    'X_type':[],
    'y.shape':[],
    'n_class':[],
    'RF_max':[],
    'RF_mean':[],
    'RF_min':[],
    'sec':[],
    }

for url in dataset_url:
    result['dataset'].append(url.split("/")[-1])

    filename = 'dataset.mat'
    urllib.request.urlretrieve(url, filename)
    result['byte'].append(os.path.getsize(filename))

    matdata = io.loadmat(filename, squeeze_me=True)
    X = matdata['X']
    y = matdata['Y'].flatten()
    result['X.shape'].append(X.shape)
    result['X_type'].append(pd.DataFrame(X).nunique()[0])

    result['y.shape'].append(y.shape)
    result['n_class'].append(pd.DataFrame(y).nunique()[0])

    scores = []
    times = []
    for _ in range(10):
        X_train, X_test, y_train, y_test = train_test_split(X, y)
        model = RandomForestClassifier()
        times.append(timeit.timeit(lambda: model.fit(X_train, y_train), number=1))
        scores.append(model.score(X_test,y_test))

    result['RF_max'].append(max(scores))
    result['RF_mean'].append(sum(scores) / len(scores))
    result['RF_min'].append(min(scores))

    result['sec'].append(sum(times) / len(times))

結果

  • dataset : データセットの名前
  • byte : データセットのサイズ(byte)
  • X.shape : 説明変数の形
  • X_type : 説明変数に入っている数値のバリエーション。2なら2種類の値しか入っていない離散値と考えることができる。十分に多ければ実質連続値と考えることができる。
  • y.shape : 目的変数の形
  • n_class : 目的変数の数値のバリエーション、すなわちクラスの数。
  • RF_max, RF_mean, RF_min : ランダムフォレストで分類問題を解いた時の正解率の最大値、平均値、最小値
  • sec : 分類問題を解くのに要した時間 (sec) の平均値
pd.DataFrame(result).sort_values("RF_max")
dataset byte X.shape X_type y.shape n_class RF_max RF_mean RF_min sec
21 nci9.mat 169288 (60, 9712) 3 (60,) 9 0.666667 0.433333 0.266667 0.183649
23 SMK_CAN_187.mat 11861244 (187, 19993) 171 (187,) 2 0.723404 0.655319 0.574468 0.670948
28 madelon.mat 1496573 (2600, 500) 40 (2600,) 2 0.733846 0.707385 0.680000 2.456003
13 CLL_SUB_111.mat 5875157 (111, 11340) 111 (111,) 3 0.750000 0.657143 0.464286 0.326307
24 TOX_171.mat 3470586 (171, 5748) 169 (171,) 4 0.813953 0.772093 0.697674 0.405085
16 GLIOMA.mat 1462087 (50, 4434) 50 (50,) 4 0.846154 0.669231 0.538462 0.154852
9 Yale.mat 161021 (165, 1024) 77 (165,) 15 0.857143 0.769048 0.595238 0.306511
25 arcene.mat 1900005 (200, 10000) 82 (200,) 2 0.900000 0.788000 0.680000 0.417719
20 lymphoma.mat 110185 (96, 4026) 3 (96,) 9 0.916667 0.829167 0.708333 0.169875
2 RELATHE.mat 226918 (1427, 4322) 2 (1427,) 2 0.921569 0.898880 0.876751 1.218853
14 colon.mat 36319 (62, 2000) 3 (62,) 2 0.937500 0.768750 0.687500 0.135427
7 warpAR10P.mat 279711 (130, 2400) 63 (130,) 10 0.939394 0.851515 0.757576 0.274956
1 PCMAC.mat 191131 (1943, 3289) 4 (1943,) 2 0.944444 0.922634 0.899177 1.491283
4 ORL.mat 376584 (400, 1024) 151 (400,) 40 0.950000 0.921000 0.830000 1.216780
15 GLI_85.mat 8743262 (85, 22283) 85 (85,) 2 0.954545 0.863636 0.772727 0.269521
27 Isolet.mat 3652673 (1560, 617) 1340 (1560,) 26 0.956410 0.938205 0.905128 2.222803
18 lung.mat 4762671 (203, 3312) 203 (203,) 5 0.960784 0.929412 0.882353 0.380843
22 Prostate_GE.mat 1524983 (102, 5966) 29 (102,) 2 0.961538 0.900000 0.807692 0.207986
10 USPS.mat 15138167 (9298, 256) 1617 (9298,) 10 0.965161 0.960258 0.955699 9.295629
26 gisette.mat 10619742 (7000, 5000) 345 (7000,) 2 0.974286 0.968971 0.961714 9.597926
12 Carcinom.mat 6917199 (174, 9182) 156 (174,) 11 0.977273 0.868182 0.772727 0.557979
0 BASEHOCK.mat 279059 (1993, 4862) 2 (1993,) 2 0.985972 0.974349 0.965932 1.789281
3 COIL20.mat 3024549 (1440, 1024) 10 (1440,) 20 1.000000 0.998889 0.994444 1.873450
11 ALLAML.mat 3639219 (72, 7129) 66 (72,) 2 1.000000 0.938889 0.833333 0.183536
6 pixraw10P.mat 520463 (100, 10000) 11 (100,) 10 1.000000 0.972000 0.920000 0.338596
17 leukemia.mat 154743 (72, 7070) 3 (72,) 2 1.000000 0.950000 0.777778 0.155346
8 warpPIE10P.mat 458267 (210, 2420) 36 (210,) 10 1.000000 0.962264 0.924528 0.410544
5 orlraws10P.mat 951783 (100, 10304) 46 (100,) 10 1.000000 0.988000 0.960000 0.415471
19 lung_discrete.mat 7516 (73, 325) 3 (73,) 7 1.000000 0.800000 0.526316 0.131734

簡単すぎる問題を解いてもつまらないと思ったので、 RF_max の降順に並べてみました。

データセット選びの参考になればと。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?