0
Help us understand the problem. What are the problem?

posted at

ValueError: Input contains NaN, infinity or a value too large for dtype('float32'). に対処する

scikit-learn を使うと

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

というエラーに遭遇して、データのどこにエラーがあるか追及するのに苦労すること、ありませんか?(あるある)

問題設定

pandas で次のようなデータを用いるとします。

X.shape, Y.shape
((1318, 400), (1318,))

普通にscikit-learnを使うと、次のようなエラーが出るとします。

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X, Y)
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-9-55fca59c1791> in <module>()
      2 
      3 model = RandomForestRegressor()
----> 4 model.fit(X, Y)


/usr/local/lib/python3.7/dist-packages/sklearn/ensemble/_forest.py in fit(self, X, y, sample_weight)
    293         """
    294         # Validate or convert input data
--> 295         X = check_array(X, accept_sparse="csc", dtype=DTYPE)
    296         y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
    297         if sample_weight is not None:


/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    576         if force_all_finite:
    577             _assert_all_finite(array,
--> 578                                allow_nan=force_all_finite == 'allow-nan')
    579 
    580     if ensure_min_samples > 0:


/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
     58                     msg_err.format
     59                     (type_err,
---> 60                      msg_dtype if msg_dtype is not None else X.dtype)
     61             )
     62     # for object dtype data, we only check for NaNs (GH-13254)


ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

解決方法

1列ずつ取り出して学習して、エラーを起こさない列だけ集めます。

success_col = []
for i in range(X.shape[1]):
    try:
        model = RandomForestRegressor()
        model.fit(X.iloc[:, [i]], Y)
        success_col.append(i)
    except:
        continue

エラーを起こさなかった列だけを抜き出します。

X = X.iloc[:, success_col]

列の数が減りましたね。

X.shape
(1318, 392)

今度は学習できるはずです。

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X, Y)
model.score(X, Y)
0.897965478410951

なんなら、予測スコアがある程度高い列だけ抜き出したら効率的かも。

success_col = []
for i in range(X.shape[1]):
    try:
        model = RandomForestRegressor()
        model.fit(X.iloc[:, [i]], Y)
        if model.score(X.iloc[:, [i]], Y) > 0.1:
            success_col.append(i)
    except:
        continue
X = X.iloc[:, success_col]
X.shape
(1318, 272)
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X, Y)
model.score(X, Y)
0.8996226954522588

完。

Why not register and get more from Qiita?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
Sign upLogin
0
Help us understand the problem. What are the problem?