3
6

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

機械学習向けのデータ前処理でデータ分割をするにはどうすれば良いか

Last updated at Posted at 2018-03-07

環境

  • macOS
  • python3.6

想定ケース

  • 機械学習向けのデータフレームを検証データと訓練データで分割したい
    • それぞれランダムサンプリングによる抽出を想定

やってみる

データソース

ソースコード

import pandas as pd

# データの読み込み
# 今回はUSのyoutubeトレンドデータ
df = pd.read_csv("USvideos.csv")

# テストデータをランダムサンプリングで抽出
df_test = df.sample(15000)

# 検証データを抽出
# 元データからdf_testのインデックスに該当するデータを除外する
df_validation = df.drop(df_test.index)

結果

> df_test.info()

Int64Index: 15000 entries, 4985 to 1561
Data columns (total 16 columns):
video_id                  15000 non-null object
trending_date             15000 non-null object
title                     15000 non-null object
channel_title             15000 non-null object
category_id               15000 non-null int64
publish_time              15000 non-null object
tags                      15000 non-null object
views                     15000 non-null int64
likes                     15000 non-null int64
dislikes                  15000 non-null int64
comment_count             15000 non-null int64
thumbnail_link            15000 non-null object
comments_disabled         15000 non-null bool
ratings_disabled          15000 non-null bool
video_error_or_removed    15000 non-null bool
description               14717 non-null object
dtypes: bool(3), int64(5), object(8)
memory usage: 1.6+ MB


> df_validation.info()
nt64Index: 6965 entries, 0 to 21954
Data columns (total 16 columns):
video_id                  6965 non-null object
trending_date             6965 non-null object
title                     6965 non-null object
channel_title             6965 non-null object
category_id               6965 non-null int64
publish_time              6965 non-null object
tags                      6965 non-null object
views                     6965 non-null int64
likes                     6965 non-null int64
dislikes                  6965 non-null int64
comment_count             6965 non-null int64
thumbnail_link            6965 non-null object
comments_disabled         6965 non-null bool
ratings_disabled          6965 non-null bool
video_error_or_removed    6965 non-null bool
description               6829 non-null object
dtypes: bool(3), int64(5), object(8)
memory usage: 782.2+ KB
3
6
2

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
3
6

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?