More than 5 years have passed since last update.

(失敗)ETLも機械学習もGPUで！？ NVIDIA発表オープンソースGPUアクセラレーションプラットフォーム「RAPIDS」を試す2(パフォーマンス検証編)

Last updated at 2018-11-04Posted at 2018-11-04

本記事は失敗記事。
出直してちゃんとうまくいった時は別途記事にする予定(上手くいく日がやって来れば、、、)

前回まで

前回記事:「ETLも機械学習もGPUで！？ NVIDIAオープンソースGPUアクセラレーションプラットフォーム「RAPIDS」を試す１(環境構築編)」
NVIDAが発表した「RAPIDS」なるプラットフォームを使うと、今までGPUで処理できなかったあんなことやこんなこともGPUで処理できるようになってなんか凄そう(語彙力皆無)ということで、RAPIDSを動作させるための環境構築を実施した。
今回はその続き。

前提

前回記事の内容等を利用して、RAPIDS Dockerコンテナ上のJupyterLabがホスト側で起動できていること

動作環境

OS
Ubuntu16.04 64bit
CPU
Intel Core i7-7700 CPU @ 3.60GHz × 8
GPU
NVIDIA GeForce GTX 1060 6GB
RAM
24GB

RAPIDS上のサンプルデータの解凍

実際にパフォーマンスを確認してみるにあたり自前でデータを用意しても良かったのだが、RAPIDSイメージ上にサンプルデータが用意されているようだったので、今回はそれを動作確認に利用することにした。
/rapids/dataフォルダ配下に圧縮された状態で用意されているので、まずはファイルの解凍から。

(gdf) root@a1b3c0415a99:/# cd /rapids/data
(gdf) root@a1b3c0415a99:/rapids/data# tar -xzvf mortgage.tar.gz 

mortgage/
mortgage/acq/
mortgage/acq/Acquisition_2000Q1.txt
mortgage/acq/Acquisition_2001Q4.txt
mortgage/acq/Acquisition_2001Q2.txt
mortgage/acq/Acquisition_2000Q4.txt
mortgage/acq/Acquisition_2000Q3.txt
mortgage/acq/Acquisition_2000Q2.txt
mortgage/acq/Acquisition_2001Q1.txt
mortgage/acq/Acquisition_2001Q3.txt
mortgage/perf/
mortgage/perf/Performance_2001Q2.txt_0
mortgage/perf/Performance_2001Q4.txt_0
mortgage/perf/Performance_2001Q4.txt_1
mortgage/perf/Performance_2001Q3.txt_1
mortgage/perf/Performance_2000Q1.txt
mortgage/perf/Performance_2001Q1.txt
mortgage/perf/Performance_2000Q4.txt
mortgage/perf/Performance_2000Q3.txt
mortgage/perf/Performance_2000Q2.txt
mortgage/perf/Performance_2001Q3.txt_0
mortgage/perf/Performance_2001Q2.txt_1
mortgage/names.csv

ちなみに、察しの良い方はこの時点で気がつくかもしれないが、
RAPIDSのDockerイメージにはサンプルデータだけではく、
ETLおよび機械学習を行うサンプルコード(/rapids/notebooks配下に「E2E.ipynb」「ETL.ipynb」の2つ)が提供されている。
こちらを使用する手もあったのだが、思いの外しっかりとしたサンプルで私の弱小脳では理解するのが容易でなかったので、以降はサンプルコードは必要最低限な部分でしか使わずにシンプルな単発処理メインで検証を進めていく。

RAPIDS vs Pandas.DataFrame

Pythonでのデータ操作といえばPandasのDataFrame。
こいつでGPUぶん回してあれこれできたら便利だよね、ってことで調べてみた。

RAPIDSにおけるDataFrameの取り扱い

Dockerイメージで提供されているサンプルプログラムを眺めてみたところ、どうやらPyGDFと呼ばれるライブラリがRAPIDSには含まれており、こいつを使うことでDataFrameっぽい操作感でGPUをぶん回せるようになるらしい。

ただし、NVIDIAの技術者ブログ『RAPIDS Accelerates Data Science End-to-End』に

cuDF: A GPU DataFrame library with a pandas-like API. cuDF provides operations on data columns including unary and binary operations, filters, joins, and groupbys. cuDF currently comprises the Python library PyGDF, and the C++/CUDA GPU DataFrames implementation in libgdf. These two libraries are being merged into cuDF. See the documentation for more details and examples.

と書いてあるように、現在はcuDFと統合されているっぽい。
NVIDIAのslideshareにもこんな内容があった。

(引用元：https://www.slideshare.net/NVIDIAJapan/rapids-120510206)

ファイル読み込み

ここからは実際にPythonコードを書いて検証。
まずはファイル読み込み。
サンプルデータおよびサンプルソースコードの一部を拝借している都合上、厳密に言うと読み込み以外の処理も含まれてしまっている点ご承知おきを...

とりあえずサンプルデータの列定義などサンプルデータを使うために必要そうな処理だけサンプルソースコードからパクる。

import pygdf
from collections import OrderedDict
import pandas as pd
import time
import os

def define_cols(isGpu):
    cols = [
        "loan_id", "monthly_reporting_period", "servicer", "interest_rate", "current_actual_upb",
        "loan_age", "remaining_months_to_legal_maturity", "adj_remaining_months_to_maturity",
        "maturity_date", "msa", "current_loan_delinquency_status", "mod_flag", "zero_balance_code",
        "zero_balance_effective_date", "last_paid_installment_date", "foreclosed_after",
        "disposition_date", "foreclosure_costs", "prop_preservation_and_repair_costs",
        "asset_recovery_costs", "misc_holding_expenses", "holding_taxes", "net_sale_proceeds",
        "credit_enhancement_proceeds", "repurchase_make_whole_proceeds", "other_foreclosure_proceeds",
        "non_interest_bearing_upb", "principal_forgiveness_upb", "repurchase_make_whole_proceeds_flag",
        "foreclosure_principal_write_off_amount", "servicing_activity_indicator"
    ]
    
    if isGpu:
        dtypes = OrderedDict([
            ("loan_id", "int64"),
            ("monthly_reporting_period", "date"),
            ("servicer", "category"),
            ("interest_rate", "float64"),
            ("current_actual_upb", "float64"),
            ("loan_age", "float64"),
            ("remaining_months_to_legal_maturity", "float64"),
            ("adj_remaining_months_to_maturity", "float64"),
            ("maturity_date", "date"),
            ("msa", "float64"),
            ("current_loan_delinquency_status", "int32"),
            ("mod_flag", "category"),
            ("zero_balance_code", "category"),
            ("zero_balance_effective_date", "date"),
            ("last_paid_installment_date", "date"),
            ("foreclosed_after", "date"),
            ("disposition_date", "date"),
            ("foreclosure_costs", "float64"),
            ("prop_preservation_and_repair_costs", "float64"),
            ("asset_recovery_costs", "float64"),
            ("misc_holding_expenses", "float64"),
            ("holding_taxes", "float64"),
            ("net_sale_proceeds", "float64"),
            ("credit_enhancement_proceeds", "float64"),
            ("repurchase_make_whole_proceeds", "float64"),
            ("other_foreclosure_proceeds", "float64"),
            ("non_interest_bearing_upb", "float64"),
            ("principal_forgiveness_upb", "float64"),
            ("repurchase_make_whole_proceeds_flag", "category"),
            ("foreclosure_principal_write_off_amount", "float64"),
            ("servicing_activity_indicator", "category")
        ])
    else:
        dtypes = OrderedDict([
            ("loan_id", "int64"),
            ("monthly_reporting_period", "object"),
            ("servicer", "object"),
            ("interest_rate", "float"),
            ("current_actual_upb", "float"),
            ("loan_age", "float64"),
            ("remaining_months_to_legal_maturity", "float"),
            ("adj_remaining_months_to_maturity", "float"),
            ("maturity_date", "object"),
            ("msa", "float64"),
            ("current_loan_delinquency_status", "int"),
            ("mod_flag", "object"),
            ("zero_balance_code", "float"),
            ("zero_balance_effective_date", "object"),
            ("last_paid_installment_date", "object"),
            ("foreclosed_after", "object"),
            ("disposition_date", "object"),
            ("foreclosure_costs", "float"),
            ("prop_preservation_and_repair_costs", "float"),
            ("asset_recovery_costs", "float"),
            ("misc_holding_expenses", "float"),
            ("holding_taxes", "float"),
            ("net_sale_proceeds", "float"),
            ("credit_enhancement_proceeds", "float"),
            ("repurchase_make_whole_proceeds", "float"),
            ("other_foreclosure_proceeds", "float"),
            ("non_interest_bearing_upb", "float"),
            ("principal_forgiveness_upb", "float"),
            ("repurchase_make_whole_proceeds_flag", "object"),
            ("foreclosure_principal_write_off_amount", "float"),
            ("servicing_activity_indicator", "object")
        ])


    return cols, dtypes

def gpu_load_performance_csv(performance_path, **kwargs):
    """ Loads performance data

    Returns
    -------
    GPU DataFrame
    """

    cols, dtypes = define_cols(True)

    print(performance_path)
    
    return pygdf.read_csv(performance_path, names=cols, delimiter='|', dtype=list(dtypes.values()), skiprows=1)


def cpu_load_performance_csv(performance_path, **kwargs):
    """ Loads performance data

    Returns
    -------
    Pandas DataFrame
    """

    cols, dtypes = define_cols(False)

    print(performance_path)
    
    return pd.read_csv(performance_path, names=cols, delimiter='|', dtype=dtypes, skiprows=1)

# 数値を小数点第1位で四捨五入して文字列に変換する
def roundstr(size):
    return str(round(size, 1))

# ファイルサイズ（Bytes）を
# KBytes, MBytes, GBytes, TBytes表記の文字列に変換する
def format_file_size(bytesize):
    if bytesize < 1024:
        return str(bytesize) + ' Byte'
    elif bytesize < 1024 ** 2:
        return roundstr(bytesize / 1024.0) + ' KB'
    elif bytesize < 1024 ** 3:
        return roundstr(bytesize / (1024.0 ** 2)) + ' MB'
    elif bytesize < 1024 ** 4:
        return roundstr(bytesize / (1024.0 ** 3)) + ' GB'
    elif bytesize < 1024 ** 5:
        return roundstr(bytesize / (1024.0 ** 4)) + ' TB'
    else:
        return str(bytesize) + ' Bytes'

ファイルサイズをわかりやすく表示させるための処理はこちらのサイトを参考にさせていただいた。
Python で Byte のサイズを KB, MB, GB, TB 表記するスクリプトを書いた

ここまでで準備はできたので、解凍したファイルの1つを実際に読み込んでみる。
まずは対象ファイルのサイズを確認。

file = r'../data/mortgage/perf/Performance_2000Q1.txt'
display(format_file_size(os.path.getsize(file)))

'949.5 MB'

1GB近くあるらしい、結構でかい。こいつを読み込んでみる。

CPU

read_csv_cpu.py

start = time.time()
perf_df_pd = cpu_load_performance_csv(file)
elapsed_time = time.time() - start
display(elapsed_time)

../data/mortgage/perf/Performance_2000Q1.txt
19.403013229370117

GPU

read_csv_gpu.py

start = time.time()
perf_df = gpu_load_performance_csv(file)
elapsed_time = time.time() - start
display(elapsed_time)
perf_df

../data/mortgage/perf/Performance_2000Q1.txt
1.2687678337097168
<pygdf.DataFrame ncols=31 nrows=9094678 >

え、むちゃくちゃ速くないか。
ちなみに、読んだデータはこんな感じだった。

(一部)

ソート

CPU

sort_cpu.py

start = time.time()
perf_df_pd_sort = perf_df_pd.sort_values('loan_age')
elapsed_time = time.time() - start
display(elapsed_time)

13.352887630462646

GPU

sort_gpu.py

start = time.time()
display(perf_df.sort_values('loan_age'))
elapsed_time = time.time() - start
display(elapsed_time)

CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

エラった、辛い。Out of Memoryだなんて悲しいこと言わないでくれや…
その他複数の処理を試してみたのでが、GPUでの処理は同様のエラーが発生して結果を得られなかった、泣きたい。

手元の環境で扱うにはデータがでかすぎた？

今回わかったことと次回に向けて

read_csvメソッドはRAPIDSを利用すると10倍以上速くなった
Out of Memoryが出ない程度のデータサイズまで減らしてみてその他のDataFrame操作処理のパフォーマンスは確認予定(そもそもデータ減らすだけで解決できるエラー？)

参考資料

RAPIDSチートシート
https://rapids.ai/files/cheatsheet.pdf

cudf Documentation
https://media.readthedocs.org/pdf/pygdf/latest/pygdf.pdf

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up