More than 5 years have passed since last update.

【Houdini】PythonSOPでのデータ読み込みコードの最適化

Last updated at 2019-12-25Posted at 2018-07-20

はじめに

Houdiniを使っているとVEXが速すぎるために、PythonSOPの遅さが際立ちます。特にfor文。
しかし当然Pythonでしか出来ないような処理もあるのでそこをもっと早く快適にしようというテーマです。
主に高速化について、おまけで簡素化について書きます。

環境

CentOS-7.4
Houdini-16.5.405
pandas-0.23.3
numpy-1.14.5

高速化

データのpickle化

Pickleとは？→Python: オブジェクトを漬物 (Pickle) にする
これは役立つ場合と全く必要ない場合があります。
例えば、巨大なデータをPythonSOPで読み込んだりする場合、実行するたびに一からデータを読み直すので非常にレスポンスが悪くなり作業効率が下がってしまいます。
そんな時に予めHoudini内/外でデータをpickle化しておき、PythonSOPで読むのはpickleにしておけば読み込みが早くなり快適になります。

比較

3float列1700万行のデータの読み込み速度比較です。

read_table

df = pd.read_table("/path/to/data.tsv")

read_pickle

df = pd.read_pickle("/path/to/data.pickle")

結果

※一度しか測ってないので、多少バラつきがあるかもしれません。

約6.6倍速くなりました。
事前処理で多少手間はかかりますが、何度も読みこむような場合大きな差ですね。

書き出し

pandasなら、to_pickleで簡単に書き出せます。

import pandas as pd
df = pd.read_csv("data.csv")
df.to_pickle("/path/to/out.pickle")

point作成

膨大なデータ(csv,tsv,xyz,xml,json...)をHoudiniに読み込むときやはりネックになるのがfor文。
HoudiniにはcreatePointsという、素晴らしいメソッドが用意されているので、pandas/numpyと合わせて簡単に脱for文出来てしまいます。

通常のfor文と、内包表記と、numpy.reshapeの3つで3float列1000万行のデータをPアトリビュート化する速度を比較してみます。

for文

A-普通のfor文

import pandas as pd

node = hou.pwd()
geo = node.geometry()

df = pd.read_pickle("data.pickle")
for row in xrange(df.shape[0]):
    point = geo.createPoint()
    point.setPosition((df.iat[row, 0], df.iat[row, 1], df.iat[row, 2]))

for文(内包表記)

B-内包表記

import pandas as pd

node = hou.pwd()
geo = node.geometry()

df = pd.read_pickle("data.pickle")
geo.createPoints([(df.iat[row, 0], df.iat[row, 1], df.iat[row, 2]) for row in xrange(df.shape[0])])

numpy.reshape

C-numpy.reshape

import pandas as pd

node = hou.pwd()
geo = node.geometry()

df = pd.read_pickle("data.pickle")
geo.createPoints(df.values.reshape(len(df.index), 3))

測定結果

※一度しか測ってないので、多少バラつきがあるかもしれません。

やはりダントツでreshapeが早いですね。
通常のfor文に比べて約8.2倍速です。行数が増えるほど差は顕著になりそうです。
ちなみにcreatePolygonsなんていうメソッドもあるので、頑張ればこれもreshapeでいけます。

しかし内包表記でcreatePointsしても全然速くならないんですね・・意外でした。

reshapeは速いんですが、~~P以外のアトリビュートはfor文を使わないと設定出来ないので、float値3つまでしか対応出来ません。~~
大量のstringアトリビュートを持ったデータは、もう直接Houdiniで読めるジオメトリに変換するスクリプトを作ってしまったほうが早いかもしれません。。
出来ます。記事書いた当時setPointFloatAttribValues等のメソッドがあったのを見落としていました。

ちなみにpandasのto_hdfでAlembic化出来ないかとダメ元でやってみたことがありますがやっぱりダメでした。

分散処理

これはやっている人も多そうですが、時系列データなどファイル数が多い場合、Houdiniの場合for文でぶん回すより分割してファームに投げちゃったほうが圧倒的に速いです。Engineライセンスがあるだけ回せます。
やり方としては単純で、読み込みたいデータのパスを1つずつポイントアトリビュートに入れ、ポイント番号=現フレームになるようにWrangleで削除してあげて後はファームに投げるだけです。

クラス化

毎度毎度、長ったらしいデータの読み込みコードを前のプロジェクトからコピペして微調整するのは面倒です。
そこで、pandas.DataFrameのサブクラス「houpandas.DataFrame」を作り、to_geometryメソッドを追加しています。
Houdini上でpandas.DataFrameをhoupandas.DataFrameに置き換えることにより、
df.to_geometry(geo)とするだけでpandasライクにDataFrameをHoudiniのポイントアトリビュートに変換出来るようにしています。
save_geometry関数でも良かったんですが、やっぱto_geometryしたいじゃないですか。。

これでIDEやJupyterなんかでごりごりコーディングしたり検証してから、
PythonSOPにコピペしてきてto_geometryするようなフローが出来るようになりました。
細かく弄りたかったらadd_attribs_from_columns/set_attribs_valuesをPythonSOPから呼び出すこともできます。

以下コードです。

※18/07/20 20:39 コードが古かったので微修正しました。

houpandas.py

# -*- coding: utf-8 -*-
import sys

import pandas as pd
import numpy as np
import hou


class DataFrame(pd.DataFrame):
    def add_attribs_from_columns(self, geo, includes=None, excludes=None,
                                 default_value_int=-1, default_value_float=-9999.0, default_value_str="",
                                 default_value_iarray=None, default_value_farray=None):
        if includes is None:
            includes = []
        if excludes is None:
            excludes = []

        columns = self.columns
        dtypes = self.dtypes

        attrib_dict = {}
        for column, dtype in zip(columns, dtypes):
            if column in excludes:
                continue
            if includes and column not in includes:
                continue

            attrib_name = column
            if not attrib_name[0].isalpha() and not attrib_name.startswith("_"):
                attrib_name = "_" + column

            found_attrib = geo.findPointAttrib(attrib_name)
            if found_attrib:
                attrib_dict[column] = found_attrib
                continue

            default_value = ""
            if str(dtype)[:3] == "int" or str(dtype)[:4] == "uint" or str(dtype) == "bool":
                default_value = default_value_int
            elif str(dtype)[:5] == "float":
                default_value = default_value_float
            elif str(dtype) == "object":
                default_value = default_value_str

            attrib_dict[column] = geo.addAttrib(hou.attribType.Point, attrib_name, default_value)

        return attrib_dict

    def set_attrib_value(self, points, attrib_dict, debug=False):
        for row, point in enumerate(points):
            for column in attrib_dict.keys():
                attrib_value = self.at[row, column]
                if debug:
                    print attrib_dict[column], attrib_value
                point.setAttribValue(attrib_dict[column], attrib_value)

    def to_geometry(self, geo, pos_columns=None, debug=False):
        pos_shape = (self.shape[0], 3)
        if pos_columns:
            if not isinstance(pos_columns, list):
                raise TypeError("pos_columns should be list of column names or numbers.")
            if len(pos_columns) > 3:
                raise IndexError("pos_column size should be 3 or less.")
            pos_array = self[pos_columns].values.reshape(pos_shape)
        else:
            pos_array = np.zeros(pos_shape)

        points = geo.createPoints(pos_array)
        attrib_dict = self.add_attribs_from_columns(geo, excludes=pos_columns)
        if attrib_dict:
            self.set_attrib_value(points, attrib_dict, debug)

PythonSOP

import pandas as pd
import houpandas as hp


node = hou.pwd()
geo = node.geometry()

df = pd.read_csv("data.csv")

"""
DataFrame処理
"""

df = hp.DataFrame(df)
df.to_geometry(geo)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up