More than 5 years have passed since last update.

数式からサンプルを生成---激遅だけど便利なSympy (Generate data samples from math formula -- hesitatingly slow but useful Sympy)

Last updated at 2017-05-22Posted at 2017-05-19

(English follows Japanese. They are corresponding translations.)

重要なアップデート Important Update

https://github.com/osofr/simcausal
速度を考えるとこちらの方が良いです。変数の因果順序は手作業で指定する必要がありますが、複雑な関数に対するデータ生成の速度はこちらが当然優っています。介入によるデータの生成にも対応しています。
Regarding the speed, this solution is better for data generation from causal models, albeit you will have to specify the causal order of variables by hand. The package also provides functionalities for demonstrating interventions.

はじめに Intro

機械学習で，アルゴリズムを試してみたいとき，データを作りますよね！
データ生成の部分って，生成する変数の順番とか考えないといけなくて，面倒ですよね！

When you try a new algorithm in machine learning, you surely have to generate data!
Generation of data takes up our minds because we have to consider things such as the order of the variables to generate, which is annoying!

アプローチ Approach

Sympyを用いて，数式の形でデータ生成プロセスを表現し，それを方程式として解かせます．
First express the data generation process in the form of formula in Sympy, then solve them as equations.

メリット Pros

変数間の依存関係を考えることなく，数式を入力してデータを生成できます．
Sympyのリッチな数式や数式処理機能を使えます．

You can generate data from equations, WITHOUT having to consider the dependency structure of the variables.
You can utilize the mathematical formula that come in with Sympy, with its rich functions to process formula.

問題点・注意点 Cons / Caveats

方程式ソルバーを使うので，動作はめちゃくちゃ遅いです．
下の例では，サンプル1つ生成するごとに2秒くらいかかります．実際に使える量のデータを生成するには時間がかかることを覚悟してください．
生成したデータはファイルにして保存されることを推奨します．

In this method, we utilize an equation solver. Therefore the performance is freaking slow.
For the example below, it takes about 2 seconds for generating each sample. In order to create an effective amount of simulated data, be aware that this method takes a plenty of time.
It's recommended to save the generated data in a file.

サンプル Working Sample

以下のデータ生成プロセスからデータをサンプルしたいとします．
Suppose we want to sample data from the following data generation process.
$$
\begin{split}
x_1 = x_2^2 + x_3 + e_1 \\
x_2 = x_3 + e_2 \\
x_3 = e_3 \\
e_1, ..., e_{10} \sim Unif(-1, 1) i.i.d.
\end{split}
$$

from sympy import Integer, Symbol, Eq, solve, var
from sympy.stats import sample, Uniform
import numpy as np
np.random.seed(42)

# Setting
p = 10                                                # 変数の個数 The number of variables
n = 10                                                # サンプル数 The number of samples

# Symbol initialization
xs = {i: Symbol("x{}".format(i)) for i in range(p+1)}  # この表記は
x = lambda i: xs[int(i)]                               # 簡易化できるかもしれないが，ひとまずこれでいこうと思う．
es = {i: Symbol("e{}".format(i)) for i in range(p+1)}  # These expressions may be able to be simplified,
e = lambda i: es[int(i)]                               # but I'll leave it here for now (If you find a better way of writing these, please leave a comment)

# Util
firstOrSelf = lambda ll: ll[0] if isinstance(ll, list) else ll

# Data generation
data = []
for k in range(n):
    if (k % 10 == 0):
        print(k)
    # Equationsの一覧を作る．代入は全てEq クラスで表記する．
    # Create a list of equations. All imputations of variables are expressed by 'Eq' classes.
    eqs = list([
        Eq(x(1), x(2) ** 2 + x(3) + e(1)),
        Eq(x(2), x(3) + e(2)),
        Eq(x(3), e(3)),
    ]) + list([Eq(e(i), sample(Uniform(f"unif{i}", -1, 1))) for i in range(p+1)])
    # ランダムになっている箇所は，普通にnumpyでサンプリングした数値を代入してもいい．ここではsympyを利用．
    # The random values input to the error terms may well be just values sampled with numpy. Here I used Sympy.

    # ソルバーにかける．時間がかかる．
    # Post the formula to the solver. This should take some time.
    sol = solve(eqs)

    # 生成したデータを積み上げる
    # Stack generated data
    data.append(firstOrSelf(sol))

print(data)

解説 Explanation

変数を増やすときは，

Initializationのところを増やす
数式に組み込む

の2つが必要な作業になります．

When you want to add a new variable,

Add your variable to initialization
Include the variable in the equation.

参考 References

Sympy Documentation (en)

代替策 Alternatives

SageMath (?)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up