数式からサンプルを生成---激遅だけど便利なSympy (Generate data samples from math formula -- hesitatingly slow but useful Sympy)

重要なアップデート Important Update

Regarding the speed, this solution is better for data generation from causal models, albeit you will have to specify the causal order of variables by hand. The package also provides functionalities for demonstrating interventions.

はじめに Intro


When you try a new algorithm in machine learning, you surely have to generate data!
Generation of data takes up our minds because we have to consider things such as the order of the variables to generate, which is annoying!

アプローチ Approach

First express the data generation process in the form of formula in Sympy, then solve them as equations.

メリット Pros

  1. 変数間の依存関係を考えることなく,数式を入力してデータを生成できます.
  2. Sympyのリッチな数式や数式処理機能を使えます.

  1. You can generate data from equations, WITHOUT having to consider the dependency structure of the variables.
  2. You can utilize the mathematical formula that come in with Sympy, with its rich functions to process formula.

問題点・注意点 Cons / Caveats


In this method, we utilize an equation solver. Therefore the performance is freaking slow.
For the example below, it takes about 2 seconds for generating each sample. In order to create an effective amount of simulated data, be aware that this method takes a plenty of time.
It's recommended to save the generated data in a file.

サンプル Working Sample

Suppose we want to sample data from the following data generation process.
x_1 = x_2^2 + x_3 + e_1 \\
x_2 = x_3 + e_2 \\
x_3 = e_3 \\
e_1, ..., e_{10} \sim Unif(-1, 1) i.i.d.

from sympy import Integer, Symbol, Eq, solve, var
from sympy.stats import sample, Uniform
import numpy as np

# Setting
p = 10                                                # 変数の個数 The number of variables
n = 10                                                # サンプル数 The number of samples

# Symbol initialization
xs = {i: Symbol("x{}".format(i)) for i in range(p+1)}  # この表記は
x = lambda i: xs[int(i)]                               # 簡易化できるかもしれないが,ひとまずこれでいこうと思う.
es = {i: Symbol("e{}".format(i)) for i in range(p+1)}  # These expressions may be able to be simplified,
e = lambda i: es[int(i)]                               # but I'll leave it here for now (If you find a better way of writing these, please leave a comment)

# Util
firstOrSelf = lambda ll: ll[0] if isinstance(ll, list) else ll

# Data generation
data = []
for k in range(n):
    if (k % 10 == 0):
    # Equationsの一覧を作る.代入は全てEq クラスで表記する.
    # Create a list of equations. All imputations of variables are expressed by 'Eq' classes.
    eqs = list([
        Eq(x(1), x(2) ** 2 + x(3) + e(1)),
        Eq(x(2), x(3) + e(2)),
        Eq(x(3), e(3)),
    ]) + list([Eq(e(i), sample(Uniform(f"unif{i}", -1, 1))) for i in range(p+1)])
    # ランダムになっている箇所は,普通にnumpyでサンプリングした数値を代入してもいい.ここではsympyを利用.
    # The random values input to the error terms may well be just values sampled with numpy. Here I used Sympy.

    # ソルバーにかける.時間がかかる.
    # Post the formula to the solver. This should take some time.
    sol = solve(eqs)

    # 生成したデータを積み上げる
    # Stack generated data


解説 Explanation

1. Initializationのところを増やす
2. 数式に組み込む


When you want to add a new variable,
1. Add your variable to initialization
2. Include the variable in the equation.

参考 References

代替策 Alternatives

  • SageMath (?)

