More than 1 year has passed since last update.

外れ値一つで相関係数を0.0にする

Last updated at 2023-01-12Posted at 2022-11-11

外れ値を一つ加えることによって相関係数を0.0にするPythonスクリプトを作りました。

図1：強い相関があるデータ（相関係数:0.864）
図2：図1に1点の外れ値を加えた影響で相関係数が0.0になってしまったデータ

Pythonスクリプト

まず、図1を作るところです。

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal
np.random.seed(42)

def plot(x, y):
    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1)
    ax.scatter(x, y)
    ax.set_title(f'correlation coefficient: {np.corrcoef(x, y)[0, 1]:.3f}')
    plt.show()

n = 50
mean = np.array([0, 0])
cov = np.array([[1, 0.9], [0.9, 1]])
data = np.random.multivariate_normal(mean, cov, size=n)
x = data[:, 0]
y = data[:, 1]
plot(x, y)

次に、外れ値を求めます。

下のcalc_xy_add関数が外れ値を求める関数です。中では、「目標の相関係数」と「実際の相関係数」の差の二乗（calc_lossで導出）が最小となるように最適化しています。

import torch
from torch import optim
import torch.nn as nn
import torch.nn.functional as F

def calc_xy_add(x, y, target):
    """
    x, y: 外れ値を加える前のデータ
    target: ターゲットの相関係数
    x_add, y_add: 求める外れ値
    """

    x = torch.from_numpy(x.astype(np.float32))
    y = torch.from_numpy(y.astype(np.float32))
    x_add = torch.tensor([0.0], requires_grad=True)
    y_add = torch.tensor([0.0], requires_grad=True)
    optimizer = optim.Adam([x_add, y_add], lr=0.1)
    for _ in range(100000):
        optimizer.zero_grad()
        loss = calc_loss(x_add, y_add, x, y, target)
        loss.backward()
        optimizer.step()
        if loss < 1e-8:
            break
    x_add = x_add.detach().numpy()
    y_add = y_add.detach().numpy()
    return x_add, y_add


def calc_loss(x_add, y_add, x, y, target):
    """
    x, y: 外れ値を加える前のデータ
    x_add, y_add: 求める外れ値
    target: ターゲットの相関係数
    loss: 実際の相関係数 と targetの相関係数 の差の二乗
    """
    x = torch.cat((x, x_add), 0)
    y = torch.cat((y, y_add), 0)
    x_mean = torch.mean(x)
    y_mean = torch.mean(y)
    x_std = torch.std(x)
    y_std = torch.std(y)
    x = (x - x_mean) / x_std
    y = (y - y_mean) / y_std
    loss = torch.pow(torch.dot(x, y) / n - target, 2)
    return loss

最後に、上の関数を使って外れ値を導出し、図2を作ります。

target = 0.0
x_add, y_add = calc_xy_add(x, y, target)
x = np.append(x, x_add)
y = np.append(y, y_add)
plot(x, y)

注意

タイトルは「外れ値一つで相関係数を0.0にする」ですが、targetを変えると任意の相関係数にできます。（さすがに、1以上や-1以下はできないです。コードはえいやで作ったので、targetが-1～1以外であってもエラー終了しません。）
元から相関係数がtargetに近い場合、x_add, y_addは外れ値にはなりません。
上手く収束しないときもあるかもしれません

他にもやってみた

図1に1点加えて、相関係数を-0.9に
図1に1点加えて、相関係数を0.63に

結論

相関係数だけ見るのは危険。散布図も見よう。

参考

相関係数が0.63の散布図を作成する

おまけ

同じことをPytorchではなくTensorflow2でやってみた。

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal
import tensorflow as tf
np.random.seed(42)

def plot(x, y):
    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1)
    ax.scatter(x, y)
    ax.set_title(f'correlation coefficient: {np.corrcoef(x, y)[0, 1]:.3f}')
    plt.show()


def calc_xy_add(x, y, target):
    """
    x, y: 外れ値を加える前のデータ
    target: ターゲットの相関係数
    x_add, y_add: 求める外れ値
    """
    x = tf.convert_to_tensor(x, dtype=tf.float32)
    y = tf.convert_to_tensor(y, dtype=tf.float32)
    x_add = tf.Variable([0.0])
    y_add = tf.Variable([0.0])
    opt = tf.keras.optimizers.Adam(learning_rate=0.1)
    for _ in range(100000):
        with tf.GradientTape() as tape:
            loss = calc_loss(x_add, y_add, x, y, target)
        if loss < 1e-8:
            break
        grads = tape.gradient(loss, [x_add, y_add])
        opt.apply_gradients(zip(grads, [x_add, y_add]))
    return x_add.numpy(), y_add.numpy()


def calc_loss(x_add, y_add, x, y, target):
    """
    x, y: 外れ値を加える前のデータ
    x_add, y_add: 求める外れ値
    target: ターゲットの相関係数
    loss: 実際の相関係数 と targetの相関係数 の差の二乗
    """
    x = tf.concat([x, x_add], 0)
    y = tf.concat([y, y_add], 0)
    x_mean = tf.math.reduce_mean(x)
    y_mean = tf.math.reduce_mean(y)
    x_std = tf.math.reduce_std(x)
    y_std = tf.math.reduce_std(y)
    x = (x - x_mean) / x_std
    y = (y - y_mean) / y_std
    loss = tf.pow(tf.reduce_sum(x * y) / n - target, 2)
    return loss


n = 50
mean = np.array([0, 0])
cov = np.array([[1, 0.9], [0.9, 1]])
data = np.random.multivariate_normal(mean, cov, size=n)
x = data[:, 0]
y = data[:, 1]
plot(x, y)

target = 0.0
x_add, y_add = calc_xy_add(x, y, target)
x = np.append(x, x_add)
y = np.append(y, y_add)
plot(x, y)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up