While in general in Machine Learning the optimization with respect to L1 or L2 or some other norm with same topology is common, I recently found myself more than once having to perform maximization with respect to correlation:
$$\operatorname{corr}(x,y) :=
\frac{1}{n-1}\sum_{i=1}^n\left(
\frac{x_i-\bar{x}}{s_x}\frac{y_i-\bar{y}}{s_y}
\right)
$$, where
$$s_x:= \sqrt{\sum_{i=1}^n(x_i-\bar{x})^2/(n-1)}, s_y:=\sqrt{\sum_{i=1}^n(y_i-\bar{y})^2/(n-1)}$$
and
$\bar{x}:=\sum_{i=1}^n x_i/n, \bar{y} :=\sum_{i=1}^n y_i/n; x,y\in\mathbb{R}^n$.
As a toy example, suppose we have 10 randomly generated vectors $v_1, v_2, \dots, v_{10}, v_i\in \mathbb{R}^{100}$ of size 100, and we want to fit them to the target vector $t\in\mathbb{R}^{100}$, so to maximize the correlation.
import numpy as np
from scipy import optimize
import math
N = 10
A = np.random.rand(N+1,100)
More precisely, we want to find the
$$a := \operatorname{argmax}{a\in\mathbb{R}^{10}} \left(
\operatorname{corr}\left(\sum{i=1}^{10} a_i\cdot v_i, t\right)
\right)$$
Now, for maximization I simply use minimize routine provided by the mighty scipy library. However, the documentation of minimize
and the general mathematical reasoning suggests that maximization would perform better, if given the derivative of the target function.
Therefore, I naturally found myself looking for a closed formula for $corr(x,y)$. However, as fast search on internet did not give me anything relevant, I attempted to derive the formula myself. Now, since the correlation is symmetric with respect to permutation of $x$ and $y$, it is enough to derive formulae for $\partial/\partial x_i (corr(x,y))$ partial derivatives. Moreover, as value of $corr(x,y)$ is unaffected by simultaneous permutation of components of $x$ and $y$, it suffices to find $\partial/\partial x_1$.
Now, we have
$$\frac{\partial}{\partial x_1}x_i=\delta_1^i, \frac{\partial}{\partial x_1}y_i \equiv 0,$$
where $\delta_i^j$ is Kronecker delta defined as 1 if $i=j$ and zero otherwise.
Similarly,
$$\frac{\partial}{\partial x_1}\bar{x} = \frac{1}{n}, \frac{\partial}{\partial x_1}\bar{y}\equiv0$$
and by the rules of derivation of square, square-root, product and sum
$$\frac{\partial}{\partial x_1}s_x = \frac{\sum_{i=1}^n(x_i-\bar{x})\frac{\partial}{\partial x_1}(x_i)-\sum_{i=1}^n(x_i-\bar{x})\frac{\partial}{\partial x_1}(\bar{x})}{(n-1)s_x} =
\frac{x_1-\bar{x} }{(n-1)s_x}.
$$
Using these formulae, and the definition of $\operatorname{corr}$, some computations yield the final expression:
$$
\frac{\partial \operatorname{corr}(x,y)}{\partial x_i}=
\frac{1}{n-1}\left(
\frac{y_i-\bar{y}}{s_ys_x} - \frac{x_i-\bar{x}}{s^2_x}\operatorname{corr}(x,y)
\right).
$$
Finally, we implement the last expression in the form of Python function derivative_corr
.
def _mean(v):
return sum(v)/len(v)
def _std(v):
m = _mean(v)
n = len(v)
return math.sqrt(sum([(x-m)**2 for x in v])/(n-1))
def derivative_corr(x,y):
n = len(x)
# assert 0<=i<n
return ((y-_mean(y))/_std(y)/_std(x) - (x-_mean(x))*np.corrcoef(x,y)[0][1]/_std(x)**2)/(n-1)
Now, to begin with, we solve the $\operatorname{argmax}$ problem mentioned above via naive application of optimize.minimize
:
%%timeit -n5
optimize.minimize(fun=lambda x:np.corrcoef(A[N]+x@A[:N],A[N])[0][1],x0=np.ones(N))
42.2 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
%time
res = optimize.minimize(fun=lambda x:np.corrcoef(A[N]+x@A[:N],A[N])[0][1],x0=np.ones(N))
res.fun,res.x
CPU times: user 60 ms, sys: 2.99 ms, total: 63 ms
Wall time: 64.5 ms
(-0.35877247302854237,
array([ 2964.70897213, 2166.68176736, 1474.22979309, 3220.11473378,
-3203.32687212, 1795.39405674, 5513.03586616, 3229.98957463,
-2153.00903054, 568.78334059]))
And for comparison we perform the same optimization, but this time providing optimize.minimize
with Jacobian, which we compute using derivative_corr
above.
%%timeit -n5
optimize.minimize(
fun=lambda x:np.corrcoef(A[N]+x@A[:N],A[N])[0][1],
x0=np.ones(N),
jac=lambda x:np.dot(A[:N],derivative_corr(A[N]+x@A[:N],A[N])),
)
15.6 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
%%time
res = optimize.minimize(
fun=lambda x:np.corrcoef(A[N]+x@A[:N],A[N])[0][1],
x0=np.ones(N),
jac=lambda x:np.dot(A[:N],derivative_corr(A[N]+x@A[:N],A[N])),
)
res.fun,res.x
CPU times: user 28 ms, sys: 3.65 ms, total: 31.6 ms
Wall time: 29.6 ms
(-0.3587893467095761,
array([ 3495.09695573, 2554.24639091, 1737.89533372, 3796.21917945,
-3776.81052968, 2116.52916603, 6499.49772168, 3807.86027533,
-2538.53606755, 670.37410316]))
As expected, the second invocation yields better results in shorter time (difference even exacerbated when we use %%timeit
to time average execution of 5 iterations). Although the benefits are not huge for this toy example, they are nevertheless noticeable and expect to be even more for bigger real-life examples.
Finally, one can similarly easily derive formula for the Hessianhttps://en.wikipedia.org/wiki/Hessian_matrix of correlation. This will be published in a subsequent article.