機械学習✖️最適化ー Smart Predict then Optimize

Last updated at 2025-06-21Posted at 2025-05-29

Traditional PO and SPO

Loss function in traditional prediction for optimization (PO)

Also called Two-stage Method¹

\ell_{\mathrm{pred}}(\hat{c}, c) := \|\hat{c} - c\|^2

Loss function in Smart Predict-Then-Optimize (SPO)

Also called End-to-End Predict-Then-Optimize²

\ell_{\mathrm{SPO}}(\hat{c}, c) := c^\top w^*(\hat{c}) - z^*(c)

After Predict, we use the predict result to optimize

w^*(\hat{c}) := \arg\max_{w \in \mathcal{S}} \hat{c}^\top w

Why SPO?

Geometric Insight

Even if the predicted cost vectors $\hat{c}_A$ and $\hat{c}_B$ are equally close to the true cost $c$ in Euclidean distance:

\| \hat{c}_A - c \|_2 = \| \hat{c}_B - c \|_2

they may lead to different decisions:

w^*(\hat{c}_A) \ne w^*(\hat{c}_B)

which can result in different outcomes in the true cost:

c^\top w^*(\hat{c}_A) \ne c^\top w^*(\hat{c}_B)

This illustrates that minimizing the prediction error (e.g., using L2 loss) does not necessarily lead to optimal decisions — a core motivation for using the SPO framework.

Challenges in SPO

SPO Loss

\ell_{\mathrm{SPO}}(\hat{c}, c) := c^\top w^*(\hat{c}) - z^*(c)

Backward Propagation in SPO

To compute the gradient of the SPO loss $\ell$ with respect to the model parameters $\theta$, we apply the chain rule:

\frac{\partial \ell}{\partial \theta} =
\frac{\partial \ell}{\partial w^*(\hat{c})} \cdot
\frac{\partial w^*(\hat{c})}{\partial \hat{c}} \cdot
\frac{\partial \hat{c}}{\partial \theta}, \quad \hat{c} = f_\theta(x)

Here:
$\hat{c}$ is the predicted cost vector,
$f_\theta(x)$ is the prediction model parameterized by $\theta$,
$w^*(\hat{c})$ is the decision obtained by solving the optimization problem with $\hat{c}$.

Since $w^(\hat{c})$ is the solution to an optimization problem,
its derivative $\frac{\partial w^*(\hat{c})}{\partial \hat{c}}$ is often non-smooth or undefined, making the overall gradient computation challenging.

SPO → SPO+

Using the fact that:

\hat{c}^\top w = z^*(\hat{c}) \quad \text{for all } w \in W^*(\hat{c}),

we derive:

\ell_{\text{SPO}}(\hat{c}, c) 
= \max_{w \in W^*(\hat{c})} \left( c^\top w - z^*(c) \right) 
= \max_{w \in W^*(\hat{c})} \left( c^\top w - z^*(c) - \alpha \hat{c}^\top w + \alpha z^*(\hat{c}) \right) 
= \max_{w \in W^*(\hat{c})} \left( c^\top w - \alpha \hat{c}^\top w \right) + \alpha z^*(\hat{c}) - z^*(c)

The influence of $\alpha$ :

\begin{cases}
\alpha \uparrow & \Rightarrow \quad w^* \text{ leans toward predicted cost} \\
\alpha \downarrow & \Rightarrow \quad w^* \text{ leans toward true cost}
\end{cases}

Surrogate Upper Bound

\ell_{\mathrm{SPO}}(\hat{c}, c) \leq \inf_{\alpha} 
\left\{
\max_{w \in S} \left( c^\top w - \alpha \hat{c}^\top w \right) + \alpha z^*(\hat{c})
\right\} - z^*(c)

The feasible region of $w$ changes from $W^*$ to $S$
$\alpha$ does not influence the feasible region
As $\alpha \to \infty$, the bound is monotonically decreasing
Because:

w \to w^*, \quad \hat{c}^\top w \to z^*(c)

SPO ERM Problem

Empirical Risk Minimization (ERM):

\min_{f \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \ell_{\text{SPO}} \left( f(x_i), c_i \right)

SPO-based ERM Reformulation:

\min_{f \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \lim_{\alpha_i \to \infty} \left\{
\max_{w \in S} \left( c_i^\top w - \alpha_i f(x_i)^\top w \right)
+ \alpha_i z^*(f(x_i)) - z^*(c_i)
\right\}

$f(x_i) = \hat{c}_i$
For all $c_i$ and $\hat{c}_i$, there exists $\alpha_i$ such that $w \to w^*$

SPO ERM Reformulation (Step-by-Step)

\min_{f \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \lim_{\alpha_i \to \infty} \left[
\max_{w \in S} \left( c_i^\top w - \alpha_i f(x_i)^\top w \right) 
+ \alpha_i z^*(f(x_i)) - z^*(c_i)
\right]

= \min_{f \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \lim_{\alpha_i \to \infty} \left[
\max_{w \in S} \left( c_i^\top w - \alpha_i f(x_i)^\top w \right) 
+ \alpha_i f(x_i)^\top w^*(\alpha_i f(x_i)) - z^*(c_i)
\right]

= \min_{f \in \mathcal{H}} \frac{1}{n} \lim_{\alpha \to \infty} \left(
\sum_{i=1}^n \max_{w \in S} \left( c_i^\top w - \alpha f(x_i)^\top w \right) 
+ \alpha f(x_i)^\top w^*(\alpha f(x_i)) - z^*(c_i)
\right)

\leq \min_{f \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \max_{w \in S} \left(
c_i^\top w - 2 f(x_i)^\top w + 2 f(x_i)^\top w^*(2 f(x_i)) - z^*(c_i)
\right)

\leq \min_{f \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \max_{w \in S} \left(
c_i^\top w - 2 f(x_i)^\top w + 2 f(x_i)^\top w^*(c_i) - z^*(c_i)
\right)

SPO+

Definition:

\text{SPO}^+ = \max_{w \in \mathcal{S}} (c - 2\hat{c})^\top w + 2\hat{c}^\top w^*(c) - z^*(c)

Where:

$\hat{c}$：Predicted cost vector
$c$：True cost vector
$w \in \mathcal{S}$：Any feasible decision in constraint set $\mathcal{S}$
$w^*(\hat{c})$：Optimal solution under predicted cost
$z^*(c)$：Optimal value under true cost
$\max_{w \in \mathcal{S}} (c - 2\hat{c})^\top w$：This is a linear programming problem

The Differentiability in SPO+

SPO+ Loss

The SPO+ loss:

\mathcal{L}_{\text{SPO+}}(\hat{c}) = \max_{w \in \mathcal{S}} (c - 2\hat{c})^\top w + 2\hat{c}^\top w^*(c) - c^\top w^*(c)

Let:

w^+ \in \arg\max_{w \in \mathcal{S}} (c - 2\hat{c})^\top w

Then one subgradient of the SPO+ loss is:

\nabla_{\hat{c}} \mathcal{L}_{\text{SPO+}}(\hat{c}) = -2w^+ + 2w^*(c)

Notes

The loss is convex, since it's a maximum over linear functions.
Hence, a subgradient exists at every point.
This makes the SPO+ loss amenable to subgradient descent and compatible with automatic differentiation in deep learning frameworks.

Definition from: Adam N. Elmachtoub, Paul Grigas (2021). Smart “Predict, then Optimize”. Management Science, 68(1): 9–26. ↩
Kotary et al. (2023). Predict-Then-Optimize by Proxy: Learning Joint Models of Prediction and Optimization. ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

機械学習✖️最適化 ー Smart Predict then Optimize