Traditional PO and SPO
Loss function in traditional prediction for optimization (PO)
Also called Two-stage Method1
\ell_{\mathrm{pred}}(\hat{c}, c) := \|\hat{c} - c\|^2
Loss function in Smart Predict-Then-Optimize (SPO)
Also called End-to-End Predict-Then-Optimize2
\ell_{\mathrm{SPO}}(\hat{c}, c) := c^\top w^*(\hat{c}) - z^*(c)
After Predict, we use the predict result to optimize
w^*(\hat{c}) := \arg\max_{w \in \mathcal{S}} \hat{c}^\top w
Why SPO?
Geometric Insight
Even if the predicted cost vectors $\hat{c}_A$ and $\hat{c}_B$ are equally close to the true cost $c$ in Euclidean distance:
\| \hat{c}_A - c \|_2 = \| \hat{c}_B - c \|_2
they may lead to different decisions:
w^*(\hat{c}_A) \ne w^*(\hat{c}_B)
which can result in different outcomes in the true cost:
c^\top w^*(\hat{c}_A) \ne c^\top w^*(\hat{c}_B)
This illustrates that minimizing the prediction error (e.g., using L2 loss) does not necessarily lead to optimal decisions — a core motivation for using the SPO framework.
Challenges in SPO
SPO Loss
\ell_{\mathrm{SPO}}(\hat{c}, c) := c^\top w^*(\hat{c}) - z^*(c)
Backward Propagation in SPO
To compute the gradient of the SPO loss $\ell$ with respect to the model parameters $\theta$, we apply the chain rule:
\frac{\partial \ell}{\partial \theta} =
\frac{\partial \ell}{\partial w^*(\hat{c})} \cdot
\frac{\partial w^*(\hat{c})}{\partial \hat{c}} \cdot
\frac{\partial \hat{c}}{\partial \theta}, \quad \hat{c} = f_\theta(x)
Here:
$\hat{c}$ is the predicted cost vector,
$f_\theta(x)$ is the prediction model parameterized by $\theta$,
$w^*(\hat{c})$ is the decision obtained by solving the optimization problem with $\hat{c}$.
Since $w^(\hat{c})$ is the solution to an optimization problem,
its derivative $\frac{\partial w^*(\hat{c})}{\partial \hat{c}}$ is often non-smooth or undefined, making the overall gradient computation challenging.
SPO → SPO+
Using the fact that:
\hat{c}^\top w = z^*(\hat{c}) \quad \text{for all } w \in W^*(\hat{c}),
we derive:
\ell_{\text{SPO}}(\hat{c}, c)
= \max_{w \in W^*(\hat{c})} \left( c^\top w - z^*(c) \right)
= \max_{w \in W^*(\hat{c})} \left( c^\top w - z^*(c) - \alpha \hat{c}^\top w + \alpha z^*(\hat{c}) \right)
= \max_{w \in W^*(\hat{c})} \left( c^\top w - \alpha \hat{c}^\top w \right) + \alpha z^*(\hat{c}) - z^*(c)
The influence of $\alpha$ :
\begin{cases}
\alpha \uparrow & \Rightarrow \quad w^* \text{ leans toward predicted cost} \\
\alpha \downarrow & \Rightarrow \quad w^* \text{ leans toward true cost}
\end{cases}
Surrogate Upper Bound
\ell_{\mathrm{SPO}}(\hat{c}, c) \leq \inf_{\alpha}
\left\{
\max_{w \in S} \left( c^\top w - \alpha \hat{c}^\top w \right) + \alpha z^*(\hat{c})
\right\} - z^*(c)
- The feasible region of $w$ changes from $W^*$ to $S$
- $\alpha$ does not influence the feasible region
- As $\alpha \to \infty$, the bound is monotonically decreasing
Because:
w \to w^*, \quad \hat{c}^\top w \to z^*(c)
SPO ERM Problem
Empirical Risk Minimization (ERM):
\min_{f \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \ell_{\text{SPO}} \left( f(x_i), c_i \right)
SPO-based ERM Reformulation:
\min_{f \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \lim_{\alpha_i \to \infty} \left\{
\max_{w \in S} \left( c_i^\top w - \alpha_i f(x_i)^\top w \right)
+ \alpha_i z^*(f(x_i)) - z^*(c_i)
\right\}
- $f(x_i) = \hat{c}_i$
- For all $c_i$ and $\hat{c}_i$, there exists $\alpha_i$ such that $w \to w^*$
SPO ERM Reformulation (Step-by-Step)
\min_{f \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \lim_{\alpha_i \to \infty} \left[
\max_{w \in S} \left( c_i^\top w - \alpha_i f(x_i)^\top w \right)
+ \alpha_i z^*(f(x_i)) - z^*(c_i)
\right]
= \min_{f \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \lim_{\alpha_i \to \infty} \left[
\max_{w \in S} \left( c_i^\top w - \alpha_i f(x_i)^\top w \right)
+ \alpha_i f(x_i)^\top w^*(\alpha_i f(x_i)) - z^*(c_i)
\right]
= \min_{f \in \mathcal{H}} \frac{1}{n} \lim_{\alpha \to \infty} \left(
\sum_{i=1}^n \max_{w \in S} \left( c_i^\top w - \alpha f(x_i)^\top w \right)
+ \alpha f(x_i)^\top w^*(\alpha f(x_i)) - z^*(c_i)
\right)
\leq \min_{f \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \max_{w \in S} \left(
c_i^\top w - 2 f(x_i)^\top w + 2 f(x_i)^\top w^*(2 f(x_i)) - z^*(c_i)
\right)
\leq \min_{f \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \max_{w \in S} \left(
c_i^\top w - 2 f(x_i)^\top w + 2 f(x_i)^\top w^*(c_i) - z^*(c_i)
\right)
SPO+
Definition:
\text{SPO}^+ = \max_{w \in \mathcal{S}} (c - 2\hat{c})^\top w + 2\hat{c}^\top w^*(c) - z^*(c)
Where:
- $\hat{c}$:Predicted cost vector
- $c$:True cost vector
- $w \in \mathcal{S}$:Any feasible decision in constraint set $\mathcal{S}$
- $w^*(\hat{c})$:Optimal solution under predicted cost
- $z^*(c)$:Optimal value under true cost
- $\max_{w \in \mathcal{S}} (c - 2\hat{c})^\top w$:This is a linear programming problem
The Differentiability in SPO+
SPO+ Loss
The SPO+ loss:
\mathcal{L}_{\text{SPO+}}(\hat{c}) = \max_{w \in \mathcal{S}} (c - 2\hat{c})^\top w + 2\hat{c}^\top w^*(c) - c^\top w^*(c)
Let:
w^+ \in \arg\max_{w \in \mathcal{S}} (c - 2\hat{c})^\top w
Then one subgradient of the SPO+ loss is:
\nabla_{\hat{c}} \mathcal{L}_{\text{SPO+}}(\hat{c}) = -2w^+ + 2w^*(c)
Notes
- The loss is convex, since it's a maximum over linear functions.
- Hence, a subgradient exists at every point.
- This makes the SPO+ loss amenable to subgradient descent and compatible with automatic differentiation in deep learning frameworks.