More than 3 years have passed since last update.

Selective Classification for Deep Neural Networks【2 Problem Setting】【論文 DeepL 翻訳】

論文読み

Last updated at 2020-12-22Posted at 2020-11-09

この記事は自分用のメモみたいなものです.
ほぼ DeepL 翻訳でお送りします.
間違いがあれば指摘していだだけると嬉しいです.

翻訳元
Selective Classification for Deep Neural Networks
Author: Yonatan Geifman, Ran El-Yaniv

前: 【1 Introduction】
次: 【3 Selection with Guaranteed Risk Control】

2 Problem Setting

訳文

標準的な多クラス分類問題を考える. ${\cal X}$ を特徴空間 (例えば, 生の画像データ), ${\cal Y}$ を有限のラベル集合, ${\cal Y} = \{1, 2, 3, ... , k \} $ で, $k$ 個のクラスを表す．$P(X, Y)$ を ${\cal X} \times {\cal Y}$ の分布とする. 分類器 $f$ は関数 $f:{\cal X} \rightarrow {\cal Y}$ であり，$P$ に関して $f$ の真のリスクは $R(f|P) \triangleq E_{P(X,Y)} [\ell(f(x), y)]$ であり, ここで $\ell : Y \times Y \rightarrow {\mathbb R}^+ $ は与えられた損失関数, 例えば $0/1$ の誤差である．$P(X, Y)$ から i.i.d. サンプリングされたラベル付き集合 $S_m = \{(x_i , y_i)\}^m_{i=1} \subseteq ({\cal X} \times {\cal Y})^m$ が与えられると, 分類器 $f$ の経験的リスクは $\hat{r} (f|S_m) \triangleq \frac{1}{m} \sum^m_{i=1} \ell (f(x_i), y_i)$ となる.
選択的分類器 [5] とは, 以下のようなペア $(f, g)$ であり, ここで $f$ は分類器であり, $g : {\cal X} \rightarrow \{0, 1\}$ は選択関数であり, $f$ の二項修飾子として機能する,

(f, g)(x) \triangleq \left\{
\begin{array}{ll}
f(x), & {\rm if} \ g(x) = 1;  \\
{\rm dont' know}, & {\rm if} \ g(x) = 0.
\end{array}
\right.

したがって, 選択的分類器は，$g(x) = 0$ であれば，ある点 $x$ での予測を避ける. 選択的分類器の性能は, カバレッジとリスクを用いて定量化される. $P$ を固定すると, $ \phi (f, g) \triangleq E_P[g(x)]$ と定義されるカバレッジは, ${\cal X}$ 内の非選択領域の確率質量である. $(f, g)$ の選択的リスクは,

$$R(f, g) \triangleq \frac{E_P[\ell(f(x), y)g(x)]}{\phi (f, g)} \tag{1}$$

明らかに, 選択的分類器のリスクは, カバレッジとトレードオフできる. そのような分類器のパフォーマンスプロファイル全体は, カバレッジの関数としてのリスクと定義されるリスク-カバレッジ曲線によって特定される [5].
以下の問題を考える. 我々には, 分類器$f$, 学習サンプル$S_m$, 信頼度パラメータ$\delta > 0$, 希望するリスク目標 $r^* > 0$ が与えられている. 我々の目標は, $S_m$ を用いて，$(f, g)$ の選択リスクが以下を満たすような選択関数 $g$ を作ることである

$$ Pr_{S_m} \{ R(f, g) > r^* \} < \delta, \tag{2}$$

ここで, 確率は, 未知の基礎となる分布 $P$ から i.i.d.サンプリングされた学習サンプル $S_m$ を対象とする. (2) を満たすすべての分類器の中で, 最も優れた分類器は, カバレッジが最大になるものである.

原文

We consider a standard multi-class classification problem. Let ${\cal X}$ be some feature space (e.g., raw image data) and ${\cal Y}$, a finite label set, ${\cal Y} = \{1, 2, 3, . . . , k \} $, representing $k$ classes. Let $P(X, Y)$ be a distribution over ${\cal X} \times {\cal Y}$. A classifier $f$ is a function $f:{\cal X} \rightarrow {\cal Y}$, and the true risk of $f$ w.r.t. $P$ is $R(f|P) \triangleq E_{P(X,Y)} [\ell(f(x), y)]$, where $\ell : Y \times Y \rightarrow {\mathbb R}^+ $ is a given loss function, for example the $0/1$ error. Given a labeled set $S_m = \{(x_i , y_i)\}^m_{i=1} \subseteq ({\cal X} \times {\cal Y})^m$ sampled i.i.d. from $P(X, Y)$, the empirical risk of the classifier $f$ is $\hat{r} (f|S_m) \triangleq \frac{1}{m} \sum^m_{i=1} \ell (f(x_i), y_i)$.
A selective classifier [5] is a pair $(f, g)$, where $f$ is a classifier, and $g : {\cal X} \rightarrow \{0, 1\}$ is a selection function, which serves as a binary qualifier for $f$ as follows,

(f, g)(x) \triangleq \left\{
\begin{array}{ll}
f(x), & {\rm if} \ g(x) = 1;  \\
{\rm dont' know}, & {\rm if} \ g(x) = 0.
\end{array}
\right.

Thus, the selective classifier abstains from prediction at a point $x$ iff $g(x) = 0$. The performance of a selective classifier is quantified using coverage and risk. Fixing $P$, coverage, defined to be $ \phi (f, g) \triangleq E_P[g(x)]$, is the probability mass of the non-rejected region in ${\cal X}$. The selective risk of $(f, g)$ is

$$R(f, g) \triangleq \frac{E_P[\ell(f(x), y)g(x)]}{\phi (f, g)} \tag{1}$$

Clearly, the risk of a selective classifier can be traded-off for coverage. The entire performance profile of such a classifier can be specified by its risk-coverage curve, defined to be risk as a function of coverage [5].
Consider the following problem. We are given a classifier $f$, a training sample $S_m$, a confidence parameter $\delta > 0$, and a desired risk target $r^* > 0$. Our goal is to use $S_m$ to create a selection function $g$ such that the selective risk of $(f, g)$ satisfies

$$ Pr_{S_m} \{ R(f, g) > r^* \} < \delta, \tag{2}$$

where the probability is over training samples, $S_m$, sampled i.i.d. from the unknown underlying distribution $P$. Among all classifiers satisfying (2), the best ones are those that maximize the coverage.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up