More than 1 year has passed since last update.

Pytorchの損失関数(Loss Function)の使い方および実装まとめ

Last updated at 2022-11-13Posted at 2022-06-17

損失関数 (Loss function) って？

機械学習と言っても結局学習をするのは計算機なので，所詮数字で評価されたものが全てだと言えます．例えば感性データのようなものでも，最終的に混同行列を使うなどして数的に処理をします．その際，計算機に対して「どれくらい間違っているよ」という結果を伝えるための計算式がこれにあたります．つまり，これが大きければ大きいほど大きく間違っているということになります．

ざっくりとした数式で表すと簡単に以下の式で表されます．~~ほぼ情報量がゼロの式ですが．~~

\mathrm{Loss} = L(Y_\mathrm{prediction}, Y_\mathrm{grand\_truth})

つまり，何かしらの定義に基づいて $Y_\mathrm{prediction}$ および $Y_\mathrm{grand\mbox{_}truth}$ の違い・誤差・距離を計測するというものがこれにあたります．

また，以下の式に関しまして基本的に損失関数として考えた入力， $y_{\mathrm{pred}}$ や $y_{\mathrm{true}}$ などを用いない代わりに単に 2 入力の関数として捉えて $x$ および $y$ を用いてあえて表記しております．

さらに，私自身の理解が不十分な点 (特に後半にかけて) に関しては記載しませんでした．理解ができた段階で追記させていただきます．もしも間違っておりましたらコメントの方でご指摘いただけますと幸いです．

Updates
2022/08/04: MSE に関する数式では二乗誤差をそのまま表記していたのですが，確認を再度したところ Mean を取っていることがわかりました．よってアップデートしました．
2022/11/13: Smooth L1 Loss に関する説明に「影の実力者」などと本質的ではない情報量がゼロの表現を用いていたため，説明を追加しました．

Pytorch ライブラリにおける利用可能な損失関数

参照元：Pytorch nn.functional
※説明の都合上本家ドキュメントと順番が一部入れ替わっていますがご了承ください．

項番	ページ内リンク
1	Cross Entropy
2	Kullback-Leibler divergence Loss
3	Binary Cross Entropy
4	Binary Cross Entropy with logits
5	Negative log likelihood Loss
6	Poisson Negative log likelihood Loss
7	Gaussian Negative log likelihood Loss
8	Cosine Embedding Loss
9	The Connectionist Temporal Classification Loss
10	Hinge Embedding Loss
11	L1-Loss
12	Smooth L1-Loss
13	Huber Loss
14	Mean Squared Error
15	Soft Margin Loss
16	Multi Margin Loss
17	Multilabel Margin Loss
18	Multilabel Soft Margin Loss
19	Margin Ranking Loss
20	Triplet Margin Loss
21	Triplet Margin with Distance Loss

Loss functions

Cross Entropy

主に多クラス分類問題および二クラス分類問題で用いられることが多い．多クラス分類問題を扱う場合は各々のクラス確率を計算するにあたって Softmax との相性がいいので，これを用いる場合が多い．二クラス分類 (意味するところ 2 つの数字が出力される場合) の場合は Softmax を用いたとしても出力される数字そのものは確率を表す数字であるとは言いにくい．
情報量からクロスエントロピーに関する説明は以前の記事「情報量からKLダイバージェンスまでのお話」にて記載してあるので興味ある方はぜひどうぞ．
また，PyTorch のドキュメントでも CrossEntropyLoss に関する説明 (英文) が記載されているのでこちらもぜひどうぞ．

Definition

Cross Entropy Loss 定義バージョン

l(x, y) = - \sum^{e}_{\Omega\in e}P(e)\log Q(e)

Pytorch 内部処理バージョン

l(x, y) = L = \sum^{N}_{n=1}l_n \\
\begin{align}
l_n &= -w_{y_{n}} \ \log\frac{\mathrm{exp}(x_{n, y_{n}})}{\sum_{c=1}^{C}\mathrm{exp}(x_{n, c})} \\
&= -w_{y_{n}} \ \log (\mathrm{Softmax(}x\mathrm{)}) \\
&= -w_{y_{n}} \ \mathrm{LogSoftmax(}x\mathrm{)}) \\
&= -w_{y_{n}} \ [\log \mathrm{exp}(x_{n, y_{n}}) - \log \sum_{c=1}^{C}\mathrm{exp}(x_{n, c}) ] \\
&= -w_{y_{n}} \ [x_{n, y_{n}} - \log \sum_{c=1}^{C}\mathrm{exp}(x_{n, c}) ] \\
\end{align}

$c$ はクラスを，$w$ は重みを設定している場合の重みを表している．Cross Entropy Loss に関して，Pytorch ではあらかじめソフトマックスを考慮してから定義に基づいた CRL を算出するようにしているようである．なんでそんな重要なことを黙っていたのだろうか．ありがた迷惑とはこのことではないだろうか...一応このおかげでネットワークの出力で Softmax を適用しなくともよくなっているとはいえ...

というわけで，もしも Softmax ~~とかいう邪魔な~~ 関数を適用せずに Cross Entropy Loss を利用する場合は NLLLoss の方で Predicted の値の対数を取って扱うらしい．という議論がこちらでされていたので，興味がある方はぜひ．

ちなみに，Pytorch では入力されたテンソル x に対して Softmax の計算をしてから，対数を取るという LogSoftmax (公式ドキュメント) というメソッドも存在している．

Appendix

上記の式にも記載したが，若干の行間を感じたので定義となる Softmax の式も記載しておきます．

y_i = \frac{\mathrm{exp}(x_i)}{\sum_{k=1}^{N}\mathrm{exp}(x_k)}

Sample code

cross_entropy.py

import torch
from torch import nn

######################
#         here       #
######################
criteria = nn.CrossEntropyLoss()

input = torch.randn(3, 5, requires_grad=True)
target = torch.empty(3, dtype=torch.long).random_(5)
print(input, target)
loss = criteria(input, target)
print(loss)

Kullback-Leibler divergence Loss

いわゆる「KLダイバージェンス」と呼ばれる距離関数にあたる．私の経験則で言うなら生成系モデルの損失関数として用いられることが多いような気がする． KL ダイバージェンスは情報量と非常に関係深いので，情報量に関する説明に関しては以前の記事「情報量からKLダイバージェンスまでのお話」に記載してありますので興味のある方はぜひどうぞ．簡単に言えば「確率分布と確率分布の似てる度合いを示す指標の1つ」というイメージ．

PyTorch公式ドキュメントにも解説が載っているのでこちらもぜひご参照ください．

Definition

\begin{align}
L(x, y) 
&= y\ ・\log \frac{y}{x} \\
&=y\ ・(\log y - \log x)
\end{align}

Sample code

kl_divergence.py

import torch
import torch.nn.functional as F
from torch import nn

######################
#         here       #
######################
criteria = nn.KLDivLoss()
_input = torch.randn(3, requires_grad=True)
input = F.log_softmax(_input)
_target = torch.empty(3).random_(2)
target = F.softmax(_target)

loss = criteria(input, target)
print(input, target)
print(loss)

Binary Cross Entropy

Binary (2 値) という言葉からもわかるかもしれないが，主に二クラス分類問題に用いられることが多い．CSE と同様にサンプル数で平均を取ることもある．二クラス分類を行うにあたって，Sigmoid 関数と相性がいいとされている．
例えばある 1 つの特徴量 $\hat{p}$ が出力として得られた時に，問題に対して $\mathrm{Sigmoid}(\hat{p})=p$という結果が得られたとして，これが 50% より大きければ事前に設定されている class-0 である確率が高いし，そうでなければもう一方の class-1 である確率が高いという結果になる．つまり，1 つの数字が出力されるような二クラス分類モデルに対しては有用に働くといえる．
一方で 2 つ以上の出力 (e.g. [0.2, 2.5, -9.3]) が得られた時に，これに Sigmoid を適用させたとしても必ずしもその合計値が 1 になるとは限らず，あくまで (0, 1) の範囲の値に押し込めるだけなので有意な出力が得られるとは限らない．
Pytorch 公式ドキュメントの BCELOSSも是非ご参照ください．

Definition

l(x, y) = L = \sum^{N}_{n=1}l_n \\
l_n = -w_{n}[ y_n\ \log x_n + (1-y_n)\log(1-x_n)]

Appendix

Sigmoid 関数は以下の通りになっている．

\begin{align}
f(x) & = \frac{1}{1+e^{-x}} \\
& = \frac{e^{x}}{e^{x}+1} 
\end{align}

ところで~~君のような~~勘のいいガキ方々は気づかれたかもしれないが，Sigmoid 関数の機能を拡張したものが Softmax 関数にあたる．例えばあるモデルの出力が $z(x) = [z, 0]$ として得られた時， Softmax 関数において $N=2$ であるため

y = \frac{e^{x_1}}{e^{x_1}+e^{x_2}} \\

という式に代入することになる．よって，

\begin{align}
S(z_1)&= \frac{e^{z}}{e^{z}+e^{0}} \\
&= \frac{e^{z}}{e^{z}+1} \\
&= \sigma(z) \\
S(z_2)&= \frac{e^{0}}{e^{0}+e^{z}} \\
&= \frac{1}{1+e^{z}} \\
&= 1-\sigma(z) \\
\end{align}

となり，Sigmoid の関数が獲得されることがわかる．ただし，あくまでこれは説明のための 2 クラス分類の特殊なケースであるため，多クラス分類については合計値が 1 になるように調整した Softmax を用いた方が良いだろう．

Sample code

binary_cross_entropy.py

import torch
from torch import nn

######################
#         here       #
######################
criteria = nn.BCELoss()
m = nn.Sigmoid()

input = torch.randn(3, requires_grad=True)
target = torch.empty(3).random_(2)
print(input, target)
loss = criteria(m(input), target)
print(m(input), target)
print(loss)

Binary Cross Entropy with Logits

BCELoss に対して Sigmoid を適用しただけである．数式だけ追跡すると一見するとどこに違いがあるのか分からなかったのだが，よく見ると $x_n$ に Sigmoid 関数が適用されている．
Pytorch 公式ドキュメントの BCEWITHLOGITSLOSSも是非ご参照ください．

Definition

l(x, y) = L = \sum^{N}_{n=1}l_n \\
l_n = -w_{n}[ y_n\ \log \sigma(x_n) + (1-y_n)\log(1-\sigma(x_n))]

Sample code

bce_with_logits.py

import torch
from torch import nn

######################
#         here       #
######################
criteria = nn.BCEWithLogitsLoss()

input = torch.randn(3, requires_grad=True)
target = torch.empty(3).random_(2)
print(input, target)
loss = criteria(input, target)
print(loss)

Negative log likelihood Loss

Definition

日本語でなんて言うんだろうと思ったら「負の対数尤度 (ふのたいすうゆうど)」と呼ぶらしいです．そもそも尤度の「尤」って「尤もらしい」という時に使われる字なようです．~~「犬度」みたいだよねーって話，あるあるな話題の割にめっちゃスベるので口に出さない方がいいです．~~
さて，そんな尤度に関して扱っているはずの箇所なのですが，公式ドキュメントを読んでみてもどう考えても尤度計算をしてなそうなのですが...とりあえず行っている計算としては以下のようです．

l_n = -w_{y_{n}}x_{n, y_{n}}, \ w_c = \mathrm{weight}[c] \\
l(x, y) = \sum^{N}_{n=1}\frac{1}{\sum_{N}^{n=1}w_{y_{n}}}l_n \\

$w$ として重みも設定できるようですが，デフォルトでは None になっています．一応各々のクラスに対して重みをつけることができるので，不均衡データなどに適用できるようです．特にそうではなければ全ての $w$ を一律にすればただただ $l$ を足しただけになります．

Sample code

nll_loss.py

import torch
from torch import nn

######################
#         here       #
######################
criteria = nn.NLLLoss()
m = nn.LogSoftmax(dim=1)

input = torch.randn(3, 5, requires_grad=True)
target = torch.tensor([1, 0, 4])
print(input, target)
loss = criteria(input, target)
print(loss)

Poisson Negative log likelihood Loss

Possion distribution に従った Negative log likelihood loss を求めている (らしい)．

Definition

\begin{align}
\mathrm{y} &\sim \mathrm{Poisson}(x)\  \mathrm{loss}(x, y) \\
&= x - y*\mathrm{log}(\mathrm{x})+\mathrm{log}(\mathrm{y!})
\end{align}

数学記号解説をしておくと，$\sim$ は右辺の確率変数が左辺の確率分布に基づくということを意味している．また，最後の $!$ については Stirling's approximation (日本語だと「スターリングの公式？」) に従っており，

n! \sim \sqrt{2\pi n}(\frac{n}{e})^n

に同じである．

Sample code

posson_nll_loss.py

import torch
from torch import nn

######################
#         here       #
######################
criteria = nn.PoissonNLLLoss()
log_input = torch.randn(5, 2, requires_grad=True)

target = torch.randn(5, 2)
print(input, target)
loss = criteria(log_input, target)
print(loss)

Gaussian Negative log likelihood Loss

ガウス分布 (正規分布などさまざまな言い方がありますが...) に基づいているという仮定を元に考えている．Pytorch公式ドキュメントにも記載があるのでぜひご参照ください．
他にも StackExchangeでもこの損失関数の内容について議論されていたので，こちらもぜひ．

Definition

L_{\mathrm{loss}} = \frac{1}{2}\left( \log (\max (\mathrm{var}, \mathrm{eps})) + \frac{(y-x)^2}{\max(\mathrm{var}, \mathrm{eps})} \right) + C_{\mathrm{const}}

Appendix

えっ，ガウス分布の式書けますよね..?

f(x, \mu, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left( - \frac{(x-\mu)^2}{2\sigma^2} \right)

なお $\mu$ は平均を，$\sigma^2$ は分散を表している．ちなみに $\mu=0$ かつ $\sigma^2=1$ の時標準正規分布と呼ばれたりする．別に覚えなくても良い．ただ有名な形というだけ．

ちなみにこの両辺に対して対数を取ると，

\begin{align}
\log \left(f(x, \mu, \sigma)\right) &= \log\left(\frac{1}{\sqrt{2\pi \sigma^2}}\right) + \log\left( \exp \left( - \frac{(x-\mu)^2}{2\sigma^2} \right)\right)\\
&=\log\frac{1}{\sigma} + C - \frac{(x-\mu)^2}{2\sigma^2}\\
&=- \log \sigma - \frac{(x-\mu)^2}{2\sigma^2}\\
&=- \frac{1}{2}\left(\log \sigma^2 + \frac{(x-\mu)^2}{\sigma^2}\right)
\end{align}

さて，ここでマイナスをプラスに変えて単純化すると，損失関数に似たような式が導出される．つまり，この損失関数においては $\mu$ および $\sigma$ が入力 x によってチューニングされていくと言うイメージになる．また，$\mathrm{eps}$ を用いている理由は分母にこれがあるため計算の過程で小さくなりすぎて発散することを予防する，つまり計算安定性のために用いてあるらしい．

Sample code

gaussian_nll_loss.py

import torch
from torch import nn

######################
#         here       #
######################
criteria = nn.GaussianNLLLoss()
input = torch.randn(5, 2, requires_grad=True)
target = torch.randn(5, 2)
var = torch.ones(5, 2, requires_grad=True)
print(input, target, var)
loss = criteria(input, target, var)
print(loss)

Cosine Embedding Loss

コサイン距離のことだと思っていて，あれ，違うのかと思っていたら，やっぱりコサイン距離だったというオチ．ご丁寧に Embedding という名称が入っているので何事かと思ったのに...
ツンデレなコサイン類似度に関する解説をしていらっしゃるページがあったので詳細はそちらの方を見ていただいた方が早いかと思います．

Definition

\begin{align}
\cos(\vec{a}, \vec{b}) &= \frac{\vec{a}\cdot\vec{b}}{|\vec{a}||\vec{b}|} \\
&= \frac{\vec{a}}{|\vec{a}|}\cdot\frac{\vec{b}}{|\vec{b}|} \\
&= \sum_{i=1}^{|V|}a_ib_i
\end{align}

のもとで，

L = 
\begin{cases}
  1-\cos(x_1, x_2) & \mathrm{if} \ y=1 \\
  \max(0, \cos(x_1,x_2)-\mathrm{margin}) & \mathrm{if} \ y=-1
\end{cases}

$y=1$，つまり似ていると判定がついた場合は $\cos(x_1, x_2)=1$ に近づけば損失関数を最小化させることができます．一方で$y=-1$，つまり全く似ていないと判定がついた場合は $\mathrm{margin}$ を一旦無視したとして，$\cos(x_1, x_2)\leq 0$ の領域で損失関数を最小化させることができます．

Sample code

cosine_embedding_loss.py

import torch

######################
#         here       #
######################
criteria = torch.nn.CosineEmbeddingLoss()

x1 = torch.randn(3, 4)
x2 = torch.randn(3, 4)
y = torch.empty(3).bernoulli_().mul_(2).sub_(1)

loss = criteria(x1, x2, y).item()
print(loss)

The Connectionist Temporal Classification Loss

正直に聞いたことも見たこともなかったです．すいません．調べた結果を載せるだけになってしまいます．
CTCLoss とは 2 つの時系列データに対して損失を計算することができる．ものだそうです.

Definition

計算式が発見できませんでした...
参考文献①An Intuitive Explanation of Connectionist Temporal Classification
参考文献②PyTorch CTCLoss

Sample code

ctc_loss.py

import torch
import torch.nn as nn

######################
#         here       #
######################
# Target are to be padded
T = 50      # Input sequence length
C = 20      # Number of classes (including blank)
N = 16      # Batch size
S = 30      # Target sequence length of longest target in batch (padding length)
S_min = 10  # Minimum target length, for demonstration purposes

# Initialize random batch of input vectors, for *size = (T,N,C)
input = torch.randn(T, N, C).log_softmax(2).detach().requires_grad_()

# Initialize random batch of targets (0 = blank, 1:C = classes)
target = torch.randint(low=1, high=C, size=(N, S), dtype=torch.long)

input_lengths = torch.full(size=(N,), fill_value=T, dtype=torch.long)
target_lengths = torch.randint(low=S_min, high=S, size=(N,), dtype=torch.long)
ctc_loss = nn.CTCLoss()
loss = ctc_loss(input, target, input_lengths, target_lengths)

print(loss)

Hinge Embedding Loss

機械学習を学び始めたばかりの時に sklearn を使って SVM を扱ったことがある人なら誰でも知っている「ヒンジ関数」にこれが相当する．グラフの概形イメージは ReLU を $x$ 方向に $b$ ずらしたような感じ．
初学者向けにこの関数の使い方を述べると，「SVM という線形分類アルゴリズムがあるんだけど，直線であまりにビチっと引いちゃうと境界線近くの微妙なものに対して "ハード" に分類しすぎじゃない？だからもうちょっと誤差 (ヒンジ関数で言う "$b$") を許して，"ソフト" に評価してあげたらどう？」という時に使う損失関数です．

Definition

l_n = 
\begin{cases}
  x_n & \mathrm{if} \ y_n=1 \\
  \max(0, \Delta-x_n) & \mathrm{if} \ y_n=-1
\end{cases}

L(x, y) = 
\begin{cases}
  \mathrm{mean}(L) & \mathrm{if\ reduction} = \mathrm{'mean'} \\
  \mathrm{sum}(L) & \mathrm{if\ reduction} = \mathrm{'sum'}
\end{cases}

\mathrm{where} L = l_1, l_2,...,l_N

Sample code

hinge_loss.py

import torch
import torch.nn as nn

######################
#         here       #
######################
criteria = nn.HingeEmbeddingLoss()

input = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
loss = criteria(input, target)
print(loss)

L1 loss

何も言うまい．

Definition

l(x, y) = \left\{ l_1,l_2,...,l_N \right\}^{⊤}\\
l_n = |x_n-y_n|

Sample code

l1_loss.py

import torch
import torch.nn as nn

######################
#         here       #
######################
criteria = nn.L1Loss()

input = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
loss = criteria(input, target)
print(loss)

Smooth L1 Loss

滑らかなL1距離とも呼ばれており，L1 損失における 0 地点における変曲点を文字通り"滑らかに"繋いだような概形をしている．これにより 0 地点における微分を可能にしたり，0 地点に近い点で勾配が依然として大きいことを防ぐことになる．
L1 距離で都合が悪ければこちらを使うというのも手だというくらいの認識で良いだろう．

Definition

l_n = 
\begin{cases}
  \frac{1}{2\beta} (x_n-y_n)^2 & \mathrm{if}\ |x_n-y_n|<\beta\\
  |x_n-y_n|-\frac{1}{2\beta} & \mathrm{if} \ \mathrm{otherwise} \\
\end{cases}

Sample code

smooth_l1_loss.py

import torch
import torch.nn as nn

######################
#         here       #
######################
criteria = nn.SmoothL1Loss()

input = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
loss = criteria(input, target)
print(loss)

Huber Loss

Smooth L1 Loss に対して $\delta$ を用いて拡張性を高めた数式である．$\delta=1$ の時，Smooth L1 Loss と同じ数式になる．重み付けに似たような考え方．

Definition

l_n = 
\begin{cases}
  \frac{1}{2} (x_n-y_n)^2 & \mathrm{if}\ |x_n-y_n|<\delta\\
  \delta*|x_n-y_n|-\frac{1}{2}\delta & \mathrm{if} \ \mathrm{otherwise} \\
\end{cases}

Sample code

huberloss.py

import torch
import torch.nn as nn

######################
#         here       #
######################
criteria = torch.nn.HuberLoss

input = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
loss = criteria(input, target)
print(loss)

Mean Squared Loss

何も言うまい．

Definition

l_n = \frac{(x_n-y_n)^2}{\mathrm{Number\_of\_Elements}}

Sample code

mes_loss.py

import torch
import torch.nn as nn

######################
#         here       #
######################
criteria = nn.MSELoss()

input = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
loss = criteria(input, target)
print(loss)

Soft Margin Loss

2 クラス分類のロジスティック損失を求める数式．らしいがそれにしても公式ドキュメントの $\mathrm{nelement}$ は恐らく $\mathrm{n element}$ のことだと思うが...というレベルだし，何よりも使用している人が少なすぎる...参考文献教えてください...

Definition

L(x,y) = \sum_{i}\frac{\log(1+\exp(-y[i]*x[i]))}{x.\mathrm{elements()}}

Sample code

soft_margin_loss.py

import torch

######################
#         here       #
######################
criteria = torch.nn.SoftMarginLoss()

x = torch.randn(3)
y = torch.empty(3).bernoulli_().mul_(2).sub_(1)

loss = criteria(x, y).item()
print(loss)

Multi Margin Loss

複数クラス分類に用いられているようである．(知らなかった...)

Definition

L(x,y) = \frac{\sum_{i}\max(0, \mathrm{margin}-x[y]+x[i])^p}{x.\mathrm{size}(0)}

Sample code

multi_margin_loss.py

import torch

######################
#         here       #
######################
criteria = torch.nn.MultiMarginLoss()

x = torch.tensor([[0.1, 0.2, 0.4, 0.8]])
y = torch.tensor([3])
loss = criteria(x, y)
print(loss)

Multilabel Margin Loss

複数クラス分類に用いられる損失を二次元配列に対応できるように拡張したものらしい.(知らなかった...)

Definition

L(x,y) = \frac{\sum_{ij}\max(0, 1-(x[y[j]]-x[i])}{x.\mathrm{size}(0)}

Sample code

multilabel_margin_loss.py

import torch
import torch.nn as nn

######################
#         here       #
######################
criteria = nn.MultiLabelMarginLoss()

x = torch.FloatTensor([[0.1, 0.2, 0.4, 0.8]])
y = torch.LongTensor([[3, 0, -1, 1]])
loss = criteria(x, y)
print(loss)

Multilabel Soft Margin Loss

モデルの出力とターゲットの間でエントロピーに基づく損失を計算したものらしい．(知らなかった...)

Definition

L(x,y) = -\frac{1}{C}*\sum_{i}y[i]*\log((1+\exp(-x[i]))^{-1}+(1-y[i])*\log\left(\frac{\exp(-x[i])}{1+\exp(-x[i])}\right)

Sample code

サンプルが見つかりませんでした...
一体どうやって使うんだ...

Margin Ranking Loss

2 つの入力に対して，$y=1$ である場合は入力 1 のほうがよりランクが高いと考え， $y=-1$ である場合は入力 2 のほうがランクが高いと考える損失関数．(知らなかった...)

Definition

L(x_1, x_2, y) = \max(0, -y*(x_1-x_2)+\mathrm{margin})

Sample code

margin_ranking_loss.py

import torch
import torch.nn as nn

######################
#         here       #
######################
criteria = nn.MarginRankingLoss()
input1 = torch.randn(3, requires_grad=True)
input2 = torch.randn(3, requires_grad=True)
target = torch.randn(3).sign()
loss = criteria(input1, input2, target)
print(loss)

Triplet Margin Loss

接頭辞 'tri' からもわかるように，3 つのテンソル入力に対して損失を計算する．この学習過程で面白いのは 正しい教師とそれを学ぶ学生 という像ではなく，正しい教師と間違った教師とそれらから学ぶ学生 というモデルである点にある．
以下の数式に示す通り，3 つの入力を元に学習を行う．一つは $a_{anchor}$ であり，これは学習を行う基となるデータ．次に $p_{positive}$ は基データに対して Positive な関係にあるデータ，最後に $n_{negative}$ は基データに対して Negative な関係にあるデータである．
Deep Metric Learning の定番⁈ Triplet Lossを徹底解説という @tancoro さんの記事がタイトル通り徹底的に解説されていたので，こちらもぜひ読んでみてください．

Definition

L(a_\mathrm{anchor}, p_\mathrm{positive}, n_\mathrm{negative}) = \max(d(a_i, p_i)-d(a_i, n_i)+\mathrm{margin},0)

d(x_i,y_i)=||x_i-y_i||_{p}

基となる $a_{anchor}$ と同じクラスに分類されてほしい $p_{positive}$ との距離は近く，そして同じクラスとして分類して欲しくない $n_{negative}$ との距離は遠くなるように最適化をしているというのがこの数式の意味にあたる．

Sample code

triplet_margin_loss.py

import torch
import torch.nn as nn

######################
#         here       #
######################
criteria = nn.TripletMarginLoss(margin=1.0, p=2)

anchor = torch.randn(100, 128, requires_grad=True)
positive = torch.randn(100, 128, requires_grad=True)
negative = torch.randn(100, 128, requires_grad=True)
loss = criteria(anchor, positive, negative)
print(loss)

Triplet Margin With Distance Loss

上記の Triplet Margin Loss の Anchor およぴ Positive, Negative の関係に対して他の距離関数を定義して考えたい時に用いることができる．

Definition

L(a_\mathrm{anchor}, p_\mathrm{positive}, n_\mathrm{negative}) = \max(d(a_i, p_i)-d(a_i, n_i)+\mathrm{margin},0)

d(x_i,y_i)= \mathrm{Dedice\ by\ yourself!}

Sample code

triple_margin_with_distance_loss.py

import torch
import torch.nn as nn

######################
#         here       #
######################
criteria = nn.TripletMarginWithDistanceLoss(distance_function=nn.PairwiseDistance())

embedding = nn.Embedding(1000, 128)

anchor_ids = torch.randint(0, 1000, (1,))
positive_ids = torch.randint(0, 1000, (1,))
negative_ids = torch.randint(0, 1000, (1,))

anchor = embedding(anchor_ids)
positive = embedding(positive_ids)
negative = embedding(negative_ids)

loss = criteria(anchor, positive, negative)
print(loss)

参考文献

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up