More than 3 years have passed since last update.

Deep Learning Specialization (Coursera) 自習記録 (C4W1)

Last updated at 2020-07-03Posted at 2020-06-28

はじめに

Deep Learning Specialization の Course 4, Week 1 (C4W1) の内容です。

(C4W1L01) Computer Vision

内容

64 $\times$ 64 の画像の場合，データ数は $64 \times 64 \times 3 = 12288$ になる (RGB を考慮)
1000 $\times$ 1000 の画像の場合，300 万になる
サイズの問題を解決するために，convolution を導入する
convolution は computer vision 以外にも役立つ (かもしれない)

(C4W1L02) Edge Detection Example

内容

Computer vision problem では，エッジ → 部分 → 全体，の順で認識していく
Vertical Edge Detection の場合は，下記の 3 $\times$ 3 のフィルタを使う

1 0 -1
1 0 -1
1 0 -1

python ; conv_forward
Tensorflow ; tf.nn.conv2d
Keras ; Conv2D

(C4W1L03) More edge detection

内容

Vertical edge detection

1 0 -1
1 0 -1
1 0 -1

Horizontal edge detection

1 1 1
0 0 0
-1 -1 -1

フィルタの要素を変数 ($w_i$) にして，良いフィルタを学習させる

(C4W1L04) Padding

内容

input image を $n \times n$，フィルタを $f \times f$ とすると，output image は $(n-f+1) \times (n-f+1)$ となる
欠点
- convolution 演算のたびに画像が小さくなる
- コーナーの pixel は 1 回しか使われない (情報を捨てている)
Padding
- input image の周囲にピクセルを入れて大きくする (0 を入れるのが一般的)
- $p$ = padding とすると output image は $(n+2p-f+1) \times (n+2p-f+1)$ となる
Valid and Same convolution
- Valid ; $(n \times n)$ $\ast$ $(f \times f)$ $\rightarrow$ $(n-f+1) \times (n-f+1)$
- Same
  - Pad so that output size is the same as input size
  - $n + 2p - f + 1 = n$
  - $p = \frac{f-1}{2}$
$f$ is usually odd ($3\times 3$ が一般的。$5 \times 5$ や $7 \times 7$ でも良い)

(C4W1L05) Strided convolution

内容

フィルタを動かす pixel 数を stride という
input image ; $n \times n$, filter ; $f\times f$, padding ; $p$, stride ; $s$ とすると, output image ; $\lfloor\frac{n+2p-f}{s}+1\rfloor \times \lfloor\frac{n+2p-f}{s}+1\rfloor$ (割り切れないときは切り捨て)

参考

数学のテキストでは，convolution 演算ではフィルタをひっくり返して計算する。そうすることで，結合則 ($(A \ast B) \ast C = A \ast (B \ast C)$) を満足できる
でも deep learning では，フィルタをひっくり返さずにそのまま計算する

(C4W1L06) Convolutions Over Volume

内容

RGB image の convolution を考える
- input ; 6 $\times$ 6 $\times$ 3 (3 ; # of channel)
- filter ; 3 $\times$ 3 $\times$ 3
- output ; 4 $\times$ 4
Multiple filters
- 例えば filter が 2 つあると，output は 4 $\times$ 4 $\times$ 2 のようになる
Summary
- input ; $n \times n \times n_c$
- filter ; $f \times f \times n_c$
- output ; $ (n-f+1) \times (n-f+1) \times n_c^\prime$
  - $n_c^\prime$ ; # of filters
channel 数は「深さ depth」とも呼ばれる

(C4W1L07) One layer of a convolutional net

内容

layer の例
もし $3 \times 3 \times 3$ のフィルタが 10 個あると，パラメタ数はいくつか?
- $3 \times 3 \times 3 = 27$
- バイアスのパラメタが 1
- 上記が 10 個
- $(27 + 1) \times 10 = 280$ (input の画像の大小にかかわらず，フィルタが 10 ならパラメタは 280 個)
Summary of notation ; If layer $l$ is a convolutional layer,
- $f^{[l]}$ ; filter size
- $p^{[l]}$ ; padding
- $s^{[l]}$ ; stride
- $n_c^{[l]}$ ; # filters
- Input ; $n_H^{[l-1]} \times n_W^{[l-1]} \times n_c^{[l-1]}$
- Output ;$n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]}$
- $n_H^{[l]} = \lfloor\frac{n_H^{[l-1]} + 2p^{[l]} - f^{[l]}}{s^{[l]}} + 1\rfloor$
Each filter is,
- $f^{[l]} \times f^{[l]} \times n_c^{[l-1]}$
Activation
- $a^{[l]} \rightarrow n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]}$
- $A^{[l]} \rightarrow m \times n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]}$
Weights
- $f^{[l]} \times f^{[l]} \times n_c^{[l-1]} \times n_c^{[l]}$
  - $n_c^{[l]}$ ; # filters
bias
- $(1, 1, 1, n_c^{[l]})$

(C4W1L08) A simple convolutional network example

内容

Example of ConvNet
ネットワークが深くなると
- 画像サイズは小さくなる
- チャネル数が大きくなる
Types of layers in a convolutional network
- Convolution (CONV)
- Pooling (POOL)
- Fully connected (FC)

C4W1L09) Pooling Layers

内容

Max Pooling ; フィルタの範囲で最大値を取る
Average Pooling ; フィルタの範囲で平均値をとる
Summary of pooling
- Hyper parameters
  - f ; filter size
  - s ; stride
  - Max or Average pooling
  - padding は通常使われない
- 学習するパラメタが無い

(C4W1L10) CNN Example

内容

LeNet-5 を用いた説明
Input
Layer 1
- CONV1
- POOL1
Layer 2
- CONV2
- POOL2
FC3
FC4
softmax

	Activation shape	Activation size	# parameters
Input	(32, 32, 3)	3072	0
CONV1 (f=5, s=1)	(28, 28, 8)	6272	208
POOL	(14, 14, 8)	1568	0
CONV2 (f=5, s=1)	(10, 10, 16)	1600	416
FC3	(120, 1)	120	48001
FC4	(84, 1)	84	10081
softmax	(10, 1)	10	841

(C4W1L11) Why convolutions?

内容

Input (32,32,3) から Output (28,28,6) にするとき
- Convolution で f=5, 6 filters ならパラメタは $(5\times 5 + 1) \times 6 = 156$
- fully connected なら $3072 \times 4704 \sim 14M$
Parameter sharing ; A feature detector (such as vertical edge detector) that's useful in one part of the image is probably useful in another part of image
Sparsity of connection ; In each image, each output value depends only on a small numbers of inputs

感想

C4W1 の後半は 1 つの動画がみんな 10 分前後なので，ちょっと大変だった

参考

Deep Learning Specialization (Coursera) 自習記録 (目次)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up