More than 5 years have passed since last update.

[Review] Math of CNN

Last updated at 2018-05-02Posted at 2018-04-27

Preface

In this article, I would like to elaborate the arithmetic in CNN.
But I may make mistake so please feel free to leave the comment below.

Implementation of CNN in python with Numpy

Visual Image of Networks

source: http://www.mdpi.com/1099-4300/19/6/242

Math in CNN

In this section, I will write about the mathematical concept of Convolutional networks and its three features. As you probably know, the network composes many neurons in layers. And each layer is connected by its pipeline. So first the input data is propagated forward through layers till the output layer. Then in order to strengthen the ability to represent the dataset, the net needs to learn by tuning its parameters measuring the error between the target and its prediction.
Hence in this section we will see two faces of the layers, forward-propagation and back-propagation.

Three Key Features in CNN

Convolutional layer
Activation layer
Pooling layer(generally we apply max-pooling)

So let me describe one by one.

1. Convolutional layer.

Let's consider a single image case. For simplicity, I would like to define the image size and its convoluted image size as above. So notation here is like below. $x$ : input $a^k$ : after convoluted image $k$ : index of kernel(weight filter) $W$ : kernel(weight filter) $b$ : bias $E$ : Cost Function

forward prop

a^{(k)}_{ij} = \sum^{m-1}_{s=0} \sum^{n-1}_{t=0} W^{(k)}_{st} x_{(i+s)(j+t)} + b^{(k)}

back prop to update the weight

\frac{\partial E}{\partial W^{(k)}_{st}} = \sum^{M-m}_{i=0} \sum^{N-n}_{j=0} \frac{\partial E}{\partial a^{(k)}_{ij}} \frac{\partial a^{(k)}_{ij}}{\partial W^{(k)}_{st}} = \sum^{M-m}_{i=0} \sum^{N-n}_{j=0} \frac{\partial E}{\partial a^{(k)}_{ij}} x_{(i+s)(j+t)}\\

\frac{\partial E}{\partial b^{(k)}} = \sum^{M-m}_{i=0} \sum^{N-n}_{j=0} \frac{\partial E}{\partial a^{(k)}_{ij}} \frac{\partial a^{(k)}_{ij}}{\partial b^{(k)}} = \sum^{M-m}_{i=0} \sum^{N-n}_{j=0} \frac{\partial E}{\partial a^{(k)}_{ij}}

Bear in mind that the propagated error can be noted like below.

\delta^{k}_{ij} = \frac{\partial E}{\partial a^{(k)}_{ij}}

back prop to previous layer

\frac{\partial E}{\partial x_{ij}} = \sum^{m-1}_{s=0} \sum^{n-1}_{t=0} \frac{\partial E}{\partial a^{(k)}_{(i-s)(j-t)}} \frac{\partial a^{(k)}_{(i-s)(j-t)}}{\partial x_{ij}}
= \sum^{m-1}_{s=0} \sum^{n-1}_{t=0} \frac{\partial E}{\partial a^{(k)}_{(i-s)(j-t)}} W^{(k)}_{st}

2. Activation layer

When it comes to the selection of activation functions, indeed we have some options, for example sigmoid or hyperbolic tangent. So in this section, first let me brief the appearance of each function, then move on to see propagations.

Activation Families

sigmoid : $\sigma(x) = \frac{1}{1 + e^{-x}}$
tanh : $tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
ReLU :

ReLU(x) = max(0, x)

And in this case, I will pick ReLU for our activation function in this article.
So let's check the forward prop and backprop of it.

forward prop

a_{ij} = max(0, x_{ij})

backprop

\frac{\partial E}{\partial x_{ij}} = \frac{\partial E}{\partial a^{(k)}_{ij}} \frac{\partial a^{(k)}}{\partial x_{ij}}
= \left\{
\begin{array}{ll}
\frac{\partial E}{\partial a^{(k)}_{ij}} & (a^{(k)}_{ij} \geq 0) \\
0 & (otherwise)
\end{array}
\right.

Max Pooling

Forward prop

a_{ij} = max(0, x_{(i+s)(j+t)})

where $s \in |0, l|$ and $t \in |0, l|$ and $l$ is filter size.

Backward prop

\frac{\partial E}{\partial x_{(i+s)(j+t)}} = \frac{\partial E}{\partial a^{(k)}_{ij}} \frac{\partial a^{(k)}}{\partial x_{(i+s)(j+t)}}
= \left\{
\begin{array}{ll}
\frac{\partial E}{\partial a^{(k)}_{ij}} & (a^{(k)}_{ij} = x_{(i+s)(j+t)}) \\
0 & (otherwise)
\end{array}
\right.

Convolutional Layer(Multi-Channel)

So far, we haven't consider the multi-channel convolutional layers.
But since we have done to cover the single channel connections, it's good time for us to move on to more practical nets.

This is the conceptual image of multi-channel convolutional layers.

Forward-prop\\
a^{(k)}_{ij} = \sum_c a^{(k, c)}_{ij} = \sum_c \sum^{m-1}_{s=0} \sum^{n-1}_{t=0} W^{(k)}_{st} x_{(i+s)(j+t)} + b^{(k)}\\

Likewise in Forward-prop, Backward-Prop just requires $\sum_c$ in its math.
** $c$ is index of channel.

Updating Parameter(W) and Backprop to previous layer in Multi-channel

Multi-channel convolutional layer has two aspects as well, one for updating its weights and other for propagating the error to previous layers.

updating parameters

\frac{\partial E}{\partial W^{(k, c)}_{st}} = \sum^{M-m}_{i=0} \sum^{N-n}_{j=0} \frac{\partial E}{\partial a^{(k)}_{ij}} \frac{\partial a^{(k)}_{ij}}{\partial W^{(k)}_{st}} = \sum^{M-m}_{i=0} \sum^{N-n}_{j=0} \frac{\partial E}{\partial a^{(k)}_{ij}} x_{(i+s)(j+t)}\\

\frac{\partial E}{\partial b^{(k)}} = \sum^{M-m}_{i=0} \sum^{N-n}_{j=0} \frac{\partial E}{\partial a^{(k)}_{ij}} \frac{\partial a^{(k)}_{ij}}{\partial b^{(k)}} = \sum^{M-m}_{i=0} \sum^{N-n}_{j=0} \frac{\partial E}{\partial a^{(k)}_{ij}}

backprop to previous layer

\frac{\partial E}{\partial x^c_{ij}} = \sum_k \sum^{m-1}_{s=0} \sum^{n-1}_{t=0} \frac{\partial E}{\partial a^{(k)}_{(i-s)(j-t)}} \frac{\partial a^{(k)}_{(i-s)(j-t)}}{\partial x^c_{ij}}
= \sum_k \sum^{m-1}_{s=0} \sum^{n-1}_{t=0} \frac{\partial E}{\partial a^{(k)}_{(i-s)(j-t)}} W^{(k, c)}_{st}

Advanced Material

https://arxiv.org/pdf/1603.07285.pdf
https://github.com/vdumoulin/conv_arithmetic

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up