Preface
In this article, I would like to elaborate the arithmetic in CNN.
But I may make mistake so please feel free to leave the comment below.
Implementation of CNN in python with Numpy
Visual Image of Networks
source: http://www.mdpi.com/1099-4300/19/6/242
Math in CNN
In this section, I will write about the mathematical concept of Convolutional networks and its three features. As you probably know, the network composes many neurons in layers. And each layer is connected by its pipeline. So first the input data is propagated forward through layers till the output layer. Then in order to strengthen the ability to represent the dataset, the net needs to learn by tuning its parameters measuring the error between the target and its prediction.
Hence in this section we will see two faces of the layers, forward-propagation and back-propagation.
Three Key Features in CNN
- Convolutional layer
- Activation layer
- Pooling layer(generally we apply max-pooling)
So let me describe one by one.
1. Convolutional layer.
Let's consider a single image case. For simplicity, I would like to define the image size and its convoluted image size as above. So notation here is like below. $x$ : input $a^k$ : after convoluted image $k$ : index of kernel(weight filter) $W$ : kernel(weight filter) $b$ : bias $E$ : Cost Functionforward prop
a^{(k)}_{ij} = \sum^{m-1}_{s=0} \sum^{n-1}_{t=0} W^{(k)}_{st} x_{(i+s)(j+t)} + b^{(k)}
back prop to update the weight
\frac{\partial E}{\partial W^{(k)}_{st}} = \sum^{M-m}_{i=0} \sum^{N-n}_{j=0} \frac{\partial E}{\partial a^{(k)}_{ij}} \frac{\partial a^{(k)}_{ij}}{\partial W^{(k)}_{st}} = \sum^{M-m}_{i=0} \sum^{N-n}_{j=0} \frac{\partial E}{\partial a^{(k)}_{ij}} x_{(i+s)(j+t)}\\
\frac{\partial E}{\partial b^{(k)}} = \sum^{M-m}_{i=0} \sum^{N-n}_{j=0} \frac{\partial E}{\partial a^{(k)}_{ij}} \frac{\partial a^{(k)}_{ij}}{\partial b^{(k)}} = \sum^{M-m}_{i=0} \sum^{N-n}_{j=0} \frac{\partial E}{\partial a^{(k)}_{ij}}
Bear in mind that the propagated error can be noted like below.
\delta^{k}_{ij} = \frac{\partial E}{\partial a^{(k)}_{ij}}
back prop to previous layer
\frac{\partial E}{\partial x_{ij}} = \sum^{m-1}_{s=0} \sum^{n-1}_{t=0} \frac{\partial E}{\partial a^{(k)}_{(i-s)(j-t)}} \frac{\partial a^{(k)}_{(i-s)(j-t)}}{\partial x_{ij}}
= \sum^{m-1}_{s=0} \sum^{n-1}_{t=0} \frac{\partial E}{\partial a^{(k)}_{(i-s)(j-t)}} W^{(k)}_{st}
2. Activation layer
When it comes to the selection of activation functions, indeed we have some options, for example sigmoid or hyperbolic tangent. So in this section, first let me brief the appearance of each function, then move on to see propagations.
Activation Families
- sigmoid : $\sigma(x) = \frac{1}{1 + e^{-x}}$
- tanh : $tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
- ReLU :
ReLU(x) = max(0, x)
And in this case, I will pick ReLU for our activation function in this article.
So let's check the forward prop and backprop of it.
forward prop
a_{ij} = max(0, x_{ij})
backprop
\frac{\partial E}{\partial x_{ij}} = \frac{\partial E}{\partial a^{(k)}_{ij}} \frac{\partial a^{(k)}}{\partial x_{ij}}
= \left\{
\begin{array}{ll}
\frac{\partial E}{\partial a^{(k)}_{ij}} & (a^{(k)}_{ij} \geq 0) \\
0 & (otherwise)
\end{array}
\right.
Max Pooling
Forward prop
a_{ij} = max(0, x_{(i+s)(j+t)})
where $s \in |0, l|$ and $t \in |0, l|$ and $l$ is filter size.
Backward prop
\frac{\partial E}{\partial x_{(i+s)(j+t)}} = \frac{\partial E}{\partial a^{(k)}_{ij}} \frac{\partial a^{(k)}}{\partial x_{(i+s)(j+t)}}
= \left\{
\begin{array}{ll}
\frac{\partial E}{\partial a^{(k)}_{ij}} & (a^{(k)}_{ij} = x_{(i+s)(j+t)}) \\
0 & (otherwise)
\end{array}
\right.
Convolutional Layer(Multi-Channel)
So far, we haven't consider the multi-channel convolutional layers.
But since we have done to cover the single channel connections, it's good time for us to move on to more practical nets.
This is the conceptual image of multi-channel convolutional layers.
Forward-prop\\
a^{(k)}_{ij} = \sum_c a^{(k, c)}_{ij} = \sum_c \sum^{m-1}_{s=0} \sum^{n-1}_{t=0} W^{(k)}_{st} x_{(i+s)(j+t)} + b^{(k)}\\
Likewise in Forward-prop, Backward-Prop just requires $\sum_c$ in its math.
** $c$ is index of channel.
Updating Parameter(W) and Backprop to previous layer in Multi-channel
Multi-channel convolutional layer has two aspects as well, one for updating its weights and other for propagating the error to previous layers.
updating parameters
\frac{\partial E}{\partial W^{(k, c)}_{st}} = \sum^{M-m}_{i=0} \sum^{N-n}_{j=0} \frac{\partial E}{\partial a^{(k)}_{ij}} \frac{\partial a^{(k)}_{ij}}{\partial W^{(k)}_{st}} = \sum^{M-m}_{i=0} \sum^{N-n}_{j=0} \frac{\partial E}{\partial a^{(k)}_{ij}} x_{(i+s)(j+t)}\\
\frac{\partial E}{\partial b^{(k)}} = \sum^{M-m}_{i=0} \sum^{N-n}_{j=0} \frac{\partial E}{\partial a^{(k)}_{ij}} \frac{\partial a^{(k)}_{ij}}{\partial b^{(k)}} = \sum^{M-m}_{i=0} \sum^{N-n}_{j=0} \frac{\partial E}{\partial a^{(k)}_{ij}}
backprop to previous layer
\frac{\partial E}{\partial x^c_{ij}} = \sum_k \sum^{m-1}_{s=0} \sum^{n-1}_{t=0} \frac{\partial E}{\partial a^{(k)}_{(i-s)(j-t)}} \frac{\partial a^{(k)}_{(i-s)(j-t)}}{\partial x^c_{ij}}
= \sum_k \sum^{m-1}_{s=0} \sum^{n-1}_{t=0} \frac{\partial E}{\partial a^{(k)}_{(i-s)(j-t)}} W^{(k, c)}_{st}
Advanced Material
https://arxiv.org/pdf/1603.07285.pdf
https://github.com/vdumoulin/conv_arithmetic