##

Foreword

Autoencoders play a key role in deep neural network architectures for transfer learning and other tasks.

By analytically investigating the architecture of autoencoder, it leads us to certain general framework.

And in fact, learning the framework of autoencoder sheds the light on the understanding of deep architectures.

So in this article, I would like to review the research paper below.

Title: Autoencoders, Unsupervised Learning, and Deep Architectures

Author: Pierre Baldi

Publish Year: 2012

Link: http://proceedings.mlr.press/v27/baldi12a/baldi12a.pdf

##

My implementation

https://github.com/Rowing0914/autoencoders_keras

##

Introduction

Initially autoencoder was invented by the collaborative work of Hinton and PDP group(Rumelheart) in 1986.

Recently it has got huge attention again because of the new discovery of the powerful architecture, which is a variational encoder.

The aim of this paper is to derive a better theoretical understanding of autoencoders and get a good insight of the nature of deep architectures.

##

A general Autoencoders Framework

Let us define the settings related with the framework in advance and then move on to the explanation of the architecture of it.

name

notation

weight matrix: n->p

B

weight matrix: p->n

A

Input/Target

$X = ${$x_1, . . . , x_m$}

Target *optional

$Y = ${$y_1, . . . , y_m$}

Dissimilarity

$\Delta$

So the main purpose of autoencoder is, again, to obtain a meaningful representation of the dataset, so we want a model to have a strong reproducibility.

Based on the settings above, I would like to build the mathematical equations below.

```
min \space E(A,B) = min_{A,B} \space \sum^m_{t=1} E(x_t) = min_{A,B} \space \sum^m_{t=1} \Delta(A o B(x_t), x_t)
```

Case: non auto-associative

```
min \space E(A,B) = min_{A,B} \space \sum^m_{t=1} E(x_t, y_t) = min_{A,B} \space \sum^m_{t=1} \Delta(A o B(x_t), y_t)\\
```

##

Simply saying...

The basic autoencoder is describe as below.

- Input -> Hidden

$y = f(W^Tx + b)$ where f is some activation function e.g. sigmoid/ReLU/Tanh... - Hidden -> Output

$z = f(W'^Tz + b)$ where f is some activation function e.g. sigmoid/ReLU/Tanh... - Cost Function: Cross Entropy loss function

$L(x,z) = H(B_x || B_y) = -\sum^d_{k=1} y_k \log z_k$

And if we backpropagate the error from the output layer through the net, then it can learn how to reproduce the input.

So the important part is not the generated output, but the weight matrix. Because while it learns, it squashes the features into weight matrix in numerical format.

More interesting way to understand the architecture is below's image.

image source: http://curiousily.com/data-science/2017/02/02/what-to-do-when-data-is-missing-part-2.html

##

Problem in Autoencoder

So as we have seen above, the mathematically this ability of representing the dataset is proved. But there remains the problem, which is that it tends to learn **Identity Function**.

Hence, it is not robust anymore. I have prepared the explanatory image of this phenomena below.

Source: https://stats.stackexchange.com/questions/130809/autoencoders-cant-learn-meaningful-features

##

keras implementation

Following this great post

https://blog.keras.io/building-autoencoders-in-keras.html

Dataset: mnist images => Train:(60000, 784) Test:(10000, 784)

Architecture:

```
_________________________________________________________________
```

Layer (type) Output Shape Param #

=================================================================

input_1 (InputLayer) (None, 784) 0

_________________________________________________________________

encoding (Dense) (None, 32) 25120

_________________________________________________________________

decoding (Dense) (None, 784) 25872

=================================================================

Total params: 50,992

Trainable params: 50,992

Non-trainable params: 0

_________________________________________________________________

Code

```
from keras.layers import Input, Dense
```

from keras.models import Model
# building a model

encoding_dim = 32

input_img = Input(shape=(784, ))

encoded = Dense(encoding_dim, activation='relu', name='encoding')(input_img)

decoded = Dense(784, activation='sigmoid', name='decoding')(encoded)

autoencoder = Model(input_img, decoded)

encoder = Model(input_img, encoded)

encoded_input = Input(shape=(encoding_dim, ))

decoded_layer = autoencoder.layers[-1]

decoder = Model(encoded_input, decoded_layer(encoded_input))

autoencoder.compile(optimizer='adam', loss='binary_crossentropy')

# summariseing the model architecture

autoencoder.summary()

# save the architecture in image

from keras.utils import plot_model

plot_model(autoencoder, to_file='model.png')

# get and prepare dataset

from keras.datasets import mnist

import numpy as np

(x_train, _), (x_test, _) = mnist.load_data()

x_train = x_train.astype('float32') / 255.

x_test = x_test.astype('float32') / 255.

x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:])))

x_test = x_test.reshape((len(x_test), np.prod(x_test.shape[1:])))

print(x_train.shape, x_test.shape)

# fit the model to the dataset

autoencoder.fit(x_train, x_train, epochs=2, batch_size=256, shuffle=True, validation_data=(x_test, x_test))

# saving the model in a format of serialize weights to HDF5

autoencoder.save_weights("model.h5")

print("Saved model to disk")

# presentation part: try reproducing the image to see how it worked

encoded_imgs = encoder.predict(x_test)

decoded_imgs = decoder.predict(encoded_imgs)

# showing the image

import matplotlib.pyplot as plt

n = 10

plt.figure(figsize=(20, 4))

for i in range(n):

ax = plt.subplot(2, n, i+1)

plt.imshow(x_test[i].reshape(28, 28))

plt.gray()

ax.get_xaxis().set_visible(False)

ax.get_yaxis().set_visible(False)

# display reconstruction

ax = plt.subplot(2, n, i + 1 + n)

plt.imshow(decoded_imgs[i].reshape(28, 28))

plt.gray()

ax.get_xaxis().set_visible(False)

ax.get_yaxis().set_visible(False)

plt.show()

##

Learning Result

I found that even in a few epochs, this model can achieve certain work.

Although this embeddings are really sparse matrix, the computational efficiency was retained on my machine.

##

Future Work

Since I have learned the pros and cons of autoencoders through the reviewing the paper, I would like to know how to solve the issue. In fact, there is already advanced research on this point and created the model for this as well.

So I will review that paper soon.