Neural Networks(Part 1)

5 minute read

Published:

This is an original article. Please indicate reference if reproduced.

Outline

Deep Learning is so popular that have shown great promise in CV and NLP tasks. As a student majoring in this area, I had a series of presentation to explain some concepts at my Lab. Now I summarize them in this specialization on my page, hoping to give graduate students a tutorial to understand how neural networks work.
There are four parts in this specialization, which cover the following:

  1. Perceptron, Multilayer Perceptron, BackPropagation
  2. Convolution Neural Networks
  3. Autoencoder
  4. Recurrent Neural Networks

History

Neural Networks are some algorithms that try to mimic human brain. There have been three waves of development of neural networks.

  • Cybernetics(人际关系), 1940s-1960s.(Biological Learning, Perceptron)
  • Connectionism(连接机制), 1980s-1990s.(BackPropagation)
  • Deep Learning(深度学习), since 2006.(Deep Belief Networks)

For more details, you could be read chapter 1 of [1] Deep Learning, Yoshua Bengio, Ian Goodfellow, Aaron Courville, MIT Press, In preparation.

Let’s see the real stuff!

Perceptron

The perceptron algorithm was invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt,which can be trained for binary classification problem.

An example for binary classification problem

Structure

The structure of perceptron is very simple. Input is an n-dimension vector X. An weight is set for each X[i]. The bias unit is necessary. If you set X[0] equal 1, then the number of W[0] represents bias. So we can use matrix operation calculate the sum for all X[i] * W[i]. Finally the prediction can be got by activation function(sigmoid, tanh, ReLU).

The structure of perceptron

Example:AND, OR

Now, let’s see two examples about AND, OR operation. This part will help you understand how perceptron works.
Suppose $x_1,x_2\in{(0,1)}$, $g()=sign()$,and an AND operation can be represented using perceptron as following:

AND Operation

$x_1$$x_2$$XW^T$$ \hat{y} $
00-300
01-100
10-100
11101

Then an OR operation can be represented as following:

OR Operation

$x_1$$x_2$$XW^T$$\hat{y}$
00-100
01100
10100
11301

Cost Function

Cost function can return a number representing how well the neural network performed to map training examples to correct output. An easy way is to calculate the sum of all point that predict incorrectly:

\[J(W)=\frac{1}{N}\sum_{i=0}^N(y_i \neq \hat{y}_i)\]

Unfortunately, this function is not a continuous function, which can not be optimized. Instead, we use following function:

\[J(W)=\frac{1}{N}\sum_{i=0}^Nmax(0,-y_i(XW^T))\]

When the prediction is correct, $XW^T>0$, so $max(0,-y_i(XW^T)=0$; if the prediction is incorrect, $XW^T<0$, so $max(0,-y_i(XW^T)=-y_i(XW^T)$. Obviously, this function is continuous, and only the points predicting incorrectly are calculated. We can rewrite the function as following:

\[J(W)=\frac{1}{N}\sum_{i\in{WrongSamples}}^N(-y_i(XW^T))\]

How is perceptron trained

Gradient descent algorithm is used to update W. For each misclassified example i, repeat:

\[\frac{\partial J\_i(W)}{\partial W\_j}=-y\_ix\_{i,j}\] \[W\_j=W\_j+\eta y\_ix\_{i,j}\]

Multilayer Perceptron

The perceptron only works for linearly separable data. In more complex case, e.g. XOR or XNOR, the perceptron will fail to separate classes correctly.

OR Operation

Structure

The structure of multilayer perceptron is similar with perceptron. The only difference is there are more hidden layers between input layer and output layer in multilayer perceptron. Given following definition:

  • $a_i^{(j)} =$ “activation” of unit i in layer j
  • $w^{(j)} =$ matrix of weights controlling function. Mapping from layer j to layer j+1, if network has $S_j$ units in layer j, $S_{j+1}$ units in layer j+1, then $w^{(j)}$ will be of dimension $S_{j+1} \times (S_{j}+1)$

Suppose we have a multilayer perceptron as following:

Multilayer Perceptron

Then we will get following relation:
\(a\_1^{(2)} = g(w\_{10}^{(1)}x\_0+w\_{11}^{(1)}x\_1+w\_{12}^{(1)}x\_2+w\_{13}^{(1)}x\_3)\)
\(a\_2^{(2)} = g(w\_{20}^{(1)}x\_0+w\_{21}^{(1)}x\_1+w\_{22}^{(1)}x\_2+w\_{23}^{(1)}x\_3)\)
\(a\_3^{(2)} = g(w\_{30}^{(1)}x\_0+w\_{31}^{(1)}x\_1+w\_{32}^{(1)}x\_2+w\_{33}^{(1)}x\_3)\)
\(\hat{y} = a\_1^{(3)} = g(w\_{10}^{(2)}x\_0+w\_{11}^{(2)}x\_1+w\_{12}^{(2)}x\_2+w\_{13}^{(2)}x\_3)\)

Example: XNOR

Suppose $x_1,x_2\in{(0,1)}$, $y = x_1 XNOR x_2 = (NOTx_1)AND(NOTx_2)$,then an XNOR operation can be represented using perceptron as following:

XNOR Operation

———$w^{(1)}$———
-302020
10-20-20
———$w^{(2)}$———
-102020
$x_1$$x_2$$a_1^{(2)}$$a_2^{(2)}$$\hat{y}$
00011
01000
10000
11101

BackPropagation Algorithm

For a single training example (x,y), we define the cost function to be:

\[J(w,b;x,y)=\frac{1}{2}\sum_{j=1}^l(\hat{y\_j} - y\_j)^2\]
  • $w^{(l)} =$ matrix of weights controlling function. Mapping from layer l to layer l+1.
  • $a_i^{(j)} = g(z^{(l)}$(add $a_0^{(l)}$) “activation” of unit i in layer j.
  • $z^{(l)} = {w^{(l-1)}}^Ta^{(l-1)}$

Step 1: Forward Propagation

In first step, we need initialize all $w^{(l)}$ as well as calculate all $a^{(l)}$ and $z^{(l)}$ with initialized $w^{(l)}$.

\(a^{(1)} = x\)
\(z^{(2)} = w^{(1)}a^{(1)}\)
\(a^{(2)} = g(z^{(2)})(add \quad a\_0^{(2)})\)
\(z^{(3)} = w^{(2)}a^{(2)}\)
\(a^{(3)} = g(z^{(3)})(add \quad a\_0^{(3)})\)
\(z^{(4)} = w^{(3)}a^{(3)}\)
\(a^{(4)} = \hat{y\_j} = g(z^{(4)})\)

Step 2: BackPropagation Algorithm

Second step, the main idea is gradient descent algorithm. Gradient of every parameter should be calculated and updated as follow:

\(w\_{ij}^{(l)} = w\_{ij}^{(l)} - \alpha \frac{\partial J(w,b)}{\partial w\_{ij}^{(l)}}\)
\(b\_i^{(l)} = b\_i^{(l)} - \alpha \frac{\partial J(w,b)}{\partial b\_i^{(l)}}\)

Then the point turns to gradient. How can we calculate every gradient.
For each output unit (e.g., layer 4 in picture)

For each inner unit (e.g., layer 3 in picture)

$\delta$ is also known as error term. For each gradient, it is more convenient to calculate $\delta$ first.

So we can summarize all process as follows:

There is a demo to realize multilayer perceptron using TensorFlow. Following is a visualization for multilayer perceptron. You can run it here.

A demo video for Perceptron