Neural Networks(Part 1)
Published:
This is an original article. Please indicate reference if reproduced.
Outline
Deep Learning is so popular that have shown great promise in CV and NLP tasks. As a student majoring in this area, I had a series of presentation to explain some concepts at my Lab. Now I summarize them in this specialization on my page, hoping to give graduate students a tutorial to understand how neural networks work.
There are four parts in this specialization, which cover the following:
- Perceptron, Multilayer Perceptron, BackPropagation
- Convolution Neural Networks
- Autoencoder
- Recurrent Neural Networks
History
Neural Networks are some algorithms that try to mimic human brain. There have been three waves of development of neural networks.
- Cybernetics(人际关系), 1940s-1960s.(Biological Learning, Perceptron)
- Connectionism(连接机制), 1980s-1990s.(BackPropagation)
- Deep Learning(深度学习), since 2006.(Deep Belief Networks)
For more details, you could be read chapter 1 of [1] Deep Learning, Yoshua Bengio, Ian Goodfellow, Aaron Courville, MIT Press, In preparation.
Let’s see the real stuff!
Perceptron
The perceptron algorithm was invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt,which can be trained for binary classification problem.
Structure
The structure of perceptron is very simple. Input is an n-dimension vector X. An weight is set for each X[i]. The bias unit is necessary. If you set X[0] equal 1, then the number of W[0] represents bias. So we can use matrix operation calculate the sum for all X[i] * W[i]. Finally the prediction can be got by activation function(sigmoid, tanh, ReLU).
Example:AND, OR
Now, let’s see two examples about AND, OR operation. This part will help you understand how perceptron works.
Suppose $x_1,x_2\in{(0,1)}$, $g()=sign()$,and an AND operation can be represented using perceptron as following:
$x_1$ | $x_2$ | $XW^T$ | $ \hat{y} $ |
---|---|---|---|
0 | 0 | -30 | 0 |
0 | 1 | -10 | 0 |
1 | 0 | -10 | 0 |
1 | 1 | 10 | 1 |
Then an OR operation can be represented as following:
$x_1$ | $x_2$ | $XW^T$ | $\hat{y}$ |
---|---|---|---|
0 | 0 | -10 | 0 |
0 | 1 | 10 | 0 |
1 | 0 | 10 | 0 |
1 | 1 | 30 | 1 |
Cost Function
Cost function can return a number representing how well the neural network performed to map training examples to correct output. An easy way is to calculate the sum of all point that predict incorrectly:
\[J(W)=\frac{1}{N}\sum_{i=0}^N(y_i \neq \hat{y}_i)\]Unfortunately, this function is not a continuous function, which can not be optimized. Instead, we use following function:
\[J(W)=\frac{1}{N}\sum_{i=0}^Nmax(0,-y_i(XW^T))\]When the prediction is correct, $XW^T>0$, so $max(0,-y_i(XW^T)=0$; if the prediction is incorrect, $XW^T<0$, so $max(0,-y_i(XW^T)=-y_i(XW^T)$. Obviously, this function is continuous, and only the points predicting incorrectly are calculated. We can rewrite the function as following:
\[J(W)=\frac{1}{N}\sum_{i\in{WrongSamples}}^N(-y_i(XW^T))\]How is perceptron trained
Gradient descent algorithm is used to update W. For each misclassified example i, repeat:
\[\frac{\partial J\_i(W)}{\partial W\_j}=-y\_ix\_{i,j}\] \[W\_j=W\_j+\eta y\_ix\_{i,j}\]Multilayer Perceptron
The perceptron only works for linearly separable data. In more complex case, e.g. XOR or XNOR, the perceptron will fail to separate classes correctly.
Structure
The structure of multilayer perceptron is similar with perceptron. The only difference is there are more hidden layers between input layer and output layer in multilayer perceptron. Given following definition:
- $a_i^{(j)} =$ “activation” of unit i in layer j
- $w^{(j)} =$ matrix of weights controlling function. Mapping from layer j to layer j+1, if network has $S_j$ units in layer j, $S_{j+1}$ units in layer j+1, then $w^{(j)}$ will be of dimension $S_{j+1} \times (S_{j}+1)$
Suppose we have a multilayer perceptron as following:
Then we will get following relation:
\(a\_1^{(2)} = g(w\_{10}^{(1)}x\_0+w\_{11}^{(1)}x\_1+w\_{12}^{(1)}x\_2+w\_{13}^{(1)}x\_3)\)
\(a\_2^{(2)} = g(w\_{20}^{(1)}x\_0+w\_{21}^{(1)}x\_1+w\_{22}^{(1)}x\_2+w\_{23}^{(1)}x\_3)\)
\(a\_3^{(2)} = g(w\_{30}^{(1)}x\_0+w\_{31}^{(1)}x\_1+w\_{32}^{(1)}x\_2+w\_{33}^{(1)}x\_3)\)
\(\hat{y} = a\_1^{(3)} = g(w\_{10}^{(2)}x\_0+w\_{11}^{(2)}x\_1+w\_{12}^{(2)}x\_2+w\_{13}^{(2)}x\_3)\)
Example: XNOR
Suppose $x_1,x_2\in{(0,1)}$, $y = x_1 XNOR x_2 = (NOTx_1)AND(NOTx_2)$,then an XNOR operation can be represented using perceptron as following:
——— | $w^{(1)}$ | ——— |
---|---|---|
-30 | 20 | 20 |
10 | -20 | -20 |
——— | $w^{(2)}$ | ——— |
---|---|---|
-10 | 20 | 20 |
$x_1$ | $x_2$ | $a_1^{(2)}$ | $a_2^{(2)}$ | $\hat{y}$ |
---|---|---|---|---|
0 | 0 | 0 | 1 | 1 |
0 | 1 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 |
1 | 1 | 1 | 0 | 1 |
BackPropagation Algorithm
For a single training example (x,y), we define the cost function to be:
\[J(w,b;x,y)=\frac{1}{2}\sum_{j=1}^l(\hat{y\_j} - y\_j)^2\]- $w^{(l)} =$ matrix of weights controlling function. Mapping from layer l to layer l+1.
- $a_i^{(j)} = g(z^{(l)}$(add $a_0^{(l)}$) “activation” of unit i in layer j.
- $z^{(l)} = {w^{(l-1)}}^Ta^{(l-1)}$
Step 1: Forward Propagation
In first step, we need initialize all $w^{(l)}$ as well as calculate all $a^{(l)}$ and $z^{(l)}$ with initialized $w^{(l)}$.
\(a^{(1)} = x\)
\(z^{(2)} = w^{(1)}a^{(1)}\)
\(a^{(2)} = g(z^{(2)})(add \quad a\_0^{(2)})\)
\(z^{(3)} = w^{(2)}a^{(2)}\)
\(a^{(3)} = g(z^{(3)})(add \quad a\_0^{(3)})\)
\(z^{(4)} = w^{(3)}a^{(3)}\)
\(a^{(4)} = \hat{y\_j} = g(z^{(4)})\)
Step 2: BackPropagation Algorithm
Second step, the main idea is gradient descent algorithm. Gradient of every parameter should be calculated and updated as follow:
\(w\_{ij}^{(l)} = w\_{ij}^{(l)} - \alpha \frac{\partial J(w,b)}{\partial w\_{ij}^{(l)}}\)
\(b\_i^{(l)} = b\_i^{(l)} - \alpha \frac{\partial J(w,b)}{\partial b\_i^{(l)}}\)
Then the point turns to gradient. How can we calculate every gradient.
For each output unit (e.g., layer 4 in picture)
For each inner unit (e.g., layer 3 in picture)
$\delta$ is also known as error term. For each gradient, it is more convenient to calculate $\delta$ first.
So we can summarize all process as follows:
There is a demo to realize multilayer perceptron using TensorFlow. Following is a visualization for multilayer perceptron. You can run it here.