Maths behind training neural network using Gradient Descent

Well, recently I tried to contribute on a neural network project, I won’t lie I sucked miserably. But it is never too late to learn something and I will be pretty upset if I got another chance to contribute some thing as sexy as neural network and I suck once again. So here I go, learning neural network.

So what does a neural network model look like, it looks like a composition of bunch of layers and every layer contains bunch of neurons. The top most layer is called output layer and the down most layer is called input layer. Depending on the input the data propagates output. There are many many hidden layers resides in between these two. How many hidden layers would you love to have, it is a modeling problem, we need to consider many things to model our neural network. Every neurons are connected with every neurons of its next layer. Every node/neuron has its own activation function, every edge has its own weight, every layer has its own bias. So end of the day, this way or the other every neuron contributes to the output. Now when we are talking about training a neural network we are basically saying, we basically want to set the value of this weight, bias parameter in such a way so that for every input we get the correct result. How do we do that? Thats what this blog is all about.

So what can we do? We can go for Empirical risk minimization! So basically we are transforming the training problem to an optimization problem where we need to minimize a loss function of output and desired output. To save papers, or to impress academics we put it that way, , where f() is some function that predicts output, l() is some loss function of predicted output and real outputs. Ω() is a regularization function that takes Θ which is the weights we are predicting. Which has a use to filter on which values we don’t want to take. We will need to smooth our loss function because it is hard to optimize a non-smooth function.

From optimization literature we can use stochastic gradient descent to solve this. Θ= [w_1, b_1,…w_(i+1), b_(i+1)]. So we do what, For N iteration we will find

<img class="mathtex-equation-editor" src="https://web.archive.org/web/20160331013249im_/http://chart.apis.google.com/chart?cht=tx&chl=%5CDelta%20%3D%20-%20%5CDelta_%7B%5CTheta%7D%20l(f(X%5E%7Bt%7D%2C%20%5CTheta%7D)%2C%20Y%5E%7Bt%7D)%20%20-%20%5Clambda%20%5CDelta_%7B%5CTheta%7D%20%5COmega%20(%5CTheta))%0A%5C%5C%5CTheta%20%5Cleftarrow%20%5CTheta%20%2B%20%5Calpha%20%5CDelta" alt="\Delta = – \Delta_{\Theta} l(f(X^{t}, \Theta}), Y^{t}) – \lambda \Delta_{\Theta} \Omega (\Theta))

\\\Theta \leftarrow \Theta + \alpha \Delta” align=”absmiddle” scale=”0″>

This every iteration is known as epoch. So in every epoch we are finding a gradient, now we need to define those functions.

Lets discuss cost function first. As we see our cost function relies on another function f(). F() is basically a prediction function on probablity P(y=c|x), so it gives the probability of y being in class c, when x is given. We would like to maximize this probability, now as we are talking about we are framing this to a minimization problem, so we can define l() as a negative log likelihood problem.
l() is also known as cross entropy of information theory.

Now we will discuss about the gradients, the partial derivatives of our negative log function is

As we see, -1 was not a necessary part of the derivatives, but we are adding this as a filter, when y is not c it is 0 so it filters everything else for the term Fc.

So the gradient of the probablity function

<img class="mathtex-equation-editor" src="https://web.archive.org/web/20160331013249im_/http://chart.apis.google.com/chart?cht=tx&chl=%5CDelta%20_%7Bf(x)%7D%20%20-%20log%20f(x)_y%20%3D%20%5Cfrac%7B-1%7D%7Bf(x)_%7By%7D%7D%20%20%5Cleft%20%5B%201_%7B(y%3D0)%7D%20%5C%5C%20.%5C%5C.%5C%5C.%5C%5C%201_%7B(y%3DC-1)%7D%20%20%5Cright%20%5D%5C%5C%20%3D%20%5Cfrac%7B%20-e(y)%20%7D%7Bf(x)_y%7D&quot; alt="\Delta _{f(x)} – log f(x)_y = \frac{-1}{f(x)_{y}} \left [ 1_{(y=0)} \\ .\\.\\.\\ 1_{(y=C-1)} \right ]

\\ = \frac{ -e(y) }{f(x)_y}” align=”absmiddle” scale=”0″>

Now we are interested in partial derivative of output pre-activation function:

Now we replace f() with a softmax function that basically normalizes the exponential of activation it over the summation of other exponentials.

<img class="mathtex-equation-editor" src="https://web.archive.org/web/20160331013249im_/http://chart.apis.google.com/chart?cht=tx&chl=%5Cfrac%7B-1%7D%7Bf(x)_y%7D%20%20*%20%5Cfrac%7B%20%5Cdelta%7D%7B%5Cdelta%20a%5E%7B(L%2B1)%7D%20X_c%7D%20%20softmax%20(a%5E%7B(L%2B1)%7D%20X)_y%5C%5C%5Cfrac%7B-1%7D%7Bf(x)_y%7D%20%20*%20%5Cfrac%7B%20%5Cdelta%7D%7B%5Cdelta%20a%5E%7B(L%2B1)%7D%20X_c%7D%20%20%5Cfrac%7Bexp(a%5E%7B(L%2B1)%7D%20X_y)%20%7D%7B%5Csum%20exp(a%5E%7B(L%2B1)%20%7D%20%20x_c%7D%20%5C%5C&quot; alt="\frac{-1}{f(x)_y} * \frac{ \delta}{\delta a^{(L+1)} X_c} softmax (a^{(L+1)} X)_y\\

\frac{-1}{f(x)_y} * \frac{ \delta}{\delta a^{(L+1)} X_c} \frac{exp(a^{(L+1)} X_y) }{\sum exp(a^{(L+1) } x_c} \\” align=”absmiddle” scale=”0″>

We got this formula for partial derivative of a ratio, . If we apply this on our previous equation we get this:

But now we back on our f(x) we got:

–(i)

We will also need to find out the gradient of the hidden layers of neural network, if we calculate gradients for each neurons, we will grow old solving this. So we take this equation for chain rule, where a is the activation, p is the loss function, q is the preactivation layer above.

.

if we are k’th layer we are interested learn their gradients.

Partial derivative at j’th hidden unit and k’th hidden layer, with respect to the activation of hidden layer.

<img class="mathtex-equation-editor" src="https://web.archive.org/web/20160331013249im_/http://chart.apis.google.com/chart?cht=tx&chl=%5Cfrac%7B%5Cdelta%7D%7B%5Cdelta%20h%5E%7Bk%7D%20x_j%7D%20-log%20f(x)_y%3D%20%5Csum_i%20%5Cfrac%7B%20%5Cdelta%20-%20log%20f(x)_y%7D%7B%5Cdelta%20a%5E%7B(k%2B1)%7D%20x_i%20%7D%20%5Cfrac%7B%5Cdelta%20a%5E%7Bk%2B1%7D%20x_i%7D%7B%20%5Cdelta%20h%5Ek%20x_j%20%7D&quot; alt="\frac{\delta}{\delta h^{k} x_j} -log f(x)_y= \sum_i \frac{ \delta – log f(x)_y}{\delta a^{(k+1)} x_i } \frac{\delta a^{k+1} x_i}{ \delta h^k x_j }

” align=”absmiddle” scale=”0″>

a=b+∑W
so differentiation of a will be W, no surprise. Now if we get the gradient of the preactivation it will remain the same.

<img class="mathtex-equation-editor" src="https://web.archive.org/web/20160331013249im_/http://chart.apis.google.com/chart?cht=tx&chl=%5Cbigtriangledown%20h%5Ek%20(x)%20-%20log%20f(x)_y%5C%5C(W%5E%7Bk%2B1%7D)%5ET%20(%5Cbigtriangledown_%7B%20a%5E%7Bk%2B1%7D(x)%7D%20-log%20f(x)_y)&quot; alt="\bigtriangledown h^k (x) – log f(x)_y

\\(W^{k+1})^T (\bigtriangledown_{ a^{k+1}(x)} -log f(x)_y)” align=”absmiddle” scale=”0″>—-(2)

\\(\bigtriangledown_{h^k x } -log f(x)_y)^T (\bigtriangledown_{a^k x } h^k x)

\\\\(\bigtriangledown_{h^k x } -log f(x)_y)^T \bigodot […, g'(a^k x_j),…]” align=”absmiddle” scale=”0″>

Partial derivative of weights:
<img class="mathtex-equation-editor" src="https://web.archive.org/web/20160331013249im_/http://chart.apis.google.com/chart?cht=tx&chl=%5Cfrac%7B%5Cdelta%7D%7B%5Cdelta%20W_%7Bi%2Cj%7D%5Ek%7D%20-log%20f(x)_y%5C%5C%20%3D%5Cfrac%20%7B%20%5Cdelta%20-%20log%20f(x)_y%20%7D%7B%20%5Cdelta%20a%5E%7Bk%7D%20x_i%7D%20*%20%20%5Cfrac%20%7B%20%5Cdelta%20a%5E%7Bk%7D%20x_i%20%7D%7B%20%5Cdelta%20W_%7Bi%2Cj%7D%5Ek%7D&quot; alt="\frac{\delta}{\delta W_{i,j}^k} -log f(x)_y

\\ =\frac { \delta – log f(x)_y }{ \delta a^{k} x_i} * \frac { \delta a^{k} x_i }{ \delta W_{i,j}^k}” align=”absmiddle” scale=”0″>

which is basically because a (x)=b(k)+∑W(k) h(k-1) (x)

Partial derivatives of Biases:

\\ =\frac { \delta – log f(x)_y }{ \delta a^{k} x_i} * \frac { \delta a^{k} x_i }{ \delta b_{i}^k}

\\=\frac { \delta – log f(x)_y }{ \delta a^{k} x_i} * 1″ align=”absmiddle” scale=”0″>

We see that somehow most of them has something in common, that is the gradient of pre-activation.

In backward propagation, we assume that we have got f(x) precomputed. So now for all the ks we will find out the gradients for layers. Theta in a 2 hidden layer neural network is defines as [W1,b1,W2,b2] so from my understanding so far, we put it in gradient descent algorithm. So for each iteration we have new value of Ws and bs, which we can put in our gradient descent algorithm.

How should we initialize our parameters? Well, looks like it is required a lot of insight to write about it in this blog, but I can say that there is a paper of Glarot and Bengio published in 2010 that suggested to have a gradient H_i,j = Uniform (-b,b). b=�?6/�?(H_k+ H_k-1)

Thanks:
1) Martin Thoma, for the inspiration
2) Hugo Larochelle for the tutorial