ssl (https) from python flask

Basically here I am in this blog I am going share a code snippet, and I am going to describe what else stupid things I tried and did not work to do that. Well, don’t miss my point, when I am sharing my stupid ordeal, it does not mean, I am proving myself stupid but I am trying to save your time, I am basically telling what did not work in last couple of hour so that it could save your time.

So using flask when I shared a static file migrated directly from django, while working with django we have figured out that we actually don’t need powerful tool like django, instead we can use something very lightweight like flask and it is much easier to switch from django to flask as for both of them default front end templating is jinja. Now after this shift, I had to face a little bit trouble with https, because when I tried https, it is completely blank, I lost my mind, what the hell is going on here? Then I realized for flask probably I need to define my ssl. They got this nice documentation at their website (http://flask.pocoo.org/snippets/111/), I follwed them, and it did not work! Why it won’t work? Alright after couple of trial and google search I realized this is an old school way of doing this, fine, I need to go to new school, obviously! I got myself a new dress and now context looks more pretty {“cert.crt”,”key.key”}. I am impressed but what the hell? it did not work as well, why it won’t work? I lost my mind! hours of fighting, and I got this error:

Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/lib/python2.7/dist-packages/werkzeug/serving.py", line 602, in inner
    passthrough_errors, ssl_context).serve_forever()
  File "/usr/local/lib/python2.7/dist-packages/werkzeug/serving.py", line 506, in make_server
    passthrough_errors, ssl_context)
  File "/usr/local/lib/python2.7/dist-packages/werkzeug/serving.py", line 450, in __init__
    self.socket = ssl_context.wrap_socket(self.socket,
AttributeError: 'OpenSSL.SSL.Context' object has no attribute 'wrap_socket'

It is because I am using a python 2.7 version below 2.7.9.

What else did I try? You won’t want to know, I tried to install pyopenssl on heroku using pip, but looks like it is a ported version and failed to compile on heroku. Now I will write about what I had to do to make it work.

I have to make my cirtificate and keys:

$ openssl genrsa -des3 -passout pass:x -out server.pass.key 2048
 openssl rsa -passin pass:x -in server.pass.key -out server.key
 rm server.pass.key
 openssl req -new -key server.key -out server.csr
openssl x509 -req -days 365 -in server.csr -signkey server.key -out server.crt

So now we have server.key and server.csr two files in our directory

Now python script:

from flask import Flask, request, send_from_directory
from flask import render_template

#from OpenSSL import SSL
#context = SSL.Context(SSL.SSLv23_METHOD)
#context.use_privatekey_file('server.key')
#context.use_certificate_file('server.crt')

import os

ASSETS_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), '/static')
app = Flask(__name__, static_url_path='/static')

@app.route('/js/')
def send_js(path):
    return send_from_directory('js', path)

@app.route('/signup')
def signup():
    return render_template('signup.html')

if __name__ == '__main__':
    context = ('server.crt', 'server.key')
    app.run(ssl_context=context, threaded=True, debug=True)

Done! Up and running if you got a version >=2.7.9.!

RESTful API design best practices

For this new project I am currently involved in, I need to design and criticize some APIs. So I thought maybe digging up a little bit about RESTful API might be helpful for me.

When we are talking about API, we know that we are talking about Abstract Programming Interfaces, indeed but what makes a API good and bad? People having this religious debates about SOAP and REST, but U want to know, what should we keep in mind when we are designing a RESTful API.

First of we need to be clear about our goals why and what for we are creating this APIs. We create APIs for another developer so that they can offer good stuff to their customers. But the goal of api is to scale, so that our product can reach more people.
Why do we need to go for REST? Well, Rest is not a standard but a way of doing something that makes life easier. It has this following features:

i) Scalability: It has theability to make it grow faster. Say we want to add a new feature in our api, we can add it without much effort.
ii) Generality: It has to use a general protocol, like http, or https.
iii) Independence: Another good thing about REST is that, different part of apis can be independent in this architecture.
iv) Latency: caching is another big feature.
v) Security: as REST use http type protocols, it also has some good feature of security, we can use http headers for security purpose.
vi) Encapsulation: Rest architecture only shows thing that could possibly be necessary to other.

Why json?
Ubiquity: well, 50-70% of the market uses JSON.
Simplicity: It has a simpler grammar than any other competitors.
Readability: We don’t need any other prior knowledge to read jsons.
Scalability: Jsons are easy to add to.
Flexibility: It is flexible, everything can fit in json.
HATEOAS: it does not contain content type or anything, it only contains things that is necessary.

REST is easy for user but it can be hard for those who makes it. But one thing we need to keep it in mind, it is not a standard like w3 but it is a style, it is an architecture. This is easy to maintain if we keep few things in mind.

Suppose we have this application that creates user, gets user derails, change user details and so on. So one noob approach would be:

/createUser
/getUserDetail
/changePassword
…and so on.

Is it good? it is not. Not only it is bad, it also confuses the user of our API. We should put only nouns in our API request url not behaviors like doThis. We can do that with less urls to request. In fact we can do that with one using http methods, GET, PUT, POST, DElete, head, (PATCH). CRUD is not necessarily a one to one correspondent of GET, PUT, POST, DELETE. In fact PUT and POST can be used to get, create and update. Before discussing about POST or PUT, I think we can go for something simple, like getting information. I think there can be 2 types of resources need to be care about, we can get whole bunch of users, or a single instance of users.
i) collection
/users <- it is a plural
ii)instance
/users/1231 <-nested

Now lets talk about PUT. PUT has this little complicacy that, PUT is idempotent, it sends all the same state remains same does not matter if we get a 101 or what as a result. So need to take care sometime. We need to present all the field present when we do PUT. With PUT can’t support partial update. As PUT returns some data in the end a blank PUT can be used similarly as GET if implemented like that.

POST always done on parent (eg: /users) for POST 200 response is not good enough to return, as when the request is successful it also returns 200, it can be 201. but for update 200 is ok. POST is the only method not idempotent. We can make it idempotent. Some rest api has data limit, we can reduce it using partial updates. Also has impact on performence.

Now it is always a good idea to deal with Accept header of a request and send priority list which it can accept. When we send a request, it is also important to have a content-type.  We can do it easily if we use it as an extention.

applications/aln2c3.json.

BASE URLs does not mean a lot in rest api. https://api.foo.com is better looking than http://www.foo.com/dev/service/api/rest .Version should be in url, https://api.foo.com/v1, or as media type application/json;application&v=1, for those who changes api very frequently media type is better option. But it requires more work to make it done. not ideal and it is not completely fall under rest api system. About versioning, 1.2.3.4 should not be at api version of data exchange.

It is always better to use Camel cases not underscore when it comes to json. Date should be at iso 8601 and it should be UTC rather than anything else, it saves a lot of work.

There should always have ids but does the user need to know about this ids? Nope, we can use HREF in json media, instead. Hypermedia is paramount, does not have xlink like xml.

Now we talked about the fundamentals, but at the same time giving too much information on each api calls are a bad idea because it costs you money. So what can you do? You can ask user to request, what do you want? and upon his request you can extend your api response.

?expand=resource1, resource2

 

It is being considered as different resource so it will be stored in different cache. Offset and limits can also be done the same way.

?offset=5-&limit=25

It is always our dream that there won’t be any errors anywhere, but practically it never happens, but error can be tackled. So it is always better to keep user updated about the errors. Rather than just let them check the headers for error messages we should tell them in json response status about it. we should describe that error to him, give them a link to get notified about the error updates. Thats how we can help them.

About security, I would say sessions does not do much good other than over heads, sessions can be copied and you won’t see it coming, not only that session requires extra clustering at your server, it also better not to have them when you need better performance. Url changes all the time, so it is better not to hash urls, rather than urls rather use it on content.

There are also some people that believe oAuth2 better than oAuth1a, but honestly speaking oauth is not more secure than 2, oAuth2 wanted to make life easier though some people really doubt about its success. The use bearer token can be questioned.

It is always better to use api keys, rather than user passwords.
ENtropy: long in length, you can’t guess like a password.
password reset: does not affect on password change.
independence: not broken on password change or anything.
speed: If we use bcript  it may take 0.1 or 0.5 s, user may never notice, but some hacker will notice that it will take time.
limited exposure: only who downloaded it, he knows
traceablity: we can trace why it did not work.

Now for login there is this common misconception on when to use 401 and 402.

401 unauthorized means unauthenticated, maybe password was unchecked.
402 forbidden, credentials taken but found out he is not allowed

It is always good idea to add this POST method, because some user may never heard of DELETE kind of methods. And why post? Because POST is  the only method that is not idempoten.

?_method=DELETE

How to fix theano numpy wrong number of dimensions!

Recently I had to work with Theano, numpy, scipy libraries in python. had to pass hour with this this piece of error which says TypeError: (‘Bad input argument to theano function at index 1(0-based)’, ‘Wrong number of dimensions: expected 2, got 1 with shape (2,).’) So here it goes, how I fixed it.

For the sake helping other fellow programmers, I am posting the whole error log, which will help them via google:

/usr/lib/python2.7/dist-packages/scipy/lib/_util.py:35: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead
DeprecationWarning)
/usr/lib/python2.7/dist-packages/scipy/lib/_util.py:35: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead
DeprecationWarning)
/usr/lib/python2.7/dist-packages/scipy/lib/_util.py:35: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead
DeprecationWarning)
/usr/lib/python2.7/dist-packages/scipy/lib/_util.py:35: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead
DeprecationWarning)
/usr/lib/python2.7/dist-packages/scipy/lib/_util.py:35: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead
DeprecationWarning)
/usr/lib/python2.7/dist-packages/scipy/lib/_util.py:35: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead
DeprecationWarning)
/usr/lib/python2.7/dist-packages/scipy/lib/_util.py:35: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead
DeprecationWarning)
/usr/lib/python2.7/dist-packages/scipy/lib/_util.py:35: DeprecationWarning: Module scipy.linalg.blas.fblas is deprecated, use scipy.linalg.blas instead
DeprecationWarning)
False
Traceback (most recent call last):
File "bgd.py", line 44, in 
theta = gradient_descent_2(alpha, x, y, 2000)
File "bgd.py", line 24, in gradient_descent_2
hypothesis =find_hypothesis(x, theta).T
File "/usr/local/lib/python2.7/dist-packages/Theano-0.6.0-py2.7.egg/theano/compile/function_module.py", line 497, in __call__
allow_downcast=s.allow_downcast)
File "/usr/local/lib/python2.7/dist-packages/Theano-0.6.0-py2.7.egg/theano/tensor/type.py", line 157, in filter
data.shape))
TypeError: ('Bad input argument to theano function at index 1(0-based)', 'Wrong number of dimensions: expected 2, got 1 with shape (2,).')

So basically it is saying that in my code there is a shape mismatch, apparently there should not be any mismatch, because I coded everything, I KNOW what I did! But apparently there is something in numpy that makes a matrix without height, which means with blank height. So it has a shape of (2, ).

Example could be:

import numpy as np
theta = np.array([1,1])
print theta.shape       #(2,)

It works pretty fine when you are working with nupy alone, but it causes tremendous problems when you are using other library like THENO.

Now let me share another snippet, from where I got this error:

from theano import tensor as T, function

T_x,T_y, T_z, T_theta, T_loss = T.dmatrices('x','y','z','theta','loss')
find_hypothesis=function([T_x,T_theta],T.dot(T_x,T_theta))

hypothesis =find_hypothesis(x, theta).T

From code like this you will get the error I previously mentioned.

Now lets talk about how did I fix it, actually it was simple, I passed hours without any reason. You can solve it just by reshapiing your numpy like this:

theta=theta.reshape(theta.shape[0],1)        #had tough time without this,

Maths behind training neural network using Gradient Descent

Well, recently I tried to contribute on a neural network project, I won’t lie I sucked miserably. But it is never too late to learn something and I will be pretty upset if I got another chance to contribute some thing as sexy as neural network and I suck once again. So here I go, learning neural network.

So what does a neural network model look like, it looks like a composition of bunch of layers and every layer contains bunch of neurons. The top most layer is called output layer and the down most layer is called input layer. Depending on the input the data propagates output. There are many many hidden layers resides in between these two. How many hidden layers would you love to have, it is a modeling problem, we need to consider many things to model our neural network. Every neurons are connected with every neurons of its next layer. Every node/neuron has its own activation function, every edge has its own weight, every layer has its own bias. So end of the day, this way or the other every neuron contributes to the output. Now when we are talking about training a neural network we are basically saying, we basically want to set the value of this weight, bias parameter in such a way so that for every input we get the correct result. How do we do that? Thats what this blog is all about.

 

So what can we do? We can go for Empirical risk minimization! So basically we are transforming the training problem to an optimization problem where we need to minimize a loss function of output and desired output. To save papers, or to impress academics we put it that way, argmin \frac{1}{T} \sum l(f(X^{t}, \theta), Y^{t})+\lambda\Omega (\theta), where f() is some function that predicts output, l() is some loss function of predicted output and real outputs. Ω() is a regularization function that takes Θ which is the weights we are predicting. Which has a use to filter on which values we don’t want to take. We will need to smooth our loss function because it is hard to optimize a non-smooth function.

From optimization literature we can use stochastic gradient descent to solve this. Θ= [w_1, b_1,…w_(i+1), b_(i+1)]. So we do what, For N iteration we will find

<img class="mathtex-equation-editor" src="https://web.archive.org/web/20160331013249im_/http://chart.apis.google.com/chart?cht=tx&chl=%5CDelta%20%3D%20-%20%5CDelta_%7B%5CTheta%7D%20l(f(X%5E%7Bt%7D%2C%20%5CTheta%7D)%2C%20Y%5E%7Bt%7D)%20%20-%20%5Clambda%20%5CDelta_%7B%5CTheta%7D%20%5COmega%20(%5CTheta))%0A%5C%5C%5CTheta%20%5Cleftarrow%20%5CTheta%20%2B%20%5Calpha%20%5CDelta" alt="\Delta = – \Delta_{\Theta} l(f(X^{t}, \Theta}), Y^{t}) – \lambda \Delta_{\Theta} \Omega (\Theta))

\\\Theta \leftarrow \Theta + \alpha \Delta” align=”absmiddle” scale=”0″>

This every iteration is known as epoch. So in every epoch we are finding a gradient, now we need to define those functions.

Lets discuss cost function first. As we see our cost function relies on another function f(). F() is basically a prediction function on probablity P(y=c|x), so it gives the probability of y being in class c, when x is given. We would like to maximize this probability, now as we are talking about we are framing this to a minimization problem, so we can define l() as a negative log likelihood problem.l(f(x), y)= - \sum _c 1_{(y=c)} log f(x)_{c} =-log f(x)_{y}
l() is also known as cross entropy of information theory.

Now we will discuss about the gradients, the partial derivatives of our negative log function is
\frac{\delta}{\delta f(x)_{c}} - log f(x)_y =\frac{-1_{(y=c)}}{f(x)_{y}}

As we see, -1 was not a necessary part of the derivatives, but we are adding this as a filter, when y is not c it is 0 so it filters everything else for the term Fc.

So the gradient of the probablity function

<img class="mathtex-equation-editor" src="https://web.archive.org/web/20160331013249im_/http://chart.apis.google.com/chart?cht=tx&chl=%5CDelta%20_%7Bf(x)%7D%20%20-%20log%20f(x)_y%20%3D%20%5Cfrac%7B-1%7D%7Bf(x)_%7By%7D%7D%20%20%5Cleft%20%5B%201_%7B(y%3D0)%7D%20%5C%5C%20.%5C%5C.%5C%5C.%5C%5C%201_%7B(y%3DC-1)%7D%20%20%5Cright%20%5D%5C%5C%20%3D%20%5Cfrac%7B%20-e(y)%20%7D%7Bf(x)_y%7D&quot; alt="\Delta _{f(x)} – log f(x)_y = \frac{-1}{f(x)_{y}} \left [ 1_{(y=0)} \\ .\\.\\.\\ 1_{(y=C-1)} \right ]

\\ = \frac{ -e(y) }{f(x)_y}” align=”absmiddle” scale=”0″>

Now we are interested in partial derivative of output pre-activation function:

\frac{\delta}{\delta a^{(L+1)} X_c} - log f(x)_y = \frac{-1}{f(x)_y}  *  \frac{\delta}{\delta a^{(L+1)} X_c} f(x)_y

Now we replace f() with a softmax function that basically normalizes the exponential of activation it over the summation of other exponentials.

<img class="mathtex-equation-editor" src="https://web.archive.org/web/20160331013249im_/http://chart.apis.google.com/chart?cht=tx&chl=%5Cfrac%7B-1%7D%7Bf(x)_y%7D%20%20*%20%5Cfrac%7B%20%5Cdelta%7D%7B%5Cdelta%20a%5E%7B(L%2B1)%7D%20X_c%7D%20%20softmax%20(a%5E%7B(L%2B1)%7D%20X)_y%5C%5C%5Cfrac%7B-1%7D%7Bf(x)_y%7D%20%20*%20%5Cfrac%7B%20%5Cdelta%7D%7B%5Cdelta%20a%5E%7B(L%2B1)%7D%20X_c%7D%20%20%5Cfrac%7Bexp(a%5E%7B(L%2B1)%7D%20X_y)%20%7D%7B%5Csum%20exp(a%5E%7B(L%2B1)%20%7D%20%20x_c%7D%20%5C%5C&quot; alt="\frac{-1}{f(x)_y} * \frac{ \delta}{\delta a^{(L+1)} X_c} softmax (a^{(L+1)} X)_y\\

\frac{-1}{f(x)_y} * \frac{ \delta}{\delta a^{(L+1)} X_c} \frac{exp(a^{(L+1)} X_y) }{\sum exp(a^{(L+1) } x_c} \\” align=”absmiddle” scale=”0″>

We got this formula for partial derivative of a ratio, \frac{ \delta \frac{g(x)}{h(x)}}{\delta x} } = \frac{ \delta g(x) }{ \delta x } \frac{1}{h(x)} - \frac {g(x)}{h(x)} \frac{\delta h(x)}{\delta x}. If we apply this on our previous equation we get this:

\frac{\delta}{\delta a^{(L+1)} X_c} - log f(x)_y = \frac{-1}{f(x)_y}  *  (1_{(y=c)} softmax(a^{(L+1)}  x_y - softmax(a^{(L+1)}  x_y softmax(a^{(L+1)}  x_c  )

But now we back on our f(x) we got:

\frac{\delta}{\delta a^{(L+1)} X_c} - log f(x)_y = \frac{-1}{f(x)_y}  *  (1_{(y=c) f(x)_y - f(x)_y f(x)_c})

\frac{\delta}{\delta a^{(L+1)} (x)_{c}}} - log f(x)_y =-(1_{(y=c) - f(x)_{c}})

So the gradient,

\Delta  a^{(L+1)}(x) - log f(x)_y = -(e(y)-f(x))–(i)

 

We will also need to find out the gradient of the hidden layers of neural network, if we calculate gradients for each neurons, we will grow old solving this. So we take this equation for chain rule, where a is the activation, p is the loss function, q is the preactivation layer above.

.\frac{\delta}{\delta a} p(a)= \sum_i  \frac{\delta p(a)}{\delta q_i(a)} *  \frac{\delta q_i(a)}{\delta a}

if we are k’th layer we are interested learn their gradients.

Partial derivative at j’th hidden unit and k’th hidden layer, with respect to the activation of hidden layer.

<img class="mathtex-equation-editor" src="https://web.archive.org/web/20160331013249im_/http://chart.apis.google.com/chart?cht=tx&chl=%5Cfrac%7B%5Cdelta%7D%7B%5Cdelta%20h%5E%7Bk%7D%20x_j%7D%20-log%20f(x)_y%3D%20%5Csum_i%20%5Cfrac%7B%20%5Cdelta%20-%20log%20f(x)_y%7D%7B%5Cdelta%20a%5E%7B(k%2B1)%7D%20x_i%20%7D%20%5Cfrac%7B%5Cdelta%20a%5E%7Bk%2B1%7D%20x_i%7D%7B%20%5Cdelta%20h%5Ek%20x_j%20%7D&quot; alt="\frac{\delta}{\delta h^{k} x_j} -log f(x)_y= \sum_i \frac{ \delta – log f(x)_y}{\delta a^{(k+1)} x_i } \frac{\delta a^{k+1} x_i}{ \delta h^k x_j }

” align=”absmiddle” scale=”0″>

a=b+∑W
so differentiation of a will be W, no surprise. Now if we get the gradient of the preactivation it will remain the same.

(W_{,j}^{(k+1)} )^T   \bigtriangledown  a^{k+1} (x) - log f(x)_y

Gradient,

<img class="mathtex-equation-editor" src="https://web.archive.org/web/20160331013249im_/http://chart.apis.google.com/chart?cht=tx&chl=%5Cbigtriangledown%20h%5Ek%20(x)%20-%20log%20f(x)_y%5C%5C(W%5E%7Bk%2B1%7D)%5ET%20(%5Cbigtriangledown_%7B%20a%5E%7Bk%2B1%7D(x)%7D%20-log%20f(x)_y)&quot; alt="\bigtriangledown h^k (x) – log f(x)_y

\\(W^{k+1})^T (\bigtriangledown_{ a^{k+1}(x)} -log f(x)_y)” align=”absmiddle” scale=”0″>—-(2)

<img class="mathtex-equation-editor" src="https://web.archive.org/web/20160331013249im_/http://chart.apis.google.com/chart?cht=tx&chl=%5Cbigtriangledown%20_%20%7Ba%5Ek%20x%20%7D%20-log%20f(x)_y%5C%5C(%5Cbigtriangledown_%7Bh%5Ek%20x%20%7D%20-log%20f(x)_y)%5ET%20(%5Cbigtriangledown_%7Ba%5Ek%20x%20%7D%20h%5Ek%20x)%5C%5C%5C%5C(%5Cbigtriangledown_%7Bh%5Ek%20x%20%7D%20-log%20f(x)_y)%5ET%20%5Cbigodot%20%5B…%2C%20g'(a%5Ek%20x_j)%2C…%5D&quot; alt="\bigtriangledown _ {a^k x } -log f(x)_y

\\(\bigtriangledown_{h^k x } -log f(x)_y)^T (\bigtriangledown_{a^k x } h^k x)

\\\\(\bigtriangledown_{h^k x } -log f(x)_y)^T \bigodot […, g'(a^k x_j),…]” align=”absmiddle” scale=”0″>

Partial derivative of weights:
<img class="mathtex-equation-editor" src="https://web.archive.org/web/20160331013249im_/http://chart.apis.google.com/chart?cht=tx&chl=%5Cfrac%7B%5Cdelta%7D%7B%5Cdelta%20W_%7Bi%2Cj%7D%5Ek%7D%20-log%20f(x)_y%5C%5C%20%3D%5Cfrac%20%7B%20%5Cdelta%20-%20log%20f(x)_y%20%7D%7B%20%5Cdelta%20a%5E%7Bk%7D%20x_i%7D%20*%20%20%5Cfrac%20%7B%20%5Cdelta%20a%5E%7Bk%7D%20x_i%20%7D%7B%20%5Cdelta%20W_%7Bi%2Cj%7D%5Ek%7D&quot; alt="\frac{\delta}{\delta W_{i,j}^k} -log f(x)_y

\\ =\frac { \delta – log f(x)_y }{ \delta a^{k} x_i} * \frac { \delta a^{k} x_i }{ \delta W_{i,j}^k}” align=”absmiddle” scale=”0″>

\\ =\frac { \delta - log f(x)_y }{ \delta a^{k} x_i}  h_j^{k-1} x

which is basically because a (x)=b(k)+∑W(k) h(k-1) (x)

Partial derivatives of Biases:

<img class="mathtex-equation-editor" src="https://web.archive.org/web/20160331013249im_/http://chart.apis.google.com/chart?cht=tx&chl=%5Cfrac%7B%5Cdelta%7D%7B%5Cdelta%20b_%7Bi%7D%5Ek%7D%20-log%20f(x)_y%5C%5C%20%3D%5Cfrac%20%7B%20%5Cdelta%20-%20log%20f(x)_y%20%7D%7B%20%5Cdelta%20a%5E%7Bk%7D%20x_i%7D%20*%20%20%5Cfrac%20%7B%20%5Cdelta%20a%5E%7Bk%7D%20x_i%20%7D%7B%20%5Cdelta%20b_%7Bi%7D%5Ek%7D%5C%5C%3D%5Cfrac%20%7B%20%5Cdelta%20-%20log%20f(x)_y%20%7D%7B%20%5Cdelta%20a%5E%7Bk%7D%20x_i%7D%20*%201&quot; alt="\frac{\delta}{\delta b_{i}^k} -log f(x)_y

\\ =\frac { \delta – log f(x)_y }{ \delta a^{k} x_i} * \frac { \delta a^{k} x_i }{ \delta b_{i}^k}

\\=\frac { \delta – log f(x)_y }{ \delta a^{k} x_i} * 1″ align=”absmiddle” scale=”0″>

We see that somehow most of them has something in common, that is the gradient of pre-activation.

In backward propagation, we assume that we have got f(x) precomputed. So now for all the ks we will find out the gradients for layers. Theta in a 2 hidden layer neural network is defines as [W1,b1,W2,b2] so from my understanding so far, we put it in gradient descent algorithm. So for each iteration we have new value of Ws and bs, which we can put in our gradient descent algorithm.

 

How should we initialize our parameters? Well, looks like it is required a lot of insight to write about it in this blog, but I can say that there is a paper of Glarot and Bengio published in 2010 that suggested to have a gradient H_i,j = Uniform (-b,b). b=�?6/�?(H_k+ H_k-1)

 

Thanks:
1) Martin Thoma, for the inspiration
2) Hugo Larochelle for the tutorial

Notes on Markov Random Field energy function and Reparameterization Example

Markov Random Field is basically a model of joint probability distribution. It is commonly used in many places where an output get biased depending on the output of its neighbors which including Computer Vision. In computer vision one of the major problem that we are aware of is its noise problem. This noise usually depends on its neighbors which satisfies markovian properties.

Now lets represent the neighborhood relationship in a undirected graph where the vertices represents random variables and edges defines the neighborhood. We assume that this random variables that can be picked can be denoted as a finite set L of cardinality h. If I have n random variable then I can say that I have h^n possible combination out of which one is responsible for all. So we can represent it in a probability distribution, p(x). MRF has its own way to assumes this property for p(x). It assumes a vertice conditionally independent of all other vertices other than its neighbor. Mainly because it would help us to use Hammersly Clifford theorem.
MRF1
In graph theory, there is a common term called clique. Clique is basically the vartices that are somehow connected via an edge. If we consider clique potential function of size 2 then clique potential between 2 vartice will return a non negative value. The P(x) is proportional to the multiplication of all clique potentials.
Previously we were busy with only the outputs, but now we will try to predict inputs as well. So x_n were unobserved data and d_n’s are observed data. So now we have another edge which will be subject to clique potential as well. So now the probablity of d given x, P(d|x) is proportional to the multiplication of all clique potentials between x and d.
MRF2
But the function that we get can be normalized using an exponential function, P(x)=Exp(-E(x))/z
Here we get another function, E(x) which is known as Energy function. E(x) is summation instead of product as because we have an exponential in the original function. E(x)=Summation of theta(x)
This E(x) is closely related with Clique Potential because its just analogous to -log(psi(x))
Let me give an example of this energy function. Suppose we are separating out background and foreground image. We can mark which one is backgroun and which one is foreground by marking them in different label. The pixel that has a probability denoted like this P(x_p,l_1)=2 and this P(x_p,l_0)=5 then it is more probable ffor x_p to be in foreground. Now we have to calculate total energy function to get the result. 

E(x)=Summation of theta(x) [for unary potential]
EMFpotential1 

But if we want more accurate result we would go for binary potential.  So we will get connected in all possible way.
E(x)=Summation of theta(xp) + Summation of theta(xp,xq) [for binary potential]

EMFpotential2

Now using dynamic programming we would figure out the minimum energy so that we get a higher probability in total.
Now we know that the only thing we care about is energy function. So we can change the value of potential function to make the calculation more convenient by using Reperamatrization. Reperamatrization is a trick where we change the value of potential function without leting the Energy function being changed. Few of the example of that are given below.

How to automate numpy installation in your project using setuptool?

Recently for an opensurce project, I had to work using numpy. Numpy makes life a lot easier, without any doubt, but when it comes to setup.py of setuptool. It does not work just by adding <code>’install_requires:[“numpy”]’ </code> mainly because unlike any other python packages numpy is written in C/C++, mainly because of the optimization, interpreted language like python are not fast enough for mathematical computations. They ported their code in python. So before using them we need to compile them.

To do that we need to use custom commands, for custom commands in setuptool, we need to add “cmdclass” as a key in the dictionary. “build_ext” is a cython command which helps to compile files of numpy. But before compiling is done we should not call anything else, so we need to customize some existing class. So we overwrite build_ext class with an extension of it.

So the code should look like this:

 

from setuptools import setup
from setuptools.command.build_ext import build_ext as _build_ext

class build_ext(_build_ext):
    'to install numpy'
    def finalize_options(self):
        _build_ext.finalize_options(self)
        # Prevent numpy from thinking it is still in its setup process:
        __builtins__.__NUMPY_SETUP__ = False
        import numpy
        self.include_dirs.append(numpy.get_include())

config = {
 'cmdclass':{'build_ext':build_ext}, #numpy hack
 'setup_requires':['numpy'],         #numpy hack
 #...
 'install_requires': ["numpy" ]
}

 

Now why do we write __NUMPY_SETUP__=false? I have found an interesting answer @ StackOverFlow by coldfix.

In the source code of numpy we have got this:

 

if __NUMPY_SETUP__:
    import sys as _sys
    _sys.stderr.write('Running from numpy source directory.\n')
    del _sys
else:
    # import submodules etc. (main branch)

So when NUMPY_SETUP is true it does not import submodules so we need to make it sure that it does not get called first time. Then we include it in our library path.

 

 

Implementation and mathematics behind Harris Corner Detection

Say I have this amazing friend who travels a lot, say he visits every corner of the world. If I had the job/money I would have done the same, but unfortunately I need to be passionate a little bit to make it happen. So he want to share all his photos with me. There could have been the picture of the hill top buildings of italy, the underground city of turkey, Plitvice lakes of Croatia, Prague of Czech Republic. Obviously some of the photos will be in Zambia capturing The Southern Cross. Bazaruto Archipelago of Mozambique will obviously be there, Some of them will capture Nyiragongo Volcano of Congo. And so on, not one picture in each, but instead, there would be multiples of it. At this age, we don’t just take one picture at a place, right? But as I have seen these places in only in my dream, I could not really tell, which pictures are taken from where just by looking at it. I need to take a closer look, I need to figure out something unique in that picture and correspond that unique feature with other pictures I can tell if the place is same or not. They key point of this para is “Feature detection”.

So as a human we do feature detection. We want our algorithm to do that for us, by nature computer can’t tell a man “a man”, because everything is just a bunch of number that represent pixels, how would it know if they are feature or not? For the simplicity sake lets say we know there are two images where the picture where the image has been captured from the same place. Say we have two corresponding images representing the same scene, we need to match one point of the image to the other point. If we try to correspond each pixel of one image with other image. It will be extremely costly, because even a simple can have 1000×600 pixels. So better approach would be to get few unique identical points that is very rare in the images. Say within 1000×600=100000 pixels, we choose 100 points that does not comes too often, within two images. Since both of the images represent the same scene then something that is unique feature in one image should also be unique in other. So now our new aim will be to correspond 100 points of one image to the other.

Feature pixcels such as mountain peaks, building corners, doorways, or interestingly shaped patches are kinds of localized feature are often called keypoint features, interest points or even corners. They are basically defined by the appearance of patches of pixels surrounding the point location. Another set of important features are edges, say the profile of mountains against the sky.

As we see, similarity plays a great role in it, so we can use summed square difference here between two images, mathematically,

w is the weight, say some pixcel that is close to x_i has a different weight when it is far far away. Usually, it is a gaussian filter.

Which is essentially the correlation equation.

Now we know are going to consider pixels as a series of 2D array, like I(x+u,y+v). As we know any function can be represented as a tylor series, so do our I(). So our I(x+u,y+v) can be represented as

So putting all together:


Now we will compute eigenvalues, it helps us to find least squares plane of some data. It helps us to find the surface normal of point cloud.

is a 2×2 matrix, it can have two eigenvalues, λ1 and λ2.

If λ1 and λ2 are too small then flat.
If λ1 and λ2 are large then it is corner.
If λ1 or λ2 are too small or too large from other it is edge.

Harris measured this cornerness using R=λ1 * λ2 – k (λ1 + λ2)^2 = det M – k(trace M)^2,

if R is negative then it is edge, R is small then flat, R is greater than corner.

from scipy.ndimage import *
from pylab import *
from PIL import *

def compute_harris_response(im,sigma=3):
    """ Compute the Harris corner detector response function for each pixel in a graylevel image. """

    # derivatives
    imx = zeros(im.shape)
    filters.gaussian_filter(im, (sigma,sigma), (0,1), imx)
    imy = zeros(im.shape)
    filters.gaussian_filter(im, (sigma,sigma), (1,0), imy)

    # compute components of the Harris matrix
    Wxx = filters.gaussian_filter(imx*imx,sigma)
    Wxy = filters.gaussian_filter(imx*imy,sigma)
    Wyy = filters.gaussian_filter(imy*imy,sigma)

    # determinant and trace
    Wdet = Wxx*Wyy - Wxy**2
    Wtr = Wxx + Wyy

    return Wdet / Wtr


def get_harris_points(harrisim,min_dist=10,threshold=0.1):
    """ Return corners from a Harris response image
    min_dist is the minimum number of pixels separating
    corners and image boundary. """
    # find top corner candidates above a threshold
    corner_threshold = harrisim.max() * threshold
    harrisim_t = (harrisim > corner_threshold) * 1
    # get coordinates of candidates
    coords = array(harrisim_t.nonzero()).T
    # ...and their values
    candidate_values = [harrisim[c[0],c[1]] for c in coords]
    # sort candidates
    index = argsort(candidate_values)
    # store allowed point locations in array
    allowed_locations = zeros(harrisim.shape)
    allowed_locations[min_dist:-min_dist,min_dist:-min_dist] = 1
    # select the best points taking min_distance into account
    filtered_coords = []
    for i in index:
        if allowed_locations[coords[i,0],coords[i,1]] == 1:
            filtered_coords.append(coords[i])
            allowed_locations[(coords[i,0]-min_dist):(coords[i,0]+min_dist),
            (coords[i,1]-min_dist):(coords[i,1]+min_dist)] = 0
    return filtered_coords

def plot_harris_points(image,filtered_coords):
    """ Plots corners found in image. """
    figure()
    gray()
    imshow(image)
    plot([p[1] for p in filtered_coords],[p[0] for p in filtered_coords],'*')
    axis('off')
    show()






im = array(Image.open('empire.jpg').convert('L'))
harrisim = compute_harris_response(im)

filtered_coords = get_harris_points(harrisim,10)
plot_harris_points(im, filtered_coords)

Thanks to:
1. Professor Mubarak Shah for sharing his course online.
2. Professor Szeleski, for making his book draft public
3. Jan Erik Solem for his book on programming computer vision.
4. That random guy at physics forum who tells me why it works, after full day of wondering, why it works.

Probability transformation, moment and possible uses

This is the second blog on my adventure to learn more about probability and statistics. This is one is about Probability transformation, probability momentum and their use in skewness, kurtosis, variance and so on

Sometime it is easier for us if we can represent a probability distribution in a different way, say we have a very complicated joint probability distribution of a set of random variables, we want to transform it into much simpler form using probability transformation.So, using probability transformation constructs an equivalent set of values that can reasonably be modelled from a specified distribution.

Suppose we have a distribution like this:
F(x) \int_{\sqrt{ \pi /2}}^{0} 2x cos x\^{2} dx

We want to transform it into something else, say y=x^{2}
Replace the limits x = 0 and x = \sqrt{2 / \pi} will change into y=0 and y=\pi/2
\frac{d}{dx}x\^{2}=frac{d}{dx}y =>2x dx=dy
So, F(y) \int_{ \pi /2}^{0} sin y dy will represent the same function.

If cdf F_x(x), and transformation function Y=g(x), then the relationship between F_x(x) and F_y(y) depends on whether g is increasing or decreasing, if it is increasing then F_y(x)=F_x(g\{-1}(y)) but if it is decreasing then F_y(x)=1-F_x(g\{-1}(y))

Let X have pdf fx), let Y=g(x) Suppose there exists a partition A_0, A_1,…A_k of ample saple s[ace
i) g(x)=g_i(x) for x \in A_i
ii)g_i(x) is monotone on A_i
iii) Y=

There are a lot of interesting theorem I found in the book, like say if we have Y=F(x), then P(Y

In binomial transform tation a set of number a_0, a_1, a_2, … get transformed into b_0, b_1, b_2, …, where
b_n =\sum_{k=0}^{n} (-1)^k {(n-k)}\dbinom{n}{k} a_i

A discrete random variable X has a binomial distribution if its pmf is of the form f_x(x)=P(X=x)=\dbinom{n}{k} p^{x} (1-p)^{n-x} where 0< =p=0

If g is increasing function on X, F_y(y)=F_x(g^{-1}(y))
If g is decreasing function on X, F_y(y)=1-F_x(g^{-1}(y))

Now, lets get started with Expectation, what to expect and what not to, this is one hell of a question. Don’t go for very serious issue at the beginning, but at least we can formulate what to expect when we roll a dice. It is calculated as following E(X)=\sum{n}^{n=0} x p(x) So what to expect when we roll the dice? Dice has 6 side, so for each side p(x)=1/6.
E(X)=(1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+(1/6)*6=3.5

So which is basically the mean of the number so the expection is basically the mean, E(X). Not only mean but expectation also has a relation with variance as mean has a relation with variance. Variance \sigma/^{2} can be calculated using the expectation of x-EX )\^2, so E(x-EX )\^2 is variance. So from expectation we know the average, as well as how sparse our data is, on other word variance, so standard deviation is just root of \sigma.

Probability moments are formulated using this eq E(X\^k), so its a expected power of random variable.Central moment E(X -EX)/^{k}. Now we see that anything like mean or variance can be calculated using this probability moment, when k=1 it is same as the equation of mean, when k=2 it is close to variance. For k=3 we E(x-u)\{3}/\sigma\^{3} we get the skew of the moment. skew defines the goodness of the distribution. Skewness=(mean-median)/Standard deviation. Positive distribution pick will be in the left, and vise versa. For k=4 we E(x-u)\{4}/\sigma\^{4} we get the kurtosis, Kurtosis defines whether the distribution has higher peak then the normal distribution or flatter.

If random variable with finite variance, then for any constants a and b, Var(aX+b0=a^2 Var X.

Convergence of MGF theorem states, if lim t->0 M(t)_i=M(t) for all neighbor of 0, then F_xi(x)=F_x(x)

Why Android app DOES NOT run on Java Virtual Machine!

I had a misconception for like years that as we write code in java for android application so there must be an JVM [just like on J2ME mobiles ] in underline android Operating System but actually I was actually wrong. So I have written a little bit on the differences.
Its true that for most of the android development we use java and we need to compile it like other java development. When in our IDE we hit compile button it also generates Java byte-code files which will be better understood if I say .class files. But later on these .class files get compressed and get converted into a single byte code file using DX converter which usually known as classes.dex. Then this dex filles packaged with other application resources and become a .apk file. We can install this .apk file in the devices. When we run this application, JRE does not run the .dex file instead Dalvik Virtual Machine executes that dex bit codes.
Since JVM and Delvik are turing complete, so basically end of the day they do the same things. But it has internal differences. The key difference is that JVM is stack-based, while Dalvik is register based.
I think it will be better if we remind the example of stack based addition and register based addition.
In an stack stack based addition when we need to add a set of numbers we put all of them into a stack then we would pop top two and add them up then again we would push that summation in the stack and we will pop two and sum them up and we will repeat this process until the stack is empty. So it takes minimal process to implement stack based VM because only 2 register can do it. Which is very important for JVM because it supports a wide range of devices.

On the other hand register based system it requires to tell the explicit memory location in the memory if we want it to add. But the average register instruction is larger than an average stack instruction. once we tell them the location explicitly we don’t need to go through all this over head of push, pop which helps android to run on slow CPU and less RAM. Additionally in register based virtual machine few optimization can be done which on low battery consumption. DVM has been designed so that a device can run multiple instances of the VM efficiently.