Notes on Markov Random Field energy function and Reparameterization Example

Markov Random Field is basically a model of joint probability distribution. It is commonly used in many places where an output get biased depending on the output of its neighbors which including Computer Vision. In computer vision one of the major problem that we are aware of is its noise problem. This noise usually depends on its neighbors which satisfies markovian properties.

Now lets represent the neighborhood relationship in a undirected graph where the vertices represents random variables and edges defines the neighborhood. We assume that this random variables that can be picked can be denoted as a finite set L of cardinality h. If I have n random variable then I can say that I have h^n possible combination out of which one is responsible for all. So we can represent it in a probability distribution, p(x). MRF has its own way to assumes this property for p(x). It assumes a vertice conditionally independent of all other vertices other than its neighbor. Mainly because it would help us to use Hammersly Clifford theorem.
MRF1
In graph theory, there is a common term called clique. Clique is basically the vartices that are somehow connected via an edge. If we consider clique potential function of size 2 then clique potential between 2 vartice will return a non negative value. The P(x) is proportional to the multiplication of all clique potentials.
Previously we were busy with only the outputs, but now we will try to predict inputs as well. So x_n were unobserved data and d_n’s are observed data. So now we have another edge which will be subject to clique potential as well. So now the probablity of d given x, P(d|x) is proportional to the multiplication of all clique potentials between x and d.
MRF2
But the function that we get can be normalized using an exponential function, P(x)=Exp(-E(x))/z
Here we get another function, E(x) which is known as Energy function. E(x) is summation instead of product as because we have an exponential in the original function. E(x)=Summation of theta(x)
This E(x) is closely related with Clique Potential because its just analogous to -log(psi(x))
Let me give an example of this energy function. Suppose we are separating out background and foreground image. We can mark which one is backgroun and which one is foreground by marking them in different label. The pixel that has a probability denoted like this P(x_p,l_1)=2 and this P(x_p,l_0)=5 then it is more probable ffor x_p to be in foreground. Now we have to calculate total energy function to get the result. 

E(x)=Summation of theta(x) [for unary potential]
EMFpotential1 

But if we want more accurate result we would go for binary potential.  So we will get connected in all possible way.
E(x)=Summation of theta(xp) + Summation of theta(xp,xq) [for binary potential]

EMFpotential2

Now using dynamic programming we would figure out the minimum energy so that we get a higher probability in total.
Now we know that the only thing we care about is energy function. So we can change the value of potential function to make the calculation more convenient by using Reperamatrization. Reperamatrization is a trick where we change the value of potential function without leting the Energy function being changed. Few of the example of that are given below.

Notes on ensuring Maximum Energy Flow using s-t flow/cut & residual graph

Maximum Flow problem is one of the important problems in combinatorial optimization. It finds feasible solution for flowing energy from one source to another point. Here we are interested to find a minimum energy labeling in a MRF.

Lets define our problem space using directed graph. Every edge denotes a flow and every flow has a capacity (black numbers in diagram) that it can hold. Each time when we make an energy flow we will mark it in the graph in red and obviously it have to be less than the capacity. The excess energy is the difference between incoming and outgoing values. For example E(v1)=1-(4+3)=-6. We can also compute excess energy for a set of veteces, E({v1,v2}). Now we need to consider v1 and v2 as a block and the difference between incoming and outgoing edges must be calculated.  E({v1,v2})=10+1 -(3)=8. We will also get the same result if we compute E(v1) and E(v2) separately and sum them up.flow
Now we will try to explain S-t flow. The main property of a S-t flow is that all the vertice other than source or drain must have an equal value of input and output.
stflow
s-t cut is a way where we separate the graph in two parts in such a order that source and drain goes into the second part and then we find the output capacity difference of it.
stcut
s-t flow can also be represented in term of s-t cut. In terms of s-t flow all other vertices excess energy of 0. So basically the value of flow for s-t flow=E(s) which is similar to write E(s)-E(v) so it is actually an s-t cut.  E(s)-E(v) can’t exceed the capacity.
To solve maximum flow, we can use Residual Graph. Resudual Graph is an undirected graph. In this undirected graph we would remove every edge when it gets saturated in other word we would keep edge of all the graph which has edge less than capacity. So whats our stopping condition? Eventually we will reach to a state where our sink will lose all its edge so its a state where we can’t reach in the end.
rasidual
So we have got the maximum flow for the capacity of that. From the max value min cut theorem we can say that it would give us the minimum cut.

TF-IDF and Cosine similarity on Document Rank Retrieval

I was interested in learning about the algorithms that tries to match document and our queries when it tries to match with the documents and returns on a rank from best match to least match. In this blog I have tried to figure out the question why I have to use tf-idf and cosine similarity .

Jaccard similarity coefficient is too straight forward. For a specific search query It figures out the rank depending on the word it has matched in the total number of unique word it has. I think rather than saying like this, if I say formally then it would be a lot more clear. formally it is defined by the division of cardinality of the set of their intersection and the cardinality of the set of unions.

(1) \begin{equation*} Jaccard(A,B)= \frac{ | A \cup B |}{ | A \cap B |} \end{equation*}

Example:
query=ides o march

document1=caesar died in march
document2=the long march

jaccard(query, document1)=1/6
jaccard(query, document2)=1/5

But Jaccard is too binary to be awesome. In real life if a word frequently appears in a document then the other than the query should retrieval the first document instead of the second one but jaccard coefficient does not count the frequency at all, it checks whether the word appears or not, it is boolean. The problem with Boolean queries are that, it often results too many or too few results. When we use AND it becomes too short and for OR the results become too short. So for long query it becomes too choosy and for short queries it shows everything that matches.
We have detected a problem, so now it is time to solve it. We will try to solve it using Term Frequency Waiting. This time we will try to move away from boolean approach. So this time a presence of a word in a document is not enough but their count is also something of our headache. So the same thing of above but this time we will count the words it has appeared. Suppose for the sentence: “in the month of march, they are going for long march” will vote twice for the document when the query is “march”, in fact it will be 2/11.
tf(query, document)= total query words present in document/ total number of words in query and document

Let me remind the scenario once again so that we can detect any possible bug in our assumption, suppose I am looking for a word and one document has it once and the other document has it 10 times. obviously the 10 time will be the one I would more likely be looking for but it won’t be 10 time more important. So we can use log assumption here. So we can do,

(2) \begin{equation*} w= 1+log(tf) \end{equation*}

But guess what? what is the most used word in our conversation? I am pretty sure that “is, are, it” sort of word will get the most frequency rate but it appears in all the documents which won’t help us to do anything at all. In fact we are interested about rare words – the antique one.
So it was the necessity for introducing Inverse Frequency Model. So let me tell you again the rare terms are more important here for now. For example a document full of the, to is not important to recognize the document anymore because it appears in every document. So in this model the more it appears in the documents, it is less important… in other word the important get divided. So lets do it,

(3) \begin{equation*} idf=\frac{N}{df} \end{equation*}

where N= total number of documents we have,
df= the number of document that has our query word
But again it does not increase like a plain line, so lets use logarithmic scale for the same reason as above.

(4) \begin{equation*} idf=Log \frac{N}{df} \end{equation*}

more formally:

(5) \begin{equation*} idf(t,D)= log(tf) \end{equation*}

so now for 1 occurance in 1M will get idf = 6

TF only care about the number of time it appears in the document and idf cares about the rare words. But we need a proper balance of things. It should be rare but at the same time that rare word should appear multiple time in the document. So how would we balance? we can just multiply both of them. tf-idf is the best weighting scheme known.
Let me help you a little bit more with its implementation. When you are done your tf-idf would look like a 1xm vector matrix. somewhat like array.
[It, is, March]=[0.33,0.33,0]
Remember how do we draw vectors? if a vector is represented as [x,y] just ploting line that go through [0,0] and that connects [x,y]. TF-IDF makes it multidimensional nothing else. we have represented something using tf-idf, thats cool, but how would we we retrive ranks? We can make tf-idf matrix for both query and document then calculate vector distance. But here we can face a new problem because of its vector nature. remember that [2,2] and [4,4] represent the same line, just twice as large as the line. Which is a drawback of this tf-idf because it adds the distance between query and document very high. So to get out of this draw back we will introduce cosine similarity. Basically cosine similarity is pretty simple idea, instead of distance now we will calculate the angle between the lines.

(6) \begin{equation*} cos(\theta)=\frac{A \dot B}{| A | | B |} \end{equation*}

so from angle now we can detect the similarity.