Entropy is the fundamental unit of information in Information Theory and is extensively useful in Machine Learning. Let us introduce the concepts: Entropy, Joint Entropy, Mutual Information.

## Entropy

This is defined as below

H(x) = P(x1)*#bits(x1) + P(x2) * #bits(x2) + ... + P(xn)*#bits(xn)

Where the variable X is a discrete variable such that it can take these values, {x1, x2,…,xn}. Now there is a nice video by Puskar Kolhe explaining to help us understand the concept

Now again, the ones with higher probability may get a smaller bits assigned because we can reduce the data size by reducing the overall size, so the

P(xi) = 1 / 2^(#bits(xi))

Which leads to,

#bits(xi) = logBase2(1/P(xi)) => #bits(xi) = -logBase2(P(xi))

This is how, the entropy definition becomes

H(x) = SumOver_i[ -P(xi) * log(P(xi)) ]

## Joint Entropy

This measures the information contained in two discrete variables X, and Y

H(X,Y) = SumOver_ij[ -P(xi,yj) * log(P(xi,yj)) ]

Here, `P(xi,yj)`

is the joint probability of the variables X, Y. If the X,Y variables are independent, then the joint entropy becomes just the sum of the individual entropies as below since they do not tell anything when combined together!

H(X,Y) = H(X) + H(Y)

## Conditional Entropy

This measures the additional information one gets when one of the variables, X is already known. Just swap the last portion with `P(yj|xi)`

H(Y|X) = SumOver_ij[ -P(xi,yj) * log(P(xi|yj)) ]

If the X,Y variables are independent, then `H(Y|X)=H(Y)`

because knowing X does not help figure out the distribution of Y.

## Mutual Information

This is probably the most useful metric to measure information between variables. The formula is below

I(X,Y) = H(Y) - H(Y|X)

It tells that, what extra information is there from Y if we already know the conditional entropy of Y given X.

It is important to remember, for two independent variables X,Y, the mutual information `I(X,Y)=0`

. If there is any dependency, then `I(X,Y)>0`