Entropy

What is Information

Differences between message and information

message: bits transmitted from sender to receiver, however may be redundant
information: new and useful things delivered

Information must be news to the recipient; telling something the recipient already knows conveys no information.

Entropy

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent in the variable’s possible outcomes.

Entropy is defined as:

H(X) = - \sum_{i=1}^n p_i \log_2 p_i

In short, entropy measures uncertainty:

entropy = 0: no uncertainty, provides no information
the larger the entropy, the larger uncertainty, the more information
maximized when all the messages in the message space are equiprobable $p(x)=1/n$ , i.e. most unpredictable, in which case $H(X)=\log n$ .
Usually some of the actual choices are more likely than others, and in that case $H$ will always be less than if the choices are all equally probable.

Example

Flipping a coin

if an unfair coin, entropy will be in (0, 1)
- if always tail or head, no uncertainty, entropy = 0;
if a fair coin, 50%/50% chance of head/tail, entropy = $- 0.5 \log 0.5 - 0.5 \log 0.5$ = 1, the largest;
flipping 2 fair coins: there are $2^2$ states, to use bit to represent, we need $log_2 2^2$ = 2 bits

Rolling a dice

Specifying the outcome of a fair coin flip (two equally likely outcomes) provides less information (lower entropy) than specifying the outcome from a roll of a fair dice (six equally likely outcomes).

Day of the Week

Send a message encoding what day of the week it is: we need a message that can encode 7 values, $\log_2 7 = 2.8$ bits, i.e. we need 3 bits (000 – Monday, 001 – Tuesday, …, 110 – Sunday, 111- unused).

Transmit 1000 bits

one extreme: if receiver already knows all the bits, no information is transmitted
another extreme: if 1000 bits are independent: 1000 bits(or shannons) of information, maximum uncertainty, maximum entropy.
in between:

H_s = \sum_{i=1}^n p_i I_e = - \sum_{i=1}^n p_i \log_2 p_i

1000-bit message can deliver at most 1000-bit information

English vs Chinese

English: 26 characters, low entropy
Chinese: thousands of characters, high entropy

Data Compression

Remove redundancy in message, increase entropy.

Cross-entropy

Cross-entropy gives us a way to express how different two probability distributions are.

It is defined as follows, it depends on both $p$ and $q$ :

H_p(q) = - \sum_{i=1}^n q_i \log_2 p_i

Note that Cross-entropy isn’t symmetric: $H_p(q) \neq Hq(p)$ .

Used as an alternative to squared error to evaluate the output of the network (comparing predictions against what is actually true). It is more useful in problems in which the targets are 0 and 1.

Normalized Entropy

Normalized Entropy: normalize the entropy by size

H_n(p) = - \sum_{i=1}^n {p_i \log_b p_i \over \log_b n}

also called efficiency
Normalizing the entropy by $\log_b n$ gives $H_n(p) \in [0, 1]$
alternatively you can set $b=n$ and drop the normalization term
the log loss normalized by entropy based on data ctr. Smaller NE means better model in general.
sensitive to calibration of the prediction.