# Entropy

Updated: 2019-01-13

## Concept

• message: bits transmitted from sender to receiver, however may be redundant
• information: new and useful things delivered

An underlying concept is that information must be news to the recipient; in particular, telling the recipient something the recipient already knows conveys no information.

Entropy: measures uncertainty

• entropy = 0: no uncertainty, provides no information

A property of entropy is that it is maximized when all the messages in the message space are equiprobable $p(x)=1/n$, i.e. most unpredictable, in which case $H(X)=\log n$.

Data Compression: remove redundancy in message, increase entropy.

Usually some of the actual choices are more likely than others, and in that case H will always be less than if the choices are all equally probable.

## Example

### Flipping a coin

• if always tail or head, no uncertainty, entropy = 0;
• if a fair coin, 50%/50% chance of head/tail, entropy = 1, the largest;
• if an unfair coin, entropy will be in (0, 1)
• flipping 2 fair coins: there are $2^2$ states, to use bit to represent, we need $log_2 2^2$ = 2 bits

non-fair coins: two extremes

• entropy = 0: sure thing
• entropy = 1: random = $- 0.5 \log 0.5 - 0.5 \log 0.5$

### Rolling a dice

• specifying the outcome of a fair coin flip (two equally likely outcomes) provides less information (lower entropy) than specifying the outcome from a roll of a fair dice (six equally likely outcomes).

### Day of the Week

Send a message encoding what day of the week it is: we need a message that can encode 7 values, $\log_2 7 = 2.8$ bits, i.e. we need 3 bits (000 – Monday, 001 – Tuesday, …, 110 – Sunday, 111- unused).

### Transmit 1000 bits

• one extreme: if receiver already knows all the bits, no information is transmitted
• another extreme: if 1000 bits are independent: 1000 bits(or shannons) of information, maximum uncertainty, maximum entropy.
• in between:
$H_s = \sum_{i=1}^n p_i I_e = - \sum_{i=1}^n p_i \log_2 p_i$
• 1000-bit message can deliver at most 1000-bit information

### English vs Chinese

• English: 26 characters, low entropy
• Chinese: thousands of characters, high entropy

## Cross-entropy

http://colah.github.io/posts/2015-09-Visual-Information/

Cross-entropy is an idea from information theory that allows us to describe how bad it is to believe the predictions of the neural network, given what is actually true.

## Normalized Entropy

• an important metric for classification
• the log loss normalized by entropy based on data ctr. Smaller NE means better model in general.
• sensitive to calibration of the prediction.

https://en.wikipedia.org/wiki/Entropy(informationtheory)#Efficiency