# Entropy

## Concept

- message: bits transmitted from sender to receiver, however may be redundant
- information: new and useful things delivered

An underlying concept is that information must be news to the recipient; in particular, telling the recipient something the recipient already knows conveys no information.

Entropy: measures uncertainty

- entropy = 0: no uncertainty, provides no information
- the larger the entropy, the larger uncertainty, the more information

A property of entropy is that it is maximized when all the messages in the message space are equiprobable $p(x)=1/n$, i.e. most unpredictable, in which case $H(X)=\log n$.

Data Compression: remove redundancy in message, increase entropy.

Usually some of the actual choices are more likely than others, and in that case H will always be less than if the choices are all equally probable.

## Example

### Flipping a coin

- if always tail or head, no uncertainty, entropy = 0;
- if a fair coin, 50%/50% chance of head/tail, entropy = 1, the largest;
- if an unfair coin, entropy will be in (0, 1)
- flipping 2 fair coins: there are $2^2$ states, to use bit to represent, we need $log_2 2^2$ = 2 bits

non-fair coins: two extremes

- entropy = 0: sure thing
- entropy = 1: random = $- 0.5 \log 0.5 - 0.5 \log 0.5$

### Rolling a dice

- specifying the outcome of a fair coin flip (two equally likely outcomes) provides less information (lower entropy) than specifying the outcome from a roll of a fair dice (six equally likely outcomes).

### Day of the Week

Send a message encoding what day of the week it is: we need a message that can encode 7 values, $\log_2 7 = 2.8$ bits, i.e. we need 3 bits (000 – Monday, 001 – Tuesday, …, 110 – Sunday, 111- unused).

### Transmit 1000 bits

- one extreme: if receiver already knows all the bits, no information is transmitted
- another extreme: if 1000 bits are independent: 1000 bits(or shannons) of information, maximum uncertainty, maximum entropy.
- in between:

- 1000-bit message can deliver at most 1000-bit information

### English vs Chinese

- English: 26 characters, low entropy
- Chinese: thousands of characters, high entropy

### Sunset/Sunrise

## Cross-entropy

http://colah.github.io/posts/2015-09-Visual-Information/

Cross-entropy is an idea from information theory that allows us to describe how bad it is to believe the predictions of the neural network, given what is actually true.

## Normalized Entropy

- an important metric for classification
- the log loss normalized by entropy based on data ctr. Smaller NE means better model in general.
- sensitive to calibration of the prediction.

## External Links

entropy examples: https://www.nist.gov/sites/default/files/documents/2017/11/30/nce.pdf

normalized entropy: https://math.stackexchange.com/questions/395121/how-entropy-scales-with-sample-size

https://en.wikipedia.org/wiki/Entropy*(information*theory)#Efficiency