# Entropy

## What is Information

Differences between **message** and **information**

**message**: bits transmitted from sender to receiver, however*may be redundant***information**:*new*and*useful*things delivered

Information must be news to the recipient; telling something the recipient already knows conveys no information.

## Entropy

In information theory, the

entropyof a random variable is the average level of "information", "surprise", or "uncertainty" inherent in the variable’s possible outcomes.

**Entropy** is defined as:

In short, **entropy measures uncertainty**:

- entropy = 0: no uncertainty, provides no information
- the larger the entropy, the larger uncertainty, the more information
- maximized when all the messages in the message space are equiprobable $p(x)=1/n$, i.e. most unpredictable, in which case $H(X)=\log n$.
- Usually some of the actual choices are more likely than others, and in that case $H$ will always be less than if the choices are all equally probable.

## Example

### Flipping a coin

- if an unfair coin, entropy will be in (0, 1)
- if always tail or head, no uncertainty, entropy = 0;

- if a fair coin, 50%/50% chance of head/tail, entropy = $- 0.5 \log 0.5 - 0.5 \log 0.5$ = 1, the largest;
- flipping 2 fair coins: there are $2^2$ states, to use bit to represent, we need $log_2 2^2$ = 2 bits

### Rolling a dice

Specifying the outcome of a fair coin flip (two equally likely outcomes) provides less information (lower entropy) than specifying the outcome from a roll of a fair dice (six equally likely outcomes).

### Day of the Week

Send a message encoding what day of the week it is: we need a message that can encode 7 values, $\log_2 7 = 2.8$ bits, i.e. we need 3 bits (`000`

– Monday, `001`

– Tuesday, …, `110`

– Sunday, `111`

- unused).

### Transmit 1000 bits

- one extreme: if receiver already knows all the bits, no information is transmitted
- another extreme: if 1000 bits are independent: 1000 bits(or shannons) of information, maximum uncertainty, maximum entropy.
- in between:

- 1000-bit message can deliver at most 1000-bit information

### English vs Chinese

- English: 26 characters, low entropy
- Chinese: thousands of characters, high entropy

### Data Compression

Remove redundancy in message, increase entropy.

## Cross-entropy

Cross-entropy gives us a way to express how different two probability distributions are.

It is defined as follows, it depends on both $p$ and $q$:

$H_p(q) = - \sum_{i=1}^n q_i \log_2 p_i$Note that Cross-entropy isn’t symmetric: $H_p(q) \neq Hq(p)$.

Used as an alternative to squared error to evaluate the output of the network (comparing predictions against what is actually true). It is more useful in problems in which the targets are 0 and 1.

## Normalized Entropy

Normalized Entropy: normalize the entropy by size

$H_n(p) = - \sum_{i=1}^n {p_i \log_b p_i \over \log_b n}$- also called
*efficiency* - Normalizing the entropy by $\log_b n$ gives $H_n(p) \in [0, 1]$
- alternatively you can set $b=n$ and drop the normalization term
- the log loss normalized by entropy based on data ctr. Smaller NE means better model in general.
- sensitive to calibration of the prediction.