Entropy
What is Information
Differences between message and information
 message: bits transmitted from sender to receiver, however may be redundant
 information: new and useful things delivered
Information must be news to the recipient; telling something the recipient already knows conveys no information.
Entropy
In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent in the variable’s possible outcomes.
Entropy is defined as:
$H(X) =  \sum_{i=1}^n p_i \log_2 p_i$In short, entropy measures uncertainty:
 entropy = 0: no uncertainty, provides no information
 the larger the entropy, the larger uncertainty, the more information
 maximized when all the messages in the message space are equiprobable $p(x)=1/n$, i.e. most unpredictable, in which case $H(X)=\log n$.
 Usually some of the actual choices are more likely than others, and in that case $H$ will always be less than if the choices are all equally probable.
Example
Flipping a coin

if an unfair coin, entropy will be in (0, 1)
 if always tail or head, no uncertainty, entropy = 0;
 if a fair coin, 50%/50% chance of head/tail, entropy = $ 0.5 \log 0.5  0.5 \log 0.5$ = 1, the largest;
 flipping 2 fair coins: there are $2^2$ states, to use bit to represent, we need $log_2 2^2$ = 2 bits
Rolling a dice
Specifying the outcome of a fair coin flip (two equally likely outcomes) provides less information (lower entropy) than specifying the outcome from a roll of a fair dice (six equally likely outcomes).
Day of the Week
Send a message encoding what day of the week it is: we need a message that can encode 7 values, $\log_2 7 = 2.8$ bits, i.e. we need 3 bits (000
– Monday, 001
– Tuesday, …, 110
– Sunday, 111
 unused).
Transmit 1000 bits
 one extreme: if receiver already knows all the bits, no information is transmitted
 another extreme: if 1000 bits are independent: 1000 bits(or shannons) of information, maximum uncertainty, maximum entropy.
 in between:
 1000bit message can deliver at most 1000bit information
English vs Chinese
 English: 26 characters, low entropy
 Chinese: thousands of characters, high entropy
Data Compression
Remove redundancy in message, increase entropy.
Crossentropy
Crossentropy gives us a way to express how different two probability distributions are.
It is defined as follows, it depends on both $p$ and $q$:
$H_p(q) =  \sum_{i=1}^n q_i \log_2 p_i$Note that Crossentropy isn’t symmetric: $H_p(q) \neq Hq(p)$.
Used as an alternative to squared error to evaluate the output of the network (comparing predictions against what is actually true). It is more useful in problems in which the targets are 0 and 1.
Normalized Entropy
Normalized Entropy: normalize the entropy by size
$H_n(p) =  \sum_{i=1}^n {p_i \log_b p_i \over \log_b n}$ also called efficiency
 Normalizing the entropy by $\log_b n$ gives $H_n(p) \in [0, 1]$
 alternatively you can set $b=n$ and drop the normalization term
 the log loss normalized by entropy based on data ctr. Smaller NE means better model in general.
 sensitive to calibration of the prediction.