Last Updated: 2021-11-19

What is Information

Differences between message and information

  • message: bits transmitted from sender to receiver, however may be redundant
  • information: new and useful things delivered

Information must be news to the recipient; telling something the recipient already knows conveys no information.


In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent in the variable’s possible outcomes.

Entropy is defined as:

H ( X ) = i = 1 n p i log 2 p i H(X) = - \sum_{i=1}^n p_i \log_2 p_i

In short, entropy measures uncertainty:

  • entropy = 0: no uncertainty, provides no information
  • the larger the entropy, the larger uncertainty, the more information
  • maximized when all the messages in the message space are equiprobable p ( x ) = 1 / n p(x)=1/n , i.e. most unpredictable, in which case H ( X ) = log n H(X)=\log n .
  • Usually some of the actual choices are more likely than others, and in that case H H will always be less than if the choices are all equally probable.


Flipping a coin

  • if an unfair coin, entropy will be in (0, 1)
    • if always tail or head, no uncertainty, entropy = 0;
  • if a fair coin, 50%/50% chance of head/tail, entropy = 0.5 log 0.5 0.5 log 0.5 - 0.5 \log 0.5 - 0.5 \log 0.5 = 1, the largest;
  • flipping 2 fair coins: there are 2 2 2^2 states, to use bit to represent, we need l o g 2 2 2 log_2 2^2 = 2 bits

Rolling a dice

Specifying the outcome of a fair coin flip (two equally likely outcomes) provides less information (lower entropy) than specifying the outcome from a roll of a fair dice (six equally likely outcomes).

Day of the Week

Send a message encoding what day of the week it is: we need a message that can encode 7 values, log 2 7 = 2.8 \log_2 7 = 2.8 bits, i.e. we need 3 bits (000 – Monday, 001 – Tuesday, …, 110 – Sunday, 111- unused).

Transmit 1000 bits

  • one extreme: if receiver already knows all the bits, no information is transmitted
  • another extreme: if 1000 bits are independent: 1000 bits(or shannons) of information, maximum uncertainty, maximum entropy.
  • in between:
H s = i = 1 n p i I e = i = 1 n p i log 2 p i H_s = \sum_{i=1}^n p_i I_e = - \sum_{i=1}^n p_i \log_2 p_i
  • 1000-bit message can deliver at most 1000-bit information

English vs Chinese

  • English: 26 characters, low entropy
  • Chinese: thousands of characters, high entropy

Data Compression

Remove redundancy in message, increase entropy.


Cross-entropy gives us a way to express how different two probability distributions are.

It is defined as follows, it depends on both p p and q q :

H p ( q ) = i = 1 n q i log 2 p i H_p(q) = - \sum_{i=1}^n q_i \log_2 p_i

Note that Cross-entropy isn’t symmetric: H p ( q ) H q ( p ) H_p(q) \neq Hq(p) .

Used as an alternative to squared error to evaluate the output of the network (comparing predictions against what is actually true). It is more useful in problems in which the targets are 0 and 1.

Normalized Entropy

Normalized Entropy: normalize the entropy by size

H n ( p ) = i = 1 n p i log b p i log b n H_n(p) = - \sum_{i=1}^n {p_i \log_b p_i \over \log_b n}
  • also called efficiency
  • Normalizing the entropy by log b n \log_b n gives H n ( p ) [ 0 , 1 ] H_n(p) \in [0, 1]
  • alternatively you can set b = n b=n and drop the normalization term
  • the log loss normalized by entropy based on data ctr. Smaller NE means better model in general.
  • sensitive to calibration of the prediction.