Updated: 2019-02-10


  • message: bits transmitted from sender to receiver, however may be redundant
  • information: new and useful things delivered

An underlying concept is that information must be news to the recipient; in particular, telling the recipient something the recipient already knows conveys no information.

Entropy: measures uncertainty

  • entropy = 0: no uncertainty, provides no information
  • the larger the entropy, the larger uncertainty, the more information

A property of entropy is that it is maximized when all the messages in the message space are equiprobable p(x)=1/np(x)=1/n, i.e. most unpredictable, in which case H(X)=lognH(X)=\log n.

Data Compression: remove redundancy in message, increase entropy.

Usually some of the actual choices are more likely than others, and in that case H will always be less than if the choices are all equally probable.


Flipping a coin

  • if always tail or head, no uncertainty, entropy = 0;
  • if a fair coin, 50%/50% chance of head/tail, entropy = 1, the largest;
  • if an unfair coin, entropy will be in (0, 1)
  • flipping 2 fair coins: there are 222^2 states, to use bit to represent, we need log222log_2 2^2 = 2 bits

non-fair coins: two extremes

  • entropy = 0: sure thing
  • entropy = 1: random = 0.5log0.50.5log0.5- 0.5 \log 0.5 - 0.5 \log 0.5

Rolling a dice

  • specifying the outcome of a fair coin flip (two equally likely outcomes) provides less information (lower entropy) than specifying the outcome from a roll of a fair dice (six equally likely outcomes).

Day of the Week

Send a message encoding what day of the week it is: we need a message that can encode 7 values, log27=2.8\log_2 7 = 2.8 bits, i.e. we need 3 bits (000 – Monday, 001 – Tuesday, …, 110 – Sunday, 111- unused).

Transmit 1000 bits

  • one extreme: if receiver already knows all the bits, no information is transmitted
  • another extreme: if 1000 bits are independent: 1000 bits(or shannons) of information, maximum uncertainty, maximum entropy.
  • in between:
Hs=i=1npiIe=i=1npilog2piH_s = \sum_{i=1}^n p_i I_e = - \sum_{i=1}^n p_i \log_2 p_i
  • 1000-bit message can deliver at most 1000-bit information

English vs Chinese

  • English: 26 characters, low entropy
  • Chinese: thousands of characters, high entropy




Cross-entropy is an idea from information theory that allows us to describe how bad it is to believe the predictions of the neural network, given what is actually true.

Normalized Entropy

  • an important metric for classification
  • the log loss normalized by entropy based on data ctr. Smaller NE means better model in general.
  • sensitive to calibration of the prediction.

entropy examples: https://www.nist.gov/sites/default/files/documents/2017/11/30/nce.pdf

normalized entropy: https://math.stackexchange.com/questions/395121/how-entropy-scales-with-sample-size