- message: bits transmitted from sender to receiver, however may be redundant
- information: new and useful things delivered
An underlying concept is that information must be news to the recipient; in particular, telling the recipient something the recipient already knows conveys no information.
Entropy: measures uncertainty
- entropy = 0: no uncertainty, provides no information
- the larger the entropy, the larger uncertainty, the more information
A property of entropy is that it is maximized when all the messages in the message space are equiprobable , i.e. most unpredictable, in which case .
Data Compression: remove redundancy in message, increase entropy.
Usually some of the actual choices are more likely than others, and in that case H will always be less than if the choices are all equally probable.
- if always tail or head, no uncertainty, entropy = 0;
- if a fair coin, 50%/50% chance of head/tail, entropy = 1, the largest;
- if an unfair coin, entropy will be in (0, 1)
- flipping 2 fair coins: there are states, to use bit to represent, we need = 2 bits
non-fair coins: two extremes
- entropy = 0: sure thing
- entropy = 1: random =
- specifying the outcome of a fair coin flip (two equally likely outcomes) provides less information (lower entropy) than specifying the outcome from a roll of a fair dice (six equally likely outcomes).
Send a message encoding what day of the week it is: we need a message that can encode 7 values, bits, i.e. we need 3 bits (000 – Monday, 001 – Tuesday, …, 110 – Sunday, 111- unused).
- one extreme: if receiver already knows all the bits, no information is transmitted
- another extreme: if 1000 bits are independent: 1000 bits(or shannons) of information, maximum uncertainty, maximum entropy.
- in between:
- 1000-bit message can deliver at most 1000-bit information
- English: 26 characters, low entropy
- Chinese: thousands of characters, high entropy
Cross-entropy is an idea from information theory that allows us to describe how bad it is to believe the predictions of the neural network, given what is actually true.
- an important metric for classification
- the log loss normalized by entropy based on data ctr. Smaller NE means better model in general.
- sensitive to calibration of the prediction.
entropy examples: https://www.nist.gov/sites/default/files/documents/2017/11/30/nce.pdf