Differences between message and information
- message: bits transmitted from sender to receiver, however may be redundant
- information: new and useful things delivered
Information must be news to the recipient; telling something the recipient already knows conveys no information.
In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent in the variable’s possible outcomes.
Entropy is defined as:
In short, entropy measures uncertainty:
- entropy = 0: no uncertainty, provides no information
- the larger the entropy, the larger uncertainty, the more information
- maximized when all the messages in the message space are equiprobable , i.e. most unpredictable, in which case .
- Usually some of the actual choices are more likely than others, and in that case will always be less than if the choices are all equally probable.
- if an unfair coin, entropy will be in (0, 1)
- if always tail or head, no uncertainty, entropy = 0;
- if a fair coin, 50%/50% chance of head/tail, entropy = = 1, the largest;
- flipping 2 fair coins: there are states, to use bit to represent, we need = 2 bits
Specifying the outcome of a fair coin flip (two equally likely outcomes) provides less information (lower entropy) than specifying the outcome from a roll of a fair dice (six equally likely outcomes).
Send a message encoding what day of the week it is: we need a message that can encode 7 values, bits, i.e. we need 3 bits (
000 – Monday,
001 – Tuesday, …,
110 – Sunday,
- one extreme: if receiver already knows all the bits, no information is transmitted
- another extreme: if 1000 bits are independent: 1000 bits(or shannons) of information, maximum uncertainty, maximum entropy.
- in between:
- 1000-bit message can deliver at most 1000-bit information
- English: 26 characters, low entropy
- Chinese: thousands of characters, high entropy
Remove redundancy in message, increase entropy.
Cross-entropy gives us a way to express how different two probability distributions are.
It is defined as follows, it depends on both and :
Note that Cross-entropy isn’t symmetric: .
Used as an alternative to squared error to evaluate the output of the network (comparing predictions against what is actually true). It is more useful in problems in which the targets are 0 and 1.
Normalized Entropy: normalize the entropy by size
- also called efficiency
- Normalizing the entropy by gives
- alternatively you can set and drop the normalization term
- the log loss normalized by entropy based on data ctr. Smaller NE means better model in general.
- sensitive to calibration of the prediction.