A note on Shannon information

The Shannon channel coding setup
Let be an alphabet of arbitrary symbols . We create random messages by drawing symbols from with probabilities . Thus, a message of length is a sequence of symbols of the alphabet , or:

We ask in how many bits we can transmit this message over a noiseless channel, i.e. a channel without random errors. More precisely, we are allowed to perform any transformation on the message resulting in a message which uses the alphabet . The only requirement is that we maintain the ability to decode an arbitrary message. Thus, the transformation must be an invertible map.
We will write for the length of the message. Let denote the  Shannon entropy of the distribution , i.e. the number defined as follows:

One calls the quantity the information contained in the event . In general, if is any event (in the sense of probability theory) then is the amount of information contained in the fact that occurred. We note that Shannon entropy is the mean value of the informations of the individual symbols.
Shannon's theorem gives a lower bound:

The use of the logarithm with base 2 in the above calculations is intentional if informations is to be measured in bits, and bits are used to transmit or store information.
The coding algorithm
There is an algorithm to code messages arbitrarily close to the Shannon entropy bound. The method is called the Shannon-Fano method. If the probabilities are all negative powers of 2, the Huffman algorithm yields the coding algorithm which is in perfect agreement with entropy.
Let us illustrate the Huffman code for and .  We note that any efficient code must use short representations of frequently used symbols and long representations for rare symbols. The Huffman code is a code with this property. For our example:

  • Symbol 1 is translated to 1
  • Symbol 2 is translated to the sequence 00
  • Symbol 3 is translated to the sequence 01

For example:

We note that Huffman code has the prefix property which allows us to decode Huffman encoded messages. This property says that if no symbol code is the prefix of the code for another symbol. The prefix property is not necessary to be able to invert a substitution code. However, we can see that the lack of this property makes it necessary to look ahead before decoding a symbol.
We note that the information of each symbol coincides with the number of bits in the symbol's code. This observation automatically proves that the Huffman code realizes the bound in the Shannon theorem.
What are the savings (for the example above)?
The compression ratio is the ratio and reflects the savings in space or transmission time resulting from coding of the message. In our previous example, the entropy is:

Thus, the optimal coding method should issue bits per symbol of the message. We claim that the Huffman code described above is optimal. Indeed, a message of length conforming to the distribution should have 1's, 2's and 3's. Thus, the length of the code can be computed directly by counting the bits:

We note that . Thus the gain of using the bias of the distribution towards 1 vs. the unbiased distribution is:

This amounts to about 5 percent, but it proves the principle.
Indeed, much more significant savings are possible, and they are fundamentally important to such technological achievements as digital television.
We note that the bit rate of for the distribution can be realized by a technique called arithmetic coding.