top of page
  • Writer's pictureAlibek Jakupov

Information Theory: Step 3 Huffman Algorithm

Updated: Nov 19, 2021



Prefix codes

Prefix codes (sometimes called "prefix-free codes“) is a type of code system (typically a variable-length code) distinguished by its possession of the "prefix property", which requires that there is no codeword in the system that is a prefix (initial segment) of any other code word in the system.


A prefix code is a uniquely decodable code: a receiver can identify each word without requiring a special marker between words.


Examples of prefix codes:

  • Shannon-Fano code

  • Huffman code

  • Country calling codes


In this article we are going to implement our simple version of Huffman Encoding.

In 1951, David A. Huffman and his MIT Information Theory classmates were given the choice of a term paper or a final exam. The professor, Robert M. Fano, assigned a term paper on the problem of finding the most efficient binary code. Huffman, unable to prove any codes were the most efficient, was about to give up and start studying for the final when he hit upon the idea of using a frequency-sorted binary tree and quickly proved this method the most efficient.
In doing so, the student outdid his professor, who had worked with Information Theory inventor Claude Shannon to develop a similar code.
A Huffman code is an optimal code found using the algorithm developed by David A. Huffman while he was a Ph.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

Huffman code is one of the optimal codes using for lossless data compression.

  • The algorithm's output can be viewed as a variable-length code table for encoding a source symbol (such as a character in a file).

  • Huffman's algorithm derives this table based on the estimated probability or frequency of occurrence (weight) for each possible value of the source symbol.

  • As in other entropy encoding methods, more common symbols are generally represented using fewer bits than less common symbols.

Step 1


Characters (or any message to be encoded) of the original alphabet are setted in descending order of probability.


Step 2

  • Sum the two minimum probabilities.

  • Characters of the original alphabet are setted again in descending order of new probabilities.


Step 3


Repeat the step 2 again and again until there is only 1 symbol with probability = 1.



Step 4


Construct a binary tree



Step 5


Set weights/bits




Step 6


Construct codewords



And our implementation of this algorithm:


from heapq import heappush, heappop, heapify
from collections import defaultdict
 
def encode(symb2freq):
 """Huffman encode the given dict mapping symbols to weights"""
    heap = [[wt, [sym, ""]] for sym, wt in symb2freq.items()]
    heapify(heap)
 while len(heap) > 1:
        lo = heappop(heap)
        hi = heappop(heap)
 for pair in lo[1:]:
            pair[1] = '0' + pair[1]
 for pair in hi[1:]:
            pair[1] = '1' + pair[1]
        heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
 return sorted(heappop(heap)[1:], key=lambda p: (len(p[-1]), p))
 
txt = raw_input("Please, enter your sequence:")
symb2freq = defaultdict(int)
for ch in txt:
    symb2freq[ch] += 1
# in Python 3.1+:
# symb2freq = collections.Counter(txt)
huff = encode(symb2freq)
print "Symbol\tWeight\tHuffman Code"
for p in huff:
    print "%s\t%s\t%s" % (p[0], symb2freq[p[0]], p[1])

And on git hub



References:

  • Arndt C. Information Measures: Information and its Description in Science and Engineering.

  • Thomas Cover. Elements Of Information Theory.

239 views0 comments
bottom of page