Huffman code is a type of optimal prefix code that is commonly used for lossless data compression. The algorithm has been developed by David A. Huffman. The technique works by creating a binary tree of nodes. Nodes count depends on the number of symbols.
The main idea is to transform plain input into variable-length code. As in other entropy encoding methods, more common symbols are generally represented using fewer bits than less common symbols. The easiest way to understand how to create Huffman tree is to analyze following steps:
- Scan text for symbols (e.g. 1-byte characters) and calculate their frequency of occurrence. Symbol value with its count of occurrences is a single leaf.
- Start a loop.
- Find two smallest probability nodes and combine them into single node.
- Remove those two nodes from list and insert combined one.
- Repeat the loop until the list has only single node.
- This last single node represent a huffman tree.
The expected result:
The full source code is available at GitHub, written using C++11. The expected output of a program for custom text with 100 000 words: