
Media and data files of large size are naturally more difficult to handle, especially in terms of managing storage space and transferring them between memory drives. The simplest solution would be to compress them, which can be achieved in many different ways. There are two kinds of compression: lossy, and lossless. We have elaborately explained and compared these two categories so as to help you understand where, why, and how each can be used.
Did You Know? A good practical example of a real application implementing techniques of data compression is Morse code, which dates back to the 19th century. In Morse, those letters of the alphabet that are used more often (like vowels), are assigned shorter codewords than the ones that are rarely used, so as to optimize coding efficiency. For example, the letter ‘E’ is represented by a single symbol (one dot) as opposed to the letter ‘X’, whose coded representation contains four symbols. |
For communication of any kind, data compression is a prerequisite. For more than a century, the techniques of data compression that are in use, are constantly undergoing development and improvement at the hands of scientists, mathematicians, and software professionals, who have been striving to develop better compression algorithms so as to help upgrade the prevalent technological standards.
Not just telecommunication, but the entire range of subgenres of the diverse discipline of Electronics, thrives on data compression, and this includes domains like data, signal, image, audio, as well as video processing.
How are techniques of data communication implemented?
The function of data compression is actually very literal―it aims to “condense” a large amount of data into a smaller volume when it is stored or transmitted across a distance, so that it will require less space. If considered from the point of view of telecommunication, this analogy is synonymous with less requirement (and hence, utilization) of channel bandwidth.
However, all the data that is processed is digital in nature, and thus, before any of the aforementioned tasks (transporting it over cables or storing it in CDs, DVDs, on the computer’s hard drive or in any other storage device) are successfully completed, it must be transformed into an appropriate format.
This means that either each data symbol, each data sample from a set that is selected from the pool of data at regular intervals (in a process known as sampling), each set of grouped data symbols, or each set of collective data samples is assigned a unique ‘codeword’. It is this coded data that is transmitted or stored. The original, raw data needs to be extracted from the coded stream when one desires to access it from the place where it is stored, or when it is received at the end of the communication channel.
Techniques of data compression are implemented at this (data transformation) level; raw data is encoded using optimized algorithms, which additionally fulfills the task of data compression.
It is noteworthy to mention that in certain situations where the amount of data in question is very small, the volume of encoded data may even exceed that of raw data, making compression inefficient.
Now data compression essentially can be divided into two branches: lossless compression and lossy compression. We will now attempt to simplify these two broad categories so that they can be compared, and their specific applications can be studied.
Lossless Data Compression
♦ In simple words, lossless compression is used to transform data in such a way that no data is discarded, quantized, or lost because of any other reason, yet adequate data condensation is achieved.
♦ The techniques used in this case, work on the principle of breaking down large streams of data into smaller ‘codewords’, making transmission, storage, and reconstruction much simpler.
♦ The most important characteristic of lossless compression that makes it stand apart is that, at the destination, the entire original data (or message signal), down to every last bit, can be reconstructed from the received coded data.
Techniques for Lossless Compression
**Entropy Coding:- This category of coding techniques represents all those methods where the frequency of occurrence of data symbols is taken into account while mapping them to codewords. Symbols that occur more often are assigned shorted codes, whereas those with a lower frequency of occurrence are assigned longer codewords. These processes are completely lossless.
It is also interesting to note that a quantity known as Entropy, which was defined by mathematician Claude Shannon in 1948, can be defined as an estimate of the best possible representation (using the least number of bits, by achieving the maximum amount of compression possible) of a given sequence. Mathematically, it can be written as:
H(X) = −∑[P(xi)∗log2(P(xi))]
The most common algorithms that achieve lossless compression are the Huffman Coding Algorithm, Arithmetic Coding Method, and the Lempel-Ziv-Welsh (LZW) Coding Mechanism. We have briefly explained how compression is achieved when each of these methods is used, with illustrative examples.
**Note: The DEFLATE algorithm is a combination of the LZ77 Algorithm, one of the earliest methods developed by Abraham Lempel and Jacob Ziv (which later contributed to the development of the LZW Coding Mechanism), and the Huffman Coding Algorithm.
Example #1: To Illustrate How Huffman Coding Algorithm Achieves Compression
A sequence of data symbols, where each symbol is selected from the set {A, B, C, D, E, F}, has the following distribution of the frequency of occurrence of symbols.
Symbol | Frequency |
A | 15 |
B | 7 |
C | 5 |
D | 23 |
E | 17 |
F | 19 |
Encoding Without Compression
The sequence contains 86 symbols. Since there are 8 distinct symbols, the number of bits required to represent each symbol would be 3, because 23 = 8. Now, if the entire sequence is converted to binary without any compression, the number of bits in the encoded sequence would be 86 X 3 = 258.
After Encoding With Huffman Coding Algorithm
After using the Huffman Coding Algorithm to generate codewords for the given distribution, the following results can be obtained:
Symbol | Assigned Codeword | No. of Bits Per Codeword | Frequency | Total No. of Bits Per Symbol |
A | 010 | 3 | 15 | 3 ∗ 15 = 45 |
B | 0110 | 4 | 7 | 4 ∗ 7 = 28 |
C | 0111 | 4 | 5 | 4 ∗ 5 = 20 |
D | 00 | 2 | 23 | 2 ∗ 23 = 46 |
E | 11 | 2 | 17 | 2 ∗ 17 = 34 |
F | 10 | 2 | 19 | 2 ∗ 19 = 38 |
If the entire Huffman-encoded sequence is converted to binary, the total number of bits in the encoded sequence would be ∑(Total No. of Bits Per Symbol) = 211.
In comparison to encoding the raw data, the compressed encoded sequence requires 47 bits less.
For a set of symbols that includes the 24 letters of the English language alphabet, 9 digits, and a wide range of special characters, the amount of compression that can be achieved for any arbitrary sequence (such as a text file) is a very significant.
Example #2: To Illustrate How Arithmetic Coding Method Achieves Compression
A sequence of data symbols, where each symbol is selected from the set {A, B, C, D, #}, has the following distribution of the probability of occurrence of symbols.
Symbol | Probability |
A | 0.2 |
B | 0.4 |
C | 0.1 |
D | 0.2 |
# | 0.1 |
Entropy of the Distribution
As mentioned earlier, the Entropy of a sequence (or probability distribution), according to the theory of information, represents the maximum compression it can undergo. The entropy of this distribution can be calculated as follows:
H(X) = − [ 2 ∗ 0.2 ∗ log2(0.2) + 2 ∗ 0.1 ∗ log2(0.1) + 0.4 ∗ log2(0.4) ] = 12.6095 approx.
After Encoding With Arithmetic Coding Method
Since arithmetic coding basically aims to represent the entire range of probabilities on a single number line between 0 and 1, the following data can be obtained:
Symbol | Probability | Cumulative Probability | Intervals on the Number Line |
A | 0.2 | 0.2 | [0, 0.2) |
B | 0.4 | 0.6 | [0.2, 0.6) |
C | 0.1 | 0.7 | [0.6, 0.7) |
D | 0.2 | 0.9 | [0.7, 0.9) |
# | 0.1 | 1 | [0.9, 1.0) |
The number of bits required to represent a single data symbol is the same as the number of bits required to represent any real number belonging to the interval that its corresponding probability falls into. The number of bits that would be required to represent an interval of any arbitrary size, say S is mathematically, -log2S.
It can therefore be seen that the maximum number of bits required to represent a sequence encoded using the Arithmetic Coding Method is equivalent to its entropy, which, as mentioned earlier, is the maximum amount of lossless compression possible.
Example #3: To Illustrate How Lempel-Ziv-Welsh (LZW) Coding Mechanism Achieves Compression
The following is a sequence of data symbols, where each symbol is selected from the set, {A, B, C}.
ABABBABCABABBA
Encoding Without Compression
The sequence contains 14 symbols. There are 3 distinct symbols, implying 3 distinct codewords for each symbol. Let us consider the adoption of a coding scheme in which a 3-bit codeword is used to represent each symbol.
Without compression, to convert the entire raw sequence to binary, the number of bits required would be 14 X 3 = 42.
After Encoding With Lempel-Ziv-Welsh (LZW) Coding Mechanism
After using the Lempel-Ziv-Welsh (LZW) Mechanism to generate codewords for the given distribution, the following results can be obtained:
S | C | Output | Code | String |
1 | ||||
2 | ||||
3 | ||||
A | B | 1 | 4 | AB |
B | A | 2 | 5 | BA |
A | B | |||
AB | B | 4 | 6 | ABB |
B | A | |||
BA | B | 5 | 7 | BAB |
B | C | 2 | 8 | BC |
C | A | 3 | 9 | CA |
A | B | |||
AB | A | 4 | 10 | ABA |
A | B | |||
AB | B | |||
ABB | A | 6 | 11 | ABBA |
A | eof | 1 |
It can be clearly seen that there are 9 output codewords, as opposed to the 14 that would have been required if the encoding had been carried out without compression. This translates to 9 X 3 = 27 bits, which is 15 bits less than the amount that encoding without compression would result in.
Less number of codewords need to be transmitted or stored in this case. The compression ratio when this mechanism is used, is 14/9 = 1.56
The compression achieved is even more significant for longer strings, of which the length is greater than 100 symbols.
Real Applications of Lossless Compression
Techniques of lossless compression are generally used when all the information that is encoded needs to completely recovered. Hence, it is used more in the case of data files, or documents, rather than media files, like images, audio, and video.
However, in certain cases, for example, when one requires the raw camera image with all the information collected by the camera’s sensors, there arises a need to use lossless formats of image files too. In case of audio, a song stored in a CD (.WAV) is coded losslessly, but the same is not true for many other audio formats.
We have compiled a list of common lossless formats used to compress data, image, and audio files.
Type of File | Formats That Implement Lossless Compression |
Data Files and Folders | .ARC:- The format that pioneered the practice of combining file archiving and compression by using a single mechanism, .ARC, was popular in the late 80s, although it has fallen out of use nowadays. Its limitation was that it was unable to compress directory trees. It achieved compression by compressing data that was already compressed by the LZW Mechanism, by making use of the Huffman Coding Algorithm. .ZIP:- .RAR:- |
Audio Files | .WAV:- .WAV, which stands for Waveform Audio File Format, is actually the version of the audio file that is raw and uncompressed. It is not to be confused with the MP3-compressed .WAV format, which achieves compression through lossy techniques. It is the standard format adopted when audio files are stored in CDs. This format has a maximum file size of 4 GB. .FLAC:- .ALAC:- |
Image Files | .RAW:- As the name suggests, the .RAW format refers to the raw, uncompressed camera image that has also hardly undergone any processing. Since this format contains every last bit of information regarding the image, and the image can be fully reconstructed from it, it can be considered analogous to a digitized photographic negative. .TIFF:- .BMP:- .PNG:- |
**Note: Run-Length Encoding is a method of encoding data, which is adopted when the nature of the data stream is such that it contains long sets of consequently-repeating data symbols. This encoding technique optimizes the process by encoding the symbol itself just once, followed by the number of times it repeats, for every time it appears, rather than encoding the same symbol over and over again.
Lossy Data Compression
♦ As the name suggests, lossy data compression involves the loss of some amount of information during the data condensation process.
♦ The information that is lost in the process of compression could be redundant or duplicate data, which hence is simply discarded to facilitate compression, otherwise, either identical or similar data may be represented with a single codeword, in a process called quantization.
♦ The bad news is that data that is compressed using lossy techniques can never be completely reconstructed, in the sense, the entire original data, down to the last bit, can never be fully recovered.
♦ The good news, however, is that any data which is compressed using lossy techniques is of the kind that can afford to lose a certain amount of data without it having any impact on its quality. Also, the amount to which a file can be compressed with lossy techniques is much, much more than when lossless methods are used.
Techniques of Lossy Compression
**Rate-Distortion Theory:- Just like entropy of a sequence sets a boundary on the amount of lossless compression a data stream can undergo, this theory, which was again formulated by Claude Shannon, aims to determine the maximum amount of lossy compression that a sequence can withstand, while not exceeding a maximum amount of signal distortion, where distortion is defined as the mean square error between the transmitted and received signal.
When there is loss of data involved in the data encoding process, the original set of data symbols doesn’t even have to be encoded as-is.
✏ Quantization can be used to convert the complete set of data symbols into a much smaller set, eliminating unnecessary repetition of symbols, by clubbing similar, or identical data symbols together.
✏ Alternatively, the entire sequence can be transformed to a completely different domain, using mathematical techniques like Fourier Transform, Discrete Cosine Transform, and a wide variety of others.
Lossy compression algorithms are tunable, or in simple words, the extent to which the data in question can be compressed using techniques of lossy compression can be varied. Analogies like the Rate-Distortion theory are used to restrict the amount of compression that is acceptable in the different cases.
Real Applications of Lossy Compression
As mentioned before, lossy compression is employed under two conditions: firstly, when the file in question will not suffer in quality if a portion of the information it contains is lost, and secondly, if it needs to be compressed to a greater extent than what lossless techniques can achieve.
A text-based document, or a folder that contains a number of files cannot be compressed using lossy techniques because all the data they contain has to be recovered entirely. On the other hand, for media files, namely image, audio, or video files, the dynamics are completely different.
Why Lossy Compression Techniques Can Be Employed For Media Files
The human eye is only capable of recognizing a certain number of different colors, and also, it can only resolve pixels up to a certain minimum size. Representing similar-looking colors with a single color, or combining adjacent pixels that have values that are very close to each other, will not only help to successfully compress the file, but the image in question will also not look any different from the original when it is reconstructed (decoded).
A similar phenomenon called persistence of vision comes into play while viewing videos. Since videos are nothing but a colossal number of image frames, which have minute differences between them that are successively fed at a speed such that the human eye cannot distinguish between individual frames. The loss of a handful of frames will not result in any loss of video quality, as long as the speed at which the frames are fed exceeds the limiting value.
We have compiled a list of lossy compression formats that are commonly used nowadays, to compress the corresponding kinds of files.
Type of File | Formats That Implement Lossy Compression |
Image Files | .JPEG:- This image format, developed by the Joint Photographic Experts Group, which is also where it got its name, is the best example of lossy compression. While creating a .JPEG file, the author has the option of trade-off between image size and image quality. A larger image will obviously contain more pixels, and thus will automatically be of better quality. To achieve more compression, the image’s quality will have to be compromised. .GIF:- |
Audio Files | .MP3:- The .MP3 file format gets its name from the Motion Pictures Experts Group (MPEG)-1 (or MPEG-2), Audio Layer 3, and is one of the most widespread audio formats encoded using lossy techniques. It can reduce the size of an uncompressed audio file (such as a .WAV file taken from a CD) to approximately 1/11th of its original amount. The quality of MP3 files can be varied by encoding at different bit rates. .AAC:- .WMA:- |
Video Files | H.264:- Motion Pictures Experts Group (MPEG)-4, Part 10, Advanced Video Coding, also known as h.264, is a lossy compression standard that is popularly used nowadays. The most popular application of this format is in the encoding of Blu-Ray Discs. It uses a combination of different lossy and lossless coding algorithms. |
In a nutshell, lossy compression is nothing but an improvement over lossless compression, as far as the function of reducing the size of the file via compression techniques is concerned.
However, lossy compression will lead to degradation in data/signal quality if it is implemented beyond a certain extent. This simply happens because at the receiver end, the original signal will need to be reconstructed from insufficient data, whereas in lossy compression, every last bit is not encoded.
Both lossless and lossy techniques are best suited for their respective applications, and it will be a futile endeavor to use techniques of either category interchangeably.
We hope that we were successful in highlighting the similarities and differences between lossy and lossless data compression.