Pop songs compressed by the Lempel-Ziv algorithm
Lempel-Ziv is a lossless compression algorithm. PNG, GIF, and gzip are all powered by variants of Lempel-Ziv. The algorithm works by exploiting repeated sequences of characters (or bytes). Pop songs happen to have lots of these, so LZ compresses them very efficiently.
For more context, and a brief explanation of Lempel-Ziv compression, check out my essay on Measuring repetitiveness of song lyrics.
Technical details
-
Size reduction is calculated by counting matches (i.e. the blue circles) as taking up the same space as 3 characters. For example, replacing a string of length 4 saves 1 character, replacing a length 10 string saves 7 characters etc. A typical implementation of LZ encodes a match as an 8-bit length and a 15-bit distance, so 3 bytes is pretty realistic. (A compressor optimized for pop songs would likely use fewer bits for distance, given that songs are usually shorter than 215 characters in total.)
-
The compressions shown here were generated by zopfli, which is similar to gzip but produces slightly more efficient compressions (at the cost of speed).
-
In real applications (like gzip) the output of the LZ algorithm is usually further compressed using Huffman coding - this combo is embodied in the DEFLATE algorithm/format. That step isn't illustrated here, and isn't taken into account when calculating the size reduction. (I wanted to use compressibility as a proxy for measuring lyrical repetitiveness. The LZ algorithm has a clear connection with repeated words and phrases, but Huffman coding does not - it exploits differences in character frequencies.) I used infgen to parse out the pre-Huffman'd data from DEFLATE files generated by zopfli/gzip.