LZ4 Fastest Compression Algorithm Explanation

Little scum · Posted on 1/3/2022 1:54:23 PM

After watching the HBO drama "Silicon Valley", I have always been interested in compression algorithms. Richard Hendricks and his middle out compression algorithm in it are of course fake, but after Google hard, I found that there is such a compression algorithm genius in real life.

Yann Collet invented the LZ4 compression algorithm in 2011. Of course, the LZ4 algorithm is not as awesome as middle out, but it is known for its invincible compression speed and faster decompression speed when a certain compression rate can be guaranteed.

I will not go into detail here about the evaluation of its compression speed, if you are interested, you can read this comparison article:The hyperlink login is visible.

Also attached is the github address of LZ4:The hyperlink login is visible.

This article focuses on explaining the principle of LZ4 compression algorithm. A great god wrote an explanation of the LZ4 algorithm before:The hyperlink login is visible.But when I was studying before, I felt that this article was not very friendly to novices, so I started another article for novices like me.

If you sum it up in one sentence, LZ4 is an LZ77 that stores a dictionary with a 16k size hash table and simplifies retrieval.

LZ4 is a lossless compression algorithm that offers a compression speed of > 500 MB/s per core, which can be scaled with multi-core CPUs. It has extremely fast decoders with speeds of several GB/s per core, often reaching RAM speed limits on multi-core systems.

The speed can be dynamically adjusted, selecting an "acceleration" factor that trades compression ratio for faster speed. On the other hand, a high-compression derivative LZ4_HC is also provided to increase compression at the expense of CPU time. All versions have the same decompression speed.

So what is LZ77?

LZ77 compression and principle

LZ77 is an algorithm that applies a dictionary for compression. In layman's terms, it is to let the program observe (look at the dictionary) whether the data currently seen is duplicated with the previous one, and if so, we save the distance (offset) and the length of the duplicate fields to replace the duplicate fields and compress the data.

referenceThe hyperlink login is visible.

For a string of letters A A B C B B A B C, when the second A is read, the program saves (1,1) (1 digit from the previous one, length 1), and similarly, when the second B is read, the program saves (2,1) (distance 2, length 1).

But if the string is longer and the program records the data into the dictionary all the time, the search for duplicate fields becomes extremely time-consuming, because in the worst case the program goes through the entire dictionary with each new letter read. LZ77 uses a sliding window method to solve this problem.

Similar to the starting point of TCP using a sliding window, a sliding window can shrink the cache area to avoid processing too much cached data at the same time. In LZ77, the dictionary does not increase all the time, but discards the first characters added to the dictionary when it exceeds the maximum size specified by the dictionary, so as to ensure that the size of the dictionary is always maintained at the specified maximum size.

Suppose the dictionary length is 3

| dictionary |

| A |  A B C B B A B C

Output (0,0,A)

| A A |  B C B B A B C

Output(1,1)

| A A B |  C B B A B C

Output (0,0,B)

A  | A B C |  B B A B C

Output (0,0,C)

A A  | B C B | B A B C

Output(2,1)

The other part of the sliding window is the cache to be searched (look ahead buffer has no official translation). The cache to be searched is the uncompressed part of the length closest to the dictionary. The LZ77 algorithm will match the longest string in this part of the character that is the same as the dictionary. The previous example can be considered as a look ahead buffer of 1.

Suppose the dictionary length is 5 and the cache size to be searched is 3

| dictionary | look ahead buffer |

A  | A B C B B  |  A B C |

Output(5,3)

The longest string that matches is ABC

Complete compression process:

Suppose the dictionary length is 5 and the cache size to be searched is 3

| dictionary | look ahead buffer |

|  A A B  |  C B B A B C

Output (0,0,A)

|  A  |  A B C  |  B B A B C

Output(1,1)

|  A A  |  B C B  |  B A B C

Output (0,0,B)

|  A A B  |  C B B  |  A B C

Output (0,0,C)

|  A A B C |  B B A  |  B C

Output(2,1)

|  A A B C B | B A B  |  C

Output(1,1)

A  |  A B C B B  |  A B C  |

Output(5,3)

A A B C  |  B B A B C  | 。。。 |

There is no need to save the dictionary in the output file of the compressor, because the decompression program restores the dictionary by matching the output units.

Decompression process

One of the great advantages of the LZ77 algorithm is that it is very fast to decompress.

If the first matching unit is (0,0,A), then A is output.

The second matching unit is (1,1), which copies the previous bit in the output string, and outputs A if the copy length is 1.

。。。

If the last matching unit is (5,3), then look back 5 bits in the copy output string, and if the copy length is 3, then output A, B, C.

The most time-consuming part of the compression of the LZ77 algorithm is finding the longest matching character in the cache to be searched in the dictionary. If the dictionary and the cache to be searched are too short, the chances of finding a match are slim. Therefore, LZ4 has changed the matching algorithm for LZ77.

First, the dictionary of the LZ4 algorithm is a hash table. The key in the dictionary is a 4-byte string, each key corresponds to only one slot, and the value in the slot is the position of this string.

Instead of searching the cache, LZ4 reads four bytes at a time from the input file, and then looks for the slot corresponding to the string in the hash table, which is hereinafter referred to as the present string.

If you have reached the last 12 characters, put them directly into the output file.

If there is no value assigned in the slot, it means that the four bytes appear for the first time, add these four bytes and positions to the hash table, and continue searching.

If there is an assignment in the slot, it means that we have found a matching value.

If the value in the slot plus the size of the slider window < the position of the current character, then the match position exceeds the size of the block, and the program updates the value in the hash slot to the position of the current string.

LZ4 will check if there has been a hash conflict. If the 4 bytes in the slot are not the same as the current string, there must be a hash conflict. The author's own xxhash is also known for its efficiency, but conflicts are inevitable. In case of a conflict, the program also updates the value in the hash slot to the current position of the string

Finally, we can confirm that the match is valid, and the program will continue to see if the subsequent characters in the matching string are the same. Finally, copy all characters from the end of the previous valid match to the beginning of this valid match to the compressed file, and add the matching sequence of this valid match.

Collet calls the matching units output by LZ4 sequences, and they make up the LZ4 compressed file. The details are as follows:

The token is 1 byte long, with the first 4 words being the literal length and the last 4 words being the match length. As mentioned earlier, all characters from the end of the previous valid match to the start of this valid match will be copied to a compressed file, these uncompressed files are literal, and their length is literal length.

Literally followed by deviation. As in LZ77, deviation and match length refer to the length of the current string from its match, and the match length refers to the length of the match length of the present string to the same string in the dictionary. In decompression is to go through it to find the string to be copied and the length to copy. The magnitude of the deviation is fixed.

In the figure, literal length+ and match length+ are if the literal or match length 4 words in the token are not enough, they will continue to increase in the corresponding positions. 4 words can represent 0-15, and if there are any more, add one byte, that is, add 255 to the size until the byte is less than 255. In the matching length, 0 represents 4 bytes.

The last 12 bytes end literally because it was copied directly.

Let's look at the code (python is more abstract)

Login is visible.

Summary Having said all this, I still haven't summarized why the LZ4 is so fast. Let's first compare the dictionary search between LZ77 and LZ4. The native LZ77 searches for dictionaries by looking for the largest match in the cache to be searched and in the dictionary. This is a problem of finding the largest identical string in two strings. If we don't use dynamic programming, then at worst we have to consider all the substrings of one string and then match them in another string. If ×we use dynamic programming, we will use a table to hold the longest match in the dynamic, which will only allow us to complete the match in the case of O(m*n). And, that's just for a pair of search caches and dictionaries. In the worst case, if there are no matches, then LZ77 will have to count (the length of the entire file - the length of the cache to be searched) as many such matching problems. LZ4 actually uses a larger level of dynamic programming: it stores 4 bytes and their positions in a hash table, and then matches a new 4 bytes of data each time only to see if the values in the hash table are valid. After finding a valid matching value, if you look at the follow-up data of the two sets of matching values, you can also find its longest match. Since each key of the LZ4 hash table corresponds to only 1 slot, the work of finding and adding and modifying the hash table only requires O(1). If more matches are found after matching, more group comparisons are needed, but in the total time, these comparisons will replace more time to look up the hash table, so the total time of the LZ4 algorithm is only O(n). I have to admire the beauty of Collet's algorithm! To make the algorithm even faster, Collet sets the default hash table size to 16k, which is recommended not to exceed 32k, so that it can be put into the L1 cache of almost all CPUs (Intel). Everyone knows that the speed and memory ratio of CPU L1 cache are very different, so it is not surprising that LZ4 is flying fast, not to mention that the hash equation used in LZ4's hash table is still the fastest xxhash. Of course, such a design has its drawbacks. The smaller the hash table, the fewer keys it has. This means that more hash conflicts will occur, which is inevitable. And more hash collisions mean fewer matches. And a smaller hash table also means a smaller sliding window, which means more matches will be discarded because they are too far away. Fewer matches represent a smaller compression ratio, which is why the LZ4 has a less prominent compression ratio. Looking Ahead LZ4 has a wide range of application scenarios. Just as middle out was used in VR in Silicon Valley, LZ4 can increase bandwidth utilization by bringing fewer IOs at the cost of very low latency due to its very fast compression speed. There is also a little conjecture, if there is a CPU with a larger cache at level 1, can the LZ4 increase the compression ratio without compromising speed?

Original:The hyperlink login is visible.

LZ4 Fastest Compression Algorithm Explanation

Related Posts

Sections viewed