Best compression algorithm for very small data

Rakehellion · Dec 22, 2013

I have some binary files hovering around 100 bytes that I need to make as small as possible.

I want the best, most aggressive compression algorithm available but with a lax license so I can embed it in my program.

I'm currently using zlib and it shaves about 20% off the files. Are there certain settings I can use to improve that? Is there some way I can arrange the data to get a better compression ratio?

DaveSimmons · Dec 23, 2013

Do you need to keep the files separate?

The zlib compression might work better given a bunch of the files combined together into one buffer so that it only uses one dictionary.

ejjpi · Dec 23, 2013

well there is no "best algorithm" for everything, depends on a huge amount of factors: how it's structured your binary data, if you want lossless compression or not, cpu time efficiency, system architecture etc.

Sequences123 · Dec 23, 2013

Zlib works with a sliding window. I've achieved 70%-80% compression rate with zlib. The one time where I have seen 20% compression is when I used zlib with constant flushing.

If you're using API calls, try out the different flush arguments. I've had some success with Z_PARTIAL_FLUSH.

Rakehellion · Dec 23, 2013

ejjpi said:
well there is no "best algorithm" for everything, depends on a huge amount of factors: how it's structured your binary data, if you want lossless compression or not, cpu time efficiency, system architecture etc.

Well, I answered all of those questions already. It's binary data, so I want lossless, I need the most aggressive algorithm regardless of CPU and memory usage, and I need something I can use in my program, so source code preferred, and I need to know if there's an efficient way to structure the data.

Sequences123 said:
Zlib works with a sliding window. I've achieved 70%-80% compression rate with zlib. The one time where I have seen 20% compression is when I used zlib with constant flushing.

If you're using API calls, try out the different flush arguments. I've had some success with Z_PARTIAL_FLUSH.

What's changed by using a different flush argument? Would that make a difference with 100 bytes of data? Also, I hear there's a way to reduce the size of the header.

DaveSimmons said:
Do you need to keep the files separate?

The zlib compression might work better given a bunch of the files combined together into one buffer so that it only uses one dictionary.

The files need to be separate.

masteryoda34 · Dec 23, 2013

What if you tar all of the small files before the compression? e.g. tar.gz

Rakehellion · Dec 23, 2013

Here's an example of one of the files. I'd expect all of those zeroes to compress extremely well.

15040e14 15151614 1c1b111c 1c141214 14160b1c 14000000 003c0000 00000000 00282800 000c1c00
00000000 00503000 00292900 00000000 00100000 00180000 00000000 00000000 004c4800 00000000
00344000 00440800 00202400 00000000 00000000 002c3800 00000000 5aa1963d 3d753e96 0ea7483e
5c9d8374 720e4e68 22432e4e 862d14e1 42ed87d6 3e2b8219 74be1000 00803f00 00004000 00803f00
00a04154 756e6e65 6c205669 73696f6e 21

Rakehellion · Dec 23, 2013

masteryoda34 said:
What if you tar all of the small files before the compression? e.g. tar.gz

They need to be packed individually.

Train · Dec 23, 2013

With such a small file size, you aren't going to gain much, if at all by trying the standard algo's.

Maybe a search for each file's existence in Pi, then just store the starting point and length, for each file. Should be able to get down to 8-10 bytes, I'd estimate.

Could take hella long (days, weeks, months) to find each byte-string in Pi, but near instant unpacking.

Ken g6 · Dec 23, 2013

zlib uses two encoding schemes: copying (LZ77) and entropy encoding (Huffman). Entropy encoding assumes that some bytes (00, 14, and 15 in your example) occur more than others (73 for instance). It then gives the more common bytes shorter representations, and gives the less common ones longer representations. (Yes, longer than the originals.) There is a better method of entropy encoding, called "arithmetic coding". It has had patent issues in the USA, but many of its patents have expired.

Looking at your example without knowledge of its meaning, a 3- or 4-bit run-length encoding of zeroes seems like it would help, followed by Huffman for the rest. You could put the run lengths at the beginning or the end to help Huffman work only on the other bytes.

The zero RLE for the first line could look like:
15040e14 15151614 1c1b111c 1c141214 14160b1c 1400 3c00282800 0c1c00 0163

To decode, every time you encounter 00, repeat it the number of times found at the four bits at the end. Note that a lone 00 now needs another 0 to indicate not to repeat it. Compression can always create a longer file in the worst case.

But if you know the reasons why certain bytes are written to this file, you might be able to compress it more. Is the beginning a "magic number", for example? Are some of these integers that never exceed a certain value?

From the department of other stupid ideas: Could you write some of the file's data into the filename? Base64-encoded, of course.

Conversely, if your filesystem uses a minimum number of bytes for each file (and most do, usually at least 512 bytes), why do you need to compress these files further?

Edit: I see you asked about arranging the files better. Putting more of the zeroes at the beginning may help zlib. This may result in better compression than my simple RLE idea.

Also, Train's idea seems unlikely to result in compression: It could easily take a digit of pi with more digits than the original value to get the desired result.

Sequences123 · Dec 23, 2013

Rakehellion said:
What's changed by using a different flush argument? Would that make a difference with 100 bytes of data? Also, I hear there's a way to reduce the size of the header.

The files need to be separate.

The flush argument indicates how short that sliding window is. The sliding window retains knowledge (up to 32KB/frame, iirc) which allows for better compression. Different types of data have different compression ratios, depending on what is repeatable.

So the thing about partial flush is that when you want to decompress it, the inflater must know how to inflate the bytes. If your binary files must be kept separate and randomly need inflation, then partial flush might not be what you're looking for.

If you're looking for customized compression algorithm, you might have to write your own especially since your use case is so specific. I would re-evaluate how much compression I really need over the risk of improperly implementing my own compression algorithm. Zlib has been around a long time and I would trust it more than any compression algorithm I could write.

lamedude · Dec 25, 2013

Zopfli claims to be 5% better than zlib.

Rakehellion · Dec 26, 2013

lamedude said:
Zopfli claims to be 5% better than zlib.

Perfect! Thanks!

I think what I'm going to have to do is shuffle the data so all the most significant bits are read first, which should leave me with a long string of zeroes that hopefully zlib can process better.

Cogman · Dec 26, 2013

http://www.maximumcompression.com/

Check out the PAQ compressor. It offers a high compression ratio while being extremely slow.

It is available under the GPL license.

That said, while PAQ is one of the better general compressors, you often can do much better by analysing your own data and building your own compression system (if you have the time to invest).

Rakehellion · Dec 29, 2013

Okay, I actually ended up writing my own compression algorithm. The files usually contain a lot of nulls, so I added an index that keeps track of where they occur and strips them from the file. I got 177 bytes down to 125 bytes and now zlib actually makes it larger.

winstongel · Apr 29, 2016

Lzma2 is faster when using 4 or more cores and it gives better compression.

Site link and irrelevant solution to three year-old question removed -- Programming Moderator Ken g6

Winston

Ken g6 · Apr 29, 2016

Necroed thread locked -- Programming Moderator Ken g6

Best compression algorithm for very small data

Rakehellion

Lifer

DaveSimmons

Elite Member

ejjpi

Member

Sequences123

Member

Rakehellion

Lifer

masteryoda34

Golden Member

Rakehellion

Lifer

Rakehellion

Lifer

Train

Lifer

Ken g6

Programming Moderator, Elite Member

Sequences123

Member

lamedude

Golden Member

Rakehellion

Lifer

Cogman

Lifer

Rakehellion

Lifer

winstongel

Junior Member

Ken g6

Programming Moderator, Elite Member

TRENDING THREADS