Best compression algorithm for very small data

Status
Not open for further replies.

Rakehellion

Lifer
Jan 15, 2013
12,182
35
91
I have some binary files hovering around 100 bytes that I need to make as small as possible.

I want the best, most aggressive compression algorithm available but with a lax license so I can embed it in my program.

I'm currently using zlib and it shaves about 20% off the files. Are there certain settings I can use to improve that? Is there some way I can arrange the data to get a better compression ratio?
 

DaveSimmons

Elite Member
Aug 12, 2001
40,730
670
126
Do you need to keep the files separate?

The zlib compression might work better given a bunch of the files combined together into one buffer so that it only uses one dictionary.
 

ejjpi

Member
Dec 21, 2013
58
0
0
pangoly.com
well there is no "best algorithm" for everything, depends on a huge amount of factors: how it's structured your binary data, if you want lossless compression or not, cpu time efficiency, system architecture etc.
 

Sequences123

Member
Apr 24, 2013
34
0
0
Zlib works with a sliding window. I've achieved 70%-80% compression rate with zlib. The one time where I have seen 20% compression is when I used zlib with constant flushing.

If you're using API calls, try out the different flush arguments. I've had some success with Z_PARTIAL_FLUSH.
 

Rakehellion

Lifer
Jan 15, 2013
12,182
35
91
well there is no "best algorithm" for everything, depends on a huge amount of factors: how it's structured your binary data, if you want lossless compression or not, cpu time efficiency, system architecture etc.

Well, I answered all of those questions already. It's binary data, so I want lossless, I need the most aggressive algorithm regardless of CPU and memory usage, and I need something I can use in my program, so source code preferred, and I need to know if there's an efficient way to structure the data.

Zlib works with a sliding window. I've achieved 70%-80% compression rate with zlib. The one time where I have seen 20% compression is when I used zlib with constant flushing.

If you're using API calls, try out the different flush arguments. I've had some success with Z_PARTIAL_FLUSH.

What's changed by using a different flush argument? Would that make a difference with 100 bytes of data? Also, I hear there's a way to reduce the size of the header.

Do you need to keep the files separate?

The zlib compression might work better given a bunch of the files combined together into one buffer so that it only uses one dictionary.

The files need to be separate.
 
Last edited:

Rakehellion

Lifer
Jan 15, 2013
12,182
35
91
Here's an example of one of the files. I'd expect all of those zeroes to compress extremely well.

15040e14 15151614 1c1b111c 1c141214 14160b1c 14000000 003c0000 00000000 00282800 000c1c00
00000000 00503000 00292900 00000000 00100000 00180000 00000000 00000000 004c4800 00000000
00344000 00440800 00202400 00000000 00000000 002c3800 00000000 5aa1963d 3d753e96 0ea7483e
5c9d8374 720e4e68 22432e4e 862d14e1 42ed87d6 3e2b8219 74be1000 00803f00 00004000 00803f00
00a04154 756e6e65 6c205669 73696f6e 21
 

Train

Lifer
Jun 22, 2000
13,863
68
91
www.bing.com
With such a small file size, you aren't going to gain much, if at all by trying the standard algo's.

Maybe a search for each file's existence in Pi, then just store the starting point and length, for each file. Should be able to get down to 8-10 bytes, I'd estimate.

Could take hella long (days, weeks, months) to find each byte-string in Pi, but near instant unpacking.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,284
3,905
75
zlib uses two encoding schemes: copying (LZ77) and entropy encoding (Huffman). Entropy encoding assumes that some bytes (00, 14, and 15 in your example) occur more than others (73 for instance). It then gives the more common bytes shorter representations, and gives the less common ones longer representations. (Yes, longer than the originals.) There is a better method of entropy encoding, called "arithmetic coding". It has had patent issues in the USA, but many of its patents have expired.

Looking at your example without knowledge of its meaning, a 3- or 4-bit run-length encoding of zeroes seems like it would help, followed by Huffman for the rest. You could put the run lengths at the beginning or the end to help Huffman work only on the other bytes.

The zero RLE for the first line could look like:
15040e14 15151614 1c1b111c 1c141214 14160b1c 1400 3c00282800 0c1c00 0163

To decode, every time you encounter 00, repeat it the number of times found at the four bits at the end. Note that a lone 00 now needs another 0 to indicate not to repeat it. Compression can always create a longer file in the worst case.

But if you know the reasons why certain bytes are written to this file, you might be able to compress it more. Is the beginning a "magic number", for example? Are some of these integers that never exceed a certain value?

From the department of other stupid ideas: Could you write some of the file's data into the filename? Base64-encoded, of course.

Conversely, if your filesystem uses a minimum number of bytes for each file (and most do, usually at least 512 bytes), why do you need to compress these files further?

Edit: I see you asked about arranging the files better. Putting more of the zeroes at the beginning may help zlib. This may result in better compression than my simple RLE idea.

Also, Train's idea seems unlikely to result in compression: It could easily take a digit of pi with more digits than the original value to get the desired result.
 
Last edited:

Sequences123

Member
Apr 24, 2013
34
0
0
What's changed by using a different flush argument? Would that make a difference with 100 bytes of data? Also, I hear there's a way to reduce the size of the header.

The files need to be separate.

The flush argument indicates how short that sliding window is. The sliding window retains knowledge (up to 32KB/frame, iirc) which allows for better compression. Different types of data have different compression ratios, depending on what is repeatable.

So the thing about partial flush is that when you want to decompress it, the inflater must know how to inflate the bytes. If your binary files must be kept separate and randomly need inflation, then partial flush might not be what you're looking for.

If you're looking for customized compression algorithm, you might have to write your own especially since your use case is so specific. I would re-evaluate how much compression I really need over the risk of improperly implementing my own compression algorithm. Zlib has been around a long time and I would trust it more than any compression algorithm I could write.
 

Rakehellion

Lifer
Jan 15, 2013
12,182
35
91
Zopfli claims to be 5% better than zlib.

Perfect! Thanks!

I think what I'm going to have to do is shuffle the data so all the most significant bits are read first, which should leave me with a long string of zeroes that hopefully zlib can process better.
 

Cogman

Lifer
Sep 19, 2000
10,278
126
106
http://www.maximumcompression.com/

Check out the PAQ compressor. It offers a high compression ratio while being extremely slow.

It is available under the GPL license.

That said, while PAQ is one of the better general compressors, you often can do much better by analysing your own data and building your own compression system (if you have the time to invest).
 

Rakehellion

Lifer
Jan 15, 2013
12,182
35
91
Okay, I actually ended up writing my own compression algorithm. The files usually contain a lot of nulls, so I added an index that keeps track of where they occur and strips them from the file. I got 177 bytes down to 125 bytes and now zlib actually makes it larger.
 

winstongel

Junior Member
Aug 1, 2014
5
0
16
Lzma2 is faster when using 4 or more cores and it gives better compression.

Site link and irrelevant solution to three year-old question removed -- Programming Moderator Ken g6

Winston
 
Last edited by a moderator:

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,284
3,905
75
Necroed thread locked -- Programming Moderator Ken g6
 
Status
Not open for further replies.
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |