Zip files

A work colleague emailed me recently, amazed that he had compressed 5.87 GB of data down to 34.02 MB. This amazing feat of data compression had me scurrying to Wikipedia to find out more about how lossless data compression algorithms work and whether or not I could game the system to achieve an even greater compression ratio with artisanal hand-crafted boutique files.

I’ve had some good compression wins with LZW compression on TIFF images before. I know from experience that the best compression ratios occur where an image is largely the same pixel value. This is common with satellite imagery where the actual image data is a non-rectangle shape within the bounding rectangle defined by the file format (most image file formats only allow a rectangular image). The null cells around the actual image data are assigned a value, often zero (black).

I started out in TNT MIPS software (by Microimages), which I use regularly for image processing. Starting with an image I had at hand, a digital terrain model, I cropped out part of the model over the ocean where I knew all the values would be the same (the ocean is zero elevation in this model). I then resampled the image to a smaller pixel size in order to increase the number of pixels up to my target of roughly 180,000 x 150,000 pixels. To get this target I had to refer to some notes I’ve made over the years comparing pixel count to file size for various image file formats. I aimed for 50 GB.

I then exported the image as an uncompressed BigTIFF. A regular TIFF image has a size limit of about 4 GB, but BigTIFF has a size limit much larger.

I compressed the file with the standard zip compression tool found in most consumer operating systems. The file compression ratio was slightly better than 1000:1.

Why the compressed file is still 55 MB is a mystery, but I’ll assume zip compression breaks down the repetitious data into smaller and smaller chunks in some sort of hierarchical way such that the dictionary is still quite large. Or maybe the TIFF format writes pixels sequentially with line breaks. Who knows? And of course I used a GeoTIFF format for the source file which includes a bunch of metadata.

Following the TIFF file I decided to use a more pure source of repetitious data - the plain text file. I opened up a text editor and wrote a string of characters all the same. I then moved to the command line to concatenate the text file with itself in a recurring manner to eventually end up with a text file of exactly the same size as the TIFF image from the exercise above.

Applying zip compression to this text file yielded pretty much the same compression ratio as the TIFF image (about 1000:1). This ratio doesn’t seem to change much with the file size. I tried 9 GB, 18 GB, 36 GB. I had expected better results. At this stage I’m thoroughly disappointed and slightly bored. (55 MB, expands to 56.04 GB) (54.5 MB, expands to 56.04 GB)