Tags

, , , ,

Would you compress a file before encrypting it, or would you encrypt first then compress? It was a question I’ve encountered twice recently. Apparently it’s a ‘thinking question’, with no right or wrong answer.
But at the time I was insistent there was indeed a correct answer – compress first then encrypt. Post the question on Stack Overflow and 90% of the answers would be the same as mine was.

Here’s how I saw it: You have an encryption program (symmetric, of course), and a compression program. Both do two very different things. Symmetric encryption basically mixes and scrambles the bits that make up a file, in a reversible way – the output is theoretically a random collection of bits.
A compression program takes out redundant bits and/or replaces byte patterns with shorter signatures, replacing them again to decompress the file. It’s analogous to removing all the vowels in a novel and making it readable with fewer characters. The downside with compression is that redundancy of characters makes a language resilient against errors, and so it is with bytes.

So, in my own simple world, an encrypted file’s bytes would appear random and structureless, and therefore couldn’t be compressed. It followed that a file would have to be compressed before it’s encrypted. This is probably the reasoning an interviewer would expect.

I decided to test this. First I took a PDF (a 12.9MB Cards Aainst Humanity download) and compressed it to a .zip file before encrypting it with a GPG symmetric cipher. The result was a 12.6MB file. Next, I encrypted the original PDF before compressing it.
So I ended up with two compressed files, and both were 12.6MB. Why?

ZIP 101
Thinking this over in more depth, a file does consist of random bits before it’s encrypted anyway. That is, the bytes are quite meaningless without the program to interpret or decode them, and so wouldn’t mean anything to the compression program either.

Let’s look at what DEFLATE (used in several compression schemes) does to a file, or at least what Wikipedia says about it. DEFLATE actually does several things to reduce the number of bytes.

First it looks for repeating byte patterns, replacing them with a reference. When the file is later decompressed, the program does a lookup before exchanging the references with the original bytes. In our encrypted file, there certainly should be repeating byte patterns somewhere, so this stage is applied.

Next it replaces known byte patterns (‘symbols’) with shorter signatures that are referenced by the encoding program. Again, it’s statistically certain known patterns will occur in any file or bytestream with sufficient random bytes, as attested by running $strings against the raw kernel memory space.

A word of warning
Please don’t quote any of this verbatim in an interview. Many IT pros will disagree with what I’ve posted here, and it’s a ‘thinking’ question the interviewer can follow up in as much depth as s/he likes.

Advertisements