Checksums are Not Encryption and Other Media Fallacies

There are a lot of people that seem to have some confusion about the differences between checksums, encryption, and compression. It’s not just Grandpa and Cousin Doris that are a bit mystified though, it’s often a lot of tech journalists, and the police/judiciary.

At their most basic, the two things might seem alike. They’re ways of describing information that leads to verification of data. Yet apart from this, they’re two very different beasts.

For those wondering why I’ve included compression with encryption, the simple answer is compression is encryption. When data is transformed in any way using a method where it can be transformed back, it’s encryption. Write it backwards and it’s encrypted, write it vertically, and that’s a form of encryption, Write it in Spanish instead of English, and that’s technically encryption too. It might seem simple, but take a phrase, put it in a language you don’t know, and you won’t be able to decrypt it.

The trick is that you’ve done some sort of alteration to the original to make it a new data set. That’s encryption. Now sometimes this can compress, and sometimes it can expand. If your encryption method is to take your phrase, and spread it out, with three random letters after every one of the original, you’ve increased it in size. If, however, you remove every space and replace “THE” with “QX”, your resulting phrase will be shorter.

They love the sound of the sea

Becomes

QXyloveQXsoundofQXsea

One thing to notice though is that there were two different methods used to shorten then. One was a ‘lossy’ method, and the other was ‘lossless’. The=QX substitution is a lossless method, because when going the other way, indications of QX can be turned back into THE very easily. The spaces, however, are not recorded in any way. They have just been discarded without a care, and won’t be brought back.

This generally isn’t a big deal, because most of the time, it’s pretty obvious where the spaces should go, so noting it would be superfluous. This is Lossy compression, where things that make little difference are discarded. For instance, a photograph with large blocks of pretty much one colour can be called a single colour, and defined as an area instead. You’ve made it smaller, but you’ve lost some detail. It’s why some people claim to be able to hear the difference in MP3’s for instance (really, they can’t).

That’s compression and encryption, checksums are different.

Checksums are ways of checking for accuracy, but they are not a way of reconstructing data. It’s a way of describing data simply so that if it’s correct, it passes, and if it’s not, it fails.

A slightly different way to think about a checksum is the list of answers for a maths test. When put with the question, they’re the answers to the question, but if you just have the answers by themselves, they have no real meaning, and you can’t easily deduce the question. If the answer is “2” is the question “Alice has an apple, and bob has an apple, how many apples are there?”, or is it “4½=?” or ‘what is 2(Cos(0))’. In all three cases the answer (or checksum) is a specific value, but knowing the value on its own does not tell you what the question is.

If that’s hard to understand, let’s try an example of checksumming.

Let’s have a sample sequence and to keep it simple, we’ll use the fibbonacci sequence.

1,1,2,3,5,8,13,21,34,55

Now a checksum can be anything, as long as you define it. So, let’s make it simple, and say the checksum is 3 values; quantity, mean and standard deviation. For this data group it would be 10, 14.3, and 17.79, which for clarity we’ll express as (10|14.3|17.79)

So what if we were to change some numbers (or omit them)? Here’s a quick table.

Data

Checksum

1,1,2,3,5,8,13,21,34,55 (original)

(10|14.3|17.79)

1,1,2,3,4,8,13,27,34,55

(10|14.8|18.20)

1,2,3,3,5,8,13,21,34,55

(10|14.5|17.63)

1,1,2,3,6,9,13,21,34,55

(10|14.5|17.70)

1,1,2,3,5,8,13,21,34

(9|9.7|11.23)

1,2,3,5,8,13,21,34,55

(9|15.77|18.21)

0,1,2,3,5,8,13,21,34,56

(10|14.3|18.13)

As you can see, all the modified sequences fail the checksums, BUT, if you were given the checksums, and asked to find the original sequence, you’d find it fairly hard. That’s not their point. The point is to be a check on accuracy.

Now we come to bittorrent. When people say ‘they uploaded it to a isoHunt’, or ‘the downloaded it from the pirate bay’, they’re talking absolute crap. That applies to the judiciary, the press, and of course the various industry groups that seem to exist only to mislead people about things.

Bittorrent is all about checksums. To understand it, we need an example, and we’ll use the official torrent of No Safe Harbor as our example.

The torrent is 46.9 MB and comprises 6 files. Now, to download, you can use a magnet link (such as this one) or go to a site like the Pirate Bay (if you’re allowed to) and download it (and it’s perfectly legal to do so, I am the book’s co-author, and own the copyright, along with my esteemed colleague, Bradley Hall). Either way makes no difference.

What you have downloaded from TPB is not the book. You can’t read it. Instead you’ve got a checksum file. That’s because that’s what a bittorrent file is, a massive pile of checksums.

Put simply (and this is very simple, for more detail, you need to read the documentation) you make a bittorrent file by taking all the files and placing them end-to-end, like a bit snake, or a queue of people for a ride. You then break it up into equal sized chunks, like roller-coaster cars would break up people in line. If you get to the end of the first file, you immediately start filling the piece with the start of the next file. In the end, you have a bunch of pieces all one size, and a last one that’s almost always smaller. Then you checksum them, and that’s the majority of your bittorrent file.

Well, while it’s a checksum, because it uses a cryptographic method, it’s called a ‘cryptographic hash function’, but it’s effectively a checksum (as far as we’re concerned). So now you’ve got a lot of pieces (in the No Safe Harbor torrent, it’s 751 pieces each of 64kb, and a last piece which is smaller) and each of these pieces has a SHA1 hash, or checksum.

The torrent file contains a list of files, and their size, and which pieces they start and end on. So you end up with something like this being known by a bittorrent client.

pieces in torrents

Now, each of those pieces has a hash value, so when a piece is downloaded from other peers, a client can then use the checksum (the SHA1 cryptographic hash in the torrent file) to ensure the data is correct. If it is, it saves it; if it’s not it marks it as failed and throws it away.

Now, because each of these pieces are independent, can be verified on their own and don’t rely on any other data, they don’t have to be downloaded in order. This is part of what makes bittorrent so efficient. You can take any piece from any one, and indeed aspects of the same piece from multiple sources at once.

Now, there’s one more bit of checksum-ing, and that’s the torrent identification. Again, like each of the pieces, the torrents themselves have their own checksum. For No Safe Harbor, it is 79ADFF2965C672CC66F2AD54D67857BD3BAEEC61

This is often called the torrent hash, and it’s most commonly seen now through magnet links. It is the unique torrent identifier. Its 40 characters long, and each one can be any of 16 characters, which means 1640 possible outcomes, (or 1,461,501,637,330,902,918,203,684,832,716,283,019,655,932,542,976 If you want it written out). This is a HUGE amount (its 49 digits long) and is basically the checksum of the checksums. That’s one reason two different torrents, of the same thing, can have different hashes. Change the file order, or the piece size, or the file names (because the client won’t know to check a file with a different name) and the swarms don’t interact.

There’s also no way to easily predict what a hash will be, and how to tell what a particular torrent hash is, except by running it.

And this brings us full circle. Because no matter how many times you download a torrent file from The Pirate Bay, all you have are a bunch of checksums. There is no possible (realistic) way to take that checksum, and work back to get the data. If there were, there wouldn’t be so many people complaining about ‘dead torrents’ (where there isn’t a complete copy amongst all the peers on a torrent).

A torrent file is not encryption. It does not contain the data people want, either compressed, or otherwise. It contains only an answer, and one which is not obvious. In fact, while the number of possible outcomes is large, it’s not THAT big. The number of possible simulations in the Muon1 project is much bigger (up to 101653 compared to 1048 here) so conflicts (called Hash Collisions) are possible.

A Hash collision happens when data is deliberately (or accidentally, for that matter)  constructed in such a way as to give the same hash value. While it’s possible, it’s very computationally expensive (which is why SHA1 is used, rather than ‘quantity|mean|S.D.’) which makes it ‘safer’., insofar as they’re not likely.  Nevertheless, the possibility exists.

This is also why it’s impossible to reconstruct the data from the torrent file, and illustrates the difference between checksums/cryptographic hashes and encryption/compression.

Another way to format 1640 is as 2160 which means that the hash is 160 bits long in binary. However, that’s only 20 bytes in length. So if you’re hashing 20 bytes of data, each one CAN have a unique hash (there’s no guarantee they all will) but twice as much data will mean there are roughly two possible outcomes for each hash. 1 kilobyte (1000 bytes) would then have 50 possible outcomes on average, and so each 64 kilobyte piece in the sample torrent will have 64 times as many, or 3,200 possible outcomes. Sure, it seems like a lot, but that’s out of 3,200×1640 possible outcomes (or 4.67×1051 for those that care) there are thousands of outcomes, but that’s out of 4 thousand-trillion-trillion-trillion-trillion possible.

That also means if I had the world’s biggest supercomputers, and they had the torrent file, and nothing else, they could take that torrent file, and produce (roughly) 3,200 completely different sets of files. You would then have to check through each of those 3,200 outputs to see which one made sense, as it’s extremely unlikely that any but one would produce readable documents.

And all of this is with what’s considered a small piece-size, of only 64KB. For larger torrents, you can have larger pieces, where there are more duplicates, but also more possibilities as well. In fact, since SHA1 always gives out a 40 character hex value, the ratio between the number of possibilities of giving a particular hash, and the total number of possibilities always stays the same. 1MB piece sizes are typical (so giving you 51,200 average collisions) and for larger torrents, 4MB is common, and even 16MB is not unknown (giving 819,200 sets of data, all of which could match the hashes)

As you can see, what you can download from the pirate bay isn’t anything incriminating at all. It’s just maths. As we said right at the start, knowing the checksum is like knowing the answers to a test, yes you know it but you don’t know what the question is.

So the next time you see someone saying they ‘downloaded a film from The Pirate Bay’, or ‘got this data from a bittorrent site’, you too can laugh along at them. When a judge makes a ruling about a bittorrent site, you can cry along with us too. Because when they make those rulings about ‘downloading things from the site’, they may think they have the answers, but the reality is they don’t even know what the questions are.

  • ktetch

    Just made some tweaks, mainly maths errors (I really do know that Cos(0)=1 not Cos(1), honest!) and some typos, thanks to Jinx.

  • Pingback: DRM, the Private Flag, and BitComet | Politics & P2P()

  • Decim

    I think you’ve made some calculations in your errors but I’m struggling to get my head around it and actually check, so I’ll post anyway:

    256 = the number of permutations for a single byte

    256²⁰ = 1,4615016373309029182e48 different permutations for 20 bytes

    256⁴⁰ = 2,1359870359209100824e96 different permutations for 40 bytes

    256⁴⁰ ÷ 256²⁰ = 1,4615016373309029182e48 hash collisions (№ theoretical 40-byte inputs per 20-byte hashes)

    so if you had the 20-byte hash of 64000 bytes:

    256⁶⁴⁰⁰⁰ ÷ 256²⁰ = 1,55948302614583084578e154079 different collisions
    (yes, 1.6ish with 154079 zeroes on the end!!)

    Ergo you’d have to construct 1,55948302614583084578e154079 files on your supercomputer and check each one to check if it’s the one you were after.

    This is absolutely insane as I’m sure you can realise, whereas only 3200 different sets of files only (as in your post) is actually fairly feasible depending on what you’re after.

    Of course, it’s always possible to brute force a file which is valid and has the same hash, but is a different file, which is why you have to download the entire 64 kB file, not just 20 bytes + 2 bytes describing which of the 3200 files it was (2 bytes can have 65536 different permutations/values, so this is ‘overkill’ even)

    (if we could get away transferring 64 kB as 20+2 bytes, why would we have the 64 kB file in the first place?)

    I might not be correct on the calculations but the numbers you’ve provided seem too small, especially given my explanation in the last 2 paragraphs (representing 64 kB as 20+2 bytes)

    Cheers