If you follow tech news, you have a certain list of sites you’ll keep an eye on. Personally, I always keep an eye on TorrentFreak (but then, I am their researcher, and night-time comment moderator) but there are others as well, Wired’s Threat Level, Slyck, and of course, ArsTechnica.
The problem for all tech news sites is that there’s a deadline game. You have to be first to break the story, so you can get it passed around the social media circles, facebook, slashdot etc. Often that means that stories, or more specifically the data that comprises the story, doesn’t get the attention it should, and ArsTechnica has fallen foul of this, repeating the conclusions of a study, and not noticing some glaring errors.
They go to a site, grab the 10 most popular, pull the trackers from them, giving 23 trackers. Then they fully scrape those trackers, and pick the most popular torrents from the combined scrape response. They get the filename, and use that to categorise, and from that we get the results.
So they pick the most popular torrents based on a combined figure of seeds, obtained by scrapes which are notoriously easy to spoof. From that they look at the names, and categorise them based on that, which is, again, easy to spoof. Finally, a determination of the copyright status is made based on the name. And thus we end up at the 0.3 figure championed in the Ars piece.
Problems in this method are not hard for anyone to spot. The most obvious one for anyone that’s been around torrents for a while, is that huge seed numbers are usually a bad sign. They’re fakes, put up by Anti P2P companies, to discourage people by being impossible to finish, or not what they claim to be. This started in the Napster era, and the most famous example was Madonna, who hit the headlines with it back in 2003. Alternatively, they can be torrents of trojans, and other malware set up to infect peoples computers, deliberately mislabeled and with hyper-inflated seed/leech figures to entice people to download them. The most important thing to do, in a study involving bittorrent, is verify. Anyone that knows bittorrent well, knows this. The highest number of seeds on a torrent that has been verified, at least that I’m aware of, wouldn’t even be on the first page of their torrents.
The second issue is the initial data collection itself. They went to a site, grabbed the most popular (which, as has just been noted, is not the way to get a ‘valid’ torrent) then used the trackers listed in them, to compile a list of trackers, noting “each torrent having at least 10 trackers associated with it”. No torrent should have more than 2 trackers, and really only needs 1. More trackers don’t add anything (no extra peers, just extra overhead for you, and for the tracker). It does mean that ‘disreputable’, or honeypot trackers (ones set up specifically to track users for purposes other than being purely a bittorrent tracker) can hide in the swarm better. Again, to anyone with knowledge of bittorrent, this is well known. Thus, when they include these trackers, they’re going to get ‘more’ fake/honeypot/trojan’d torrents, rather than the ‘real’ torrents they are after in order for their study to be accurate.
Finally, names. They categorise based on names, and as we all know, the name of the file ALWAYS matches the contents. It’s not like you can change a file name to anything you want and the contents stay the same… Oh wait, it’s EXACTLY like that. So, what they have listed as The Incredible HulkDvDrip-aXXo97065494792.4447 could quite easily be ‘randomDataGeneratedByN00b’ and not work, or could be 20 seconds of the film intro, then switch to Rick Astley for 3 minutes, and then random data. If so, then The University of Ballarat AND AFACT have both been Rickrolled, very very publicly. It’s also not like AntiP2P companies intentionally misname files, oh DAMN, yes they do, that’s exactly what MediaDefender did.
So, we’ve got a method that uses bad data, collected by using other bad data, using bad data to make determinations about copyright. From that, AFACT makes a big deal gloating about it.
The Australian Federation Against Copyright Theft (AFACT) has welcomed the release of a research paper by the University of Ballarat into the extent of infringing content on BitTorrent networks stating it gives a clear insight into the nature of traffic on Bit Torrent network.
The academic research is the first to quantify the percentage of infringing BitTorrent (BT) traffic. Previous research only looked at the overall percentage of BT traffic across the internet, but not the legality of the traffic packets.
The key finding was that at least 89.9% of all torrents to be infringing.
The research analysed a sample of 1,000 unique torrents taken from 19 of the most popular BT trackers. The research objective was to investigate the percentage of shared files which are infringing, both by number of files and total seeders, as well as to evaluate the most popular categories of shared files. The results found that the percentage of legitimate BT traffic being shared.
A summary of key findings included:
1. 89.9% of all torrents within the sample were found to be infringing both by the number of files and total downloads. This was excluding all pornographic torrents whose legality could not be verified. If all pornographic titles were classified as infringing this overall figure would rise to 98.1%.
2. The top two categorized torrents were movie and TV shows making up 72.4% of all torrents. There were no legitimate movies or TV torrents in the sample.
3. The top two movie files were being seeded more than 1 million times each. The third most popular movie file was being seeded more than 500,000 times.
4. 9.9% of torrents were responsible for 90% of the total seed population.
5. Only 1 non-infringing torrent (an open source program) was found in the most popular 100 torrents.
Unfortunatly, as I’ve now pointed out, these points are completely unsupported by the data. The highest number of seeds on a torrent that I’ve been able to verify, was around 115,000, as I said a few weeks ago. A million is right out. They’ve also called bittorrent a network, it’s not. It’s a protocol. A network would imply they’re all connected. While they may be connected via DHT, that’s another protocol on top of bittorrent. Bittorrent itself doesn’t have a network, and never has.
ArsTechnica on the other hand SHOULD know this, it’s their job to. If they weren’t sure, they should ask. I know they even have my personal cellphone number, as they’ve interviewed me on it in the past. AFACT almost certainly knows it’s inaccurate (they’re not dumb, despite what people think) but it’s exactly the message they want to promote. The fact that they have to resort to effectively worthless studies to make their point should tell you everything you need to know about the validity of their point.