"You're good with computers, aren't you?" When I hear this question, I know that either someone wants their broken Macbook repaired for free, or -what has become a more frequent question lately- asks me about how this data collection stuff on the internet works.
In the wake of the revelations on NSA and GCHQ spy programs more people than ever have asked me about it, and how they can protect themselves. Therefore, I decided to do a blog article about it. I will not cover much of the secret service programs, because I don't know more than the newspapers. What GCHQ does is basically plugging into an overseas cable, and dumping all the data that comes through to a serious batch of hard-drives.
Now, what a about those data-gathering molochs such as Google and Facebook? They cannot just plug themselves into your internet cable, so how do they do it.
They use different approaches to identify you on the internet, I can cover only a few of them in a humble blog-post, but I'll try.
It is important, that identifying doesn't automatically mean, they now your name and home address. In the first step, it is only important to find you again. Getting your personal details can be done later, by combining the data.
The most obvious choice to go, is the IP address of your computer. IP is short for internet protocol, and it's the low level "language" computers on the internet use to talk to each other. In order to be able to send data from one computer to another, they need a name, in this case, the IP address, which is a 32-bit wide number. For better readability, IP addresses are written in groups of four values seperated by a dot. For example 220.127.116.11 is the IP address of this computer. Actually everytime you type a name into the address bar of your browser, the computer will translate the name into the ip address. To make things more complicated, there is another type of ip addresses out there, that are called IPv6. They are becoming more frequent recently, and the reason is simple, that in the old system, there are about 4 billion addresses (a lot less in practice because, many areas of the address space are reserved for special purposes). The last of the IP addresses have been handed out in April 2011. IPv6 fixes the problem, simply by making more addresses available. The IPv6 address for afanen-writes.net is 2a01:4f8:130:2304::2. It looks more complicated, but it is in fact the same thing, the addresses are just a lot longer.
Now that you have your IP address, it's easy for a service such as Google, to find you. Everytime you use google's search engine Google can identify you by your IP address, and stores what you searched for, and what result you clicked on. Actually, Google's software tries to predict what you most likely will click on, and determines the order in which the results are shown to you. You can try this quite easily: Log in to the internet through a service like TOR, which anonymises you, from a friends computer, and search for anything in Google. Then compare the results with what you get, when you use your own computer. See the difference?
The downside of the IP-Address tracking is, that you do not always get the same IP address when you log in, and will definitly have a different address, when you log in from a different computer.
Cookies are small pieces of data, that are stored locally on your computer by your web browser. They can be very useful. For example, if you use an online store, and put something into your shopping basket, it is actually stored as a cookie. Technically the cookie is a file on your computer, that usually stores a number. These cookies can also be used to track you. For example a service called Google Analytics uses this information to gather data for market analysis (Services like this are actually what Google makes all it's money from).This is done by setting a cookie under a specific name on your computer. Now everytime you visit a site, that uses this service, a small piece of code is executed, that reads that cookie, and send an info to Google, that you visited the specific site. A good start to prevent that, is to configure your browser to delete all the cookies, once you end your browser session. You should also use a browser add-on such as Ghostery, which prevents the tracking code from being executed on your computer.
A more sophisticated approach are the Like buttons you find on many sites. If you are logged in to Facebook, the code behind the like button will be connected to your running facebook session, so you're identified pretty much on the spot. You don't even have to click on it, it's enough that it's there. It will also generate information to be sent back, if you are not a Facebook user at all. Luckily, Ghostery blocks these Like-buttons as well.
When you click on a link in your browser, it sends a command to the server, requesting the page the link leads to. Among the information it sends is a link to the place you are coming from. It is called the Referer. And yes, it's actually referer not referrer, which would be the correct spelling. In the early days there was a typo in the documentation and it somehow stuck. Originally the referer was made to allow a website to keep track of your path, allowing such things as bread crumbs (like the ones you see on the top of each article on this page). Many sites also use them to track where you came from by an external link, and send that information back to a company such as Google. You can use a tool like RefControl to gain control over this.
Adobe Flash Player
Flash player is a browser plugin, that allows all sorts of interactive programming running inside your browser. Nowadays it is mostly used for video streaming, but it contains a fully grown programming language, so web-designers can run practically any code on your machine. YouTube uses it as the default player for their videos. YouTube also allows using HTML5 for playing videos, and if you have a browser less than two years old, you should use that instead. There's not much to say about Flash Player but: don't use it. It is extremly bloated (playing a 720p video on YouTube uses all four CPU's in my computer, while you hardly notice the action when using HTML 5), unreliable and notoriously insecure. Since it comes with it's own runtime environment, all the countermeasures above are useless against it. If you absolutly rely on it, use Flashblock. It allows you to select which flash programs on a website you want to run, by clicking on them.
Asking your browser
Most browsers send a lot of information to the server they request data from. Version numbers, details on the system they are running on, cookies stored etc. Compute a hash value from all that data, and you get a pretty unique fingerprint from your browser. You can use Panopticlick from the Electronic Frontier Foundation to find out how unique your browser is.
What can I do?
If you want to protect yourself, use the tools I have listed. Use encryption whenever possible. Try HTTPS-everywhere, a browser plugin, that automatically redirects you to the encrypted connection to a website, if one is available. Avoid using Facebook and Google Mail. There are alternatives. Use end-to-end-encryption for email and messengers. Join electronic civil rights organisations such as the Electronic Frontier Foundation. Use Free and Open-Source software whereever possible. And last but not least: Stay alert!