The value of collecting data and analysing them, popularly referred to under terms such as Big Data or Smart Data, has undoubtedly become a powerful tool that is beginning to be prominent in many aspects of modern society. As such, sectors as diverse as the media, business and economic environments, as well as political and even military intelligence sectors have become involved, even when they have not been required to include this important data analysis strategy in their work operations.
Different motivations, same objective
The obtaining and use of data, both from OSINT public sources and those obtained by other "tracking" methods, are very useful for compiling information on internet users. The motivations for doing it are many: the creation of client profiles and behaviour analysis to improve sales and marketing strategies, customised prices and advertisement in accordance with the origin of the target, monetisation of the information collected by selling it to third parties, monitoring of individuals or groups, statistics, etc.
- Some motivations for the monitoring and identification of users -
These techniques are very common, such that nobody is surprised anymore, for example, by the accuracy of adverts in our web browsing or the online shopping price variationwhose behaviour will be different depending on the target audience.
- Price setting depending on the location and/or operating system. Plugin $sheriff-
In order to be able to identify, classify and compile internet user information, various web technologies and methods are used, which allow sufficient information to be collected so as to gain a perfect overview of a user and their patterns of behaviour.
These mechanisms are widely used and have a direct impact on the privacy of the user, who, as we will see below, has no easy way of avoiding this exposure and protecting their anonymity.
In this article, we will focus on "web tracking", which refers to tracking mechanisms aimed at identifying devices, browsers and tools employed by all of us as internet users.
Web tracking and identifiers
Although these are not the only ones, we can mention the techniques most used to profile users or devices, which we can group together as follows:
- Client-side identifiers (session, cache, local storage)
- Hardware/software fingerprint
- Other methods: specific patterns of behaviour,local preferences, injection of HTTP headers.
- Client-side identifiers -
Within this category of identifiers we find certain elements (data, files) that are stored locally by browsers in different locations on the client computer. These data will be transmitted to web servers and used to identify the user and carry out the desired operations in accordance with their profile.
Removing stored data does not always have an automatic or preconfigured mechanism, which makes their elimination difficult and favours their persistence. Data storage may be as follows: local, session or cache storage. These data, in principle, will remain for clearly defined persistence.
-Types of local storage and assumed method of elimination -
As we will see below, browser data elimination mechanisms are not always as effective as could be expected.
These types of identifiers are stored temporarily while the user uses the browser and they persist while the session lasts. These are usually elements contained in the page as hidden fields, DOM properties of the page or explicit authentication webforms that only validate the user during an active session. Unlike cookies, and other methods, these identifiers are not stored and disappear when the session or page visited is exited. This method is obsolete and is generally not used a lot, particularly when there is the possibility of using cookies or another type of storage with greater persistence.
Cache is an element that must be taken into account from the privacy perspective. Web browsers implement a cache that allows them to perform better when viewing previously visited sites, storing part of their content as images or scripts. It is a very common type of storage and the persistence of the stored data directly depends on the configuration used in the browser and/or manual elimination.
The persistence of cached elements is generally determined through values that are established through HTTP headers when visiting a website. These headers include:
Expires/max-age. It determines an expiry date for the data, which will remain until this date is reached or a manual elimination of the cache is carried out.
ETag. This header labels variable content of the website and its value will indicate to the browser when a resource has changed.
Last-Modified. It is used to notify of the date at the last time the web content changed.
Local files and data
- HTTP cookies
Cookies have been and continue to be a widely used mechanism for identifying and profiling users so a history of their browsing, preferences and sessions can be maintained. They can store up to 4 KB of information and their elimination is easy, but this elimination does not always avoid the persisting or reconstruction of information tracking, as explained below.
- Flash (Flash cookies)
Adobe Flash applications store data in the client-side with a mechanism that is homologous to HTTP cookies, known as Local Shared Objects (LSO), which have a storage capacity of up to 100 KB. This content may be accessed by all browsers installed, since Adobe Flash shares the same file location route.
Similarly, Microsoft Silverlight applets maintain local data storage known as Isolated Storage. This technology does not depend on the management of browsing data or the browser’s cache, with a manual elimination of the files being necessary. This characteristic gives it a high degree of persistence, which may be used for identifying web clients. Moreover, this storage may be shared between different requests or windows of the browser.
HTML5 was accompanied by important features, amongst them an API (WebStorage) that included modules to manage data storage with different degrees of persistence, such as Local Storage (cache) or Session Storage (session). Similarly, IndexedDB or File are other examples of API for storing files and managing databases in the client, whose elimination will require manual intervention on most occasions.
Java also has API, the specific PersistenceService that provides methods for storing local data in the client, even for applications outside the browser environment.
HSTS is a security mechanism whose objective is to ensure that the connection to a specific domain is only carried out under HTTPS. For this, the browser stores a list of sites registered initially and subsequently, it gradually adds new ones on demand through HTTP headers. Once a HSTS registration is stored, the latter remains until it expires and eliminating cookies, cache or temporary files will be of no use. Elimination can only be carried out through advanced browser options in a nontrivial manner.
- HSTS registrations and configuration in Chrome -
By employing this mechanism, it is possible to generate HSTS registrations in the user’s browser and thus create an identifier set, which has been called, perhaps inappropriately, HSTS supercookies. A proof of concept test can be carried out at the following link: http://www.radicalresearch.co.uk/lab/hstssupercookies (Link currently unavailable)
As we can see, HSTS, regardless of its original mission, has become a tool that can be used to obtain information that compromises the user’s privacy. As such, even browsing history can be obtained, as was shown in a study presented at the San Diego ToorCon security conference in 2015. In a proof of concept called Sniffly it is demonstrated how, through HSTS requests and analysing the browser response time (which will depend on whether or not it has a stored registration), it is possible to deduce the sites visited by the user.
The Panopticlick initiative of the Electronic Frontier Foundation (EFF) implemented a proof of concept to demonstrate what a web client was like in particular at the moment of accessing the website. Using an algorithm that collects information through HTTP and AJAX requests on the installed plugins, the screen resolution, sources, the time zone, cookies and flash objects, it determines a fingerprint that allows a web client to be distinguished from millions of others. For browsers that have Java and Flash activated they state that the degree of reliability upon identification is around 95%. This information demonstrates the degree of accuracy with which a fingerprint can be constructed, determining the identity of a browser/specific device.
- Example of Panopticlick analysis-
HTML5 canvas fingerprint
The canvas fingerprint is one of the most recent methods used to identify devices according to the characteristics of their hardware. WebGL technology for generating graphs is used to create an image in an HTML <canvas> element in the client. This image, as is deduced in the study "Pixel Perfect: Fingerprinting Canvas in HMTL5"directly depends on hardware and has a degree of entropy that is high enough to create a fingerprint from the user’s computer. By analysing the characteristics of the pixels that make up the image generated in the web client it is possible to obtain an identifying fingerprint with a high degree of accuracy.
- CanvasFingerprintBlock plugin for the detection and blocking of canvas tracking images -
Network fingerprints and geolocation
- Obtaining of the private IP using WebRTC (Chrome, Firefox) -. http://net.ipcalf.com
Geolocation is also another of the data extracted and used for profiling and identification. For this, the address is consulted in public databases or HTML5 API Geolocation is used. However, geolocation is data that is not sufficiently accurate and is affected by circumstances such as the use of VPN or the Tor network, which would falsify the real origin.
Other identification mechanisms
Markers dependent on preferences and behaviour
There are characteristics that are linked to the normal behaviour of the user and which, therefore, are not connected to a specific device. These data are useful for identifying the profile of a user of a device. This information includes:
It is interesting to highlight that the cache and browser history information can be obtained with certain collaboration from the user, as is explained in the study carried out in collaboration with Microsoft: I Still Know What You Visited Last Summer. This study describes various techniques, highlighting the use of different colours that a browser uses to distinguish between visited and non-visited links and thus, camouflage them in images, captchas or other interactive elements of the page to determine whether or not the user has visited them or not.
Another simpler approach to access the browser history that does not need interaction with the user is the abovementioned Sniffy, where HSTS technology is employed.
Injection of HTTP headers
The issue of injection of HTTP headers already had an impact in 2014 when it was published that Verizon, an American telecommunications operator was adding headers with markers on the HTTP traffic of its clients, with the objective of creating an identifier for each of them. This circumstance was baptised as "permacookies".
Moreover and more recently, the initiative acessnow.org published the study The Rise of Mobile TrackingHeaders: How Telcos Around the World Are Threatening Your Privacy which explains this same identification strategy in mobile devices used by telephone operators. Faced with these type of actions little can be done, since it is impossible to control traffic once it has left our mobile terminal and is in the hands of the operator.
-Statistics extracted in analysis of headers- Source: accessnow.org-
- Some HTTP headers used by different operators -Source: accessnow.org
Persistence of identifying elements
You could think that by using private browsing and frequently clearing browsing data, cache and cookies, every element of identification or tracking would be removed. Is this the case?: not always, in fact hardly ever. There are elements that are eliminated just by closing the browser and elements on which we cannot act, such as fingerprints or the injection of headers, as well as elements that persist and/or are regenerated after browsing data is completely eliminated.
The following table includes the main technologies used to identify users and their persistence after elimination.
Defence against identification and tracking
- uBLock plugin. Observe the high number of blocks of requests and tracking elements -
Lastly, and by taking a stricter position in the search for more privacy, we can use browsers that are specially designed with this objective in mind, such as Tor browser
- Tor browser -
The involuntary decrease in privacy when we use the internet is bit by bit leaving its mark on the social conscience and, in general, it is beginning to attract real interest. An example of this is the recent case in Belgium with the legal request for Facebook to stop tracking users who visit its site.