Home / Blog / Web tracking of the Internet users

Web tracking of the Internet users

Posted on 12/01/2015, by Antonio López (INCIBE)
Internet users

The value of collecting data and analysing them, popularly referred to under terms such as Big Data or Smart Data, has undoubtedly become a powerful tool that is beginning to be prominent in many aspects of modern society. As such, sectors as diverse as the media, business and economic environments, as well as political and even military intelligence sectors have become involved, even when they have not been required to include this important data analysis strategy in their work operations.

Different motivations, same objective

The obtaining and use of data, both from OSINT public sources and those obtained by other "tracking" methods, are very useful for compiling information on internet users. The motivations for doing it are many: the creation of client profiles and behaviour analysis to improve sales and marketing strategies, customised prices and advertisement in accordance with the origin of the target, monetisation of the information collected by selling it to third parties, monitoring of individuals or groups, statistics, etc.

Some motivations for the monitoring and identification of users

- Some motivations for the monitoring and identification of users -

These techniques are very common, such that nobody is surprised anymore, for example, by the accuracy of adverts in our web browsing or the online shopping price variationwhose behaviour will be different depending on the target audience.

Plugin $sheriff

- Price setting depending on the location and/or operating system. Plugin $sheriff-

 

In order to be able to identify, classify and compile internet user information, various web technologies and methods are used, which allow sufficient information to be collected so as to gain a perfect overview of a user and their patterns of behaviour.

These mechanisms are widely used and have a direct impact on the privacy of the user, who, as we will see below, has no easy way of avoiding this exposure and protecting their anonymity.

In this article, we will focus on "web tracking", which refers to tracking mechanisms aimed at identifying devices, browsers and tools employed by all of us as internet users.

 

Web tracking and identifiers

Although these are not the only ones, we can mention the techniques most used to profile users or devices, which we can group together as follows:

  • Client-side identifiers (session, cache, local storage)
  • Hardware/software fingerprint
  • Other methods: specific patterns of behaviour,local preferences, injection of HTTP headers.

Client-side identifiers

- Client-side identifiers -

Within this category of identifiers we find certain elements (data, files) that are stored locally by browsers in different locations on the client computer. These data will be transmitted to web servers and used to identify the user and carry out the desired operations in accordance with their profile.

Removing stored data does not always have an automatic or preconfigured mechanism, which makes their elimination difficult and favours their persistence. Data storage may be as follows: local, session or cache storage. These data, in principle, will remain for clearly defined persistence.

 

Types of local storage and assumed method of elimination

-Types of local storage and assumed method of elimination -

As we will see below, browser data elimination mechanisms are not always as effective as could be expected.

Session identifiers

These types of identifiers are stored temporarily while the user uses the browser and they persist while the session lasts. These are usually elements contained in the page as hidden fields, DOM properties of the page or explicit authentication webforms that only validate the user during an active session. Unlike cookies, and other methods, these identifiers are not stored and disappear when the session or page visited is exited. This method is obsolete and is generally not used a lot, particularly when there is the possibility of using cookies or another type of storage with greater persistence.

Cache storage

Cache is an element that must be taken into account from the privacy perspective. Web browsers implement a cache that allows them to perform better when viewing previously visited sites, storing part of their content as images or scripts. It is a very common type of storage and the persistence of the stored data directly depends on the configuration used in the browser and/or manual elimination.

The persistence of cached elements is generally determined through values that are established through HTTP headers when visiting a website. These headers include:

Expires/max-age. It determines an expiry date for the data, which will remain until this date is reached or a manual elimination of the cache is carried out.

ETag. This header labels variable content of the website and its value will indicate to the browser when a resource has changed.

Last-Modified. It is used to notify of the date at the last time the web content changed.

These headers may be used to store differentiating elements in the client browser with the desired persistence and, as such, obtain a profile associated with the user. Proof of concept of this idea is found in Japitracing, a Master’s in Security study carried out in 2011 at the European University of Madrid. In this study, HTTP headers are employed to store JavaScript code in the browser’s cache and use it to geographically track the user.

Local files and data

  • HTTP cookies

Cookies have been and continue to be a widely used mechanism for identifying and profiling users so a history of their browsing, preferences and sessions can be maintained. They can store up to 4 KB of information and their elimination is easy, but this elimination does not always avoid the persisting or reconstruction of information tracking, as explained below.

  • Flash (Flash cookies)

Adobe Flash applications store data in the client-side with a mechanism that is homologous to HTTP cookies, known as Local Shared Objects (LSO), which have a storage capacity of up to 100 KB. This content may be accessed by all browsers installed, since Adobe Flash shares the same file location route.

The trend is for browsers to integrate LSO Flash elements into cookie management such that their elimination also involves the elimination of Flash files. However, there are techniques that, using JavaScript code, are capable of regenerating eliminated HTTP cookies through Flash storage (Evercookies).

  • Silverlight

Similarly, Microsoft Silverlight applets maintain local data storage known as Isolated Storage. This technology does not depend on the management of browsing data or the browser’s cache, with a manual elimination of the files being necessary. This characteristic gives it a high degree of persistence, which may be used for identifying web clients. Moreover, this storage may be shared between different requests or windows of the browser.

  • HTML5

HTML5 was accompanied by important features, amongst them an API (WebStorage) that included modules to manage data storage with different degrees of persistence, such as Local Storage (cache) or Session Storage (session). Similarly, IndexedDB or File are other examples of API for storing files and managing databases in the client, whose elimination will require manual intervention on most occasions.

  • Java

    Java also has API, the specific PersistenceService that provides methods for storing local data in the client, even for applications outside the browser environment.

Other methods

HSTS is a security mechanism whose objective is to ensure that the connection to a specific domain is only carried out under HTTPS. For this, the browser stores a list of sites registered initially and subsequently, it gradually adds new ones on demand through HTTP headers. Once a HSTS registration is stored, the latter remains until it expires and eliminating cookies, cache or temporary files will be of no use. Elimination can only be carried out through advanced browser options in a nontrivial manner.

HSTS registrations and configuration in Chrome

- HSTS registrations and configuration in Chrome -

By employing this mechanism, it is possible to generate HSTS registrations in the user’s browser and thus create an identifier set, which has been called, perhaps inappropriately, HSTS supercookies. A proof of concept test can be carried out at the following link: http://www.radicalresearch.co.uk/lab/hstssupercookies (Link currently unavailable)

As we can see, HSTS, regardless of its original mission, has become a tool that can be used to obtain information that compromises the user’s privacy. As such, even browsing history can be obtained, as was shown in a study presented at the San Diego ToorCon security conference in 2015. In a proof of concept called Sniffly it is demonstrated how, through HSTS requests and analysing the browser response time (which will depend on whether or not it has a stored registration), it is possible to deduce the sites visited by the user.

Software/hardware fingerprint

The techniques based on fingerprinting use differentiating elements in the hardware or software employed by the user. As such and through JavaScript, Flash, Java and other web technologies, information will be compiled that will make it possible to create a "fingerprint" that identifies it rather accurately.

Browser’s fingerprint

The Panopticlick initiative of the Electronic Frontier Foundation (EFF) implemented a proof of concept to demonstrate what a web client was like in particular at the moment of accessing the website. Using an algorithm that collects information through HTTP and AJAX requests on the installed plugins, the screen resolution, sources, the time zone, cookies and flash objects, it determines a fingerprint that allows a web client to be distinguished from millions of others. For browsers that have Java and Flash activated they state that the degree of reliability upon identification is around 95%. This information demonstrates the degree of accuracy with which a fingerprint can be constructed, determining the identity of a browser/specific device.

Example of Panopticlick analysis

- Example of Panopticlick analysis-

HTML5 canvas fingerprint

The canvas fingerprint is one of the most recent methods used to identify devices according to the characteristics of their hardware. WebGL technology for generating graphs is used to create an image in an HTML <canvas> element in the client. This image, as is deduced in the study "Pixel Perfect: Fingerprinting Canvas in HMTL5"directly depends on hardware and has a degree of entropy that is high enough to create a fingerprint from the user’s computer. By analysing the characteristics of the pixels that make up the image generated in the web client it is possible to obtain an identifying fingerprint with a high degree of accuracy.

CanvasFingerprintBlock plugin for the detection and blocking of canvas tracking images

- CanvasFingerprintBlock plugin for the detection and blocking of canvas tracking images -

Network fingerprints and geolocation

The IP address of the device or of the network in which it is found is another of the common data used to try and identify users through various techniques, including traffic analysis, HTTP headers or the use of Java, Flash, JavaScript or HTML5. Intermediate proxy data can also provide information or it may even be possible to obtain the IP address of the client’s private network, for example by using API JavaScript provided by WebRTC, a free project for providing communication features to browsers in real time.

Obtaining of the private IP using WebRTC

- Obtaining of the private IP using WebRTC (Chrome, Firefox) -. http://net.ipcalf.com

Geolocation is also another of the data extracted and used for profiling and identification. For this, the address is consulted in public databases or HTML5 API Geolocation is used. However, geolocation is data that is not sufficiently accurate and is affected by circumstances such as the use of VPN or the Tor network, which would falsify the real origin.

Other identification mechanisms

Markers dependent on preferences and behaviour

There are characteristics that are linked to the normal behaviour of the user and which, therefore, are not connected to a specific device. These data are useful for identifying the profile of a user of a device. This information includes:

It is interesting to highlight that the cache and browser history information can be obtained with certain collaboration from the user, as is explained in the study carried out in collaboration with Microsoft: I Still Know What You Visited Last Summer. This study describes various techniques, highlighting the use of different colours that a browser uses to distinguish between visited and non-visited links and thus, camouflage them in images, captchas or other interactive elements of the page to determine whether or not the user has visited them or not.

Another simpler approach to access the browser history that does not need interaction with the user is the abovementioned Sniffy, where HSTS technology is employed.

Injection of HTTP headers

The issue of injection of HTTP headers already had an impact in 2014 when it was published that Verizon, an American telecommunications operator was adding headers with markers on the HTTP traffic of its clients, with the objective of creating an identifier for each of them. This circumstance was baptised as "permacookies".

Moreover and more recently, the initiative acessnow.org published the study The Rise of Mobile TrackingHeaders: How Telcos Around the World Are Threatening Your Privacy which explains this same identification strategy in mobile devices used by telephone operators. Faced with these type of actions little can be done, since it is impossible to control traffic once it has left our mobile terminal and is in the hands of the operator.

Statistics extracted in analysis of headers

-Statistics extracted in analysis of headers- Source: accessnow.org-

Some HTTP headers used by different operators

- Some HTTP headers used by different operators -Source: accessnow.org

Persistence of identifying elements

You could think that by using private browsing and frequently clearing browsing data, cache and cookies, every element of identification or tracking would be removed. Is this the case?: not always, in fact hardly ever. There are elements that are eliminated just by closing the browser and elements on which we cannot act, such as fingerprints or the injection of headers, as well as elements that persist and/or are regenerated after browsing data is completely eliminated.

The following table includes the main technologies used to identify users and their persistence after elimination.

Table

Defence against identification and tracking

Can we do anything to avoid being classified in our daily internet use? As we said, not very much. Current web technologies use JavaScript, Flash, Java and cookies on almost every site and they have persistence capacities that are usually beyond the basic browser cleaning mechanisms. Private browsing is not a great improvement, similarly to other options such as Do not track that incorporate some browsers. Despite the above, it is always recommended, as a first step, to protect our privacy, and completely empty files and browsing content after each time we use the browser.

To go further in the protection of our privacy we can deactivate all of the technologies already mentioned (Java, Flash, JavaScript, etc.) and prevent the running of plugins and scripts in the browser, but this, in general, degrades the experience of using the internet considerably. As an alternative measure tools such as uBlock Origin (Chrome, Firefox) can be used to block domains and/or advert pages, as well as JavaScript blockers such as NoScript (Firefox) or ScriptSafe (Chrome), plugins for deactivating canvas fingerprinting such as CanvasFingerprintBLock, and even if we usually shop on the internet, we can use any of the tools designed to detect price setting, such as that already mentioned, $heriff.

uBlock Origin

- uBLock plugin. Observe the high number of blocks of requests and tracking elements -

Lastly, and by taking a stricter position in the search for more privacy, we can use browsers that are specially designed with this objective in mind, such as Tor browser

Tor browser

- Tor browser -

The involuntary decrease in privacy when we use the internet is bit by bit leaving its mark on the social conscience and, in general, it is beginning to attract real interest. An example of this is the recent case in Belgium with the legal request for Facebook to stop tracking users who visit its site.