# Whotracks.me April Update

This month we have a big update to the site. We have restructured the data we publish to make it easier to use, increased the number of entries we publish, and we have laid the groundwork for internationalised versions of WhoTracks.Me - that means you can see how tracking differs between different countries.

Thanks to integration with Ghostery 8 we collected significantly more tracker data this month, covering 360 million page loads. This is spread over countries across the world, with Germany and the USA the most represented.

In [1]:
from plotly.offline import init_notebook_mode, iplot, offline
import plotly.graph_objs as go
from whotracksme.website.plotting.colors import cliqz_colors, palette
from whotracksme.website.plotting.utils import (
    CliqzFonts,
    div_output,
    set_margins,
    annotation,
    set_line_style,
    set_category_colors
)

import pandas as pd
init_notebook_mode()

In [3]:
def doughnut_chart(values, labels, name):
    trace = go.Pie(
        values=values,
        labels=labels,
        name=str(name),
        hoverinfo="label+percent",
        textposition="outside",
        hole=0.45,
        pull=0.07,
        textinfo="label",
        textfont=dict(
            family=CliqzFonts.regular,
            color=cliqz_colors["black"],
            size=15
        )    
    )
    data = [trace]
    layout = dict(
        showlegend=False,
        paper_bgcolor=cliqz_colors["transparent"],
        plot_bgcolor=cliqz_colors["white"],
        xaxis=dict(showgrid=False, showline=False, showticklabels=False, zeroline=False),
        yaxis=dict(showgrid=False, showline=False, showticklabels=False, zeroline=False),
        # autosize=True,
        margin=set_margins(t=30, b=30),
        annotations=[
            annotation(
                text=str(name).upper(),
                x=0.5,
                y=0.5,
                background_color=cliqz_colors["black"],
                shift_x=0,
                text_size=15
            )
        ]
    )
    fig = dict(data=data, layout=layout)
    # NB: saving plot requires a manual step, plotly is does not support it yet
    # source: https://github.com/plotly/plotly.py/issues/880
    offline.plot(fig, image='svg')

    return iplot(fig)

countries = ['Germany', 'USA', 'France', 'Other', 'Russia', 'UK', 'Poland', 'Netherlands', 'Canada', 'Ukraine', 'Austria', 'Italy', 'Spain', 'Switzerland', 'Belgium']
page_loads = [87124064, 78216572, 40282874, 32326828, 24384449, 16317893, 10554555, 10291928, 10054367, 6268086, 6261035, 6094486, 5753209, 4732324, 4048089]

doughnut_chart(values=page_loads, labels=countries, name='Data origin')

This volume of data will also enable us to publish separate rankings for individual countries, something we plan to add later this month.

## Data restructure

We have updated the struture of data which we publish in our [respository](https://github.com/cliqz-oss/whotracks.me/) to make it both easier to use and more scalable as we add more data. We now publish CSV files each month for each of the following:

 * `domains.csv`: Top third-party domains seen tracking.
 * `trackers.csv`: Top trackers - this combines domains known be operated by the same tracker.
 * `companies.csv`: Top companies - aggregates the stats for trackers owned by the same company.
 * `sites.csv`: Stats for number of trackers seen on popular websites.
 * `site_trackers.csv`: Stats for each tracker on each site.

These files can then be loaded with popular data-analysis tools such as [Pandas](https://pandas.pydata.org/). We have also rewritten the code to render the site to take advantage of Pandas. We expose the dataframes via the `DataSource` class which loads data from all CSV files:

In [None]:
from whotracksme.data.loader import DataSource
data = DataSource()
len(data.trackers.df)

We have also updated the criteria by which we include trackers and sites on the main site. We now 'rollover' entries, so once they have been included once, we will keep publishing data (until they completely dissappear from the data). This has the effect of naturally growing the number of trackers and sites we publish. We currently have data on 868 trackers and 748 websites published:

In [None]:
def plot_ts():
    df = pd.DataFrame({
        'trackers': data.trackers.df.groupby('month').count()['tracker'], 
        'sites': data.sites.df.groupby('month').count()['site']
    })
    sites_trace = go.Scatter(
        x=df.index, 
        y=df.sites, 
        name='Sites',
        line=dict(width=4, color='#9ebcda'),
    )
    trackers_trace = go.Scatter(
        x=df.index, 
        y=df.trackers, 
        name='Trackers',
        line=dict(width=4, color='#A069AB'),
    )
    
    layout=dict(
        margin=set_margins(t=0,b=30),
        legend=dict(
           x=0.05, y=1,
           bgcolor='#E2E2E2',
           orientation='h'
       )
    )
    fig = dict(data=[sites_trace, trackers_trace], layout=layout)
    offline.plot(fig, image='svg')

    iplot(fig)

plot_ts()

The per site trend for average number of trackers continues a slightly downward trend, but the average is still above 9. There are several possible reasons for this, it is not necessarily that sites are using fewer trackers. The proportion of data from Ghostery users continues to increase, and these users will disproportionately block many trackers. This has an effect on the average number of trackers, because it prevents the blocked trackers from loading others. The data shows also that the average indcidence of blocking for trackers increased to 25% in March, up from 20% in February. 

In [None]:
traces = [
    go.Box(
        y=data.sites.df[data.sites.df.month == '2018-01'].trackers, 
        name='Jan 2018',
        marker=dict(
            color='#c44e52',
            line=dict(
                color='#c44e52',
                width=3
            ),
        )
    ),
    go.Box(
        y=data.sites.df[data.sites.df.month == '2018-02'].trackers, 
        name='Feb 2018',
        marker=dict(
            color='#55a868',
            line=dict(
                color='#55a868',
                width=3
            ),
        )
    ),
    go.Box(
        y=data.sites.df[data.sites.df.month == '2018-03'].trackers, 
        name='Mar 2018',
        marker=dict(
            color='#4c72b0',
            line=dict(
                color='#4c72b0',
                width=3
            ),
        )
    )
]
fig = dict(data=traces, layout=dict(showlegend=False, margin=set_margins(t=0, b=30)))
offline.plot(fig, image='svg')
iplot(fig)

In [None]:
# Mean occurrence of Blocking per page
traces = [
    go.Bar(
        x=['Jan 2018', 'Feb 2018', 'Mar 2018'],
        y=[
            data.trackers.df[data.trackers.df.month == '2018-01'].has_blocking.mean()*100,
            data.trackers.df[data.trackers.df.month == '2018-02'].has_blocking.mean()*100,
            data.trackers.df[data.trackers.df.month == '2018-03'].has_blocking.mean()*100
        ],
        marker=dict(
            color=['#A069AB', '#9564c4', '#6564c4'],
            line=dict(
                color='#222',
                width=2
            ),
        )
    )
]
fig = dict(data=traces, layout=dict(margin=set_margins(t=0, b=30)))
offline.plot(fig, image_height=200, image_width=800, image='svg', output_type='file')
iplot(fig)

As in previous months, we look at sites' changing the trackers. [fewo-direct.de](../websites/fewo-direkt.de.html), [brigitte.de](../websites/brigitte.de.html) and [gutefrage.net](../websites/gutefrage.net.html) all had 5 fewer trackers on average per page this month. However, each of these still has over 50 trackers with some kind of presence, showing that this is more likely a side-effect of increased blocking than an active effort to reduce tracking on their sites. [klingel.de](../websites/klingel.de.html) and [informationvine.com](../websites/informationvine.com.html) see the largest increase in tracking of the sites we currently monitor.

In [None]:
mar_trackers = data.sites.get_snapshot('2018-03').set_index('site')['trackers']
feb_trackers = data.sites.get_snapshot('2018-02').set_index('site')['trackers']
site_diffs = pd.DataFrame({
    'trackers': mar_trackers,
    'change': (mar_trackers - feb_trackers)
})
site_diffs[(site_diffs.change > 5) | (site_diffs.change < -5.5)].sort_values('change')

A side-effect of the filtering we added in this new data pipeline is that the site reach for top trackers has increased. In the previous analysis a long-tail of very rarely visited sites reduced effective site reach. With this factor reduced, we get a real sense of the coverage of the largest trackers, with Google Analytics reaching 85% of popular sites, and Facebook almost 60%.

In [None]:
df = data.trackers.get_snapshot().sort_values(by='site_reach', ascending=False).head(10)
df['name'] = df.id.apply(func=lambda x: data.app_info[x]['name'])

traces = [
    go.Bar(
        x=df.site_reach[::-1]*100,
        y=df.name[::-1],
        orientation='h',
        marker=dict(
            color=palette('#9ebcda', '#A069AB', 10),
            line=dict(
                color='#333',
                width=2
            ),
        )
    )
]
layout=dict(margin=set_margins(l=200))
fig = dict(data=traces, layout=layout)
offline.plot(fig, image='svg')
iplot(fig)

If you want to delve deeper into our data, it is available on the [Whotracks.me Github Repository](https://github.com/cliqz-oss/whotracks.me/tree/master/whotracksme/data), and as a [pip package](https://pypi.python.org/pypi/whotracksme/).