{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Making sense of the trackers on your favorite site" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/vnd.plotly.v1+html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plotting imports\n", "from plotly.offline import init_notebook_mode, iplot\n", "init_notebook_mode()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating Sankey Data\n", "When building the tracker maps that you see on popular site profiles on whotracks.me, sankey diagrams seemed like a good fit to map categories of tracking to companies that own the trackers. Each link would be a tracker, going from a category to a company. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![title](tumblr.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given we had decided to use plottly.offline to generate the interactive images, I wanted to use the sankey diagram supported in plotly. The fuction itself is pretty straightforward, as you can see in `sankey_diagram()`, but figuring out how the structure of the input data took a bit. Hopefully the following example will make it easier for those reading this post, should they ever decided to try sankey diagrams.\n", "\n", "The goal here is to show some very small dataset, structured in a way that the plotly diagram (and other plotting solutions e.g.: d3.js) understand. We will be mapping cities to the countries they are part of. The value of each link, will be the city population (in millions)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "city_data = dict(\n", " nodes = dict(\n", " label=[\"Germany\", \"Berlin\", \"Munich\", \"Cologne\", \"France\", \"Paris\", \"Lyon\", \"Bordeaux\"],\n", " color=[\"beige\", \"black\", \"red\", \"yellow\", \"beige\", \"blue\", \"white\", \"red\"]\n", " ),\n", " links = dict(\n", " source=[0, 0, 0, 4, 4, 4],\n", " target=[1, 2, 3, 5, 6, 7],\n", " value= [3.5, 1.5, 1, 2.2, 0.5, 0.2],\n", " label=[\"capital\", \"city\", \"city\", \"capital\", \"city\", \"city\"],\n", " color=[\"black\", \"red\", \"yellow\", \"blue\", \"whitesmoke\", \"red\"]\n", " )\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note how there are two keys in the `dictionary`, `nodes` and `links`, and each has some attributes. Let's go over them. Each node has a label (e.g. `Germany`) and a corresponding color (in this case `beige`). Note than `labels` and `colors` are stored in lists of equal length, and the pairing is done based on the index. \n", "\n", "Links contain information about how to link nodes. Eeach has a `source`, `target`, `value`, `label` and `color`. Source cointains the index in the list of the source node, whereas target the index in the list of the target node. Value determines how thick the link should be (in our case it will be the population of each link, hence each city), Label and color, as the name suggests, specify the label and color of the link. Links too, are paired based on index." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting a sankey diagram\n", "\n", "Now let's write a simple function to plot these data nicely. Most of the work has already been done, given we're feeding the data in a format that's easy to parse." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def sankey_diagram(sndata, title):\n", " # First part of a plotly plot is the `trace`\n", " data_trace = dict(\n", " type='sankey',\n", " node=dict(\n", " pad=10,\n", " thickness=30,\n", " # label could easily be equal to sndatap['node]['label']. The following is just cosmetics\n", " label=list(map(lambda x: x.replace(\"_\", \" \").capitalize(), sndata['nodes']['label'])),\n", " color=sndata['nodes']['color']\n", " ),\n", " link=sndata[\"links\"],\n", " \n", " # configuration options for the diagram\n", " domain=dict(\n", " x=[0, 1],\n", " y=[0, 1]\n", " ),\n", " hoverinfo=\"none\",\n", " orientation=\"h\"\n", " )\n", " # Second part of a plotly plot is the `layout`\n", " layout = dict(\n", " title=title,\n", " font=dict(\n", " size=12\n", " )\n", " )\n", " fig = dict(data=[data_trace], layout=layout)\n", " return iplot(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sankey diagram for a few German and French citites\n", "All that is left now, is feeding the city_data to the sankey_diagram function and we're done." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "data": [ { "domain": { "x": [ 0, 1 ], "y": [ 0, 1 ] }, "hoverinfo": "none", "link": { "color": [ "black", "red", "yellow", "blue", "whitesmoke", "red" ], "label": [ "capital", "city", "city", "capital", "city", "city" ], "source": [ 0, 0, 0, 4, 4, 4 ], "target": [ 1, 2, 3, 5, 6, 7 ], "value": [ 3.5, 1.5, 1, 2.2, 0.5, 0.2 ] }, "node": { "color": [ "beige", "black", "red", "yellow", "beige", "blue", "white", "red" ], "label": [ "Germany", "Berlin", "Munich", "Cologne", "France", "Paris", "Lyon", "Bordeaux" ], "pad": 10, "thickness": 30 }, "orientation": "h", "type": "sankey" } ], "layout": { "font": { "size": 12 }, "title": "A few European Cities" } }, "text/html": [ "
" ], "text/vnd.plotly.v1+html": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sankey_diagram(city_data, \"A few European Cities\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# From Cities to Trackers\n", "\n", "Doing Sankey diagrams for cities may have been fun. I am not sure the result of doing the same for trackers on your favorite sites will be equally fun. In fact it may be terrifying. We'll be using public data from whotracks.me to map tracker categories to Companies present on a particular site. Each link will be a tracker the company owns. This gives imediate visual insights on who's watching you an why. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Utils from `whotracksme`" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from whotracksme.data.loader import DataSource\n", "from whotracksme.website.plotting.colors import tracker_category_colors, cliqz_colors\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading Data\n", "`DataSource` is a class that provides access to trackers, websites and companies. The functionality of `DataSource` is something we'll be constantly trying to improve and expand. Online tracking is messy enough to analyze, so the tooling should be not." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "DATA = DataSource()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Available entities\n", "\n", "These entities are loaded into DataSource, but an API is provided for some common operations on each of them. For more details, have a look at `whotracksme.data.loader`. As far as we're concerned, we can load them like this:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "trackers = DATA.trackers\n", "sites = DATA.sites\n", "companies = DATA.companies\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Looking at reddit.com\n", "Most people know what reddit is. For you that don't, check it out - there are some great communities there. Now we'll look at the tracking landscape in reddit. To do that, we only need to know the reddit `site_id`, which is `reddit.com`. Each site has a `site_id`, most often its `url`. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['apps', 'category', 'history', 'name', 'overview', 'subdomains'])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reddit_id = \"reddit.com\"\n", "reddit_data = DATA.sites.get_site(reddit_id)\n", "\n", "# reddit_data is a dictionary. And a site object has the following keys: \n", "reddit_data.keys()\n", "\n", "# apps refers to trackers. Naming is hard, but it'll soon be changed to trackers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparing tracker data for sankey diagram\n", "Here we will be mapping the trackers on reddit to the category they belong to (on the left) and to the companies that own them (on the right). This means each link is a tracker, nodes on the left are categories, and nodes on the right are companies. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def sankey_data(site_id, data=DATA):\n", "\n", " nodes = []\n", " link_source = []\n", " link_target = []\n", " link_value = []\n", " link_label = []\n", "\n", " for (tracker, category, company) in data.sites.trackers_on_site(site_id, data.trackers, data.companies):\n", "\n", " # index of this category in nodes\n", " if category in nodes:\n", " cat_idx = nodes.index(category)\n", " else:\n", " nodes.append(category)\n", " cat_idx = len(nodes) - 1 \n", " \n", " # index of this company in nodes\n", " if company in nodes:\n", " com_idx = nodes.index(company)\n", " else:\n", " nodes.append(company)\n", " com_idx = len(nodes) - 1 \n", " \n", " link_source.append(cat_idx)\n", " link_target.append(com_idx)\n", " link_label.append(tracker[\"name\"])\n", " link_value.append(100.0 * tracker[\"frequency\"])\n", "\n", " label_colors = [tracker_category_colors[l] if l in tracker_category_colors else cliqz_colors[\"purple\"] for l in nodes]\n", "\n", " return dict(\n", " nodes = dict(\n", " label=nodes,\n", " color=label_colors\n", " ),\n", " links = dict(\n", " source=link_source,\n", " target=link_target,\n", " value=link_value,\n", " label=link_label,\n", " color=[\"#dedede\"] * len(link_label)\n", " )\n", " )" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "data": [ { "domain": { "x": [ 0, 1 ], "y": [ 0, 1 ] }, "hoverinfo": "none", "link": { "color": [ "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede", "#dedede" ], "label": [ "Reddit", "Google Tag Manager", "Google Analytics", "Amazon Associates", "Quantcast", "ScoreCard Research Beacon", "Google", "DoubleClick", "Quantcount", "Google AdServices", "Moat", "Google Syndication", "Google APIs", "OpenX", "Imgur", "Google CDN", "YouTube", "Alexa Metrics", "WikiMedia", "Amazon Web Services", "AppNexus", "InsightExpress", "Advertising.com" ], "source": [ 0, 2, 4, 5, 5, 5, 5, 5, 4, 5, 5, 5, 10, 5, 12, 10, 14, 4, 10, 17, 5, 4, 5 ], "target": [ 1, 3, 3, 6, 7, 8, 3, 3, 7, 3, 9, 3, 3, 11, 13, 3, 3, 15, 16, 6, 18, 19, 20 ], "value": [ 99.58731289613593, 99.0402000255409, 98.2706125110061, 91.97209321082666, 46.30429960814889, 45.09245132106922, 44.410240554909564, 43.818095052459654, 37.652657261343855, 32.872476996390674, 20.00053770306692, 16.99612181662981, 16.054469320679388, 8.754478058354225, 5.226473810499996, 4.598705479866381, 2.9412357760735577, 2.19920554371862, 1.9437965869297826, 1.2777169127778412, 1.2004220969075352, 1.1789139742305805, 1.1110289620314422 ] }, "node": { "color": [ "#87BCEF", "#A069AB", "#FC9834", "#A069AB", "#84D7F0", "#BF90D2", "#A069AB", "#A069AB", "#A069AB", "#A069AB", "#C0BB61", "#A069AB", "#80C87D", "#A069AB", "#F86D4F", "#A069AB", "#A069AB", "#444", "#A069AB", "#A069AB", "#A069AB" ], "label": [ "Social media", "Reddit", "Essential", "Google", "Site analytics", "Advertising", "Amazon", "Quantcast", "Comscore", "Oracle", "Cdn", "Openx", "Misc", "Imgur", "Audio video player", "Alexa", "Wikimedia", "Hosting", "Appnexus", "Millward brown", "Aol" ], "pad": 10, "thickness": 30 }, "orientation": "h", "type": "sankey" } ], "layout": { "font": { "size": 12 }, "title": "reddit.com" } }, "text/html": [ "
" ], "text/vnd.plotly.v1+html": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "input_data = sankey_data(reddit_id, data=DATA)\n", "sankey_diagram(input_data, reddit_id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Don't forget to check out the article on https://whotracks.me/blog/trackers_in_your_favorite_site.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 2 }