{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# IP anonymization and its impact on visitor localization in Google Analytics\n",
"\n",
"\n",
"### Tools used\n",
"* [pandas](http://pandas.pydata.org/) and its [Google Analytics connector](http://pandas.pydata.org/pandas-docs/version/0.15.2/remote_data.html#remote-data-ga) to fetch and wrangle the data,\n",
"* [bokeh](http://bokeh.pydata.org) to visualize it.\n",
"\n",
"### Data sources\n",
"* The traffic data comes from Google Analytics and concerns [blog.liip.ch](http://blog.liip.ch). Two properties track pageviews on that 'tech' blog, on with IP anonymizing enabled, since september 2015."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exploration 1: What's the impact of IP anonymizing on the user country dimension?"
]
},
{
"cell_type": "code",
"execution_count": 221,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import pandas.io.ga as ga\n",
"import numpy as np\n",
"\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 220,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"PYTHON 3.4.3 |Anaconda 2.4.0 (x86_64)| (default, Oct 20 2015, 14:27:51) \n",
"[GCC 4.2.1 (Apple Inc. build 5577)]\n",
"PANDAS 0.17.0\n"
]
}
],
"source": [
"import sys\n",
"print (\"PYTHON \", sys.version)\n",
"print (\"PANDAS \", pd.__version__)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fetching data\n",
"\n",
"Now that our tools are loaded, let us fetch the data from the two Google Analytics properties.\n",
"\n",
"The traffic data without anonymization since September 1st, 2015."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"sources = {\n",
" 'full' : {\n",
" 'property_id': \"UA-424540-4\",\n",
" 'profile_id': \"5334921\",\n",
" },\n",
" 'anon' : {\n",
" 'property_id': \"UA-424540-11\",\n",
" 'profile_id': \"107030134\",\n",
" },\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"49376\n",
"49161\n"
]
}
],
"source": [
"source_data = {}\n",
"\n",
"for key in sources: \n",
" source_data[key] = ga.read_ga(\n",
" property_id = sources[key]['property_id'],\n",
" profile_id = sources[key]['profile_id'],\n",
" metrics = \"sessions\",\n",
" dimensions = ['country','city'],\n",
" start_date = \"2015-09-01\",\n",
" index_col = ['country','city'],\n",
" )\n",
"\n",
" print(source_data[key]['sessions'].sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Less than 1% difference in volumes. Since the tracking is not forcefully simultaneous, that was expected. Let's just have a look at one of them."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" sessions | \n",
"
\n",
" \n",
" country | \n",
" city | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" (not set) | \n",
" (not set) | \n",
" 126 | \n",
"
\n",
" \n",
" Afghanistan | \n",
" (not set) | \n",
" 2 | \n",
"
\n",
" \n",
" Albania | \n",
" (not set) | \n",
" 1 | \n",
"
\n",
" \n",
" Tirana | \n",
" 12 | \n",
"
\n",
" \n",
" Algeria | \n",
" (not set) | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sessions\n",
"country city \n",
"(not set) (not set) 126\n",
"Afghanistan (not set) 2\n",
"Albania (not set) 1\n",
" Tirana 12\n",
"Algeria (not set) 15"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"source_data['full'].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us now join those two dataframes based on their country/city index:"
]
},
{
"cell_type": "code",
"execution_count": 222,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" full_ip_sessions | \n",
" anon_ip_sessions | \n",
"
\n",
" \n",
" country | \n",
" city | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" (not set) | \n",
" (not set) | \n",
" 130 | \n",
" 126 | \n",
"
\n",
" \n",
" Afghanistan | \n",
" (not set) | \n",
" 2 | \n",
" 2 | \n",
"
\n",
" \n",
" Albania | \n",
" (not set) | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" Tirana | \n",
" 12 | \n",
" 12 | \n",
"
\n",
" \n",
" Algeria | \n",
" (not set) | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" full_ip_sessions anon_ip_sessions\n",
"country city \n",
"(not set) (not set) 130 126\n",
"Afghanistan (not set) 2 2\n",
"Albania (not set) 1 1\n",
" Tirana 12 12\n",
"Algeria (not set) 14 15"
]
},
"execution_count": 222,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.concat(source_data, axis=1, join='outer')\n",
"\n",
"# rename homonymous columns\n",
"data.columns=['full_ip_sessions', 'anon_ip_sessions']\n",
"\n",
"data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's list the countries where the biggest proportional losses & wins happen."
]
},
{
"cell_type": "code",
"execution_count": 225,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# group by level 0 of the index (i.e. countries) and sum columns for groups\n",
"country_data = data.groupby(level=0).sum()\n",
"\n",
"# compute delta and its proportion\n",
"country_data['delta'] = country_data.anon_ip_sessions - country_data.full_ip_sessions\n",
"country_data['dprop'] = country_data.delta / country_data.full_ip_sessions\n",
"\n",
"# sort by prop. delta, ascending\n",
"country_data.sort_values(by='dprop', inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Countries with proportionally large losses:"
]
},
{
"cell_type": "code",
"execution_count": 229,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" full_ip_sessions | \n",
" anon_ip_sessions | \n",
" delta | \n",
" dprop | \n",
"
\n",
" \n",
" country | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Austria | \n",
" 387 | \n",
" 356 | \n",
" -31 | \n",
" -0.080103 | \n",
"
\n",
" \n",
" China | \n",
" 341 | \n",
" 326 | \n",
" -15 | \n",
" -0.043988 | \n",
"
\n",
" \n",
" Japan | \n",
" 443 | \n",
" 429 | \n",
" -14 | \n",
" -0.031603 | \n",
"
\n",
" \n",
" United States | \n",
" 7705 | \n",
" 7465 | \n",
" -240 | \n",
" -0.031149 | \n",
"
\n",
" \n",
" Denmark | \n",
" 242 | \n",
" 238 | \n",
" -4 | \n",
" -0.016529 | \n",
"
\n",
" \n",
" Bulgaria | \n",
" 207 | \n",
" 206 | \n",
" -1 | \n",
" -0.004831 | \n",
"
\n",
" \n",
" Finland | \n",
" 275 | \n",
" 274 | \n",
" -1 | \n",
" -0.003636 | \n",
"
\n",
" \n",
" Mexico | \n",
" 332 | \n",
" 332 | \n",
" 0 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" Taiwan | \n",
" 278 | \n",
" 278 | \n",
" 0 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" Czech Republic | \n",
" 420 | \n",
" 420 | \n",
" 0 | \n",
" 0.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" full_ip_sessions anon_ip_sessions delta dprop\n",
"country \n",
"Austria 387 356 -31 -0.080103\n",
"China 341 326 -15 -0.043988\n",
"Japan 443 429 -14 -0.031603\n",
"United States 7705 7465 -240 -0.031149\n",
"Denmark 242 238 -4 -0.016529\n",
"Bulgaria 207 206 -1 -0.004831\n",
"Finland 275 274 -1 -0.003636\n",
"Mexico 332 332 0 0.000000\n",
"Taiwan 278 278 0 0.000000\n",
"Czech Republic 420 420 0 0.000000"
]
},
"execution_count": 229,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"country_data.query('full_ip_sessions > 200').head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Countries with proportionally large gains:"
]
},
{
"cell_type": "code",
"execution_count": 231,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" full_ip_sessions | \n",
" anon_ip_sessions | \n",
" delta | \n",
" dprop | \n",
"
\n",
" \n",
" country | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Philippines | \n",
" 266 | \n",
" 272 | \n",
" 6 | \n",
" 0.022556 | \n",
"
\n",
" \n",
" Lithuania | \n",
" 217 | \n",
" 222 | \n",
" 5 | \n",
" 0.023041 | \n",
"
\n",
" \n",
" India | \n",
" 4142 | \n",
" 4243 | \n",
" 101 | \n",
" 0.024384 | \n",
"
\n",
" \n",
" Indonesia | \n",
" 430 | \n",
" 441 | \n",
" 11 | \n",
" 0.025581 | \n",
"
\n",
" \n",
" Portugal | \n",
" 230 | \n",
" 236 | \n",
" 6 | \n",
" 0.026087 | \n",
"
\n",
" \n",
" Malaysia | \n",
" 233 | \n",
" 242 | \n",
" 9 | \n",
" 0.038627 | \n",
"
\n",
" \n",
" Ukraine | \n",
" 1100 | \n",
" 1144 | \n",
" 44 | \n",
" 0.040000 | \n",
"
\n",
" \n",
" Ireland | \n",
" 202 | \n",
" 211 | \n",
" 9 | \n",
" 0.044554 | \n",
"
\n",
" \n",
" Sweden | \n",
" 478 | \n",
" 506 | \n",
" 28 | \n",
" 0.058577 | \n",
"
\n",
" \n",
" Singapore | \n",
" 311 | \n",
" 331 | \n",
" 20 | \n",
" 0.064309 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" full_ip_sessions anon_ip_sessions delta dprop\n",
"country \n",
"Philippines 266 272 6 0.022556\n",
"Lithuania 217 222 5 0.023041\n",
"India 4142 4243 101 0.024384\n",
"Indonesia 430 441 11 0.025581\n",
"Portugal 230 236 6 0.026087\n",
"Malaysia 233 242 9 0.038627\n",
"Ukraine 1100 1144 44 0.040000\n",
"Ireland 202 211 9 0.044554\n",
"Sweden 478 506 28 0.058577\n",
"Singapore 311 331 20 0.064309"
]
},
"execution_count": 231,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"country_data.query('full_ip_sessions > 200').tail(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What's the proportion of the fluctuation?"
]
},
{
"cell_type": "code",
"execution_count": 232,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.018429242692378105"
]
},
"execution_count": 232,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"country_data.delta.map(abs).sum()/country_data.full_ip_sessions.sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The deltas are below 10%, either positive or negative. And there's globally less than 2% of country attribution mismatch.\n",
"\n",
"One can then say that __Country attribution is largely insensitive to IP anonymization__."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exploration 2 : what does it mean locally in Switzerland?\n",
"\n",
"Let us dive one level deeper: at city level. We will focus on Switzerland since we have enough traffic from it."
]
},
{
"cell_type": "code",
"execution_count": 233,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" full_ip_sessions | \n",
" anon_ip_sessions | \n",
" delta | \n",
" dprop | \n",
"
\n",
" \n",
" country | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Switzerland | \n",
" 4141 | \n",
" 4156 | \n",
" 15 | \n",
" 0.003622 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" full_ip_sessions anon_ip_sessions delta dprop\n",
"country \n",
"Switzerland 4141 4156 15 0.003622"
]
},
"execution_count": 233,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"country_data.query('country == \"Switzerland\"')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Less than 1 percent loss at country level for Switzerland, rather stable. But what's happening at city level?"
]
},
{
"cell_type": "code",
"execution_count": 234,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"full_ip_sessions 4141\n",
"anon_ip_sessions 4156\n",
"dtype: float64"
]
},
"execution_count": 234,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# create a clean subset\n",
"swiss_data = data.query('country == \"Switzerland\"').copy()\n",
"swiss_data.sum()"
]
},
{
"cell_type": "code",
"execution_count": 235,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"swiss_data['delta'] = swiss_data.anon_ip_sessions - swiss_data.full_ip_sessions\n",
"swiss_data['dprop'] = swiss_data.delta / swiss_data.full_ip_sessions\n",
"\n",
"swiss_data.sort_values(by='dprop', inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 237,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" full_ip_sessions | \n",
" anon_ip_sessions | \n",
" delta | \n",
" dprop | \n",
"
\n",
" \n",
" country | \n",
" city | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Switzerland | \n",
" Porrentruy | \n",
" 138 | \n",
" 1 | \n",
" -137 | \n",
" -0.992754 | \n",
"
\n",
" \n",
" Ebikon | \n",
" 81 | \n",
" 1 | \n",
" -80 | \n",
" -0.987654 | \n",
"
\n",
" \n",
" Basel | \n",
" 162 | \n",
" 111 | \n",
" -51 | \n",
" -0.314815 | \n",
"
\n",
" \n",
" Lugano | \n",
" 54 | \n",
" 42 | \n",
" -12 | \n",
" -0.222222 | \n",
"
\n",
" \n",
" Lucerne | \n",
" 71 | \n",
" 58 | \n",
" -13 | \n",
" -0.183099 | \n",
"
\n",
" \n",
" Lausanne | \n",
" 341 | \n",
" 345 | \n",
" 4 | \n",
" 0.011730 | \n",
"
\n",
" \n",
" Zurich | \n",
" 1476 | \n",
" 1504 | \n",
" 28 | \n",
" 0.018970 | \n",
"
\n",
" \n",
" Bern | \n",
" 206 | \n",
" 232 | \n",
" 26 | \n",
" 0.126214 | \n",
"
\n",
" \n",
" Saint Gallen | \n",
" 86 | \n",
" 104 | \n",
" 18 | \n",
" 0.209302 | \n",
"
\n",
" \n",
" Winterthur | \n",
" 77 | \n",
" 97 | \n",
" 20 | \n",
" 0.259740 | \n",
"
\n",
" \n",
" Geneva | \n",
" 118 | \n",
" 189 | \n",
" 71 | \n",
" 0.601695 | \n",
"
\n",
" \n",
" Fribourg | \n",
" 105 | \n",
" 300 | \n",
" 195 | \n",
" 1.857143 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" full_ip_sessions anon_ip_sessions delta dprop\n",
"country city \n",
"Switzerland Porrentruy 138 1 -137 -0.992754\n",
" Ebikon 81 1 -80 -0.987654\n",
" Basel 162 111 -51 -0.314815\n",
" Lugano 54 42 -12 -0.222222\n",
" Lucerne 71 58 -13 -0.183099\n",
" Lausanne 341 345 4 0.011730\n",
" Zurich 1476 1504 28 0.018970\n",
" Bern 206 232 26 0.126214\n",
" Saint Gallen 86 104 18 0.209302\n",
" Winterthur 77 97 20 0.259740\n",
" Geneva 118 189 71 0.601695\n",
" Fribourg 105 300 195 1.857143"
]
},
"execution_count": 237,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"swiss_data.query('full_ip_sessions > 50')"
]
},
{
"cell_type": "code",
"execution_count": 238,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.2711905336875151"
]
},
"execution_count": 238,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# absolute sum of delta \n",
"swiss_data.delta.map(abs).sum()/swiss_data.full_ip_sessions.sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Quite some turmoil at city level!__ For example, Fribourg gains 185% of attributions while Basel loses 30%, Something wild going on in Porrentruy and Ebikon, ...\n",
"\n",
"Overall, we see __more than 25% mismatch in city attribution__ for Switzerland."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}