{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Использование Logs API для анализа данных\n", "## Подготовка\n", "### Установка необходимых программ\n", "\n", "Для работы нам понадобится:\n", "* СУБД ClickHouse\n", "* Python 2.7 + несколько библиотек (pandas, requests, plot.ly)\n", "* Jupyter Notebook\n", "\n", "#### Установка ClickHouse\n", "ClickHouse можно установить на Linux (подробно про установку написано в документации https://clickhouse.yandex/).\n", "\n", "На Windows или Mac можно запустить ClickHouse под Docker'ом: первая команда поднимает clickhouse-server на порту 8123, а вторая - позволяет подключиться к консольному ClickHouse клиенту.\n", "```\n", "docker run -d --name clickhouse-server --publish=8123:8123 --publish=9000:9000 yandex/clickhouse-server\n", "docker run -it --rm --link clickhouse-server:9000 yandex/clickhouse-client --host clickhouse-server\n", "```\n", "\n", "#### Установка python и jupyter notebook\n", "\n", "Python можно скачать на [официальном сайте](https://www.python.org/getit/). \n", "Также в скрипте будут использованы некоторые дополнительные библиотеки, их проще всего поставить с помощью менеджера пакетов pip ([инструкция по установке pip](https://pip.pypa.io/en/stable/installing/))\n", "\n", "```\n", "pip install pandas jupyter requests plotly\n", "```\n", "\n", "Для того, чтобы запустить jupyter notebook, достаточно выполнить в консоли команду (после этого web-interface будет доступно по адресу http://localhost:8888):\n", "\n", "```jupyter notebook```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Выгрузка данных из LogsAPI в ClickHouse\n", "Проще всего загрузить данные из LogsAPI в ClickHouse с помощью [скрипта на GitHub'e](https://github.com/yndx-metrika/logs_api_integration).\n", "Прежде всего нужно указать всю информацию в конфиге.\n", "\n", "В конфиге собраны основные параметры и будет достаточно задать их один раз и затем пользоваться. Для данной задачи я использовала вот такой конфиг:\n", "\n", "```\n", "{\n", " \"token\" : \"\",\n", " \"counter_id\": \"29761725\",\n", " \"visits_fields\": [\n", " \"ym:s:counterID\",\n", " \"ym:s:dateTime\",\n", " \"ym:s:date\",\n", " \"ym:s:visitDuration\",\n", " \"ym:s:bounce\",\n", " \"ym:s:pageViews\",\n", " \"ym:s:goalsID\",\n", " \"ym:s:clientID\",\n", " \"ym:s:lastTrafficSource\",\n", " \"ym:s:lastAdvEngine\",\n", " \"ym:s:lastSearchEngineRoot\",\n", " \"ym:s:visitID\",\n", " \"ym:s:startURL\",\n", " \"ym:s:browser\" \n", " ],\n", " \"log_level\": \"INFO\",\n", " \"retries\": 1,\n", " \"retries_delay\": 60,\n", " \"clickhouse\": {\n", " \"host\": \"http://localhost:8123\",\n", " \"user\": \"\",\n", " \"password\": \"\",\n", " \"visits_table\": \"visits_all\",\n", " \"hits_table\": \"hits_all\",\n", " \"database\": \"demo\"\n", " }\n", "}\n", "```\n", "\n", "Часть параметров задается непосредственно при вызове функции в CLI options. Например, источник (хиты или визиты) и период выгрузки.\n", "У скрипта есть несколько режимов работы. \n", "Во-первых, можно задать конкретные период (`start_date` и `end_date`):\n", "\n", "```python metrica_logs_api.py -source visits -start_date 2016-01-01 -end_date 2016-12-31```\n", "\n", "Другая опция это использовать один из режимов:\n", "* __history__ - выгружает все данные с даты создания счетчика до позавчера \n", "* __regular__ - рекомендуемый режим для регулярных выгрузок, каждый день будет подтягивать данные за позавчера\n", "* __regular_early__ - как __regular__ только берет вчерашний день (данные за вчера могут быть неполными, так что рекомендуется все-таки использовать __regular__)\n", "\n", "```python metrica_logs_api.py -source visits -mode regular```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Полезные функции\n", "\n", "Напишем функции для интеграции с ClickHouse: первая функция просто возвращает результат из DataBase, вторая же преобразует его в pandas DataFrame." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "HOST = 'http://localhost:8123'\n", "import requests\n", "import pandas as pd\n", "import StringIO\n", "import datetime\n", "\n", "def get_clickhouse_data(query, host = HOST, connection_timeout = 1500):\n", " r = requests.post(host, params = {'query': query}, timeout = connection_timeout)\n", " if r.status_code == 200:\n", " return r.text\n", " else:\n", " raise ValueError, r.text\n", " \n", "def get_clickhouse_df(query, host = HOST, connection_timeout = 1500):\n", " data = get_clickhouse_data(query, host, connection_timeout) \n", " df = pd.read_csv(StringIO.StringIO(data), sep = '\\t')\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Расчет retentions по когортам" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Разминка или пара простых SQL запросов\n", "Для начала посмотрим, какие данные есть в выгруженной с помощью Logs API таблице `visits_all`" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Bounce\tUInt8\t\t\n", "Browser\tString\t\t\n", "ClientID\tUInt64\t\t\n", "CounterID\tUInt32\t\t\n", "Date\tDate\t\t\n", "DateTime\tDateTime\t\t\n", "GoalsID\tArray(UInt32)\t\t\n", "LastAdvEngine\tString\t\t\n", "LastSearchEngineRoot\tString\t\t\n", "LastTrafficSource\tString\t\t\n", "PageViews\tInt32\t\t\n", "StartURL\tString\t\t\n", "VisitDuration\tUInt32\t\t\n", "VisitID\tUInt64\t\t\n", "\n" ] } ], "source": [ "q = '''DESCRIBE demo.visits_all'''\n", "print get_clickhouse_data(q)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Описание и полный список полей можно найти в документации Logs API: https://tech.yandex.ru/metrika/doc/api2/logs/intro-docpage/" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BounceBrowserClientIDCounterIDDateDateTimeGoalsIDLastAdvEngineLastSearchEngineRootLastTrafficSourcePageViewsStartURLVisitDurationVisitID
01operamini1449647949482143849297617252016-01-012016-01-01 19:45:49[]ya_undefinedNaNinternal1https://yandex.ru/support/metrika/troubleshoot...05946027005326809537
10opera1450991301857057590297617252016-01-012016-01-01 17:58:51[]ya_undefinedNaNinternal1https://yandex.ru/support/metrika/reports/fals...165946000737738979198
20chrome1451627947145006808297617252016-01-012016-01-01 13:44:09[]ya_undefinedNaNreferral5https://events.yandex.ru/surveys/1705/?iframe=...15305945938122840090407
30chrome1451627947145006808297617252016-01-012016-01-01 16:14:39[]ya_undefinedNaNreferral1https://yandex.ru/support/metrika/behavior/web...155945975107921245974
41chrome1451673786489426257297617252016-01-012016-01-01 21:44:45[]ya_undefinedNaNreferral1https://yandex.ru/support/metrika/troubleshoot...05946056234508984012
\n", "
" ], "text/plain": [ " Bounce Browser ClientID CounterID Date \\\n", "0 1 operamini 1449647949482143849 29761725 2016-01-01 \n", "1 0 opera 1450991301857057590 29761725 2016-01-01 \n", "2 0 chrome 1451627947145006808 29761725 2016-01-01 \n", "3 0 chrome 1451627947145006808 29761725 2016-01-01 \n", "4 1 chrome 1451673786489426257 29761725 2016-01-01 \n", "\n", " DateTime GoalsID LastAdvEngine LastSearchEngineRoot \\\n", "0 2016-01-01 19:45:49 [] ya_undefined NaN \n", "1 2016-01-01 17:58:51 [] ya_undefined NaN \n", "2 2016-01-01 13:44:09 [] ya_undefined NaN \n", "3 2016-01-01 16:14:39 [] ya_undefined NaN \n", "4 2016-01-01 21:44:45 [] ya_undefined NaN \n", "\n", " LastTrafficSource PageViews \\\n", "0 internal 1 \n", "1 internal 1 \n", "2 referral 5 \n", "3 referral 1 \n", "4 referral 1 \n", "\n", " StartURL VisitDuration \\\n", "0 https://yandex.ru/support/metrika/troubleshoot... 0 \n", "1 https://yandex.ru/support/metrika/reports/fals... 16 \n", "2 https://events.yandex.ru/surveys/1705/?iframe=... 1530 \n", "3 https://yandex.ru/support/metrika/behavior/web... 15 \n", "4 https://yandex.ru/support/metrika/troubleshoot... 0 \n", "\n", " VisitID \n", "0 5946027005326809537 \n", "1 5946000737738979198 \n", "2 5945938122840090407 \n", "3 5945975107921245974 \n", "4 5946056234508984012 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "q = '''\n", " SELECT * FROM demo.visits_all \n", " LIMIT 5\n", " FORMAT TabSeparatedWithNames\n", " '''\n", "get_clickhouse_df(q)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
min(Date)max(Date)
02016-01-012016-12-31
\n", "
" ], "text/plain": [ " min(Date) max(Date)\n", "0 2016-01-01 2016-12-31" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "q = '''\n", " SELECT min(Date), max(Date)\n", " FROM demo.visits_all\n", " FORMAT TabSeparatedWithNames\n", "'''\n", "get_clickhouse_df(q)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Посчитаем общее число визитов и пользователей." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TotalVisitsTotalUsers
01240021851044
\n", "
" ], "text/plain": [ " TotalVisits TotalUsers\n", "0 1240021 851044" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "q = '''\n", " SELECT \n", " count() as TotalVisits,\n", " uniq(ClientID) as TotalUsers\n", " FROM demo.visits_all\n", " FORMAT TabSeparatedWithNames\n", "'''\n", "get_clickhouse_df(q)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Retention" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Будем считать retention по неделям, рассматривая пользователей, которые впервые пришли на сервис с 24 октября по 25 ноября.\n", "\n", "Для удобства обозначим временной период переменными." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "start_date = datetime.datetime(2016, 10, 24)\n", "end_date = datetime.datetime(2016, 11, 25)\n", "\n", "start_date_str, end_date_str = map(lambda x: x.strftime('%Y-%m-%d'), (start_date, end_date))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Для начала посчитаем получим всех пользователей, которые впервые появились на сервисе в указанный период." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
client_idmin_date
014636596103476928402016-11-21
114718534935327658002016-10-24
214701508149033271512016-10-24
314799946465786652952016-11-21
414768567355183204552016-11-21
\n", "
" ], "text/plain": [ " client_id min_date\n", "0 1463659610347692840 2016-11-21\n", "1 1471853493532765800 2016-10-24\n", "2 1470150814903327151 2016-10-24\n", "3 1479994646578665295 2016-11-21\n", "4 1476856735518320455 2016-11-21" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "q = '''\n", " SELECT \n", " ClientID as client_id,\n", " min(toMonday(Date)) as min_date\n", " FROM demo.visits_all\n", " GROUP BY \n", " client_id\n", " HAVING (min_date >= '{start_date}') AND (min_date <= '{end_date}')\n", " LIMIT 5\n", " FORMAT TabSeparatedWithNames\n", "'''.format(\n", " start_date = start_date_str,\n", " end_date = end_date_str\n", ")\n", "\n", "get_clickhouse_df(q)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Мы получили соответствие пользователей когортам (неделе, когда они впервые появились на сайте). Теперь, чтобы посчитать retention нам нужно сделать JOIN таблицы с самой собой по `ClientID`, чтобы понять, когда возвращались на сайт пользователи каждой из групп." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
client_idmin_datedatedaysweek_num
014636596103476928402016-11-212016-11-2100
114636596103476928402016-11-212016-11-2871
214718534935327658002016-10-242016-10-2400
314701508149033271512016-10-242016-10-2400
414799946465786652952016-11-212016-11-2100
\n", "
" ], "text/plain": [ " client_id min_date date days week_num\n", "0 1463659610347692840 2016-11-21 2016-11-21 0 0\n", "1 1463659610347692840 2016-11-21 2016-11-28 7 1\n", "2 1471853493532765800 2016-10-24 2016-10-24 0 0\n", "3 1470150814903327151 2016-10-24 2016-10-24 0 0\n", "4 1479994646578665295 2016-11-21 2016-11-21 0 0" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "q = '''\n", "SELECT\n", " client_id, \n", " min_date, \n", " date,\n", " (date - min_date) as days,\n", " days/7 as week_num\n", "FROM\n", " (SELECT \n", " ClientID as client_id,\n", " min(toMonday(Date)) as min_date\n", " FROM demo.visits_all\n", " GROUP BY \n", " client_id\n", " HAVING (min_date >= '{start_date}') AND (min_date <= '{end_date}'))\n", " ALL INNER JOIN\n", " (SELECT DISTINCT\n", " ClientID as client_id,\n", " toMonday(Date) as date\n", " FROM demo.visits_all)\n", " USING client_id\n", "LIMIT 5\n", "FORMAT TabSeparatedWithNames\n", "'''.format(\n", " start_date = start_date_str,\n", " end_date = end_date_str\n", ")\n", "\n", "get_clickhouse_df(q)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Осталось только посчитать по когортам, сколько пользователей появлялись в каждую из недель `weeks` после первого появления." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [], "source": [ "q = '''\n", "SELECT\n", " uniq(client_id) as clients, \n", " (date - min_date)/7 as week_num,\n", " min_date\n", "FROM\n", " (SELECT \n", " ClientID as client_id,\n", " min(toMonday(Date)) as min_date\n", " FROM demo.visits_all\n", " GROUP BY \n", " client_id\n", " HAVING (min_date >= '{start_date}') AND (min_date <= '{end_date}'))\n", " ALL INNER JOIN\n", " (SELECT DISTINCT\n", " ClientID as client_id,\n", " toMonday(Date) as date\n", " FROM demo.visits_all)\n", " USING client_id\n", "GROUP BY min_date, week_num\n", "FORMAT TabSeparatedWithNames\n", "'''.format(\n", " start_date = start_date_str,\n", " end_date = end_date_str\n", ")\n", "\n", "raw_ret_df = get_clickhouse_df(q)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ret_df = raw_ret_df.pivot_table(index = 'min_date', values = 'clients', columns = 'week_num').fillna(0).T\n", "ret_df_norm = ret_df.apply(lambda x: 100*x/ret_df.loc[0], axis = 1).applymap(lambda x: x if x!=0 else None)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
min_date2016-10-242016-10-312016-11-072016-11-142016-11-21
week_num
013667.012919.013951.014097.015788.0
1956.01007.01097.01068.01070.0
2637.0576.0602.0589.0620.0
3436.0414.0492.0481.0451.0
4393.0374.0411.0351.0345.0
5353.0343.0334.0267.0224.0
6276.0249.0245.0155.00.0
7258.0199.0170.00.00.0
8191.0141.00.00.00.0
9139.00.00.00.00.0
\n", "
" ], "text/plain": [ "min_date 2016-10-24 2016-10-31 2016-11-07 2016-11-14 2016-11-21\n", "week_num \n", "0 13667.0 12919.0 13951.0 14097.0 15788.0\n", "1 956.0 1007.0 1097.0 1068.0 1070.0\n", "2 637.0 576.0 602.0 589.0 620.0\n", "3 436.0 414.0 492.0 481.0 451.0\n", "4 393.0 374.0 411.0 351.0 345.0\n", "5 353.0 343.0 334.0 267.0 224.0\n", "6 276.0 249.0 245.0 155.0 0.0\n", "7 258.0 199.0 170.0 0.0 0.0\n", "8 191.0 141.0 0.0 0.0 0.0\n", "9 139.0 0.0 0.0 0.0 0.0" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ret_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Напишем пару функций, которые будем использовать для визуализации результатов. Мы будем использовать open-source библиотеку plot.ly для построения интерактивных графиков." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.0.0\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from plotly import __version__\n", "from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot\n", "from plotly import graph_objs as go\n", "import requests\n", "import StringIO\n", "import pandas as pd\n", "\n", "print __version__ # need 1.9.0 or greater\n", "\n", "init_notebook_mode(connected = True)\n", "\n", "def plotly_df(df, title = ''):\n", " data = []\n", " \n", " for column in df.columns:\n", " trace = go.Scatter(\n", " x = df.index,\n", " y = df[column],\n", " mode = 'lines',\n", " name = column\n", " )\n", " data.append(trace)\n", " \n", " layout = dict(title = title)\n", " fig = dict(data = data, layout = layout)\n", " \n", " # plotly.offline.plot(fig, filename=filename, show_link = False)\n", " \n", " iplot(fig, show_link = False)\n", " \n", "def highlight_vals(val):\n", " if (val is None) or (val == ''):\n", " return ''\n", " p = 0.5\n", " if val > 90:\n", " return 'background-color: rgba(229, 0, 20, %f)' % p\n", " if val > 80:\n", " return 'background-color: rgba(231, 25, 43, %f)' % p\n", " if val > 70:\n", " return 'background-color: rgba(234, 51, 67, %f)' % p\n", " if val > 60:\n", " return 'background-color: rgba(236, 76, 90, %f)' % p\n", " if val > 50:\n", " return 'background-color: rgba(239, 102, 114, %f)' % p\n", " if val > 40:\n", " return 'background-color: rgba(242, 137, 127, %f)' % p\n", " if val > 30:\n", " return 'background-color: rgba(244, 153, 161, %f)' % p\n", " if val > 20:\n", " return 'background-color: rgba(247, 178, 184, %f)' % p\n", " if val > 10:\n", " return 'background-color: rgba(249, 204, 208, %f)' % p\n", " return 'background-color: rgba(252, 229, 231, %f)' % p" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plotly_df(ret_df_norm, title = 'Retention')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " week_num\n", " \n", " \n", " \n", " \n", " 0\n", " \n", " \n", " \n", " \n", " 1\n", " \n", " \n", " \n", " \n", " 2\n", " \n", " \n", " \n", " \n", " 3\n", " \n", " \n", " \n", " \n", " 4\n", " \n", " \n", " \n", " \n", " 5\n", " \n", " \n", " \n", " \n", " 6\n", " \n", " \n", " \n", " \n", " 7\n", " \n", " \n", " \n", " \n", " 8\n", " \n", " \n", " \n", " \n", " 9\n", " \n", " \n", "
\n", " min_date\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " 2016-10-24\n", " \n", " \n", " \n", " \n", " 100\n", " \n", " \n", " \n", " \n", " 6.99495\n", " \n", " \n", " \n", " \n", " 4.66086\n", " \n", " \n", " \n", " \n", " 3.19017\n", " \n", " \n", " \n", " \n", " 2.87554\n", " \n", " \n", " \n", " \n", " 2.58286\n", " \n", " \n", " \n", " \n", " 2.01946\n", " \n", " \n", " \n", " \n", " 1.88776\n", " \n", " \n", " \n", " \n", " 1.39753\n", " \n", " \n", " \n", " \n", " 1.01705\n", " \n", " \n", "
\n", " 2016-10-31\n", " \n", " \n", " \n", " \n", " 100\n", " \n", " \n", " \n", " \n", " 7.79472\n", " \n", " \n", " \n", " \n", " 4.45855\n", " \n", " \n", " \n", " \n", " 3.20458\n", " \n", " \n", " \n", " \n", " 2.89496\n", " \n", " \n", " \n", " \n", " 2.655\n", " \n", " \n", " \n", " \n", " 1.92739\n", " \n", " \n", " \n", " \n", " 1.54037\n", " \n", " \n", " \n", " \n", " 1.09142\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " 2016-11-07\n", " \n", " \n", " \n", " \n", " 100\n", " \n", " \n", " \n", " \n", " 7.86324\n", " \n", " \n", " \n", " \n", " 4.3151\n", " \n", " \n", " \n", " \n", " 3.52663\n", " \n", " \n", " \n", " \n", " 2.94603\n", " \n", " \n", " \n", " \n", " 2.39409\n", " \n", " \n", " \n", " \n", " 1.75615\n", " \n", " \n", " \n", " \n", " 1.21855\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " 2016-11-14\n", " \n", " \n", " \n", " \n", " 100\n", " \n", " \n", " \n", " \n", " 7.57608\n", " \n", " \n", " \n", " \n", " 4.17819\n", " \n", " \n", " \n", " \n", " 3.41207\n", " \n", " \n", " \n", " \n", " 2.48989\n", " \n", " \n", " \n", " \n", " 1.89402\n", " \n", " \n", " \n", " \n", " 1.09952\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " 2016-11-21\n", " \n", " \n", " \n", " \n", " 100\n", " \n", " \n", " \n", " \n", " 6.7773\n", " \n", " \n", " \n", " \n", " 3.92703\n", " \n", " \n", " \n", " \n", " 2.8566\n", " \n", " \n", " \n", " \n", " 2.1852\n", " \n", " \n", " \n", " \n", " 1.4188\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ret_df_norm.T.fillna('').style.applymap(highlight_vals)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Rolling retention\n", "Основное отличие rolling retention в том, что мы считаем, что пользователь \"жив\" до момента последнего посещения сайта. \n", "Предположим пользователь пришел на 1й неделе, а затем на 5й. При расчете retention'a мы не будем учитывать этого клиента на второй неделе, поскольку он не появлялся на сайте. При расчете же метрики rolling retention мы будем считать пользователя активным вполоть до 5й недели.\n", "\n", "В этом случай нам важно всего лишь посчитать для каждого пользователя его первое и последнее появление на сайте." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
client_idmin_datemax_dateweek_numweeks_visited
014678986591880802016-11-212016-12-265[0,1,2,3,4,5]
114718417703708062016-11-072016-12-196[0,1,2,3,4,5,6]
214722139881953442016-11-142016-11-211[0,1]
314724055821433662016-11-142016-12-195[0,1,2,3,4,5]
414725336411951522016-10-242016-10-311[0,1]
\n", "
" ], "text/plain": [ " client_id min_date max_date week_num weeks_visited\n", "0 1467898659188080 2016-11-21 2016-12-26 5 [0,1,2,3,4,5]\n", "1 1471841770370806 2016-11-07 2016-12-19 6 [0,1,2,3,4,5,6]\n", "2 1472213988195344 2016-11-14 2016-11-21 1 [0,1]\n", "3 1472405582143366 2016-11-14 2016-12-19 5 [0,1,2,3,4,5]\n", "4 1472533641195152 2016-10-24 2016-10-31 1 [0,1]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "q = '''\n", " SELECT \n", " ClientID as client_id,\n", " min(toMonday(Date)) as min_date,\n", " max(toMonday(Date)) as max_date,\n", " (max_date - min_date)/7 as week_num,\n", " range(toUInt64((max_date - min_date)/7) + 1) as weeks_visited\n", " FROM demo.visits_all\n", " GROUP BY \n", " client_id\n", " HAVING (min_date >= '{start_date}') AND (min_date <= '{end_date}') AND length(weeks_visited) > 1\n", " ORDER BY client_id\n", " LIMIT 5\n", " FORMAT TabSeparatedWithNames\n", "'''.format(\n", " start_date = start_date_str,\n", " end_date = end_date_str\n", ")\n", "\n", "get_clickhouse_df(q)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "В clickhouse есть функция `arrayJoin()`, которая позволяет развернуть массив, оставив все остальные колонки без изменений. Воспользуемся ей.\n", "\n", "Условие `length(weeks_visited) > 1` добавлено в запросы исключительно для целей иллюстрации работы `arrayJoin`." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
client_idmin_datemax_dateweeks_visitedweek_num
014678986591880802016-11-212016-12-26[0,1,2,3,4,5]0
114678986591880802016-11-212016-12-26[0,1,2,3,4,5]1
214678986591880802016-11-212016-12-26[0,1,2,3,4,5]2
314678986591880802016-11-212016-12-26[0,1,2,3,4,5]3
414678986591880802016-11-212016-12-26[0,1,2,3,4,5]4
514678986591880802016-11-212016-12-26[0,1,2,3,4,5]5
614718417703708062016-11-072016-12-19[0,1,2,3,4,5,6]0
714718417703708062016-11-072016-12-19[0,1,2,3,4,5,6]1
814718417703708062016-11-072016-12-19[0,1,2,3,4,5,6]2
914718417703708062016-11-072016-12-19[0,1,2,3,4,5,6]3
\n", "
" ], "text/plain": [ " client_id min_date max_date weeks_visited week_num\n", "0 1467898659188080 2016-11-21 2016-12-26 [0,1,2,3,4,5] 0\n", "1 1467898659188080 2016-11-21 2016-12-26 [0,1,2,3,4,5] 1\n", "2 1467898659188080 2016-11-21 2016-12-26 [0,1,2,3,4,5] 2\n", "3 1467898659188080 2016-11-21 2016-12-26 [0,1,2,3,4,5] 3\n", "4 1467898659188080 2016-11-21 2016-12-26 [0,1,2,3,4,5] 4\n", "5 1467898659188080 2016-11-21 2016-12-26 [0,1,2,3,4,5] 5\n", "6 1471841770370806 2016-11-07 2016-12-19 [0,1,2,3,4,5,6] 0\n", "7 1471841770370806 2016-11-07 2016-12-19 [0,1,2,3,4,5,6] 1\n", "8 1471841770370806 2016-11-07 2016-12-19 [0,1,2,3,4,5,6] 2\n", "9 1471841770370806 2016-11-07 2016-12-19 [0,1,2,3,4,5,6] 3" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "q = '''\n", " SELECT \n", " ClientID as client_id,\n", " min(toMonday(Date)) as min_date,\n", " max(toMonday(Date)) as max_date,\n", " range(toUInt64((max_date - min_date)/7) + 1) as weeks_visited,\n", " arrayJoin(weeks_visited) as week_num\n", " FROM demo.visits_all\n", " GROUP BY \n", " client_id\n", " HAVING (min_date >= '{start_date}') AND (min_date <= '{end_date}') AND length(weeks_visited) > 1\n", " ORDER BY client_id, week_num\n", " LIMIT 10\n", " FORMAT TabSeparatedWithNames\n", "'''.format(\n", " start_date = start_date_str,\n", " end_date = end_date_str\n", ")\n", "\n", "get_clickhouse_df(q)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Осталось только посчитать число пользователей в каждую неделю по когортам." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "q = '''\n", "SELECT\n", " uniq(client_id) as clients,\n", " week_num,\n", " min_date\n", "FROM\n", " (SELECT \n", " ClientID as client_id,\n", " min(toMonday(Date)) as min_date,\n", " max(toMonday(Date)) as max_date,\n", " range(toUInt64((max_date - min_date)/7) + 1) as weeks_visited,\n", " arrayJoin(weeks_visited) as week_num\n", " FROM demo.visits_all\n", " GROUP BY \n", " client_id\n", " HAVING (min_date >= '{start_date}') AND (min_date <= '{end_date}') \n", " ORDER BY client_id, week_num)\n", "GROUP BY \n", " week_num,\n", " min_date\n", "FORMAT TabSeparatedWithNames\n", "'''.format(\n", " start_date = start_date_str,\n", " end_date = end_date_str\n", ")\n", "\n", "raw_roll_ret_df = get_clickhouse_df(q)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true }, "outputs": [], "source": [ "roll_ret_df = raw_roll_ret_df.pivot_table(index = 'min_date', \n", " values = 'clients', \n", " columns = 'week_num').fillna(0).T\n", "roll_ret_df_norm = roll_ret_df.apply(lambda x: 100*x/roll_ret_df.loc[0], axis = 1).applymap(lambda x: x if x!=0 else None)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plotly_df(roll_ret_df_norm, title = 'Rolling retention')" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " week_num\n", " \n", " \n", " \n", " \n", " 0\n", " \n", " \n", " \n", " \n", " 1\n", " \n", " \n", " \n", " \n", " 2\n", " \n", " \n", " \n", " \n", " 3\n", " \n", " \n", " \n", " \n", " 4\n", " \n", " \n", " \n", " \n", " 5\n", " \n", " \n", " \n", " \n", " 6\n", " \n", " \n", " \n", " \n", " 7\n", " \n", " \n", " \n", " \n", " 8\n", " \n", " \n", " \n", " \n", " 9\n", " \n", " \n", "
\n", " min_date\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " 2016-10-24\n", " \n", " \n", " \n", " \n", " 100\n", " \n", " \n", " \n", " \n", " 16.9679\n", " \n", " \n", " \n", " \n", " 12.768\n", " \n", " \n", " \n", " \n", " 10.0461\n", " \n", " \n", " \n", " \n", " 8.18029\n", " \n", " \n", " \n", " \n", " 6.54862\n", " \n", " \n", " \n", " \n", " 4.8072\n", " \n", " \n", " \n", " \n", " 3.35845\n", " \n", " \n", " \n", " \n", " 2.07068\n", " \n", " \n", " \n", " \n", " 1.01705\n", " \n", " \n", "
\n", " 2016-10-31\n", " \n", " \n", " \n", " \n", " 100\n", " \n", " \n", " \n", " \n", " 16.9982\n", " \n", " \n", " \n", " \n", " 11.9359\n", " \n", " \n", " \n", " \n", " 9.26542\n", " \n", " \n", " \n", " \n", " 7.33029\n", " \n", " \n", " \n", " \n", " 5.45708\n", " \n", " \n", " \n", " \n", " 3.73094\n", " \n", " \n", " \n", " \n", " 2.41505\n", " \n", " \n", " \n", " \n", " 1.09142\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " 2016-11-07\n", " \n", " \n", " \n", " \n", " 100\n", " \n", " \n", " \n", " \n", " 16.0419\n", " \n", " \n", " \n", " \n", " 10.9813\n", " \n", " \n", " \n", " \n", " 8.48685\n", " \n", " \n", " \n", " \n", " 6.3723\n", " \n", " \n", " \n", " \n", " 4.34377\n", " \n", " \n", " \n", " \n", " 2.6378\n", " \n", " \n", " \n", " \n", " 1.21855\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " 2016-11-14\n", " \n", " \n", " \n", " \n", " 100\n", " \n", " \n", " \n", " \n", " 14.4073\n", " \n", " \n", " \n", " \n", " 9.42044\n", " \n", " \n", " \n", " \n", " 6.83833\n", " \n", " \n", " \n", " \n", " 4.40519\n", " \n", " \n", " \n", " \n", " 2.5892\n", " \n", " \n", " \n", " \n", " 1.09952\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " 2016-11-21\n", " \n", " \n", " \n", " \n", " 100\n", " \n", " \n", " \n", " \n", " 12.6425\n", " \n", " \n", " \n", " \n", " 7.8414\n", " \n", " \n", " \n", " \n", " 5.18115\n", " \n", " \n", " \n", " \n", " 3.15429\n", " \n", " \n", " \n", " \n", " 1.4188\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "roll_ret_df_norm.T.fillna('').style.applymap(highlight_vals)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Пути до покупки\n", "Для начала выгрузим визиты, в которых были совершены покупки" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LastTraficSourceClientIDVisitID
0direct14500950223471208055940262123258949328
1organic14464884249619435035940419290323804278
2referral1450380880229897725940420766409453257
3organic14502936315371145595940402531744356186
4ad14502624606050712555940275038225886929
\n", "
" ], "text/plain": [ " LastTraficSource ClientID VisitID\n", "0 direct 1450095022347120805 5940262123258949328\n", "1 organic 1446488424961943503 5940419290323804278\n", "2 referral 145038088022989772 5940420766409453257\n", "3 organic 1450293631537114559 5940402531744356186\n", "4 ad 1450262460605071255 5940275038225886929" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "q = '''\n", " SELECT \n", " LastTraficSource,\n", " ClientID,\n", " VisitID\n", " FROM visits_all\n", " WHERE (ClientID != 0) AND (length(PurchaseID) != 0)\n", " LIMIT 5\n", " FORMAT TabSeparatedWithNames\n", "'''\n", "\n", "get_clickhouse_df(q)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Для того, чтобы построить пути, т.е. для каждой покупки определить, с какими источниками пользователь появлялся на сайте ранее, нужно сделать JOIN таблицы с самой собой по идентификатору пользователя `ClientID`." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ClientIDVisitIDPurchaseTimeDateTimeLastTraficSource
0144649411520014812559353889318805195362015-12-02 18:19:232015-12-02 18:19:23direct
1144908273316560962259354449598770791582015-12-02 22:07:212015-12-02 22:07:21ad
2145009502234712080559402621232589493282015-12-16 12:46:482015-12-14 15:19:26internal
3145009502234712080559402621232589493282015-12-16 12:46:482015-12-16 12:46:48direct
4144648842496194350359404192903238042782015-12-16 23:27:582015-12-03 20:53:36organic
\n", "
" ], "text/plain": [ " ClientID VisitID PurchaseTime \\\n", "0 1446494115200148125 5935388931880519536 2015-12-02 18:19:23 \n", "1 1449082733165609622 5935444959877079158 2015-12-02 22:07:21 \n", "2 1450095022347120805 5940262123258949328 2015-12-16 12:46:48 \n", "3 1450095022347120805 5940262123258949328 2015-12-16 12:46:48 \n", "4 1446488424961943503 5940419290323804278 2015-12-16 23:27:58 \n", "\n", " DateTime LastTraficSource \n", "0 2015-12-02 18:19:23 direct \n", "1 2015-12-02 22:07:21 ad \n", "2 2015-12-14 15:19:26 internal \n", "3 2015-12-16 12:46:48 direct \n", "4 2015-12-03 20:53:36 organic " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "q = '''\n", " SELECT \n", " ClientID,\n", " VisitID,\n", " PurchaseTime,\n", " DateTime,\n", " LastTraficSource\n", " FROM\n", " (SELECT \n", " ClientID,\n", " DateTime as PurchaseTime, \n", " VisitID\n", " FROM visits_all\n", " WHERE (ClientID != 0) AND (length(PurchaseID) != 0))\n", " ALL LEFT JOIN\n", " (SELECT \n", " ClientID,\n", " LastTraficSource,\n", " DateTime\n", " FROM visits_all\n", " ORDER BY DateTime)\n", " USING ClientID\n", " WHERE PurchaseTime >= DateTime\n", " LIMIT 5\n", " FORMAT TabSeparatedWithNames\n", "'''\n", "\n", "get_clickhouse_df(q)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Объединим все по `VisitID` и с помощью функции `groupArray` соберем все источника трафика в массивы" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
VisitIDSources
05942036913160910687['ad','ad','ad','ad','ad']
15944887285864662889['ad','internal','ad']
25936327790758690505['ad']
35935674958795425569['social']
45937596701532043135['organic','organic','referral','ad']
\n", "
" ], "text/plain": [ " VisitID Sources\n", "0 5942036913160910687 ['ad','ad','ad','ad','ad']\n", "1 5944887285864662889 ['ad','internal','ad']\n", "2 5936327790758690505 ['ad']\n", "3 5935674958795425569 ['social']\n", "4 5937596701532043135 ['organic','organic','referral','ad']" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "q = '''\n", " SELECT \n", " VisitID,\n", " groupArray(LastTraficSource) as Sources\n", " FROM\n", " (SELECT \n", " ClientID,\n", " DateTime as PurchaseTime, \n", " VisitID\n", " FROM visits_all\n", " WHERE (ClientID != 0) AND (length(PurchaseID) != 0))\n", " ALL LEFT JOIN\n", " (SELECT \n", " ClientID,\n", " LastTraficSource,\n", " DateTime\n", " FROM visits_all\n", " ORDER BY DateTime)\n", " USING ClientID\n", " WHERE PurchaseTime >= DateTime\n", " GROUP BY VisitID\n", " LIMIT 5\n", " FORMAT TabSeparatedWithNames\n", "'''\n", "\n", "get_clickhouse_df(q)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "И, в конце концов, посчитаем для каждого пути количество покупок и построим топ." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.86 ms, sys: 4.4 ms, total: 10.3 ms\n", "Wall time: 5.45 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SourcesTotalNumPurchases
0['ad','ad']2200
1['organic','organic']2159
2['organic','ad']1701
3['direct','direct']989
4['ad','ad','ad']903
\n", "
" ], "text/plain": [ " Sources TotalNumPurchases\n", "0 ['ad','ad'] 2200\n", "1 ['organic','organic'] 2159\n", "2 ['organic','ad'] 1701\n", "3 ['direct','direct'] 989\n", "4 ['ad','ad','ad'] 903" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "q = '''\n", " SELECT \n", " Sources, \n", " sum(NumPurchases) as TotalNumPurchases\n", " FROM\n", " (SELECT \n", " any(NumPurchases) as NumPurchases,\n", " VisitID,\n", " groupArray(LastTraficSource) as Sources\n", " FROM\n", " (SELECT \n", " length(PurchaseID) as NumPurchases,\n", " ClientID,\n", " DateTime as PurchaseTime, \n", " VisitID\n", " FROM visits_all\n", " WHERE (ClientID != 0) AND (length(PurchaseID) != 0))\n", " ALL LEFT JOIN\n", " (SELECT \n", " ClientID,\n", " LastTraficSource,\n", " DateTime\n", " FROM visits_all\n", " ORDER BY DateTime)\n", " USING ClientID\n", " WHERE PurchaseTime >= DateTime\n", " GROUP BY VisitID)\n", " WHERE length(Sources) >= 2\n", " GROUP BY Sources\n", " ORDER BY TotalNumPurchases DESC\n", " LIMIT 5\n", " FORMAT TabSeparatedWithNames\n", "'''\n", "\n", "%time get_clickhouse_df(q)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Получившиеся результаты уже достаточно интересны. Но, на самом деле, мы хотим разделять источники на более мелкие группы, а не только на те, которые уже есть в Метрике:\n", "* __CPA__ (по условию `UTMMedium` равно `CPA` или `cpa`)\n", "* __CPC__ (по условию `UTMMedium` равно `cpc`)\n", "* __MarketPlace__ (по условию `UTMMedium` равно `marketplace` или это рекламный трафик с Яндекс.Маркета)\n", "* __Ретаргетинг__ (определяем по `DirectConditionType` или по наличию параметра `ref=ga_ref` в URL)\n", "* __Media__ - остальная реклама\n", "* Хотим объединить Прямые заходы,Переходы с сохраненных страниц и Внутренние переходы в общую группу __Direct__\n", "* Остальные источники оставим, как в Метрике" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LastTraficSourceUTMMediumStartURLLastAdvEngineLastSearchEngineSourceSourceDetailed
0adcpchttp://220-volt.ru/catalog-156860/?ref=yamar&y...marketyandexcpccpc market
1organicNaNhttp://m.220-volt.ru/catalog/svetodiodnyj-zana...NaNyandexorganicorganic yandex
2adcpchttp://220-volt.ru/catalog-210053/?ref=yamar&y...marketyandexcpccpc market
3directNaNhttp://ulyanovsk.220-volt.ru/catalog-59739/NaNNaNdirectdirect
4directNaNhttp://m.220-volt.ru/catalog/2-55/NaNNaNdirectdirect
5organicNaNhttp://omsk.220-volt.ru/catalog/NaNyandexorganicorganic yandex
6adNaNhttp://m.220-volt.ru/catalog-53213/?ref=ga_retgoogle_adwordsNaNmedia_admedia_ad google_adwords
7adNaNhttp://m.220-volt.ru/catalog-220325/?ref=ga_retgoogle_adwordsNaNmedia_admedia_ad google_adwords
8organicNaNhttp://kemerovo.220-volt.ru/catalog/svarochie-...NaNyandexorganicorganic yandex
9organicNaNhttp://220-volt.ru/catalog-136605/NaNgoogleorganicorganic google
\n", "
" ], "text/plain": [ " LastTraficSource UTMMedium \\\n", "0 ad cpc \n", "1 organic NaN \n", "2 ad cpc \n", "3 direct NaN \n", "4 direct NaN \n", "5 organic NaN \n", "6 ad NaN \n", "7 ad NaN \n", "8 organic NaN \n", "9 organic NaN \n", "\n", " StartURL LastAdvEngine \\\n", "0 http://220-volt.ru/catalog-156860/?ref=yamar&y... market \n", "1 http://m.220-volt.ru/catalog/svetodiodnyj-zana... NaN \n", "2 http://220-volt.ru/catalog-210053/?ref=yamar&y... market \n", "3 http://ulyanovsk.220-volt.ru/catalog-59739/ NaN \n", "4 http://m.220-volt.ru/catalog/2-55/ NaN \n", "5 http://omsk.220-volt.ru/catalog/ NaN \n", "6 http://m.220-volt.ru/catalog-53213/?ref=ga_ret google_adwords \n", "7 http://m.220-volt.ru/catalog-220325/?ref=ga_ret google_adwords \n", "8 http://kemerovo.220-volt.ru/catalog/svarochie-... NaN \n", "9 http://220-volt.ru/catalog-136605/ NaN \n", "\n", " LastSearchEngine Source SourceDetailed \n", "0 yandex cpc cpc market \n", "1 yandex organic organic yandex \n", "2 yandex cpc cpc market \n", "3 NaN direct direct \n", "4 NaN direct direct \n", "5 yandex organic organic yandex \n", "6 NaN media_ad media_ad google_adwords \n", "7 NaN media_ad media_ad google_adwords \n", "8 yandex organic organic yandex \n", "9 google organic organic google " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "q = '''\n", " SELECT \n", " LastTraficSource,\n", " UTMMedium,\n", " StartURL,\n", " LastAdvEngine, \n", " LastSearchEngine,\n", " if(\n", " lower(UTMMedium) IN ('cpa', 'cpc', 'marketplace'), \n", " lower(UTMMedium),\n", " if(\n", " LastTraficSource = 'ad', \n", " if(\n", " LastAdvEngine = 'market', \n", " 'marketplace',\n", " if(\n", " (DirectConditionType = 2) OR (extractURLParameter(StartURL, 'ref') = 'ga_ref'),\n", " 'retargeting',\n", " 'media_ad')\n", " ),\n", " if(\n", " LastTraficSource IN ('direct', 'saved', 'undefined', 'internal'),\n", " 'direct',\n", " LastTraficSource\n", " ) \n", " )\n", " ) as Source,\n", " if(Source = 'organic', \n", " if(LastSearchEngine IN ('yandex', 'google'), concat('organic ', LastSearchEngine), 'organic other'),\n", " if(Source IN ('media_ad', 'retargeting', 'cpa', 'cpc'),\n", " if(\n", " LastAdvEngine IN ('market', 'google_adwords', 'yandex'), \n", " concat(Source, concat(' ', LastAdvEngine)),\n", " concat(Source, ' other')\n", " ),\n", " Source\n", " )) as SourceDetailed\n", " FROM visits_all\n", " LIMIT 10\n", " FORMAT TabSeparatedWithNames\n", "'''\n", "get_clickhouse_df(q)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Добавляем в скрипт для расчета путей конверсии определение категорий и получаем пути по новым переопределенным каналам." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 12.8 ms, sys: 24.6 ms, total: 37.4 ms\n", "Wall time: 13.9 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SourcesDetailedPurchases
0['organic yandex']8532
1['direct']4707
2['organic google']3989
3['cpc market']3690
4['cpa other']2249
5['cpc yandex']1966
6['cpc other']1531
7['organic yandex','organic yandex']1321
8['direct','direct']1164
9['email']1044
\n", "
" ], "text/plain": [ " SourcesDetailed Purchases\n", "0 ['organic yandex'] 8532\n", "1 ['direct'] 4707\n", "2 ['organic google'] 3989\n", "3 ['cpc market'] 3690\n", "4 ['cpa other'] 2249\n", "5 ['cpc yandex'] 1966\n", "6 ['cpc other'] 1531\n", "7 ['organic yandex','organic yandex'] 1321\n", "8 ['direct','direct'] 1164\n", "9 ['email'] 1044" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "q = '''\n", "SELECT \n", " SourcesDetailed, \n", " sum(VisitPurchases) as Purchases\n", " FROM\n", " (SELECT \n", " groupArray(Source) as Sources,\n", " groupArray(SourceDetailed) as SourcesDetailed,\n", " VisitID,\n", " any(ClientID) as ClientID,\n", " any(DateTime) as StartTime,\n", " any(VisitRevenue) as VisitRevenue,\n", " any(VisitPurchases) as VisitPurchases\n", " FROM\n", " (SELECT\n", " ClientID,\n", " DateTime,\n", " VisitRevenue,\n", " VisitPurchases,\n", " Source,\n", " SourceDetailed,\n", " VisitID,\n", " LastSearchEngine,\n", " LastAdvEngine\n", " FROM\n", " (SELECT\n", " ClientID,\n", " DateTime as PurchaseTime,\n", " arraySum(PurchaseRevenue) as VisitRevenue,\n", " length(PurchaseID) as VisitPurchases,\n", " LastTraficSource,\n", " VisitID,\n", " LastSearchEngine,\n", " LastAdvEngine\n", " FROM visits_all \n", " WHERE ClientID != 0 AND VisitPurchases != 0)\n", " ALL LEFT JOIN\n", " (SELECT\n", " ClientID,\n", " DateTime,\n", " if(\n", " lower(UTMMedium) IN ('cpa', 'cpc', 'marketplace'), \n", " lower(UTMMedium),\n", " if(\n", " LastTraficSource = 'ad', \n", " if(\n", " LastAdvEngine = 'market', \n", " 'marketplace',\n", " if(\n", " (DirectConditionType = 2) OR (extractURLParameter(StartURL, 'ref') = 'ga_ref'),\n", " 'retargeting',\n", " 'media_ad')\n", " ),\n", " if(\n", " LastTraficSource IN ('direct', 'saved', 'undefined', 'internal'),\n", " 'direct',\n", " LastTraficSource\n", " ) \n", " )\n", " ) as Source,\n", " if(Source = 'organic', \n", " if(LastSearchEngine IN ('yandex', 'google'), concat('organic ', LastSearchEngine), 'organic other'),\n", " if(Source IN ('media_ad', 'retargeting', 'cpa', 'cpc'),\n", " if(\n", " LastAdvEngine IN ('market', 'google_adwords', 'yandex'), \n", " concat(Source, concat(' ', LastAdvEngine)),\n", " concat(Source, ' other')\n", " ),\n", " Source\n", " )) as SourceDetailed\n", " FROM visits_all \n", " ORDER BY DateTime)\n", " USING ClientID\n", " WHERE PurchaseTime >= DateTime)\n", " GROUP BY VisitID)\n", " GROUP BY SourcesDetailed\n", " ORDER BY Purchases DESC\n", " LIMIT 10\n", " FORMAT TabSeparatedWithNames\n", "'''\n", "\n", "%time get_clickhouse_df(q)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Модели аттрибуции" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Модель по последнему переходу" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 14.8 ms, sys: 8.91 ms, total: 23.7 ms\n", "Wall time: 4.93 s\n" ] } ], "source": [ "q = '''\n", " SELECT \n", " count() as Visits,\n", " sum(arraySum(PurchaseRevenue)) as Revenue,\n", " sum(length(PurchaseID)) as Purchases,\n", " if(\n", " lower(UTMMedium) IN ('cpa', 'cpc', 'marketplace'), \n", " lower(UTMMedium),\n", " if(\n", " LastTraficSource = 'ad', \n", " if(\n", " LastAdvEngine = 'market', \n", " 'marketplace',\n", " if(\n", " (DirectConditionType = 2) OR (extractURLParameter(StartURL, 'ref') = 'ga_ref'),\n", " 'retargeting',\n", " 'media_ad')\n", " ),\n", " if(\n", " LastTraficSource IN ('direct', 'saved', 'undefined', 'internal'),\n", " 'direct',\n", " LastTraficSource\n", " ) \n", " )\n", " ) as Source,\n", " if(Source = 'organic', \n", " if(LastSearchEngine IN ('yandex', 'google'), concat('organic ', LastSearchEngine), 'organic other'),\n", " if(Source IN ('media_ad', 'retargeting', 'cpa', 'cpc'),\n", " if(\n", " LastAdvEngine IN ('market', 'google_adwords', 'yandex'), \n", " concat(Source, concat(' ', LastAdvEngine)),\n", " concat(Source, ' other')\n", " ),\n", " Source\n", " )) as SourceDetailed\n", " FROM visits_all\n", " WHERE ClientID != 0\n", " GROUP BY Source, SourceDetailed\n", " FORMAT TabSeparatedWithNames\n", "'''\n", "\n", "%time last_raw_df = get_clickhouse_df(q).groupby(['Source', 'SourceDetailed']).sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Note:__ Значение `Revenue` были изменены" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
VisitsRevenuePurchases
SourceSourceDetailed
cpacpa google_adwords814419420163179
cpa market29916369209
cpa other3245885330802715445
cpa yandex4565691638166
cpccpc google_adwords21521396399281
cpc market3436679879630797947
cpc other1903934000878493279
cpc yandex6703748126467865528
directdirect882305200577749815712
emailemail1306862520444112526
marketplacemarketplace503313124252192
media_admedia_ad google_adwords3607333016558902685
media_ad other5880460090160
media_ad yandex7148540893931461
organicorganic google6014345699598905657
organic other5671297028874728
organic yandex1214740146655288913453
referralreferral68278118097982938
retargetingretargeting yandex4318706875
socialsocial7348334991340334
\n", "
" ], "text/plain": [ " Visits Revenue Purchases\n", "Source SourceDetailed \n", "cpa cpa google_adwords 8144 19420163 179\n", " cpa market 299 1636920 9\n", " cpa other 324588 533080271 5445\n", " cpa yandex 4565 6916381 66\n", "cpc cpc google_adwords 2152 13963992 81\n", " cpc market 343667 987963079 7947\n", " cpc other 190393 400087849 3279\n", " cpc yandex 670374 812646786 5528\n", "direct direct 882305 2005777498 15712\n", "email email 130686 252044411 2526\n", "marketplace marketplace 5033 13124252 192\n", "media_ad media_ad google_adwords 360733 301655890 2685\n", " media_ad other 5880 4600901 60\n", " media_ad yandex 71485 40893931 461\n", "organic organic google 601434 569959890 5657\n", " organic other 56712 97028874 728\n", " organic yandex 1214740 1466552889 13453\n", "referral referral 68278 118097982 938\n", "retargeting retargeting yandex 431 870687 5\n", "social social 73483 34991340 334" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "last_raw_df" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": true }, "outputs": [], "source": [ "last_raw_df_norm = last_raw_df.apply(lambda x: 100.*x/last_raw_df.sum(), axis = 1)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
VisitsRevenuePurchases
SourceSourceDetailed
cpacpa google_adwords0.1623800.2528230.274182
cpa market0.0059620.0213100.013786
cpa other6.4718506.9399628.340354
cpa yandex0.0910200.0900420.101095
cpccpc google_adwords0.0429080.1817920.124071
cpc market6.85226012.86190212.172781
cpc other3.7961815.2085865.022593
cpc yandex13.36636010.5795288.467489
directdirect17.59198026.11242724.066784
emailemail2.6057043.2812673.869189
marketplacemarketplace0.1003510.1708590.294095
media_admedia_ad google_adwords7.1925333.9271394.112736
media_ad other0.1172390.0598970.091905
media_ad yandex1.4253150.5323820.706135
organicorganic google11.9917887.4200838.665084
organic other1.1307611.2631811.115111
organic yandex24.22028919.09247420.606571
referralreferral1.3613721.5374711.436777
retargetingretargeting yandex0.0085940.0113350.007659
socialsocial1.4651530.4555380.511603
\n", "
" ], "text/plain": [ " Visits Revenue Purchases\n", "Source SourceDetailed \n", "cpa cpa google_adwords 0.162380 0.252823 0.274182\n", " cpa market 0.005962 0.021310 0.013786\n", " cpa other 6.471850 6.939962 8.340354\n", " cpa yandex 0.091020 0.090042 0.101095\n", "cpc cpc google_adwords 0.042908 0.181792 0.124071\n", " cpc market 6.852260 12.861902 12.172781\n", " cpc other 3.796181 5.208586 5.022593\n", " cpc yandex 13.366360 10.579528 8.467489\n", "direct direct 17.591980 26.112427 24.066784\n", "email email 2.605704 3.281267 3.869189\n", "marketplace marketplace 0.100351 0.170859 0.294095\n", "media_ad media_ad google_adwords 7.192533 3.927139 4.112736\n", " media_ad other 0.117239 0.059897 0.091905\n", " media_ad yandex 1.425315 0.532382 0.706135\n", "organic organic google 11.991788 7.420083 8.665084\n", " organic other 1.130761 1.263181 1.115111\n", " organic yandex 24.220289 19.092474 20.606571\n", "referral referral 1.361372 1.537471 1.436777\n", "retargeting retargeting yandex 0.008594 0.011335 0.007659\n", "social social 1.465153 0.455538 0.511603" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "last_raw_df_norm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Кастомная модель c перевзвешенными каналами и затуханием по времени" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": true }, "outputs": [], "source": [ "traffic_sources_coeffs = {\n", " 'direct': 0.1,\n", " 'referral': 0.2,\n", " 'social': 0.3,\n", " 'cpa': 0.5,\n", " 'cpc': 0.5,\n", " 'retargeting': 0.3,\n", " 'organic': 0.4,\n", " 'email': 0.4,\n", " 'media_ad': 0.5,\n", " 'marketplace': 0.5\n", "}" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def get_attribution_data_pos_sample(i, N, traffic_sources_coeffs, where_condition = ''):\n", " q = '''\n", " SELECT \n", " sum(VisitRevenue*SourceCoefNorm) as Revenue,\n", " sum(VisitPurchases*SourceCoefNorm) as Purchases,\n", " sum(SourceCoefNorm) as Visits,\n", " Source,\n", " SourceDetailed\n", " FROM\n", " (SELECT \n", " groupArray(Source) as Sources,\n", " groupArray(SourceDetailed) as SourcesDetailed,\n", " VisitID,\n", " any(ClientID) as ClientID,\n", " any(DateTime) as StartTime,\n", " any(VisitRevenue) as VisitRevenue,\n", " any(VisitPurchases) as VisitPurchases,\n", " arrayMap(x -> transform(x, [{traffic_sources}], [{traffic_sources_coeffs}], 0), Sources) as SourcesRawCoefs,\n", " arrayMap(x -> 1/(length(SourcesRawCoefs) - x), range(length(SourcesRawCoefs))) as PositionCoefs,\n", " arrayMap(x, y -> x * y, SourcesRawCoefs, PositionCoefs) as SourcesCoefs,\n", " arraySum(SourcesCoefs) as SourcesCoefsSum,\n", " arrayMap(x -> x/SourcesCoefsSum, SourcesCoefs) as SourcesCoefsNorm\n", " FROM\n", " (SELECT\n", " ClientID,\n", " DateTime,\n", " VisitRevenue,\n", " VisitPurchases,\n", " SourceTime,\n", " Source,\n", " SourceDetailed,\n", " VisitID,\n", " LastSearchEngine,\n", " LastAdvEngine\n", " FROM\n", " (SELECT\n", " ClientID,\n", " DateTime,\n", " arraySum(PurchaseRevenue) as VisitRevenue,\n", " length(PurchaseID) as VisitPurchases,\n", " LastTraficSource,\n", " VisitID,\n", " LastSearchEngine,\n", " LastAdvEngine\n", " FROM visits_all SAMPLE 1/{N} OFFSET {i}/{N}\n", " WHERE ClientID != 0)\n", " ALL LEFT JOIN\n", " (SELECT\n", " ClientID,\n", " DateTime as SourceTime,\n", " if(\n", " lower(UTMMedium) IN ('cpa', 'cpc', 'marketplace'), \n", " lower(UTMMedium),\n", " if(\n", " LastTraficSource = 'ad', \n", " if(\n", " LastAdvEngine = 'market', \n", " 'marketplace',\n", " if(\n", " (DirectConditionType = 2) OR (extractURLParameter(StartURL, 'ref') = 'ga_ref'),\n", " 'retargeting',\n", " 'media_ad')\n", " ),\n", " if(\n", " LastTraficSource IN ('direct', 'saved', 'undefined', 'internal'),\n", " 'direct',\n", " LastTraficSource\n", " ) \n", " )\n", " ) as Source,\n", " if(Source = 'organic', \n", " if(LastSearchEngine IN ('yandex', 'google'), concat('organic ', LastSearchEngine), 'organic other'),\n", " if(Source IN ('media_ad', 'retargeting', 'cpa', 'cpc'),\n", " if(\n", " LastAdvEngine IN ('market', 'google_adwords', 'yandex'), \n", " concat(Source, concat(' ', LastAdvEngine)),\n", " concat(Source, ' other')\n", " ),\n", " Source\n", " )) as SourceDetailed\n", " FROM visits_all SAMPLE 1/{N} OFFSET {i}/{N} {where_condition})\n", " USING ClientID\n", " WHERE SourceTime <= DateTime\n", " ORDER BY SourceTime)\n", " GROUP BY VisitID)\n", " ARRAY JOIN\n", " SourcesCoefsNorm as SourceCoefNorm,\n", " Sources as Source,\n", " SourcesDetailed as SourceDetailed\n", " GROUP BY Source, SourceDetailed\n", " ORDER BY Visits DESC\n", " FORMAT TabSeparatedWithNames\n", " '''.format(\n", " traffic_sources = ', '.join(map(lambda x: \"'%s'\" % x, traffic_sources_coeffs.keys())),\n", " traffic_sources_coeffs = ', '.join(map(str, traffic_sources_coeffs.values())),\n", " i = i, N = N,\n", " where_condition = where_condition\n", " )\n", " # print q\n", " return get_clickhouse_df(q)\n", "\n", "def get_attribution_data_pos(traffic_sources_coeffs, where_condition=''):\n", " tmp_dfs = []\n", " for i in range(10):\n", " tmp_dfs.append(get_attribution_data_pos_sample(i, 10, traffic_sources_coeffs, where_condition))\n", " \n", " return pd.concat(tmp_dfs).groupby(['Source', 'SourceDetailed']).sum()" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 104 ms, sys: 58.8 ms, total: 163 ms\n", "Wall time: 45.7 s\n" ] } ], "source": [ "%time not_bounce_pos_df = get_attribution_data_pos(traffic_sources_coeffs, where_condition = 'WHERE IsBounce = 0')" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": true }, "outputs": [], "source": [ "not_bounce_pos_df_norm = not_bounce_pos_df.apply(lambda x: 100.*x/not_bounce_pos_df.sum(), axis = 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Кастомная модель с поведением" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Напишем базовую функцию для расчета модели атрибуции:\n", "* будем учитывать только источники без отказа\n", "* источники, в которых была достигнута цель \"Добавление в корзину\" будет иметь вес на 50% обычного\n", "* источники визитов, в которых были просмотры товаров будут иметь вес 10% больше обычного" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def get_attribution_data_sample(i, N, where_condition = ''):\n", " q = '''\n", " SELECT \n", " sum(VisitRevenue*SourceCoefNorm) as Revenue,\n", " sum(VisitPurchases*SourceCoefNorm) as Purchases,\n", " sum(SourceCoefNorm) as Visits,\n", " Source,\n", " SourceDetailed\n", " FROM\n", " (SELECT \n", " groupArray(Source) as Sources,\n", " groupArray(SourceDetailed) as SourcesDetailed,\n", " groupArray(HasCart) as HasCarts,\n", " groupArray(HasImpression) as HasImpressions,\n", " VisitID,\n", " any(ClientID) as ClientID,\n", " any(DateTime) as StartTime,\n", " any(VisitRevenue) as VisitRevenue,\n", " any(VisitPurchases) as VisitPurchases,\n", " arrayMap(x, y -> 1 + 0.5*x + 0.1*y, HasCarts, HasImpressions) as SourcesCoefs,\n", " arraySum(SourcesCoefs) as SourcesCoefsSum,\n", " arrayMap(x -> x/SourcesCoefsSum, SourcesCoefs) as SourcesCoefsNorm\n", " FROM\n", " (SELECT\n", " ClientID,\n", " DateTime,\n", " VisitRevenue,\n", " VisitPurchases,\n", " SourceTime,\n", " Source,\n", " HasCart,\n", " HasImpression,\n", " SourceDetailed,\n", " VisitID,\n", " LastSearchEngine,\n", " LastAdvEngine\n", " FROM\n", " (SELECT\n", " ClientID,\n", " DateTime,\n", " arraySum(PurchaseRevenue) as VisitRevenue,\n", " length(PurchaseID) as VisitPurchases,\n", " LastTraficSource,\n", " VisitID,\n", " LastSearchEngine,\n", " LastAdvEngine\n", " FROM visits_all SAMPLE 1/{N} OFFSET {i}/{N}\n", " WHERE ClientID != 0)\n", " ALL LEFT JOIN\n", " (SELECT\n", " ClientID,\n", " DateTime as SourceTime,\n", " if(\n", " lower(UTMMedium) IN ('cpa', 'cpc', 'marketplace'), \n", " lower(UTMMedium),\n", " if(\n", " LastTraficSource = 'ad', \n", " if(\n", " LastAdvEngine = 'market', \n", " 'marketplace',\n", " if(\n", " (DirectConditionType = 2) OR (extractURLParameter(StartURL, 'ref') = 'ga_ref'),\n", " 'retargeting',\n", " 'media_ad')\n", " ),\n", " if(\n", " LastTraficSource IN ('direct', 'saved', 'undefined', 'internal'),\n", " 'direct',\n", " LastTraficSource\n", " ) \n", " )\n", " ) as Source,\n", " if(Source = 'organic', \n", " if(LastSearchEngine IN ('yandex', 'google'), concat('organic ', LastSearchEngine), 'organic other'),\n", " if(Source IN ('media_ad', 'retargeting', 'cpa', 'cpc'),\n", " if(\n", " LastAdvEngine IN ('market', 'google_adwords', 'yandex'), \n", " concat(Source, concat(' ', LastAdvEngine)),\n", " concat(Source, ' other')\n", " ),\n", " Source\n", " )) as SourceDetailed,\n", " has(GoalsID, 552829) as HasCart,\n", " length(ImpressionsProductID) > 0 as HasImpression\n", " FROM visits_all SAMPLE 1/{N} OFFSET {i}/{N} {where_condition}\n", " )\n", " USING ClientID\n", " WHERE SourceTime <= DateTime\n", " ORDER BY SourceTime)\n", " GROUP BY VisitID)\n", " ARRAY JOIN\n", " SourcesCoefsNorm as SourceCoefNorm,\n", " Sources as Source,\n", " SourcesDetailed as SourceDetailed\n", " GROUP BY Source, SourceDetailed\n", " ORDER BY Visits DESC\n", " FORMAT TabSeparatedWithNames\n", " '''.format(\n", " i = i, N = N,\n", " where_condition = where_condition\n", " )\n", " # print q\n", " return get_clickhouse_df(q)\n", "\n", "def get_attribution_data(where_condition=''):\n", " tmp_dfs = []\n", " for i in range(10):\n", " tmp_dfs.append(get_attribution_data_sample(i, 10, where_condition))\n", " \n", " return pd.concat(tmp_dfs).groupby(['Source', 'SourceDetailed']).sum()" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 105 ms, sys: 46.1 ms, total: 151 ms\n", "Wall time: 1min 1s\n" ] } ], "source": [ "%time not_bounce_df = get_attribution_data(where_condition = 'WHERE IsBounce = 0')" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": true }, "outputs": [], "source": [ "not_bounce_df_norm = not_bounce_df.apply(lambda x: 100.*x/not_bounce_df.sum(), axis = 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Сравним получившиеся модели\n", "Напишем еще одну функцию для построения bar-chart'ов." ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def show_bar_plot(cmp_df):\n", " data = []\n", " for column in cmp_df.columns:\n", " trace = go.Bar(\n", " x = cmp_df.index.values,\n", " y = cmp_df[column].values,\n", " name = column\n", " )\n", " data.append(trace)\n", " layout = go.Layout({'xaxis': {'tickangle': 45}})\n", " fig = go.Figure(data = data, layout = layout)\n", " iplot(fig, show_link=False)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [], "source": [ "dct = {\n", " 'last click': last_raw_df_norm,\n", " 'behaviour': not_bounce_df_norm,\n", " 'time decay & weights': not_bounce_pos_df_norm,\n", "}" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def get_comparison(parameter, detalization, show_delta = False):\n", " cmp_df = pd.DataFrame()\n", " for item in dct:\n", " cmp_df[item] = dct[item].reset_index().groupby(detalization)[parameter].sum()\n", " cmp_df = cmp_df.sort_values('last click', ascending = False)\n", " if not show_delta:\n", " return cmp_df\n", " cmp_df_norm = cmp_df.apply(lambda x: x - cmp_df['last click']).drop('last click', axis = 1).sort_values('behaviour')\n", " return cmp_df_norm" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_bar_plot(get_comparison('Revenue', 'Source', show_delta=False))" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_bar_plot(get_comparison('Revenue', 'Source', show_delta=True))" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 1 }