{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "6a870412-5c24-41f7-9474-cb8db570532e",
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57cca796-50a5-4370-9855-c6d0754f3ae7",
   "metadata": {},
   "source": [
    "# Google Analytics Analysis + Pipeline Example\n",
    "\n",
    "In this notebook, we will be analyzing the [open bigquery google analytics dataset](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=hacker_news&page=dataset&project=studio-ai-379107&ws=!1m5!1m4!4m3!1sbigquery-public-data!2sgoogle_analytics_sample!3sga_sessions_20170801). \n",
    "\n",
    "Objectives:\n",
    "- Demonstrate capability to process complex data sources\n",
    "- Demonstrate capability to convert the analysis into a pipeline which can be reran over time with little effort."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03880406-4d03-4eb3-9ce8-8ea3b8c8cc3e",
   "metadata": {},
   "source": [
    "## Note on Privacy & Security\n",
    "\n",
    "Privacy of your data is of primte importance. This library has been specifically designed to NOT share any part of your data with the Genie APIs. Just the metadata about your data like name and types of columns of a pandas dataframe would be shared, which help in generating high quality results."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73478f0c-79d6-4481-bded-3231fde41b7a",
   "metadata": {
    "tags": []
   },
   "source": [
    "## setup "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61e6b4fd-5f6c-4b82-a23f-13a4221c0e2d",
   "metadata": {
    "tags": []
   },
   "source": [
    "### access token\n",
    "\n",
    "Don't have an access token yet? [Sign up for free!](https://www.thismlguy.com/)\n",
    "\n",
    "Store your access token in a `.env` file with the same format as shown in the [.env.example file](https://github.com/thismlguy/code-genie/blob/main/docs/notebooks/.env.example)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "1dad8c94-ace3-43c1-b44a-58e1b4f320f0",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dotenv import load_dotenv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "774eea8f-453c-4e68-9cef-173edf0a5fb0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "load_dotenv(\".env\")  # replace with path to your env file"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3e8af5b-0a47-4c92-9096-642624429490",
   "metadata": {},
   "source": [
    "We're using dotenv library here, alternatively you can simply set the environment variable as:\n",
    "```\n",
    "import os\n",
    "os.environ[\"CODE_GENIE_TOKEN\"] = \"your-access-token-here\"\n",
    "```\n",
    "\n",
    "Though it is not recommended to do so as notebooks are typically shared with others and you should keep your access token private."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61cf4264-5bd2-49fc-a539-6191470833f0",
   "metadata": {},
   "source": [
    "### genie cache\n",
    "\n",
    "By default the package will cache genie invocations in a temp file. This would mean that any genies created will be lost once you restart the kernel or rerun the notebook at another time. to keep the cached genies, you can set a custom path where genies would be stored so that you will not lose them when you rerun the notebook.\n",
    "\n",
    "You can modify the cached code if you'd like and those modifications would be loaded the next time you run the code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "c97d85d8-b883-418d-9522-5e762d262772",
   "metadata": {},
   "outputs": [],
   "source": [
    "CACHE_DIR = \"./_cache_google_analytics\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9fe2d3a6-e9ca-4c19-8a75-40943d1af699",
   "metadata": {},
   "source": [
    "## Analysis objective\n",
    "\n",
    "We will use the [google analytics dataset](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=hacker_news&page=dataset&project=studio-ai-379107&ws=!1m5!1m4!4m3!1sbigquery-public-data!2sgoogle_analytics_sample!3sga_sessions_20170801) and create a pipeline which can be run once every week and generate stats for the given week. We will create 2 tables containing following information:\n",
    "\n",
    "1. Page level bounce and exit rate stats\n",
    "2. User funnel stats"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1799e34d-530b-4010-b5d1-5d4758dd0c69",
   "metadata": {},
   "source": [
    "## Load Data\n",
    "\n",
    "Lets load 1 week of data using bigquery from July 1 - July 7 2017"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "9b23345e-8057-4563-b277-25f233bc0325",
   "metadata": {},
   "outputs": [],
   "source": [
    "start_date = \"20170701\"\n",
    "end_date = \"20170707\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "3cda42e4-5e88-4098-94a0-e37e13ba114f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from google.cloud import bigquery\n",
    "client = bigquery.Client()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "77f7b0e7-5923-45bb-b051-bcf929c5ff35",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(15920, 16)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query = f\"\"\"\n",
    "select * from `bigquery-public-data.google_analytics_sample.ga_sessions_*`\n",
    "where _TABLE_SUFFIX between '{start_date}' and '{end_date}'\n",
    "\"\"\".format(start_date=start_date, end_date=end_date)\n",
    "df = client.query(query).result().to_dataframe()\n",
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "ac97953c-619e-4fe6-93e0-4138b5c0b57e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>visitorId</th>\n",
       "      <th>visitNumber</th>\n",
       "      <th>visitId</th>\n",
       "      <th>visitStartTime</th>\n",
       "      <th>date</th>\n",
       "      <th>totals</th>\n",
       "      <th>trafficSource</th>\n",
       "      <th>device</th>\n",
       "      <th>geoNetwork</th>\n",
       "      <th>customDimensions</th>\n",
       "      <th>hits</th>\n",
       "      <th>fullVisitorId</th>\n",
       "      <th>userId</th>\n",
       "      <th>clientId</th>\n",
       "      <th>channelGrouping</th>\n",
       "      <th>socialEngagementType</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>1</td>\n",
       "      <td>1499117293</td>\n",
       "      <td>1499117293</td>\n",
       "      <td>20170703</td>\n",
       "      <td>{'visits': 1, 'hits': 16, 'pageviews': 12, 'ti...</td>\n",
       "      <td>{'referralPath': None, 'campaign': '(not set)'...</td>\n",
       "      <td>{'browser': 'Safari', 'browserVersion': 'not a...</td>\n",
       "      <td>{'continent': 'Americas', 'subContinent': 'Cen...</td>\n",
       "      <td>[{'index': 4, 'value': 'Central America'}]</td>\n",
       "      <td>[{'hitNumber': 1, 'time': 0, 'hour': 14, 'minu...</td>\n",
       "      <td>1018154947831642966</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>Organic Search</td>\n",
       "      <td>Not Socially Engaged</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>1</td>\n",
       "      <td>1499132474</td>\n",
       "      <td>1499132474</td>\n",
       "      <td>20170703</td>\n",
       "      <td>{'visits': 1, 'hits': 16, 'pageviews': 10, 'ti...</td>\n",
       "      <td>{'referralPath': None, 'campaign': '(not set)'...</td>\n",
       "      <td>{'browser': 'Chrome', 'browserVersion': 'not a...</td>\n",
       "      <td>{'continent': 'Americas', 'subContinent': 'Nor...</td>\n",
       "      <td>[{'index': 4, 'value': 'North America'}]</td>\n",
       "      <td>[{'hitNumber': 1, 'time': 0, 'hour': 18, 'minu...</td>\n",
       "      <td>1770622829105680991</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>Organic Search</td>\n",
       "      <td>Not Socially Engaged</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   visitorId  visitNumber     visitId  visitStartTime      date  \\\n",
       "0       <NA>            1  1499117293      1499117293  20170703   \n",
       "1       <NA>            1  1499132474      1499132474  20170703   \n",
       "\n",
       "                                              totals  \\\n",
       "0  {'visits': 1, 'hits': 16, 'pageviews': 12, 'ti...   \n",
       "1  {'visits': 1, 'hits': 16, 'pageviews': 10, 'ti...   \n",
       "\n",
       "                                       trafficSource  \\\n",
       "0  {'referralPath': None, 'campaign': '(not set)'...   \n",
       "1  {'referralPath': None, 'campaign': '(not set)'...   \n",
       "\n",
       "                                              device  \\\n",
       "0  {'browser': 'Safari', 'browserVersion': 'not a...   \n",
       "1  {'browser': 'Chrome', 'browserVersion': 'not a...   \n",
       "\n",
       "                                          geoNetwork  \\\n",
       "0  {'continent': 'Americas', 'subContinent': 'Cen...   \n",
       "1  {'continent': 'Americas', 'subContinent': 'Nor...   \n",
       "\n",
       "                             customDimensions  \\\n",
       "0  [{'index': 4, 'value': 'Central America'}]   \n",
       "1    [{'index': 4, 'value': 'North America'}]   \n",
       "\n",
       "                                                hits        fullVisitorId  \\\n",
       "0  [{'hitNumber': 1, 'time': 0, 'hour': 14, 'minu...  1018154947831642966   \n",
       "1  [{'hitNumber': 1, 'time': 0, 'hour': 18, 'minu...  1770622829105680991   \n",
       "\n",
       "  userId clientId channelGrouping  socialEngagementType  \n",
       "0   None     None  Organic Search  Not Socially Engaged  \n",
       "1   None     None  Organic Search  Not Socially Engaged  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79f82c1f-aa5b-4da0-abaa-43e58d16d007",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Extract information\n",
    "\n",
    "In order to perform the given analysis, we need to extract the following nested fields into their own columns:\n",
    "- hits.page.pagePath: landing_page\n",
    "- totals.bounces: bounces\n",
    "- hits.type: hit_type\n",
    "- hits.hitNumber: hit_number\n",
    "- hits.eCommerceAction.action_type: action_type"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "4d69b711-13a4-44f5-8f70-9befbe405bd8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from code_genie import Genie"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "a800a4c0-4e7f-4548-83dc-5aec2d364997",
   "metadata": {},
   "outputs": [],
   "source": [
    "genie = Genie(data=df, cache_dir=CACHE_DIR)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8deab2fa-d7f4-4cb3-b9ef-bcf0e2d59065",
   "metadata": {},
   "source": [
    "The hits column contains multiple values in a single row, we need to convert that into separate rows"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4d4fe1e-6a8f-405f-8674-88aac07df3a2",
   "metadata": {},
   "source": [
    "### expand hits into separate rows\n",
    "\n",
    "Note that update_base_input=True means that the data in the genie is update and next genie invocations will use this data as input."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "a51ba449-e4a1-4cf6-80db-0f2eeab68440",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading cached genie id: separate_hits_70651, set override = True to rerun\n"
     ]
    }
   ],
   "source": [
    "gr_hits_expanded = genie.plz(\"\"\"\n",
    "each value in hits column contains a list. create a new dataframe where a separate row exists for each value. \n",
    "copy the totals columns to the new df, keep the same value for each item in the list. drop all other columns except hits and totals.\n",
    "\"\"\", update_base_input=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "38bc93b0-332d-41cc-81e7-dff30aa96708",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>hits</th>\n",
       "      <th>totals</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>{'hitNumber': 1, 'time': 0, 'hour': 14, 'minut...</td>\n",
       "      <td>{'visits': 1, 'hits': 16, 'pageviews': 12, 'ti...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>{'hitNumber': 2, 'time': 56142, 'hour': 14, 'm...</td>\n",
       "      <td>{'visits': 1, 'hits': 16, 'pageviews': 12, 'ti...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                hits  \\\n",
       "0  {'hitNumber': 1, 'time': 0, 'hour': 14, 'minut...   \n",
       "0  {'hitNumber': 2, 'time': 56142, 'hour': 14, 'm...   \n",
       "\n",
       "                                              totals  \n",
       "0  {'visits': 1, 'hits': 16, 'pageviews': 12, 'ti...  \n",
       "0  {'visits': 1, 'hits': 16, 'pageviews': 12, 'ti...  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gr_hits_expanded.result.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a86bfdfc-63a4-40a6-aabf-3ebf7711e2d4",
   "metadata": {},
   "source": [
    "### extract page path column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "d8fda803-5169-4d9f-85dd-fd87851188bc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading cached genie id: extract_page_path_78866, set override = True to rerun\n"
     ]
    }
   ],
   "source": [
    "gr_page_path = genie.plz(\"\"\"\n",
    "each value of hits is a dictionary with page as a key. the value of page is also a dictionary with pagePath as a key. \n",
    "extract pagePath into a new column called page_path\n",
    "\"\"\", update_base_input=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "382f7d14-3377-4dc3-8641-6ea1aff58888",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>hits</th>\n",
       "      <th>totals</th>\n",
       "      <th>page_path</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>{'hitNumber': 1, 'time': 0, 'hour': 14, 'minut...</td>\n",
       "      <td>{'visits': 1, 'hits': 16, 'pageviews': 12, 'ti...</td>\n",
       "      <td>/home</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>{'hitNumber': 2, 'time': 56142, 'hour': 14, 'm...</td>\n",
       "      <td>{'visits': 1, 'hits': 16, 'pageviews': 12, 'ti...</td>\n",
       "      <td>/home</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                hits  \\\n",
       "0  {'hitNumber': 1, 'time': 0, 'hour': 14, 'minut...   \n",
       "0  {'hitNumber': 2, 'time': 56142, 'hour': 14, 'm...   \n",
       "\n",
       "                                              totals page_path  \n",
       "0  {'visits': 1, 'hits': 16, 'pageviews': 12, 'ti...     /home  \n",
       "0  {'visits': 1, 'hits': 16, 'pageviews': 12, 'ti...     /home  "
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gr_page_path.result.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53546815-2ffc-4d5c-9794-de1113f31d8e",
   "metadata": {},
   "source": [
    "### extract hit_type"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "4f48dc19-7ed6-4b2d-af2e-7bbe10293bfc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Genie cached with id: run_51794\n"
     ]
    }
   ],
   "source": [
    "gr_hit_type = genie.custom(\"\"\"\n",
    "def run(df):\n",
    "    df[\"hit_type\"] = df[\"hits\"].apply(lambda x: x[\"type\"])\n",
    "    return df\n",
    "\"\"\", update_base_input=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2499c8e7-8c22-481c-a927-d832509ecf5e",
   "metadata": {},
   "source": [
    "### extract hit_number"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "23185b2c-9437-4351-ab1b-90fd02cd3779",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading cached genie id: extract_hit_number_77980, set override = True to rerun\n"
     ]
    }
   ],
   "source": [
    "gr_hit_number = genie.plz(\"\"\"\n",
    "each value of hits is a dictionary with hitNumber as a key. extract type into a new column called hit_number\n",
    "\"\"\", update_base_input=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d5cab7f-94fd-4665-aaf1-0531d657a89d",
   "metadata": {},
   "source": [
    "Let's see another example of using a custom genie"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d117b49d-35be-402a-ade2-e64fb41552de",
   "metadata": {},
   "source": [
    "### extract action_type"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "0966d88e-9a74-45d5-a768-99c6e52d6ebb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Genie cached with id: run_20093\n"
     ]
    }
   ],
   "source": [
    "gr_action_type = genie.custom(\"\"\"\n",
    "def run(df):\n",
    "    df[\"action_type\"] = df[\"hits\"].apply(lambda x: int(x[\"eCommerceAction\"][\"action_type\"]))\n",
    "    return df\n",
    "\"\"\", update_base_input=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "186e4e9c-ac06-4d03-8278-574f2e53a14c",
   "metadata": {},
   "source": [
    "### extract bounce"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "2817a2b5-71ec-4722-8ffd-2096c71f4327",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading cached genie id: add_bounces_column_93508, set override = True to rerun\n"
     ]
    }
   ],
   "source": [
    "gr_bounce = genie.plz(\"\"\"\n",
    "each value of totals is a dictionary. create a new boolean column called bounces which is True if the key bounces in totals has a value of 1\n",
    "\"\"\", update_base_input=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "47034ff0-5927-4cc9-909d-89e37e9c0585",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>hits</th>\n",
       "      <th>totals</th>\n",
       "      <th>page_path</th>\n",
       "      <th>hit_type</th>\n",
       "      <th>hit_number</th>\n",
       "      <th>action_type</th>\n",
       "      <th>bounces</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>{'hitNumber': 1, 'time': 0, 'hour': 14, 'minut...</td>\n",
       "      <td>{'visits': 1, 'hits': 16, 'pageviews': 12, 'ti...</td>\n",
       "      <td>/home</td>\n",
       "      <td>PAGE</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>{'hitNumber': 2, 'time': 56142, 'hour': 14, 'm...</td>\n",
       "      <td>{'visits': 1, 'hits': 16, 'pageviews': 12, 'ti...</td>\n",
       "      <td>/home</td>\n",
       "      <td>PAGE</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                hits  \\\n",
       "0  {'hitNumber': 1, 'time': 0, 'hour': 14, 'minut...   \n",
       "0  {'hitNumber': 2, 'time': 56142, 'hour': 14, 'm...   \n",
       "\n",
       "                                              totals page_path hit_type  \\\n",
       "0  {'visits': 1, 'hits': 16, 'pageviews': 12, 'ti...     /home     PAGE   \n",
       "0  {'visits': 1, 'hits': 16, 'pageviews': 12, 'ti...     /home     PAGE   \n",
       "\n",
       "   hit_number  action_type  bounces  \n",
       "0           1            0    False  \n",
       "0           2            0    False  "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gr_bounce.result.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f919ea60-c817-499f-8130-65b37a3b4afc",
   "metadata": {},
   "source": [
    "## Aggregate data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51aac07d-1e67-41b3-94e2-dbc018e406df",
   "metadata": {},
   "source": [
    "### landing page bounce rate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "be65cb05-db7f-4604-8538-bf979df61fed",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading cached genie id: get_page_bounce_rate_81411, set override = True to rerun\n"
     ]
    }
   ],
   "source": [
    "gr_landing_page = genie.plz([\n",
    "    \"filter data for hit_number as 1 and hit_type as PAGE\",\n",
    "    \"\"\"group data by page_path, create 2 aggregate columns:\n",
    "views: total number of rows in the group\n",
    "bounce_rate: fraction of rows in the group where bounces column is True\n",
    "\"\"\",\n",
    "    \"sort the data by views descending\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "fca5d54c-4ed7-484f-9e54-0cd029706060",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>views</th>\n",
       "      <th>bounce_rate</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>page_path</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>/home</th>\n",
       "      <td>7698</td>\n",
       "      <td>0.489738</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>/google+redesign/shop+by+brand/youtube</th>\n",
       "      <td>3911</td>\n",
       "      <td>0.629507</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>/google+redesign/shop+by+brand/waze+baby+on+board+window+decal.axd</th>\n",
       "      <td>681</td>\n",
       "      <td>0.654919</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>/google+redesign/apparel/mens/mens+t+shirts</th>\n",
       "      <td>585</td>\n",
       "      <td>0.482051</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>/signin.html</th>\n",
       "      <td>225</td>\n",
       "      <td>0.342222</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                    views  bounce_rate\n",
       "page_path                                                             \n",
       "/home                                                7698     0.489738\n",
       "/google+redesign/shop+by+brand/youtube               3911     0.629507\n",
       "/google+redesign/shop+by+brand/waze+baby+on+boa...    681     0.654919\n",
       "/google+redesign/apparel/mens/mens+t+shirts           585     0.482051\n",
       "/signin.html                                          225     0.342222"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gr_landing_page.result.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c93d3044-4a2c-4f9e-92cc-9fc7a0bfb42d",
   "metadata": {},
   "source": [
    "### exit rates of pages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "eef3166e-d5f4-402b-9cc6-314f3bd3f097",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading cached genie id: aggregate_data_92888, set override = True to rerun\n"
     ]
    }
   ],
   "source": [
    "gr_exit_rate = genie.plz([\n",
    "    \"filter data for hit_type as PAGE\",\n",
    "    \"\"\"group data by page_path, create 2 aggregate columns:\n",
    "views: total number of rows in the group\n",
    "exit_rate: fraction of rows in the group where bounces column is True\n",
    "\"\"\",\n",
    "    \"sort the data by views descending\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "3c69fa2b-4fda-4ae7-8e67-d9da4dd609f5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>views</th>\n",
       "      <th>exit_rate</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>page_path</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>/home</th>\n",
       "      <td>13174</td>\n",
       "      <td>0.287308</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>/google+redesign/shop+by+brand/youtube</th>\n",
       "      <td>4988</td>\n",
       "      <td>0.493585</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>/basket.html</th>\n",
       "      <td>2426</td>\n",
       "      <td>0.027205</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>/signin.html</th>\n",
       "      <td>1512</td>\n",
       "      <td>0.050926</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>/google+redesign/apparel/mens/mens+t+shirts</th>\n",
       "      <td>1468</td>\n",
       "      <td>0.192098</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             views  exit_rate\n",
       "page_path                                                    \n",
       "/home                                        13174   0.287308\n",
       "/google+redesign/shop+by+brand/youtube        4988   0.493585\n",
       "/basket.html                                  2426   0.027205\n",
       "/signin.html                                  1512   0.050926\n",
       "/google+redesign/apparel/mens/mens+t+shirts   1468   0.192098"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gr_exit_rate.result.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61dd22f5-8068-4bd1-98dd-309875cd8c79",
   "metadata": {},
   "source": [
    "### user funnel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "c868d585-afe7-4e2b-87af-bd23c2f63a27",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading cached genie id: group_action_type_27490, set override = True to rerun\n"
     ]
    }
   ],
   "source": [
    "gr_funnel = genie.plz([\n",
    "    \"group data by action_type and count number of rows in each group\",\n",
    "    \"remove action_type other than 1, 2, 5, 6\",\n",
    "    \"\"\"replace the action_type values as:\n",
    "    1: Click on product list page\n",
    "    2: Product details page\n",
    "    5: Checkout\n",
    "    6: Purchase Complete\n",
    "    \"\"\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "a992f8d0-3f8a-4b4c-99ba-72236f8601f8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>action_type</th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Click on product list page</td>\n",
       "      <td>6887</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Product details page</td>\n",
       "      <td>5386</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Checkout</td>\n",
       "      <td>1115</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Purchase Complete</td>\n",
       "      <td>292</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  action_type  count\n",
       "1  Click on product list page   6887\n",
       "2        Product details page   5386\n",
       "5                    Checkout   1115\n",
       "6           Purchase Complete    292"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gr_funnel.result"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3f821ed-1b6a-4b08-9944-d70e73ac9db9",
   "metadata": {},
   "source": [
    "## Convert into a pipeline\n",
    "\n",
    "Now that we have done all the work in terms of analyzing data and creating what we want, we can put this into a pipeline. Links to API documentation:\n",
    "- [Pipeline](https://code-genie.readthedocs.io/en/latest/pipeline.html)\n",
    "- [Arguments](https://code-genie.readthedocs.io/en/latest/arguments.html)\n",
    "- [Sources](https://code-genie.readthedocs.io/en/latest/sources.html)\n",
    "- [Sinks](https://code-genie.readthedocs.io/en/latest/sinks.html)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df824808-966c-4c03-8473-38870615bfbb",
   "metadata": {},
   "source": [
    "### Define a source\n",
    "The first step in a pipeline is to define a source of external data which will feed the pipeline. We need to use one of the pre-defined data sources, let's use the [BQ data source here](https://code-genie.readthedocs.io/en/latest/sources.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "a71023c9-f1a6-4841-9139-e4ad989eb2b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from code_genie.io import BigQueryToDataframeSource, StringArg"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "35872bb1-3a00-4760-86d1-f94e8b579e8b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# difference from how we used above is that start_date and end_date need to come from inputs to the pipeline\n",
    "query = f\"\"\"\n",
    "select * from `bigquery-public-data.google_analytics_sample.ga_sessions_*`\n",
    "where _TABLE_SUFFIX between '{start_date}' and '{end_date}'\n",
    "\"\"\"\n",
    "source = BigQueryToDataframeSource(\n",
    "    query=query,\n",
    "    query_args={\"start_date\": StringArg(name=\"analysis-start-date\"),  # need to be set during pipeline run\n",
    "                \"end_date\": StringArg(name=\"analysis-end-date\")},  # need to be set during pipeline run\n",
    "    key_path=StringArg(env_var=\"GOOGLE_APPLICATION_CREDENTIALS\")  # to be read from env var\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99057290-eaa6-42c0-8e7a-e82ecce19f0c",
   "metadata": {},
   "source": [
    "### Define sinks\n",
    "Now that we have a source, we need to define where to export the results. In this example, let's export the 3 dataframes we created as csv files into the cache dir using the [DataFrameToCsvSink](https://code-genie.readthedocs.io/en/latest/sinks.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "83175304-d64b-4577-8c95-61a8cfb94c9e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from code_genie.io import DataFrameToCsvSink"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "d18b9a18-e100-4a48-a25b-463a4d0e2219",
   "metadata": {},
   "outputs": [],
   "source": [
    "sink_bounce_rate = DataFrameToCsvSink(path=StringArg(name=\"bounce-rate-export-path\"))\n",
    "sink_exit_rate = DataFrameToCsvSink(path=StringArg(name=\"exit-rate-export-path\"))\n",
    "sink_funnel = DataFrameToCsvSink(path=StringArg(name=\"funnel-export-path\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e39af1c7-a08f-4230-90f4-b70191b6305c",
   "metadata": {},
   "source": [
    "### Define pipeline steps\n",
    "Once we have our source and sinks, we can stich them together with the genies we have already create to make a pipeline. Read the [pipeline docs](https://code-genie.readthedocs.io/en/latest/pipeline.html) for more info."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "f254d69d-28e0-4b53-a253-b20e722eeccd",
   "metadata": {},
   "outputs": [],
   "source": [
    "from code_genie.pipeline import PipelineStep, GeniePipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "cf408477-ee54-4247-8be2-3255a7ce4d70",
   "metadata": {},
   "outputs": [],
   "source": [
    "# initialize pipeline step\n",
    "pipeline_steps = []"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8956574-8475-4d3b-a69f-674cdee48acb",
   "metadata": {},
   "source": [
    "#### Use source data and expand hits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "ffa24306-ba8e-4f65-a236-df7dfffca2a7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipeline_steps.append(PipelineStep(genie_result=gr_hits_expanded, data=source))\n",
    "len(pipeline_steps)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fea79cf7-256d-4ad9-a1de-00bbfad1ae95",
   "metadata": {},
   "source": [
    "#### Add all extraction steps\n",
    "Note that here the input source is the previous genie as we want to carry forward all the columns added to the very end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "c27b2344-71ab-4626-ab27-57556f9b06f2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipeline_steps.extend([\n",
    "    # extract page path\n",
    "    PipelineStep(genie_result=gr_page_path,\n",
    "                 data=gr_hits_expanded),\n",
    "    # extract hit type\n",
    "    PipelineStep(genie_result=gr_hit_type,\n",
    "                 data=gr_page_path),\n",
    "    # extract hit number\n",
    "    PipelineStep(genie_result=gr_hit_number,\n",
    "                 data=gr_hit_type),\n",
    "    # extract action type\n",
    "    PipelineStep(genie_result=gr_action_type,\n",
    "                 data=gr_hit_number),\n",
    "    # extract bounce\n",
    "    PipelineStep(genie_result=gr_bounce,\n",
    "                 data=gr_action_type)\n",
    "])\n",
    "len(pipeline_steps)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc92436f-0992-49b5-903b-071cbaf7d043",
   "metadata": {},
   "source": [
    "#### Export results to sinks\n",
    "Now that we have interim layers added, we can add the final aggregation layer and assign the corresponding sinks so that the output data is generated. Note that the base_input_genie argument for all of these is still gr_bounce as the result of that genie needs to be fed into all of them.\n",
    "\n",
    "We will also add the sink parameter to these steps so that the results are generated"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "7dc596bc-77e5-47e8-8536-4645de14e74b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "9"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipeline_steps.extend([\n",
    "    PipelineStep(genie_result=gr_landing_page, data=gr_bounce, sink=sink_bounce_rate),\n",
    "    PipelineStep(genie_result=gr_exit_rate, data=gr_bounce, sink=sink_exit_rate),\n",
    "    PipelineStep(genie_result=gr_funnel, data=gr_bounce, sink=sink_funnel)\n",
    "])\n",
    "len(pipeline_steps)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9d3f46b-1d6e-4bbb-9a5d-c2dec0b6fa4a",
   "metadata": {},
   "source": [
    "#### Create pipeline and export"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "45039951-68b1-4204-accf-806dbcf00f6d",
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline = GeniePipeline(name=\"google-analytics-pipeline\", version=\"1\", steps=pipeline_steps, cache_dir=CACHE_DIR)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "e8abf0d9-bb9a-4101-9356-e0bcbf1ddae2",
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline.export(\"pipeline-v1.json\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "136b9f5f-f344-4eeb-872f-2185dad24dea",
   "metadata": {},
   "source": [
    "## Run the pipeline\n",
    "\n",
    "Now that we have created the pipeline, we can load it into a new session and run it. From this point, the code does not depend on the execution of any pieces of code before this as long as we have exported the pipeline. You can restart your kernel and run again from here on."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c059593-f058-46d7-a4ac-e5769f5d8893",
   "metadata": {},
   "source": [
    "### rerun setup\n",
    "\n",
    "lets load our env vars and setup cache dir as before"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "56b23470-8799-4b1d-b3f2-1ef5b58c6f9f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dotenv import load_dotenv\n",
    "load_dotenv(\".env\")  # replace with path to your env file\n",
    "CACHE_DIR = \"./_cache_google_analytics\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14526b6f-e30a-4266-983c-4033db46d714",
   "metadata": {},
   "source": [
    "### load the pipeline and run\n",
    "\n",
    "Lets rerun for the same dates as above to see if we can get the same results back."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "e3766c07-d039-423a-b0c3-4f057d52a15c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from code_genie.pipeline import GeniePipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "fc7de6a3-f3f1-4eda-a30b-ab19037f7b51",
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline = GeniePipeline.load(f\"{CACHE_DIR}/pipeline-v1.json\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "99051182-b4fd-4c08-a47b-b7666352bee4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# lets create a temporary directory to store the generated datasets\n",
    "from tempfile import mkdtemp\n",
    "export_dir = mkdtemp()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "6434007d-813b-4b4d-9079-8a31fea79035",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Running step 1: separate_hits_70651\n",
      "\tCompleted in 13.2 seconds\n",
      "Running step 2: extract_page_path_78866\n",
      "\tCompleted in 0.0 seconds\n",
      "Running step 3: run_75120\n",
      "\tCompleted in 0.0 seconds\n",
      "Running step 4: extract_hit_number_77980\n",
      "\tCompleted in 0.0 seconds\n",
      "Running step 5: run_83895\n",
      "\tCompleted in 0.0 seconds\n",
      "Running step 6: add_bounces_column_93508\n",
      "\tCompleted in 0.0 seconds\n",
      "Running step 7: get_page_bounce_rate_81411\n",
      "\tCompleted in 0.0 seconds\n",
      "Running step 8: aggregate_data_92888\n",
      "\tCompleted in 0.0 seconds\n",
      "Running step 9: group_action_type_27490\n",
      "\tCompleted in 0.0 seconds\n"
     ]
    }
   ],
   "source": [
    "pipeline_args={\n",
    "    \"analysis-start-date\": \"20170701\",\n",
    "    \"analysis-end-date\": \"20170707\",\n",
    "    \"bounce-rate-export-path\": f\"{export_dir}/bounce_rate.csv\",\n",
    "    \"exit-rate-export-path\": f\"{export_dir}/exit_rate.csv\",\n",
    "    \"funnel-export-path\": f\"{export_dir}/funnel_stats.csv\"\n",
    "}\n",
    "pipeline.run(pipeline_args)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "eefeb89c-0b32-47d4-a455-32b523daa806",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>page_path</th>\n",
       "      <th>views</th>\n",
       "      <th>bounce_rate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>/home</td>\n",
       "      <td>7698</td>\n",
       "      <td>0.489738</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>/google+redesign/shop+by+brand/youtube</td>\n",
       "      <td>3911</td>\n",
       "      <td>0.629507</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>/google+redesign/shop+by+brand/waze+baby+on+bo...</td>\n",
       "      <td>681</td>\n",
       "      <td>0.654919</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>/google+redesign/apparel/mens/mens+t+shirts</td>\n",
       "      <td>585</td>\n",
       "      <td>0.482051</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>/signin.html</td>\n",
       "      <td>225</td>\n",
       "      <td>0.342222</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                           page_path  views  bounce_rate\n",
       "0                                              /home   7698     0.489738\n",
       "1             /google+redesign/shop+by+brand/youtube   3911     0.629507\n",
       "2  /google+redesign/shop+by+brand/waze+baby+on+bo...    681     0.654919\n",
       "3        /google+redesign/apparel/mens/mens+t+shirts    585     0.482051\n",
       "4                                       /signin.html    225     0.342222"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# load results\n",
    "import pandas as pd\n",
    "pd.read_csv(pipeline_args[\"bounce-rate-export-path\"]).head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "e8a2577a-fb7c-4518-8167-f027bc97b12e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>page_path</th>\n",
       "      <th>views</th>\n",
       "      <th>exit_rate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>/home</td>\n",
       "      <td>13174</td>\n",
       "      <td>0.287308</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>/google+redesign/shop+by+brand/youtube</td>\n",
       "      <td>4988</td>\n",
       "      <td>0.493585</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>/basket.html</td>\n",
       "      <td>2426</td>\n",
       "      <td>0.027205</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>/signin.html</td>\n",
       "      <td>1512</td>\n",
       "      <td>0.050926</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>/google+redesign/apparel/mens/mens+t+shirts</td>\n",
       "      <td>1468</td>\n",
       "      <td>0.192098</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                     page_path  views  exit_rate\n",
       "0                                        /home  13174   0.287308\n",
       "1       /google+redesign/shop+by+brand/youtube   4988   0.493585\n",
       "2                                 /basket.html   2426   0.027205\n",
       "3                                 /signin.html   1512   0.050926\n",
       "4  /google+redesign/apparel/mens/mens+t+shirts   1468   0.192098"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv(pipeline_args[\"exit-rate-export-path\"]).head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "5f8b6c53-dd9f-4b5c-b5ec-ea76f6facb8e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>index</th>\n",
       "      <th>action_type</th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Click on product list page</td>\n",
       "      <td>6887</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Product details page</td>\n",
       "      <td>5386</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5</td>\n",
       "      <td>Checkout</td>\n",
       "      <td>1115</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>6</td>\n",
       "      <td>Purchase Complete</td>\n",
       "      <td>292</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   index                 action_type  count\n",
       "0      1  Click on product list page   6887\n",
       "1      2        Product details page   5386\n",
       "2      5                    Checkout   1115\n",
       "3      6           Purchase Complete    292"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv(pipeline_args[\"funnel-export-path\"]).head(5)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "code-genie",
   "language": "python",
   "name": "code-genie"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}