{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Baseline Metrics of Legacy Search on Commons\n", "\n", "In order to understand the effects of tests of search on Commons, we need to establish baselines. This notebook does that for legacy search on Commons. The phab task for this is [T258723](https://phabricator.wikimedia.org/T258723). The metrics are listed in its parent task [T258229](https://phabricator.wikimedia.org/T258229) because we want to measure these for both legacy search and Media Search.\n", "\n", "The metrics are:\n", "\n", "1. Number of searches made.\n", "2. Number of search sessions.\n", "3. Number of searches per session.\n", "4. Search session length.\n", "5. Click-through rate.\n", "6. Average position of clicked result in successful searches.\n", "\n", "I think we'd like to grab data for this on either a daily or weekly basis, and store aggregates somewhere, then build dashboards on top of it. It would be great to be able to update these datasets regularly, e.g. daily with a cron job." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import datetime as dt\n", "\n", "import pandas as pd\n", "import numpy as np\n", "\n", "from wmfdata import spark, mariadb" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "## Load the RPython library so we can use R for graphs\n", "\n", "%load_ext rpy2.ipython" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [], "source": [ "%%R\n", "library(ggplot2)\n", "library(hrbrthemes)\n", "library(tidyr)\n", "library(lubridate)\n", "library(zoo)\n", "library(dplyr)\n", "import::from(polloi, compress)" ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [], "source": [ "%%R\n", "\n", "## Options\n", "options(mc.cores = 4, digits = 3, scipen = 500)\n", "\n", "## Defining a custom theme for all plots\n", "commons_theme = function() {\n", " theme_ipsum_rc(\n", " base_size = 14, axis_title_size = 12, subtitle_size = 16,\n", " axis_title_just = 'cm'\n", " )\n", "}\n", "\n" ] }, { "cell_type": "code", "execution_count": 213, "metadata": {}, "outputs": [], "source": [ "## We're operating with different format for timestamps, so we'll have to be able\n", "## to parse them both with and without milliseconds.\n", "\n", "def parse_dt(ts):\n", " try:\n", " return(dt.datetime.strptime(ts, '%Y-%m-%dT%H:%M:%S.%fZ'))\n", " except ValueError: ## no microseconds\n", " return(dt.datetime.strptime(ts, '%Y-%m-%dT%H:%M:%SZ'))" ] }, { "cell_type": "code", "execution_count": 208, "metadata": {}, "outputs": [], "source": [ "today = dt.datetime.now(dt.timezone.utc).date()\n", "last_week = today - dt.timedelta(days = 7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SearchSatisfaction schema notes\n", "\n", "The SearchSatisfaction schema is as far as I know the first schema to be ported to the [Event Platform](https://wikitech.wikimedia.org/wiki/Event_Platform). At the time of this analysis (early August 2020), the database for this schema contains data captured through two event infrastructures: EventLogging (EL) and Event Platform (EP). This means that we need to deal with some inconsistencies in that data.\n", "\n", "### Timestamps\n", "\n", "The data contains three timestamp columns: `dt`, `meta.dt`, and `client_ts`. The first one of those is an EL column, the other two are EP columns. `meta.dt` and `dt` are set server-side, except when `client_ts` is set and `dt` is not, then it equals `client_ts`.\n", "\n", "Once we've accumulated enough data to only have EP data, we can most likely simplify our analysis and focus on `client_dt`. In the meantime, we'll combine all three timestamps in priority order: `client_ts`, `meta.dt`, then `dt`. Later, we might focus on client timestamps to understand more." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Number of Searches Per Day on Commons\n", "\n", "We make these measurements similarly to how we did this back in March 2020, when grabbing these numbers for the SDAW grant:\n", "\n", "1. For fulltext searches, we count every \"searchResultPage\".\n", "2. For autocomplete searches, we count every distinct `searchSessionId` + `pageViewId` combination. An autocomplete search session can consist of multiple searches as the user types out their query, and this collapses them into a single unit.\n", "3. A user can hit Enter in their autocomplete search and get to a fulltext search if their autocomplete search did not find any pages. In this case, we count it as two separate searches. Partly because we expect this to be somewhat rare compared to autocomplete searches in general. Secondly because identifying these can be tricky.\n", "4. Users who have Do Not Track enabled are not part of the dataset.\n", "\n", "Regarding No. 3, one way to go about this as Mikhail points out is: \"for each searchSessionId that have autocomplete and fulltext events, grab the query from the last autocomplete search preceeding (or that happened temporally near) a fulltext search and do a string comparison\" However, that's outside the scope of this analysis due to the tight deadline." ] }, { "cell_type": "code", "execution_count": 231, "metadata": {}, "outputs": [], "source": [ "# Query to count fulltext and autocomplete searches on Commons\n", "\n", "search_count_query = '''\n", "WITH ac AS (\n", " SELECT TO_DATE(coalesce(meta.dt, client_dt, dt)) AS log_date,\n", " COUNT(DISTINCT event.searchsessionid, event.pageviewid) AS n_autocomp\n", " FROM event.searchsatisfaction\n", " WHERE year = 2020\n", " AND month = 8\n", " AND wiki = \"commonswiki\"\n", " AND useragent.is_bot = false\n", " AND event.subTest IS NULL\n", " AND event.action = \"searchResultPage\"\n", " AND event.isforced IS NULL -- only include non-test users\n", " AND event.source = \"autocomplete\"\n", " GROUP BY TO_DATE(coalesce(meta.dt, client_dt, dt))\n", "), ft AS (\n", " SELECT TO_DATE(coalesce(meta.dt, client_dt, dt)) AS log_date,\n", " SUM(IF(event.hitsReturned > 0 , 1, 0)) AS n_fulltext_successful,\n", " SUM(IF(event.hitsReturned IS NULL , 1, 0)) AS n_fulltext_zeroresults\n", " FROM event.searchsatisfaction\n", " WHERE year = 2020\n", " AND month = 8\n", " AND wiki = \"commonswiki\"\n", " AND useragent.is_bot = false\n", " AND event.subTest IS NULL\n", " AND event.action = \"searchResultPage\"\n", " AND event.isforced IS NULL -- only include non-test users\n", " AND event.source = \"fulltext\"\n", " GROUP BY TO_DATE(coalesce(meta.dt, client_dt, dt))\n", ")\n", "SELECT ac.log_date, n_autocomp, n_fulltext_successful, n_fulltext_zeroresults\n", "FROM ac\n", "LEFT JOIN ft\n", "ON ac.log_date = ft.log_date\n", "'''" ] }, { "cell_type": "code", "execution_count": 232, "metadata": {}, "outputs": [], "source": [ "commons_searches_daily = spark.run(search_count_query)" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "## rpy2 doesn't seem to handle datetime.date objects very well, so we make it a string\n", "commons_searches_daily['log_date_str'] = commons_searches_daily['log_date'].apply(str)" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "%%R\n", "\n", "## Moving average function with right-alignment and zero-fill\n", "mavg = function(x, ndays) {\n", " rollapply(x, ndays, mean, align = 'right', fill = 0)\n", "}\n", "\n", "## Color-blind-friendly palette with black\n", "## http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/\n", "cbbPalette <- c(\"#000000\", \"#E69F00\", \"#56B4E9\", \"#009E73\", \"#F0E442\", \"#0072B2\", \"#D55E00\", \"#CC79A7\")" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "%%R -i commons_searches_daily\n", "\n", "commons_searches_daily %>% mutate(log_date = ymd(log_date_str)) %>%\n", " filter(log_date < today(tzone = 'UTC')) %>% ## skip today because it's partial data\n", " arrange(log_date) %>%\n", " mutate(n_autocomp_m7 = mavg(n_autocomp, 7)) %>%\n", " ggplot(aes(x = log_date)) +\n", " scale_y_continuous(labels = compress) +\n", " scale_x_date(date_breaks = \"1 month\", minor_breaks = NULL, date_labels = \"%b\\n%Y\") +\n", " labs(x = \"Date\", y = \"Number of searches\",\n", " title = \"Autocomplete searches per day with 7-day MA\") +\n", " geom_line(aes(y = n_autocomp)) +\n", " geom_line(aes(y = n_autocomp_m7), color = cbbPalette[2]) +\n", " commons_theme()" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "%%R -i commons_searches_daily\n", "\n", "commons_searches_daily %>% mutate(log_date = ymd(log_date_str)) %>%\n", " filter(log_date < today(tzone = 'UTC')) %>% ## skip today because it's partial data\n", " arrange(log_date) %>%\n", " mutate(n_fulltext_successful_m7 = mavg(n_fulltext_successful, 7)) %>%\n", " ggplot(aes(x = log_date)) +\n", " scale_y_continuous(labels = compress) +\n", " scale_x_date(date_breaks = \"1 month\", minor_breaks = NULL, date_labels = \"%b\\n%Y\") +\n", " labs(x = \"Date\", y = \"Number of searches\",\n", " title = \"Successful Fulltext searches per day with 7-day MA\") +\n", " geom_line(aes(y = n_fulltext_successful)) +\n", " geom_line(aes(y = n_fulltext_successful_m7), color = cbbPalette[2]) +\n", " commons_theme()" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "%%R -i commons_searches_daily\n", "\n", "commons_searches_daily %>% mutate(log_date = ymd(log_date_str)) %>%\n", " filter(log_date < today(tzone = 'UTC')) %>% ## skip today because it's partial data\n", " arrange(log_date) %>%\n", " mutate(n_fulltext_zeroresults_m7 = mavg(n_fulltext_zeroresults, 7)) %>%\n", " ggplot(aes(x = log_date)) +\n", " scale_y_continuous(labels = compress) +\n", " scale_x_date(date_breaks = \"1 month\", minor_breaks = NULL, date_labels = \"%b\\n%Y\") +\n", " labs(x = \"Date\", y = \"Number of searches\",\n", " title = \"Zero-results Fulltext searches per day with 7-day MA\") +\n", " geom_line(aes(y = n_fulltext_zeroresults)) +\n", " geom_line(aes(y = n_fulltext_zeroresults_m7), color = cbbPalette[2]) +\n", " commons_theme()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Baselines for number of searches\n", "\n", "Let's calculate some baselines using the most recent 7-day average." ] }, { "cell_type": "code", "execution_count": 233, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "n_autocomp 52473.57\n", "n_fulltext_successful 97826.00\n", "n_fulltext_zeroresults 3646.00\n", "dtype: float64" ] }, "execution_count": 233, "metadata": {}, "output_type": "execute_result" } ], "source": [ "round(commons_searches_daily.loc[(commons_searches_daily['log_date'] < today) &\n", " (commons_searches_daily['log_date'] >= last_week)][\n", " ['n_autocomp', 'n_fulltext_successful', 'n_fulltext_zeroresults']\n", "].mean(), 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Number of Search Sessions and Number of Searches per Session\n", "\n", "Here, we're interested in the number of search sessions that originate on a given day. This means, for each session, get the timestamp of the first SERP in that session. Also, count the number of SERPs in that session.\n", "\n", "Because autocomplete searches generate multiple searches while the user types, we'll count these separately." ] }, { "cell_type": "code", "execution_count": 234, "metadata": {}, "outputs": [], "source": [ "autocomp_session_query = '''\n", "SELECT event.searchsessionid,\n", " MIN(TO_DATE(coalesce(meta.dt, client_dt, dt))) AS session_start_date,\n", " SUM(1) AS num_searches\n", "FROM event.searchsatisfaction\n", "WHERE year = 2020\n", "AND month = 8\n", "AND wiki = \"commonswiki\"\n", "AND useragent.is_bot = false\n", "AND event.subTest IS NULL\n", "AND event.action = \"searchResultPage\"\n", "AND event.isforced IS NULL -- only include non-test users\n", "AND event.source = \"autocomplete\"\n", "GROUP BY event.searchsessionid\n", "'''" ] }, { "cell_type": "code", "execution_count": 235, "metadata": {}, "outputs": [], "source": [ "autocomp_session_metrics = spark.run(autocomp_session_query)" ] }, { "cell_type": "code", "execution_count": 236, "metadata": {}, "outputs": [], "source": [ "fulltext_session_query = '''\n", "SELECT event.searchsessionid,\n", " MIN(TO_DATE(coalesce(meta.dt, client_dt, dt))) AS session_start_date,\n", " SUM(1) AS num_searches\n", "FROM event.searchsatisfaction\n", "WHERE year = 2020\n", "AND month = 8\n", "AND wiki = \"commonswiki\"\n", "AND useragent.is_bot = false\n", "AND event.subTest IS NULL\n", "AND event.action = \"searchResultPage\"\n", "AND event.isforced IS NULL -- only include non-test users\n", "AND event.source = \"fulltext\"\n", "GROUP BY event.searchsessionid\n", "'''" ] }, { "cell_type": "code", "execution_count": 237, "metadata": {}, "outputs": [], "source": [ "fulltext_session_metrics = spark.run(fulltext_session_query)" ] }, { "cell_type": "code", "execution_count": 238, "metadata": {}, "outputs": [], "source": [ "## Drop the session ID columns, we don't really need those\n", "autocomp_session_metrics.drop(columns = 'searchsessionid', inplace = True)\n", "fulltext_session_metrics.drop(columns = 'searchsessionid', inplace = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We aggregate by day, filter out today because it's partial data, and remove sessions with more than 50 searches because those tend to be non-human. I got the 50 cutoff from Chelsy and Mikhail's work.\n", "\n", "Daily average number of autocomplete sessions:" ] }, { "cell_type": "code", "execution_count": 239, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "17438.6" ] }, "execution_count": 239, "metadata": {}, "output_type": "execute_result" } ], "source": [ "round(autocomp_session_metrics.loc[(autocomp_session_metrics['session_start_date'] < today) &\n", " (autocomp_session_metrics['session_start_date'] >= last_week) &\n", " (autocomp_session_metrics['num_searches'] < 50)]\n", " .groupby('session_start_date')\n", " .agg({'session_start_date' : 'count'})\n", " ['session_start_date'].mean(), 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Daily average number of fulltext sessions:" ] }, { "cell_type": "code", "execution_count": 240, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "21099.6" ] }, "execution_count": 240, "metadata": {}, "output_type": "execute_result" } ], "source": [ "round(fulltext_session_metrics.loc[(fulltext_session_metrics['session_start_date'] < today) &\n", " (fulltext_session_metrics['session_start_date'] >= last_week) &\n", " (fulltext_session_metrics['num_searches'] < 50)]\n", " .groupby('session_start_date')\n", " .agg({'session_start_date' : 'count'})\n", " ['session_start_date'].mean(), 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Number of searches per session" ] }, { "cell_type": "code", "execution_count": 241, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " session_start_date num_searches session_start_date_str\n", " Min. :737638 Min. : 1 Length:221229 \n", " 1st Qu.:737641 1st Qu.: 2 Class :character \n", " Median :737644 Median : 5 Mode :character \n", " Mean :737644 Mean : 9 \n", " 3rd Qu.:737647 3rd Qu.:12 \n", " Max. :737650 Max. :49 \n" ] } ], "source": [ "%%R\n", "\n", "autocomp_session_metrics %>% filter(num_searches < 50) %>%\n", " summary()" ] }, { "cell_type": "code", "execution_count": 242, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " session_start_date num_searches session_start_date_str\n", " Min. :737638 Min. : 1.0 Length:263699 \n", " 1st Qu.:737641 1st Qu.: 1.0 Class :character \n", " Median :737644 Median : 2.0 Mode :character \n", " Mean :737644 Mean : 4.3 \n", " 3rd Qu.:737647 3rd Qu.: 5.0 \n", " Max. :737650 Max. :49.0 \n" ] } ], "source": [ "%%R\n", "\n", "fulltext_session_metrics %>% filter(num_searches < 50) %>%\n", " summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take out today's date, filter to the last week (because activity moves by week), and then calculate the median." ] }, { "cell_type": "code", "execution_count": 243, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "num_searches 5.0\n", "dtype: float64" ] }, "execution_count": 243, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autocomp_session_metrics.loc[(autocomp_session_metrics['session_start_date'] < today) &\n", " (autocomp_session_metrics['session_start_date'] >= last_week) &\n", " (autocomp_session_metrics['num_searches'] < 50)].median()" ] }, { "cell_type": "code", "execution_count": 244, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "num_searches 2.0\n", "dtype: float64" ] }, "execution_count": 244, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fulltext_session_metrics.loc[(fulltext_session_metrics['session_start_date'] < today) &\n", " (fulltext_session_metrics['session_start_date'] >= last_week) &\n", " (fulltext_session_metrics['num_searches'] < 50)].median()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Search session length\n", "\n", "We define it as the time difference beween the first search event and last event in a session, for non-bot sessions with less than 50 searches. Since it's convenient to do more things at the same time, we also gather information about click-through, positions, and dwell time in the same query." ] }, { "cell_type": "code", "execution_count": 249, "metadata": {}, "outputs": [], "source": [ "session_info_query = '''\n", "WITH cs AS (\n", " SELECT event.searchsessionid,\n", " MIN(coalesce(meta.dt, client_dt, dt)) AS session_start_ts\n", " FROM event.searchsatisfaction\n", " WHERE year = 2020\n", " AND month = 8\n", " AND wiki = \"commonswiki\"\n", " AND useragent.is_bot = false\n", " AND event.subTest IS NULL\n", " AND event.action = \"searchResultPage\"\n", " AND event.isforced IS NULL -- only include non-test users\n", " GROUP BY event.searchsessionid\n", " HAVING SUM(1) < 50\n", "),\n", "se AS (\n", " SELECT event.searchsessionid,\n", " MAX(coalesce(meta.dt, client_dt, dt)) AS session_end_ts\n", " FROM event.searchsatisfaction\n", " WHERE year = 2020\n", " AND month = 8\n", " AND wiki = \"commonswiki\"\n", " AND useragent.is_bot = false\n", " AND event.subTest IS NULL\n", " AND event.isforced IS NULL -- only include non-test users\n", " GROUP BY event.searchsessionid\n", "),\n", "ct AS (\n", " SELECT event.searchsessionid, event.position\n", " FROM event.searchsatisfaction\n", " WHERE year = 2020\n", " AND month = 8\n", " AND wiki = \"commonswiki\"\n", " AND useragent.is_bot = false\n", " AND event.subTest IS NULL\n", " AND event.action = \"visitPage\"\n", " AND event.isforced IS NULL -- only include non-test users\n", "),\n", "dw AS (\n", " SELECT event.searchsessionid, max(event.checkin) AS last_checkin\n", " FROM event.searchsatisfaction\n", " WHERE year = 2020\n", " AND month = 8\n", " AND wiki = \"commonswiki\"\n", " AND useragent.is_bot = false\n", " AND event.subTest IS NULL\n", " AND event.action = \"checkin\"\n", " AND event.isforced IS NULL -- only include non-test users\n", " GROUP BY event.searchsessionid\n", ")\n", "SELECT cs.searchsessionid, session_start_ts, se.session_end_ts,\n", " IF(ct.searchsessionid IS NOT NULL, 1, 0) AS clicked_through,\n", " coalesce(ct.position, -1) AS position,\n", " IF(dw.last_checkin IS NOT NULL, dw.last_checkin, -1) AS last_checkin\n", "FROM cs\n", "JOIN se\n", "ON cs.searchsessionid = se.searchsessionid\n", "LEFT JOIN ct\n", "ON cs.searchsessionid = ct.searchsessionid\n", "LEFT JOIN dw\n", "ON cs.searchsessionid = dw.searchsessionid\n", "'''" ] }, { "cell_type": "code", "execution_count": 250, "metadata": {}, "outputs": [], "source": [ "session_info = spark.run(session_info_query)" ] }, { "cell_type": "code", "execution_count": 251, "metadata": {}, "outputs": [], "source": [ "session_info.drop(columns = 'searchsessionid', inplace = True)" ] }, { "cell_type": "code", "execution_count": 261, "metadata": {}, "outputs": [], "source": [ "r_session_info = session_info[['session_start_ts', 'session_end_ts']]" ] }, { "cell_type": "code", "execution_count": 264, "metadata": {}, "outputs": [ { "data": { "image/png": "\n" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%R -i r_session_info\n", "\n", "r_session_info %>% mutate(\n", " session_start = ymd_hms(session_start_ts),\n", " session_end = ymd_hms(session_end_ts),\n", " session_length = session_end - session_start\n", " ) %>% filter(session_length > 0) %>%\n", " ggplot(aes(x = as.numeric(session_length))) + \n", " geom_histogram(binwidth = 0.2, colour=\"black\", fill='white') +\n", " scale_x_log10(\n", " \"Time\",\n", " breaks=c(60, 15*60, 60*60, 24*60*60, 7*24*60*60, 30*24*60*60, 365*24*60*60),\n", " labels=c(\"minute\", \"15 min.\", \"hour\", \"day\", \"week\", \"month\", \"year\")) +\n", " commons_theme()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This looks fairly well-distributed on a log-scale, so let's use the median." ] }, { "cell_type": "code", "execution_count": 253, "metadata": {}, "outputs": [], "source": [ "session_info['session_start'] = session_lengths['session_start_ts'].apply(parse_dt)\n", "session_info['session_end'] = session_lengths['session_end_ts'].apply(parse_dt)\n", "session_info['session_length'] = session_lengths['session_end'] - session_lengths['session_start']\n", "session_info['session_start_date'] = session_lengths['session_start'].apply(lambda x: x.date())" ] }, { "cell_type": "code", "execution_count": 254, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Timedelta('0 days 00:00:48.148500')" ] }, "execution_count": 254, "metadata": {}, "output_type": "execute_result" } ], "source": [ "session_lengths.loc[(session_lengths['session_start_date'] < today) &\n", " (session_lengths['session_start_date'] >= last_week)]['session_length'].median()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Click-through rate\n", "\n", "Per the [data scientist takehome task](https://github.com/nettrom/Contributors-Hiring-DataScientist-2018), the Search Team defines the click-through rate as the \"proportion of search sessions where the user clicked on one of the results displayed.\"\n", "\n", "We're again limiting this to non-bot sessions with less than 50 searches made." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculate the average click-through rate:" ] }, { "cell_type": "code", "execution_count": 255, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "71.55" ] }, "execution_count": 255, "metadata": {}, "output_type": "execute_result" } ], "source": [ "round(100 * session_info['clicked_through'].mean(), 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Average position of clicked result in successful searches" ] }, { "cell_type": "code", "execution_count": 265, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 252434.000000\n", "mean 19.342549\n", "std 52.663988\n", "min 0.000000\n", "25% 0.000000\n", "50% 3.000000\n", "75% 13.000000\n", "max 499.000000\n", "Name: position, dtype: float64" ] }, "execution_count": 265, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Position can't be -1 (that means it's missing), and the maximum number\n", "## of results is 500, so it can't be above that either.\n", "\n", "session_info.loc[(session_info['clicked_through'] == 1) &\n", " (session_info['position'] < 500) &\n", " (session_info['position'] != -1)]['position'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The large difference between the mean and median isn't surprising, the bigger values pulls the mean up. In other words, we'll use the median." ] }, { "cell_type": "code", "execution_count": 258, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3.0" ] }, "execution_count": 258, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Position can't be -1 (that means it's missing), and the maximum number\n", "## of results is 500, so it can't be above that either.\n", "\n", "session_info.loc[(session_info['clicked_through'] == 1) &\n", " (session_info['position'] < 500) &\n", " (session_info['position'] != -1)]['position'].median()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Success rate\n", "\n", "Defined as a click-through with a dwell time of at least 10 seconds." ] }, { "cell_type": "code", "execution_count": 257, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "81.35" ] }, "execution_count": 257, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Same assumptions as before,\n", "## plus removing all sessions with no checkin\n", "\n", "round(100 *\n", " session_info.loc[(session_info['clicked_through'] == 1) &\n", " (session_info['position'] < 500) &\n", " (session_info['position'] != -1) &\n", " (session_info['last_checkin'] >= 10)]['last_checkin'].count() /\n", " session_info.loc[(session_info['clicked_through'] == 1) &\n", " (session_info['position'] < 500) &\n", " (session_info['position'] != -1)]['last_checkin'].count(),\n", " 2)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }