{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Live Demo\n",
    "\n",
    "Before we do any sort of research into a data set we'll need to know what the goal is.\n",
    "\n",
    "In this case I'm demoing how you can use Minhash calculation and text based similarities to discover potential phishing domains. This code is borrowed mostly from the phishy domain analyzer. This is to demonstrate how one can start doing some data analysis/research and end up with an analyzer.\n",
    "\n",
    "The question we are trying to answer is: **Are there domains in the data set that are \"similar\" to the most frequently visited domains, and thus have a chance to be phishing domains?** The theory here being that the most commonly visited domains by the user are most likely \"legit\", and if there are anomalies in the data set, that is domains that are less frequently visited yet very similar to those that are frequently visited they could be malicious or phishing domains.\n",
    "\n",
    "## Imports\n",
    "\n",
    "Simple imports of libraries that are going to be used throughout."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import re\n",
    "import datetime\n",
    "import getpass\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import altair as alt\n",
    "\n",
    "import difflib \n",
    "\n",
    "from timesketch_api_client import client as timesketch_client"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Caveat**\n",
    "\n",
    "If you are running this from jupyterlab, then you will not be able to run the first half, only the Timesketch portion.\n",
    "\n",
    "Also you will not be able to import everything, so the first import statement will work fine, the second import statement is only for those running this locally.\n",
    "\n",
    "Also if you are running from jupyterlab, then you'll need to run the pip command below (commented out for local run)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!pip install vega"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Second import statement (only run if you are not using jupyterlab)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasketch.minhash import MinHash\n",
    "from six.moves import urllib_parse as urlparse\n",
    "\n",
    "from timesketch.lib import similarity\n",
    "from timesketch.lib.analyzers import utils\n",
    "\n",
    "_ = alt.renderers.enable('notebook')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read the Data\n",
    "\n",
    "The way this would be done normally would be to store the data in timesketch and then query it.\n",
    "\n",
    "In this case we really only care about entries that have a URL or a domain attribute set. The TS query would be: \n",
    "\n",
    "```\n",
    "        query = (\n",
    "            '{\"query\": { \"bool\": { \"should\": [ '\n",
    "            '{ \"exists\" : { \"field\" : \"url\" }}, '\n",
    "            '{ \"exists\" : { \"field\" : \"domain\" }} ] } } }')\n",
    "```\n",
    " \n",
    "However, this notebook was used at a conference for a live demo, and as such we don't want to depend on internet connectivity, thus we read a JSON file on disk.\n",
    "\n",
    "To get a copy of the JSON file, please reach out to kiddi at log2timeline dot net."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "user_running_notebook = getpass.getuser()\n",
    "\n",
    "DATA_LOCATION = '/Users/{0:s}/Documents/Presentation/data/studentpc1.jsonl'.format(user_running_notebook)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The file itself is really large:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!du -hs $DATA_LOCATION"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since all we care about are entries with URls or domains in them, we'll open up the file and read it line by line, just picking those and then entering them into a data frame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "data_entries = []\n",
    "\n",
    "with open(DATA_LOCATION, 'r') as fh:\n",
    "    for line in fh:\n",
    "        parsed = json.loads(line)\n",
    "        if 'url' in parsed or 'domain' in parsed:\n",
    "            data_entries.append(parsed)\n",
    "\n",
    "network_data = pd.DataFrame(data_entries)\n",
    "del data_entries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Initial Exploration\n",
    "\n",
    "No we've got a data frame with just our browsing data, let's look at it first to inspect before we make any other changes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "network_data.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When doing research, or data exploration it really helps knowing what the data looks like, so we take a small sample of it to look at it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "network_data.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check whether we've got a domain column...."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if 'domain' in network_data.columns:\n",
    "    print('we do have a domain column')\n",
    "else:\n",
    "    print('no such column exists....')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Manipulate The Data\n",
    "\n",
    "Now we need to start manipulating or extracting the features we need.\n",
    "\n",
    "We could also have just used regular expressions... eg `r'https?://([^$/]+)'` and the built-in `network_data.url.str.extract(<regular_expression>)` to extract the data. However here I'm using a library to extract the URL part, since it contains features for more types of URLs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_domain(url):\n",
    "    domain_parsed = urlparse.urlparse(url)\n",
    "    domain_full = domain_parsed.netloc\n",
    "    domain, _, _ = domain_full.partition(':')\n",
    "    return domain\n",
    "\n",
    "def get_tld(domain):\n",
    "    return '.'.join(domain.split('.')[-2:])\n",
    "\n",
    "network_data['domain'] = network_data.url.apply(get_domain)\n",
    "network_data['tld'] = network_data.domain.apply(get_tld)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The Meat\n",
    "\n",
    "Now that we've got the domain we can start looking at whether there are similar domains, but first we need to figure out what are the most common domains, to see what the most likely \"corporate\" domains are."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "network_data.tld.value_counts()[:10].keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's create a list of domains to \"watch\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We \"bootstrap\" the watch list with the known \"corp\" domains that we've got and we know about.\n",
    "watch_list = ['greendale.xyz']\n",
    "\n",
    "# Let's add the top domains to the list.\n",
    "watch_list.extend(network_data.tld.value_counts()[:10].keys())\n",
    "watch_list.extend(network_data.domain.value_counts()[:10].keys())\n",
    "\n",
    "# Remove empty records...\n",
    "watch_list = [x for x in watch_list if x]\n",
    "watch_list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now calculate the MinHash value for all of the domains on the watch list.\n",
    "\n",
    "Start by defining few functions.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "    # This is just some code from the phishy analyzer, not really something we go through here.\n",
    "    # Added here to have the code to play with.\n",
    "    domain_scoring_threshold = 0.75\n",
    "\n",
    "    def _get_minhash_from_domain(domain):\n",
    "        \"\"\"Get the Minhash value from a domain name.\n",
    "        This function takes a domain, removes the TLD extension\n",
    "        from it and then creates a MinHash object from every\n",
    "        remaining character in the domain.\n",
    "        If a domain starts with www., it will be stripped of the\n",
    "        domain before the Minhash is calculated.\n",
    "        Args:\n",
    "          domain: string with a full domain, eg. www.google.com\n",
    "        Returns:\n",
    "            A minhash (instance of datasketch.minhash.MinHash)\n",
    "        \"\"\"\n",
    "        domain_items = domain.split('.')\n",
    "        domain_part = '.'.join(domain_items[:-1])\n",
    "\n",
    "        minhash = MinHash(similarity.DEFAULT_PERMUTATIONS)\n",
    "        for char in domain_part:\n",
    "            minhash.update(char.encode('utf8'))\n",
    "\n",
    "        return minhash\n",
    "\n",
    "    def _get_similar_domains(domain, domain_dict):\n",
    "        \"\"\"Compare a domain to a list of domains and return similar domains.\n",
    "        This function takes a domain and a dict object that contains\n",
    "        as key domain names and value the calculated MinHash value for that\n",
    "        domain as well as the domains depth (mbl.is is 2, foobar.evil.com would\n",
    "        be 3). It will then strip www. if needed from the domain, and compare\n",
    "        the Jaccard distance between all domains in the dict and the supplied\n",
    "        domain (removing the TLD extension from all domains).\n",
    "        If the Jaccard distance between the supplied domain and one or more of\n",
    "        the domains in the domain dict is higher than the configured threshold\n",
    "        the domain is further tested to see if there are overlapping substrings\n",
    "        between the two domains. If there is a common substring that is longer\n",
    "        than half the domain name and the Jaccard distance is above the\n",
    "        threshold the domain is considered to be similar.\n",
    "        Args:\n",
    "            domain: string with a full domain, eg. www.google.com\n",
    "            domain_dict: dict with domain names (keys) and MinHash objects\n",
    "                (values) for all domains to compare against.\n",
    "        Returns:\n",
    "            a list of tuples (score, similar_domain_name) with the names of\n",
    "            the similar domains as well as the Jaccard distance between\n",
    "            the supplied domain and the matching one.\n",
    "        \"\"\"\n",
    "        domain = utils.strip_www_from_domain(domain)\n",
    "\n",
    "        similar = []\n",
    "        if '.' not in domain:\n",
    "            return similar\n",
    "\n",
    "        if domain in domain_dict:\n",
    "            return similar\n",
    "\n",
    "        if any(domain.endswith('.{0:s}'.format(x)) for x in domain_dict):\n",
    "            return similar\n",
    "\n",
    "        # We want to get rid of the TLD extension of the domain.\n",
    "        # This is only used in the substring match in case the Jaccard\n",
    "        # distance is above the threshold.\n",
    "        domain_items = domain.split('.')\n",
    "        domain_depth = len(domain_items)\n",
    "        domain_part = '.'.join(domain_items[:-1])\n",
    "\n",
    "        minhashes = {}\n",
    "        for index in range(0, domain_depth - 1):\n",
    "            minhashes[domain_depth - index] = _get_minhash_from_domain(\n",
    "                '.'.join(domain_items[index:]))\n",
    "\n",
    "        for watched_domain, watched_item in iter(domain_dict.items()):\n",
    "            watched_hash = watched_item.get('hash')\n",
    "            watched_depth = watched_item.get('depth')\n",
    "\n",
    "            minhash = minhashes.get(watched_depth)\n",
    "            if not minhash:\n",
    "                # The supplied domains length does not match this watched\n",
    "                # domain.\n",
    "                continue\n",
    "            score = watched_hash.jaccard(minhash)\n",
    "            if score < domain_scoring_threshold:\n",
    "                continue\n",
    "\n",
    "            watched_domain_items = watched_domain.split('.')\n",
    "            watched_domain_part = '.'.join(watched_domain_items[:-1])\n",
    "\n",
    "            # Check if there are also any overlapping strings.\n",
    "            sequence = difflib.SequenceMatcher(\n",
    "                None, domain_part, watched_domain_part)\n",
    "            match = sequence.find_longest_match(\n",
    "                0, len(domain_part), 0, len(watched_domain_part))\n",
    "\n",
    "            # We want to have at least half of the domain matching.\n",
    "            # TODO: This can be improved, this is a value and part that\n",
    "            # needs or can be tweaked. Perhaps move this to a config option\n",
    "            # that is the min length of strings.\n",
    "            match_size = min(\n",
    "                int(len(domain_part)/2), int(len(watched_domain_part)/2))\n",
    "            if match.size < match_size:\n",
    "                continue\n",
    "            similar.append((watched_domain, score))\n",
    "\n",
    "        return similar"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and to create the watched domain dict, which is an object that simply contains all the watched domains as keys and the value is the MinHash value and the depth. The depth is defined simply as the number of parts within the domain name, eg. www.mbl.is is 3 level deep, whereas cnn.com is 2, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "watched_domains = {}\n",
    "for domain in watch_list:\n",
    "    minhash = _get_minhash_from_domain(domain)\n",
    "    watched_domains[domain] = {\n",
    "        'hash': minhash,\n",
    "        'depth': len(domain.split('.'))\n",
    "    }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we've got the common domains, let's calculate the jaccard distance (actually [minhash](https://en.wikipedia.org/wiki/MinHash) ) and similarities (overlapping strings) for all of the domains in the history.\n",
    "\n",
    "Since what we do is to take all the letters in the domain name and add it to a set, which we then create the MinHash value from and use for comparison the domains abc.com and bacccab.moc will be considered the same (same letters in it), that's why we add the overlapping string check to make sure the domains are similar."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can go through the domains we've got and find \"similar\" domains."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "entries = []\n",
    "for domain in network_data.domain.unique():\n",
    "    similar_domains = _get_similar_domains(domain, watched_domains)\n",
    "    if not similar_domains:\n",
    "        continue\n",
    "\n",
    "    print('Domain: {0:s} does have similar domains discovered:'.format(domain))\n",
    "    for similarities in similar_domains:\n",
    "        s_domain, s_score = similarities\n",
    "        print('    [{0:s}] - {1:0.2f}%'.format(s_domain, s_score * 100))\n",
    "        entry = {'domain': domain, 'watched_domain': s_domain, 'score': s_score}\n",
    "        entries.append(entry)\n",
    "    print('---')\n",
    "similar_domains = pd.DataFrame(entries)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This can also be looked at visually.\n",
    "\n",
    "As a table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "similar_domains.sort_values('score', ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So we see that someone is visiting a URL that looks very similar to our corp URL...\n",
    "\n",
    "This could trigger a detection."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "or a chart (might not be the best representation) ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "similar_domains['percentage'] = similar_domains.score * 100\n",
    "alt.Chart(similar_domains).mark_point().encode(\n",
    "    y='domain', x='watched_domain', size='percentage',\n",
    "    color='watched_domain', fill='watched_domain',\n",
    "    tooltip=['watched_domain','domain', 'percentage'],\n",
    ").properties(\n",
    "    width=600, height=400,\n",
    "    title='Similar Domains'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Timesketch Demo\n",
    "\n",
    "Small TS demo. Let's start by connecting to our demo server."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "client = timesketch_client.TimesketchApi('https://demo.timesketch.org', 'demo', 'demo')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we can take a look at what sketches are available, and let's pick one of them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for index, sketch in enumerate(client.list_sketches()):\n",
    "    print('[{0:d}] ID: {1:d} - {2:s} <{3:s}>'.format(index, sketch.id, sketch.name, sketch.description))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case we want to get the data from the Greendale incident."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sketch = client.get_sketch(238)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start to check whether or not there are saved views."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lines = []\n",
    "for view in sketch.list_views():\n",
    "    view_dict = {\n",
    "        'id': view.id,\n",
    "        'name': view.name}\n",
    "    lines.append(view_dict)\n",
    "\n",
    "pd.DataFrame(lines)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at what browser searches were made."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "view = sketch.get_view(view_id=2010)\n",
    "\n",
    "results = sketch.explore(\n",
    "    view=view, as_pandas=True,\n",
    "    return_fields='datetime,message,timestamp_desc,source,source_short,label,tag,tags,url,domain,search_string')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at some of the browser searches."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results.search_string.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can take a look at the frequency of search queries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "results['date'] = pd.to_datetime(results.datetime)\n",
    "results['day'] = results.date.dt.strftime('%Y%m%d')\n",
    "r_group = results[['day', 'datetime']].groupby(by='day', as_index=False)\n",
    "r_count = r_group.count()\n",
    "r_count['count'] = r_count['datetime']\n",
    "del r_count['datetime']\n",
    "\n",
    "alt.Chart(r_count, width=500, height=300).mark_line(point=True).encode(\n",
    "    x='day',\n",
    "    y='count',\n",
    "    tooltip=['day', 'count'],\n",
    ").properties(\n",
    "    title='Search Queries per day'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Problem with this is the fact that there are days with no search queries, that make the chart look a bit odd. Let's fill the values so that we have each day represented in the chart."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's skip the first day, since that was in the year 2000....\n",
    "r = r_count[1:]\n",
    "\n",
    "# We need to find out the first and last day.\n",
    "r_first = r.day.values[0]\n",
    "r_end = r.day.values[-1]\n",
    "\n",
    "# Let's create a timestamp from the first day.\n",
    "year = r_first[0:4]\n",
    "month = r_first[4:6]\n",
    "day = r_first[6:]\n",
    "first_day = datetime.datetime(year=int(year), month=int(month), day=int(day))\n",
    "\n",
    "# Now we can create a list of all dates between first and last date.\n",
    "all_days = []\n",
    "cur_day = first_day\n",
    "while True:\n",
    "    day_string = cur_day.strftime('%Y%m%d')\n",
    "    if day_string == r_end:\n",
    "        break\n",
    "    all_days.append(day_string)\n",
    "    cur_day = cur_day + datetime.timedelta(days=1)\n",
    "\n",
    "# Let's create sets, one for all days, the second for the current days.\n",
    "cur_days = set([str(x) for x in r.day.values])\n",
    "all_days = set(all_days)\n",
    "# Now we can easily find a list of all missing dates.\n",
    "missing_days = all_days.difference(cur_days)\n",
    "\n",
    "# Let's create a data frame that contains just the missing dates, with a count of zero.\n",
    "lines = []\n",
    "for day in missing_days:\n",
    "    line = {'day': str(day), 'count': 0}\n",
    "    lines.append(line)\n",
    "    \n",
    "df = pd.DataFrame(lines)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can repeat what we did before, just this time with filled values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r_filled = r.append(df, sort=True)\n",
    "\n",
    "alt.Chart(r_filled, width=900, height=300).mark_line(point=True).encode(\n",
    "    x=alt.X('day', axis=alt.Axis(title='Day of Query')),\n",
    "    y='count',\n",
    "    tooltip=['day', 'count'],\n",
    ").properties(\n",
    "    title='Search Queries per day'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also just look at saved aggregations..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aggregations = []\n",
    "for index, agg in enumerate(sketch.list_aggregations()):\n",
    "    print('[{0:d}] - {1:s}, {2:s} <{3:s}> -> {4:s}'.format(index, agg.name, agg.aggregator_name, agg.type, agg.chart_type))\n",
    "    aggregations.append(agg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "They are more or less the same... we can take a look at one of them.\n",
    "\n",
    "As a table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aggregations[3].table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or as a chart:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aggregations[3].chart"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we shall do something different. Let's look at all executions that are recorded from our prefetch parser. Once we've get these records we can then search for whether they appear somewhere in our web history.. that is whether there is a link between prefetch and browser history (this will miss ZIP files or other compressed files that get downloaded)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = sketch.explore(\n",
    "    query_string='parser:\"prefetch\"', as_pandas=True,\n",
    "    return_fields='datetime,timestamp_desc,source,source_short,message,executable')\n",
    "\n",
    "data.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we've got all the prefetch data, let's look at all executables and do a query for browser traffic that contains those names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "browser_traffic = pd.DataFrame()\n",
    "\n",
    "for executable in data.executable.unique():\n",
    "    if not executable:\n",
    "        continue\n",
    "    if executable is np.nan:\n",
    "        continue\n",
    "    if not executable.lower().endswith('.exe'):\n",
    "        continue\n",
    "    exec_data = sketch.explore(\n",
    "        query_string='url:\"*{0:s}*\"'.format(executable.lower()),\n",
    "        return_fields='datetime,timestamp_desc,source,source_short,message,domain,url',\n",
    "        as_pandas=True)\n",
    "    browser_traffic = pd.concat([browser_traffic, exec_data])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's combine the two data frames, that is the browser traffic and the prefetch data. We'll then sort it by dates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "combined = pd.concat([browser_traffic, data], sort=False)\n",
    "c_sorted = combined.sort_values(by='datetime')\n",
    "c_sorted['date'] = pd.to_datetime(c_sorted.datetime)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First of all, let's just look at VPN.exe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "c_sorted[c_sorted.message.str.contains('vpn.exe', case=False)][['date', 'timestamp_desc', 'message', 'url', 'executable', 'domain']][2:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And to examine all the hits:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "c_sorted[['date', 'timestamp_desc', 'message', 'url', 'executable']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also extract the executable from the web history."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import six\n",
    "\n",
    "def check_row(row):\n",
    "    executable = row.executable\n",
    "    if isinstance(executable, six.text_type):\n",
    "        if not executable:\n",
    "            return np.nan\n",
    "        return executable.lower()\n",
    "    url = row.url\n",
    "    if not isinstance(url, six.text_type):\n",
    "        return np.nan\n",
    "    _, _, last_part = url.rpartition('/')\n",
    "    if '&' in last_part:\n",
    "        last_part, _, _ = last_part.partition('&')\n",
    "    if '#' in last_part:\n",
    "        last_part, _, _ = last_part.partition('#')\n",
    "        \n",
    "    if not last_part:\n",
    "        return np.nan\n",
    "    return last_part\n",
    "\n",
    "c_sorted['new_exec'] = c_sorted.apply(check_row, axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can look at the most common applications"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "c_sorted['day'] = c_sorted.date.dt.strftime('%Y%m%d')\n",
    "c_sorted_data = c_sorted[~c_sorted.new_exec.isna()]\n",
    "\n",
    "c_sorted_data.new_exec.value_counts()[:10]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And then look at vpn.exe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "c_sorted_data[c_sorted_data.new_exec == 'vpn.exe']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or we can do this as a chart..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "c_group = c_sorted[['day', 'date', 'new_exec']].groupby(by=['new_exec','day'], as_index=False)\n",
    "c_count = c_group.count()\n",
    "c_count['count'] = c_count.date\n",
    "c_count['executable'] = c_count['new_exec']\n",
    "del c_count['date']\n",
    "\n",
    "alt.Chart(c_count).mark_point(filled=True).encode(\n",
    "    x='day',\n",
    "    y='executable',\n",
    "    size='count',\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OK.. these are all... but what about those executables that appear BOTH in a URL and Prefetch.\n",
    "\n",
    "To be able to do that, we'll first add the executable field to the two data frames from before."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "browser_traffic.head(10)\n",
    "\n",
    "def get_exec(url):\n",
    "    items = [x.lower() for x in url.split('/')]\n",
    "    executable = ''\n",
    "    for item in items:\n",
    "        if not '.exe' in item:\n",
    "            continue\n",
    "        executable = item\n",
    "    if '#' in executable:\n",
    "        executable, _, _ = executable.partition('#')\n",
    "    if '&' in executable:\n",
    "        executable, _, _ = executable.partition('&')\n",
    "    if '\\\\' in executable:\n",
    "        _, _, executable = executable.rpartition('\\\\')\n",
    "    return executable\n",
    "\n",
    "browser_traffic['new_exec'] = browser_traffic.url.apply(get_exec)\n",
    "\n",
    "def get_exec(exec_string):\n",
    "    if not isinstance(exec_string, six.text_type):\n",
    "        return exec_string\n",
    "    return exec_string.lower()\n",
    "\n",
    "data['new_exec'] = data.executable.apply(get_exec)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can merge the two data frames."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "merged_df = data.merge(browser_traffic, on='new_exec')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And then we can do the same as we did above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "merged_df['date'] = pd.to_datetime(merged_df['datetime_x'])\n",
    "merged_df['day'] = merged_df.date.dt.strftime('%Y%m%d')\n",
    "m_group = merged_df[['day', 'date', 'new_exec']].groupby(by=['new_exec','day'], as_index=False)\n",
    "m_count = m_group.count()\n",
    "m_count['count'] = m_count.date\n",
    "m_count['executable'] = m_count['new_exec']\n",
    "del m_count['date']\n",
    "\n",
    "alt.Chart(m_count, width=300, height=200).mark_point(filled=True).encode(\n",
    "    x='day',\n",
    "    y='executable',\n",
    "    size='count',\n",
    "    color='executable'\n",
    ").properties(\n",
    "    title='Executables that appear both in Prefetch and Internet history'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Search by label.\n",
    "Let's look at some other part of the API, let's look at starred events."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sketch.search_by_label('__ts_star', as_pandas=True)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}