{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# urlExpander Quickstart\n",
    "View this notebook on [NBViewer](http://nbviewer.jupyter.org/github/SMAPPNYU/urlExpander/blob/master/examples/quickstart.ipynb?flush_cache=true) or [Github](https://github.com/SMAPPNYU/urlExpander/blob/master/examples/quickstart.ipynb)| Run it interactively on\n",
    "[Binder](https://mybinder.org/v2/gh/SMAPPNYU/urlExpander/master?filepath=examples%2Fquickstart.ipynb) <br>\n",
    "By [Leon Yin](leonyin.org) for [SMaPP NYU](https://wp.nyu.edu/smapp/)\n",
    "\n",
    "\n",
    "[urlExpander](https://github.com/SMAPPNYU/urlExpander) is a Python package for quickly and thoroughly expanding URLs.\n",
    "\n",
    "You can download the software using pip:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Updated 2018-12-20 13:34:52.044480\n",
      "By QuickStart User\n",
      "Using Python 3.6.5\n",
      "On Linux-3.10.0-514.10.2.el7.x86_64-x86_64-with-centos-7.3.1611-Core\n",
      "This notebook is using urlExpander v0.0.34\n"
     ]
    }
   ],
   "source": [
    "import urlexpander\n",
    "from runtimestamp.runtimestamp import runtimestamp\n",
    "runtimestamp('QuickStart User')\n",
    "print(f\"This notebook is using urlExpander v{urlexpander.__version__}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is a toy example of some URLs taken from Congressional Twitter accounts:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "urls = [\n",
    "    'https://trib.al/xXI5ruM',\n",
    "    'http://bit.ly/1Sv81cj',\n",
    "    'https://www.youtube.com/watch?v=8NwKcfXvGl4',\n",
    "    'https://t.co/zNU1eHhQRn',\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use the `expand` function (see the code) to unshorten any link:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'https://www.breitbart.com/video/2017/12/31/lindsey-graham-trump-just-cant-tweet-iran/'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "urlexpander.expand(urls[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It also works on any list of URLs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['https://www.breitbart.com/video/2017/12/31/lindsey-graham-trump-just-cant-tweet-iran/',\n",
       " 'http://www.billshusterforcongress.com/__CONNECTIONPOOL_ERROR__',\n",
       " 'https://www.youtube.com/watch?v=8NwKcfXvGl4',\n",
       " 'http://www.nfib.com/content/press-release/elections/small-business-endorses-shuster-for-reelection-73730/?utm_campaign=Advocacy&utm_source=Twitter&utm_medium=Social']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "urlexpander.expand(urls)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To save compute time, we can skip links that don't need to be expanded.<br>\n",
    "The `is_short` function takes any url and checks if the domain is from a known list of link shorteners"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "http://bit.ly/1Sv81cj returns:\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(f\"{urls[1]} returns:\")\n",
    "urlexpander.is_short(urls[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "bit.ly is probably the best known link shortener, Youtube.com however is not a link shortener!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://www.youtube.com/watch?v=8NwKcfXvGl4 returns:\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(f\"{urls[2]} returns:\")\n",
    "urlexpander.is_short(urls[2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "urlExpander takes advantage of a list of known domains that offer link shortening services."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "85\n"
     ]
    }
   ],
   "source": [
    "known_shorteners = urlexpander.constants.all_short_domains.copy()\n",
    "print(len(known_shorteners))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can make modifications or use your own `list_of_domains` as an argument for the`is_short` function or `is_short_domain` (which is faster and operates on the domain-level)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "known_shorteners += ['youtube.com']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Now https://www.youtube.com/watch?v=8NwKcfXvGl4 returns:\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(f\"Now {urls[2]} returns:\")\n",
    "urlexpander.is_short(urls[2], list_of_domains=known_shorteners) # this is the default"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can shorten our workload:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['https://trib.al/xXI5ruM', 'http://bit.ly/1Sv81cj', 'https://t.co/zNU1eHhQRn']"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# filter only domains that need to be shortenened\n",
    "urls_to_shorten = [link for link in urls if urlexpander.is_short(link)]\n",
    "urls_to_shorten"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "urlExpander's `multithread_expand()` does heavy lifting to quickly and thoroughly expand a list of links:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['https://www.breitbart.com/video/2017/12/31/lindsey-graham-trump-just-cant-tweet-iran/',\n",
       " 'http://www.billshusterforcongress.com/__CONNECTIONPOOL_ERROR__',\n",
       " 'http://www.nfib.com/content/press-release/elections/small-business-endorses-shuster-for-reelection-73730/?utm_campaign=Advocacy&utm_source=Twitter&utm_medium=Social']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "expanded_urls = urlexpander.expand(urls_to_shorten)\n",
    "expanded_urls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i>Note that URLs that resolve to defunct pages, still return the domain name -- followed by the type of error surrounded by two underscores IE `http://www.billshusterforcongress.com/__CONNECTIONPOOL_ERROR__`.</i>\n",
    "\n",
    "Instead of filtering the inputs before running the `expand` function, you can assign a filter using the `filter_function` argument.<br>\n",
    "Filter functions can be any boolean function that operates on a string. Below is an example function that filters for t.co links:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def custom_filter(url):\n",
    "    '''This function returns True if the url is a shortened Twitter URL'''\n",
    "    if urlexpander.get_domain(url) == 't.co':\n",
    "        return True\n",
    "    else:\n",
    "        return False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0%|          | 0/1 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 1 links to unshorten\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 1/1 [00:00<00:00,  3.09it/s]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['https://trib.al/xXI5ruM',\n",
       " 'http://bit.ly/1Sv81cj',\n",
       " 'https://www.youtube.com/watch?v=8NwKcfXvGl4',\n",
       " 'http://www.nfib.com/content/press-release/elections/small-business-endorses-shuster-for-reelection-73730/?utm_campaign=Advocacy&utm_source=Twitter&utm_medium=Social']"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "resolved_links = urlexpander.expand(urls, \n",
    "                                    filter_function=custom_filter, \n",
    "                                    verbose=1)\n",
    "resolved_links"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although filtering within the `expand` function is convenient, you will see changes in performance time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0%|          | 0/1 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 3 links to unshorten\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 1/1 [00:02<00:00,  2.68s/it]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['https://www.breitbart.com/video/2017/12/31/lindsey-graham-trump-just-cant-tweet-iran/',\n",
       " 'http://www.billshusterforcongress.com/__CONNECTIONPOOL_ERROR__',\n",
       " 'https://www.youtube.com/watch?v=8NwKcfXvGl4',\n",
       " 'http://www.nfib.com/content/press-release/elections/small-business-endorses-shuster-for-reelection-73730/?utm_campaign=Advocacy&utm_source=Twitter&utm_medium=Social']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "resolved_links = urlexpander.expand(urls,  \n",
    "                                    filter_function=urlexpander.is_short,\n",
    "                                    verbose=1)\n",
    "resolved_links"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But that is a toy example, let's see how this fairs with a larger dataset.<br>\n",
    "This package comes with a [sampled dataset](https://github.com/SMAPPNYU/urlExpander/blob/master/urlexpander/core/datasets.py#L8-L29) of links extracted from Twitter accounts from the 115th Congress. <br>\n",
    "If you work with Twitter data you'll be glad to know there is a function `urlexpander.tweet_utils.get_link` for creating a similar dataset from Tweets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The dataset has 10000 rows\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>link_domain</th>\n",
       "      <th>link_url_long</th>\n",
       "      <th>link_url_short</th>\n",
       "      <th>tweet_created_at</th>\n",
       "      <th>tweet_id</th>\n",
       "      <th>tweet_text</th>\n",
       "      <th>user_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>9998</th>\n",
       "      <td>facebook.com</td>\n",
       "      <td>https://www.facebook.com/theDanRather/posts/10...</td>\n",
       "      <td>https://t.co/VOiuOXFi1P</td>\n",
       "      <td>Tue Jun 20 21:36:04 +0000 2017</td>\n",
       "      <td>877278904846888965</td>\n",
       "      <td>RT @DanRather: Nothing I have ever seen approa...</td>\n",
       "      <td>15808765</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9999</th>\n",
       "      <td>bit.ly</td>\n",
       "      <td>http://bit.ly/1YWRIXg</td>\n",
       "      <td>https://t.co/Hz8RojBqOy</td>\n",
       "      <td>Tue Dec 08 19:34:38 +0000 2015</td>\n",
       "      <td>674311141527560197</td>\n",
       "      <td>We need to get people off the sidelines &amp;amp; ...</td>\n",
       "      <td>733751245</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       link_domain                                      link_url_long  \\\n",
       "9998  facebook.com  https://www.facebook.com/theDanRather/posts/10...   \n",
       "9999        bit.ly                              http://bit.ly/1YWRIXg   \n",
       "\n",
       "               link_url_short                tweet_created_at  \\\n",
       "9998  https://t.co/VOiuOXFi1P  Tue Jun 20 21:36:04 +0000 2017   \n",
       "9999  https://t.co/Hz8RojBqOy  Tue Dec 08 19:34:38 +0000 2015   \n",
       "\n",
       "                tweet_id                                         tweet_text  \\\n",
       "9998  877278904846888965  RT @DanRather: Nothing I have ever seen approa...   \n",
       "9999  674311141527560197  We need to get people off the sidelines &amp; ...   \n",
       "\n",
       "        user_id  \n",
       "9998   15808765  \n",
       "9999  733751245  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_congress = urlexpander.datasets.load_congress_twitter_links(nrows=10000)\n",
    "\n",
    "print(f'The dataset has {len(df_congress)} rows')\n",
    "df_congress.tail(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.2796270302787247"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "shortened_urls = df_congress[df_congress.link_domain.apply(urlexpander.is_short)].tweet_id.nunique()\n",
    "all_urls = df_congress.tweet_id.nunique()\n",
    "shortened_urls / all_urls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "About 28% of the links are short!<br>\n",
    "The performance of the next script is dependent on your internet connection:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Retrieving speedtest.net configuration...\n",
      "Testing from New York University (128.122.215.16)...\n",
      "Retrieving speedtest.net server list...\n",
      "Selecting best server based on ping...\n",
      "Hosted by Speedtest.net (New York City, NY) [2.57 km]: 4.263 ms\n",
      "Testing download speed................................................................................\n",
      "Download: 422.94 Mbit/s\n",
      "Testing upload speed......................................................................................................\n",
      "Upload: 320.82 Mbit/s\n"
     ]
    }
   ],
   "source": [
    "!curl -s https://raw.githubusercontent.com/sivel/speedtest-cli/master/speedtest.py | python -"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see how long it takes to expand these 10k links.<br>\n",
    "\n",
    "This is where the optional parameters for `expand` shine.\n",
    "We can created multiple threads for requests (using `n_workers`), cache results into a json file (`cache_file`), and chunk the input into smaller pieces (using `chunksize`). Why does this last part matter? Something I noticed when expanding links in mass is that performance degrades over time. Chunking the input prevents this from happening (not sure why though)!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0%|          | 0/1 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 1020 links to unshorten\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 1/1 [00:20<00:00, 20.96s/it]\n"
     ]
    }
   ],
   "source": [
    "resolved_links = urlexpander.expand(df_congress['link_url_long'], \n",
    "                                    chunksize=1280,\n",
    "                                    n_workers=64, \n",
    "                                    cache_file='temp.json', \n",
    "                                    verbose=1,\n",
    "                                    filter_function=urlexpander.is_short)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At SMaPP, the process of link expansion has been a burden on our research.<br>\n",
    "We hope that this software helps you overcome similar obstacles!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>link_domain</th>\n",
       "      <th>link_url_long</th>\n",
       "      <th>link_url_short</th>\n",
       "      <th>tweet_created_at</th>\n",
       "      <th>tweet_id</th>\n",
       "      <th>tweet_text</th>\n",
       "      <th>user_id</th>\n",
       "      <th>expanded_url</th>\n",
       "      <th>resolved_domain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>9998</th>\n",
       "      <td>facebook.com</td>\n",
       "      <td>https://www.facebook.com/theDanRather/posts/10...</td>\n",
       "      <td>https://t.co/VOiuOXFi1P</td>\n",
       "      <td>Tue Jun 20 21:36:04 +0000 2017</td>\n",
       "      <td>877278904846888965</td>\n",
       "      <td>RT @DanRather: Nothing I have ever seen approa...</td>\n",
       "      <td>15808765</td>\n",
       "      <td>https://www.facebook.com/theDanRather/posts/10...</td>\n",
       "      <td>facebook.com</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9999</th>\n",
       "      <td>bit.ly</td>\n",
       "      <td>http://bit.ly/1YWRIXg</td>\n",
       "      <td>https://t.co/Hz8RojBqOy</td>\n",
       "      <td>Tue Dec 08 19:34:38 +0000 2015</td>\n",
       "      <td>674311141527560197</td>\n",
       "      <td>We need to get people off the sidelines &amp;amp; ...</td>\n",
       "      <td>733751245</td>\n",
       "      <td>http://speakerryan.com/__CLIENT_ERROR__</td>\n",
       "      <td>speakerryan.com</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       link_domain                                      link_url_long  \\\n",
       "9998  facebook.com  https://www.facebook.com/theDanRather/posts/10...   \n",
       "9999        bit.ly                              http://bit.ly/1YWRIXg   \n",
       "\n",
       "               link_url_short                tweet_created_at  \\\n",
       "9998  https://t.co/VOiuOXFi1P  Tue Jun 20 21:36:04 +0000 2017   \n",
       "9999  https://t.co/Hz8RojBqOy  Tue Dec 08 19:34:38 +0000 2015   \n",
       "\n",
       "                tweet_id                                         tweet_text  \\\n",
       "9998  877278904846888965  RT @DanRather: Nothing I have ever seen approa...   \n",
       "9999  674311141527560197  We need to get people off the sidelines &amp; ...   \n",
       "\n",
       "        user_id                                       expanded_url  \\\n",
       "9998   15808765  https://www.facebook.com/theDanRather/posts/10...   \n",
       "9999  733751245            http://speakerryan.com/__CLIENT_ERROR__   \n",
       "\n",
       "      resolved_domain  \n",
       "9998     facebook.com  \n",
       "9999  speakerryan.com  "
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_congress['expanded_url'] = resolved_links\n",
    "df_congress['resolved_domain'] = df_congress['expanded_url'].apply(urlexpander.get_domain)\n",
    "df_congress.tail(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are the top 25 shared domains from this sampled Congress dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "twitter.com               1517\n",
       "house.gov                 1156\n",
       "youtube.com                580\n",
       "facebook.com               524\n",
       "senate.gov                 441\n",
       "instagram.com              176\n",
       "nytimes.com                165\n",
       "washingtonpost.com         157\n",
       "thehill.com                135\n",
       "politico.com                85\n",
       "foxnews.com                 64\n",
       "cnn.com                     64\n",
       "wsj.com                     64\n",
       "twimg.com                   56\n",
       "usatoday.com                46\n",
       "ow.ly                       46\n",
       "washingtonexaminer.com      46\n",
       "huffingtonpost.com          44\n",
       "medium.com                  43\n",
       "speaker.gov                 34\n",
       "healthcare.gov              33\n",
       "gop.gov                     33\n",
       "c-span.org                  33\n",
       "pscp.tv                     31\n",
       "rollcall.com                31\n",
       "Name: resolved_domain, dtype: int64"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_congress.resolved_domain.value_counts().head(25)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bonus Round!\n",
    "You can count number of `resolved_domain`s for each `user_id ` using `count_matrix()`.<br>\n",
    "You can even choose which domains are counted by modifying the `domain_list` arg:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>facebook.com</th>\n",
       "      <th>youtube.com</th>\n",
       "      <th>twitter.com</th>\n",
       "      <th>google.com</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>user_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>911302336307490816</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>941000686275387392</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>948946378939609089</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    facebook.com  youtube.com  twitter.com  google.com\n",
       "user_id                                                               \n",
       "911302336307490816             0            0            1           0\n",
       "941000686275387392             1            0            2           0\n",
       "948946378939609089             0            1            0           0"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "count_matrix = urlexpander.tweet_utils.count_matrix(df_congress,\n",
    "                                                    user_col='user_id', \n",
    "                                                    domain_col='resolved_domain', \n",
    "                                                    unique_count_col='tweet_id',\n",
    "                                                    domain_list=['youtube.com','facebook.com', 'google.com', 'twitter.com'])\n",
    "\n",
    "count_matrix.tail(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of the domain lists you might be interested in are US national media outlets -\n",
    "`datasets.load_us_national_media_outlets()` compiled by Gregory Eady (Forthcoming)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['abcnews.go.com', 'aim.org', 'alternet.org',\n",
       "       'theamericanconservative.com', 'prospect.org'], dtype=object)"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "urlexpander.datasets.load_us_national_media_outlets()[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<hr>\n",
    "We also built a one-size-fits-all scraper that returns the title, description, and/or paragraphs from any given URL."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"Lindsey Graham to Trump: 'You Just Can't Tweet' About Iran | Breitbart\""
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "urlexpander.html_utils.get_webpage_title(urls[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Sunday CBS\\'s \"Face the Nation,\" while discussing the last several\\xa0days of protests in Iran over\\xa0government corruption, Sen. Lindsey Graham (R-SC) warned | Breitbart TV'"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "urlexpander.html_utils.get_webpage_description(urls[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "OrderedDict([('url', 'https://trib.al/xXI5ruM'),\n",
       "             ('title',\n",
       "              \"Lindsey Graham to Trump: 'You Just Can't Tweet' About Iran | Breitbart\"),\n",
       "             ('description',\n",
       "              'Sunday CBS\\'s \"Face the Nation,\" while discussing the last several\\xa0days of protests in Iran over\\xa0government corruption, Sen. Lindsey Graham (R-SC) warned | Breitbart TV'),\n",
       "             ('paragraphs',\n",
       "              ['Sunday CBS’s “Face the Nation,” while discussing the last several\\xa0days of protests in Iran over\\xa0government corruption, Sen. Lindsey Graham (R-SC) warned President Donald Trump that he couldn’t “just tweet” about the protests.',\n",
       "               'Graham said, “The Iranian people are not our enemy. The Ayatollah is the enemy of the world. Here is what I would do if I were President Trump. I would explain what a better deal would look like. It’s not enough to watch. President Trump is tweeting very sympathetically to the Iranian people. But you just can’t tweet here. You have to lay out a plan.”',\n",
       "               '<em><span>Follow Pam Key on Twitter <a href=\"https://twitter.com/pamkeyNEN\">@pamkeyNEN</a> </span></em>',\n",
       "               '<a href=\"https://www.facebook.com/Breitbart\"></a>.',\n",
       "               '<small>Copyright © 2018 Breitbart</small>'])])"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "urlexpander.html_utils.get_webpage_meta(urls[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "Thanks for stumbling upon this package, we hope that it will lead to more research around links.<br>\n",
    "We're working on some projects in thie vein and would love to know if you are too!\n",
    "\n",
    "As an open source package, please feel to reach out about bugs, feature requests, or collaboration!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}