{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Applications of Regular Expressions\n",
    "\n",
    "**Author:** Bruno Grande\n",
    "\n",
    "**Date:** August 29th, 2017"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this lesson, we will cover a few applications of regular expressions (or regex) that I use all the time. Regex are available in most programming languages, but to keep this lesson accessible to as many people as possible, we will focus on applications at the Bash shell. Specifically, we will cover how you can use `grep`, `sed` and `awk` to get a lot done without firing up a script, especially with the power of regex at your side. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The motivation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Regular expressions are an extremely powerful tool for pattern matching. You might not realize it, but a lot of what we do is pattern matching, especially if you deal with text at all. The ability to describe a flexible pattern that the computer can then quickly look for in some arbitrary text opens up a world of possibilities. This lesson will focus on some of these possibilities. Notably, we will cover:\n",
    "\n",
    "1. Subsetting text using `grep`\n",
    "2. Search-and-replace text using `sed`\n",
    "3. Filter and/or process tabular data using `awk`\n",
    "\n",
    "These three tools alone justify learning Bash to make your life easier. In combination with regex, they are life-savers! "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will be using Jenny Bryan's cleaned-up version of the gapminder dataset. It contains 1704 rows and 6 columns. The dataset consists of the population, life expectancy and GDP per capita for 142 countries every 5 years between 1952 and 2007. You can easily download the data using `curl` as follows. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "curl -sL bit.ly/gapm-data > gapminder.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "country\tcontinent\tyear\tlifeExp\tpop\tgdpPercap\n",
      "Afghanistan\tAsia\t1952\t28.801\t8425333\t779.4453145\n",
      "Afghanistan\tAsia\t1957\t30.332\t9240934\t820.8530296\n",
      "Afghanistan\tAsia\t1962\t31.997\t10267083\t853.10071\n",
      "Afghanistan\tAsia\t1967\t34.02\t11537966\t836.1971382\n",
      "Afghanistan\tAsia\t1972\t36.088\t13079460\t739.9811058\n",
      "Afghanistan\tAsia\t1977\t38.438\t14880372\t786.11336\n",
      "Afghanistan\tAsia\t1982\t39.854\t12881816\t978.0114388\n",
      "Afghanistan\tAsia\t1987\t40.822\t13867957\t852.3959448\n",
      "Afghanistan\tAsia\t1992\t41.674\t16317921\t649.3413952\n"
     ]
    }
   ],
   "source": [
    "head gapminder.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "----\n",
    "\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Subsetting text using grep"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At its simplest, grep can be used to filter lines based on a pattern. We can start with a plain, non-regex pattern. Here, we subset the file to lines that contains the word `Canada`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Canada\tAmericas\t1952\t68.75\t14785584\t11367.16112\n",
      "Canada\tAmericas\t1957\t69.96\t17010154\t12489.95006\n",
      "Canada\tAmericas\t1962\t71.3\t18985849\t13462.48555\n",
      "Canada\tAmericas\t1967\t72.13\t20819767\t16076.58803\n",
      "Canada\tAmericas\t1972\t72.88\t22284500\t18970.57086\n",
      "Canada\tAmericas\t1977\t74.21\t23796400\t22090.88306\n",
      "Canada\tAmericas\t1982\t75.76\t25201900\t22898.79214\n",
      "Canada\tAmericas\t1987\t76.86\t26549700\t26626.51503\n",
      "Canada\tAmericas\t1992\t77.95\t28523502\t26342.88426\n",
      "Canada\tAmericas\t1997\t78.61\t30305843\t28954.92589\n",
      "Canada\tAmericas\t2002\t79.77\t31902268\t33328.96507\n",
      "Canada\tAmericas\t2007\t80.653\t33390141\t36319.23501\n"
     ]
    }
   ],
   "source": [
    "grep Canada gapminder.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You'll notice that the header lines was removed, because it doesn't contain `Canada`. If we want to ensure that this file remains valid, we need to keep the header. There are multiple ways to do this. \n",
    "\n",
    "First, you can grep the header and the `Canada` lines separately. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "country\tcontinent\tyear\tlifeExp\tpop\tgdpPercap\n",
      "Canada\tAmericas\t1952\t68.75\t14785584\t11367.16112\n",
      "Canada\tAmericas\t1957\t69.96\t17010154\t12489.95006\n",
      "Canada\tAmericas\t1962\t71.3\t18985849\t13462.48555\n",
      "Canada\tAmericas\t1967\t72.13\t20819767\t16076.58803\n",
      "Canada\tAmericas\t1972\t72.88\t22284500\t18970.57086\n",
      "Canada\tAmericas\t1977\t74.21\t23796400\t22090.88306\n",
      "Canada\tAmericas\t1982\t75.76\t25201900\t22898.79214\n",
      "Canada\tAmericas\t1987\t76.86\t26549700\t26626.51503\n",
      "Canada\tAmericas\t1992\t77.95\t28523502\t26342.88426\n",
      "Canada\tAmericas\t1997\t78.61\t30305843\t28954.92589\n",
      "Canada\tAmericas\t2002\t79.77\t31902268\t33328.96507\n",
      "Canada\tAmericas\t2007\t80.653\t33390141\t36319.23501\n"
     ]
    }
   ],
   "source": [
    "grep country gapminder.tsv\n",
    "grep Canada gapminder.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, you will notice that we are repeating ourselves (the `grep` command and the `gapminder.tsv` file name. Ideally, we want to follow the DRY (don't repeat yourself) principle. \n",
    "\n",
    "**N.B.** An astute reader will notice that I can extract the header using `head -1`. Indeed, this would work here, but I am familiar with file formats (_e.g._ VCF variant call format) where the header is neither the first line, nor a predictable number of lines into the file. In these cases, `grep` is more general. \n",
    "\n",
    "The second approach involves the use of regex. In fact, `grep` stands for \"globally search a regular expression and print\". However, because there have been multiple versions of regex over the years and we are used to the more modern versions, we will need to use a variant of grep that enables extended regex. You can either use `grep -E` or `egrep`. I will be using the latter. \n",
    "\n",
    "Here, we can start using regex by using the `|` operator, which matches what's on the left **or** on the right. Whenever you use regular expressions, it is safer to quote the pattern using single quotes. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "country\tcontinent\tyear\tlifeExp\tpop\tgdpPercap\n",
      "Canada\tAmericas\t1952\t68.75\t14785584\t11367.16112\n",
      "Canada\tAmericas\t1957\t69.96\t17010154\t12489.95006\n",
      "Canada\tAmericas\t1962\t71.3\t18985849\t13462.48555\n",
      "Canada\tAmericas\t1967\t72.13\t20819767\t16076.58803\n",
      "Canada\tAmericas\t1972\t72.88\t22284500\t18970.57086\n",
      "Canada\tAmericas\t1977\t74.21\t23796400\t22090.88306\n",
      "Canada\tAmericas\t1982\t75.76\t25201900\t22898.79214\n",
      "Canada\tAmericas\t1987\t76.86\t26549700\t26626.51503\n",
      "Canada\tAmericas\t1992\t77.95\t28523502\t26342.88426\n",
      "Canada\tAmericas\t1997\t78.61\t30305843\t28954.92589\n",
      "Canada\tAmericas\t2002\t79.77\t31902268\t33328.96507\n",
      "Canada\tAmericas\t2007\t80.653\t33390141\t36319.23501\n"
     ]
    }
   ],
   "source": [
    "egrep 'country|Canada' gapminder.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we wanted to include the US in our results, it's as simple as adding another `|` operator in our pattern. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "country\tcontinent\tyear\tlifeExp\tpop\tgdpPercap\n",
      "Canada\tAmericas\t1952\t68.75\t14785584\t11367.16112\n",
      "Canada\tAmericas\t1957\t69.96\t17010154\t12489.95006\n",
      "Canada\tAmericas\t1962\t71.3\t18985849\t13462.48555\n",
      "Canada\tAmericas\t1967\t72.13\t20819767\t16076.58803\n",
      "Canada\tAmericas\t1972\t72.88\t22284500\t18970.57086\n",
      "Canada\tAmericas\t1977\t74.21\t23796400\t22090.88306\n",
      "Canada\tAmericas\t1982\t75.76\t25201900\t22898.79214\n",
      "Canada\tAmericas\t1987\t76.86\t26549700\t26626.51503\n",
      "Canada\tAmericas\t1992\t77.95\t28523502\t26342.88426\n",
      "Canada\tAmericas\t1997\t78.61\t30305843\t28954.92589\n",
      "Canada\tAmericas\t2002\t79.77\t31902268\t33328.96507\n",
      "Canada\tAmericas\t2007\t80.653\t33390141\t36319.23501\n",
      "United States\tAmericas\t1952\t68.44\t157553000\t13990.48208\n",
      "United States\tAmericas\t1957\t69.49\t171984000\t14847.12712\n",
      "United States\tAmericas\t1962\t70.21\t186538000\t16173.14586\n",
      "United States\tAmericas\t1967\t70.76\t198712000\t19530.36557\n",
      "United States\tAmericas\t1972\t71.34\t209896000\t21806.03594\n",
      "United States\tAmericas\t1977\t73.38\t220239000\t24072.63213\n",
      "United States\tAmericas\t1982\t74.65\t232187835\t25009.55914\n",
      "United States\tAmericas\t1987\t75.02\t242803533\t29884.35041\n",
      "United States\tAmericas\t1992\t76.09\t256894189\t32003.93224\n",
      "United States\tAmericas\t1997\t76.81\t272911760\t35767.43303\n",
      "United States\tAmericas\t2002\t77.31\t287675526\t39097.09955\n",
      "United States\tAmericas\t2007\t78.242\t301139947\t42951.65309\n"
     ]
    }
   ],
   "source": [
    "egrep 'country|Canada|United States' gapminder.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`grep` can be as flexible as you need it to be. While it may be contrived, let's say we are interested in the data from 1977 for countries whose names start with `S` (and we want to keep the header). As always, there are multiple was of approaching this problem. \n",
    "\n",
    "First, we can use UNIX pipes to perform subsequent filters, one for the countries starting with `S` and another for the rows corresponding to 1977. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sao Tome and Principe\tAfrica\t1977\t58.55\t86796\t1737.561657\n",
      "Saudi Arabia\tAsia\t1977\t58.69\t8128505\t34167.7626\n",
      "Senegal\tAfrica\t1977\t48.879\t5260855\t1561.769116\n",
      "Serbia\tEurope\t1977\t70.3\t8686367\t12980.66956\n",
      "Sierra Leone\tAfrica\t1977\t36.788\t3140897\t1348.285159\n",
      "Singapore\tAsia\t1967\t67.946\t1977600\t4977.41854\n",
      "Singapore\tAsia\t1977\t70.795\t2325300\t11210.08948\n",
      "Singapore\tAsia\t2002\t78.77\t4197776\t36023.1054\n",
      "Slovak Republic\tEurope\t1977\t70.45\t4827803\t10922.66404\n",
      "Slovenia\tEurope\t1977\t70.97\t1746919\t15277.03017\n",
      "Somalia\tAfrica\t1977\t41.974\t4353666\t1450.992513\n",
      "South Africa\tAfrica\t1977\t55.527\t27129932\t8028.651439\n",
      "Spain\tEurope\t1977\t74.39\t36439000\t13236.92117\n",
      "Sri Lanka\tAsia\t1977\t65.949\t14116836\t1348.775651\n",
      "Sudan\tAfrica\t1977\t47.8\t17104986\t2202.988423\n",
      "Swaziland\tAfrica\t1977\t52.537\t551425\t3781.410618\n",
      "Sweden\tEurope\t1977\t75.44\t8251648\t18855.72521\n",
      "Switzerland\tEurope\t1977\t75.39\t6316424\t26982.29052\n",
      "Syria\tAsia\t1977\t61.195\t7932503\t3195.484582\n"
     ]
    }
   ],
   "source": [
    "egrep '^(country|S)' gapminder.tsv | egrep '1977'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some of you might have noticed that Singapore comes up three times, while every country is only supposed to show up once at most. Upon closer inspection, you can see why this is happening: the `1977` pattern is appearing in the line within the population number, which is not something we want. \n",
    "\n",
    "The immediate solution to this is to prevent matches of the year within other numbers. In regex, you can specify that word boundaries must be present before and after the number using `\\b`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sao Tome and Principe\tAfrica\t1977\t58.55\t86796\t1737.561657\n",
      "Saudi Arabia\tAsia\t1977\t58.69\t8128505\t34167.7626\n",
      "Senegal\tAfrica\t1977\t48.879\t5260855\t1561.769116\n",
      "Serbia\tEurope\t1977\t70.3\t8686367\t12980.66956\n",
      "Sierra Leone\tAfrica\t1977\t36.788\t3140897\t1348.285159\n",
      "Singapore\tAsia\t1977\t70.795\t2325300\t11210.08948\n",
      "Slovak Republic\tEurope\t1977\t70.45\t4827803\t10922.66404\n",
      "Slovenia\tEurope\t1977\t70.97\t1746919\t15277.03017\n",
      "Somalia\tAfrica\t1977\t41.974\t4353666\t1450.992513\n",
      "South Africa\tAfrica\t1977\t55.527\t27129932\t8028.651439\n",
      "Spain\tEurope\t1977\t74.39\t36439000\t13236.92117\n",
      "Sri Lanka\tAsia\t1977\t65.949\t14116836\t1348.775651\n",
      "Sudan\tAfrica\t1977\t47.8\t17104986\t2202.988423\n",
      "Swaziland\tAfrica\t1977\t52.537\t551425\t3781.410618\n",
      "Sweden\tEurope\t1977\t75.44\t8251648\t18855.72521\n",
      "Switzerland\tEurope\t1977\t75.39\t6316424\t26982.29052\n",
      "Syria\tAsia\t1977\t61.195\t7932503\t3195.484582\n"
     ]
    }
   ],
   "source": [
    "egrep '^(country|S)' gapminder.tsv | egrep '\\b1977\\b'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Indeed, this solves our problem, but we lost the header again. To get it back, we need to include the `country` pattern in both commands, which is slightly repetitive. \n",
    "\n",
    "**N.B.** Our current solution to filtering for observations made in 1977 is imperfect, because we are filtering on the presence of 1977 anywhere in the line. Technically, if country had a population or GDP per capita of 1977 at some point, this would be included in the output. Later, we will see how we can use awk to apply regex on specific columns. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "country\tcontinent\tyear\tlifeExp\tpop\tgdpPercap\n",
      "Sao Tome and Principe\tAfrica\t1977\t58.55\t86796\t1737.561657\n",
      "Saudi Arabia\tAsia\t1977\t58.69\t8128505\t34167.7626\n",
      "Senegal\tAfrica\t1977\t48.879\t5260855\t1561.769116\n",
      "Serbia\tEurope\t1977\t70.3\t8686367\t12980.66956\n",
      "Sierra Leone\tAfrica\t1977\t36.788\t3140897\t1348.285159\n",
      "Singapore\tAsia\t1977\t70.795\t2325300\t11210.08948\n",
      "Slovak Republic\tEurope\t1977\t70.45\t4827803\t10922.66404\n",
      "Slovenia\tEurope\t1977\t70.97\t1746919\t15277.03017\n",
      "Somalia\tAfrica\t1977\t41.974\t4353666\t1450.992513\n",
      "South Africa\tAfrica\t1977\t55.527\t27129932\t8028.651439\n",
      "Spain\tEurope\t1977\t74.39\t36439000\t13236.92117\n",
      "Sri Lanka\tAsia\t1977\t65.949\t14116836\t1348.775651\n",
      "Sudan\tAfrica\t1977\t47.8\t17104986\t2202.988423\n",
      "Swaziland\tAfrica\t1977\t52.537\t551425\t3781.410618\n",
      "Sweden\tEurope\t1977\t75.44\t8251648\t18855.72521\n",
      "Switzerland\tEurope\t1977\t75.39\t6316424\t26982.29052\n",
      "Syria\tAsia\t1977\t61.195\t7932503\t3195.484582\n"
     ]
    }
   ],
   "source": [
    "egrep '^(country|S)' gapminder.tsv | egrep '(country|\\b1977\\b)'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Second, we can combine our patterns into one regex. Admittedly, there is no compelling advantage in doing so other than preventing needless commands wherever possible. For this, we need to acknowledge that the year will always be after the country name by some number of characters. We can specify \"some numbers of characters\" in regex using `.*`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sao Tome and Principe\tAfrica\t1977\t58.55\t86796\t1737.561657\n",
      "Saudi Arabia\tAsia\t1977\t58.69\t8128505\t34167.7626\n",
      "Senegal\tAfrica\t1977\t48.879\t5260855\t1561.769116\n",
      "Serbia\tEurope\t1977\t70.3\t8686367\t12980.66956\n",
      "Sierra Leone\tAfrica\t1977\t36.788\t3140897\t1348.285159\n",
      "Singapore\tAsia\t1977\t70.795\t2325300\t11210.08948\n",
      "Slovak Republic\tEurope\t1977\t70.45\t4827803\t10922.66404\n",
      "Slovenia\tEurope\t1977\t70.97\t1746919\t15277.03017\n",
      "Somalia\tAfrica\t1977\t41.974\t4353666\t1450.992513\n",
      "South Africa\tAfrica\t1977\t55.527\t27129932\t8028.651439\n",
      "Spain\tEurope\t1977\t74.39\t36439000\t13236.92117\n",
      "Sri Lanka\tAsia\t1977\t65.949\t14116836\t1348.775651\n",
      "Sudan\tAfrica\t1977\t47.8\t17104986\t2202.988423\n",
      "Swaziland\tAfrica\t1977\t52.537\t551425\t3781.410618\n",
      "Sweden\tEurope\t1977\t75.44\t8251648\t18855.72521\n",
      "Switzerland\tEurope\t1977\t75.39\t6316424\t26982.29052\n",
      "Syria\tAsia\t1977\t61.195\t7932503\t3195.484582\n"
     ]
    }
   ],
   "source": [
    "egrep '^(country|S).*\\b1977\\b' gapminder.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "### Challenge Question 1\n",
    "\n",
    "Why is the header missing in output of the above command?\n",
    "\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's create a file with a list of countries of interest for the purposes of this demo. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Canada\n",
      "Italy\n",
      "Australia\n",
      "United States\n",
      "England\n",
      "France\n"
     ]
    }
   ],
   "source": [
    "echo -e 'Canada\\nItaly\\nAustralia\\nUnited States\\nEngland\\nFrance' > countries.txt\n",
    "cat countries.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given this list, we can easily filter the gapminder dataset for observations made for these countries. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Australia\tOceania\t1952\t69.12\t8691212\t10039.59564\n",
      "Australia\tOceania\t1957\t70.33\t9712569\t10949.64959\n",
      "Australia\tOceania\t1962\t70.93\t10794968\t12217.22686\n",
      "Australia\tOceania\t1967\t71.1\t11872264\t14526.12465\n",
      "Australia\tOceania\t1972\t71.93\t13177000\t16788.62948\n",
      "Australia\tOceania\t1977\t73.49\t14074100\t18334.19751\n",
      "Australia\tOceania\t1982\t74.74\t15184200\t19477.00928\n",
      "Australia\tOceania\t1987\t76.32\t16257249\t21888.88903\n",
      "Australia\tOceania\t1992\t77.56\t17481977\t23424.76683\n",
      "Australia\tOceania\t1997\t78.83\t18565243\t26997.93657\n"
     ]
    }
   ],
   "source": [
    "egrep -f countries.txt gapminder.tsv | head"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So far, we've seen how we can use `grep` to subset the lines in a file according to a certain pattern. Another useful feature of `grep` is its quiet mode, which can be used in conjunction with Bash conditional expressions. \n",
    "\n",
    "First, let's review Bash if statements. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "false\n"
     ]
    }
   ],
   "source": [
    "if [[ 1 > 2 ]]; then\n",
    "    echo 'true'\n",
    "else\n",
    "    echo 'false'\n",
    "fi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, `[[ 1 > 2 ]]` is actually a command that evaluates the expression inside the square brackets. This portion of the if statement in Bash can be any command. A command evaluates as true if its exit code is zero (_i.e._ the command was successful). Otherwise, it's considered as false. \n",
    "\n",
    "To show this, I will run the commands `true` and `false`, which respectively return exit codes 0 and 1. \n",
    "\n",
    "**N.B.** The `$?` is a useful variable that contains the exit code of the most recently run command. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Exit code:  0\n",
      "Considered true\n"
     ]
    }
   ],
   "source": [
    "if true; then\n",
    "    echo 'Exit code: ' $?\n",
    "    echo 'Considered true'\n",
    "else\n",
    "    echo 'Exit code: ' $?\n",
    "    echo 'Considered false'\n",
    "fi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Exit code:  1\n",
      "Considered false\n"
     ]
    }
   ],
   "source": [
    "if false; then\n",
    "    echo 'Exit code: ' $?\n",
    "    echo 'Considered true'\n",
    "else\n",
    "    echo 'Exit code: ' $?\n",
    "    echo 'Considered false'\n",
    "fi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's say you're interested in running a command only if a file contains some pattern. You can use grep in quiet mode inside a if statement, as follows. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Canada is in the countries.txt file :D\n"
     ]
    }
   ],
   "source": [
    "if egrep -q 'Canada' countries.txt; then\n",
    "    echo 'Canada is in the countries.txt file :D'\n",
    "else\n",
    "    echo 'Canada is not in the countries.txt file :('\n",
    "fi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Switzerland is not in the countries.txt file :(\n"
     ]
    }
   ],
   "source": [
    "if egrep -q 'Switzerland' countries.txt; then\n",
    "    echo 'Switzerland is in the countries.txt file :D'\n",
    "else\n",
    "    echo 'Switzerland is not in the countries.txt file :('\n",
    "fi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we are just echoing some text, but you can do whatever you want once you know a file matches a pattern. A nice thing about quiet mode is that grep stops searching as soon as it encounters the first instance of the pattern. \n",
    "\n",
    "If you wanted to count how many instances of a pattern there are in a file, you can certainly pipe the output of `grep` to `wc -l`. You can be slightly more efficient by avoiding the extra command and using the `-c` option in `grep`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      12\n"
     ]
    }
   ],
   "source": [
    "grep 'Canada' gapminder.tsv | wc -l"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "12\n"
     ]
    }
   ],
   "source": [
    "grep -c 'Canada' gapminder.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lastly, for all of the above `grep` commands, you can invert the search using the `-v` option. In other words, if you want all lines except for those containing \"Canada\" or \"United States\", you can simply do the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Canada\n",
      "Italy\n",
      "Australia\n",
      "United States\n",
      "England\n",
      "France\n"
     ]
    }
   ],
   "source": [
    "cat countries.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Italy\n",
      "Australia\n",
      "England\n",
      "France\n"
     ]
    }
   ],
   "source": [
    "egrep -v 'Canada|United States' countries.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "### Challenge Question 2\n",
    "\n",
    "Write an if statement in Bash that checks if there are any countries that start with the letter \"Z\" outside of Africa, and echoes the response accordingly. \n",
    "\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "----\n",
    "\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Search-and-replace text using sed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So far, we've seen `grep`'s amazing ability to subset lines in a file according to a pattern, which can be as complex as you can conjure. Now, we're going to introduce `sed`, which is probably best known for its ability to perform search-and-replace really easily at the command line. \n",
    "\n",
    "Let's remind ourselves of what's in our `gapminder.tsv` file. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "country\tcontinent\tyear\tlifeExp\tpop\tgdpPercap\n",
      "Afghanistan\tAsia\t1952\t28.801\t8425333\t779.4453145\n",
      "Afghanistan\tAsia\t1957\t30.332\t9240934\t820.8530296\n",
      "Afghanistan\tAsia\t1962\t31.997\t10267083\t853.10071\n",
      "Afghanistan\tAsia\t1967\t34.02\t11537966\t836.1971382\n",
      "Afghanistan\tAsia\t1972\t36.088\t13079460\t739.9811058\n",
      "Afghanistan\tAsia\t1977\t38.438\t14880372\t786.11336\n",
      "Afghanistan\tAsia\t1982\t39.854\t12881816\t978.0114388\n",
      "Afghanistan\tAsia\t1987\t40.822\t13867957\t852.3959448\n",
      "Afghanistan\tAsia\t1992\t41.674\t16317921\t649.3413952\n"
     ]
    }
   ],
   "source": [
    "head gapminder.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To start off with a simple example to examine the structure of a `sed` command, we are going to replace every instance of \"United States\" with \"USA\". Here, we will count instances of each term before and after we apply `sed` to confirm the change. \n",
    "\n",
    "In general, we need to ensure that modern regular expressions are enabled in `sed`. Unfortunately, this option varies based on your platform. Typically, it's `-E` on Macs and `-r` on Linux (and probably Windows, although I'm not sure). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sed -E 's/United States/USA/' gapminder.tsv > gapminder.usa.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Before sed\n",
      "12\n",
      "0\n",
      "After sed\n",
      "0\n",
      "12\n"
     ]
    }
   ],
   "source": [
    "echo 'Before sed'\n",
    "grep -c 'United States' gapminder.tsv\n",
    "grep -c 'USA' gapminder.tsv\n",
    "\n",
    "echo 'After sed'\n",
    "grep -c 'United States' gapminder.usa.tsv\n",
    "grep -c 'USA' gapminder.usa.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the search-and-replace worked. The general form of a `sed` search-and-replace is as follows:\n",
    "\n",
    "```\n",
    "sed -E 's/what_you_want_to_replace/what_you_want_to_replace_with/' input_file.txt > output_file.txt\n",
    "```\n",
    "\n",
    "Just in case you're still skeptical, we'll apply the same change on our small `countries.txt` file. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Canada\n",
      "Italy\n",
      "Australia\n",
      "USA\n",
      "England\n",
      "France\n"
     ]
    }
   ],
   "source": [
    "sed -E 's/United States/USA/' countries.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The initial `s` is necessary to indicate the search-and-replace command within `sed`. There are other commands that we won't see today, such as insert (`i`) and delete (`d`). The slashes are used to delimit the `what_you_want_to_replace` from the `what_you_want_to_replace_with`. It can actually be any character you want, as long as you're consistent. \n",
    "\n",
    "For example, you can use colons (`:`) instead. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Canada\n",
      "Italy\n",
      "Australia\n",
      "USA\n",
      "England\n",
      "France\n"
     ]
    }
   ],
   "source": [
    "sed -E 's:United States:USA:' countries.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The character you use is not that important. One thing to consider is that if the character you choose appear in the regex, you will need to escape it with backslashes. That's why I generally stick with slashes as my character in `sed` commands unless I'm dealing with file paths as my input text (which commonly include slashes), in which case I will switch to colons or vertical bars. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's move on to a slightly more complex change. We are going to replace every period (`.`) with a comma (`,`), as it we want to send our data to a collaborator in France, where they use commas instead of periods in decimal numbers. \n",
    "\n",
    "There is an important thing we need to handle: there might be multiple instances of a point. By default, `sed` will only replace the first instance of a pattern per line. If we want to replace every instance, we'll need to enable the global mode by adding a `g` at the end of the `sed` command. \n",
    "\n",
    "**N.B.** Recall that the period in regex has special meaning and matches any character. If we want to match an actual period, we need to escape it using a backslash. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "country\tcontinent\tyear\tlifeExp\tpop\tgdpPercap\n",
      "Afghanistan\tAsia\t1952\t28,801\t8425333\t779,4453145\n",
      "Afghanistan\tAsia\t1957\t30,332\t9240934\t820,8530296\n",
      "Afghanistan\tAsia\t1962\t31,997\t10267083\t853,10071\n",
      "Afghanistan\tAsia\t1967\t34,02\t11537966\t836,1971382\n",
      "Afghanistan\tAsia\t1972\t36,088\t13079460\t739,9811058\n",
      "Afghanistan\tAsia\t1977\t38,438\t14880372\t786,11336\n",
      "Afghanistan\tAsia\t1982\t39,854\t12881816\t978,0114388\n",
      "Afghanistan\tAsia\t1987\t40,822\t13867957\t852,3959448\n",
      "Afghanistan\tAsia\t1992\t41,674\t16317921\t649,3413952\n"
     ]
    }
   ],
   "source": [
    "sed -E 's/\\./,/g' gapminder.tsv > gapminder.comma.tsv\n",
    "head gapminder.comma.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "### Challenge Question 3\n",
    "\n",
    "Write a `sed` command that replaces are continent names with \"Pangaea\".\n",
    "\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can easily chain multiple search-and-replace commands by using the `-e` option. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "country\tcontinent\tyear\tlifeExp\tpop\tgdpPercap\n",
      "USA\tAmericas\t1952\t68,44\t157553000\t13990,48208\n",
      "USA\tAmericas\t1957\t69,49\t171984000\t14847,12712\n",
      "USA\tAmericas\t1962\t70,21\t186538000\t16173,14586\n",
      "USA\tAmericas\t1967\t70,76\t198712000\t19530,36557\n",
      "USA\tAmericas\t1972\t71,34\t209896000\t21806,03594\n",
      "USA\tAmericas\t1977\t73,38\t220239000\t24072,63213\n",
      "USA\tAmericas\t1982\t74,65\t232187835\t25009,55914\n",
      "USA\tAmericas\t1987\t75,02\t242803533\t29884,35041\n",
      "USA\tAmericas\t1992\t76,09\t256894189\t32003,93224\n"
     ]
    }
   ],
   "source": [
    "sed -E -e 's/United States/USA/' -e 's/\\./,/g' gapminder.tsv > gapminder.usa_and_comma.tsv\n",
    "egrep 'country|USA' gapminder.usa_and_comma.tsv | head"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Perhaps one of the most powerful features of sed and regex when doing search-and-replace is backreferences. They allow you to search for something and replace it with something that includes what was originally matched. I think the best way to explain this is to demonstrate backreferences in action. Our contrived example is to match the country name at the beginning of each line and duplicating it. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "country_country\tcontinent\tyear\tlifeExp\tpop\tgdpPercap\n",
      "Afghanistan_Afghanistan\tAsia\t1952\t28.801\t8425333\t779.4453145\n",
      "Afghanistan_Afghanistan\tAsia\t1957\t30.332\t9240934\t820.8530296\n",
      "Afghanistan_Afghanistan\tAsia\t1962\t31.997\t10267083\t853.10071\n",
      "Afghanistan_Afghanistan\tAsia\t1967\t34.02\t11537966\t836.1971382\n",
      "Afghanistan_Afghanistan\tAsia\t1972\t36.088\t13079460\t739.9811058\n",
      "Afghanistan_Afghanistan\tAsia\t1977\t38.438\t14880372\t786.11336\n",
      "Afghanistan_Afghanistan\tAsia\t1982\t39.854\t12881816\t978.0114388\n",
      "Afghanistan_Afghanistan\tAsia\t1987\t40.822\t13867957\t852.3959448\n",
      "Afghanistan_Afghanistan\tAsia\t1992\t41.674\t16317921\t649.3413952\n"
     ]
    }
   ],
   "source": [
    "sed -E 's/^([^\\t]+)/\\1_\\1/' gapminder.tsv > gapminder.double_country.tsv\n",
    "head gapminder.double_country.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "### Challenge Question 4\n",
    "\n",
    "Use backreferences to get rid of all decimal digits. Don't worry about rounding up or down; just take the floor of the number. \n",
    "\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "----\n",
    "\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filter and/or process tabular data using awk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The last tool we will cover today is `awk`. This tool combines the features of `grep` and `sed` and makes them more useful in the context of tabular data, such as our `gapminder.tsv` file consisting of six tab-delimited columns. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "country\tcontinent\tyear\tlifeExp\tpop\tgdpPercap\n",
      "Afghanistan\tAsia\t1952\t28.801\t8425333\t779.4453145\n",
      "Afghanistan\tAsia\t1957\t30.332\t9240934\t820.8530296\n",
      "Afghanistan\tAsia\t1962\t31.997\t10267083\t853.10071\n",
      "Afghanistan\tAsia\t1967\t34.02\t11537966\t836.1971382\n",
      "Afghanistan\tAsia\t1972\t36.088\t13079460\t739.9811058\n",
      "Afghanistan\tAsia\t1977\t38.438\t14880372\t786.11336\n",
      "Afghanistan\tAsia\t1982\t39.854\t12881816\t978.0114388\n",
      "Afghanistan\tAsia\t1987\t40.822\t13867957\t852.3959448\n",
      "Afghanistan\tAsia\t1992\t41.674\t16317921\t649.3413952\n"
     ]
    }
   ],
   "source": [
    "head gapminder.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- FS and OFS\n",
    "- Print subset of columns\n",
    "- Conditionally print lines\n",
    "- sub and gensub"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first thing you need to configure with `awk` is the field separator (`FS`), which is what separates the columns in each line. Typically, we use comma- or tab-delimited files. In this case, `gapminder.tsv` uses tabs. We also configure the output field separator (`OFS`) to be the same character. Notice that we use single quotes again to avoid unintended issues down the line. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "awk 'BEGIN {FS=OFS=\"\\t\"}' gapminder.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `BEGIN {}` contains awk commands that are run once at the beginning. Here, we only need to set the input and output field separator once. Because there are no commands that follow `BEGIN {}`, `awk` doesn't do anything. If we want to print lines, we can use `print $0`, where `$0` refers to all columns. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "country\tcontinent\tyear\tlifeExp\tpop\tgdpPercap\n",
      "Afghanistan\tAsia\t1952\t28.801\t8425333\t779.4453145\n",
      "Afghanistan\tAsia\t1957\t30.332\t9240934\t820.8530296\n",
      "Afghanistan\tAsia\t1962\t31.997\t10267083\t853.10071\n",
      "Afghanistan\tAsia\t1967\t34.02\t11537966\t836.1971382\n",
      "Afghanistan\tAsia\t1972\t36.088\t13079460\t739.9811058\n",
      "Afghanistan\tAsia\t1977\t38.438\t14880372\t786.11336\n",
      "Afghanistan\tAsia\t1982\t39.854\t12881816\t978.0114388\n",
      "Afghanistan\tAsia\t1987\t40.822\t13867957\t852.3959448\n",
      "Afghanistan\tAsia\t1992\t41.674\t16317921\t649.3413952\n"
     ]
    }
   ],
   "source": [
    "awk 'BEGIN {FS=OFS=\"\\t\"} {print $0}' gapminder.tsv | head"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Admittedly, this isn't very useful. You can refer to the first, second, third, etc. columns using `$1`, `$2`, `$3`, etc. So, if we want to print the country name, the year and the population, we can use `awk` as follows. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "country\tyear\tpop\n",
      "Afghanistan\t1952\t8425333\n",
      "Afghanistan\t1957\t9240934\n",
      "Afghanistan\t1962\t10267083\n",
      "Afghanistan\t1967\t11537966\n",
      "Afghanistan\t1972\t13079460\n",
      "Afghanistan\t1977\t14880372\n",
      "Afghanistan\t1982\t12881816\n",
      "Afghanistan\t1987\t13867957\n",
      "Afghanistan\t1992\t16317921\n"
     ]
    }
   ],
   "source": [
    "awk 'BEGIN {FS=OFS=\"\\t\"} {print $1, $3, $5}' gapminder.tsv | head"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, this isn't very useful, because can achieve the same effect using `cut` in Bash using much less typing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "country\tyear\tpop\n",
      "Afghanistan\t1952\t8425333\n",
      "Afghanistan\t1957\t9240934\n",
      "Afghanistan\t1962\t10267083\n",
      "Afghanistan\t1967\t11537966\n",
      "Afghanistan\t1972\t13079460\n",
      "Afghanistan\t1977\t14880372\n",
      "Afghanistan\t1982\t12881816\n",
      "Afghanistan\t1987\t13867957\n",
      "Afghanistan\t1992\t16317921\n"
     ]
    }
   ],
   "source": [
    "cut -f1,3,5 gapminder.tsv | head"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Things start getting interesting once you perform filtering on specific columns or manipulating text in specific columns. For instance, let's revisit our earlier task of filtering on rows that pertain to 1977. This can be accurately done by simply checking if column 3 is equal to 1977. In this case, we don't have to worry about the digits \"1977\" appearing in other columns such as the population. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Afghanistan\tAsia\t1977\t38.438\t14880372\t786.11336\n",
      "Albania\tEurope\t1977\t68.93\t2509048\t3533.00391\n",
      "Algeria\tAfrica\t1977\t58.014\t17152804\t4910.416756\n",
      "Angola\tAfrica\t1977\t39.483\t6162675\t3008.647355\n",
      "Argentina\tAmericas\t1977\t68.481\t26983828\t10079.02674\n",
      "Australia\tOceania\t1977\t73.49\t14074100\t18334.19751\n",
      "Austria\tEurope\t1977\t72.17\t7568430\t19749.4223\n",
      "Bahrain\tAsia\t1977\t65.593\t297410\t19340.10196\n",
      "Bangladesh\tAsia\t1977\t46.923\t80428306\t659.8772322\n",
      "Belgium\tEurope\t1977\t72.8\t9821800\t19117.97448\n"
     ]
    }
   ],
   "source": [
    "awk 'BEGIN {FS=OFS=\"\\t\"} $3 == 1977 {print $0}' gapminder.tsv | head"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the `{print $0}` is actually optional when we specify a condition for filtering lines. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Afghanistan\tAsia\t1977\t38.438\t14880372\t786.11336\n",
      "Albania\tEurope\t1977\t68.93\t2509048\t3533.00391\n",
      "Algeria\tAfrica\t1977\t58.014\t17152804\t4910.416756\n",
      "Angola\tAfrica\t1977\t39.483\t6162675\t3008.647355\n",
      "Argentina\tAmericas\t1977\t68.481\t26983828\t10079.02674\n",
      "Australia\tOceania\t1977\t73.49\t14074100\t18334.19751\n",
      "Austria\tEurope\t1977\t72.17\t7568430\t19749.4223\n",
      "Bahrain\tAsia\t1977\t65.593\t297410\t19340.10196\n",
      "Bangladesh\tAsia\t1977\t46.923\t80428306\t659.8772322\n",
      "Belgium\tEurope\t1977\t72.8\t9821800\t19117.97448\n"
     ]
    }
   ],
   "source": [
    "awk 'BEGIN {FS=OFS=\"\\t\"} $3 == 1977' gapminder.tsv | head"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also combine multiple conditions using `&&`. Here, we will reproduce our earlier command in `awk`, where we will filter for 1977 data for countries whose names starts with \"S\". "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sao Tome and Principe\tAfrica\t1977\t58.55\t86796\t1737.561657\n",
      "Saudi Arabia\tAsia\t1977\t58.69\t8128505\t34167.7626\n",
      "Senegal\tAfrica\t1977\t48.879\t5260855\t1561.769116\n",
      "Serbia\tEurope\t1977\t70.3\t8686367\t12980.66956\n",
      "Sierra Leone\tAfrica\t1977\t36.788\t3140897\t1348.285159\n",
      "Singapore\tAsia\t1977\t70.795\t2325300\t11210.08948\n",
      "Slovak Republic\tEurope\t1977\t70.45\t4827803\t10922.66404\n",
      "Slovenia\tEurope\t1977\t70.97\t1746919\t15277.03017\n",
      "Somalia\tAfrica\t1977\t41.974\t4353666\t1450.992513\n",
      "South Africa\tAfrica\t1977\t55.527\t27129932\t8028.651439\n"
     ]
    }
   ],
   "source": [
    "awk 'BEGIN {FS=OFS=\"\\t\"} $3 == 1977 && $1 ~ /^S/' gapminder.tsv | head"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now face a similar issue as before, where the header is missing. We can address this in multiple ways. We will use our approach from earlier, by matching country in the first column. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "country\tcontinent\tyear\tlifeExp\tpop\tgdpPercap\n",
      "Sao Tome and Principe\tAfrica\t1977\t58.55\t86796\t1737.561657\n",
      "Saudi Arabia\tAsia\t1977\t58.69\t8128505\t34167.7626\n",
      "Senegal\tAfrica\t1977\t48.879\t5260855\t1561.769116\n",
      "Serbia\tEurope\t1977\t70.3\t8686367\t12980.66956\n",
      "Sierra Leone\tAfrica\t1977\t36.788\t3140897\t1348.285159\n",
      "Singapore\tAsia\t1977\t70.795\t2325300\t11210.08948\n",
      "Slovak Republic\tEurope\t1977\t70.45\t4827803\t10922.66404\n",
      "Slovenia\tEurope\t1977\t70.97\t1746919\t15277.03017\n",
      "Somalia\tAfrica\t1977\t41.974\t4353666\t1450.992513\n"
     ]
    }
   ],
   "source": [
    "awk 'BEGIN {FS=OFS=\"\\t\"} $3 == 1977 && $1 ~ /^S/ || $1 == \"country\"' gapminder.tsv | head"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In general, the structure of `awk` commands (within the single quotes) is as follows:\n",
    "\n",
    "```\n",
    "awk 'BEGIN {FS=OFS=\"\\t\"} CONDITION {ACTION} CONDITION {ACTION} {ACTION}' input.tsv > output.tsv\n",
    "```\n",
    "\n",
    "You can think of an `awk` command as a series of conditions and actions that will only run if the preceding condition is true. In fact, `BEGIN` is a condition that is only true at the beginning of the file. Hence, the `{FS=OFS=\"\\t\"}` only gets run once at the outset. Any action that isn't preceded by a condition (like the last `{ACTION}` in the example command above) will run for every line. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "### Challenge Question 5\n",
    "\n",
    "Tackle Challenge Question 3, but this time using `awk`. You should be able to simplify your approach. \n",
    "\n",
    "**Hint:** You no longer need to know the continents in the file anymore. \n",
    "\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "----\n",
    "\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Solutions to Challenge Questions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Challenge Question 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The header is missing from the output because the `.*\\b1977\\b` in the pattern is restricting that all lines (_i.e._ those starting with `country` or `S`) have a `1977` in it. The solution is to move the `.*\\b1977\\b` inside the parentheses such that it only applies to lines starting with `S`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "country\tcontinent\tyear\tlifeExp\tpop\tgdpPercap\n",
      "Sao Tome and Principe\tAfrica\t1977\t58.55\t86796\t1737.561657\n",
      "Saudi Arabia\tAsia\t1977\t58.69\t8128505\t34167.7626\n",
      "Senegal\tAfrica\t1977\t48.879\t5260855\t1561.769116\n",
      "Serbia\tEurope\t1977\t70.3\t8686367\t12980.66956\n",
      "Sierra Leone\tAfrica\t1977\t36.788\t3140897\t1348.285159\n",
      "Singapore\tAsia\t1977\t70.795\t2325300\t11210.08948\n",
      "Slovak Republic\tEurope\t1977\t70.45\t4827803\t10922.66404\n",
      "Slovenia\tEurope\t1977\t70.97\t1746919\t15277.03017\n",
      "Somalia\tAfrica\t1977\t41.974\t4353666\t1450.992513\n",
      "South Africa\tAfrica\t1977\t55.527\t27129932\t8028.651439\n",
      "Spain\tEurope\t1977\t74.39\t36439000\t13236.92117\n",
      "Sri Lanka\tAsia\t1977\t65.949\t14116836\t1348.775651\n",
      "Sudan\tAfrica\t1977\t47.8\t17104986\t2202.988423\n",
      "Swaziland\tAfrica\t1977\t52.537\t551425\t3781.410618\n",
      "Sweden\tEurope\t1977\t75.44\t8251648\t18855.72521\n",
      "Switzerland\tEurope\t1977\t75.39\t6316424\t26982.29052\n",
      "Syria\tAsia\t1977\t61.195\t7932503\t3195.484582\n"
     ]
    }
   ],
   "source": [
    "egrep '^(country|S.*\\b1977\\b)' gapminder.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### Challenge Question 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There is no country that starts with Z outside of Africa\n"
     ]
    }
   ],
   "source": [
    "if grep -v 'Africa' gapminder.tsv | grep -q '^Z'; then\n",
    "    echo \"There is a country that starts with Z outside of Africa\"\n",
    "else\n",
    "    echo \"There is no country that starts with Z outside of Africa\"\n",
    "fi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Challenge Question 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "country\tcontinent\tyear\tlifeExp\tpop\tgdpPercap\n",
      "Afghanistan\tPangaea\t1952\t28.801\t8425333\t779.4453145\n",
      "Afghanistan\tPangaea\t1957\t30.332\t9240934\t820.8530296\n",
      "Afghanistan\tPangaea\t1962\t31.997\t10267083\t853.10071\n",
      "Afghanistan\tPangaea\t1967\t34.02\t11537966\t836.1971382\n",
      "Afghanistan\tPangaea\t1972\t36.088\t13079460\t739.9811058\n",
      "Afghanistan\tPangaea\t1977\t38.438\t14880372\t786.11336\n",
      "Afghanistan\tPangaea\t1982\t39.854\t12881816\t978.0114388\n",
      "Afghanistan\tPangaea\t1987\t40.822\t13867957\t852.3959448\n",
      "Afghanistan\tPangaea\t1992\t41.674\t16317921\t649.3413952\n"
     ]
    }
   ],
   "source": [
    "sed -E 's/Africa|America|Asia|Europe|Oceania/Pangaea/' gapminder.tsv > gapminder.pangaea.tsv\n",
    "head gapminder.pangaea.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Challenge Question 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "country\tcontinent\tyear\tlifeExp\tpop\tgdpPercap\n",
      "Afghanistan\tAsia\t1952\t28\t8425333\t779\n",
      "Afghanistan\tAsia\t1957\t30\t9240934\t820\n",
      "Afghanistan\tAsia\t1962\t31\t10267083\t853\n",
      "Afghanistan\tAsia\t1967\t34\t11537966\t836\n",
      "Afghanistan\tAsia\t1972\t36\t13079460\t739\n",
      "Afghanistan\tAsia\t1977\t38\t14880372\t786\n",
      "Afghanistan\tAsia\t1982\t39\t12881816\t978\n",
      "Afghanistan\tAsia\t1987\t40\t13867957\t852\n",
      "Afghanistan\tAsia\t1992\t41\t16317921\t649\n"
     ]
    }
   ],
   "source": [
    "sed -E 's/([0-9]+)\\.[0-9]+/\\1/g' gapminder.tsv > gapminder.no_decimal.tsv\n",
    "head gapminder.no_decimal.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### Challenge Question 5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "country\tcontinent\tyear\tlifeExp\tpop\tgdpPercap\n",
      "Afghanistan\tPangaea\t1952\t28.801\t8425333\t779.4453145\n",
      "Afghanistan\tPangaea\t1957\t30.332\t9240934\t820.8530296\n",
      "Afghanistan\tPangaea\t1962\t31.997\t10267083\t853.10071\n",
      "Afghanistan\tPangaea\t1967\t34.02\t11537966\t836.1971382\n",
      "Afghanistan\tPangaea\t1972\t36.088\t13079460\t739.9811058\n",
      "Afghanistan\tPangaea\t1977\t38.438\t14880372\t786.11336\n",
      "Afghanistan\tPangaea\t1982\t39.854\t12881816\t978.0114388\n",
      "Afghanistan\tPangaea\t1987\t40.822\t13867957\t852.3959448\n",
      "Afghanistan\tPangaea\t1992\t41.674\t16317921\t649.3413952\n"
     ]
    }
   ],
   "source": [
    "awk 'BEGIN {FS=OFS=\"\\t\"} $1 != \"country\" {$2 = \"Pangaea\"} {print $0}' gapminder.tsv > gapminder.pangaea.2.tsv\n",
    "head gapminder.pangaea.2.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above solution works fine. You can make it a bit simpler (assuming your header is on the first line). `NR` in `awk` refers to the line number. Here, we are changing the second column for every line with a line number greater than 1 (_i.e._ any non-header line). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "country\tcontinent\tyear\tlifeExp\tpop\tgdpPercap\n",
      "Afghanistan\tPangaea\t1952\t28.801\t8425333\t779.4453145\n",
      "Afghanistan\tPangaea\t1957\t30.332\t9240934\t820.8530296\n",
      "Afghanistan\tPangaea\t1962\t31.997\t10267083\t853.10071\n",
      "Afghanistan\tPangaea\t1967\t34.02\t11537966\t836.1971382\n",
      "Afghanistan\tPangaea\t1972\t36.088\t13079460\t739.9811058\n",
      "Afghanistan\tPangaea\t1977\t38.438\t14880372\t786.11336\n",
      "Afghanistan\tPangaea\t1982\t39.854\t12881816\t978.0114388\n",
      "Afghanistan\tPangaea\t1987\t40.822\t13867957\t852.3959448\n",
      "Afghanistan\tPangaea\t1992\t41.674\t16317921\t649.3413952\n"
     ]
    }
   ],
   "source": [
    "awk 'BEGIN {FS=OFS=\"\\t\"} NR > 1 {$2 = \"Pangaea\"} {print $0}' gapminder.tsv > gapminder.pangaea.3.tsv\n",
    "head gapminder.pangaea.3.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Bash",
   "language": "bash",
   "name": "bash"
  },
  "language_info": {
   "codemirror_mode": "shell",
   "file_extension": ".sh",
   "mimetype": "text/x-sh",
   "name": "bash"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}