# Applications of Regular Expressions

**Author:** Bruno Grande

**Date:** August 29th, 2017

## Introduction

In this lesson, we will cover a few applications of regular expressions (or regex) that I use all the time. Regex are available in most programming languages, but to keep this lesson accessible to as many people as possible, we will focus on applications at the Bash shell. Specifically, we will cover how you can use `grep`, `sed` and `awk` to get a lot done without firing up a script, especially with the power of regex at your side. 

### The motivation

Regular expressions are an extremely powerful tool for pattern matching. You might not realize it, but a lot of what we do is pattern matching, especially if you deal with text at all. The ability to describe a flexible pattern that the computer can then quickly look for in some arbitrary text opens up a world of possibilities. This lesson will focus on some of these possibilities. Notably, we will cover:

1. Subsetting text using `grep`
2. Search-and-replace text using `sed`
3. Filter and/or process tabular data using `awk`

These three tools alone justify learning Bash to make your life easier. In combination with regex, they are life-savers! 

### The dataset

We will be using Jenny Bryan's cleaned-up version of the gapminder dataset. It contains 1704 rows and 6 columns. The dataset consists of the population, life expectancy and GDP per capita for 142 countries every 5 years between 1952 and 2007. You can easily download the data using `curl` as follows. 

In [1]:
curl -sL bit.ly/gapm-data > gapminder.tsv

In [2]:
head gapminder.tsv

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28.801	8425333	779.4453145
Afghanistan	Asia	1957	30.332	9240934	820.8530296
Afghanistan	Asia	1962	31.997	10267083	853.10071
Afghanistan	Asia	1967	34.02	11537966	836.1971382
Afghanistan	Asia	1972	36.088	13079460	739.9811058
Afghanistan	Asia	1977	38.438	14880372	786.11336
Afghanistan	Asia	1982	39.854	12881816	978.0114388
Afghanistan	Asia	1987	40.822	13867957	852.3959448
Afghanistan	Asia	1992	41.674	16317921	649.3413952


----

----

----

## Subsetting text using grep

At its simplest, grep can be used to filter lines based on a pattern. We can start with a plain, non-regex pattern. Here, we subset the file to lines that contains the word `Canada`. 

In [3]:
grep Canada gapminder.tsv

Canada	Americas	1952	68.75	14785584	11367.16112
Canada	Americas	1957	69.96	17010154	12489.95006
Canada	Americas	1962	71.3	18985849	13462.48555
Canada	Americas	1967	72.13	20819767	16076.58803
Canada	Americas	1972	72.88	22284500	18970.57086
Canada	Americas	1977	74.21	23796400	22090.88306
Canada	Americas	1982	75.76	25201900	22898.79214
Canada	Americas	1987	76.86	26549700	26626.51503
Canada	Americas	1992	77.95	28523502	26342.88426
Canada	Americas	1997	78.61	30305843	28954.92589
Canada	Americas	2002	79.77	31902268	33328.96507
Canada	Americas	2007	80.653	33390141	36319.23501


You'll notice that the header lines was removed, because it doesn't contain `Canada`. If we want to ensure that this file remains valid, we need to keep the header. There are multiple ways to do this. 

First, you can grep the header and the `Canada` lines separately. 

In [4]:
grep country gapminder.tsv
grep Canada gapminder.tsv

country	continent	year	lifeExp	pop	gdpPercap
Canada	Americas	1952	68.75	14785584	11367.16112
Canada	Americas	1957	69.96	17010154	12489.95006
Canada	Americas	1962	71.3	18985849	13462.48555
Canada	Americas	1967	72.13	20819767	16076.58803
Canada	Americas	1972	72.88	22284500	18970.57086
Canada	Americas	1977	74.21	23796400	22090.88306
Canada	Americas	1982	75.76	25201900	22898.79214
Canada	Americas	1987	76.86	26549700	26626.51503
Canada	Americas	1992	77.95	28523502	26342.88426
Canada	Americas	1997	78.61	30305843	28954.92589
Canada	Americas	2002	79.77	31902268	33328.96507
Canada	Americas	2007	80.653	33390141	36319.23501


However, you will notice that we are repeating ourselves (the `grep` command and the `gapminder.tsv` file name. Ideally, we want to follow the DRY (don't repeat yourself) principle. 

**N.B.** An astute reader will notice that I can extract the header using `head -1`. Indeed, this would work here, but I am familiar with file formats (_e.g._ VCF variant call format) where the header is neither the first line, nor a predictable number of lines into the file. In these cases, `grep` is more general. 

The second approach involves the use of regex. In fact, `grep` stands for "globally search a regular expression and print". However, because there have been multiple versions of regex over the years and we are used to the more modern versions, we will need to use a variant of grep that enables extended regex. You can either use `grep -E` or `egrep`. I will be using the latter. 

Here, we can start using regex by using the `|` operator, which matches what's on the left **or** on the right. Whenever you use regular expressions, it is safer to quote the pattern using single quotes. 

In [5]:
egrep 'country|Canada' gapminder.tsv

country	continent	year	lifeExp	pop	gdpPercap
Canada	Americas	1952	68.75	14785584	11367.16112
Canada	Americas	1957	69.96	17010154	12489.95006
Canada	Americas	1962	71.3	18985849	13462.48555
Canada	Americas	1967	72.13	20819767	16076.58803
Canada	Americas	1972	72.88	22284500	18970.57086
Canada	Americas	1977	74.21	23796400	22090.88306
Canada	Americas	1982	75.76	25201900	22898.79214
Canada	Americas	1987	76.86	26549700	26626.51503
Canada	Americas	1992	77.95	28523502	26342.88426
Canada	Americas	1997	78.61	30305843	28954.92589
Canada	Americas	2002	79.77	31902268	33328.96507
Canada	Americas	2007	80.653	33390141	36319.23501


If we wanted to include the US in our results, it's as simple as adding another `|` operator in our pattern. 

In [6]:
egrep 'country|Canada|United States' gapminder.tsv

country	continent	year	lifeExp	pop	gdpPercap
Canada	Americas	1952	68.75	14785584	11367.16112
Canada	Americas	1957	69.96	17010154	12489.95006
Canada	Americas	1962	71.3	18985849	13462.48555
Canada	Americas	1967	72.13	20819767	16076.58803
Canada	Americas	1972	72.88	22284500	18970.57086
Canada	Americas	1977	74.21	23796400	22090.88306
Canada	Americas	1982	75.76	25201900	22898.79214
Canada	Americas	1987	76.86	26549700	26626.51503
Canada	Americas	1992	77.95	28523502	26342.88426
Canada	Americas	1997	78.61	30305843	28954.92589
Canada	Americas	2002	79.77	31902268	33328.96507
Canada	Americas	2007	80.653	33390141	36319.23501
United States	Americas	1952	68.44	157553000	13990.48208
United States	Americas	1957	69.49	171984000	14847.12712
United States	Americas	1962	70.21	186538000	16173.14586
United States	Americas	1967	70.76	198712000	19530.36557
United States	Americas	1972	71.34	209896000	21806.03594
United States	Americas	1977	73.38	220239000	24072.63213
United States	Americas	1982	74.65	232187835

`grep` can be as flexible as you need it to be. While it may be contrived, let's say we are interested in the data from 1977 for countries whose names start with `S` (and we want to keep the header). As always, there are multiple was of approaching this problem. 

First, we can use UNIX pipes to perform subsequent filters, one for the countries starting with `S` and another for the rows corresponding to 1977. 

In [7]:
egrep '^(country|S)' gapminder.tsv | egrep '1977'

Sao Tome and Principe	Africa	1977	58.55	86796	1737.561657
Saudi Arabia	Asia	1977	58.69	8128505	34167.7626
Senegal	Africa	1977	48.879	5260855	1561.769116
Serbia	Europe	1977	70.3	8686367	12980.66956
Sierra Leone	Africa	1977	36.788	3140897	1348.285159
Singapore	Asia	1967	67.946	1977600	4977.41854
Singapore	Asia	1977	70.795	2325300	11210.08948
Singapore	Asia	2002	78.77	4197776	36023.1054
Slovak Republic	Europe	1977	70.45	4827803	10922.66404
Slovenia	Europe	1977	70.97	1746919	15277.03017
Somalia	Africa	1977	41.974	4353666	1450.992513
South Africa	Africa	1977	55.527	27129932	8028.651439
Spain	Europe	1977	74.39	36439000	13236.92117
Sri Lanka	Asia	1977	65.949	14116836	1348.775651
Sudan	Africa	1977	47.8	17104986	2202.988423
Swaziland	Africa	1977	52.537	551425	3781.410618
Sweden	Europe	1977	75.44	8251648	18855.72521
Switzerland	Europe	1977	75.39	6316424	26982.29052
Syria	Asia	1977	61.195	7932503	3195.484582


Some of you might have noticed that Singapore comes up three times, while every country is only supposed to show up once at most. Upon closer inspection, you can see why this is happening: the `1977` pattern is appearing in the line within the population number, which is not something we want. 

The immediate solution to this is to prevent matches of the year within other numbers. In regex, you can specify that word boundaries must be present before and after the number using `\b`. 

In [8]:
egrep '^(country|S)' gapminder.tsv | egrep '\b1977\b'

Sao Tome and Principe	Africa	1977	58.55	86796	1737.561657
Saudi Arabia	Asia	1977	58.69	8128505	34167.7626
Senegal	Africa	1977	48.879	5260855	1561.769116
Serbia	Europe	1977	70.3	8686367	12980.66956
Sierra Leone	Africa	1977	36.788	3140897	1348.285159
Singapore	Asia	1977	70.795	2325300	11210.08948
Slovak Republic	Europe	1977	70.45	4827803	10922.66404
Slovenia	Europe	1977	70.97	1746919	15277.03017
Somalia	Africa	1977	41.974	4353666	1450.992513
South Africa	Africa	1977	55.527	27129932	8028.651439
Spain	Europe	1977	74.39	36439000	13236.92117
Sri Lanka	Asia	1977	65.949	14116836	1348.775651
Sudan	Africa	1977	47.8	17104986	2202.988423
Swaziland	Africa	1977	52.537	551425	3781.410618
Sweden	Europe	1977	75.44	8251648	18855.72521
Switzerland	Europe	1977	75.39	6316424	26982.29052
Syria	Asia	1977	61.195	7932503	3195.484582


Indeed, this solves our problem, but we lost the header again. To get it back, we need to include the `country` pattern in both commands, which is slightly repetitive. 

**N.B.** Our current solution to filtering for observations made in 1977 is imperfect, because we are filtering on the presence of 1977 anywhere in the line. Technically, if country had a population or GDP per capita of 1977 at some point, this would be included in the output. Later, we will see how we can use awk to apply regex on specific columns. 

In [9]:
egrep '^(country|S)' gapminder.tsv | egrep '(country|\b1977\b)'

country	continent	year	lifeExp	pop	gdpPercap
Sao Tome and Principe	Africa	1977	58.55	86796	1737.561657
Saudi Arabia	Asia	1977	58.69	8128505	34167.7626
Senegal	Africa	1977	48.879	5260855	1561.769116
Serbia	Europe	1977	70.3	8686367	12980.66956
Sierra Leone	Africa	1977	36.788	3140897	1348.285159
Singapore	Asia	1977	70.795	2325300	11210.08948
Slovak Republic	Europe	1977	70.45	4827803	10922.66404
Slovenia	Europe	1977	70.97	1746919	15277.03017
Somalia	Africa	1977	41.974	4353666	1450.992513
South Africa	Africa	1977	55.527	27129932	8028.651439
Spain	Europe	1977	74.39	36439000	13236.92117
Sri Lanka	Asia	1977	65.949	14116836	1348.775651
Sudan	Africa	1977	47.8	17104986	2202.988423
Swaziland	Africa	1977	52.537	551425	3781.410618
Sweden	Europe	1977	75.44	8251648	18855.72521
Switzerland	Europe	1977	75.39	6316424	26982.29052
Syria	Asia	1977	61.195	7932503	3195.484582


Second, we can combine our patterns into one regex. Admittedly, there is no compelling advantage in doing so other than preventing needless commands wherever possible. For this, we need to acknowledge that the year will always be after the country name by some number of characters. We can specify "some numbers of characters" in regex using `.*`. 

In [10]:
egrep '^(country|S).*\b1977\b' gapminder.tsv

Sao Tome and Principe	Africa	1977	58.55	86796	1737.561657
Saudi Arabia	Asia	1977	58.69	8128505	34167.7626
Senegal	Africa	1977	48.879	5260855	1561.769116
Serbia	Europe	1977	70.3	8686367	12980.66956
Sierra Leone	Africa	1977	36.788	3140897	1348.285159
Singapore	Asia	1977	70.795	2325300	11210.08948
Slovak Republic	Europe	1977	70.45	4827803	10922.66404
Slovenia	Europe	1977	70.97	1746919	15277.03017
Somalia	Africa	1977	41.974	4353666	1450.992513
South Africa	Africa	1977	55.527	27129932	8028.651439
Spain	Europe	1977	74.39	36439000	13236.92117
Sri Lanka	Asia	1977	65.949	14116836	1348.775651
Sudan	Africa	1977	47.8	17104986	2202.988423
Swaziland	Africa	1977	52.537	551425	3781.410618
Sweden	Europe	1977	75.44	8251648	18855.72521
Switzerland	Europe	1977	75.39	6316424	26982.29052
Syria	Asia	1977	61.195	7932503	3195.484582


----
### Challenge Question 1

Why is the header missing in output of the above command?

----

Let's create a file with a list of countries of interest for the purposes of this demo. 

In [11]:
echo -e 'Canada\nItaly\nAustralia\nUnited States\nEngland\nFrance' > countries.txt
cat countries.txt

Canada
Italy
Australia
United States
England
France


Given this list, we can easily filter the gapminder dataset for observations made for these countries. 

In [12]:
egrep -f countries.txt gapminder.tsv | head

Australia	Oceania	1952	69.12	8691212	10039.59564
Australia	Oceania	1957	70.33	9712569	10949.64959
Australia	Oceania	1962	70.93	10794968	12217.22686
Australia	Oceania	1967	71.1	11872264	14526.12465
Australia	Oceania	1972	71.93	13177000	16788.62948
Australia	Oceania	1977	73.49	14074100	18334.19751
Australia	Oceania	1982	74.74	15184200	19477.00928
Australia	Oceania	1987	76.32	16257249	21888.88903
Australia	Oceania	1992	77.56	17481977	23424.76683
Australia	Oceania	1997	78.83	18565243	26997.93657


So far, we've seen how we can use `grep` to subset the lines in a file according to a certain pattern. Another useful feature of `grep` is its quiet mode, which can be used in conjunction with Bash conditional expressions. 

First, let's review Bash if statements. 

In [13]:
if [[ 1 > 2 ]]; then
 echo 'true'
else
 echo 'false'
fi

false


Here, `[[ 1 > 2 ]]` is actually a command that evaluates the expression inside the square brackets. This portion of the if statement in Bash can be any command. A command evaluates as true if its exit code is zero (_i.e._ the command was successful). Otherwise, it's considered as false. 

To show this, I will run the commands `true` and `false`, which respectively return exit codes 0 and 1. 

**N.B.** The `$?` is a useful variable that contains the exit code of the most recently run command. 

In [14]:
if true; then
 echo 'Exit code: ' $?
 echo 'Considered true'
else
 echo 'Exit code: ' $?
 echo 'Considered false'
fi

Exit code: 0
Considered true


In [15]:
if false; then
 echo 'Exit code: ' $?
 echo 'Considered true'
else
 echo 'Exit code: ' $?
 echo 'Considered false'
fi

Exit code: 1
Considered false


Now, let's say you're interested in running a command only if a file contains some pattern. You can use grep in quiet mode inside a if statement, as follows. 

In [16]:
if egrep -q 'Canada' countries.txt; then
 echo 'Canada is in the countries.txt file :D'
else
 echo 'Canada is not in the countries.txt file :('
fi

Canada is in the countries.txt file :D


In [17]:
if egrep -q 'Switzerland' countries.txt; then
 echo 'Switzerland is in the countries.txt file :D'
else
 echo 'Switzerland is not in the countries.txt file :('
fi

Switzerland is not in the countries.txt file :(


Here, we are just echoing some text, but you can do whatever you want once you know a file matches a pattern. A nice thing about quiet mode is that grep stops searching as soon as it encounters the first instance of the pattern. 

If you wanted to count how many instances of a pattern there are in a file, you can certainly pipe the output of `grep` to `wc -l`. You can be slightly more efficient by avoiding the extra command and using the `-c` option in `grep`. 

In [18]:
grep 'Canada' gapminder.tsv | wc -l

 12


In [19]:
grep -c 'Canada' gapminder.tsv

12


Lastly, for all of the above `grep` commands, you can invert the search using the `-v` option. In other words, if you want all lines except for those containing "Canada" or "United States", you can simply do the following:

In [20]:
cat countries.txt

Canada
Italy
Australia
United States
England
France


In [21]:
egrep -v 'Canada|United States' countries.txt

Italy
Australia
England
France


----
### Challenge Question 2

Write an if statement in Bash that checks if there are any countries that start with the letter "Z" outside of Africa, and echoes the response accordingly. 

----

----

----

----

## Search-and-replace text using sed

So far, we've seen `grep`'s amazing ability to subset lines in a file according to a pattern, which can be as complex as you can conjure. Now, we're going to introduce `sed`, which is probably best known for its ability to perform search-and-replace really easily at the command line. 

Let's remind ourselves of what's in our `gapminder.tsv` file. 

In [22]:
head gapminder.tsv

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28.801	8425333	779.4453145
Afghanistan	Asia	1957	30.332	9240934	820.8530296
Afghanistan	Asia	1962	31.997	10267083	853.10071
Afghanistan	Asia	1967	34.02	11537966	836.1971382
Afghanistan	Asia	1972	36.088	13079460	739.9811058
Afghanistan	Asia	1977	38.438	14880372	786.11336
Afghanistan	Asia	1982	39.854	12881816	978.0114388
Afghanistan	Asia	1987	40.822	13867957	852.3959448
Afghanistan	Asia	1992	41.674	16317921	649.3413952


To start off with a simple example to examine the structure of a `sed` command, we are going to replace every instance of "United States" with "USA". Here, we will count instances of each term before and after we apply `sed` to confirm the change. 

In general, we need to ensure that modern regular expressions are enabled in `sed`. Unfortunately, this option varies based on your platform. Typically, it's `-E` on Macs and `-r` on Linux (and probably Windows, although I'm not sure). 

In [23]:
sed -E 's/United States/USA/' gapminder.tsv > gapminder.usa.tsv

In [24]:
echo 'Before sed'
grep -c 'United States' gapminder.tsv
grep -c 'USA' gapminder.tsv

echo 'After sed'
grep -c 'United States' gapminder.usa.tsv
grep -c 'USA' gapminder.usa.tsv

Before sed
12
0
After sed
0
12


As you can see, the search-and-replace worked. The general form of a `sed` search-and-replace is as follows:

```
sed -E 's/what_you_want_to_replace/what_you_want_to_replace_with/' input_file.txt > output_file.txt
```

Just in case you're still skeptical, we'll apply the same change on our small `countries.txt` file. 

In [25]:
sed -E 's/United States/USA/' countries.txt

Canada
Italy
Australia
USA
England
France


The initial `s` is necessary to indicate the search-and-replace command within `sed`. There are other commands that we won't see today, such as insert (`i`) and delete (`d`). The slashes are used to delimit the `what_you_want_to_replace` from the `what_you_want_to_replace_with`. It can actually be any character you want, as long as you're consistent. 

For example, you can use colons (`:`) instead. 

In [26]:
sed -E 's:United States:USA:' countries.txt

Canada
Italy
Australia
USA
England
France


The character you use is not that important. One thing to consider is that if the character you choose appear in the regex, you will need to escape it with backslashes. That's why I generally stick with slashes as my character in `sed` commands unless I'm dealing with file paths as my input text (which commonly include slashes), in which case I will switch to colons or vertical bars. 

Let's move on to a slightly more complex change. We are going to replace every period (`.`) with a comma (`,`), as it we want to send our data to a collaborator in France, where they use commas instead of periods in decimal numbers. 

There is an important thing we need to handle: there might be multiple instances of a point. By default, `sed` will only replace the first instance of a pattern per line. If we want to replace every instance, we'll need to enable the global mode by adding a `g` at the end of the `sed` command. 

**N.B.** Recall that the period in regex has special meaning and matches any character. If we want to match an actual period, we need to escape it using a backslash. 

In [27]:
sed -E 's/\./,/g' gapminder.tsv > gapminder.comma.tsv
head gapminder.comma.tsv

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28,801	8425333	779,4453145
Afghanistan	Asia	1957	30,332	9240934	820,8530296
Afghanistan	Asia	1962	31,997	10267083	853,10071
Afghanistan	Asia	1967	34,02	11537966	836,1971382
Afghanistan	Asia	1972	36,088	13079460	739,9811058
Afghanistan	Asia	1977	38,438	14880372	786,11336
Afghanistan	Asia	1982	39,854	12881816	978,0114388
Afghanistan	Asia	1987	40,822	13867957	852,3959448
Afghanistan	Asia	1992	41,674	16317921	649,3413952


----
### Challenge Question 3

Write a `sed` command that replaces are continent names with "Pangaea".

----

You can easily chain multiple search-and-replace commands by using the `-e` option. 

In [28]:
sed -E -e 's/United States/USA/' -e 's/\./,/g' gapminder.tsv > gapminder.usa_and_comma.tsv
egrep 'country|USA' gapminder.usa_and_comma.tsv | head

country	continent	year	lifeExp	pop	gdpPercap
USA	Americas	1952	68,44	157553000	13990,48208
USA	Americas	1957	69,49	171984000	14847,12712
USA	Americas	1962	70,21	186538000	16173,14586
USA	Americas	1967	70,76	198712000	19530,36557
USA	Americas	1972	71,34	209896000	21806,03594
USA	Americas	1977	73,38	220239000	24072,63213
USA	Americas	1982	74,65	232187835	25009,55914
USA	Americas	1987	75,02	242803533	29884,35041
USA	Americas	1992	76,09	256894189	32003,93224


Perhaps one of the most powerful features of sed and regex when doing search-and-replace is backreferences. They allow you to search for something and replace it with something that includes what was originally matched. I think the best way to explain this is to demonstrate backreferences in action. Our contrived example is to match the country name at the beginning of each line and duplicating it. 

In [29]:
sed -E 's/^([^\t]+)/\1_\1/' gapminder.tsv > gapminder.double_country.tsv
head gapminder.double_country.tsv

country_country	continent	year	lifeExp	pop	gdpPercap
Afghanistan_Afghanistan	Asia	1952	28.801	8425333	779.4453145
Afghanistan_Afghanistan	Asia	1957	30.332	9240934	820.8530296
Afghanistan_Afghanistan	Asia	1962	31.997	10267083	853.10071
Afghanistan_Afghanistan	Asia	1967	34.02	11537966	836.1971382
Afghanistan_Afghanistan	Asia	1972	36.088	13079460	739.9811058
Afghanistan_Afghanistan	Asia	1977	38.438	14880372	786.11336
Afghanistan_Afghanistan	Asia	1982	39.854	12881816	978.0114388
Afghanistan_Afghanistan	Asia	1987	40.822	13867957	852.3959448
Afghanistan_Afghanistan	Asia	1992	41.674	16317921	649.3413952


----
### Challenge Question 4

Use backreferences to get rid of all decimal digits. Don't worry about rounding up or down; just take the floor of the number. 

----

----

----

----

## Filter and/or process tabular data using awk

The last tool we will cover today is `awk`. This tool combines the features of `grep` and `sed` and makes them more useful in the context of tabular data, such as our `gapminder.tsv` file consisting of six tab-delimited columns. 

In [30]:
head gapminder.tsv

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28.801	8425333	779.4453145
Afghanistan	Asia	1957	30.332	9240934	820.8530296
Afghanistan	Asia	1962	31.997	10267083	853.10071
Afghanistan	Asia	1967	34.02	11537966	836.1971382
Afghanistan	Asia	1972	36.088	13079460	739.9811058
Afghanistan	Asia	1977	38.438	14880372	786.11336
Afghanistan	Asia	1982	39.854	12881816	978.0114388
Afghanistan	Asia	1987	40.822	13867957	852.3959448
Afghanistan	Asia	1992	41.674	16317921	649.3413952


- FS and OFS
- Print subset of columns
- Conditionally print lines
- sub and gensub

The first thing you need to configure with `awk` is the field separator (`FS`), which is what separates the columns in each line. Typically, we use comma- or tab-delimited files. In this case, `gapminder.tsv` uses tabs. We also configure the output field separator (`OFS`) to be the same character. Notice that we use single quotes again to avoid unintended issues down the line. 

In [3]:
awk 'BEGIN {FS=OFS="\t"}' gapminder.tsv

The `BEGIN {}` contains awk commands that are run once at the beginning. Here, we only need to set the input and output field separator once. Because there are no commands that follow `BEGIN {}`, `awk` doesn't do anything. If we want to print lines, we can use `print $0`, where `$0` refers to all columns. 

In [4]:
awk 'BEGIN {FS=OFS="\t"} {print $0}' gapminder.tsv | head

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28.801	8425333	779.4453145
Afghanistan	Asia	1957	30.332	9240934	820.8530296
Afghanistan	Asia	1962	31.997	10267083	853.10071
Afghanistan	Asia	1967	34.02	11537966	836.1971382
Afghanistan	Asia	1972	36.088	13079460	739.9811058
Afghanistan	Asia	1977	38.438	14880372	786.11336
Afghanistan	Asia	1982	39.854	12881816	978.0114388
Afghanistan	Asia	1987	40.822	13867957	852.3959448
Afghanistan	Asia	1992	41.674	16317921	649.3413952


Admittedly, this isn't very useful. You can refer to the first, second, third, etc. columns using `$1`, `$2`, `$3`, etc. So, if we want to print the country name, the year and the population, we can use `awk` as follows. 

In [6]:
awk 'BEGIN {FS=OFS="\t"} {print $1, $3, $5}' gapminder.tsv | head

country	year	pop
Afghanistan	1952	8425333
Afghanistan	1957	9240934
Afghanistan	1962	10267083
Afghanistan	1967	11537966
Afghanistan	1972	13079460
Afghanistan	1977	14880372
Afghanistan	1982	12881816
Afghanistan	1987	13867957
Afghanistan	1992	16317921


Again, this isn't very useful, because can achieve the same effect using `cut` in Bash using much less typing.

In [7]:
cut -f1,3,5 gapminder.tsv | head

country	year	pop
Afghanistan	1952	8425333
Afghanistan	1957	9240934
Afghanistan	1962	10267083
Afghanistan	1967	11537966
Afghanistan	1972	13079460
Afghanistan	1977	14880372
Afghanistan	1982	12881816
Afghanistan	1987	13867957
Afghanistan	1992	16317921


Things start getting interesting once you perform filtering on specific columns or manipulating text in specific columns. For instance, let's revisit our earlier task of filtering on rows that pertain to 1977. This can be accurately done by simply checking if column 3 is equal to 1977. In this case, we don't have to worry about the digits "1977" appearing in other columns such as the population. 

In [8]:
awk 'BEGIN {FS=OFS="\t"} $3 == 1977 {print $0}' gapminder.tsv | head

Afghanistan	Asia	1977	38.438	14880372	786.11336
Albania	Europe	1977	68.93	2509048	3533.00391
Algeria	Africa	1977	58.014	17152804	4910.416756
Angola	Africa	1977	39.483	6162675	3008.647355
Argentina	Americas	1977	68.481	26983828	10079.02674
Australia	Oceania	1977	73.49	14074100	18334.19751
Austria	Europe	1977	72.17	7568430	19749.4223
Bahrain	Asia	1977	65.593	297410	19340.10196
Bangladesh	Asia	1977	46.923	80428306	659.8772322
Belgium	Europe	1977	72.8	9821800	19117.97448


Note that the `{print $0}` is actually optional when we specify a condition for filtering lines. 

In [9]:
awk 'BEGIN {FS=OFS="\t"} $3 == 1977' gapminder.tsv | head

Afghanistan	Asia	1977	38.438	14880372	786.11336
Albania	Europe	1977	68.93	2509048	3533.00391
Algeria	Africa	1977	58.014	17152804	4910.416756
Angola	Africa	1977	39.483	6162675	3008.647355
Argentina	Americas	1977	68.481	26983828	10079.02674
Australia	Oceania	1977	73.49	14074100	18334.19751
Austria	Europe	1977	72.17	7568430	19749.4223
Bahrain	Asia	1977	65.593	297410	19340.10196
Bangladesh	Asia	1977	46.923	80428306	659.8772322
Belgium	Europe	1977	72.8	9821800	19117.97448


We can also combine multiple conditions using `&&`. Here, we will reproduce our earlier command in `awk`, where we will filter for 1977 data for countries whose names starts with "S". 

In [11]:
awk 'BEGIN {FS=OFS="\t"} $3 == 1977 && $1 ~ /^S/' gapminder.tsv | head

Sao Tome and Principe	Africa	1977	58.55	86796	1737.561657
Saudi Arabia	Asia	1977	58.69	8128505	34167.7626
Senegal	Africa	1977	48.879	5260855	1561.769116
Serbia	Europe	1977	70.3	8686367	12980.66956
Sierra Leone	Africa	1977	36.788	3140897	1348.285159
Singapore	Asia	1977	70.795	2325300	11210.08948
Slovak Republic	Europe	1977	70.45	4827803	10922.66404
Slovenia	Europe	1977	70.97	1746919	15277.03017
Somalia	Africa	1977	41.974	4353666	1450.992513
South Africa	Africa	1977	55.527	27129932	8028.651439


We now face a similar issue as before, where the header is missing. We can address this in multiple ways. We will use our approach from earlier, by matching country in the first column. 

In [12]:
awk 'BEGIN {FS=OFS="\t"} $3 == 1977 && $1 ~ /^S/ || $1 == "country"' gapminder.tsv | head

country	continent	year	lifeExp	pop	gdpPercap
Sao Tome and Principe	Africa	1977	58.55	86796	1737.561657
Saudi Arabia	Asia	1977	58.69	8128505	34167.7626
Senegal	Africa	1977	48.879	5260855	1561.769116
Serbia	Europe	1977	70.3	8686367	12980.66956
Sierra Leone	Africa	1977	36.788	3140897	1348.285159
Singapore	Asia	1977	70.795	2325300	11210.08948
Slovak Republic	Europe	1977	70.45	4827803	10922.66404
Slovenia	Europe	1977	70.97	1746919	15277.03017
Somalia	Africa	1977	41.974	4353666	1450.992513


In general, the structure of `awk` commands (within the single quotes) is as follows:

```
awk 'BEGIN {FS=OFS="\t"} CONDITION {ACTION} CONDITION {ACTION} {ACTION}' input.tsv > output.tsv
```

You can think of an `awk` command as a series of conditions and actions that will only run if the preceding condition is true. In fact, `BEGIN` is a condition that is only true at the beginning of the file. Hence, the `{FS=OFS="\t"}` only gets run once at the outset. Any action that isn't preceded by a condition (like the last `{ACTION}` in the example command above) will run for every line. 

----

### Challenge Question 5

Tackle Challenge Question 3, but this time using `awk`. You should be able to simplify your approach. 

**Hint:** You no longer need to know the continents in the file anymore. 

----

----

----

----

## Solutions to Challenge Questions

### Challenge Question 1

The header is missing from the output because the `.*\b1977\b` in the pattern is restricting that all lines (_i.e._ those starting with `country` or `S`) have a `1977` in it. The solution is to move the `.*\b1977\b` inside the parentheses such that it only applies to lines starting with `S`. 

In [31]:
egrep '^(country|S.*\b1977\b)' gapminder.tsv

country	continent	year	lifeExp	pop	gdpPercap
Sao Tome and Principe	Africa	1977	58.55	86796	1737.561657
Saudi Arabia	Asia	1977	58.69	8128505	34167.7626
Senegal	Africa	1977	48.879	5260855	1561.769116
Serbia	Europe	1977	70.3	8686367	12980.66956
Sierra Leone	Africa	1977	36.788	3140897	1348.285159
Singapore	Asia	1977	70.795	2325300	11210.08948
Slovak Republic	Europe	1977	70.45	4827803	10922.66404
Slovenia	Europe	1977	70.97	1746919	15277.03017
Somalia	Africa	1977	41.974	4353666	1450.992513
South Africa	Africa	1977	55.527	27129932	8028.651439
Spain	Europe	1977	74.39	36439000	13236.92117
Sri Lanka	Asia	1977	65.949	14116836	1348.775651
Sudan	Africa	1977	47.8	17104986	2202.988423
Swaziland	Africa	1977	52.537	551425	3781.410618
Sweden	Europe	1977	75.44	8251648	18855.72521
Switzerland	Europe	1977	75.39	6316424	26982.29052
Syria	Asia	1977	61.195	7932503	3195.484582


### Challenge Question 2

In [32]:
if grep -v 'Africa' gapminder.tsv | grep -q '^Z'; then
 echo "There is a country that starts with Z outside of Africa"
else
 echo "There is no country that starts with Z outside of Africa"
fi

There is no country that starts with Z outside of Africa


### Challenge Question 3

In [33]:
sed -E 's/Africa|America|Asia|Europe|Oceania/Pangaea/' gapminder.tsv > gapminder.pangaea.tsv
head gapminder.pangaea.tsv

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Pangaea	1952	28.801	8425333	779.4453145
Afghanistan	Pangaea	1957	30.332	9240934	820.8530296
Afghanistan	Pangaea	1962	31.997	10267083	853.10071
Afghanistan	Pangaea	1967	34.02	11537966	836.1971382
Afghanistan	Pangaea	1972	36.088	13079460	739.9811058
Afghanistan	Pangaea	1977	38.438	14880372	786.11336
Afghanistan	Pangaea	1982	39.854	12881816	978.0114388
Afghanistan	Pangaea	1987	40.822	13867957	852.3959448
Afghanistan	Pangaea	1992	41.674	16317921	649.3413952


### Challenge Question 4

In [34]:
sed -E 's/([0-9]+)\.[0-9]+/\1/g' gapminder.tsv > gapminder.no_decimal.tsv
head gapminder.no_decimal.tsv

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28	8425333	779
Afghanistan	Asia	1957	30	9240934	820
Afghanistan	Asia	1962	31	10267083	853
Afghanistan	Asia	1967	34	11537966	836
Afghanistan	Asia	1972	36	13079460	739
Afghanistan	Asia	1977	38	14880372	786
Afghanistan	Asia	1982	39	12881816	978
Afghanistan	Asia	1987	40	13867957	852
Afghanistan	Asia	1992	41	16317921	649


### Challenge Question 5

In [2]:
awk 'BEGIN {FS=OFS="\t"} $1 != "country" {$2 = "Pangaea"} {print $0}' gapminder.tsv > gapminder.pangaea.2.tsv
head gapminder.pangaea.2.tsv

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Pangaea	1952	28.801	8425333	779.4453145
Afghanistan	Pangaea	1957	30.332	9240934	820.8530296
Afghanistan	Pangaea	1962	31.997	10267083	853.10071
Afghanistan	Pangaea	1967	34.02	11537966	836.1971382
Afghanistan	Pangaea	1972	36.088	13079460	739.9811058
Afghanistan	Pangaea	1977	38.438	14880372	786.11336
Afghanistan	Pangaea	1982	39.854	12881816	978.0114388
Afghanistan	Pangaea	1987	40.822	13867957	852.3959448
Afghanistan	Pangaea	1992	41.674	16317921	649.3413952


The above solution works fine. You can make it a bit simpler (assuming your header is on the first line). `NR` in `awk` refers to the line number. Here, we are changing the second column for every line with a line number greater than 1 (_i.e._ any non-header line). 

In [3]:
awk 'BEGIN {FS=OFS="\t"} NR > 1 {$2 = "Pangaea"} {print $0}' gapminder.tsv > gapminder.pangaea.3.tsv
head gapminder.pangaea.3.tsv

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Pangaea	1952	28.801	8425333	779.4453145
Afghanistan	Pangaea	1957	30.332	9240934	820.8530296
Afghanistan	Pangaea	1962	31.997	10267083	853.10071
Afghanistan	Pangaea	1967	34.02	11537966	836.1971382
Afghanistan	Pangaea	1972	36.088	13079460	739.9811058
Afghanistan	Pangaea	1977	38.438	14880372	786.11336
Afghanistan	Pangaea	1982	39.854	12881816	978.0114388
Afghanistan	Pangaea	1987	40.822	13867957	852.3959448
Afghanistan	Pangaea	1992	41.674	16317921	649.3413952
