# Getting started with Bash Notebooks

Notebook can be loaded for different underlying kernels: bash, python and R. Notebooks are useful to document interactive data analysis. They combine code cells with markdown cells. A markdown cell can contain text, math or headings. 

You can create new bash notebooks using the "New" Dropdown list in the Jupyter File Browser and then selecting "Bash". Notebooks open if you click on them.

In Jupyter notebooks, you work with *Cells*. You can create new cells, or insert them above or below existing cells using the menu items in the `Insert` menu. Use the dropdown list in the command bar in Jupyter to change the type of the cell. The two main types we're going to use are `Markdown` and `Code`. Markdown cells are useful for documenting stuff, Code cells for running code. Markdown cells can be edited by double-clicking into them. Layout them by runnign Shift-Enter.

Code cells are used to enter and execute code. Let's look at some examples.

We can first check which directory we are in, using the `pwd` (=Present Working Directory) command:

In [1]:
pwd

/Users/schiffels/dev/popgen_course
[?2004h

: 1

OK, so we're in the `dev/popgen_course` subfolder within my home folder `/home/stephan`. We can list the contents of that folder:

In [2]:
ls

03_Rmd_smartpca.Rmd
03_bashnb_smartpca.ipynb
04_Rmd_plotting_pca.Rmd
04_pynb_plotting_pca.ipynb
05_Rmd_fstatistics.Rmd
05_pynb_fstatistics.ipynb
0_Welcome.ipynb
1A_short_primer_on_jupyter.ipynb
1B_getting_started_with_bash_notebooks.ipynb
1C_getting_started_with_python_notebooks.ipynb
1D_getting_started_with_R_notebooks.ipynb
README.md
adm_f3_param.txt
adm_f3_popfile.txt
f3_outgroup_stats_Han.txt
f3_outgroup_stats_MA1.txt
f4_param.txt
f4_popfile.txt
img
outgroup_f3_param_Han.txt
outgroup_f3_param_MA1.txt
outgroup_f3_popfile_Han.txt
outgroup_f3_popfile_MA1.txt
pca.AllEurasia.eval
pca.AllEurasia.evec
pca.AllEurasia.params.txt
pca.WestEurasia.eval
pca.WestEurasia.evec
pca.WestEurasia.params.txt
population_frequencies.txt
supp
test
testDir
[?2004h

: 1

We can now create a new directory:

In [3]:
mkdir testDir

mkdir: testDir: File exists
[?2004h

: 1

and change into that directory:

In [4]:
cd testDir

[?2004l[?2004h

: 1

and confirm that we are now in the new dir:

In [5]:
pwd

/Users/schiffels/dev/popgen_course/testDir
[?2004h

: 1

OK, let's go back and delete the subfolder again:

In [6]:
cd ..
rm -r testDir

[?2004h[?2004l

: 1

Here is a simple example of how to use ``echo``:

In [7]:
echo "Hello, how are you?"

Hello, how are you?
[?2004h

: 1

OK, so let's try some more useful things with ``grep``, which can be used to filter large text files by searching for patterns, in this case just the occurrence of the word "French":

In [9]:
grep French example_data/example.ind

 HGDP00511 M French
 HGDP00512 M French
 HGDP00513 F French
 HGDP00514 F French
 HGDP00515 M French
 HGDP00516 F French
 HGDP00517 F French
 HGDP00518 M French
 HGDP00519 M French
 HGDP00522 M French
 HGDP00523 F French
 HGDP00524 F French
 HGDP00525 M French
 HGDP00526 F French
 HGDP00527 F French
 HGDP00528 M French
 HGDP00529 F French
 HGDP00531 F French
 HGDP00533 M French
 HGDP00534 F French
 HGDP00535 F French
 HGDP00536 F French
 HGDP00537 F French
 HGDP00538 M French
 HGDP00539 F French
 SouthFrench3326 M French
 SouthFrench3947 M French
 SouthFrench1323 M French
 SouthFrench3951 M French
 SouthFrench3068 M French
 SouthFrench1112 M French
 SouthFrench4018 M French
[?2004h

: 1

Alright, so that lists all French individuals in that list. Now let's count them, by simply passing the flag `-c`:

In [14]:
grep -c French example_data/example.ind

32?2004l
[?2004h

: 1

***Note:*** We so far have seen the `pwd`, `mkdir`, `cd`, `rm`, `ls` and `grep` commands. If you want to find out more about those, just google them, they are among the most popular and widely used commands/programs in Unix.

In Python3 notebooks you can plot things: Create a new python3 notebook, and run this boilerplate code in the first cell:

 %matplotlib inline
 import matplotlib.pyplot as plt

Then plot something, opening a second cell:

***Exercise:*** Create a simple plot using `plt.plot([1, 2, 3], [5, 2, 6])`


# Bash Pipes

OK. So this first Notebook operates on Bash, which is more or less the lingua franca of Linux operating systems. Everything you do on command lines uses bash. One of the most useful techniques in bash scripting or bash commands are Unix pipes. To illustrate them, consider the following.

Let's look at the structure of our ``ind`` file:

In [15]:
head example_data/example.ind

 Yuk_009 M Yukagir
 Yuk_025 F Yukagir
 Yuk_022 F Yukagir
 Yuk_020 F Yukagir
 MC_40 M Chukchi
 Yuk_024 F Yukagir
 Nesk_25 F Eskimo_Naukan
 Yuk_023 F Yukagir
 MC_16 M Chukchi
 MC_15 F Chukchi
[?2004h

: 1

***Note:*** The `head` command just lists the top 10 rows of a file.

Let's filter out the population column:

In [16]:
head example_data/example.ind | awk '{print $3}'

Yukagirl
Yukagir
Yukagir
Yukagir
Chukchi
Yukagir
Eskimo_Naukan
Yukagir
Chukchi
Chukchi
[?2004h

: 1

***Note:*** The `awk` program is one of the most powerful programs for text-file processing in the Unix-world. It is actually a full-fledged programming language itself. Here we only use it in one of its simplest form. The program `{print $3}` simply says "For every line of the input file, print out the third field".

***Note:*** The pipe symbol `|` tells Unix to redirect the output of the program to its left into the program to its right as standard input. 

Let's sort the output (notice we now use ``cat`` instead of ``head``, but use ``head`` in the end:

In [17]:
cat example_data/example.ind | awk '{print $3}' | sort | head

Abkhasian
Abkhasian
Abkhasian
Abkhasian
Abkhasian
Abkhasian
Abkhasian
Abkhasian
Abkhasian
Adygei
[?2004h

: 1

OK, so there are some error messages in the end because ``head`` ungracefully discards the rest of the data, but that's OK.

Now let's use ``uniq`` to get rid of population name duplicates:

In [18]:
cat example_data/example.ind | awk '{print $3}' | sort | uniq | head

Abkhasian
Adygei
Albanian
Aleut
Aleut_Tlingit
Altaian
Ami
Armenian
Atayal
Balkar
[?2004h

: 1

And now let's count:

In [19]:
cat example_data/example.ind | awk '{print $3}' | sort | uniq | wc -l

 120
[?2004h

: 1

OK, so there are 120 populations in the dataset. And how many individuals?

In [20]:
wc -l example_data/example.ind

 1371 example_data/example.ind
[?2004h

: 1

So 1371 individuals on 120 populations, so a bit more than 10 per population on average. Good to know!

***Note:*** we learned some new Unix commands: `awk`, `cat`, `head`, `sort`, `uniq` and `wc`.

As a final step, let's modify our pipeline to output not just the unique populations, but also the number of individuals per populations. Fortunately this is extremely easy, since the flag `-c` to the `uniq` command already does the job:

In [21]:
cat example_data/example.ind | awk '{print $3}' | sort | uniq -c | head

 9 Abkhasian
 16 Adygei
 6 Albanian
 7 Aleut
 4 Aleut_Tlingit
 7 Altaian
 10 Ami
 10 Armenian
 9 Atayal
 10 Balkar
[?2004h

: 1

Nice. Let's put that list into a file that we can then import for plotting later.

In [21]:
cat /data/popgen_course/genotypes_small.ind | awk '{print $3}' | sort | uniq -c > population_frequencies.txt

OK, we have created a new file called `population_frequencies.txt` in our current directory. We have used the bash redirection sumbol `>` for writing outputs from a command or pipeline into a file. The file should now contain the population number data. We can check this by running:

In [22]:
head population_frequencies.txt

 9 Abkhasian
 16 Adygei
 6 Albanian
 7 Aleut
 4 Aleut_Tlingit
 7 Altaian
 10 Ami
 10 Armenian
 9 Atayal
 10 Balkar


OK, it seems to have worked. If you want to look at the file in a more interactive way, go back to your Jupyter File Browser and click on the file, which you should now see within your working directory. The file should open in a text editor that you can use to scroll around.

OK, now that we have a file to plot, let's try it out using a new python3 notebook. See the next notebook, called `02_pynb_getting_started` in this series.