{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting started with Bash Notebooks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notebook can be loaded for different underlying kernels: bash, python and R. Notebooks are useful to document interactive data analysis. They combine code cells with markdown cells. A markdown cell can contain text, math or headings. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can create new bash notebooks using the \"New\" Dropdown list in the Jupyter File Browser and then selecting \"Bash\". Notebooks open if you click on them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Jupyter notebooks, you work with *Cells*. You can create new cells, or insert them above or below existing cells using the menu items in the `Insert` menu. Use the dropdown list in the command bar in Jupyter to change the type of the cell. The two main types we're going to use are `Markdown` and `Code`. Markdown cells are useful for documenting stuff, Code cells for running code. Markdown cells can be edited by double-clicking into them. Layout them by runnign Shift-Enter." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Code cells are used to enter and execute code. Let's look at some examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can first check which directory we are in, using the `pwd` (=Present Working Directory) command:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/schiffels/dev/popgen_course\n", "\u001b[?2004h" ] }, { "ename": "", "evalue": "1", "output_type": "error", "traceback": [] } ], "source": [ "pwd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, so we're in the `dev/popgen_course` subfolder within my home folder `/home/stephan`. We can list the contents of that folder:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "03_Rmd_smartpca.Rmd\n", "03_bashnb_smartpca.ipynb\n", "04_Rmd_plotting_pca.Rmd\n", "04_pynb_plotting_pca.ipynb\n", "05_Rmd_fstatistics.Rmd\n", "05_pynb_fstatistics.ipynb\n", "0_Welcome.ipynb\n", "1A_short_primer_on_jupyter.ipynb\n", "1B_getting_started_with_bash_notebooks.ipynb\n", "1C_getting_started_with_python_notebooks.ipynb\n", "1D_getting_started_with_R_notebooks.ipynb\n", "README.md\n", "adm_f3_param.txt\n", "adm_f3_popfile.txt\n", "f3_outgroup_stats_Han.txt\n", "f3_outgroup_stats_MA1.txt\n", "f4_param.txt\n", "f4_popfile.txt\n", "img\n", "outgroup_f3_param_Han.txt\n", "outgroup_f3_param_MA1.txt\n", "outgroup_f3_popfile_Han.txt\n", "outgroup_f3_popfile_MA1.txt\n", "pca.AllEurasia.eval\n", "pca.AllEurasia.evec\n", "pca.AllEurasia.params.txt\n", "pca.WestEurasia.eval\n", "pca.WestEurasia.evec\n", "pca.WestEurasia.params.txt\n", "population_frequencies.txt\n", "supp\n", "test\n", "testDir\n", "\u001b[?2004h" ] }, { "ename": "", "evalue": "1", "output_type": "error", "traceback": [] } ], "source": [ "ls" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now create a new directory:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mkdir: testDir: File exists\n", "\u001b[?2004h" ] }, { "ename": "", "evalue": "1", "output_type": "error", "traceback": [] } ], "source": [ "mkdir testDir" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and change into that directory:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[?2004l\r", "\u001b[?2004h" ] }, { "ename": "", "evalue": "1", "output_type": "error", "traceback": [] } ], "source": [ "cd testDir" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and confirm that we are now in the new dir:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/schiffels/dev/popgen_course/testDir\n", "\u001b[?2004h" ] }, { "ename": "", "evalue": "1", "output_type": "error", "traceback": [] } ], "source": [ "pwd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, let's go back and delete the subfolder again:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[?2004h\u001b[?2004l" ] }, { "ename": "", "evalue": "1", "output_type": "error", "traceback": [] } ], "source": [ "cd ..\n", "rm -r testDir" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is a simple example of how to use ``echo``:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hello, how are you?\n", "\u001b[?2004h" ] }, { "ename": "", "evalue": "1", "output_type": "error", "traceback": [] } ], "source": [ "echo \"Hello, how are you?\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, so let's try some more useful things with ``grep``, which can be used to filter large text files by searching for patterns, in this case just the occurrence of the word \"French\":" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " HGDP00511 M French\n", " HGDP00512 M French\n", " HGDP00513 F French\n", " HGDP00514 F French\n", " HGDP00515 M French\n", " HGDP00516 F French\n", " HGDP00517 F French\n", " HGDP00518 M French\n", " HGDP00519 M French\n", " HGDP00522 M French\n", " HGDP00523 F French\n", " HGDP00524 F French\n", " HGDP00525 M French\n", " HGDP00526 F French\n", " HGDP00527 F French\n", " HGDP00528 M French\n", " HGDP00529 F French\n", " HGDP00531 F French\n", " HGDP00533 M French\n", " HGDP00534 F French\n", " HGDP00535 F French\n", " HGDP00536 F French\n", " HGDP00537 F French\n", " HGDP00538 M French\n", " HGDP00539 F French\n", " SouthFrench3326 M French\n", " SouthFrench3947 M French\n", " SouthFrench1323 M French\n", " SouthFrench3951 M French\n", " SouthFrench3068 M French\n", " SouthFrench1112 M French\n", " SouthFrench4018 M French\n", "\u001b[?2004h" ] }, { "ename": "", "evalue": "1", "output_type": "error", "traceback": [] } ], "source": [ "grep French example_data/example.ind" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alright, so that lists all French individuals in that list. Now let's count them, by simply passing the flag `-c`:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "32?2004l\n", "\u001b[?2004h" ] }, { "ename": "", "evalue": "1", "output_type": "error", "traceback": [] } ], "source": [ "grep -c French example_data/example.ind" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Note:*** We so far have seen the `pwd`, `mkdir`, `cd`, `rm`, `ls` and `grep` commands. If you want to find out more about those, just google them, they are among the most popular and widely used commands/programs in Unix." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Python3 notebooks you can plot things: Create a new python3 notebook, and run this boilerplate code in the first cell:\n", "\n", " %matplotlib inline\n", " import matplotlib.pyplot as plt\n", "\n", "Then plot something, opening a second cell:\n", "\n", "***Exercise:*** Create a simple plot using `plt.plot([1, 2, 3], [5, 2, 6])`\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bash Pipes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK. So this first Notebook operates on Bash, which is more or less the lingua franca of Linux operating systems. Everything you do on command lines uses bash. One of the most useful techniques in bash scripting or bash commands are Unix pipes. To illustrate them, consider the following." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the structure of our ``ind`` file:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Yuk_009 M Yukagir\n", " Yuk_025 F Yukagir\n", " Yuk_022 F Yukagir\n", " Yuk_020 F Yukagir\n", " MC_40 M Chukchi\n", " Yuk_024 F Yukagir\n", " Nesk_25 F Eskimo_Naukan\n", " Yuk_023 F Yukagir\n", " MC_16 M Chukchi\n", " MC_15 F Chukchi\n", "\u001b[?2004h" ] }, { "ename": "", "evalue": "1", "output_type": "error", "traceback": [] } ], "source": [ "head example_data/example.ind" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Note:*** The `head` command just lists the top 10 rows of a file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's filter out the population column:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Yukagirl\n", "Yukagir\n", "Yukagir\n", "Yukagir\n", "Chukchi\n", "Yukagir\n", "Eskimo_Naukan\n", "Yukagir\n", "Chukchi\n", "Chukchi\n", "\u001b[?2004h" ] }, { "ename": "", "evalue": "1", "output_type": "error", "traceback": [] } ], "source": [ "head example_data/example.ind | awk '{print $3}'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Note:*** The `awk` program is one of the most powerful programs for text-file processing in the Unix-world. It is actually a full-fledged programming language itself. Here we only use it in one of its simplest form. The program `{print $3}` simply says \"For every line of the input file, print out the third field\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Note:*** The pipe symbol `|` tells Unix to redirect the output of the program to its left into the program to its right as standard input. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's sort the output (notice we now use ``cat`` instead of ``head``, but use ``head`` in the end:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Abkhasian\n", "Abkhasian\n", "Abkhasian\n", "Abkhasian\n", "Abkhasian\n", "Abkhasian\n", "Abkhasian\n", "Abkhasian\n", "Abkhasian\n", "Adygei\n", "\u001b[?2004h" ] }, { "ename": "", "evalue": "1", "output_type": "error", "traceback": [] } ], "source": [ "cat example_data/example.ind | awk '{print $3}' | sort | head" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, so there are some error messages in the end because ``head`` ungracefully discards the rest of the data, but that's OK." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's use ``uniq`` to get rid of population name duplicates:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Abkhasian\n", "Adygei\n", "Albanian\n", "Aleut\n", "Aleut_Tlingit\n", "Altaian\n", "Ami\n", "Armenian\n", "Atayal\n", "Balkar\n", "\u001b[?2004h" ] }, { "ename": "", "evalue": "1", "output_type": "error", "traceback": [] } ], "source": [ "cat example_data/example.ind | awk '{print $3}' | sort | uniq | head" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now let's count:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 120\n", "\u001b[?2004h" ] }, { "ename": "", "evalue": "1", "output_type": "error", "traceback": [] } ], "source": [ "cat example_data/example.ind | awk '{print $3}' | sort | uniq | wc -l" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, so there are 120 populations in the dataset. And how many individuals?" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1371 example_data/example.ind\n", "\u001b[?2004h" ] }, { "ename": "", "evalue": "1", "output_type": "error", "traceback": [] } ], "source": [ "wc -l example_data/example.ind" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So 1371 individuals on 120 populations, so a bit more than 10 per population on average. Good to know!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Note:*** we learned some new Unix commands: `awk`, `cat`, `head`, `sort`, `uniq` and `wc`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a final step, let's modify our pipeline to output not just the unique populations, but also the number of individuals per populations. Fortunately this is extremely easy, since the flag `-c` to the `uniq` command already does the job:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 9 Abkhasian\n", " 16 Adygei\n", " 6 Albanian\n", " 7 Aleut\n", " 4 Aleut_Tlingit\n", " 7 Altaian\n", " 10 Ami\n", " 10 Armenian\n", " 9 Atayal\n", " 10 Balkar\n", "\u001b[?2004h" ] }, { "ename": "", "evalue": "1", "output_type": "error", "traceback": [] } ], "source": [ "cat example_data/example.ind | awk '{print $3}' | sort | uniq -c | head" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nice. Let's put that list into a file that we can then import for plotting later." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "cat /data/popgen_course/genotypes_small.ind | awk '{print $3}' | sort | uniq -c > population_frequencies.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, we have created a new file called `population_frequencies.txt` in our current directory. We have used the bash redirection sumbol `>` for writing outputs from a command or pipeline into a file. The file should now contain the population number data. We can check this by running:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 9 Abkhasian\n", " 16 Adygei\n", " 6 Albanian\n", " 7 Aleut\n", " 4 Aleut_Tlingit\n", " 7 Altaian\n", " 10 Ami\n", " 10 Armenian\n", " 9 Atayal\n", " 10 Balkar\n" ] } ], "source": [ "head population_frequencies.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, it seems to have worked. If you want to look at the file in a more interactive way, go back to your Jupyter File Browser and click on the file, which you should now see within your working directory. The file should open in a text editor that you can use to scroll around." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, now that we have a file to plot, let's try it out using a new python3 notebook. See the next notebook, called `02_pynb_getting_started` in this series." ] } ], "metadata": { "kernelspec": { "display_name": "Bash", "language": "bash", "name": "bash" }, "language_info": { "codemirror_mode": "shell", "file_extension": ".sh", "mimetype": "text/x-sh", "name": "bash" } }, "nbformat": 4, "nbformat_minor": 2 }