{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Bash - Commands used for running the analyses in the Workshop" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring the Genotype Data" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AllEurasia.poplist.txt\tgenotypes_small.ind WestEurasia.poplist.txt\n", "genotypes_small.geno\tgenotypes_small.snp\n" ] } ], "source": [ "SDIR=/data;\n", "cd $SDIR/pca;\n", "ls" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exploring the files. Here are the first 20 individuals:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Yuk_009 M Yukagir\n", " Yuk_025 F Yukagir\n", " Yuk_022 F Yukagir\n", " Yuk_020 F Yukagir\n", " MC_40 M Chukchi\n", " Yuk_024 F Yukagir\n", " Yuk_023 F Yukagir\n", " MC_16 M Chukchi\n", " MC_15 F Chukchi\n", " MC_18 M Chukchi\n", " Yuk_004 M Yukagir\n", " MC_08 F Chukchi\n", " Nov_005 M Nganasan\n", " MC_25 F Chukchi\n", " Yuk_019 F Yukagir\n", " Yuk_011 M Yukagir\n", " Sesk_47 M Chukchi1\n", " MC_17 M Chukchi\n", " Yuk_021 M Yukagir\n", " MC_06 F Chukchi\n" ] } ], "source": [ "head -20 genotypes_small.ind" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here the first 20 SNP rows:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1_752566 1 0.020130 752566 G A\n", " 1_842013 1 0.022518 842013 T G\n", " 1_891021 1 0.024116 891021 G A\n", " 1_903426 1 0.024457 903426 C T\n", " 1_949654 1 0.025727 949654 A G\n", " 1_1018704 1 0.026288 1018704 A G\n", " 1_1045331 1 0.026665 1045331 G A\n", " 1_1048955 1 0.026674 1048955 A G\n", " 1_1061166 1 0.026711 1061166 T C\n", " 1_1108637 1 0.028311 1108637 G A\n", " 1_1120431 1 0.028916 1120431 G A\n", " 1_1156131 1 0.029335 1156131 T C\n", " 1_1157547 1 0.029356 1157547 T C\n", " 1_1158277 1 0.029367 1158277 G A\n", " 1_1161780 1 0.029391 1161780 C T\n", " 1_1170587 1 0.029450 1170587 C T\n", " 1_1205155 1 0.029735 1205155 A C\n", " 1_1211292 1 0.029785 1211292 C T\n", " 1_1235792 1 0.030045 1235792 C T\n", " 1_1254255 1 0.030111 1254255 G A\n" ] } ], "source": [ "head -20 genotypes_small.snp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here are the first 20 genotypes of the first 100 individuals:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0101101211102210102021200100010200000011001000200001010110001100001111101001110200110100000111100010\n", "2012121012210011122100111202201222121102222121121012202221211212202201101201220222122021220222220221\n", "1100112001110021001001111000011200000111100001110001110100002100110111120000102200110100010010000000\n", "0000112210222121221121100202221222122112112211202122222221022222111221102200112222122210220111121111\n", "0000000000000000000000000000100000000000000000100010000000000000000000000000000100000000100001000000\n", "1012100221102201101121110120110000010012002010200100010011100100011011101110120200010120101112120111\n", "2222222222222222222222222222222222222222222222222222222222222222222222222222222222121221222222221222\n", "2211222002212022102001212222212212222210122212121222112222221112122111222222122021221122222222211122\n", "2211222002212022102001212202012212212210122212121122112221221112121111222122112021211112222111211111\n", "2222222222102222202222222222222222222222222211222212122222122122222222222222222122221222222222212222\n", "2212222212122222222222222222221222222222222220221122222222122221212222221222222202222222222222221222\n", "1101100001000001001000000222010021200001202110101111110122100021211110001221120002110001212222122222\n", "1221121221222211221222222121221222212222222222222222222211121221212122221202101222212222222222222222\n", "1221121221222211221222222121221222212222222222222222222211121221212122221202101222212222222222222222\n", "1221121221222211221222222121221222212222222222222222222211121221212122221202101222212222222222222222\n", "2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222\n", "2222222222222222222222222222222222222222222222222222222222222222222222222222222222121212222222222222\n", "1011111102100111001100200122221022211211222021212200120222112121221120012221222102020112222122222222\n", "2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222\n", "2122121212102221202222222222222221122212222192222211122222222112222222222122222122221221222222222222\n" ] } ], "source": [ "head -20 genotypes_small.geno | cut -c1-100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Counting how many individuals and SNPs there are:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1340 genotypes_small.ind\n", "593124 genotypes_small.snp\n" ] } ], "source": [ "wc -l genotypes_small.ind\n", "wc -l genotypes_small.snp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now we check that the first row of the `*.geno` file indeed contains the same number of columns:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1341\n" ] } ], "source": [ "head -1 genotypes_small.geno | wc -c" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "which is one more, including the newline character at the end of the line. Now counting the number of rows in the `*.geno`-file (this takes a few seconds, as the file is several hundred MB large):" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "593124 genotypes_small.geno\n" ] } ], "source": [ "wc -l genotypes_small.geno" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great, the number of rows and columns agrees with the numbers indicated in the `*.ind` and `*.snp` file!\n", "Now we're counting how many different populations there are. Let's first see the first 10 populations in the sorted list, alongside the number of individuals in each group:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 9 Abkhasian\n", " 16 Adygei\n", " 6 Albanian\n", " 7 Aleut\n", " 4 Aleut_Tlingit\n", " 7 Altaian\n", " 10 Ami\n", " 10 Armenian\n", " 9 Atayal\n", " 10 Balkar\n", " 29 Basque\n", " 25 BedouinA\n", " 19 BedouinB\n", " 10 Belarusian\n", " 6 BolshoyOleniOstrov\n", " 9 Borneo\n", " 10 Bulgarian\n", " 8 Cambodian\n", " 2 Canary_Islander\n", " 2 ChalmnyVarre\n" ] } ], "source": [ "awk '{print $3}' genotypes_small.ind | sort | uniq -c | head -20" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you look into the file further down, you will notice that there are a number of populations with only one sample. Let's filter those out and count only populations with at least two individuals and count them:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "113\n" ] } ], "source": [ "awk '{print $3}' genotypes_small.ind | sort | uniq -c | awk '$1>1' | wc -l" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, so there are 113 populations with more than one individual in this dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Bash", "language": "bash", "name": "bash" }, "language_info": { "codemirror_mode": "shell", "file_extension": ".sh", "mimetype": "text/x-sh", "name": "bash" } }, "nbformat": 4, "nbformat_minor": 2 }