{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bash - Commands used for running the analyses in the Workshop"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploring the Genotype Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AllEurasia.poplist.txt\tgenotypes_small.ind  WestEurasia.poplist.txt\n",
      "genotypes_small.geno\tgenotypes_small.snp\n"
     ]
    }
   ],
   "source": [
    "SDIR=/data;\n",
    "cd $SDIR/pca;\n",
    "ls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Exploring the files. Here are the first 20 individuals:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "             Yuk_009 M    Yukagir\n",
      "             Yuk_025 F    Yukagir\n",
      "             Yuk_022 F    Yukagir\n",
      "             Yuk_020 F    Yukagir\n",
      "               MC_40 M    Chukchi\n",
      "             Yuk_024 F    Yukagir\n",
      "             Yuk_023 F    Yukagir\n",
      "               MC_16 M    Chukchi\n",
      "               MC_15 F    Chukchi\n",
      "               MC_18 M    Chukchi\n",
      "             Yuk_004 M    Yukagir\n",
      "               MC_08 F    Chukchi\n",
      "             Nov_005 M   Nganasan\n",
      "               MC_25 F    Chukchi\n",
      "             Yuk_019 F    Yukagir\n",
      "             Yuk_011 M    Yukagir\n",
      "             Sesk_47 M   Chukchi1\n",
      "               MC_17 M    Chukchi\n",
      "             Yuk_021 M    Yukagir\n",
      "               MC_06 F    Chukchi\n"
     ]
    }
   ],
   "source": [
    "head -20 genotypes_small.ind"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And here the first 20 SNP rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "            1_752566     1        0.020130          752566 G A\n",
      "            1_842013     1        0.022518          842013 T G\n",
      "            1_891021     1        0.024116          891021 G A\n",
      "            1_903426     1        0.024457          903426 C T\n",
      "            1_949654     1        0.025727          949654 A G\n",
      "           1_1018704     1        0.026288         1018704 A G\n",
      "           1_1045331     1        0.026665         1045331 G A\n",
      "           1_1048955     1        0.026674         1048955 A G\n",
      "           1_1061166     1        0.026711         1061166 T C\n",
      "           1_1108637     1        0.028311         1108637 G A\n",
      "           1_1120431     1        0.028916         1120431 G A\n",
      "           1_1156131     1        0.029335         1156131 T C\n",
      "           1_1157547     1        0.029356         1157547 T C\n",
      "           1_1158277     1        0.029367         1158277 G A\n",
      "           1_1161780     1        0.029391         1161780 C T\n",
      "           1_1170587     1        0.029450         1170587 C T\n",
      "           1_1205155     1        0.029735         1205155 A C\n",
      "           1_1211292     1        0.029785         1211292 C T\n",
      "           1_1235792     1        0.030045         1235792 C T\n",
      "           1_1254255     1        0.030111         1254255 G A\n"
     ]
    }
   ],
   "source": [
    "head -20 genotypes_small.snp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And here are the first 20 genotypes of the first 100 individuals:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0101101211102210102021200100010200000011001000200001010110001100001111101001110200110100000111100010\n",
      "2012121012210011122100111202201222121102222121121012202221211212202201101201220222122021220222220221\n",
      "1100112001110021001001111000011200000111100001110001110100002100110111120000102200110100010010000000\n",
      "0000112210222121221121100202221222122112112211202122222221022222111221102200112222122210220111121111\n",
      "0000000000000000000000000000100000000000000000100010000000000000000000000000000100000000100001000000\n",
      "1012100221102201101121110120110000010012002010200100010011100100011011101110120200010120101112120111\n",
      "2222222222222222222222222222222222222222222222222222222222222222222222222222222222121221222222221222\n",
      "2211222002212022102001212222212212222210122212121222112222221112122111222222122021221122222222211122\n",
      "2211222002212022102001212202012212212210122212121122112221221112121111222122112021211112222111211111\n",
      "2222222222102222202222222222222222222222222211222212122222122122222222222222222122221222222222212222\n",
      "2212222212122222222222222222221222222222222220221122222222122221212222221222222202222222222222221222\n",
      "1101100001000001001000000222010021200001202110101111110122100021211110001221120002110001212222122222\n",
      "1221121221222211221222222121221222212222222222222222222211121221212122221202101222212222222222222222\n",
      "1221121221222211221222222121221222212222222222222222222211121221212122221202101222212222222222222222\n",
      "1221121221222211221222222121221222212222222222222222222211121221212122221202101222212222222222222222\n",
      "2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222\n",
      "2222222222222222222222222222222222222222222222222222222222222222222222222222222222121212222222222222\n",
      "1011111102100111001100200122221022211211222021212200120222112121221120012221222102020112222122222222\n",
      "2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222\n",
      "2122121212102221202222222222222221122212222192222211122222222112222222222122222122221221222222222222\n"
     ]
    }
   ],
   "source": [
    "head -20 genotypes_small.geno | cut -c1-100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Counting how many individuals and SNPs there are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1340 genotypes_small.ind\n",
      "593124 genotypes_small.snp\n"
     ]
    }
   ],
   "source": [
    "wc -l genotypes_small.ind\n",
    "wc -l genotypes_small.snp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now we check that the first row of the `*.geno` file indeed contains the same number of columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1341\n"
     ]
    }
   ],
   "source": [
    "head -1 genotypes_small.geno | wc -c"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "which is one more, including the newline character at the end of the line. Now counting the number of rows in the `*.geno`-file (this takes a few seconds, as the file is several hundred MB large):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "593124 genotypes_small.geno\n"
     ]
    }
   ],
   "source": [
    "wc -l genotypes_small.geno"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great, the number of rows and columns agrees with the numbers indicated in the `*.ind` and `*.snp` file!\n",
    "Now we're counting how many different populations there are. Let's first see the first 10 populations in the sorted list, alongside the number of individuals in each group:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      9 Abkhasian\n",
      "     16 Adygei\n",
      "      6 Albanian\n",
      "      7 Aleut\n",
      "      4 Aleut_Tlingit\n",
      "      7 Altaian\n",
      "     10 Ami\n",
      "     10 Armenian\n",
      "      9 Atayal\n",
      "     10 Balkar\n",
      "     29 Basque\n",
      "     25 BedouinA\n",
      "     19 BedouinB\n",
      "     10 Belarusian\n",
      "      6 BolshoyOleniOstrov\n",
      "      9 Borneo\n",
      "     10 Bulgarian\n",
      "      8 Cambodian\n",
      "      2 Canary_Islander\n",
      "      2 ChalmnyVarre\n"
     ]
    }
   ],
   "source": [
    "awk '{print $3}' genotypes_small.ind | sort | uniq -c | head -20"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you look into the file further down, you will notice that there are a number of populations with only one sample. Let's filter those out and count only populations with at least two individuals and count them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "113\n"
     ]
    }
   ],
   "source": [
    "awk '{print $3}' genotypes_small.ind | sort | uniq -c | awk '$1>1' | wc -l"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OK, so there are 113 populations with more than one individual in this dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Bash",
   "language": "bash",
   "name": "bash"
  },
  "language_info": {
   "codemirror_mode": "shell",
   "file_extension": ".sh",
   "mimetype": "text/x-sh",
   "name": "bash"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}