{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from __future__ import print_function, division"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# `PyPlink`\n",
    "\n",
    "`PyPlink` is a Python module to read and write binary Plink files. Here are small examples for `PyPlink`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyplink import PyPlink"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Table of contents\n",
    "\n",
    "* [**Reading binary pedfile**](#Reading-binary-pedfile)\n",
    "    * [Getting the demo data](#Getting-the-demo-data)\n",
    "    * [Reading the binary file](#Reading-the-binary-file)\n",
    "    * [Getting dataset information](#Getting-dataset-information)\n",
    "    * [Iterating over all markers](#Iterating-over-all-markers)\n",
    "        * [*Additive format*](#iterating_over_all_additive)\n",
    "        * [*Nucleotide format*](#iterating_over_all_nuc)\n",
    "    * [Iterating over selected markers](#Iterating-over-selected-markers)\n",
    "        * [*Additive format*](#iterating_over_selected_additive)\n",
    "        * [*Nucleotide format*](#iterating_over_selected_nuc)\n",
    "    * [Extracting a single marker](#Extracting-a-single-marker)\n",
    "        * [*Additive format*](#extracting_additive)\n",
    "        * [*Nucleotide format*](#extracting_nuc)\n",
    "    * [Misc example](#Misc-example)\n",
    "        * [*Extracting a subset of markers and samples*](#Extracting-a-subset-of-markers-and-samples)\n",
    "        * [*Counting the allele frequency of markers*](#Counting-the-allele-frequency-of-markers)\n",
    "\n",
    "\n",
    "* [**Writing binary pedfile**](#Writing-binary-pedfile)\n",
    "    * [SNP-major format](#SNP-major-format)\n",
    "    * [INDIVIDUAL-major-format](#INDIVIDUAL-major-format)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading binary pedfile\n",
    "\n",
    "### Getting the demo data\n",
    "\n",
    "The [`Plink`](http://pngu.mgh.harvard.edu/~purcell/plink/) softwares provides a testing dataset on the [resources page](http://pngu.mgh.harvard.edu/~purcell/plink/res.shtml). It contains the 270 samples from the HapMap project (release 23) on build GRCh36/hg18."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import zipfile\n",
    "try:\n",
    "    from urllib.request import urlretrieve\n",
    "except ImportError:\n",
    "    from urllib import urlretrieve\n",
    "\n",
    "# Downloading the demo data from Plink webset\n",
    "urlretrieve(\n",
    "    \"http://pngu.mgh.harvard.edu/~purcell/plink/dist/hapmap_r23a.zip\",\n",
    "    \"hapmap_r23a.zip\",\n",
    ")\n",
    "\n",
    "# Extracting the archive content\n",
    "with zipfile.ZipFile(\"hapmap_r23a.zip\", \"r\") as z:\n",
    "    z.extractall(\".\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reading the binary file\n",
    "\n",
    "To read a binary file, `PyPlink` only requires the prefix of the files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pedfile = PyPlink(\"hapmap_r23a\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Getting dataset information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "270 samples and 4,098,136 markers\n"
     ]
    }
   ],
   "source": [
    "print(\"{:,d} samples and {:,d} markers\".format(\n",
    "    pedfile.get_nb_samples(),\n",
    "    pedfile.get_nb_markers(),\n",
    "))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fid</th>\n",
       "      <th>iid</th>\n",
       "      <th>father</th>\n",
       "      <th>mother</th>\n",
       "      <th>gender</th>\n",
       "      <th>status</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NA18524</td>\n",
       "      <td>NA18524</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>-9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NA18526</td>\n",
       "      <td>NA18526</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>-9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NA18529</td>\n",
       "      <td>NA18529</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>-9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NA18532</td>\n",
       "      <td>NA18532</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>-9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>NA18537</td>\n",
       "      <td>NA18537</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>-9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       fid      iid father mother  gender  status\n",
       "0  NA18524  NA18524      0      0       1      -9\n",
       "1  NA18526  NA18526      0      0       2      -9\n",
       "2  NA18529  NA18529      0      0       2      -9\n",
       "3  NA18532  NA18532      0      0       2      -9\n",
       "4  NA18537  NA18537      0      0       2      -9"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_samples = pedfile.get_fam()\n",
    "all_samples.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>chrom</th>\n",
       "      <th>pos</th>\n",
       "      <th>cm</th>\n",
       "      <th>a1</th>\n",
       "      <th>a2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>snp</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>rs10399749</th>\n",
       "      <td>1</td>\n",
       "      <td>45162</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>rs2949420</th>\n",
       "      <td>1</td>\n",
       "      <td>45257</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>rs4030303</th>\n",
       "      <td>1</td>\n",
       "      <td>72434</td>\n",
       "      <td>0</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>rs4030300</th>\n",
       "      <td>1</td>\n",
       "      <td>72515</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>rs3855952</th>\n",
       "      <td>1</td>\n",
       "      <td>77689</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            chrom    pos  cm a1 a2\n",
       "snp                               \n",
       "rs10399749      1  45162   0  0  C\n",
       "rs2949420       1  45257   0  0  T\n",
       "rs4030303       1  72434   0  A  G\n",
       "rs4030300       1  72515   0  0  C\n",
       "rs3855952       1  77689   0  0  A"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_markers = pedfile.get_bim()\n",
    "all_markers.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Iterating over all markers\n",
    "\n",
    "<a id=\"iterating_over_all_additive\"></a>\n",
    "#### Additive format\n",
    "\n",
    "Cycling through genotypes as `-1`, `0`, `1` and `2` values, where `-1` is unknown, `0` is homozygous (major allele), `1` is heterozygous, and `2` is homozygous (minor allele)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rs10399749\n",
      "[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 -1  0  0  0  0  0  0  0\n",
      "  0 -1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 -1  0  0  0  0  0  0  0 -1\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0 -1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0 -1  0  0  0  0  0  0 -1  0  0  0\n",
      "  0  0  0 -1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 -1  0 -1  0  0  0  0  0  0  0\n",
      " -1  0 -1  0  0  0  0  0  0  0  0  0 -1  0  0  0  0  0 -1  0 -1  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]\n"
     ]
    }
   ],
   "source": [
    "for marker_id, genotypes in pedfile:\n",
    "    print(marker_id)\n",
    "    print(genotypes)\n",
    "    break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rs10399749\n",
      "[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 -1  0  0  0  0  0  0  0\n",
      "  0 -1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 -1  0  0  0  0  0  0  0 -1\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0 -1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0 -1  0  0  0  0  0  0 -1  0  0  0\n",
      "  0  0  0 -1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 -1  0 -1  0  0  0  0  0  0  0\n",
      " -1  0 -1  0  0  0  0  0  0  0  0  0 -1  0  0  0  0  0 -1  0 -1  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]\n"
     ]
    }
   ],
   "source": [
    "for marker_id, genotypes in pedfile.iter_geno():\n",
    "    print(marker_id)\n",
    "    print(genotypes)\n",
    "    break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"iterating_over_all_nuc\"></a>\n",
    "#### Nucleotide format\n",
    "\n",
    "Cycling through genotypes as `A`, `C`, `G` and `T` values (where `00` is unknown)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rs10399749\n",
      "['CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
      " 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
      " 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' '00' 'CC' 'CC'\n",
      " 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' '00' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
      " 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
      " 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
      " 'CC' '00' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' '00' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
      " 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
      " 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' '00' 'CC' 'CC'\n",
      " 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
      " 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' '00'\n",
      " 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' '00' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' '00' 'CC'\n",
      " 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
      " 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
      " 'CC' 'CC' 'CC' 'CC' 'CC' '00' 'CC' '00' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
      " '00' 'CC' '00' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' '00' 'CC' 'CC'\n",
      " 'CC' 'CC' 'CC' '00' 'CC' '00' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
      " 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC']\n"
     ]
    }
   ],
   "source": [
    "for marker_id, genotypes in pedfile.iter_acgt_geno():\n",
    "    print(marker_id)\n",
    "    print(genotypes)\n",
    "    break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Iterating over selected markers\n",
    "\n",
    "<a id=\"iterating_over_selected_additive\"></a>\n",
    "#### Additive format\n",
    "\n",
    "Cycling through genotypes as `-1`, `0`, `1` and `2` values, where `-1` is unknown, `0` is homozygous (major allele), `1` is heterozygous, and `2` is homozygous (minor allele)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rs7092431\n",
      "[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 -1  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]\n",
      "\n",
      "rs9943770\n",
      "[ 0  0  0  2  0  2  0  1  1  0  0  0  0  0  0  0  1  0  0  1  0  0  1  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  1  0  1  0  0\n",
      "  1  0  1  0  0  1  1  1  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  1\n",
      "  0  0  0  0  0  0  0  0  1  0  0  1  0  0  0  0  1  1  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  1  0  0  1  0  0  0  1  0  0  0  0  0  0  0\n",
      "  0  1  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  1  0  0  0  0\n",
      "  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  1 -1  0  0  0  1  2  1  1  0  1  1  2  0  0  0  1  0  0  0  0  1  1  0  2\n",
      "  1  1  1  0  1  1 -1  1  0  0  1  0  0  2  0  1  2  0  1  1  1  2  0  1  0\n",
      "  0  0  0  0  0  0  2  1  2  1  1  2  1  1  1  1  0  1  1  2  1  0 -1  0  1\n",
      "  1  1  0  1  0  1 -1  2  1  0  0  0  1  1  1  2  0  0  1  1]\n",
      "\n",
      "rs1587483\n",
      "[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 -1  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0 -1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  1  0  0  0  0  0  0 -1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0\n",
      "  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "markers = [\"rs7092431\", \"rs9943770\", \"rs1587483\"]\n",
    "for marker_id, genotypes in pedfile.iter_geno_marker(markers):\n",
    "    print(marker_id)\n",
    "    print(genotypes, end=\"\\n\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"iterating_over_selected_nuc\"></a>\n",
    "#### Nucleotide format\n",
    "\n",
    "Cycling through genotypes as `A`, `C`, `G` and `T` values (where `00` is unknown)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rs7092431\n",
      "['GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' '00'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG']\n",
      "\n",
      "rs9943770\n",
      "['AA' 'AA' 'AA' 'GG' 'AA' 'GG' 'AA' 'GA' 'GA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA'\n",
      " 'AA' 'GA' 'AA' 'AA' 'GA' 'AA' 'AA' 'GA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA'\n",
      " 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'GA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA'\n",
      " 'GA' 'AA' 'GA' 'AA' 'AA' 'GA' 'AA' 'GA' 'AA' 'AA' 'GA' 'GA' 'GA' 'AA' 'AA'\n",
      " 'AA' 'AA' 'AA' 'AA' 'AA' 'GA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'GA'\n",
      " 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'GA' 'AA' 'AA' 'GA' 'AA' 'AA' 'AA'\n",
      " 'AA' 'GA' 'GA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA'\n",
      " 'AA' 'AA' 'AA' 'AA' 'AA' 'GA' 'AA' 'AA' 'GA' 'AA' 'AA' 'AA' 'GA' 'AA' 'AA'\n",
      " 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'GA' 'GA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA'\n",
      " 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'GA' 'AA' 'AA' 'GA' 'AA' 'AA' 'AA' 'AA'\n",
      " 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'GA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA'\n",
      " 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'GA' '00' 'AA' 'AA' 'AA'\n",
      " 'GA' 'GG' 'GA' 'GA' 'AA' 'GA' 'GA' 'GG' 'AA' 'AA' 'AA' 'GA' 'AA' 'AA' 'AA'\n",
      " 'AA' 'GA' 'GA' 'AA' 'GG' 'GA' 'GA' 'GA' 'AA' 'GA' 'GA' '00' 'GA' 'AA' 'AA'\n",
      " 'GA' 'AA' 'AA' 'GG' 'AA' 'GA' 'GG' 'AA' 'GA' 'GA' 'GA' 'GG' 'AA' 'GA' 'AA'\n",
      " 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'GG' 'GA' 'GG' 'GA' 'GA' 'GG' 'GA' 'GA' 'GA'\n",
      " 'GA' 'AA' 'GA' 'GA' 'GG' 'GA' 'AA' '00' 'AA' 'GA' 'GA' 'GA' 'AA' 'GA' 'AA'\n",
      " 'GA' '00' 'GG' 'GA' 'AA' 'AA' 'AA' 'GA' 'GA' 'GA' 'GG' 'AA' 'AA' 'GA' 'GA']\n",
      "\n",
      "rs1587483\n",
      "['GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' '00' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' '00' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'CG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' '00' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
      " 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG']\n",
      "\n"
     ]
    }
   ],
   "source": [
    "markers = [\"rs7092431\", \"rs9943770\", \"rs1587483\"]\n",
    "for marker_id, genotypes in pedfile.iter_acgt_geno_marker(markers):\n",
    "    print(marker_id)\n",
    "    print(genotypes, end=\"\\n\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extracting a single marker\n",
    "\n",
    "<a id=\"extracting_additive\"></a>\n",
    "#### Additive format\n",
    "\n",
    "Cycling through genotypes as `-1`, `0`, `1` and `2` values, where `-1` is unknown, `0` is homozygous (major allele), `1` is heterozygous, and `2` is homozygous (minor allele)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0,  0,  0,  0,  0, -1,  0,  0, -1, -1,  0,  0,  0,  0,  0,  0,  0,\n",
       "        0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0, -1,\n",
       "        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n",
       "        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,\n",
       "        0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,\n",
       "        0,  0,  0,  0,  0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n",
       "       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n",
       "       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n",
       "       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n",
       "       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n",
       "       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n",
       "       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n",
       "       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n",
       "       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n",
       "       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n",
       "       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], dtype=int8)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pedfile.get_geno_marker(\"rs7619974\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"extracting_nuc\"></a>\n",
    "#### Nucleotide format\n",
    "\n",
    "Cycling through genotypes as `A`, `C`, `G` and `T` values (where `00` is unknown)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['CC', 'CC', 'CC', 'CC', 'CC', '00', 'CC', 'CC', '00', '00', 'CC',\n",
       "       'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC',\n",
       "       'CC', 'CC', '00', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC',\n",
       "       '00', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC',\n",
       "       'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC',\n",
       "       'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC',\n",
       "       'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC',\n",
       "       'CC', '00', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC',\n",
       "       'CC', 'CC', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
       "       '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
       "       '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
       "       '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
       "       '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
       "       '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
       "       '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
       "       '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
       "       '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
       "       '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
       "       '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
       "       '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
       "       '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
       "       '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
       "       '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
       "       '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
       "       '00', '00', '00', '00', '00', '00'], \n",
       "      dtype='<U2')"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pedfile.get_acgt_geno_marker(\"rs7619974\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Misc example\n",
    "\n",
    "#### Extracting a subset of markers and samples\n",
    "\n",
    "To get all markers on the Y chromosomes for the males."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "270 total genotypes\n",
      "142 genotypes for 142 males (rs1140798 on chr24 and position 169,542)\n"
     ]
    }
   ],
   "source": [
    "# Getting the Y markers\n",
    "y_markers = all_markers[all_markers.chrom == 24].index.values\n",
    "\n",
    "# Getting the males\n",
    "males = all_samples.gender == 1\n",
    "\n",
    "# Cycling through the Y markers\n",
    "for marker_id, genotypes in pedfile.iter_geno_marker(y_markers):\n",
    "    male_genotypes = genotypes[males.values]\n",
    "    print(\"{:,d} total genotypes\".format(len(genotypes)))\n",
    "    print(\"{:,d} genotypes for {:,d} males ({} on chr{} and position {:,d})\".format(\n",
    "        len(male_genotypes),\n",
    "        males.sum(),\n",
    "        marker_id,\n",
    "        all_markers.loc[marker_id, \"chrom\"],\n",
    "        all_markers.loc[marker_id, \"pos\"],\n",
    "    ))\n",
    "    break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Counting the allele frequency of markers\n",
    "\n",
    "To count the minor allele frequency of a subset of markers (only for founders)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rs7619974\t0.0\n",
      "rs2949048\t0.02381\n",
      "rs16941434\t0.357143\n"
     ]
    }
   ],
   "source": [
    "# Getting the founders\n",
    "founders = (all_samples.father == \"0\") & (all_samples.mother == \"0\")\n",
    "\n",
    "# Computing the MAF\n",
    "markers = [\"rs7619974\", \"rs2949048\", \"rs16941434\"]\n",
    "for marker_id, genotypes in pedfile.iter_geno_marker(markers):\n",
    "    valid_genotypes = genotypes[founders.values & (genotypes != -1)]\n",
    "    maf = valid_genotypes.sum() / (len(valid_genotypes) * 2)\n",
    "    print(marker_id, round(maf, 6), sep=\"\\t\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Writing binary pedfile\n",
    "\n",
    "### *SNP-major* format\n",
    "\n",
    "The following examples shows how to write a binary file using the `PyPlink` module. The *SNP-major*  format is the default. It means that the binary file is written one marker at a time.\n",
    "\n",
    "> Note that `PyPlink` only writes the `BED` file. The user is required to create the `FAM` and `BIM` files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# The genotypes for 3 markers and 10 samples\n",
    "all_genotypes = [\n",
    "    [0, 0, 0, 1, 0, 0, -1, 2, 1, 0],\n",
    "    [0, 0, 1, 1, 0, 0,  0, 1, 2, 0],\n",
    "    [0, 0, 0, 0, 1, 1,  0, 0, 0, 1],\n",
    "]\n",
    "\n",
    "# Writing the BED file using PyPlink\n",
    "with PyPlink(\"test_output\", \"w\") as pedfile:\n",
    "    for genotypes in all_genotypes:\n",
    "        pedfile.write_genotypes(genotypes)\n",
    "\n",
    "# Writing a dummy FAM file\n",
    "with open(\"test_output.fam\", \"w\") as fam_file:\n",
    "    for i in range(10):\n",
    "        print(\"family_{}\".format(i+1), \"sample_{}\".format(i+1), \"0\", \"0\", \"0\", \"-9\",\n",
    "              sep=\" \", file=fam_file)\n",
    "\n",
    "# Writing a dummy BIM file\n",
    "with open(\"test_output.bim\", \"w\") as bim_file:\n",
    "    for i in range(3):\n",
    "        print(\"1\", \"marker_{}\".format(i+1), \"0\", i+1, \"A\", \"T\",\n",
    "              sep=\"\\t\", file=bim_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Checking the content of the newly created binary files\n",
    "pedfile = PyPlink(\"test_output\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fid</th>\n",
       "      <th>iid</th>\n",
       "      <th>father</th>\n",
       "      <th>mother</th>\n",
       "      <th>gender</th>\n",
       "      <th>status</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>family_1</td>\n",
       "      <td>sample_1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>family_2</td>\n",
       "      <td>sample_2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>family_3</td>\n",
       "      <td>sample_3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>family_4</td>\n",
       "      <td>sample_4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>family_5</td>\n",
       "      <td>sample_5</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>family_6</td>\n",
       "      <td>sample_6</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>family_7</td>\n",
       "      <td>sample_7</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>family_8</td>\n",
       "      <td>sample_8</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>family_9</td>\n",
       "      <td>sample_9</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>family_10</td>\n",
       "      <td>sample_10</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         fid        iid father mother  gender  status\n",
       "0   family_1   sample_1      0      0       0      -9\n",
       "1   family_2   sample_2      0      0       0      -9\n",
       "2   family_3   sample_3      0      0       0      -9\n",
       "3   family_4   sample_4      0      0       0      -9\n",
       "4   family_5   sample_5      0      0       0      -9\n",
       "5   family_6   sample_6      0      0       0      -9\n",
       "6   family_7   sample_7      0      0       0      -9\n",
       "7   family_8   sample_8      0      0       0      -9\n",
       "8   family_9   sample_9      0      0       0      -9\n",
       "9  family_10  sample_10      0      0       0      -9"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pedfile.get_fam()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>chrom</th>\n",
       "      <th>pos</th>\n",
       "      <th>cm</th>\n",
       "      <th>a1</th>\n",
       "      <th>a2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>snp</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>marker_1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>marker_2</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>A</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>marker_3</th>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>A</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          chrom  pos  cm a1 a2\n",
       "snp                           \n",
       "marker_1      1    1   0  A  T\n",
       "marker_2      1    2   0  A  T\n",
       "marker_3      1    3   0  A  T"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pedfile.get_bim()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "marker_1 [ 0  0  0  1  0  0 -1  2  1  0]\n",
      "marker_2 [0 0 1 1 0 0 0 1 2 0]\n",
      "marker_3 [0 0 0 0 1 1 0 0 0 1]\n"
     ]
    }
   ],
   "source": [
    "for marker, genotypes in pedfile:\n",
    "    print(marker, genotypes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The newly created binary files are compatible with Plink."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "@----------------------------------------------------------@\n",
      "|        PLINK!       |     v1.07      |   10/Aug/2009     |\n",
      "|----------------------------------------------------------|\n",
      "|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |\n",
      "|----------------------------------------------------------|\n",
      "|  For documentation, citation & bug-report instructions:  |\n",
      "|        http://pngu.mgh.harvard.edu/purcell/plink/        |\n",
      "@----------------------------------------------------------@\n",
      "\n",
      "Skipping web check... [ --noweb ] \n",
      "Writing this text to log file [ plink.log ]\n",
      "Analysis started: Fri Jan 20 09:20:55 2017\n",
      "\n",
      "Options in effect:\n",
      "\t--noweb\n",
      "\t--bfile test_output\n",
      "\t--freq\n",
      "\n",
      "Reading map (extended format) from [ test_output.bim ] \n",
      "3 markers to be included from [ test_output.bim ]\n",
      "Reading pedigree information from [ test_output.fam ] \n",
      "10 individuals read from [ test_output.fam ] \n",
      "0 individuals with nonmissing phenotypes\n",
      "Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)\n",
      "Missing phenotype value is also -9\n",
      "0 cases, 0 controls and 10 missing\n",
      "0 males, 0 females, and 10 of unspecified sex\n",
      "Warning, found 10 individuals with ambiguous sex codes\n",
      "These individuals will be set to missing ( or use --allow-no-sex )\n",
      "Writing list of these individuals to [ plink.nosex ]\n",
      "Reading genotype bitfile from [ test_output.bed ] \n",
      "Detected that binary PED file is v1.00 SNP-major mode\n",
      "Before frequency and genotyping pruning, there are 3 SNPs\n",
      "10 founders and 0 non-founders found\n",
      "Writing allele frequencies (founders-only) to [ plink.frq ] \n",
      "\n",
      "Analysis finished: Fri Jan 20 09:20:55 2017\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from subprocess import Popen, PIPE\n",
    "\n",
    "# Computing frequencies\n",
    "proc = Popen([\"plink\", \"--noweb\", \"--bfile\", \"test_output\", \"--freq\"],\n",
    "             stdout=PIPE, stderr=PIPE)\n",
    "outs, errs = proc.communicate()\n",
    "print(outs.decode(), end=\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " CHR        SNP   A1   A2          MAF  NCHROBS\n",
      "   1   marker_1    A    T       0.2222       18\n",
      "   1   marker_2    A    T         0.25       20\n",
      "   1   marker_3    A    T         0.15       20\n"
     ]
    }
   ],
   "source": [
    "with open(\"plink.frq\", \"r\") as i_file:\n",
    "    print(i_file.read(), end=\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### *INDIVIDUAL-major* format\n",
    "\n",
    "The following examples shows how to write a binary file using the `PyPlink` module. The *INDIVIDUAL-major*  format means that the binary file is written one sample at a time.\n",
    "\n",
    "**Files in *INDIVIDUAL-major* format is not readable by `PyPlink`.** You need to convert it using *Plink*.\n",
    "\n",
    "> Note that `PyPlink` only writes the `BED` file. The user is required to create the `FAM` and `BIM` files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# The genotypes for 3 markers and 10 samples (INDIVIDUAL-major)\n",
    "all_genotypes = [\n",
    "    [ 0, 0, 0],\n",
    "    [ 0, 0, 0],\n",
    "    [ 0, 1, 0],\n",
    "    [ 1, 1, 0],\n",
    "    [ 0, 0, 1],\n",
    "    [ 0, 0, 1],\n",
    "    [-1, 0, 0],\n",
    "    [ 2, 1, 0],\n",
    "    [ 1, 2, 0],\n",
    "    [ 0, 0, 1],\n",
    "]\n",
    "\n",
    "# Writing the BED file using PyPlink\n",
    "with PyPlink(\"test_output_2\", \"w\", bed_format=\"INDIVIDUAL-major\") as pedfile:\n",
    "    for genotypes in all_genotypes:\n",
    "        pedfile.write_genotypes(genotypes)\n",
    "\n",
    "# Writing a dummy FAM file\n",
    "with open(\"test_output_2.fam\", \"w\") as fam_file:\n",
    "    for i in range(10):\n",
    "        print(\"family_{}\".format(i+1), \"sample_{}\".format(i+1), \"0\", \"0\", \"0\", \"-9\",\n",
    "              sep=\" \", file=fam_file)\n",
    "\n",
    "# Writing a dummy BIM file\n",
    "with open(\"test_output_2.bim\", \"w\") as bim_file:\n",
    "    for i in range(3):\n",
    "        print(\"1\", \"marker_{}\".format(i+1), \"0\", i+1, \"A\", \"T\",\n",
    "              sep=\"\\t\", file=bim_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "@----------------------------------------------------------@\n",
      "|        PLINK!       |     v1.07      |   10/Aug/2009     |\n",
      "|----------------------------------------------------------|\n",
      "|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |\n",
      "|----------------------------------------------------------|\n",
      "|  For documentation, citation & bug-report instructions:  |\n",
      "|        http://pngu.mgh.harvard.edu/purcell/plink/        |\n",
      "@----------------------------------------------------------@\n",
      "\n",
      "Skipping web check... [ --noweb ] \n",
      "Writing this text to log file [ plink_2.log ]\n",
      "Analysis started: Fri Jan 20 09:21:10 2017\n",
      "\n",
      "Options in effect:\n",
      "\t--noweb\n",
      "\t--bfile test_output_2\n",
      "\t--freq\n",
      "\t--out plink_2\n",
      "\n",
      "Reading map (extended format) from [ test_output_2.bim ] \n",
      "3 markers to be included from [ test_output_2.bim ]\n",
      "Reading pedigree information from [ test_output_2.fam ] \n",
      "10 individuals read from [ test_output_2.fam ] \n",
      "0 individuals with nonmissing phenotypes\n",
      "Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)\n",
      "Missing phenotype value is also -9\n",
      "0 cases, 0 controls and 10 missing\n",
      "0 males, 0 females, and 10 of unspecified sex\n",
      "Warning, found 10 individuals with ambiguous sex codes\n",
      "These individuals will be set to missing ( or use --allow-no-sex )\n",
      "Writing list of these individuals to [ plink_2.nosex ]\n",
      "Reading genotype bitfile from [ test_output_2.bed ] \n",
      "Detected that binary PED file is v1.00 individual-major mode\n",
      "Before frequency and genotyping pruning, there are 3 SNPs\n",
      "Converting data to SNP-major format\n",
      "10 founders and 0 non-founders found\n",
      "Writing allele frequencies (founders-only) to [ plink_2.frq ] \n",
      "\n",
      "Analysis finished: Fri Jan 20 09:21:10 2017\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from subprocess import Popen, PIPE\n",
    "\n",
    "# Computing frequencies\n",
    "proc = Popen([\"plink\", \"--noweb\", \"--bfile\", \"test_output_2\", \"--freq\", \"--out\", \"plink_2\"],\n",
    "             stdout=PIPE, stderr=PIPE)\n",
    "outs, errs = proc.communicate()\n",
    "print(outs.decode(), end=\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " CHR        SNP   A1   A2          MAF  NCHROBS\n",
      "   1   marker_1    A    T       0.2222       18\n",
      "   1   marker_2    A    T         0.25       20\n",
      "   1   marker_3    A    T         0.15       20\n"
     ]
    }
   ],
   "source": [
    "with open(\"plink_2.frq\", \"r\") as i_file:\n",
    "    print(i_file.read(), end=\"\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}