{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from __future__ import print_function, division"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# `PyPlink`\n",
"\n",
"`PyPlink` is a Python module to read and write binary Plink files. Here are small examples for `PyPlink`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from pyplink import PyPlink"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of contents\n",
"\n",
"* [**Reading binary pedfile**](#Reading-binary-pedfile)\n",
" * [Getting the demo data](#Getting-the-demo-data)\n",
" * [Reading the binary file](#Reading-the-binary-file)\n",
" * [Getting dataset information](#Getting-dataset-information)\n",
" * [Iterating over all markers](#Iterating-over-all-markers)\n",
" * [*Additive format*](#iterating_over_all_additive)\n",
" * [*Nucleotide format*](#iterating_over_all_nuc)\n",
" * [Iterating over selected markers](#Iterating-over-selected-markers)\n",
" * [*Additive format*](#iterating_over_selected_additive)\n",
" * [*Nucleotide format*](#iterating_over_selected_nuc)\n",
" * [Extracting a single marker](#Extracting-a-single-marker)\n",
" * [*Additive format*](#extracting_additive)\n",
" * [*Nucleotide format*](#extracting_nuc)\n",
" * [Misc example](#Misc-example)\n",
" * [*Extracting a subset of markers and samples*](#Extracting-a-subset-of-markers-and-samples)\n",
" * [*Counting the allele frequency of markers*](#Counting-the-allele-frequency-of-markers)\n",
"\n",
"\n",
"* [**Writing binary pedfile**](#Writing-binary-pedfile)\n",
" * [SNP-major format](#SNP-major-format)\n",
" * [INDIVIDUAL-major-format](#INDIVIDUAL-major-format)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reading binary pedfile\n",
"\n",
"### Getting the demo data\n",
"\n",
"The [`Plink`](http://pngu.mgh.harvard.edu/~purcell/plink/) softwares provides a testing dataset on the [resources page](http://pngu.mgh.harvard.edu/~purcell/plink/res.shtml). It contains the 270 samples from the HapMap project (release 23) on build GRCh36/hg18."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import zipfile\n",
"try:\n",
" from urllib.request import urlretrieve\n",
"except ImportError:\n",
" from urllib import urlretrieve\n",
"\n",
"# Downloading the demo data from Plink webset\n",
"urlretrieve(\n",
" \"http://pngu.mgh.harvard.edu/~purcell/plink/dist/hapmap_r23a.zip\",\n",
" \"hapmap_r23a.zip\",\n",
")\n",
"\n",
"# Extracting the archive content\n",
"with zipfile.ZipFile(\"hapmap_r23a.zip\", \"r\") as z:\n",
" z.extractall(\".\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Reading the binary file\n",
"\n",
"To read a binary file, `PyPlink` only requires the prefix of the files."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pedfile = PyPlink(\"hapmap_r23a\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Getting dataset information"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"270 samples and 4,098,136 markers\n"
]
}
],
"source": [
"print(\"{:,d} samples and {:,d} markers\".format(\n",
" pedfile.get_nb_samples(),\n",
" pedfile.get_nb_markers(),\n",
"))"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" fid | \n",
" iid | \n",
" father | \n",
" mother | \n",
" gender | \n",
" status | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" NA18524 | \n",
" NA18524 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" -9 | \n",
"
\n",
" \n",
" 1 | \n",
" NA18526 | \n",
" NA18526 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" -9 | \n",
"
\n",
" \n",
" 2 | \n",
" NA18529 | \n",
" NA18529 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" -9 | \n",
"
\n",
" \n",
" 3 | \n",
" NA18532 | \n",
" NA18532 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" -9 | \n",
"
\n",
" \n",
" 4 | \n",
" NA18537 | \n",
" NA18537 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" -9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" fid iid father mother gender status\n",
"0 NA18524 NA18524 0 0 1 -9\n",
"1 NA18526 NA18526 0 0 2 -9\n",
"2 NA18529 NA18529 0 0 2 -9\n",
"3 NA18532 NA18532 0 0 2 -9\n",
"4 NA18537 NA18537 0 0 2 -9"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_samples = pedfile.get_fam()\n",
"all_samples.head()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" chrom | \n",
" pos | \n",
" cm | \n",
" a1 | \n",
" a2 | \n",
"
\n",
" \n",
" snp | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" rs10399749 | \n",
" 1 | \n",
" 45162 | \n",
" 0 | \n",
" 0 | \n",
" C | \n",
"
\n",
" \n",
" rs2949420 | \n",
" 1 | \n",
" 45257 | \n",
" 0 | \n",
" 0 | \n",
" T | \n",
"
\n",
" \n",
" rs4030303 | \n",
" 1 | \n",
" 72434 | \n",
" 0 | \n",
" A | \n",
" G | \n",
"
\n",
" \n",
" rs4030300 | \n",
" 1 | \n",
" 72515 | \n",
" 0 | \n",
" 0 | \n",
" C | \n",
"
\n",
" \n",
" rs3855952 | \n",
" 1 | \n",
" 77689 | \n",
" 0 | \n",
" 0 | \n",
" A | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" chrom pos cm a1 a2\n",
"snp \n",
"rs10399749 1 45162 0 0 C\n",
"rs2949420 1 45257 0 0 T\n",
"rs4030303 1 72434 0 A G\n",
"rs4030300 1 72515 0 0 C\n",
"rs3855952 1 77689 0 0 A"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_markers = pedfile.get_bim()\n",
"all_markers.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Iterating over all markers\n",
"\n",
"\n",
"#### Additive format\n",
"\n",
"Cycling through genotypes as `-1`, `0`, `1` and `2` values, where `-1` is unknown, `0` is homozygous (major allele), `1` is heterozygous, and `2` is homozygous (minor allele)."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"rs10399749\n",
"[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0\n",
" 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 -1\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 -1 0 0 0\n",
" 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 -1 0 0 0 0 0 0 0\n",
" -1 0 -1 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 -1 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n"
]
}
],
"source": [
"for marker_id, genotypes in pedfile:\n",
" print(marker_id)\n",
" print(genotypes)\n",
" break"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"rs10399749\n",
"[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0\n",
" 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 -1\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 -1 0 0 0\n",
" 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 -1 0 0 0 0 0 0 0\n",
" -1 0 -1 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 -1 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n"
]
}
],
"source": [
"for marker_id, genotypes in pedfile.iter_geno():\n",
" print(marker_id)\n",
" print(genotypes)\n",
" break"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"#### Nucleotide format\n",
"\n",
"Cycling through genotypes as `A`, `C`, `G` and `T` values (where `00` is unknown)."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"rs10399749\n",
"['CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
" 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
" 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' '00' 'CC' 'CC'\n",
" 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' '00' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
" 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
" 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
" 'CC' '00' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' '00' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
" 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
" 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' '00' 'CC' 'CC'\n",
" 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
" 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' '00'\n",
" 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' '00' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' '00' 'CC'\n",
" 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
" 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
" 'CC' 'CC' 'CC' 'CC' 'CC' '00' 'CC' '00' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
" '00' 'CC' '00' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' '00' 'CC' 'CC'\n",
" 'CC' 'CC' 'CC' '00' 'CC' '00' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC'\n",
" 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC' 'CC']\n"
]
}
],
"source": [
"for marker_id, genotypes in pedfile.iter_acgt_geno():\n",
" print(marker_id)\n",
" print(genotypes)\n",
" break"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Iterating over selected markers\n",
"\n",
"\n",
"#### Additive format\n",
"\n",
"Cycling through genotypes as `-1`, `0`, `1` and `2` values, where `-1` is unknown, `0` is homozygous (major allele), `1` is heterozygous, and `2` is homozygous (minor allele)."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"rs7092431\n",
"[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n",
"\n",
"rs9943770\n",
"[ 0 0 0 2 0 2 0 1 1 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0\n",
" 1 0 1 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1\n",
" 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0\n",
" 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0\n",
" 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 1 -1 0 0 0 1 2 1 1 0 1 1 2 0 0 0 1 0 0 0 0 1 1 0 2\n",
" 1 1 1 0 1 1 -1 1 0 0 1 0 0 2 0 1 2 0 1 1 1 2 0 1 0\n",
" 0 0 0 0 0 0 2 1 2 1 1 2 1 1 1 1 0 1 1 2 1 0 -1 0 1\n",
" 1 1 0 1 0 1 -1 2 1 0 0 0 1 1 1 2 0 0 1 1]\n",
"\n",
"rs1587483\n",
"[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 1 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n",
"\n"
]
}
],
"source": [
"markers = [\"rs7092431\", \"rs9943770\", \"rs1587483\"]\n",
"for marker_id, genotypes in pedfile.iter_geno_marker(markers):\n",
" print(marker_id)\n",
" print(genotypes, end=\"\\n\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"#### Nucleotide format\n",
"\n",
"Cycling through genotypes as `A`, `C`, `G` and `T` values (where `00` is unknown)."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"rs7092431\n",
"['GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' '00'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG']\n",
"\n",
"rs9943770\n",
"['AA' 'AA' 'AA' 'GG' 'AA' 'GG' 'AA' 'GA' 'GA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA'\n",
" 'AA' 'GA' 'AA' 'AA' 'GA' 'AA' 'AA' 'GA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA'\n",
" 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'GA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA'\n",
" 'GA' 'AA' 'GA' 'AA' 'AA' 'GA' 'AA' 'GA' 'AA' 'AA' 'GA' 'GA' 'GA' 'AA' 'AA'\n",
" 'AA' 'AA' 'AA' 'AA' 'AA' 'GA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'GA'\n",
" 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'GA' 'AA' 'AA' 'GA' 'AA' 'AA' 'AA'\n",
" 'AA' 'GA' 'GA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA'\n",
" 'AA' 'AA' 'AA' 'AA' 'AA' 'GA' 'AA' 'AA' 'GA' 'AA' 'AA' 'AA' 'GA' 'AA' 'AA'\n",
" 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'GA' 'GA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA'\n",
" 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'GA' 'AA' 'AA' 'GA' 'AA' 'AA' 'AA' 'AA'\n",
" 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'GA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA'\n",
" 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'GA' '00' 'AA' 'AA' 'AA'\n",
" 'GA' 'GG' 'GA' 'GA' 'AA' 'GA' 'GA' 'GG' 'AA' 'AA' 'AA' 'GA' 'AA' 'AA' 'AA'\n",
" 'AA' 'GA' 'GA' 'AA' 'GG' 'GA' 'GA' 'GA' 'AA' 'GA' 'GA' '00' 'GA' 'AA' 'AA'\n",
" 'GA' 'AA' 'AA' 'GG' 'AA' 'GA' 'GG' 'AA' 'GA' 'GA' 'GA' 'GG' 'AA' 'GA' 'AA'\n",
" 'AA' 'AA' 'AA' 'AA' 'AA' 'AA' 'GG' 'GA' 'GG' 'GA' 'GA' 'GG' 'GA' 'GA' 'GA'\n",
" 'GA' 'AA' 'GA' 'GA' 'GG' 'GA' 'AA' '00' 'AA' 'GA' 'GA' 'GA' 'AA' 'GA' 'AA'\n",
" 'GA' '00' 'GG' 'GA' 'AA' 'AA' 'AA' 'GA' 'GA' 'GA' 'GG' 'AA' 'AA' 'GA' 'GA']\n",
"\n",
"rs1587483\n",
"['GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' '00' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' '00' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'CG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' '00' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG'\n",
" 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG' 'GG']\n",
"\n"
]
}
],
"source": [
"markers = [\"rs7092431\", \"rs9943770\", \"rs1587483\"]\n",
"for marker_id, genotypes in pedfile.iter_acgt_geno_marker(markers):\n",
" print(marker_id)\n",
" print(genotypes, end=\"\\n\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Extracting a single marker\n",
"\n",
"\n",
"#### Additive format\n",
"\n",
"Cycling through genotypes as `-1`, `0`, `1` and `2` values, where `-1` is unknown, `0` is homozygous (major allele), `1` is heterozygous, and `2` is homozygous (minor allele)."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 0, 0, 0, 0, 0, -1, 0, 0, -1, -1, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, -1,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n",
" -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n",
" -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n",
" -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n",
" -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n",
" -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n",
" -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n",
" -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n",
" -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n",
" -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n",
" -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], dtype=int8)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pedfile.get_geno_marker(\"rs7619974\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"#### Nucleotide format\n",
"\n",
"Cycling through genotypes as `A`, `C`, `G` and `T` values (where `00` is unknown)."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array(['CC', 'CC', 'CC', 'CC', 'CC', '00', 'CC', 'CC', '00', '00', 'CC',\n",
" 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC',\n",
" 'CC', 'CC', '00', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC',\n",
" '00', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC',\n",
" 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC',\n",
" 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC',\n",
" 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC',\n",
" 'CC', '00', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC',\n",
" 'CC', 'CC', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
" '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
" '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
" '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
" '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
" '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
" '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
" '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
" '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
" '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
" '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
" '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
" '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
" '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
" '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
" '00', '00', '00', '00', '00', '00', '00', '00', '00', '00', '00',\n",
" '00', '00', '00', '00', '00', '00'], \n",
" dtype=' Note that `PyPlink` only writes the `BED` file. The user is required to create the `FAM` and `BIM` files."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# The genotypes for 3 markers and 10 samples\n",
"all_genotypes = [\n",
" [0, 0, 0, 1, 0, 0, -1, 2, 1, 0],\n",
" [0, 0, 1, 1, 0, 0, 0, 1, 2, 0],\n",
" [0, 0, 0, 0, 1, 1, 0, 0, 0, 1],\n",
"]\n",
"\n",
"# Writing the BED file using PyPlink\n",
"with PyPlink(\"test_output\", \"w\") as pedfile:\n",
" for genotypes in all_genotypes:\n",
" pedfile.write_genotypes(genotypes)\n",
"\n",
"# Writing a dummy FAM file\n",
"with open(\"test_output.fam\", \"w\") as fam_file:\n",
" for i in range(10):\n",
" print(\"family_{}\".format(i+1), \"sample_{}\".format(i+1), \"0\", \"0\", \"0\", \"-9\",\n",
" sep=\" \", file=fam_file)\n",
"\n",
"# Writing a dummy BIM file\n",
"with open(\"test_output.bim\", \"w\") as bim_file:\n",
" for i in range(3):\n",
" print(\"1\", \"marker_{}\".format(i+1), \"0\", i+1, \"A\", \"T\",\n",
" sep=\"\\t\", file=bim_file)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Checking the content of the newly created binary files\n",
"pedfile = PyPlink(\"test_output\")"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" fid | \n",
" iid | \n",
" father | \n",
" mother | \n",
" gender | \n",
" status | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" family_1 | \n",
" sample_1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" -9 | \n",
"
\n",
" \n",
" 1 | \n",
" family_2 | \n",
" sample_2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" -9 | \n",
"
\n",
" \n",
" 2 | \n",
" family_3 | \n",
" sample_3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" -9 | \n",
"
\n",
" \n",
" 3 | \n",
" family_4 | \n",
" sample_4 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" -9 | \n",
"
\n",
" \n",
" 4 | \n",
" family_5 | \n",
" sample_5 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" -9 | \n",
"
\n",
" \n",
" 5 | \n",
" family_6 | \n",
" sample_6 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" -9 | \n",
"
\n",
" \n",
" 6 | \n",
" family_7 | \n",
" sample_7 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" -9 | \n",
"
\n",
" \n",
" 7 | \n",
" family_8 | \n",
" sample_8 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" -9 | \n",
"
\n",
" \n",
" 8 | \n",
" family_9 | \n",
" sample_9 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" -9 | \n",
"
\n",
" \n",
" 9 | \n",
" family_10 | \n",
" sample_10 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" -9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" fid iid father mother gender status\n",
"0 family_1 sample_1 0 0 0 -9\n",
"1 family_2 sample_2 0 0 0 -9\n",
"2 family_3 sample_3 0 0 0 -9\n",
"3 family_4 sample_4 0 0 0 -9\n",
"4 family_5 sample_5 0 0 0 -9\n",
"5 family_6 sample_6 0 0 0 -9\n",
"6 family_7 sample_7 0 0 0 -9\n",
"7 family_8 sample_8 0 0 0 -9\n",
"8 family_9 sample_9 0 0 0 -9\n",
"9 family_10 sample_10 0 0 0 -9"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pedfile.get_fam()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" chrom | \n",
" pos | \n",
" cm | \n",
" a1 | \n",
" a2 | \n",
"
\n",
" \n",
" snp | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" marker_1 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" A | \n",
" T | \n",
"
\n",
" \n",
" marker_2 | \n",
" 1 | \n",
" 2 | \n",
" 0 | \n",
" A | \n",
" T | \n",
"
\n",
" \n",
" marker_3 | \n",
" 1 | \n",
" 3 | \n",
" 0 | \n",
" A | \n",
" T | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" chrom pos cm a1 a2\n",
"snp \n",
"marker_1 1 1 0 A T\n",
"marker_2 1 2 0 A T\n",
"marker_3 1 3 0 A T"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pedfile.get_bim()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"marker_1 [ 0 0 0 1 0 0 -1 2 1 0]\n",
"marker_2 [0 0 1 1 0 0 0 1 2 0]\n",
"marker_3 [0 0 0 0 1 1 0 0 0 1]\n"
]
}
],
"source": [
"for marker, genotypes in pedfile:\n",
" print(marker, genotypes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The newly created binary files are compatible with Plink."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"@----------------------------------------------------------@\n",
"| PLINK! | v1.07 | 10/Aug/2009 |\n",
"|----------------------------------------------------------|\n",
"| (C) 2009 Shaun Purcell, GNU General Public License, v2 |\n",
"|----------------------------------------------------------|\n",
"| For documentation, citation & bug-report instructions: |\n",
"| http://pngu.mgh.harvard.edu/purcell/plink/ |\n",
"@----------------------------------------------------------@\n",
"\n",
"Skipping web check... [ --noweb ] \n",
"Writing this text to log file [ plink.log ]\n",
"Analysis started: Fri Jan 20 09:20:55 2017\n",
"\n",
"Options in effect:\n",
"\t--noweb\n",
"\t--bfile test_output\n",
"\t--freq\n",
"\n",
"Reading map (extended format) from [ test_output.bim ] \n",
"3 markers to be included from [ test_output.bim ]\n",
"Reading pedigree information from [ test_output.fam ] \n",
"10 individuals read from [ test_output.fam ] \n",
"0 individuals with nonmissing phenotypes\n",
"Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)\n",
"Missing phenotype value is also -9\n",
"0 cases, 0 controls and 10 missing\n",
"0 males, 0 females, and 10 of unspecified sex\n",
"Warning, found 10 individuals with ambiguous sex codes\n",
"These individuals will be set to missing ( or use --allow-no-sex )\n",
"Writing list of these individuals to [ plink.nosex ]\n",
"Reading genotype bitfile from [ test_output.bed ] \n",
"Detected that binary PED file is v1.00 SNP-major mode\n",
"Before frequency and genotyping pruning, there are 3 SNPs\n",
"10 founders and 0 non-founders found\n",
"Writing allele frequencies (founders-only) to [ plink.frq ] \n",
"\n",
"Analysis finished: Fri Jan 20 09:20:55 2017\n",
"\n"
]
}
],
"source": [
"from subprocess import Popen, PIPE\n",
"\n",
"# Computing frequencies\n",
"proc = Popen([\"plink\", \"--noweb\", \"--bfile\", \"test_output\", \"--freq\"],\n",
" stdout=PIPE, stderr=PIPE)\n",
"outs, errs = proc.communicate()\n",
"print(outs.decode(), end=\"\")"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" CHR SNP A1 A2 MAF NCHROBS\n",
" 1 marker_1 A T 0.2222 18\n",
" 1 marker_2 A T 0.25 20\n",
" 1 marker_3 A T 0.15 20\n"
]
}
],
"source": [
"with open(\"plink.frq\", \"r\") as i_file:\n",
" print(i_file.read(), end=\"\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### *INDIVIDUAL-major* format\n",
"\n",
"The following examples shows how to write a binary file using the `PyPlink` module. The *INDIVIDUAL-major* format means that the binary file is written one sample at a time.\n",
"\n",
"**Files in *INDIVIDUAL-major* format is not readable by `PyPlink`.** You need to convert it using *Plink*.\n",
"\n",
"> Note that `PyPlink` only writes the `BED` file. The user is required to create the `FAM` and `BIM` files."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# The genotypes for 3 markers and 10 samples (INDIVIDUAL-major)\n",
"all_genotypes = [\n",
" [ 0, 0, 0],\n",
" [ 0, 0, 0],\n",
" [ 0, 1, 0],\n",
" [ 1, 1, 0],\n",
" [ 0, 0, 1],\n",
" [ 0, 0, 1],\n",
" [-1, 0, 0],\n",
" [ 2, 1, 0],\n",
" [ 1, 2, 0],\n",
" [ 0, 0, 1],\n",
"]\n",
"\n",
"# Writing the BED file using PyPlink\n",
"with PyPlink(\"test_output_2\", \"w\", bed_format=\"INDIVIDUAL-major\") as pedfile:\n",
" for genotypes in all_genotypes:\n",
" pedfile.write_genotypes(genotypes)\n",
"\n",
"# Writing a dummy FAM file\n",
"with open(\"test_output_2.fam\", \"w\") as fam_file:\n",
" for i in range(10):\n",
" print(\"family_{}\".format(i+1), \"sample_{}\".format(i+1), \"0\", \"0\", \"0\", \"-9\",\n",
" sep=\" \", file=fam_file)\n",
"\n",
"# Writing a dummy BIM file\n",
"with open(\"test_output_2.bim\", \"w\") as bim_file:\n",
" for i in range(3):\n",
" print(\"1\", \"marker_{}\".format(i+1), \"0\", i+1, \"A\", \"T\",\n",
" sep=\"\\t\", file=bim_file)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"@----------------------------------------------------------@\n",
"| PLINK! | v1.07 | 10/Aug/2009 |\n",
"|----------------------------------------------------------|\n",
"| (C) 2009 Shaun Purcell, GNU General Public License, v2 |\n",
"|----------------------------------------------------------|\n",
"| For documentation, citation & bug-report instructions: |\n",
"| http://pngu.mgh.harvard.edu/purcell/plink/ |\n",
"@----------------------------------------------------------@\n",
"\n",
"Skipping web check... [ --noweb ] \n",
"Writing this text to log file [ plink_2.log ]\n",
"Analysis started: Fri Jan 20 09:21:10 2017\n",
"\n",
"Options in effect:\n",
"\t--noweb\n",
"\t--bfile test_output_2\n",
"\t--freq\n",
"\t--out plink_2\n",
"\n",
"Reading map (extended format) from [ test_output_2.bim ] \n",
"3 markers to be included from [ test_output_2.bim ]\n",
"Reading pedigree information from [ test_output_2.fam ] \n",
"10 individuals read from [ test_output_2.fam ] \n",
"0 individuals with nonmissing phenotypes\n",
"Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)\n",
"Missing phenotype value is also -9\n",
"0 cases, 0 controls and 10 missing\n",
"0 males, 0 females, and 10 of unspecified sex\n",
"Warning, found 10 individuals with ambiguous sex codes\n",
"These individuals will be set to missing ( or use --allow-no-sex )\n",
"Writing list of these individuals to [ plink_2.nosex ]\n",
"Reading genotype bitfile from [ test_output_2.bed ] \n",
"Detected that binary PED file is v1.00 individual-major mode\n",
"Before frequency and genotyping pruning, there are 3 SNPs\n",
"Converting data to SNP-major format\n",
"10 founders and 0 non-founders found\n",
"Writing allele frequencies (founders-only) to [ plink_2.frq ] \n",
"\n",
"Analysis finished: Fri Jan 20 09:21:10 2017\n",
"\n"
]
}
],
"source": [
"from subprocess import Popen, PIPE\n",
"\n",
"# Computing frequencies\n",
"proc = Popen([\"plink\", \"--noweb\", \"--bfile\", \"test_output_2\", \"--freq\", \"--out\", \"plink_2\"],\n",
" stdout=PIPE, stderr=PIPE)\n",
"outs, errs = proc.communicate()\n",
"print(outs.decode(), end=\"\")"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" CHR SNP A1 A2 MAF NCHROBS\n",
" 1 marker_1 A T 0.2222 18\n",
" 1 marker_2 A T 0.25 20\n",
" 1 marker_3 A T 0.15 20\n"
]
}
],
"source": [
"with open(\"plink_2.frq\", \"r\") as i_file:\n",
" print(i_file.read(), end=\"\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}