{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 5.2 Big Data in Genomics - Visualise\n", "\n", "This notebooks visualises the results of kMeans clustering of the genomics variants from chromosome 22 of the 1000 Genomes project dataset (phase3).\n", "\n", "We have reduced all the variants to 50 cluster centers, so that now each of the ~2500 individuals can be representation by a vector of size 50.\n", "\n", "The results are available in: `data/cluster-centers_chr22.csv.gz`.\n", "\n", "Now we will compute the average representation for each population averaging the vectors of the inviduals from this population and then use hierarchical clustering to see, which populations are similiar." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# import pandas and set display options\n", "import pandas as pd\n", "pd.set_option('display.max_rows', 5)\n", "pd.set_option('display.max_columns', 8)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | 0 | \n", "1 | \n", "2 | \n", "3 | \n", "... | \n", "46 | \n", "47 | \n", "48 | \n", "49 | \n", "
|---|---|---|---|---|---|---|---|---|---|
| Sample ID | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
| HG00096 | \n", "1.763517 | \n", "0.027793 | \n", "0.000000 | \n", "1.129517 | \n", "... | \n", "0.004106 | \n", "1.012876 | \n", "0.000000 | \n", "0.621353 | \n", "
| HG00097 | \n", "1.788846 | \n", "0.039244 | \n", "0.000000 | \n", "0.997455 | \n", "... | \n", "0.000880 | \n", "0.000000 | \n", "1.179625 | \n", "0.557942 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| NA21143 | \n", "1.754749 | \n", "0.028460 | \n", "0.725055 | \n", "0.796692 | \n", "... | \n", "0.002933 | \n", "0.000000 | \n", "0.938338 | \n", "0.628241 | \n", "
| NA21144 | \n", "1.781052 | \n", "0.017232 | \n", "0.000000 | \n", "0.928753 | \n", "... | \n", "0.010850 | \n", "0.000000 | \n", "0.919571 | \n", "0.559157 | \n", "
2535 rows × 50 columns
\n", "| \n", " | Family ID | \n", "Paternal ID | \n", "Maternal ID | \n", "Gender | \n", "... | \n", "phase 3 genotypes | \n", "related genotypes | \n", "omni genotypes | \n", "affy_genotypes | \n", "
|---|---|---|---|---|---|---|---|---|---|
| Individual ID | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
| HG00096 | \n", "HG00096 | \n", "0 | \n", "0 | \n", "1 | \n", "... | \n", "1 | \n", "0 | \n", "1 | \n", "1 | \n", "
| HG00097 | \n", "HG00097 | \n", "0 | \n", "0 | \n", "2 | \n", "... | \n", "1 | \n", "0 | \n", "1 | \n", "1 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| NA21143 | \n", "NA21143 | \n", "0 | \n", "0 | \n", "2 | \n", "... | \n", "1 | \n", "0 | \n", "1 | \n", "1 | \n", "
| NA21144 | \n", "NA21144 | \n", "0 | \n", "0 | \n", "2 | \n", "... | \n", "1 | \n", "0 | \n", "1 | \n", "1 | \n", "
3691 rows × 16 columns
\n", "| \n", " | 0 | \n", "1 | \n", "2 | \n", "3 | \n", "... | \n", "46 | \n", "47 | \n", "48 | \n", "49 | \n", "
|---|---|---|---|---|---|---|---|---|---|
| Population | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
| ACB | \n", "1.805210 | \n", "0.309657 | \n", "0.182349 | \n", "0.826776 | \n", "... | \n", "0.111718 | \n", "0.264843 | \n", "0.026167 | \n", "0.792401 | \n", "
| ASW | \n", "1.802774 | \n", "0.265552 | \n", "0.179231 | \n", "0.853655 | \n", "... | \n", "0.080712 | \n", "0.265509 | \n", "0.042977 | \n", "0.755019 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| TSI | \n", "1.769293 | \n", "0.030587 | \n", "0.099593 | \n", "1.038307 | \n", "... | \n", "0.008914 | \n", "0.200485 | \n", "0.052477 | \n", "0.535489 | \n", "
| YRI | \n", "1.808162 | \n", "0.361357 | \n", "0.270144 | \n", "0.791332 | \n", "... | \n", "0.107181 | \n", "0.282396 | \n", "0.021718 | \n", "0.836228 | \n", "
26 rows × 50 columns
\n", "| \n", " | Population Description | \n", "Super Population Code | \n", "Sequence Data Available | \n", "Alignment Data Available | \n", "Variant Data Available | \n", "
|---|---|---|---|---|---|
| Population Code | \n", "\n", " | \n", " | \n", " | \n", " | \n", " |
| CHB | \n", "Han Chinese in Bejing, China | \n", "EAS | \n", "1 | \n", "1 | \n", "1 | \n", "
| JPT | \n", "Japanese in Tokyo, Japan | \n", "EAS | \n", "1 | \n", "1 | \n", "1 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| STU | \n", "Sri Lankan Tamil from the UK | \n", "SAS | \n", "1 | \n", "1 | \n", "1 | \n", "
| ITU | \n", "Indian Telugu from the UK | \n", "SAS | \n", "1 | \n", "1 | \n", "1 | \n", "
26 rows × 5 columns
\n", "