{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Project: Week 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This week, I continued my approach from last week. I wanted to investigate - once COVID-19 got into the US, how did it initially spread?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import Stuff" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: biopython in c:\\users\\alex\\anaconda3\\lib\\site-packages (1.76)\n", "Requirement already satisfied: numpy in c:\\users\\alex\\anaconda3\\lib\\site-packages (from biopython) (1.18.1)\n" ] } ], "source": [ "!pip install biopython" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import yaml\n", "import urllib\n", "from Bio.Phylo.TreeConstruction import DistanceMatrix\n", "from Bio.Phylo.TreeConstruction import DistanceTreeConstructor\n", "from Bio import Phylo\n", "import matplotlib\n", "import random\n", "import matplotlib.pylab as plt\n", "import matplotlib.patches as mpatches" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get Given Data - aligned sequences" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once again, I started with the data given by my professor. Below is a table of all of the COVID-19 genomes made available by the galaxy project. Specifically, this table contains the aligned sequences corresponding to the spike protein." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "position_table = pd.read_csv('../../data/position_table.csv')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | seqid | \n", "S_1_1 | \n", "S_1_2 | \n", "S_1_3 | \n", "S_2_1 | \n", "S_2_2 | \n", "S_2_3 | \n", "S_3_1 | \n", "S_3_2 | \n", "S_3_3 | \n", "... | \n", "S_1270_3 | \n", "S_1271_1 | \n", "S_1271_2 | \n", "S_1271_3 | \n", "S_1272_1 | \n", "S_1272_2 | \n", "S_1272_3 | \n", "S_1273_1 | \n", "S_1273_2 | \n", "S_1273_3 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "MT007544.1 | \n", "A | \n", "T | \n", "G | \n", "T | \n", "T | \n", "T | \n", "G | \n", "T | \n", "T | \n", "... | \n", "A | \n", "C | \n", "A | \n", "T | \n", "T | \n", "A | \n", "C | \n", "A | \n", "C | \n", "A | \n", "
1 | \n", "MT019529.1 | \n", "A | \n", "T | \n", "G | \n", "T | \n", "T | \n", "T | \n", "G | \n", "T | \n", "T | \n", "... | \n", "A | \n", "C | \n", "A | \n", "T | \n", "T | \n", "A | \n", "C | \n", "A | \n", "C | \n", "A | \n", "
2 | \n", "MT019530.1 | \n", "A | \n", "T | \n", "G | \n", "T | \n", "T | \n", "T | \n", "G | \n", "T | \n", "T | \n", "... | \n", "A | \n", "C | \n", "A | \n", "T | \n", "T | \n", "A | \n", "C | \n", "A | \n", "C | \n", "A | \n", "
3 | \n", "MT019531.1 | \n", "A | \n", "T | \n", "G | \n", "T | \n", "T | \n", "T | \n", "G | \n", "T | \n", "T | \n", "... | \n", "A | \n", "C | \n", "A | \n", "T | \n", "T | \n", "A | \n", "C | \n", "A | \n", "C | \n", "A | \n", "
4 | \n", "MT019532.1 | \n", "A | \n", "T | \n", "G | \n", "T | \n", "T | \n", "T | \n", "G | \n", "T | \n", "T | \n", "... | \n", "A | \n", "C | \n", "A | \n", "T | \n", "T | \n", "A | \n", "C | \n", "A | \n", "C | \n", "A | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
672 | \n", "MT334544.1 | \n", "A | \n", "T | \n", "G | \n", "T | \n", "T | \n", "T | \n", "G | \n", "T | \n", "T | \n", "... | \n", "A | \n", "C | \n", "A | \n", "T | \n", "T | \n", "A | \n", "C | \n", "A | \n", "C | \n", "A | \n", "
673 | \n", "MT334546.1 | \n", "A | \n", "T | \n", "G | \n", "T | \n", "T | \n", "T | \n", "G | \n", "T | \n", "T | \n", "... | \n", "A | \n", "C | \n", "A | \n", "T | \n", "T | \n", "A | \n", "C | \n", "A | \n", "C | \n", "A | \n", "
674 | \n", "MT334547.1 | \n", "A | \n", "T | \n", "G | \n", "T | \n", "T | \n", "T | \n", "G | \n", "T | \n", "T | \n", "... | \n", "A | \n", "C | \n", "A | \n", "T | \n", "T | \n", "A | \n", "C | \n", "A | \n", "C | \n", "A | \n", "
675 | \n", "MT334557.1 | \n", "A | \n", "T | \n", "G | \n", "T | \n", "T | \n", "T | \n", "G | \n", "T | \n", "T | \n", "... | \n", "A | \n", "C | \n", "A | \n", "T | \n", "T | \n", "A | \n", "C | \n", "A | \n", "C | \n", "A | \n", "
676 | \n", "MT334561.1 | \n", "A | \n", "T | \n", "G | \n", "T | \n", "T | \n", "T | \n", "G | \n", "T | \n", "T | \n", "... | \n", "A | \n", "C | \n", "A | \n", "T | \n", "T | \n", "A | \n", "C | \n", "A | \n", "C | \n", "A | \n", "
677 rows × 3820 columns
\n", "\n", " | MT020880.1 | \n", "MT020881.1 | \n", "MT027062.1 | \n", "MT027063.1 | \n", "MT027064.1 | \n", "MT039887.1 | \n", "MT039888.1 | \n", "MT044257.1 | \n", "MT044258.1 | \n", "MT106052.1 | \n", "... | \n", "MT322419.1 | \n", "MT322420.1 | \n", "MT322421.1 | \n", "MT322422.1 | \n", "MT322423.1 | \n", "MT322424.1 | \n", "MT325565.1 | \n", "MT325571.1 | \n", "MT325573.1 | \n", "MT325574.1 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MT020880.1 | \n", "0 | \n", "1 | \n", "3 | \n", "3 | \n", "4 | \n", "3 | \n", "4 | \n", "4 | \n", "3 | \n", "3 | \n", "... | \n", "4 | \n", "4 | \n", "7 | \n", "7 | \n", "7 | \n", "4 | \n", "4 | \n", "4 | \n", "4 | \n", "3 | \n", "
MT020881.1 | \n", "1 | \n", "0 | \n", "3 | \n", "3 | \n", "4 | \n", "3 | \n", "4 | \n", "4 | \n", "3 | \n", "3 | \n", "... | \n", "4 | \n", "4 | \n", "7 | \n", "7 | \n", "7 | \n", "4 | \n", "4 | \n", "4 | \n", "4 | \n", "3 | \n", "
MT027062.1 | \n", "3 | \n", "3 | \n", "0 | \n", "1 | \n", "2 | \n", "3 | \n", "3 | \n", "4 | \n", "2 | \n", "2 | \n", "... | \n", "4 | \n", "4 | \n", "7 | \n", "7 | \n", "7 | \n", "4 | \n", "4 | \n", "4 | \n", "4 | \n", "3 | \n", "
MT027063.1 | \n", "3 | \n", "3 | \n", "1 | \n", "0 | \n", "2 | \n", "3 | \n", "3 | \n", "4 | \n", "2 | \n", "2 | \n", "... | \n", "4 | \n", "4 | \n", "7 | \n", "7 | \n", "7 | \n", "4 | \n", "4 | \n", "4 | \n", "4 | \n", "3 | \n", "
MT027064.1 | \n", "4 | \n", "4 | \n", "2 | \n", "2 | \n", "0 | \n", "4 | \n", "4 | \n", "5 | \n", "3 | \n", "3 | \n", "... | \n", "5 | \n", "5 | \n", "8 | \n", "8 | \n", "8 | \n", "5 | \n", "5 | \n", "5 | \n", "5 | \n", "4 | \n", "
5 rows × 100 columns
\n", "