{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Week 3\n", "For this week's analysis I decided to move my focus to trying to figure out what country the SARS-Cov-2 may have traveled to the USA from. Using my work from last week, I decided to continue using the data provided by https://covid19.galaxyproject.org/genomics/4-Variation/current_complete_ncov_genomes.fasta. I also am using the same multiple sequence alignment and position table as I did for my analysis last week.\n", "\n", "However, in order to perform my analysis this week, I needed a litle more information for my data set, such as the country of origin and date of collection for each sequence in the data set. This data was provided by https://www.ncbi.nlm.nih.gov/core/assets/genbank/files/ncov-sequences.yaml. In addition to the information I was looking for, the file also contains the accesion of the sequence, accesion listing, and the state it was collected in, if that information is available. " ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'ncov-sequences (2).yaml'" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import wget\n", "\n", "url = 'https://www.ncbi.nlm.nih.gov/core/assets/genbank/files/ncov-sequences.yaml'\n", "filename = 'ncov-sequences.yaml'\n", "wget.download(url, filename)" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [], "source": [ "import yaml\n", "\n", "#Read in the information for the sequences\n", "file = open(filename)\n", "info = yaml.load(file, Loader=yaml.FullLoader)\n", "file.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pulling the data out of the yaml file\n", "Once I had read in the sequence information, I needed to next pull out what I needed and store it in a dicitonary. Using the accession of sequence as the key, I stored the country of origin and collection date of the sequence together for use later." ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [], "source": [ "#Pull out the country and date information for each sequence\n", "countries_and_dates = {}\n", "for data in info['genbank-sequences']:\n", " country = data['country']\n", " #We want just the country, not the state\n", " if country is not None:\n", " country = country.split(':')[0]\n", " date = data['collection_date']\n", " countries_and_dates[data['accession']] = (country, date)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Modifying the position table\n", "Since I was using the same position table I used for my previous analysis, I needed to read in the position table I had saved." ] }, { "cell_type": "code", "execution_count": 127, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "#read in the position table\n", "position_table = pd.read_csv('../../data/position_table.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once I had read in the position table, I needed to add in the information I had extracted from the yaml file for each of the sequences. I also decided to add an additional name column that matched the accession for each sequence so I would not constanly neeed to remove the '.1' from the end of the seqid of the sequence in order to reference the information in the future." ] }, { "cell_type": "code", "execution_count": 128, "metadata": {}, "outputs": [], "source": [ "#Add the country and date information to the position table for each sequence\n", "for i in range(len(position_table)):\n", " #Add the name of the sequence without the '.1' ending\n", " name = position_table.loc[i, 'seqid'][:-2]\n", " position_table.loc[i, 'name'] = name\n", " if name in countries_and_dates:\n", " country, date = countries_and_dates[position_table.loc[i, 'name']]\n", " position_table.loc[i, 'country'] = country\n", " position_table.loc[i, 'date'] = date\n", " else:\n", " position_table.loc[i, 'country'] = None\n", " position_table.loc[i, 'date'] = None" ] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | seqid | \n", "S_1_1 | \n", "S_1_2 | \n", "S_1_3 | \n", "S_2_1 | \n", "S_2_2 | \n", "S_2_3 | \n", "S_3_1 | \n", "S_3_2 | \n", "S_3_3 | \n", "... | \n", "S_1271_3 | \n", "S_1272_1 | \n", "S_1272_2 | \n", "S_1272_3 | \n", "S_1273_1 | \n", "S_1273_2 | \n", "S_1273_3 | \n", "name | \n", "country | \n", "date | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "MT007544.1 | \n", "A | \n", "T | \n", "G | \n", "T | \n", "T | \n", "T | \n", "G | \n", "T | \n", "T | \n", "... | \n", "T | \n", "T | \n", "A | \n", "C | \n", "A | \n", "C | \n", "A | \n", "MT007544 | \n", "Australia | \n", "2020-01-25 | \n", "
1 | \n", "MT019529.1 | \n", "A | \n", "T | \n", "G | \n", "T | \n", "T | \n", "T | \n", "G | \n", "T | \n", "T | \n", "... | \n", "T | \n", "T | \n", "A | \n", "C | \n", "A | \n", "C | \n", "A | \n", "MT019529 | \n", "China | \n", "2019-12-23 | \n", "
2 | \n", "MT019530.1 | \n", "A | \n", "T | \n", "G | \n", "T | \n", "T | \n", "T | \n", "G | \n", "T | \n", "T | \n", "... | \n", "T | \n", "T | \n", "A | \n", "C | \n", "A | \n", "C | \n", "A | \n", "MT019530 | \n", "China | \n", "2019-12-30 | \n", "
3 | \n", "MT019531.1 | \n", "A | \n", "T | \n", "G | \n", "T | \n", "T | \n", "T | \n", "G | \n", "T | \n", "T | \n", "... | \n", "T | \n", "T | \n", "A | \n", "C | \n", "A | \n", "C | \n", "A | \n", "MT019531 | \n", "China | \n", "2019-12-30 | \n", "
4 | \n", "MT019532.1 | \n", "A | \n", "T | \n", "G | \n", "T | \n", "T | \n", "T | \n", "G | \n", "T | \n", "T | \n", "... | \n", "T | \n", "T | \n", "A | \n", "C | \n", "A | \n", "C | \n", "A | \n", "MT019532 | \n", "China | \n", "2019-12-30 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
672 | \n", "MT334544.1 | \n", "A | \n", "T | \n", "G | \n", "T | \n", "T | \n", "T | \n", "G | \n", "T | \n", "T | \n", "... | \n", "T | \n", "T | \n", "A | \n", "C | \n", "A | \n", "C | \n", "A | \n", "MT334544 | \n", "USA | \n", "2020-03-19 | \n", "
673 | \n", "MT334546.1 | \n", "A | \n", "T | \n", "G | \n", "T | \n", "T | \n", "T | \n", "G | \n", "T | \n", "T | \n", "... | \n", "T | \n", "T | \n", "A | \n", "C | \n", "A | \n", "C | \n", "A | \n", "MT334546 | \n", "USA | \n", "2020-03-19 | \n", "
674 | \n", "MT334547.1 | \n", "A | \n", "T | \n", "G | \n", "T | \n", "T | \n", "T | \n", "G | \n", "T | \n", "T | \n", "... | \n", "T | \n", "T | \n", "A | \n", "C | \n", "A | \n", "C | \n", "A | \n", "MT334547 | \n", "USA | \n", "2020-03-19 | \n", "
675 | \n", "MT334557.1 | \n", "A | \n", "T | \n", "G | \n", "T | \n", "T | \n", "T | \n", "G | \n", "T | \n", "T | \n", "... | \n", "T | \n", "T | \n", "A | \n", "C | \n", "A | \n", "C | \n", "A | \n", "MT334557 | \n", "USA | \n", "2020-03-20 | \n", "
676 | \n", "MT334561.1 | \n", "A | \n", "T | \n", "G | \n", "T | \n", "T | \n", "T | \n", "G | \n", "T | \n", "T | \n", "... | \n", "T | \n", "T | \n", "A | \n", "C | \n", "A | \n", "C | \n", "A | \n", "MT334561 | \n", "USA | \n", "2020-03-26 | \n", "
677 rows × 3823 columns
\n", "\n", " | MT019529.1 | \n", "MT019530.1 | \n", "MT019531.1 | \n", "MT019532.1 | \n", "MT019533.1 | \n", "MT039890.1 | \n", "MT291826.1 | \n", "MT291827.1 | \n", "MT291828.1 | \n", "MT291829.1 | \n", "MT291830.1 | \n", "MT326173.1 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
MT019529.1 | \n", "0 | \n", "2 | \n", "2 | \n", "2 | \n", "2 | \n", "4 | \n", "2 | \n", "2 | \n", "3 | \n", "2 | \n", "2 | \n", "5 | \n", "
MT019530.1 | \n", "2 | \n", "0 | \n", "1 | \n", "1 | \n", "2 | \n", "4 | \n", "1 | \n", "1 | \n", "2 | \n", "1 | \n", "1 | \n", "5 | \n", "
MT019531.1 | \n", "2 | \n", "1 | \n", "0 | \n", "1 | \n", "2 | \n", "4 | \n", "1 | \n", "1 | \n", "2 | \n", "1 | \n", "1 | \n", "5 | \n", "
MT019532.1 | \n", "2 | \n", "1 | \n", "1 | \n", "0 | \n", "2 | \n", "4 | \n", "1 | \n", "1 | \n", "2 | \n", "1 | \n", "1 | \n", "5 | \n", "
MT019533.1 | \n", "2 | \n", "2 | \n", "2 | \n", "2 | \n", "0 | \n", "3 | \n", "2 | \n", "2 | \n", "3 | \n", "2 | \n", "2 | \n", "4 | \n", "
MT039890.1 | \n", "4 | \n", "4 | \n", "4 | \n", "4 | \n", "3 | \n", "0 | \n", "4 | \n", "4 | \n", "5 | \n", "4 | \n", "4 | \n", "5 | \n", "
MT291826.1 | \n", "2 | \n", "1 | \n", "1 | \n", "1 | \n", "2 | \n", "4 | \n", "0 | \n", "1 | \n", "2 | \n", "1 | \n", "1 | \n", "5 | \n", "
MT291827.1 | \n", "2 | \n", "1 | \n", "1 | \n", "1 | \n", "2 | \n", "4 | \n", "1 | \n", "0 | \n", "2 | \n", "1 | \n", "1 | \n", "5 | \n", "
MT291828.1 | \n", "3 | \n", "2 | \n", "2 | \n", "2 | \n", "3 | \n", "5 | \n", "2 | \n", "2 | \n", "0 | \n", "2 | \n", "2 | \n", "6 | \n", "
MT291829.1 | \n", "2 | \n", "1 | \n", "1 | \n", "1 | \n", "2 | \n", "4 | \n", "1 | \n", "1 | \n", "2 | \n", "0 | \n", "1 | \n", "5 | \n", "
MT291830.1 | \n", "2 | \n", "1 | \n", "1 | \n", "1 | \n", "2 | \n", "4 | \n", "1 | \n", "1 | \n", "2 | \n", "1 | \n", "0 | \n", "5 | \n", "
MT326173.1 | \n", "5 | \n", "5 | \n", "5 | \n", "5 | \n", "4 | \n", "5 | \n", "5 | \n", "5 | \n", "6 | \n", "5 | \n", "5 | \n", "0 | \n", "