{ "metadata": { "name": "", "signature": "sha256:2f1fa0aa24b81d033ad947a4fb53f6b8f0d19a0d521e882946533be6966e19b8" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## [Python Programming for Biologists, Tel-Aviv University / 0411-3122 / Spring 2015](http://py4life.github.io/TAU2015/)\n", "# Homework 4\n", "\n", "## 1) Parsing $\\lambda$ phage FASTQ\n", "The FASTQ file format is commonly used to store deep sequencing reads data. It is similar to the FASTA format, but includes additional information. Each record is represented by four lines:\n", "- Line 1 begins with a `@` character and is followed by a sequence identifier and an optional description (like a FASTA title line).\n", "- Line 2 is the raw sequence letters.\n", "- Line 3 is just a `+` character\n", "- Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. \n", "Sequence quality is encoded with characters from:\n", "```\n", "!\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\n", "```\n", "Where `!` represents the lowest quality, equivalent to a score of 1, and `~` is the highest quality with a score of 94. \n", "\n", "The function given below translates the characters into their corresponding scores and returns a dictionary which you can use later. Make sure you understand how to work with this dictionary before proceeding. Also note the use of `\"\"\"`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def creates_scores_dict():\n", " scores_string = \"\"\"!\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\"\"\"\n", " scores_dict = {}\n", " for s in range(len(scores_string)):\n", " scores_dict[scores_string[s]] = s + 1\n", " return scores_dict\n", "\n", "scores_dict = creates_scores_dict()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The file `files_for_hw/lambda_reads.fq` contains 10,000 reads from the sequencing of $\\lambda$ phage. We would like to discard low quality reads. A low quality read is defined as one with a mean score lower than some predefined threshold. \n", "\n", "**a)** Write a function `mean_score` that receives a read quality string and returns the mean score (float) of the read. \n", "\n", "For example, the quality string `!!!!!` is equivalent to the scores `1,1,1,1,1`, and thus the mean is `1.0`. \n", "However, the string `49@5<*>=E` is equivalent to the scores `20,25,32,21,28,10,30,29,37` and has a mean of `25.77`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def mean_score(read_quality_string):\n", " scores_sum = 0\n", " for char in read_quality_string:\n", " scores_sum += scores_dict[char]\n", " return scores_sum/len(read_quality_string)\n", "\n", "mean_score('49@5<*>=E')\n", "assert(mean_score('!!!!!') == 1.0)\n", "assert(round(mean_score('49@5<*>=E'),2) == 25.78)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**b)** Write a function `parse_FASTQ` that parses a FASTQ file. It receives a path to a FASTQ file on the local filesystem and returns a _dictionary_ where keys are sequences and values are the mean scores of the sequences. \n", "\n", "Use the function on the provided file `files_for_hw/lambda_reads.fq`.\n", "\n", "It is recommended to use the `readline()` method of file objects (although other solutions are also possible)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def parse_FASTQ(file):\n", " seq_scores = {}\n", " with open(file,'r') as f:\n", " line = f.readline()\n", " while line:\n", " if line.startswith('@'): # a new record begins\n", " seq = ''\n", " score = -1\n", " line = f.readline()\t # read next line (sequence)\n", " seq = line.strip() # get sequence\n", " line = f.readline()\n", " line = f.readline() # skip the '+' line and get to the quality line\n", " quality = line.strip()\n", " mean_quality_score = mean_score(quality)\n", " seq_scores[seq] = mean_quality_score\n", " line = f.readline() # finished with record, read next\n", " return seq_scores\n", "\n", "\n", "# parse lambda reads file \n", "lambda_reads_file = \"files_for_hw4/lambda_reads.fq\"\n", "lambda_seqs_dict = parse_FASTQ(lambda_reads_file)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "print(parse_FASTQ('sample_fastq'))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "{'TTGGCAGGCCAAGGCCGATGGATCA': 25.52, 'GTTGCTTCTGGCGTGGGTGGGGGGG': 24.4, 'CCCTTCTTGTCTTCAGCGTTTCTCC': 26.28}\n" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**c)** Write a function `filter_reads` that takes the output from section **b** and a score cutoff (integer/float) and prints out the sequences with scores higher than the cutoff. \n", "\n", "Each sequence will be printed in a separate line (no need to keep the FASTQ format). Try different cutoffs (5,10,20) on the $\\lambda$ phage reads." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def filter_reads(seqs_dict,cutoff,out_file):\n", " with open(out_file,'w') as fo:\n", " for seq in seqs_dict:\n", " if seqs_dict[seq] > cutoff:\n", " print(seq, file=fo)\n", "\n", "# run on Lambda reads\n", "lambda_filtered_file = \"files_for_hw4/lambda_filtered_reads.txt\"\n", "filter_reads(lambda_seqs_dict,10,lambda_filtered_file)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2) Endangered turtles\n", "In this question, we will work on data from the [Global Biodiversity Information Facility](http://www.gbif.org/) (GBIF). \n", "This server has lots of data about occurences of organisms around the world. \n", "\n", "Our goal will be to get the coordinates of observations of endangered turtles species in the Southern hemisphere. \n", "The CSV files under `files_for_hw/gbif_files`, named `GBIF1.csv`, `GBIF2.csv` etc. contain observations data for various turtles genera. Each record is an observation, and contains information such as the species name and the coordinates of the observation. \n", "\n", "**a)** The first step in the analysis will be to concatenate all the CSV files into one CSV file to make processing easier. \n", "\n", "Write a function that receives a list of file names and concatenates them (i.e., inserts all records from all files into one file). Remember to include the header line only once, at the beginning. \n", "\n", "Use the function to create a unified csv file. \n", "\n", "__Note__: The function has no return value!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def append_csvs(csvs_list,out_csv):\n", " with open(out_csv,'w') as fo:\n", " # print first file to output file\n", " first_csv = csvs_list[0]\n", " with open(first_csv,'r') as f:\n", " content = f.read()\n", " print(content.strip(),file=fo)\n", " # print all other files to output file, skipping headers line\n", " for file in csvs_list[1:]:\n", " with open(file,'r') as f:\n", " f.readline()\n", " for line in f:\n", " print(line.strip(),file=fo)\n", " \n", "# create unified file\n", "import glob\n", "filenames = glob.glob(\"files_for_hw4/gbif_files/GBIF*.csv\")\n", "print(filenames)\n", "output_filename = \"files_for_hw4/gbif_files/concat_GBIF.csv\"\n", "append_csvs(filenames, output_filename)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['files_for_hw4/gbif_files\\\\GBIF1.csv', 'files_for_hw4/gbif_files\\\\GBIF10.csv', 'files_for_hw4/gbif_files\\\\GBIF11.csv', 'files_for_hw4/gbif_files\\\\GBIF12.csv', 'files_for_hw4/gbif_files\\\\GBIF13.csv', 'files_for_hw4/gbif_files\\\\GBIF14.csv', 'files_for_hw4/gbif_files\\\\GBIF15.csv', 'files_for_hw4/gbif_files\\\\GBIF16.csv', 'files_for_hw4/gbif_files\\\\GBIF17.csv', 'files_for_hw4/gbif_files\\\\GBIF18.csv', 'files_for_hw4/gbif_files\\\\GBIF19.csv', 'files_for_hw4/gbif_files\\\\GBIF2.csv', 'files_for_hw4/gbif_files\\\\GBIF20.csv', 'files_for_hw4/gbif_files\\\\GBIF21.csv', 'files_for_hw4/gbif_files\\\\GBIF22.csv', 'files_for_hw4/gbif_files\\\\GBIF3.csv', 'files_for_hw4/gbif_files\\\\GBIF4.csv', 'files_for_hw4/gbif_files\\\\GBIF5.csv', 'files_for_hw4/gbif_files\\\\GBIF6.csv', 'files_for_hw4/gbif_files\\\\GBIF7.csv', 'files_for_hw4/gbif_files\\\\GBIF8.csv', 'files_for_hw4/gbif_files\\\\GBIF9.csv']\n" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**b)** The file `files_for_hw/gbif_files/endangered_turtles.txt` contains a list of endangered turtles species from [The IUCN Red List of Threatened Species](http://www.iucnredlist.org/). \n", "\n", "Write a function `get_species` that reads the file and returns a list of the species it includes." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def get_species(filename):\n", " species_list = []\n", " with open(filename,'r') as f:\n", " for line in f:\n", " species_list.append(line.strip())\n", " return species_list\n", " \n", " \n", "turtles_filename = \"files_for_hw4/gbif_files/endangered_turtles.txt\"\n", "endangered_turtles = get_species(turtles_filename)\n", "print(endangered_turtles)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['Acanthochelys pallidipectoris', 'Actinemys marmorata', 'Amyda cartilaginea', 'Astrochelys radiata', 'Astrochelys yniphora', 'Batagur baska', 'Batagur borneoensis', 'Batagur dhongoka', 'Batagur kachuga', 'Batagur trivittata', 'Caretta caretta', 'Carettochelys insculpta', 'Centrochelys sulcata', 'Chelodina mccordi', 'Chelodina parkeri', 'Chelodina pritchardi', 'Chelonia mydas', 'Chelonoidis chilensis', 'Chelonoidis denticulata', 'Chelonoidis nigra', 'Chelydra rossignonii', 'Chitra chitra', 'Chitra indica', 'Clemmys guttata', 'Cuora amboinensis', 'Cuora aurocapitata', 'Cuora flavomarginata', 'Cuora galbinifrons', 'Cuora mccordi', 'Cuora mouhotii', 'Cuora pani', 'Cuora trifasciata', 'Cuora yunnanensis', 'Cuora zhoui', 'Dermatemys mawii', 'Dermochelys coriacea', 'Elseya bellii', 'Elseya branderhorsti', 'Elusor macrurus', 'Emydoidea blandingii', 'Eretmochelys imbricata', 'Erymnochelys madagascariensis', 'Geochelone gigantea', 'Geochelone platynota', 'Geoclemys hamiltonii', 'Geoemyda japonica', 'Geoemyda spengleri', 'Glyptemys insculpta', 'Glyptemys muhlenbergii', 'Gopherus agassizii', 'Gopherus flavomarginatus', 'Gopherus polyphemus', 'Graptemys barbouri', 'Graptemys caglei', 'Graptemys flavimaculata', 'Graptemys gibbonsi', 'Graptemys oculifera', 'Graptemys pearlensis', 'Hardella thurjii', 'Heosemys annandalii', 'Heosemys depressa', 'Heosemys grandis', 'Heosemys spinosa', 'Homopus solus', 'Hydromedusa maximiliani', 'Indotestudo elongata', 'Indotestudo forstenii', 'Indotestudo travancorica', 'Kinixys homeana', 'Kinosternon angustipons', 'Kinosternon dunni', 'Lepidochelys kempii', 'Lepidochelys olivacea', 'Leucocephalon yuwonoi', 'Macrochelys temminckii', 'Malacochersus tornieri', 'Malayemys subtrijuga', 'Manouria emys', 'Manouria impressa', 'Mauremys annamensis', 'Mauremys mutica', 'Mauremys nigricans', 'Mauremys reevesii', 'Mauremys sinensis', 'Melanochelys tricarinata', 'Mesoclemmys dahli', 'Mesoclemmys hogei', 'Mesoclemmys zuliae', 'Morenia ocellata', 'Morenia petersi', 'Nilssonia formosa', 'Nilssonia gangetica', 'Nilssonia hurum', 'Nilssonia leithii', 'Notochelys platynota', 'Orlitia borneensis', 'Palea steindachneri', 'Pangshura sylhetensis', 'Pelochelys bibroni', 'Pelochelys cantorii', 'Pelodiscus sinensis', 'Peltocephalus dumerilianus', 'Pelusios broadleyi', 'Platysternon megacephalum', 'Podocnemis erythrocephala', 'Podocnemis lewyana', 'Podocnemis sextuberculata', 'Podocnemis unifilis', 'Psammobates geometricus', 'Pseudemydura umbrina', 'Pseudemys alabamensis', 'Pyxis arachnoides', 'Pyxis planicauda', 'Rafetus euphraticus', 'Rafetus swinhoei', 'Rheodytes leukops', 'Sacalia bealei', 'Sacalia quadriocellata', 'Siebenrockiella crassicollis', 'Siebenrockiella leytensis', 'Sternotherus depressus', 'Terrapene carolina', 'Terrapene coahuila', 'Testudo graeca', 'Testudo horsfieldii', 'Testudo kleinmanni', 'Trachemys adiutrix', 'Trachemys decorata', 'Trachemys gaigeae', 'Trachemys ornata', 'Trachemys taylori', 'Trachemys terrapen', 'Trachemys yaquia', 'Vijayachelys silvatica']\n" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**c)** We will now use the unified CSV file and endangered species list to create a filtered CSV, containing only the records we are interested in: those of endangered species, observed in the Southern hemisphere (_i.e._ in latitude < 0). \n", "\n", "Write a function that receives the unified filename, the species list, maximum latitude and output file, and write to the output file the records which satisfy these conditions (no return value is needed)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import csv\n", "def filter_records(filename, species, max_lat, out_csv_file):\n", " with open(out_csv_file,'w',newline='') as fo:\n", " csv_writer = csv.writer(fo)\n", " # write headers\n", " csv_writer.writerow(['','name','scientificName','genus','decimalLongitude','decimalLatitude','country'])\n", " with open(filename, 'r') as f:\n", " csv_reader = csv.reader(f)\n", " f.readline() # skip headers\n", " for row in csv_reader:\n", " spec = row[1]\n", " latitude = row[5]\n", " if (spec in species and latitude != 'NA' and float(latitude) < max_lat):\n", " csv_writer.writerow(row)\n", " \n", " \n", "filtered_file = \"files_for_hw4/gbif_files/endangered.csv\"\n", "filter_records(output_filename, endangered_turtles, 0, filtered_file)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3) Regex drills\n", "In this question, you don't have to write real code, just write the regular expression you'd use within the quotation marks. \n", "\n", "* It is highly recommended to use [The regex Coach](http://www.weitz.de/regex-coach/) for this question.\n", "\n", "a) Write a regex that will match strings containing any kind of number: positive/negative, integer/float etc. For example, all of the following should be matched: 7 , -3 , 6.14 , -0.00054 " ] }, { "cell_type": "code", "collapsed": false, "input": [ "import re\n", "\n", "re.compile(r'-?\\d+\\.?(\\d+)?')" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ "re.compile(r'-?\\d+\\.?(\\d+)?', re.UNICODE)" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "b) Write a regex that will match strings that __end__ with a number between 100 and 199, followed by a '.' __or__ a '\\' character. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "re.compile(r'1\\d{2}[\\.\\\\]$')" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ "re.compile(r'1\\d{2}[\\.\\\\]$', re.UNICODE)" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "c) Write a regex that will match whole strings of prices in dollars, such as '100\\$', '2.99\\$', '500.90\\$', but not '7.656\\$', '80.0001\\$' or 'price is: 56.80\\$'. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "re.compile(r'^\\d+\\.?\\d{2}\\$$')" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 11, "text": [ "re.compile(r'^\\d+\\.?\\d{2}\\$$', re.UNICODE)" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "d) Write a regex that will match strings beginning with 3 to 8 uppercase letters, followed by at least 4 characters, which can be anything but '%' or '!', and end with 'XY' or 'QW'. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "re.compile(r'^[A-Z]{3,8}[^%!]{4,}(XY|QW)$')" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "re.compile(r'^[A-Z]{3,8}[^%!]{4,}(XY|QW)$', re.UNICODE)" ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4) Plant names\n", "A full scientific name of a plant species consists of a genus name, a species name and an authority (usually a short for the name of the person to first describe the species). For example, in _Arabidopsis thaliana (L.) Heynh._, _Arabidopsis_ is the genus, _thaliana)_ is the species and _(L.) Heynh._ is the authority. \n", "\n", "- The genus always begins with a capital letter, followed by one or more lower case letters. \n", "- The species is all lowercasse. \n", "- The authority can include any character. \n", "\n", "In addition, a name may (or may not) include an infraspecific rank. This is added after the species name, and consists of an _epithet_ which is either _subsp._ (for subspecies) or _var._ (for variety). The epithet is followed by the name of the infraspecific rank. \n", "\n", "For example, in _Fraxinus americana var. acuminata (Lam.) K.Koch_, the genus is _Fraxinus_, the species is _americana_, the infraspecies is _var. acuminata_ and the authority is _(Lam.) K.Koch_. \n", "\n", "The file `files_for_hw/plant_names.txt` contains a list of plant names. The goal is to break these names into their components. \n", "\n", "Write a program that reads the names from the file and writes a new CSV file with the following column names: _Genus_, _Species_, _Infraspecific_ and _Authority_. \n", "Each plant name in `plant_names.txt` should then be processed (use regular expressions) and its components inserted into the CSV file." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import csv\n", "import re\n", "\n", "# regex for processing plant names\n", "name_regex = re.compile(r'^([A-Z]\\w+)\\s(\\w+)\\s((subsp\\.|var\\.)\\s\\w+\\s)?(.+)$')\n", "\n", "# files\n", "list_file = 'files_for_hw4/plant_names.txt'\n", "out_csv = 'files_for_hw4/parsed_plant_names.csv'\n", "\n", "def process_plant_name(name_string):\n", " \"\"\"\n", " Receives a plant name and breaks it to components.\n", " Returns a list: [genus, species,infraspecific (if not, empty string), authority]\n", " \"\"\"\n", " match_result = re.search(name_regex,name_string)\n", " if match_result: # if a match was found\n", " genus = match_result.group(1)\n", " species = match_result.group(2)\n", " infra = match_result.group(3)\n", " author = match_result.group(5)\n", " return [genus, species, infra, author]\n", " else: # if no match\n", " return None\n", "\n", "\n", "with open(list_file,'r') as f:\n", " with open(out_csv,'w',newline='') as fo:\n", " csv_writer = csv.writer(fo)\n", " # write csv headers\n", " csv_writer.writerow(['Genus','Species','Infraspecific','Authority'])\n", " # iterate on names and process\n", " for name in f:\n", " name = name.strip()\n", " name_parts = process_plant_name(name)\n", " if name_parts: # if name was processed successfully\n", " csv_writer.writerow(name_parts)\n", " else: # warn that something is wrong with the name\n", " print('Name',name,'could not be processed.')\n", "print('Parsing completed')" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Parsing completed\n" ] } ], "prompt_number": 13 } ], "metadata": {} } ] }