{
 "metadata": {
  "name": "design"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Basic Program Design\n",
      "====================\n",
      "\n",
      "The most basic goal of program design is to break code into manageable pieces,\n",
      "store them in such a way that it is easy to find the piece we want,\n",
      "and make it easy to reuse pieces of our code in other projects.\n",
      "\n",
      "The basic elements of design for my scientific projects include:\n",
      "\n",
      "1. Modules for core functionality\n",
      "2. Script with project specific functions and commands\n",
      "3. Follow standard Python structure within modules/scripts\n",
      "4. Command line interactions to allow control of the code"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Basic Python Module Structure\n",
      "-----------------------------\n",
      "\n",
      "    \"\"\"module docstring\"\"\"\n",
      "\n",
      "    # imports\n",
      "    # constants\n",
      "    # classes\n",
      "    # functions\n",
      "\n",
      "    def main(...):\n",
      "        ...\n",
      "\n",
      "    if __name__ == '__main__':\n",
      "        main()\n",
      "\n",
      "I've simplified this slightly to focus on the things that most scientific programmers use.\n",
      "If you want to see the full structure check out the [original source](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#module-structure)."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Example\n",
      "-------\n",
      "\n",
      "Let's take the answer to the\n",
      "[refactoring problem in Assignment 5](http://www.programmingforbiologists.org/3-refactor-regression-test)\n",
      "as an example."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Module for core functionality\n",
      "We don't have a lot of general functions yet, but the couple we have should be placed in a module."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%file dna_analysis.py\n",
      "\n",
      "\"\"\"Code for analyzing DNA sequences\"\"\"\n",
      "\n",
      "from __future__ import division\n",
      "\n",
      "def get_clean_seq(seq):\n",
      "    \"\"\"Get a consistently formated sequence without extra characters for analysis\"\"\"\n",
      "    seq = seq.replace('\\n', '').replace(' ', '').upper()\n",
      "    seq = seq.upper()\n",
      "    return seq\n",
      "\n",
      "def get_gc_content(seq):\n",
      "    \"\"\"Determine the GC content of a sequence\"\"\"\n",
      "    seq = get_clean_seq(seq)\n",
      "    gc_content = 100 * (seq.count('G') + seq.count('C')) / len(seq)\n",
      "    return gc_content"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Script with project specific code\n",
      "We then put our project specific code into another module following Python structure."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "\"\"\"Analysis code for Dr. Granger's project\"\"\"\n",
      "\n",
      "from __future__ import division\n",
      "import urllib\n",
      "import csv\n",
      "\n",
      "import numpy as np\n",
      "\n",
      "import dna_analysis\n",
      "\n",
      "def get_size_class(earlength):\n",
      "    \"\"\"Determine the size class of earlength based on Dr. Grangers specification\"\"\"\n",
      "    assert float(earlength), \"Function requires numeric input\"\n",
      "    if earlength >= 15:\n",
      "        size_class = 'extralarge'\n",
      "    elif 10 <= earlength < 15:\n",
      "        size_class = 'large'\n",
      "    elif 8 <= earlength < 10:\n",
      "        size_class = 'medium'\n",
      "    elif earlength < 8:\n",
      "        size_class = 'small'\n",
      "    return size_class\n",
      "\n",
      "def import_data(datafile, headerrow=False):\n",
      "    \"\"\"Retrieve CSV data from the file and store it in a list of lists\"\"\"\n",
      "    datareader = csv.reader(datafile)\n",
      "    if headerrow == True:\n",
      "        datareader.next()\n",
      "    data = []\n",
      "    for row in datareader:\n",
      "        data.append(row)\n",
      "    return data\n",
      "\n",
      "def export_to_csv(data, filename):\n",
      "    \"\"\"Export a list of lists to a CSV file\"\"\"\n",
      "    output_file = open(filename, 'w')\n",
      "    datawriter = csv.writer(output_file)\n",
      "    datawriter.writerows(data)\n",
      "    output_file.close()\n",
      "    \n",
      "def get_indiv_earlength_gc_data(elves_data):\n",
      "    \"\"\"Determine individual level earth length category and gc content values\"\"\"\n",
      "    results = []\n",
      "    for indiv_id, earlength, dna in elves_data:\n",
      "        gc_content = dna_analysis.get_gc_content(dna)\n",
      "        earlength_size_class = get_size_class(float(earlength))\n",
      "        results.append((indiv_id, earlength_size_class, gc_content))\n",
      "    return results\n",
      "    \n",
      "def get_avg_gc_by_sizeclass(indiv_data):\n",
      "    \"\"\"Get average values of gc content for each size class\"\"\"\n",
      "    indiv_data = np.array(indiv_data, dtype={'names': ['ID', 'SizeClass', 'GCcontent'],\n",
      "                                       'formats': ['a10', 'a10', 'f4']})\n",
      "    size_classes = set(indiv_data['SizeClass'])\n",
      "    results = []\n",
      "    for size_class in size_classes:\n",
      "        avg_gc_content = np.mean(indiv_data['GCcontent'][indiv_data['SizeClass']\n",
      "                                                         == size_class])\n",
      "        results.append([size_class, avg_gc_content])\n",
      "    return results\n",
      "    \n",
      "def main():\n",
      "    url = 'http://programmingforbiologists.org/sites/programmingforbiologists.org/files/houseelf_earlength_dna_data.csv'\n",
      "    datafile = urllib.urlopen(url)\n",
      "    elves_data = import_data(datafile, headerrow=True)\n",
      "    indiv_data = get_indiv_earlength_gc_data(elves_data)\n",
      "    results = get_avg_gc_by_sizeclass(indiv_data)\n",
      "    export_to_csv(results, 'grangers_output.csv')\n",
      "    \n",
      "if __name__ == '__main__':\n",
      "    main()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Command line interaction to allow control of code\n",
      "To add some command line control we can have ``main()`` check to see if there are arguments and use them."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "def main():\n",
      "    if sys.argv[1]:\n",
      "        url = sys.argv[1]\n",
      "    else:\n",
      "        url = 'http://programmingforbiologists.org/sites/programmingforbiologists.org/files/houseelf_earlength_dna_data.csv'\n",
      "    datafile = urllib.urlopen(url)\n",
      "    elves_data = import_data(datafile, headerrow=True)\n",
      "    indiv_data = get_indiv_earlength_gc_data(elves_data)\n",
      "    results = get_avg_gc_by_sizeclass(indiv_data)\n",
      "    export_to_csv(results, 'grangers_output.csv')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}