{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Elegant NumPy: The Foundation of Scientific Python\n",
    "\n",
    "> [NumPy] is everywhere. It is all around us. Even now, in this very room.\n",
    "> You can see it when you look out your window or when you turn on your\n",
    "> television. You can feel it when you go to work... when you go to church...\n",
    "> when you pay your taxes.\n",
    ">\n",
    "> — Morpheus, *The Matrix*\n",
    "\n",
    "This chapter touches on some statistical functions in SciPy, but more than that, it focuses on exploring the NumPy array, a data structure that underlies almost all numerical scientific computation in Python.\n",
    "We will see how NumPy array operations enable concise and efficient code when manipulating numerical data.\n",
    "\n",
    "Our use case is using gene expression data from The Cancer Genome Atlas (TCGA) project to predict mortality in skin cancer patients.\n",
    "We will be working toward this goal throughout this chapter and the next one, learning about some key SciPy concepts along the way.\n",
    "Before we can predict mortality, we will need to normalize the expression data using a method called RPKM normalization.\n",
    "This allows the comparison of measurements between different samples and genes.\n",
    "(We will unpack what \"gene expression\" means in just a moment.)\n",
    "\n",
    "Let's start with a code snippet to tantalize you and introduce the ideas in this chapter.\n",
    "As we will do in each chapter, we open with a code sample that we believe epitomizes the elegance and power of a particular function from the SciPy ecosystem.\n",
    "In this case, we want to highlight NumPy's vectorization and broadcasting rules, which allow us to manipulate and reason about data arrays very efficiently."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def rpkm(counts, lengths):\n",
    "    \"\"\"Calculate reads per kilobase transcript per million reads.\n",
    "\n",
    "    RPKM = (10^9 * C) / (N * L)\n",
    "\n",
    "    Where:\n",
    "    C = Number of reads mapped to a gene\n",
    "    N = Total mapped reads in the experiment\n",
    "    L = Exon length in base pairs for a gene\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    counts: array, shape (N_genes, N_samples)\n",
    "        RNAseq (or similar) count data where columns are individual samples\n",
    "        and rows are genes.\n",
    "    lengths: array, shape (N_genes,)\n",
    "        Gene lengths in base pairs in the same order\n",
    "        as the rows in counts.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    normed : array, shape (N_genes, N_samples)\n",
    "        The RPKM normalized counts matrix.\n",
    "    \"\"\"\n",
    "    # First, convert counts to float to avoid overflow when multiplying by\n",
    "    # 1e9 in the RPKM formula\n",
    "    C = counts.astype(float)\n",
    "    N = np.sum(C, axis=0)  # sum each column to get total reads per sample\n",
    "    L = lengths\n",
    "\n",
    "    normed = 1e9 * C / (N[np.newaxis, :] * L[:, np.newaxis])\n",
    "\n",
    "    return(normed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This example illustrates some of the ways that NumPy arrays can make your code more elegant:\n",
    "\n",
    "- Arrays can be 1D, like lists, but they can also be 2D, like matrices, and higher-dimensional still. This allows them to represent many different kinds of numerical data. In our case, we are manipulating a 2D matrix.\n",
    "- Arrays can be operated on along *axes*. In the first line, we calculate the\n",
    "  sum down each column by specifying `axis=0`.\n",
    "- Arrays allow the expression of many numerical operations at once.\n",
    "For example toward the end of the function we divide the 2D array of counts (C) by the 1D array of column sums (N).\n",
    "This is broadcasting. More on how this works in just a moment!\n",
    "\n",
    "Before we delve into the power of NumPy, let's spend some time looking at the biological data that we will be working with.\n",
    "\n",
    "## Introduction to the Data: What Is Gene Expression?\n",
    "\n",
    "We will work our way through a *gene expression analysis* to demonstrate the power of NumPy and SciPy to solve a real-world biological problem.\n",
    "We will use the pandas library, which builds on NumPy, to read and munge our data files, and then we will manipulate our data efficiently in NumPy arrays.\n",
    "\n",
    "The so-called [central dogma of molecular biology](https://en.wikipedia.org/wiki/Central_dogma_of_molecular_biology) states that all the information needed to run a cell (or an organism, for that matter) is stored in a molecule called *deoxyribonucleic acid*, or DNA.\n",
    "This molecule has a repetitive backbone on which lie chemical groups called *bases*, in sequence.\n",
    "There are four kinds of bases, abbreviated as A, C, G, and T, comprising an alphabet with which information is stored.\n",
    "\n",
    "<img src=\"https://upload.wikimedia.org/wikipedia/commons/e/e4/DNA_chemical_structure.svg\"/>\n",
    "<!-- caption text=\"The chemical structure of DNA (image by Madeleine Price Ball, used under the terms of the CC0 public domain license)\" -->\n",
    "\n",
    "To access this information, the DNA is *transcribed* into a sister molecule called *messenger ribonucleic acid*, or mRNA.\n",
    "Finally, this mRNA is *translated* into proteins, the workhorses of the cell.\n",
    "A section of DNA that encodes the information to make a protein (via mRNA) is called a gene.\n",
    "\n",
    "The amount of mRNA produced from a given gene is called the *expression* of that gene.\n",
    "Although we would ideally like to measure protein levels, this is a much harder task than measuring mRNA.\n",
    "Fortunately, expression levels of an mRNA and levels of its corresponding protein are usually correlated.[^Maier]\n",
    "\n",
    "[^Maier]: Tobias Maier, Marc Güell, and Luis Serrano. [\"Correlation of mRNA and protein in complex biological samples\"](http://www.sciencedirect.com/science/article/pii/S0014579309008126), FEBS Letters 583, no. 204 (2009).\n",
    "\n",
    "Therefore, we usually measure mRNA levels and base our analyses on that.\n",
    "As you will see below, it often doesn't matter, because we are using mRNA levels for their power to predict biological outcomes, rather than to make specific statements about proteins.\n",
    "\n",
    "<img src=\"../figures/central_dogma.png\"/>\n",
    "<!-- caption text=\"Central dogma of molecular biology\" -->\n",
    "\n",
    "It's important to note that the DNA in every cell of your body is identical.\n",
    "Thus, the differences between cells arise from *differential expression* of\n",
    "that DNA into RNA: in different cells, different parts of the DNA are processed\n",
    "into downstream molecules. Similarly, as we will see in this chapter and the\n",
    "next, differential expression can distinguish different kinds of cancer.\n",
    "\n",
    "<img src=\"../figures/differential_gene_expression.png\"/>\n",
    "<!-- caption text=\"Gene expression\" -->\n",
    "\n",
    "The state-of-the-art technology to measure mRNA is RNA sequencing (RNAseq).\n",
    "RNA is extracted from a tissue sample (e.g., from a biopsy from a patient), *reverse transcribed* back into DNA (which is more stable), and then read out using chemically modified bases that glow when they are incorporated into the DNA sequence.\n",
    "Currently, high-throughput sequencing machines can only read short fragments (approximately 100 bases is common). These short sequences are called “reads.”\n",
    "We measure millions of reads and then based on their sequence we count how many reads came from each gene.\n",
    "We’ll be starting our analysis directly from this count data.\n",
    "\n",
    "<img src=\"../figures/RNAseq.png\"/>\n",
    "<!-- caption text=\"RNA sequencing (RNAseq)\" -->\n",
    "\n",
    "This table shows a minimal example of gene expression count data:\n",
    "\n",
    "|        | Cell type A | Cell type B |\n",
    "|--------|-------------|-------------|\n",
    "| Gene 0 | 100         | 200         |\n",
    "| Gene 1 | 50          | 0           |\n",
    "| Gene 2 | 350         | 100         |\n",
    "\n",
    "The data is a table of counts, integers representing how many reads were observed for each gene in each cell type.\n",
    "See how the counts for each gene differ between the cell types?\n",
    "We can use this information to learn about the differences between these two types of cell.\n",
    "\n",
    "One way to represent this data in Python would be as a list of lists:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gene0 = [100, 200]\n",
    "gene1 = [50, 0]\n",
    "gene2 = [350, 100]\n",
    "expression_data = [gene0, gene1, gene2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above, each gene's expression across different cell types is stored in a list of Python integers.\n",
    "Then, we store all of these lists in a list (a meta-list, if you will).\n",
    "We can retrieve individual data points using two levels of list indexing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "expression_data[2][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because of the way the Python interpreter works, this is a very inefficient way to store these data points.\n",
    "First, Python lists are always lists of *objects*, so that the above list `gene2` is not a list of integers, but a list of *pointers* to integers, which is unnecessary overhead.\n",
    "Additionally, this means that each of these lists and each of these integers ends up in a completely different, random part of your computer's RAM.\n",
    "However, modern processors actually like to retrieve things from memory in *chunks*, so this spreading of the data throughout the RAM is inefficient.\n",
    "\n",
    "This is precisely the problem solved by the *NumPy array*.\n",
    "\n",
    "## NumPy N-Dimensional Arrays\n",
    "\n",
    "One of the key NumPy data types is the N-dimensional array (ndarray, or just array).\n",
    "Ndarrays underpin lots of awesome data manipulation techniques in SciPy.\n",
    "In particular, we're going to explore vectorization and broadcasting,\n",
    "techniques that allow us to write powerful, elegant code to manipulate our data.\n",
    "\n",
    "First, let's get our heads around the the ndarray.\n",
    "These arrays must be homogeneous: all items in an array must be the same type.\n",
    "In our case we will need to store integers.\n",
    "Ndarrays are called N-dimensional because they can have any number of dimensions.\n",
    "A 1-dimensional array is roughly equivalent to a Python list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "array1d = np.array([1, 2, 3, 4])\n",
    "print(array1d)\n",
    "print(type(array1d))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Arrays have particular attributes and methods, that you can access by placing a dot after the array name.\n",
    "For example, you can get the array's *shape* with the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(array1d.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, it's just a tuple with a single number.\n",
    "You might wonder why you wouldn't just use `len`, as you would for a list.\n",
    "That will work, but it doesn't extend to *2D* arrays.\n",
    "\n",
    "This is what we use to represent the data in the table above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "array2d = np.array(expression_data)\n",
    "print(array2d)\n",
    "print(array2d.shape)\n",
    "print(type(array2d))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you can see that the `shape` attribute generalizes `len` to account for the size of multiple dimensions of an array of data.\n",
    "\n",
    "<img src=\"../figures/NumPy_ndarrays_v2.png\"/>\n",
    "<!-- caption text=\"Visualizing NumPy's ndarrays in one, two and three dimensions\" -->\n",
    "\n",
    "Arrays have other attributes, such as `ndim`, the number of dimensions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(array2d.ndim)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You'll become familiar with all of these as you start to use NumPy more for your own data analysis.\n",
    "\n",
    "NumPy arrays can represent data that has even more dimensions, such as magnetic resonance imaging (MRI) data, which includes measurements within a 3D volume.\n",
    "If we store MRI values over time, we might need a 4D NumPy array.\n",
    "\n",
    "For now, we'll stick to 2D data.\n",
    "Later chapters will introduce higher-dimensional data and will teach you to write code that works for data of any number of dimensions.\n",
    "\n",
    "### Why Use ndarrays Instead of Python Lists?\n",
    "\n",
    "Arrays are fast because they enable vectorized operations, written in the low-level language C, that act on the whole array.\n",
    "Say you have a list and you want to multiply every element in the list by five.\n",
    "A standard Python approach would be to write a loop that iterates over the\n",
    "elements of the list and multiply each one by five.\n",
    "However, if your data were instead represented as an array,\n",
    "you can multiply every element in the array by five in a single bound.\n",
    "Behind the scenes, the highly-optimized NumPy library is doing the iteration as fast as possible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# Create an ndarray of integers in the range\n",
    "# 0 up to (but not including) 1,000,000\n",
    "array = np.arange(1e6)\n",
    "\n",
    "# Convert it to a list\n",
    "list_array = array.tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's compare how long it takes to multiply all the values in the array by five,\n",
    "using the IPython `timeit` magic function. First, when the data is in a list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%timeit -n10 y = [val * 5 for val in list_array]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, using NumPy's built-in *vectorized* operations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%timeit -n10 x = array * 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Over 50 times faster, and more concise, too!\n",
    "\n",
    "Arrays are also size efficient.\n",
    "In Python, each element in a list is an object and is given a healthy memory allocation (or is that unhealthy?).\n",
    "In contrast, in arrays, each element takes up just the necessary amount of memory.\n",
    "For example, an array of 64-bit integers takes up exactly 64-bits per element, plus some very small overhead for array metadata, such as the `shape` attribute we discussed above.\n",
    "This is generally much less than would be given to objects in a Python list.\n",
    "(If you're interested in digging into how Python memory allocation works, check out Jake VanderPlas's blog post, [\"Why Python Is Slow: Looking Under the Hood\"](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/).)\n",
    "\n",
    "Plus, when computing with arrays, you can also use *slices* that subset the array *without copying the underlying data*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create an ndarray x\n",
    "x = np.array([1, 2, 3], np.int32)\n",
    "print(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a \"slice\" of x\n",
    "y = x[:2]\n",
    "print(y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set the first element of y to be 6\n",
    "y[0] = 6\n",
    "print(y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that although we edited `y`, `x` has also changed, because `y` was referencing the same data!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now the first element in x has changed to 6!\n",
    "print(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This means you have to be careful with array references.\n",
    "If you want to manipulate the data without touching the original, it's easy to make a copy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = np.copy(x[:2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Vectorization\n",
    "\n",
    "Earlier we talked about the speed of operations on arrays.\n",
    "One of the tricks NumPy uses to speed things up is *vectorization*.\n",
    "Vectorization is where you apply a calculation to each element in an array, without having to use a for loop.\n",
    "In addition to speeding things up, this can result in more natural, readable code.\n",
    "Let's look at some examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.array([1, 2, 3, 4])\n",
    "print(x * 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we have `x`, an array of 4 values, and we have implicitly multiplied every element in `x` by 2, a single value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = np.array([0, 1, 2, 1])\n",
    "print(x + y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we have added together each element in `x` to its corresponding element in `y`, an array of the same shape.\n",
    "\n",
    "Both of these operations are simple and, we hope, intuitive examples of vectorization.\n",
    "NumPy also makes them very fast, much faster than iterating over the arrays manually.\n",
    "(Feel free to play with this yourself using the `%timeit` IPython magic we saw earlier.)\n",
    "\n",
    "### Broadcasting\n",
    "\n",
    "One of the most powerful and often misunderstood features of ndarrays is broadcasting.\n",
    "Broadcasting is a way of performing implicit operations between two arrays.\n",
    "It allows you to perform operations on arrays of *compatible* shapes, to create arrays bigger than either of the starting ones.\n",
    "For example, we can compute the [outer product](https://en.wikipedia.org/wiki/Outer_product) of two vectors, by reshaping them appropriately:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.array([1, 2, 3, 4])\n",
    "x = np.reshape(x, (len(x), 1))\n",
    "print(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = np.array([0, 1, 2, 1])\n",
    "y = np.reshape(y, (1, len(y)))\n",
    "print(y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Two shapes are compatible when, for each dimension, either is equal to\n",
    "1 (one) or they match one another[^more_dimensions].\n",
    "\n",
    "[^more_dimensions]: We always start by comparing the last dimensions,\n",
    "                    and work our way forward, ignoring excess\n",
    "                    dimensions in the case of one array having more\n",
    "                    than the other (e.g., `(3, 5, 1)` and `(5, 8)`\n",
    "                    would match.\n",
    "\n",
    "Let's check the shapes of these two arrays."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(x.shape)\n",
    "print(y.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Both arrays have two dimensions and the inner dimensions of both arrays are 1, so the dimensions are compatible!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "outer = x * y\n",
    "print(outer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The outer dimensions tell you the size of the resulting array.\n",
    "In our case we expect a (4, 4) array:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(outer.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see for yourself that `outer[i, j] = x[i] * y[j]` for all `(i, j)`.\n",
    "\n",
    "This was accomplished by NumPy's [broadcasting rules](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html), which implicitly expand dimensions of size 1 in one array to match the corresponding dimension of the other array.\n",
    "Don't worry, we will talk about these rules in more detail later in this chapter.\n",
    "\n",
    "As we will see in the rest of the chapter, as we explore real data, broadcasting is extremely valuable for real-world calculations on arrays of data.\n",
    "It allows us to express complex operations concisely and efficiently.\n",
    "\n",
    "## Exploring a Gene Expression Dataset\n",
    "\n",
    "The dataset that we'll be using is an RNAseq experiment of skin cancer samples from The Cancer Genome Atlas (TCGA) project (http://cancergenome.nih.gov/).\n",
    "We've already cleaned and sorted the data for you, so you can use `data/counts.txt.bz2`\n",
    "in the book repository.\n",
    "In Chapter 2 we will be using this gene expression data to predict mortality in skin cancer patients, reproducing a simplified version of [Figures 5A and 5B](http://www.cell.com/action/showImagesData?pii=S0092-8674%2815%2900634-0) of a [paper](http://dx.doi.org/10.1016/j.cell.2015.05.044) from the TCGA consortium.\n",
    "But first we need to get our heads around the biases in our data, and think about how we could improve it.\n",
    "\n",
    "### Reading in the Data with pandas\n",
    "\n",
    "We're first going to use pandas to read in the table of counts.\n",
    "pandas is a Python library for data manipulation and analysis,\n",
    "with particular emphasis on tabular and time series data.\n",
    "Here, we will use it here to read in tabular data of mixed type.\n",
    "It uses the `DataFrame` type, which is a flexible tabular format based on the data frame object in R.\n",
    "For example, the data we will read has a column of gene names (strings) and multiple columns of counts (integers), so reading it into a homogeneous array of numbers would be the wrong approach.\n",
    "Although NumPy has some support for mixed data types (called \"structured arrays\"), it is not primarily designed for\n",
    "this use case, which makes subsequent operations harder than they need to be.\n",
    "\n",
    "By reading the data in as a pandas data frame, we can let pandas do all the parsing, then extract out the relevant information and store it in a more efficient data type.\n",
    "Here we are just using pandas briefly to import data.\n",
    "In later chapters we will see a bit more of pandas, but for details, read *Python\n",
    "for Data Analysis* (O'Reilly) by the creator of pandas, Wes McKinney."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import bz2\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "# Import TCGA melanoma data\n",
    "filename = 'data/counts.txt.bz2'\n",
    "with bz2.open(filename, 'rt') as f:\n",
    "    data_table = pd.read_csv(f, index_col=0) # Parse file with pandas\n",
    "\n",
    "print(data_table.iloc[:5, :5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that pandas has kindly pulled out the header row and used it to name the columns.\n",
    "The first column gives the name of each gene, and the remaining columns represent individual samples.\n",
    "\n",
    "We will also need some corresponding metadata, including the sample information and the gene lengths."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sample names\n",
    "samples = list(data_table.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will need some information about the lengths of the genes for our normalization.\n",
    "So that we can take advantage of some fancy pandas indexing, we're going to set\n",
    "the index of the pandas table to be the gene names in the first column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import gene lengths\n",
    "filename = 'data/genes.csv'\n",
    "with open(filename, 'rt') as f:\n",
    "    # Parse file with pandas, index by GeneSymbol\n",
    "    gene_info = pd.read_csv(f, index_col=0)\n",
    "print(gene_info.iloc[:5, :])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check how well our gene length data matches up with our count data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Genes in data_table: \", data_table.shape[0])\n",
    "print(\"Genes in gene_info: \", gene_info.shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are more genes in our gene length data than were actually measured in the experiment.\n",
    "Let's filter so we only get the relevant genes, and we want to make sure they are\n",
    "in the same order as in our count data.\n",
    "This is where pandas indexing comes in handy!\n",
    "We can get the intersection of the gene names from our two sources of data\n",
    "and use these to index both datasets, ensuring they have the same genes in the same order."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Subset gene info to match the count data\n",
    "matched_index = pd.Index.intersection(data_table.index, gene_info.index)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's use the intersection of the gene names to index our count data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 2D ndarray containing expression counts for each gene in each individual\n",
    "counts = np.asarray(data_table.loc[matched_index], dtype=int)\n",
    "\n",
    "gene_names = np.array(matched_index)\n",
    "\n",
    "# Check how many genes and individuals were measured\n",
    "print(f'{counts.shape[0]} genes measured in {counts.shape[1]} individuals.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And our gene lengths."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1D ndarray containing the lengths of each gene\n",
    "gene_lengths = np.asarray(gene_info.loc[matched_index]['GeneLength'],\n",
    "                          dtype=int)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And let's check the dimensions of our objects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(counts.shape)\n",
    "print(gene_lengths.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As expected, they now match up nicely!\n",
    "\n",
    "## Normalization\n",
    "\n",
    "Real world data contains all kinds of measurement artifacts.\n",
    "Before doing any kind of analysis with it, it is important to take a look at\n",
    "it to determine whether some normalization is warranted.\n",
    "For example, measurements with digital thermometers may systematically vary from\n",
    "those taken with mercury thermometers and read out by a human.\n",
    "Thus, comparing samples often requires some kind of data wrangling to bring\n",
    "every measurement to a common scale.\n",
    "\n",
    "In our case, we want to make sure that any differences we uncover correspond to\n",
    "real biological differences, and not to technical artifact.\n",
    "We will consider two levels of normalization often applied jointly to gene\n",
    "expression dataset: normalization between samples (columns) and normalization\n",
    "between genes (rows).\n",
    "\n",
    "### Between Samples\n",
    "\n",
    "For example, the number of counts for each individual can vary substantially in RNAseq experiments.\n",
    "Let's take a look at the distribution of expression counts over all the genes.\n",
    "First, we will sum the columns to get the total counts of expression of all genes for each individual, so we can just look at the variation between individuals.\n",
    "To visualize the distribution of total counts, we will use kernel density\n",
    "estimation (KDE), a technique commonly used to smooth out histograms because it\n",
    "gives a clearer picture of the underlying distribution.\n",
    "\n",
    "Before we start, we have to do some plotting setup (which we will do in every\n",
    "chapter). See \"A quick note on plotting\" for details about each line of the following code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make all plots appear inline in the Jupyter notebook from now onwards\n",
    "%matplotlib inline\n",
    "# Use our own style file for the plots\n",
    "import matplotlib.pyplot as plt\n",
    "plt.style.use('style/elegant.mplstyle')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **A Quick Note on Plotting {.callout}**\n",
    ">\n",
    "> The preceding code does a few neat things to make our plots prettier.\n",
    ">\n",
    "> First, `%matplotlib inline` is a Jupyter notebook [magic\n",
    "> command](http://ipython.org/ipython-doc/dev/interactive/tutorial.html#magics-explained),\n",
    "> that simply makes all plots appear in the notebook rather than pop up a new\n",
    "> window. If you are running a Jupyter notebook interactively, you can use\n",
    "> `%matplotlib notebook` instead to get an interactive figure, rather than a\n",
    "> static image of each plot.\n",
    ">\n",
    "> Second, we import `matplotlib.pyplot` and then direct it to use our own plotting\n",
    "> style `plt.style.use('style/elegant.mplstyle')`. You will see a block of code\n",
    "> like this before the first plot in every chapter.\n",
    ">\n",
    "> You may have seen people importing existing styles like this:\n",
    "> `plt.style.use('ggplot')`. But we wanted some particular settings, and we\n",
    "> wanted all the plots in this book to follow the same style. So we rolled our\n",
    "> own Matplotlib style. To see how we did it, take a look at the style file in\n",
    "> the Elegant SciPy repository: `style/elegant.mplstyle`. For more information\n",
    "> on styles, check out the [Matplotlib documentation on style\n",
    "> sheets](http://matplotlib.org/users/style_sheets.html).\n",
    "\n",
    "Now back to plotting our counts distribution!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "total_counts = np.sum(counts, axis=0)  # sum columns together\n",
    "                                       # (axis=1 would sum rows)\n",
    "\n",
    "from scipy import stats\n",
    "\n",
    "# Use Gaussian smoothing to estimate the density\n",
    "density = stats.kde.gaussian_kde(total_counts)\n",
    "\n",
    "# Make values for which to estimate the density, for plotting\n",
    "x = np.arange(min(total_counts), max(total_counts), 10000)\n",
    "\n",
    "# Make the density plot\n",
    "fig, ax = plt.subplots()\n",
    "ax.plot(x, density(x))\n",
    "ax.set_xlabel(\"Total counts per individual\")\n",
    "ax.set_ylabel(\"Density\")\n",
    "\n",
    "plt.show()\n",
    "\n",
    "print(f'Count statistics:\\n  min:  {np.min(total_counts)}'\n",
    "      f'\\n  mean: {np.mean(total_counts)}'\n",
    "      f'\\n  max:  {np.max(total_counts)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!-- caption text=\"Density plot of gene expression counts per individual using KDE smoothing\" -->\n",
    "\n",
    "We can see that there is an order of magnitude difference in the total number of counts between the lowest and the highest individual.\n",
    "This means that a different number of RNAseq reads were generated for each individual.\n",
    "We say that these individuals have different library sizes.\n",
    "\n",
    "#### Normalizing library size between samples\n",
    "\n",
    "Let's take a closer look at ranges of gene expression for each individual, so when\n",
    "we apply our normalization we can see it in action. We'll subset a random sample\n",
    "of just 70 columns to keep the plotting from getting too messy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Subset data for plotting\n",
    "np.random.seed(seed=7) # Set seed so we will get consistent results\n",
    "# Randomly select 70 samples\n",
    "samples_index = np.random.choice(range(counts.shape[1]), size=70, replace=False)\n",
    "counts_subset = counts[:, samples_index]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Some custom x-axis labelling to make our plots easier to read\n",
    "def reduce_xaxis_labels(ax, factor):\n",
    "    \"\"\"Show only every ith label to prevent crowding on x-axis\n",
    "        e.g. factor = 2 would plot every second x-axis label,\n",
    "        starting at the first.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    ax : matplotlib plot axis to be adjusted\n",
    "    factor : int, factor to reduce the number of x-axis labels by\n",
    "    \"\"\"\n",
    "    plt.setp(ax.xaxis.get_ticklabels(), visible=False)\n",
    "    for label in ax.xaxis.get_ticklabels()[factor-1::factor]:\n",
    "        label.set_visible(True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Box plot of expression counts by individual\n",
    "fig, ax = plt.subplots(figsize=(4.8, 2.4))\n",
    "\n",
    "with plt.style.context('style/thinner.mplstyle'):\n",
    "    ax.boxplot(counts_subset)\n",
    "    ax.set_xlabel(\"Individuals\")\n",
    "    ax.set_ylabel(\"Gene expression counts\")\n",
    "    reduce_xaxis_labels(ax, 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!-- caption text=\"Boxplot of gene expression counts per individual\" -->\n",
    "\n",
    "There are obviously a lot of outliers at the high expression end of the scale and a lot of variation between individuals, but these are hard to see because everything is clustered around zero.\n",
    "So let's do log(n + 1) of our data so it's a bit easier to look at.\n",
    "Both the log function and the n + 1 step can be done using broadcasting to simplify our code and speed things up."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Bar plot of expression counts by individual\n",
    "fig, ax = plt.subplots(figsize=(4.8, 2.4))\n",
    "\n",
    "with plt.style.context('style/thinner.mplstyle'):\n",
    "    ax.boxplot(np.log(counts_subset + 1))\n",
    "    ax.set_xlabel(\"Individuals\")\n",
    "    ax.set_ylabel(\"log gene expression counts\")\n",
    "    reduce_xaxis_labels(ax, 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!-- caption text=\"Boxplot of gene expression counts per individual (log scale)\" -->\n",
    "\n",
    "Now let's see what happens when we normalize by library size."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Normalize by library size\n",
    "# Divide the expression counts by the total counts for that individual\n",
    "# Multiply by 1 million to get things back in a similar scale\n",
    "counts_lib_norm = counts / total_counts * 1000000\n",
    "# Notice how we just used broadcasting twice there!\n",
    "counts_subset_lib_norm = counts_lib_norm[:,samples_index]\n",
    "\n",
    "# Bar plot of expression counts by individual\n",
    "fig, ax = plt.subplots(figsize=(4.8, 2.4))\n",
    "\n",
    "with plt.style.context('style/thinner.mplstyle'):\n",
    "    ax.boxplot(np.log(counts_subset_lib_norm + 1))\n",
    "    ax.set_xlabel(\"Individuals\")\n",
    "    ax.set_ylabel(\"log gene expression counts\")\n",
    "    reduce_xaxis_labels(ax, 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!-- caption text=\"Boxplot of library-normalized gene expression counts per individual (log scale)\" -->\n",
    "\n",
    "Much better!\n",
    "Also notice how we used broadcasting twice there.\n",
    "Once to divide all the gene expression counts by the total for that column, and then again to multiply all the values by 1 million.\n",
    "\n",
    "Finally, let's compare our normalized data to the raw data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import itertools as it\n",
    "from collections import defaultdict\n",
    "\n",
    "\n",
    "def class_boxplot(data, classes, colors=None, **kwargs):\n",
    "    \"\"\"Make a boxplot with boxes colored according to the class they belong to.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    data : list of array-like of float\n",
    "        The input data. One boxplot will be generated for each element\n",
    "        in `data`.\n",
    "    classes : list of string, same length as `data`\n",
    "        The class each distribution in `data` belongs to.\n",
    "\n",
    "    Other parameters\n",
    "    ----------------\n",
    "    kwargs : dict\n",
    "        Keyword arguments to pass on to `plt.boxplot`.\n",
    "    \"\"\"\n",
    "    all_classes = sorted(set(classes))\n",
    "    colors = plt.rcParams['axes.prop_cycle'].by_key()['color']\n",
    "    class2color = dict(zip(all_classes, it.cycle(colors)))\n",
    "\n",
    "    # map classes to data vectors\n",
    "    # other classes get an empty list at that position for offset\n",
    "    class2data = defaultdict(list)\n",
    "    for distrib, cls in zip(data, classes):\n",
    "        for c in all_classes:\n",
    "            class2data[c].append([])\n",
    "        class2data[cls][-1] = distrib\n",
    "\n",
    "    # then, do each boxplot in turn with the appropriate color\n",
    "    fig, ax = plt.subplots()\n",
    "    lines = []\n",
    "    for cls in all_classes:\n",
    "        # set color for all elements of the boxplot\n",
    "        for key in ['boxprops', 'whiskerprops', 'flierprops']:\n",
    "            kwargs.setdefault(key, {}).update(color=class2color[cls])\n",
    "        # draw the boxplot\n",
    "        box = ax.boxplot(class2data[cls], **kwargs)\n",
    "        lines.append(box['whiskers'][0])\n",
    "    ax.legend(lines, all_classes)\n",
    "    return ax"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can plot a colored boxplot according to normalized versus unnormalized samples.\n",
    "We show only three samples from each class for illustration:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "log_counts_3 = list(np.log(counts.T[:3] + 1))\n",
    "log_ncounts_3 = list(np.log(counts_lib_norm.T[:3] + 1))\n",
    "ax = class_boxplot(log_counts_3 + log_ncounts_3,\n",
    "                   ['raw counts'] * 3 + ['normalized by library size'] * 3,\n",
    "                   labels=[1, 2, 3, 1, 2, 3])\n",
    "ax.set_xlabel('sample number')\n",
    "ax.set_ylabel('log gene expression counts');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!-- caption text=\"Comparing raw and library normalized gene expression counts in three samples (log scale)\" -->\n",
    "\n",
    "You can see that the normalized distributions are a little more similar\n",
    "when we take library size (the sum of those distributions) into account.\n",
    "Now we are comparing like with like between the samples!\n",
    "But what about differences between the genes?\n",
    "\n",
    "### Between Genes\n",
    "\n",
    "We can also get into some trouble when trying to compare different genes.\n",
    "The number of counts for a gene is related to the gene length.\n",
    "Suppose gene B is twice as long as gene A.\n",
    "Both are expressed at similar levels in the sample (i.e., both produce a similar number of mRNA molecules).\n",
    "Remember that in RNAseq experiment, we fragment the transcripts, and sample reads from that pool of fragments.\n",
    "So if a gene is twice as long, it'll produce twice as many fragments, and we are twice as likely to sample it.\n",
    "Therefore, we would expect gene B to have about twice as many counts as gene A.\n",
    "If we want to compare the expression levels of different genes, we will have to do some more normalization.\n",
    "\n",
    "<img src=\"../figures/gene_length_counts.png\"/>\n",
    "<!-- caption text=\"Relationship between counts and gene length\" -->\n",
    "\n",
    "Let's see if the relationship between gene length and counts plays out in our dataset.\n",
    "First, we define a utility function for plotting:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def binned_boxplot(x, y, *,  # check out this Python 3 exclusive! (*see tip box)\n",
    "                   xlabel='gene length (log scale)',\n",
    "                   ylabel='average log counts'):\n",
    "    \"\"\"Plot the distribution of `y` dependent on `x` using many boxplots.\n",
    "\n",
    "    Note: all inputs are expected to be log-scaled.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    x: 1D array of float\n",
    "        Independent variable values.\n",
    "    y: 1D array of float\n",
    "        Dependent variable values.\n",
    "    \"\"\"\n",
    "    # Define bins of `x` depending on density of observations\n",
    "    x_hist, x_bins = np.histogram(x, bins='auto')\n",
    "\n",
    "    # Use `np.digitize` to number the bins\n",
    "    # Discard the last bin edge because it breaks the right-open assumption\n",
    "    # of `digitize`. The max observation correctly goes into the last bin.\n",
    "    x_bin_idxs = np.digitize(x, x_bins[:-1])\n",
    "\n",
    "    # Use those indices to create a list of arrays, each containing the `y`\n",
    "    # values corresponding to `x`s in that bin. This is the input format\n",
    "    # expected by `plt.boxplot`\n",
    "    binned_y = [y[x_bin_idxs == i]\n",
    "                for i in range(np.max(x_bin_idxs))]\n",
    "    fig, ax = plt.subplots(figsize=(4.8,1))\n",
    "\n",
    "    # Make the x-axis labels using the bin centers\n",
    "    x_bin_centers = (x_bins[1:] + x_bins[:-1]) / 2\n",
    "    x_ticklabels = np.round(np.exp(x_bin_centers)).astype(int)\n",
    "\n",
    "    # make the boxplot\n",
    "    ax.boxplot(binned_y, labels=x_ticklabels)\n",
    "\n",
    "    # show only every 10th label to prevent crowding on x-axis\n",
    "    reduce_xaxis_labels(ax, 10)\n",
    "\n",
    "    # Adjust the axis names\n",
    "    ax.set_xlabel(xlabel)\n",
    "    ax.set_ylabel(ylabel)\n",
    "    return ax"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Python 3 Tip: using `*` to create keyword-only arguments {.callout}**\n",
    ">\n",
    "> Since version 3.0 Python allows\n",
    "> [\"keyword-only\" arguments](https://www.python.org/dev/peps/pep-3102/).\n",
    "> These are arguments that you have to call using a keyword, rather than relying\n",
    "> on position alone.\n",
    "> For example, you can call the `binned_boxplot` we just wrote like so:\n",
    ">\n",
    ">     >>> binned_boxplot(x, y, xlabel='my x label', ylabel='my y label')\n",
    ">\n",
    "> but not this, which would have been valid Python 2, but raises an error in\n",
    "> Python 3:\n",
    ">\n",
    ">     >>> binned_boxplot(x, y, 'my x label', 'my y label')\n",
    ">\n",
    ">     ---------------------------------------------------------------------------\n",
    ">     TypeError                                 Traceback (most recent call last)\n",
    ">     <ipython-input-58-7a118d2d5750in <module>()\n",
    ">         1 x_vals = [1, 2, 3, 4, 5]\n",
    ">         2 y_vals = [1, 2, 3, 4, 5]\n",
    ">     ----3 binned_boxplot(x, y, 'my x label', 'my y label')\n",
    ">\n",
    ">     TypeError: binned_boxplot() takes 2 positional arguments but 4 were given\n",
    ">\n",
    "> The idea is to prevent you from accidentally doing something like this:\n",
    ">\n",
    ">     binned_boxplot(x, y, 'my y label')\n",
    ">\n",
    "> which would give you your y label on the x-axis, and is a common error for\n",
    "> signatures with many optional parameters that don't have an obvious ordering.\n",
    "\n",
    "We now compute the gene lengths and counts:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "log_counts = np.log(counts_lib_norm + 1)\n",
    "mean_log_counts = np.mean(log_counts, axis=1)  # across samples\n",
    "log_gene_lengths = np.log(gene_lengths)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we plot the counts as a function of gene length:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with plt.style.context('style/thinner.mplstyle'):\n",
    "    binned_boxplot(x=log_gene_lengths, y=mean_log_counts);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see in the previous image that the longer a gene is, the higher its measured counts! As\n",
    "previously explained, this is an artifact of the technique, not a biological signal!\n",
    "How do we account for this?\n",
    "\n",
    "### Normalizing Over Samples and Genes: RPKM\n",
    "\n",
    "One of the simplest normalization methods for RNAseq data is RPKM: reads per\n",
    "kilobase transcript per million reads.\n",
    "RPKM puts together the ideas of normalizing by sample and by gene.\n",
    "When we calculate RPKM, we are normalizing for both the library size (the sum of each column)\n",
    "and the gene length.\n",
    "\n",
    "To work through how RPKM is derived, let's define the following values:\n",
    "\n",
    "- $C$ = Number of reads mapped to a gene\n",
    "- $L$ = Exon length in base-pairs for a gene\n",
    "- $N$ = Total mapped reads in the experiment\n",
    "\n",
    "First, let's calculate reads per kilobase.\n",
    "\n",
    "Reads per base would be:\n",
    "$\\frac{C}{L}$\n",
    "\n",
    "The formula asks for reads per kilobase instead of reads per base.\n",
    "One kilobase = 1,000 bases, so we'll need to divide length (L) by 1,000.\n",
    "\n",
    "Reads per kilobase would be:\n",
    "\n",
    "$\\frac{C}{L/1000}  = \\frac{10^3C}{L}$\n",
    "\n",
    "Next, we need to normalize by library size.\n",
    "If we just divide by the number of mapped reads we get:\n",
    "\n",
    "$ \\frac{10^3C}{LN} $\n",
    "\n",
    "But biologists like thinking in millions of reads so that the numbers don't get\n",
    "too small. Counting per million reads we get:\n",
    "\n",
    "$ \\frac{10^3C}{L(N/10^6)} = \\frac{10^9C}{LN}$\n",
    "\n",
    "\n",
    "In summary, to calculate reads per kilobase transcript per million reads:\n",
    "$RPKM = \\frac{10^9C}{LN}$\n",
    "\n",
    "Now let's implement RPKM over the entire counts array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make our variable names the same as the RPKM formula so we can compare easily\n",
    "C = counts\n",
    "N = np.sum(counts, axis=0)  # sum each column to get total reads per sample\n",
    "L = gene_lengths  # lengths for each gene, matching rows in `C`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Tip: Numbers and computers {.callout}**\n",
    ">\n",
    "> We can't cover everything you need to know about numeric representations\n",
    "> in computers in just a tip box, but you *should* know that numbers are\n",
    "> represented as \"n-bit\" \"integer\" or \"floating point\" numbers in\n",
    "> the computer. As an example, a 32-bit precision integer is an integer\n",
    "> number (no decimal point) represented as a string of 0s and 1s of width\n",
    "> 32. And, just like you can't represent a number larger than 9999 ($10^4-1$)\n",
    "> if you have a length-4 array of numbers, you can't represent a number\n",
    "> larger than $2^32 - 1 \\approx 4 \\times 10^9$ if you are using 32-bit\n",
    "> integers, or $2^31 - 1 \\approx 2 \\times 10^9$ if you want to have negative\n",
    "> numbers (because you need one of the 32 bits to indicate sign).\n",
    ">\n",
    "> So what happens when you go over that limit? You can try it with the\n",
    "> following code:\n",
    ">\n",
    ">     >>> 2**31 - 1\n",
    ">     2147483647\n",
    ">     >>> np.array([2147483647], dtype=np.int32) + 1\n",
    ">     array([-2147483648], dtype=int32)\n",
    ">\n",
    "> As you can see, it just ticks over, without warning!\n",
    ">\n",
    "> Floating point numbers can express much larger numbers, at the cost of some\n",
    "> precision:\n",
    ">\n",
    ">     >>> np.float32(2**96)\n",
    ">     7.9228163e+28\n",
    ">     >>> np.float32(2**96) == np.float32(2**96 + 1)\n",
    ">     True\n",
    ">\n",
    "> As mentioned, we can't go into all the subtleties of dealing with these\n",
    "> errors, but it's probably what we are avoiding if you see us converting an\n",
    "> array with `.astype(float)`!\n",
    ">\n",
    "> The paper \"What Every Computer Scientist Should Know About Floating-Point\n",
    "> Arithmetic\", by David Goldberg, contains a lot of detail about this, if you\n",
    "> are curious. A free version is available at\n",
    "> https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html\n",
    "> . Somewhat more amusingly, try:\n",
    ">\n",
    ">     >>> np.sum(np.array([0.1 + 0.2], dtype=np.float64))\n",
    ">     0.30000000000000004\n",
    ">\n",
    "> Then, copy that value to go to http://0.30000000000000004.com, which\n",
    "> contains a very concise explanation of the problem and links to further\n",
    "> resources, including the Goldberg paper.\n",
    "\n",
    "First, we multiply by $10^9$.\n",
    "Because counts (C) is an ndarray, we can use broadcasting.\n",
    "If we multiple an ndarray by a single value,\n",
    "that value is broadcast over the entire array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Multiply all counts by $10^9$. Note that ^ in Python is bitwise-or.\n",
    "# Exponentiation is denoted by `**`\n",
    "# Avoid overflow by converting C to float, see tip \"Numbers and computers\"\n",
    "C_tmp = 10**9 * C.astype(float)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we need to divide by the gene length.\n",
    "Broadcasting a single value over a 2D array was pretty clear.\n",
    "We were just multiplying every element in the array by the value.\n",
    "But what happens when we need to divide a 2D array by a 1D array?\n",
    "\n",
    "#### Broadcasting rules\n",
    "\n",
    "Broadcasting allows calculations between ndarrays that have different shapes.\n",
    "Numpy uses broadcasting rules to make these manipulations a little easier.\n",
    "When two arrays have the same number of dimensions,\n",
    "broadcasting can occur if the sizes of each dimension match,\n",
    "or one of them is equal to 1.\n",
    "If arrays have different numbers of dimensions, then $(1,)$ is prepended to the\n",
    "shorter array until the numbers match, and then the standard broadcasting rules\n",
    "apply.\n",
    "\n",
    "For example, suppose we have two ndarrays, A and B, with shapes $(5, 2)$ and\n",
    "$(2,)$.\n",
    "We define the product `A * B` using broadcasting.\n",
    "B has fewer dimensions than A, so during the calculation,\n",
    "a new dimension is prepended to B with value 1, so B's new shape is $(1, 2)$.\n",
    "Finally, where B's shape doesn't match A's, it is *multiplied* by stacking enough\n",
    "versions of B, giving the shape $(5, 2)$. This is done \"virtually\", without using\n",
    "up any additional memory. At this point, the product is just an element-wise\n",
    "multiplication, giving an output array of the same shape as A.\n",
    "\n",
    "Now let's say we have another array, C, of shape $(2, 5)$. To multiply (or add) C \n",
    "to B, we might try to prepend $(1,)$ to the shape of B, but in that case, we still\n",
    "end up with incompatible shapes: $(2, 5)$ and $(1, 2)$. If we want the arrays to\n",
    "broadcast, we have to *ap*pend a dimension to B, manually. Then, we end up with\n",
    "$(2, 5)$ and $(2, 1)$, and broadcasting can proceed.\n",
    "\n",
    "In NumPy, we can explicitly add a new dimension to B using `np.newaxis`.\n",
    "Let's see this in our normalization by RPKM.\n",
    "\n",
    "Let's look at the dimensions of our arrays."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('C_tmp.shape', C_tmp.shape)\n",
    "print('L.shape', L.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that `C_tmp` has two dimensions, while `L` has one.\n",
    "So during broadcasting, an additional dimension will be prepended to `L`.\n",
    "Then we will have:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "C_tmp.shape (20500, 375)\n",
    "L.shape (1, 20500)\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dimensions won't match!\n",
    "We want to broadcast L over the first dimension of `C_tmp`,\n",
    "so we need to adjust the dimensions of L ourselves."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "L = L[:, np.newaxis] # append a dimension to L, with value 1\n",
    "print('C_tmp.shape', C_tmp.shape)\n",
    "print('L.shape', L.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that our dimensions match or are equal to 1, we can broadcast."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Divide each row by the gene length for that gene (L)\n",
    "C_tmp = C_tmp / L"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally we need to normalize by the library size,\n",
    "the total number of counts for that column.\n",
    "Remember that we have already calculated $N$ with:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "N = np.sum(counts, axis=0) # sum each column to get total reads per sample\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check the shapes of C_tmp and N\n",
    "print('C_tmp.shape', C_tmp.shape)\n",
    "print('N.shape', N.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we trigger broadcasting, an additional dimension will be\n",
    "prepended to N:\n",
    "\n",
    "`N.shape (1, 375)`\n",
    "\n",
    "The dimensions will match so we don't have to do anything.\n",
    "However, for readability, it can be useful to add the extra dimension to N anyway."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Divide each column by the total counts for that column (N)\n",
    "N = N[np.newaxis, :]\n",
    "print('C_tmp.shape', C_tmp.shape)\n",
    "print('N.shape', N.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Divide each column by the total counts for that column (N)\n",
    "rpkm_counts = C_tmp / N"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's put this in a function so we can reuse it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def rpkm(counts, lengths):\n",
    "    \"\"\"Calculate reads per kilobase transcript per million reads.\n",
    "\n",
    "    RPKM = (10^9 * C) / (N * L)\n",
    "\n",
    "    Where:\n",
    "    C = Number of reads mapped to a gene\n",
    "    N = Total mapped reads in the experiment\n",
    "    L = Exon length in base pairs for a gene\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    counts: array, shape (N_genes, N_samples)\n",
    "        RNAseq (or similar) count data where columns are individual samples\n",
    "        and rows are genes.\n",
    "    lengths: array, shape (N_genes,)\n",
    "        Gene lengths in base pairs in the same order\n",
    "        as the rows in counts.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    normed : array, shape (N_genes, N_samples)\n",
    "        The RPKM normalized counts matrix.\n",
    "    \"\"\"\n",
    "    C = counts.astype(float)  # use float to avoid overflow with `1e9 * C`\n",
    "    N = np.sum(C, axis=0)  # sum each column to get total reads per sample\n",
    "    L = lengths\n",
    "\n",
    "    normed = 1e9 * C / (N[np.newaxis, :] * L[:, np.newaxis])\n",
    "\n",
    "    return(normed)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "counts_rpkm = rpkm(counts, gene_lengths)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### RPKM between gene normalization\n",
    "\n",
    "Let's see the RPKM normalization's effect in action. First, as a reminder, here's\n",
    "the distribution of mean log counts as a function of gene length:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "log_counts = np.log(counts + 1)\n",
    "mean_log_counts = np.mean(log_counts, axis=1)\n",
    "log_gene_lengths = np.log(gene_lengths)\n",
    "\n",
    "with plt.style.context('style/thinner.mplstyle'):\n",
    "    # Keep track of axis object so that the next plot can have the same scale\n",
    "    unnormalized_ax = binned_boxplot(x=log_gene_lengths, y=mean_log_counts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!-- caption text=\"The relationship between gene length and average expression before RPKM normalization (log scale)\" -->\n",
    "\n",
    "Now, the same plot with the RPKM-normalized values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "log_counts = np.log(counts_rpkm + 1)\n",
    "mean_log_counts = np.mean(log_counts, axis=1)\n",
    "log_gene_lengths = np.log(gene_lengths)\n",
    "\n",
    "with plt.style.context('style/thinner.mplstyle'):\n",
    "    rpkm_ax = binned_boxplot(x=log_gene_lengths, y=mean_log_counts)\n",
    "    # Set the axis limits to those of the previous plot for visual comparison\n",
    "    rpkm_ax.set_ylim(unnormalized_ax.get_ylim())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see that the mean expression counts have flattened quite a bit,\n",
    "especially for genes larger than about 3,000 base pairs.\n",
    "(Smaller genes still appear to have low expression — these may be too small for\n",
    "the statistical power of the RPKM method.)\n",
    "\n",
    "RPKM normalization can be useful to compare the expression profile of different genes.\n",
    "We've already seen that longer genes have higher counts, but this doesn't mean their expression level is actually higher.\n",
    "Let's choose a short gene and a long gene and compare their counts before and after RPKM normalization to see what we mean."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gene_idxs = np.nonzero((gene_names == 'RPL24') |\n",
    "                       (gene_names == 'TXNDC5'))\n",
    "gene1, gene2 = gene_names[gene_idxs]\n",
    "len1, len2 = gene_lengths[gene_idxs]\n",
    "gene_labels = [f'{gene1}, {len1}bp', f'{gene2}, {len2}bp']\n",
    "\n",
    "log_counts = list(np.log(counts[gene_idxs] + 1))\n",
    "log_ncounts = list(np.log(counts_rpkm[gene_idxs] + 1))\n",
    "\n",
    "ax = class_boxplot(log_counts,\n",
    "                   ['raw counts'] * 3,\n",
    "                   labels=gene_labels)\n",
    "ax.set_xlabel('Genes')\n",
    "ax.set_ylabel('log gene expression counts over all samples');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!-- caption text=\"Comparing expression of two genes before RPKM normalization\" -->\n",
    "\n",
    "If we look just at the raw counts, it looks like the longer gene, TXNDC5, is expressed\n",
    "slightly more than the shorter one, RPL24.\n",
    "But, after RPKM normalization, a different picture emerges:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = class_boxplot(log_ncounts,\n",
    "                   ['RPKM normalized'] * 3,\n",
    "                   labels=gene_labels)\n",
    "ax.set_xlabel('Genes')\n",
    "ax.set_ylabel('log RPKM gene expression counts over all samples');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!-- caption text=\"Comparing expression of two genes after RPKM normalization\" -->\n",
    "\n",
    "Now it looks like RPL24 is actually expressed at a much higher level than TXNDC5.\n",
    "This is because RPKM includes normalization for gene length, so we can now directly compare between genes of different lengths.\n",
    "\n",
    "## Taking Stock\n",
    "\n",
    "So far we have done the following:\n",
    "- Imported data using pandas\n",
    "- Become familiar with the key NumPy object class — the ndarray\n",
    "- Used the power of broadcasting to make our calculations more elegant.\n",
    "\n",
    "In Chapter 2 we will continue working with the same dataset, implementing a\n",
    "more sophisticated normalization technique, then use clustering to make some\n",
    "predictions about mortality in skin cancer patients."
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 4
}