{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Systems Immunogenetics Project\n",
    "\n",
    "## WNV Flow Cytometry Data Cleaning Workflow\n",
    "\n",
    "### McWeeney Lab, Oregon Health & Science University\n",
    "\n",
    "#### Author: Michael Mooney (mooneymi@ohsu.edu)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "This document will walk through the steps for parsing and cleaning flow cytometry spreadsheets. The result will be a single R dataframe containing all the data, with standardized column names and formatting.\n",
    "\n",
    "Required Files:\n",
    "\n",
    "- This notebook** (`SIG_WNV_Flow_Data_Cleaning.ipynb`): [[Download here]](https://raw.githubusercontent.com/biodev/SIG/master/SIG_WNV_Flow_Data_Cleaning.ipynb)\n",
    "- The R script (`flow_data_cleaning_functions.r`): [[Download here]](https://raw.githubusercontent.com/biodev/SIG/master/scripts/flow_data_cleaning_functions.r)\n",
    "- The data dictionary containing all flow variables (`WNV_Data_Dictionary.xlsx`): [[Download here]](https://raw.githubusercontent.com/biodev/SIG/master/data/WNV_Data_Dictionary.xlsx)\n",
    "\n",
    "** Note: this notebook can also be downloaded as an R script (only the code blocks seen below will be included): [[Download R script here]](https://raw.githubusercontent.com/biodev/SIG/master/SIG_WNV_Flow_Data_Cleaning.r)\n",
    "\n",
    "Required R Packages:\n",
    "\n",
    "- `gdata`: [https://cran.r-project.org/web/packages/gdata/index.html](https://cran.r-project.org/web/packages/gdata/index.html)\n",
    "\n",
    "**All code is available on GitHub: [https://github.com/biodev/SIG](https://github.com/biodev/SIG)** \n",
    "\n",
    "If you are not familiar with Jupyter Notebooks, I've created a short tutorial to get you up and running quickly. There is also plenty of documentation online:\n",
    "\n",
    "1. [Jupyter for R Tutorial](http://nbviewer.jupyter.org/github/mooneymi/jupyter_notebooks/blob/master/r/Getting_Started_R.ipynb)\n",
    "2. [Jupyter Documentation](http://jupyter.org/)\n",
    "3. [Conda and R](https://www.continuum.io/conda-for-r)\n",
    "\n",
    "#### Up Next: Plotting Flow Cytometry Data\n",
    "\n",
    "After finishing this workflow you will have a cleaned dataset ready for exploration and analysis. A notebook with code examples for plotting the data (including interactive plots created with the Shiny library) is available here: [[Flow Data Plotting Workflow]](http://nbviewer.jupyter.org/github/biodev/SIG/blob/master/SIG_WNV_Flow_Data_Plotting.ipynb) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1. Prepare the Input Files\n",
    "\n",
    "Sometimes, a bit of manual cleaning of the spreadsheets is necessary for the parsing function to work correctly.\n",
    "\n",
    "1. Remove special characters (e.g. ¥, •), and set to NA (blank). It seems this character is entered when a parent population has zero cells, and therefore percentages for sub-populations are meaningless.\n",
    "2. There should be no ID column in sheets other than the Treg panel (IDs will be created automatically)\n",
    "3. All column names should match those in the `final_flow.txt` file (unexpected columns will be printed to the screen to alert the user). If any of the following columns are missing, errors will occur when merging the different panels: 'UNC strain', 'UW strain', 'RIX_ID' (changed from 'Mouse #'), 'Timepoint', 'Tissue' (changed from 'Organ'), 'Total Cell Count'.\n",
    "4. Note that sometimes columns/rows look 'empty' in Excel, but they actually contain spaces (or even formatting) that R will recognize. If these 'empty' columns/rows are not skipped when R reads the file it can cause problems, so they may need to be manually deleted.\n",
    "5. It is also a good idea to \"Unfreeze Panes' and remove any splits within the spreadsheets. This can sometimes cause problems when reading the data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2. Load the Necessary R Libraries and Functions\n",
    "\n",
    "There are a number of functions in the accompanying R script (`flow_data_cleaning_functions.r`) necessary for parsing and then processing the flow cytometry data:\n",
    "\n",
    "1. `read_flow_exp_file()`: Parses a flow spreadsheet and creates an R dataframe.\n",
    "2. `fix_column_names()`: Standardizes the column names of the above dataframe.\n",
    "3. Functions for calculating cell counts and ratios:\n",
    "    - `calc_treg_counts()`\n",
    "    - `calc_tcell_counts()`\n",
    "    - `calc_ics_counts()`\n",
    "    - `calc_ics_percent_ratios()`\n",
    "    - `calc_ics_count_ratios()`\n",
    "4. `clean_inf_nan()`: Sets any infinite or NaN values to NA.\n",
    "\n",
    "More information on each of these functions is available by calling the `describe()` function. For example, the following command will print documentation for the `read_flow_exp_file()` function:\n",
    "\n",
    "    describe(read_flow_exp_file)\n",
    "\n",
    "Remember that, in addition to the help documentation provided with `describe()`, you can view the actual function definitions at any time by simply typing the function name without parentheses (e.g. `describe`) at the command prompt."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.\n",
      "\n",
      "gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.\n",
      "\n",
      "Attaching package: ‘gdata’\n",
      "\n",
      "The following object is masked from ‘package:stats’:\n",
      "\n",
      "    nobs\n",
      "\n",
      "The following object is masked from ‘package:utils’:\n",
      "\n",
      "    object.size\n",
      "\n"
     ]
    }
   ],
   "source": [
    "## Load functions for parsing the flow cytometry spreadsheets\n",
    "## The gdata library is necessary for reading Excel spreadsheets; it will be loaded as well.\n",
    "source('./scripts/flow_data_cleaning_functions.r')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "This function parses flow cytometry data from an Excel workbook.\n",
      "\n",
      "Parameters:\n",
      "f: The Excel file name.\n",
      "cn_expected: A character vector containing the expected column names.\n",
      "\n",
      "Returns:\n",
      "A dataframe containing the processed flow cytometry data.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "## View help documentation on the functions listed above\n",
    "describe(read_flow_exp_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3. Read the Data into R"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## Load all expected flow variables (expected column names)\n",
    "flow_cn = read.xls(xls=\"./data/WNV_Data_Dictionary.xlsx\", sheet=\"Flow Data\", as.is=T)\n",
    "flow_cn = flow_cn[,1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## Move to the directory holding the data\n",
    "flow_dir = \"/Users/mooneymi/Documents/MyDocuments/SystemsImmunogenetics/WNV/Lund_Flow_fixed_Apr_26\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## Get a list of data files to read (in this case all flow spreadsheets begin with the prefix 'Expt')\n",
    "flow_files = list.files(flow_dir, pattern=\"Expt.*\\\\.xls\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"Expt 35 cell counts 04_26_16_fixedMM.xlsx\"\n",
      "[2] \"Expt 45 cell counts 04_26_16_fixedMM.xlsx\"\n"
     ]
    }
   ],
   "source": [
    "print(flow_files)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"Expt 35 cell counts 04_26_16_fixedMM.xlsx\"\n",
      "[1] \"Treg panel\"\n",
      "[1] \"CD8 d7 T cell\" \"CD8 d21\"      \n",
      "[1] \"ICS panel\"\n",
      "[1] \"Expt 45 cell counts 04_26_16_fixedMM.xlsx\"\n",
      "[1] \"Treg panel\"\n",
      "[1] \"CD8 d12\" \"CD8 d21\"\n",
      "[1] \"ICS panel\"\n"
     ]
    }
   ],
   "source": [
    "## Iterate through all the files, parse each, and merge all data into a single dataframe\n",
    "i = 1\n",
    "for (file in flow_files) {\n",
    "    print(file)\n",
    "    flow_dat = read_flow_exp_file(file.path(flow_dir, file), flow_cn)\n",
    "    \n",
    "    ## Check if there are any unexpected columns\n",
    "    new_columns = setdiff(colnames(flow_dat), flow_cn)\n",
    "    if (length(new_columns) > 0) {\n",
    "        flow_cn = c(flow_cn, new_columns)\n",
    "    }\n",
    "    if (i > 1) {\n",
    "        ## Fill extra columns with NAs\n",
    "        for (col in new_columns) {\n",
    "            flow_all[,col] = NA\n",
    "        }\n",
    "        ## Merge data\n",
    "        flow_all = rbind(flow_all[,flow_cn], flow_dat[,flow_cn])\n",
    "    } else {\n",
    "        flow_all = flow_dat\n",
    "    }\n",
    "    i = i + 1\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<ol class=list-inline>\n",
       "\t<li>60</li>\n",
       "\t<li>277</li>\n",
       "</ol>\n"
      ],
      "text/latex": [
       "\\begin{enumerate*}\n",
       "\\item 60\n",
       "\\item 277\n",
       "\\end{enumerate*}\n"
      ],
      "text/markdown": [
       "1. 60\n",
       "2. 277\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "[1]  60 277"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Check the dimensions of the dataframe\n",
    "dim(flow_all)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [],
      "text/latex": [],
      "text/markdown": [],
      "text/plain": [
       "character(0)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Check that all expected columns are present\n",
    "setdiff(flow_cn, colnames(flow_all))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead><tr><th></th><th scope=col>ID</th><th scope=col>Mating</th><th scope=col>UW_Line</th><th scope=col>RIX_ID</th><th scope=col>Timepoint</th><th scope=col>Virus</th><th scope=col>Tissue</th><th scope=col>Lab</th><th scope=col>Data_Altered</th><th scope=col>Notes</th><th scope=col>Total_Cell_Count</th></tr></thead>\n",
       "<tbody>\n",
       "\t<tr><th scope=row>1</th><td>8024x8049_16</td><td>8024x8049</td><td>038</td><td>16</td><td>28</td><td>WNV</td><td>brain</td><td>NA</td><td>NA</td><td>NA</td><td>8000000</td></tr>\n",
       "\t<tr><th scope=row>2</th><td>8024x8049_16</td><td>8024x8049</td><td>038</td><td>16</td><td>28</td><td>WNV</td><td>spleen</td><td>NA</td><td>NA</td><td>NA</td><td>27840000</td></tr>\n",
       "\t<tr><th scope=row>3</th><td>8024x8049_17</td><td>8024x8049</td><td>038</td><td>17</td><td>28</td><td>WNV</td><td>brain</td><td>NA</td><td>NA</td><td>NA</td><td>7040000</td></tr>\n",
       "\t<tr><th scope=row>4</th><td>8024x8049_17</td><td>8024x8049</td><td>038</td><td>17</td><td>28</td><td>WNV</td><td>spleen</td><td>NA</td><td>NA</td><td>NA</td><td>31680000</td></tr>\n",
       "\t<tr><th scope=row>5</th><td>8024x8049_3</td><td>8024x8049</td><td>038</td><td>3</td><td>28</td><td>Mock</td><td>brain</td><td>NA</td><td>NA</td><td>NA</td><td>3200000</td></tr>\n",
       "\t<tr><th scope=row>6</th><td>8024x8049_3</td><td>8024x8049</td><td>038</td><td>3</td><td>28</td><td>Mock</td><td>spleen</td><td>NA</td><td>NA</td><td>NA</td><td>15360000</td></tr>\n",
       "\t<tr><th scope=row>7</th><td>8024x8049_7</td><td>8024x8049</td><td>038</td><td>7</td><td>28</td><td>Mock</td><td>brain</td><td>NA</td><td>NA</td><td>NA</td><td>3520000</td></tr>\n",
       "\t<tr><th scope=row>8</th><td>8024x8049_7</td><td>8024x8049</td><td>038</td><td>7</td><td>28</td><td>Mock</td><td>spleen</td><td>NA</td><td>NA</td><td>NA</td><td>19200000</td></tr>\n",
       "\t<tr><th scope=row>9</th><td>8024x8049_8</td><td>8024x8049</td><td>038</td><td>8</td><td>28</td><td>Mock</td><td>brain</td><td>NA</td><td>NA</td><td>NA</td><td>3680000</td></tr>\n",
       "\t<tr><th scope=row>10</th><td>8024x8049_8</td><td>8024x8049</td><td>038</td><td>8</td><td>28</td><td>Mock</td><td>spleen</td><td>NA</td><td>NA</td><td>NA</td><td>20160000</td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ],
      "text/latex": [
       "\\begin{tabular}{r|lllllllllll}\n",
       "  & ID & Mating & UW_Line & RIX_ID & Timepoint & Virus & Tissue & Lab & Data_Altered & Notes & Total_Cell_Count\\\\\n",
       "\\hline\n",
       "\t1 & 8024x8049_16 & 8024x8049 & 038 & 16 & 28 & WNV & brain & NA & NA & NA & 8000000\\\\\n",
       "\t2 & 8024x8049_16 & 8024x8049 & 038 & 16 & 28 & WNV & spleen & NA & NA & NA & 27840000\\\\\n",
       "\t3 & 8024x8049_17 & 8024x8049 & 038 & 17 & 28 & WNV & brain & NA & NA & NA & 7040000\\\\\n",
       "\t4 & 8024x8049_17 & 8024x8049 & 038 & 17 & 28 & WNV & spleen & NA & NA & NA & 31680000\\\\\n",
       "\t5 & 8024x8049_3 & 8024x8049 & 038 & 3 & 28 & Mock & brain & NA & NA & NA & 3200000\\\\\n",
       "\t6 & 8024x8049_3 & 8024x8049 & 038 & 3 & 28 & Mock & spleen & NA & NA & NA & 15360000\\\\\n",
       "\t7 & 8024x8049_7 & 8024x8049 & 038 & 7 & 28 & Mock & brain & NA & NA & NA & 3520000\\\\\n",
       "\t8 & 8024x8049_7 & 8024x8049 & 038 & 7 & 28 & Mock & spleen & NA & NA & NA & 19200000\\\\\n",
       "\t9 & 8024x8049_8 & 8024x8049 & 038 & 8 & 28 & Mock & brain & NA & NA & NA & 3680000\\\\\n",
       "\t10 & 8024x8049_8 & 8024x8049 & 038 & 8 & 28 & Mock & spleen & NA & NA & NA & 20160000\\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "             ID    Mating UW_Line RIX_ID Timepoint Virus Tissue Lab\n",
       "1  8024x8049_16 8024x8049     038     16        28   WNV  brain  NA\n",
       "2  8024x8049_16 8024x8049     038     16        28   WNV spleen  NA\n",
       "3  8024x8049_17 8024x8049     038     17        28   WNV  brain  NA\n",
       "4  8024x8049_17 8024x8049     038     17        28   WNV spleen  NA\n",
       "5   8024x8049_3 8024x8049     038      3        28  Mock  brain  NA\n",
       "6   8024x8049_3 8024x8049     038      3        28  Mock spleen  NA\n",
       "7   8024x8049_7 8024x8049     038      7        28  Mock  brain  NA\n",
       "8   8024x8049_7 8024x8049     038      7        28  Mock spleen  NA\n",
       "9   8024x8049_8 8024x8049     038      8        28  Mock  brain  NA\n",
       "10  8024x8049_8 8024x8049     038      8        28  Mock spleen  NA\n",
       "   Data_Altered Notes Total_Cell_Count\n",
       "1            NA    NA          8000000\n",
       "2            NA    NA         27840000\n",
       "3            NA    NA          7040000\n",
       "4            NA    NA         31680000\n",
       "5            NA    NA          3200000\n",
       "6            NA    NA         15360000\n",
       "7            NA    NA          3520000\n",
       "8            NA    NA         19200000\n",
       "9            NA    NA          3680000\n",
       "10           NA    NA         20160000"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "flow_all[1:10,1:11]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4. Clean and Reformat the Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## Order columns, add Lab column and fix formatting\n",
    "flow_all = flow_all[, flow_cn]\n",
    "flow_all$Lab = \"Lund\"\n",
    "\n",
    "flow_all$ID = gsub(\" \", \"\", flow_all$ID)\n",
    "flow_all$ID = gsub(\"X\", \"x\", flow_all$ID)\n",
    "flow_all$Mating = gsub(\" \", \"\", flow_all$Mating)\n",
    "flow_all$Mating = gsub(\"X\", \"x\", flow_all$Mating) \n",
    "flow_all$UW_Line = as.numeric(flow_all$UW_Line)\n",
    "flow_all$Timepoint = as.numeric(flow_all$Timepoint)\n",
    "\n",
    "## For validation data mock animals do not have IDs, set to NA\n",
    "flow_all$ID[is.na(flow_all$RIX_ID)] = NA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead><tr><th></th><th scope=col>ID</th><th scope=col>Mating</th><th scope=col>UW_Line</th><th scope=col>RIX_ID</th><th scope=col>Timepoint</th><th scope=col>Virus</th><th scope=col>Tissue</th><th scope=col>Lab</th><th scope=col>Data_Altered</th><th scope=col>Notes</th><th scope=col>Total_Cell_Count</th></tr></thead>\n",
       "<tbody>\n",
       "\t<tr><th scope=row>1</th><td>8024x8049_16</td><td>8024x8049</td><td>38</td><td>16</td><td>28</td><td>WNV</td><td>brain</td><td>Lund</td><td>NA</td><td>NA</td><td>8000000</td></tr>\n",
       "\t<tr><th scope=row>2</th><td>8024x8049_16</td><td>8024x8049</td><td>38</td><td>16</td><td>28</td><td>WNV</td><td>spleen</td><td>Lund</td><td>NA</td><td>NA</td><td>27840000</td></tr>\n",
       "\t<tr><th scope=row>3</th><td>8024x8049_17</td><td>8024x8049</td><td>38</td><td>17</td><td>28</td><td>WNV</td><td>brain</td><td>Lund</td><td>NA</td><td>NA</td><td>7040000</td></tr>\n",
       "\t<tr><th scope=row>4</th><td>8024x8049_17</td><td>8024x8049</td><td>38</td><td>17</td><td>28</td><td>WNV</td><td>spleen</td><td>Lund</td><td>NA</td><td>NA</td><td>31680000</td></tr>\n",
       "\t<tr><th scope=row>5</th><td>8024x8049_3</td><td>8024x8049</td><td>38</td><td>3</td><td>28</td><td>Mock</td><td>brain</td><td>Lund</td><td>NA</td><td>NA</td><td>3200000</td></tr>\n",
       "\t<tr><th scope=row>6</th><td>8024x8049_3</td><td>8024x8049</td><td>38</td><td>3</td><td>28</td><td>Mock</td><td>spleen</td><td>Lund</td><td>NA</td><td>NA</td><td>15360000</td></tr>\n",
       "\t<tr><th scope=row>7</th><td>8024x8049_7</td><td>8024x8049</td><td>38</td><td>7</td><td>28</td><td>Mock</td><td>brain</td><td>Lund</td><td>NA</td><td>NA</td><td>3520000</td></tr>\n",
       "\t<tr><th scope=row>8</th><td>8024x8049_7</td><td>8024x8049</td><td>38</td><td>7</td><td>28</td><td>Mock</td><td>spleen</td><td>Lund</td><td>NA</td><td>NA</td><td>19200000</td></tr>\n",
       "\t<tr><th scope=row>9</th><td>8024x8049_8</td><td>8024x8049</td><td>38</td><td>8</td><td>28</td><td>Mock</td><td>brain</td><td>Lund</td><td>NA</td><td>NA</td><td>3680000</td></tr>\n",
       "\t<tr><th scope=row>10</th><td>8024x8049_8</td><td>8024x8049</td><td>38</td><td>8</td><td>28</td><td>Mock</td><td>spleen</td><td>Lund</td><td>NA</td><td>NA</td><td>20160000</td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ],
      "text/latex": [
       "\\begin{tabular}{r|lllllllllll}\n",
       "  & ID & Mating & UW_Line & RIX_ID & Timepoint & Virus & Tissue & Lab & Data_Altered & Notes & Total_Cell_Count\\\\\n",
       "\\hline\n",
       "\t1 & 8024x8049_16 & 8024x8049 & 38 & 16 & 28 & WNV & brain & Lund & NA & NA & 8000000\\\\\n",
       "\t2 & 8024x8049_16 & 8024x8049 & 38 & 16 & 28 & WNV & spleen & Lund & NA & NA & 27840000\\\\\n",
       "\t3 & 8024x8049_17 & 8024x8049 & 38 & 17 & 28 & WNV & brain & Lund & NA & NA & 7040000\\\\\n",
       "\t4 & 8024x8049_17 & 8024x8049 & 38 & 17 & 28 & WNV & spleen & Lund & NA & NA & 31680000\\\\\n",
       "\t5 & 8024x8049_3 & 8024x8049 & 38 & 3 & 28 & Mock & brain & Lund & NA & NA & 3200000\\\\\n",
       "\t6 & 8024x8049_3 & 8024x8049 & 38 & 3 & 28 & Mock & spleen & Lund & NA & NA & 15360000\\\\\n",
       "\t7 & 8024x8049_7 & 8024x8049 & 38 & 7 & 28 & Mock & brain & Lund & NA & NA & 3520000\\\\\n",
       "\t8 & 8024x8049_7 & 8024x8049 & 38 & 7 & 28 & Mock & spleen & Lund & NA & NA & 19200000\\\\\n",
       "\t9 & 8024x8049_8 & 8024x8049 & 38 & 8 & 28 & Mock & brain & Lund & NA & NA & 3680000\\\\\n",
       "\t10 & 8024x8049_8 & 8024x8049 & 38 & 8 & 28 & Mock & spleen & Lund & NA & NA & 20160000\\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "             ID    Mating UW_Line RIX_ID Timepoint Virus Tissue  Lab\n",
       "1  8024x8049_16 8024x8049      38     16        28   WNV  brain Lund\n",
       "2  8024x8049_16 8024x8049      38     16        28   WNV spleen Lund\n",
       "3  8024x8049_17 8024x8049      38     17        28   WNV  brain Lund\n",
       "4  8024x8049_17 8024x8049      38     17        28   WNV spleen Lund\n",
       "5   8024x8049_3 8024x8049      38      3        28  Mock  brain Lund\n",
       "6   8024x8049_3 8024x8049      38      3        28  Mock spleen Lund\n",
       "7   8024x8049_7 8024x8049      38      7        28  Mock  brain Lund\n",
       "8   8024x8049_7 8024x8049      38      7        28  Mock spleen Lund\n",
       "9   8024x8049_8 8024x8049      38      8        28  Mock  brain Lund\n",
       "10  8024x8049_8 8024x8049      38      8        28  Mock spleen Lund\n",
       "   Data_Altered Notes Total_Cell_Count\n",
       "1            NA    NA          8000000\n",
       "2            NA    NA         27840000\n",
       "3            NA    NA          7040000\n",
       "4            NA    NA         31680000\n",
       "5            NA    NA          3200000\n",
       "6            NA    NA         15360000\n",
       "7            NA    NA          3520000\n",
       "8            NA    NA         19200000\n",
       "9            NA    NA          3680000\n",
       "10           NA    NA         20160000"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "flow_all[1:10,1:11]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "0"
      ],
      "text/latex": [
       "0"
      ],
      "text/markdown": [
       "0"
      ],
      "text/plain": [
       "[1] 0"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Check for duplicate IDs\n",
    "new_flow_ids = paste(flow_all$ID, flow_all$Tissue, sep='_')\n",
    "sum(duplicated(new_flow_ids))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5. Combine with Previous Data (Optional)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## Read the previously cleaned data\n",
    "## Note: you will have to change the file path\n",
    "cleaned_dir = '/Users/mooneymi/Documents/MyDocuments/SystemsImmunogenetics/WNV/Cleaned_Data_Releases/23-Mar-2016'\n",
    "flow_prev = read.xls(file.path(cleaned_dir, 'Lund_Flow_21-Mar-2016_final.xlsx'), \n",
    "                     header=T, as.is=T, na.strings=c(\"\",\" \", \"NA\", \"#DIV/0!\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<ol class=list-inline>\n",
       "\t<li>1989</li>\n",
       "\t<li>277</li>\n",
       "</ol>\n"
      ],
      "text/latex": [
       "\\begin{enumerate*}\n",
       "\\item 1989\n",
       "\\item 277\n",
       "\\end{enumerate*}\n"
      ],
      "text/markdown": [
       "1. 1989\n",
       "2. 277\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "[1] 1989  277"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dim(flow_prev)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] 12\n"
     ]
    }
   ],
   "source": [
    "## Check for duplicate IDs\n",
    "dup_ids1 = intersect(flow_prev$ID[flow_prev$Tissue=='brain'], flow_all$ID[flow_all$Tissue=='brain'])\n",
    "dup_ids2 = intersect(flow_prev$ID[flow_prev$Tissue=='spleen'], flow_all$ID[flow_all$Tissue=='spleen'])\n",
    "\n",
    "## Overwrite old data with new\n",
    "idx1 = which(flow_prev$ID %in% dup_ids1 & flow_prev$Tissue=='brain')\n",
    "idx2 = which(flow_prev$ID %in% dup_ids2 & flow_prev$Tissue=='spleen')\n",
    "idx_dups = c(idx1, idx2)\n",
    "print(length(idx_dups))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "idx_to_keep = setdiff(1:nrow(flow_prev), idx_dups)\n",
    "flow_prev = flow_prev[idx_to_keep,]\n",
    "flow_all = rbind(flow_prev[, flow_cn], flow_all[, flow_cn])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<ol class=list-inline>\n",
       "\t<li>2037</li>\n",
       "\t<li>277</li>\n",
       "</ol>\n"
      ],
      "text/latex": [
       "\\begin{enumerate*}\n",
       "\\item 2037\n",
       "\\item 277\n",
       "\\end{enumerate*}\n"
      ],
      "text/markdown": [
       "1. 2037\n",
       "2. 277\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "[1] 2037  277"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Check the dimensions of the dataframe\n",
    "dim(flow_all)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6. Calculate Cell Counts and Ratios"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## Change all data columns to numeric\n",
    "for (i in 11:277) {\n",
    "    flow_all[,i] = as.numeric(flow_all[,i])\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## Calculate cell counts and ratios\n",
    "flow_full = flow_all\n",
    "flow_full = calc_treg_counts(flow_full)\n",
    "flow_full = calc_tcell_counts(flow_full)\n",
    "flow_full = calc_ics_counts(flow_full)\n",
    "flow_full = calc_ics_percent_ratios(flow_full)\n",
    "flow_full = calc_ics_count_ratios(flow_full)\n",
    "flow_full = clean_inf_nan(flow_full)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<ol class=list-inline>\n",
       "\t<li>2037</li>\n",
       "\t<li>763</li>\n",
       "</ol>\n"
      ],
      "text/latex": [
       "\\begin{enumerate*}\n",
       "\\item 2037\n",
       "\\item 763\n",
       "\\end{enumerate*}\n"
      ],
      "text/markdown": [
       "1. 2037\n",
       "2. 763\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "[1] 2037  763"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dim(flow_full)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 7. Save the Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## Save the data file\n",
    "write.table(flow_all, file=file.path(flow_dir, 'Lund_Flow_12-May-2016_final.txt'), \n",
    "            col.names=T, row.names=F, quote=T, sep='\\t', na='')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## Save the full data file\n",
    "write.table(flow_full, file=file.path(flow_dir, 'Lund_Flow_Full_12-May-2016_final.txt'), \n",
    "            col.names=T, row.names=F, quote=T, sep='\\t', na='')\n",
    "save(flow_full, file=file.path(flow_dir, 'lund_flow_full_12-May-2016_final.rda'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Up Next: Plotting Flow Cytometry Data\n",
    "\n",
    "Code for plotting this data is available here: [[Flow Data Plotting Workflow]](http://nbviewer.jupyter.org/github/biodev/SIG/blob/master/SIG_WNV_Flow_Data_Plotting.ipynb)\n",
    "\n",
    "#### Last Updated: 24-May-2016"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.2.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}