{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Systems Immunogenetics Project\n",
    "\n",
    "## WNV Histology Data Cleaning Workflow\n",
    "\n",
    "### McWeeney Lab, Oregon Health & Science University\n",
    "\n",
    "#### Author: Michael Mooney (mooneymi@ohsu.edu)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "This document will walk through the steps for parsing and cleaning flow cytometry spreadsheets. The result will be a single R dataframe containing all the data, with standardized column names and formatting.\n",
    "\n",
    "Required Files:\n",
    "\n",
    "- This notebook** (`SIG_WNV_Histology_Data_Cleaning.ipynb`): [[Download here]](https://raw.githubusercontent.com/biodev/SIG/master/SIG_WNV_Flow_Data_Cleaning.ipynb)\n",
    "- The data dictionary containing all histology variables (`WNV_Data_Dictionary.xlsx`): [[Download here]](https://raw.githubusercontent.com/biodev/SIG/master/data/WNV_Data_Dictonary.xlsx)\n",
    "\n",
    "** Note: this notebook can also be downloaded as an R script (only the code blocks seen below will be included): [[Download R script here]](https://raw.githubusercontent.com/biodev/SIG/master/SIG_WNV_Flow_Data_Cleaning.r)\n",
    "\n",
    "Required R Packages:\n",
    "\n",
    "- `gdata`: [https://cran.r-project.org/web/packages/gdata/index.html](https://cran.r-project.org/web/packages/gdata/index.html)\n",
    "\n",
    "**All code is available on GitHub: [https://github.com/biodev/SIG](https://github.com/biodev/SIG)** \n",
    "\n",
    "If you are not familiar with Jupyter Notebooks, I've created a short tutorial to get you up and running quickly. There is also plenty of documentation online:\n",
    "\n",
    "1. [Jupyter for R Tutorial](http://nbviewer.jupyter.org/github/mooneymi/jupyter_notebooks/blob/master/r/Getting_Started_R.ipynb)\n",
    "2. [Jupyter Documentation](http://jupyter.org/)\n",
    "3. [Conda and R](https://www.continuum.io/conda-for-r)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1. Prepare the Input Files\n",
    "\n",
    "Sometimes, a bit of manual cleaning of the spreadsheets is necessary for the parsing function to work correctly.\n",
    "\n",
    "1. Move histology data to the first tab in the Excel workbook.\n",
    "2. Remove any rows with text above the column names."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2. Load the Necessary R Libraries and Functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.\n",
      "\n",
      "gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.\n",
      "\n",
      "Attaching package: ‘gdata’\n",
      "\n",
      "The following object is masked from ‘package:stats’:\n",
      "\n",
      "    nobs\n",
      "\n",
      "The following object is masked from ‘package:utils’:\n",
      "\n",
      "    object.size\n",
      "\n"
     ]
    }
   ],
   "source": [
    "library(gdata)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3. Read the Data into R"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## Set data directories\n",
    "cleaned_dir = '~/Documents/SIG/WNV/Cleaned_Data_Releases/15-Jan-2016'\n",
    "hist_dir = '~/Documents/SIG/WNV/Histology/fixed_files'\n",
    "\n",
    "## Load column names\n",
    "hist_cn = read.xls(xls=\"./data/WNV_Data_Dictionary.xlsx\", sheet=\"Histology Data\", as.is=T)\n",
    "hist_cn = hist_cn[,1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## Load cleaned weight data\n",
    "gale_weight = read.xls(file.path(cleaned_dir, 'Gale_Weight_13-Jan-2016_final.xlsx'), header=T, \n",
    "                       as.is=T, na.strings=c(\"\",\" \", \"NA\", \"na\", \"#DIV/0!\"))\n",
    "\n",
    "## Load previously cleaned histology data\n",
    "hist_prev = read.xls(file.path(cleaned_dir, 'Gale_Histology_5-Jan-2016_final.xlsx'), header=T, \n",
    "                     as.is=T, na.strings=c(\"\",\" \", \"NA\", \"na\", \"#DIV/0!\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"Copy of 15-H943_fixedMM.xlsx\"                                                                \n",
      "[2] \"Copy of Copy of Comparative Pathology Program Consult Request  15-H1246 imaging_fixedMM.xlsx\"\n",
      "[3] \"Copy of Copy of Comparative Pathology Program Consult Request  15-H1393_fixedMM.xlsx\"        \n",
      "[4] \"Copy of Copy of Comparative Pathology Program Consult Request 15-H1508_fixedMM.xlsx\"         \n",
      "[5] \"Copy of Copy of Gale CC mice CPP request for prefilled 16-H94_fixedMM.xlsx\"                  \n",
      "[6] \"Copy of histo 15-H792 results_fixedMM.xlsx\"                                                  \n",
      "[7] \"New scoring 08 19 14_fixedMM.xlsx\"                                                           \n",
      "[8] \"Results 15-H501+15-H502 06 10 15_fixedMM.xlsx\"                                               \n",
      "[9] \"Results 15-H585 06 17 15_fixedMM.xlsx\"                                                       \n"
     ]
    }
   ],
   "source": [
    "## Get files to be processed\n",
    "hist_files = list.files(hist_dir, pattern=\".*\\\\.xlsx\")\n",
    "print(hist_files)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"Copy of 15-H943_fixedMM.xlsx\"\n",
      "[1] \"Copy of Copy of Comparative Pathology Program Consult Request  15-H1246 imaging_fixedMM.xlsx\"\n",
      "[1] \"Copy of Copy of Comparative Pathology Program Consult Request  15-H1393_fixedMM.xlsx\"\n",
      "[1] \"Copy of Copy of Comparative Pathology Program Consult Request 15-H1508_fixedMM.xlsx\"\n",
      "[1] \"Copy of Copy of Gale CC mice CPP request for prefilled 16-H94_fixedMM.xlsx\"\n",
      "[1] \"Copy of histo 15-H792 results_fixedMM.xlsx\"\n",
      "[1] \"New scoring 08 19 14_fixedMM.xlsx\"\n",
      "[1] \"Results 15-H501+15-H502 06 10 15_fixedMM.xlsx\"\n",
      "[1] \"Results 15-H585 06 17 15_fixedMM.xlsx\"\n"
     ]
    }
   ],
   "source": [
    "## Read all files and combine into a single dataframe\n",
    "print(hist_files[1])\n",
    "hist_data_v2 = read.xls(file.path(hist_dir, hist_files[1]), header=T, \n",
    "                        as.is=T, na.strings=c(\"\",\" \", \"NA\", \"na\", \"#DIV/0!\"))\n",
    "colnames(hist_data_v2) = hist_cn[c(9, c(13:35))]\n",
    "\n",
    "for (file in hist_files[2:length(hist_files)]) {\n",
    "  print(file)\n",
    "  hist_data = read.xls(file.path(hist_dir, file), header=T, \n",
    "                       as.is=T, na.strings=c(\"\",\" \", \"NA\", \"na\", \"#DIV/0!\"))\n",
    "  colnames(hist_data) = hist_cn[c(9, c(13:35))]\n",
    "  hist_data_v2 = rbind(hist_data_v2, hist_data)\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<ol class=list-inline>\n",
       "\t<li>616</li>\n",
       "\t<li>24</li>\n",
       "</ol>\n"
      ],
      "text/latex": [
       "\\begin{enumerate*}\n",
       "\\item 616\n",
       "\\item 24\n",
       "\\end{enumerate*}\n"
      ],
      "text/markdown": [
       "1. 616\n",
       "2. 24\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "[1] 616  24"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Check dimensions of dataframe\n",
    "dim(hist_data_v2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4. Clean and Reformat the Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"ID\"\n",
      "[1] \"Mating\"\n",
      "[1] \"RIX_ID\"\n",
      "[1] \"UW_Line\"\n",
      "[1] \"UWID\"\n",
      "[1] \"Virus\"\n",
      "[1] \"Tissue\"\n",
      "[1] \"Timepoint\"\n",
      "[1] \"Lab\"\n",
      "[1] \"Data_Altered\"\n",
      "[1] \"Notes\"\n",
      "[1] \"GI_Lesions\"\n"
     ]
    }
   ],
   "source": [
    "## Create missing columns\n",
    "for (cn in setdiff(hist_cn, colnames(hist_data_v2))) {\n",
    "  print(cn)\n",
    "  hist_data_v2[,cn] = NA\n",
    "}\n",
    "\n",
    "## Order columns\n",
    "hist_data_v2 = hist_data_v2[, hist_cn]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Warning message:\n",
      "In eval(expr, envir, enclos): NAs introduced by coercion"
     ]
    }
   ],
   "source": [
    "## Remove any leading or trailing spaces from slide_label\n",
    "hist_data_v2$slide_label = trim(hist_data_v2$slide_label)\n",
    "\n",
    "## Update UW Line\n",
    "hist_data_v2$UW_Line = sapply(hist_data_v2$slide_label, function(x){unlist(strsplit(x, \" \"))[1]})\n",
    "hist_data_v2$UW_Line = as.numeric(hist_data_v2$UW_Line)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## Update UWID\n",
    "hist_data_v2$UWID = sapply(hist_data_v2$slide_label, function(x){unlist(strsplit(x, \" \"))[2]})\n",
    "hist_data_v2$UWID = gsub(\"m\", \"M\", hist_data_v2$UWID)\n",
    "\n",
    "## Update Lab\n",
    "hist_data_v2$Lab = 'Gale'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## Update time points\n",
    "hist_data_v2$Timepoint = sapply(hist_data_v2$UWID, function(x){unlist(strsplit(x, \"\\\\.\"))[2]})\n",
    "\n",
    "## Use time points to update virus\n",
    "hist_data_v2$Virus = 'WNV'\n",
    "hist_data_v2$Virus[grepl(\"M\", hist_data_v2$Timepoint)] = 'Mock'\n",
    "hist_data_v2$Virus[grepl(\"m\", hist_data_v2$Timepoint)] = 'Mock'\n",
    "\n",
    "## Update time points\n",
    "hist_data_v2$Timepoint = gsub(\"M\", \"\", hist_data_v2$Timepoint)\n",
    "hist_data_v2$Timepoint = gsub(\"m\", \"\", hist_data_v2$Timepoint)\n",
    "hist_data_v2$Timepoint = as.numeric(hist_data_v2$Timepoint)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"line=3260x1566; rix_id=2\"  \"line=3260x1566; rix_id=15\"\n"
     ]
    }
   ],
   "source": [
    "## Update Mating and RIX_ID\n",
    "for (i in 1:dim(hist_data_v2)[1]) {\n",
    "  line = hist_data_v2$UW_Line[i]\n",
    "  if (!is.na(line)) {\n",
    "    mating = gale_weight$Mating[with(gale_weight, UW_Line==hist_data_v2$UW_Line[i])]\n",
    "    if (length(mating)>0) {\n",
    "      hist_data_v2$Mating[i] = mating[1]\n",
    "      rix_id = gale_weight$RIX_ID[with(gale_weight, UW_Line==hist_data_v2$UW_Line[i] & UWID==hist_data_v2$UWID[i])]\n",
    "      if (length(rix_id)>0) {\n",
    "        if (length(rix_id)>1) {\n",
    "          ## If more than one ID matches the UWID, print the IDs\n",
    "          print(paste0(\"line=\", mating[1], \"; rix_id=\", rix_id))\n",
    "        }\n",
    "        hist_data_v2$RIX_ID[i] = rix_id[1]\n",
    "      }\n",
    "    }\n",
    "  }\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead><tr><th></th><th scope=col>ID</th><th scope=col>Mating</th><th scope=col>RIX_ID</th><th scope=col>UW_Line</th><th scope=col>UWID</th><th scope=col>Virus</th><th scope=col>Tissue</th><th scope=col>Timepoint</th><th scope=col>slide_label</th><th scope=col>Lab</th><th scope=col>Data_Altered</th><th scope=col>Notes</th></tr></thead>\n",
       "<tbody>\n",
       "\t<tr><th scope=row>273</th><td>NA</td><td>3260x1566</td><td>2</td><td>100</td><td>2.12</td><td>WNV</td><td>NA</td><td>12</td><td>100 2.12</td><td>Gale</td><td>NA</td><td>NA</td></tr>\n",
       "\t<tr><th scope=row>274</th><td>NA</td><td>3260x1566</td><td>8</td><td>100</td><td>1.12M</td><td>Mock</td><td>NA</td><td>12</td><td>100 1.12M</td><td>Gale</td><td>NA</td><td>NA</td></tr>\n",
       "\t<tr><th scope=row>275</th><td>NA</td><td>3260x1566</td><td>9</td><td>100</td><td>2.12M</td><td>Mock</td><td>NA</td><td>12</td><td>100 2.12M</td><td>Gale</td><td>NA</td><td>NA</td></tr>\n",
       "\t<tr><th scope=row>276</th><td>NA</td><td>3260x1566</td><td>16</td><td>100</td><td>3.12M</td><td>Mock</td><td>NA</td><td>12</td><td>100 3.12M</td><td>Gale</td><td>NA</td><td>NA</td></tr>\n",
       "\t<tr><th scope=row>331</th><td>NA</td><td>3260x1566</td><td>NA</td><td>100</td><td>1.4</td><td>WNV</td><td>NA</td><td>4</td><td>100 1.4</td><td>Gale</td><td>NA</td><td>NA</td></tr>\n",
       "\t<tr><th scope=row>332</th><td>NA</td><td>3260x1566</td><td>NA</td><td>100</td><td>2.4</td><td>WNV</td><td>NA</td><td>4</td><td>100 2.4</td><td>Gale</td><td>NA</td><td>NA</td></tr>\n",
       "\t<tr><th scope=row>333</th><td>NA</td><td>3260x1566</td><td>NA</td><td>100</td><td>3.4</td><td>WNV</td><td>NA</td><td>4</td><td>100 3.4</td><td>Gale</td><td>NA</td><td>NA</td></tr>\n",
       "\t<tr><th scope=row>364</th><td>NA</td><td>3260x1566</td><td>NA</td><td>100</td><td>1.7</td><td>WNV</td><td>NA</td><td>7</td><td>100 1.7</td><td>Gale</td><td>NA</td><td>NA</td></tr>\n",
       "\t<tr><th scope=row>365</th><td>NA</td><td>3260x1566</td><td>NA</td><td>100</td><td>2.7</td><td>WNV</td><td>NA</td><td>7</td><td>100 2.7</td><td>Gale</td><td>NA</td><td>NA</td></tr>\n",
       "\t<tr><th scope=row>366</th><td>NA</td><td>3260x1566</td><td>NA</td><td>100</td><td>3.7</td><td>WNV</td><td>NA</td><td>7</td><td>100 3.7</td><td>Gale</td><td>NA</td><td>NA</td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ],
      "text/latex": [
       "\\begin{tabular}{r|llllllllllll}\n",
       "  & ID & Mating & RIX_ID & UW_Line & UWID & Virus & Tissue & Timepoint & slide_label & Lab & Data_Altered & Notes\\\\\n",
       "\\hline\n",
       "\t273 & NA & 3260x1566 & 2 & 100 & 2.12 & WNV & NA & 12 & 100 2.12 & Gale & NA & NA\\\\\n",
       "\t274 & NA & 3260x1566 & 8 & 100 & 1.12M & Mock & NA & 12 & 100 1.12M & Gale & NA & NA\\\\\n",
       "\t275 & NA & 3260x1566 & 9 & 100 & 2.12M & Mock & NA & 12 & 100 2.12M & Gale & NA & NA\\\\\n",
       "\t276 & NA & 3260x1566 & 16 & 100 & 3.12M & Mock & NA & 12 & 100 3.12M & Gale & NA & NA\\\\\n",
       "\t331 & NA & 3260x1566 & NA & 100 & 1.4 & WNV & NA & 4 & 100 1.4 & Gale & NA & NA\\\\\n",
       "\t332 & NA & 3260x1566 & NA & 100 & 2.4 & WNV & NA & 4 & 100 2.4 & Gale & NA & NA\\\\\n",
       "\t333 & NA & 3260x1566 & NA & 100 & 3.4 & WNV & NA & 4 & 100 3.4 & Gale & NA & NA\\\\\n",
       "\t364 & NA & 3260x1566 & NA & 100 & 1.7 & WNV & NA & 7 & 100 1.7 & Gale & NA & NA\\\\\n",
       "\t365 & NA & 3260x1566 & NA & 100 & 2.7 & WNV & NA & 7 & 100 2.7 & Gale & NA & NA\\\\\n",
       "\t366 & NA & 3260x1566 & NA & 100 & 3.7 & WNV & NA & 7 & 100 3.7 & Gale & NA & NA\\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "    ID    Mating RIX_ID UW_Line  UWID Virus Tissue Timepoint slide_label  Lab\n",
       "273 NA 3260x1566      2     100  2.12   WNV     NA        12    100 2.12 Gale\n",
       "274 NA 3260x1566      8     100 1.12M  Mock     NA        12   100 1.12M Gale\n",
       "275 NA 3260x1566      9     100 2.12M  Mock     NA        12   100 2.12M Gale\n",
       "276 NA 3260x1566     16     100 3.12M  Mock     NA        12   100 3.12M Gale\n",
       "331 NA 3260x1566     NA     100   1.4   WNV     NA         4     100 1.4 Gale\n",
       "332 NA 3260x1566     NA     100   2.4   WNV     NA         4     100 2.4 Gale\n",
       "333 NA 3260x1566     NA     100   3.4   WNV     NA         4     100 3.4 Gale\n",
       "364 NA 3260x1566     NA     100   1.7   WNV     NA         7     100 1.7 Gale\n",
       "365 NA 3260x1566     NA     100   2.7   WNV     NA         7     100 2.7 Gale\n",
       "366 NA 3260x1566     NA     100   3.7   WNV     NA         7     100 3.7 Gale\n",
       "    Data_Altered Notes\n",
       "273           NA    NA\n",
       "274           NA    NA\n",
       "275           NA    NA\n",
       "276           NA    NA\n",
       "331           NA    NA\n",
       "332           NA    NA\n",
       "333           NA    NA\n",
       "364           NA    NA\n",
       "365           NA    NA\n",
       "366           NA    NA"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Check Matings and IDs of the discrepancies\n",
    "## Make manual corrections if necessary\n",
    "hist_data_v2[hist_data_v2$Mating=='3260x1566' & !is.na(hist_data_v2$Mating), 1:12]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## Update ID\n",
    "hist_data_v2$ID = paste(hist_data_v2$Mating, hist_data_v2$RIX_ID, sep=\"_\")\n",
    "hist_data_v2$ID[grepl(\"NA\", hist_data_v2$ID)] = NA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "## Update tissue\n",
    "hist_data_v2$Tissue = 'Brain'\n",
    "for (i in 1:length(hist_data_v2$slide_label)) {\n",
    "  label = hist_data_v2$slide_label[i]\n",
    "  if (grepl(\"Br\", label)) {hist_data_v2$Tissue[i] = 'Brain'}\n",
    "  if (grepl(\"LI\", label)) {hist_data_v2$Tissue[i] = 'Large Intestine'}\n",
    "  if (grepl(\"SI\", label)) {hist_data_v2$Tissue[i] = 'Small Intestine'}\n",
    "  if (grepl(\"Stom\", label)) {hist_data_v2$Tissue[i] = 'Stomach'}\n",
    "  if (grepl(\"[Cc]ec\", label)) {hist_data_v2$Tissue[i] = 'Cecum'}\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## Update GI_Lesions\n",
    "hist_data_v2$GI_Lesions = NA\n",
    "for (i in 1:length(hist_data_v2$Tissue)) {\n",
    "  tissue = hist_data_v2$Tissue[i]\n",
    "  if (tissue != 'Brain') {\n",
    "    hist_data_v2$GI_Lesions[i] = hist_data_v2$Cortex_PV_Inf[i]\n",
    "    hist_data_v2$Cortex_PV_Inf[i] = NA\n",
    "  }\n",
    "}\n",
    "\n",
    "hist_data_v2$Cortex_PV_Inf = as.numeric(hist_data_v2$Cortex_PV_Inf)\n",
    "hist_data_v2$GI_Lesions[hist_data_v2$GI_Lesions == 'NSL'] = 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\n",
       " 0  1 \n",
       "98 22 "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table(hist_data_v2$GI_Lesions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead><tr><th></th><th scope=col>ID</th><th scope=col>Mating</th><th scope=col>RIX_ID</th><th scope=col>UW_Line</th><th scope=col>UWID</th><th scope=col>Virus</th><th scope=col>Tissue</th><th scope=col>Timepoint</th><th scope=col>slide_label</th><th scope=col>Lab</th><th scope=col>Data_Altered</th><th scope=col>Notes</th><th scope=col>Cortex_PV_Inf</th><th scope=col>Cortex_P_Inf</th><th scope=col>Cortex_hm</th><th scope=col>Cortex_nn</th><th scope=col>Cortex_Subtotal</th><th scope=col>Hippocampus_PV_Inf</th><th scope=col>Hippocampus_P_Inf</th><th scope=col>Hippocampus_hm</th><th scope=col>Hippocampus_nn</th><th scope=col>Hippocampus_Subtotal</th><th scope=col>Thalamus_Midbrain_PV_Inf</th><th scope=col>Thalamus_Midbrain_P_Inf</th><th scope=col>Thalamus_Midbrain_hm</th><th scope=col>Thalamus_Midbrain_nn</th><th scope=col>Thalamus_Midbrain_Subtotal</th><th scope=col>Cerebellum_PV_Inf</th><th scope=col>Cerebellum_P_Inf</th><th scope=col>Cerebellum_hm</th><th scope=col>Cerebellum_nn</th><th scope=col>Cerebellum_Subtotal</th><th scope=col>Meningitis</th><th scope=col>Total</th><th scope=col>Comment</th><th scope=col>GI_Lesions</th></tr></thead>\n",
       "<tbody>\n",
       "\t<tr><th scope=row>397</th><td>NA</td><td>NA</td><td>NA</td><td>NA</td><td>Mock</td><td>WNV</td><td>Brain</td><td>NA</td><td>STI Mock K</td><td>Gale</td><td>NA</td><td>NA</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>KIDNEY- no signifincat lesions; mild artifacts MALE</td><td>NA</td></tr>\n",
       "\t<tr><th scope=row>398</th><td>NA</td><td>NA</td><td>NA</td><td>NA</td><td>3.6</td><td>WNV</td><td>Brain</td><td>6</td><td>STI 3.6 K</td><td>Gale</td><td>NA</td><td>NA</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>NA</td><td>NA</td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ],
      "text/latex": [
       "\\begin{tabular}{r|llllllllllllllllllllllllllllllllllll}\n",
       "  & ID & Mating & RIX_ID & UW_Line & UWID & Virus & Tissue & Timepoint & slide_label & Lab & Data_Altered & Notes & Cortex_PV_Inf & Cortex_P_Inf & Cortex_hm & Cortex_nn & Cortex_Subtotal & Hippocampus_PV_Inf & Hippocampus_P_Inf & Hippocampus_hm & Hippocampus_nn & Hippocampus_Subtotal & Thalamus_Midbrain_PV_Inf & Thalamus_Midbrain_P_Inf & Thalamus_Midbrain_hm & Thalamus_Midbrain_nn & Thalamus_Midbrain_Subtotal & Cerebellum_PV_Inf & Cerebellum_P_Inf & Cerebellum_hm & Cerebellum_nn & Cerebellum_Subtotal & Meningitis & Total & Comment & GI_Lesions\\\\\n",
       "\\hline\n",
       "\t397 & NA & NA & NA & NA & Mock & WNV & Brain & NA & STI Mock K & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & KIDNEY- no signifincat lesions; mild artifacts MALE & NA\\\\\n",
       "\t398 & NA & NA & NA & NA & 3.6 & WNV & Brain & 6 & STI 3.6 K & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & NA & NA\\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "      ID Mating RIX_ID UW_Line UWID Virus Tissue Timepoint slide_label  Lab\n",
       "397 <NA>   <NA>     NA      NA Mock   WNV  Brain        NA  STI Mock K Gale\n",
       "398 <NA>   <NA>     NA      NA  3.6   WNV  Brain         6   STI 3.6 K Gale\n",
       "    Data_Altered Notes Cortex_PV_Inf Cortex_P_Inf Cortex_hm Cortex_nn\n",
       "397           NA    NA             0            0         0         0\n",
       "398           NA    NA             0            0         0         0\n",
       "    Cortex_Subtotal Hippocampus_PV_Inf Hippocampus_P_Inf Hippocampus_hm\n",
       "397               0                  0                 0              0\n",
       "398               0                  0                 0              0\n",
       "    Hippocampus_nn Hippocampus_Subtotal Thalamus_Midbrain_PV_Inf\n",
       "397              0                    0                        0\n",
       "398              0                    0                        0\n",
       "    Thalamus_Midbrain_P_Inf Thalamus_Midbrain_hm Thalamus_Midbrain_nn\n",
       "397                       0                    0                    0\n",
       "398                       0                    0                    0\n",
       "    Thalamus_Midbrain_Subtotal Cerebellum_PV_Inf Cerebellum_P_Inf Cerebellum_hm\n",
       "397                          0                 0                0             0\n",
       "398                          0                 0                0             0\n",
       "    Cerebellum_nn Cerebellum_Subtotal Meningitis Total\n",
       "397             0                   0          0     0\n",
       "398             0                   0          0     0\n",
       "                                                Comment GI_Lesions\n",
       "397 KIDNEY- no signifincat lesions; mild artifacts MALE       <NA>\n",
       "398                                                <NA>       <NA>"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Fix STI sample\n",
    "hist_data_v2[grepl(\"STI\", hist_data_v2$slide_label),]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "## Fix STI sample\n",
    "hist_data_v2$UWID[397] = NA\n",
    "hist_data_v2$Virus[397] = 'Mock'\n",
    "hist_data_v2$Tissue[397] = 'Kidney'\n",
    "hist_data_v2$Tissue[398] = 'Kidney'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "0"
      ],
      "text/latex": [
       "0"
      ],
      "text/markdown": [
       "0"
      ],
      "text/plain": [
       "[1] 0"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Check for duplicates\n",
    "sum(duplicated(hist_data_v2$ID[!is.na(hist_data_v2$ID)]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [],
      "text/latex": [],
      "text/markdown": [],
      "text/plain": [
       "character(0)"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Duplicated IDs\n",
    "hist_data_v2$ID[duplicated(hist_data_v2$ID) & !is.na(hist_data_v2$ID)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5. Fix Missing Values & Calculate Totals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "cortex_cols = c('Cortex_PV_Inf','Cortex_P_Inf','Cortex_hm','Cortex_nn')\n",
    "hippocampus_cols = c('Hippocampus_PV_Inf','Hippocampus_P_Inf','Hippocampus_hm','Hippocampus_nn')\n",
    "thalamus_cols = c('Thalamus_Midbrain_PV_Inf','Thalamus_Midbrain_P_Inf','Thalamus_Midbrain_hm','Thalamus_Midbrain_nn')\n",
    "cerebellum_cols = c('Cerebellum_PV_Inf','Cerebellum_P_Inf','Cerebellum_hm','Cerebellum_nn')\n",
    "subtotal_cols = c('Cortex_Subtotal','Hippocampus_Subtotal','Thalamus_Midbrain_Subtotal','Cerebellum_Subtotal','Meningitis')\n",
    "\n",
    "## Fill in zeros if not NA\n",
    "for (i in 1:dim(hist_data_v2)[1]) {\n",
    "  if (!is.na(hist_data_v2[i, 'Cortex_Subtotal']) & hist_data_v2[i, 'Cortex_Subtotal'] == 0) {\n",
    "    hist_data_v2[i,cortex_cols] = c(0,0,0,0)\n",
    "  }\n",
    "  if (!is.na(hist_data_v2[i, 'Hippocampus_Subtotal']) & hist_data_v2[i, 'Hippocampus_Subtotal'] == 0) {\n",
    "    hist_data_v2[i,hippocampus_cols] = c(0,0,0,0)\n",
    "  }\n",
    "  if (!is.na(hist_data_v2[i, 'Thalamus_Midbrain_Subtotal']) & hist_data_v2[i, 'Thalamus_Midbrain_Subtotal'] == 0) {\n",
    "    hist_data_v2[i,thalamus_cols] = c(0,0,0,0)\n",
    "  }\n",
    "  if (!is.na(hist_data_v2[i, 'Cerebellum_Subtotal']) & hist_data_v2[i, 'Cerebellum_Subtotal'] == 0) {\n",
    "    hist_data_v2[i,cerebellum_cols] = c(0,0,0,0)\n",
    "  }\n",
    "  if (!is.na(hist_data_v2[i, 'Total']) & hist_data_v2[i, 'Total'] == 0) {\n",
    "    hist_data_v2[i,subtotal_cols] = c(0,0,0,0,0)\n",
    "  }\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "hist_data_v2$Cortex_Subtotal = apply(hist_data_v2[,cortex_cols], 1, function(x){sum(x)})\n",
    "hist_data_v2$Hippocampus_Subtotal = apply(hist_data_v2[,hippocampus_cols], 1, function(x){sum(x)})\n",
    "hist_data_v2$Thalamus_Midbrain_Subtotal = apply(hist_data_v2[,thalamus_cols], 1, function(x){sum(x)})\n",
    "hist_data_v2$Cerebellum_Subtotal = apply(hist_data_v2[,cerebellum_cols], 1, function(x){sum(x)})\n",
    "hist_data_v2$Total = apply(hist_data_v2[,subtotal_cols], 1, function(x){sum(x)})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead><tr><th></th><th scope=col>ID</th><th scope=col>Mating</th><th scope=col>RIX_ID</th><th scope=col>UW_Line</th><th scope=col>UWID</th><th scope=col>Virus</th><th scope=col>Tissue</th><th scope=col>Timepoint</th><th scope=col>slide_label</th><th scope=col>Lab</th><th scope=col>Data_Altered</th><th scope=col>Notes</th><th scope=col>Cortex_PV_Inf</th><th scope=col>Cortex_P_Inf</th><th scope=col>Cortex_hm</th><th scope=col>Cortex_nn</th><th scope=col>Cortex_Subtotal</th><th scope=col>Hippocampus_PV_Inf</th><th scope=col>Hippocampus_P_Inf</th><th scope=col>Hippocampus_hm</th><th scope=col>Hippocampus_nn</th><th scope=col>Hippocampus_Subtotal</th><th scope=col>Thalamus_Midbrain_PV_Inf</th><th scope=col>Thalamus_Midbrain_P_Inf</th><th scope=col>Thalamus_Midbrain_hm</th><th scope=col>Thalamus_Midbrain_nn</th><th scope=col>Thalamus_Midbrain_Subtotal</th><th scope=col>Cerebellum_PV_Inf</th><th scope=col>Cerebellum_P_Inf</th><th scope=col>Cerebellum_hm</th><th scope=col>Cerebellum_nn</th><th scope=col>Cerebellum_Subtotal</th><th scope=col>Meningitis</th><th scope=col>Total</th><th scope=col>Comment</th><th scope=col>GI_Lesions</th></tr></thead>\n",
       "<tbody>\n",
       "\t<tr><th scope=row>1</th><td>3154x16012_37</td><td>3154x16012</td><td>37</td><td>83</td><td>1.7</td><td>WNV</td><td>Brain</td><td>7</td><td>083 1.7</td><td>Gale</td><td>NA</td><td>NA</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>NA</td><td>NA</td></tr>\n",
       "\t<tr><th scope=row>2</th><td>3154x16012_38</td><td>3154x16012</td><td>38</td><td>83</td><td>2.7</td><td>WNV</td><td>Brain</td><td>7</td><td>083 2.7</td><td>Gale</td><td>NA</td><td>NA</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>NA</td><td>NA</td></tr>\n",
       "\t<tr><th scope=row>3</th><td>3154x16012_45</td><td>3154x16012</td><td>45</td><td>83</td><td>3.7</td><td>WNV</td><td>Brain</td><td>7</td><td>083 3.7</td><td>Gale</td><td>NA</td><td>NA</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>NA</td><td>NA</td></tr>\n",
       "\t<tr><th scope=row>4</th><td>16072x5346_61</td><td>16072x5346</td><td>61</td><td>81</td><td>1.12</td><td>WNV</td><td>Brain</td><td>12</td><td>081 1.12</td><td>Gale</td><td>NA</td><td>NA</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>NA</td><td>NA</td></tr>\n",
       "\t<tr><th scope=row>5</th><td>16072x5346_62</td><td>16072x5346</td><td>62</td><td>81</td><td>2.12</td><td>WNV</td><td>Brain</td><td>12</td><td>081 2.12</td><td>Gale</td><td>NA</td><td>NA</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>NA</td><td>NA</td></tr>\n",
       "\t<tr><th scope=row>6</th><td>16072x5346_68</td><td>16072x5346</td><td>68</td><td>81</td><td>3.12</td><td>WNV</td><td>Brain</td><td>12</td><td>081 3.12</td><td>Gale</td><td>NA</td><td>NA</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>NA</td><td>NA</td></tr>\n",
       "\t<tr><th scope=row>7</th><td>16072x5346_63</td><td>16072x5346</td><td>63</td><td>81</td><td>1.12M</td><td>Mock</td><td>Brain</td><td>12</td><td>081 1.12M</td><td>Gale</td><td>NA</td><td>NA</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>NA</td><td>NA</td></tr>\n",
       "\t<tr><th scope=row>8</th><td>16072x5346_70</td><td>16072x5346</td><td>70</td><td>81</td><td>2.12M</td><td>Mock</td><td>Brain</td><td>12</td><td>081 2.12M</td><td>Gale</td><td>NA</td><td>NA</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>NA</td><td>NA</td></tr>\n",
       "\t<tr><th scope=row>9</th><td>16072x5346_71</td><td>16072x5346</td><td>71</td><td>81</td><td>3.12M</td><td>Mock</td><td>Brain</td><td>12</td><td>081 3.12M</td><td>Gale</td><td>NA</td><td>NA</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>NA</td><td>NA</td></tr>\n",
       "\t<tr><th scope=row>10</th><td>5358x8046_71</td><td>5358x8046</td><td>71</td><td>73</td><td>1.12</td><td>WNV</td><td>Brain</td><td>12</td><td>073 1.12</td><td>Gale</td><td>NA</td><td>NA</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>NA</td><td>NA</td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ],
      "text/latex": [
       "\\begin{tabular}{r|llllllllllllllllllllllllllllllllllll}\n",
       "  & ID & Mating & RIX_ID & UW_Line & UWID & Virus & Tissue & Timepoint & slide_label & Lab & Data_Altered & Notes & Cortex_PV_Inf & Cortex_P_Inf & Cortex_hm & Cortex_nn & Cortex_Subtotal & Hippocampus_PV_Inf & Hippocampus_P_Inf & Hippocampus_hm & Hippocampus_nn & Hippocampus_Subtotal & Thalamus_Midbrain_PV_Inf & Thalamus_Midbrain_P_Inf & Thalamus_Midbrain_hm & Thalamus_Midbrain_nn & Thalamus_Midbrain_Subtotal & Cerebellum_PV_Inf & Cerebellum_P_Inf & Cerebellum_hm & Cerebellum_nn & Cerebellum_Subtotal & Meningitis & Total & Comment & GI_Lesions\\\\\n",
       "\\hline\n",
       "\t1 & 3154x16012_37 & 3154x16012 & 37 & 83 & 1.7 & WNV & Brain & 7 & 083 1.7 & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & NA & NA\\\\\n",
       "\t2 & 3154x16012_38 & 3154x16012 & 38 & 83 & 2.7 & WNV & Brain & 7 & 083 2.7 & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & NA & NA\\\\\n",
       "\t3 & 3154x16012_45 & 3154x16012 & 45 & 83 & 3.7 & WNV & Brain & 7 & 083 3.7 & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & NA & NA\\\\\n",
       "\t4 & 16072x5346_61 & 16072x5346 & 61 & 81 & 1.12 & WNV & Brain & 12 & 081 1.12 & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & NA & NA\\\\\n",
       "\t5 & 16072x5346_62 & 16072x5346 & 62 & 81 & 2.12 & WNV & Brain & 12 & 081 2.12 & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & NA & NA\\\\\n",
       "\t6 & 16072x5346_68 & 16072x5346 & 68 & 81 & 3.12 & WNV & Brain & 12 & 081 3.12 & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & NA & NA\\\\\n",
       "\t7 & 16072x5346_63 & 16072x5346 & 63 & 81 & 1.12M & Mock & Brain & 12 & 081 1.12M & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & NA & NA\\\\\n",
       "\t8 & 16072x5346_70 & 16072x5346 & 70 & 81 & 2.12M & Mock & Brain & 12 & 081 2.12M & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & NA & NA\\\\\n",
       "\t9 & 16072x5346_71 & 16072x5346 & 71 & 81 & 3.12M & Mock & Brain & 12 & 081 3.12M & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & NA & NA\\\\\n",
       "\t10 & 5358x8046_71 & 5358x8046 & 71 & 73 & 1.12 & WNV & Brain & 12 & 073 1.12 & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & NA & NA\\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "              ID     Mating RIX_ID UW_Line  UWID Virus Tissue Timepoint\n",
       "1  3154x16012_37 3154x16012     37      83   1.7   WNV  Brain         7\n",
       "2  3154x16012_38 3154x16012     38      83   2.7   WNV  Brain         7\n",
       "3  3154x16012_45 3154x16012     45      83   3.7   WNV  Brain         7\n",
       "4  16072x5346_61 16072x5346     61      81  1.12   WNV  Brain        12\n",
       "5  16072x5346_62 16072x5346     62      81  2.12   WNV  Brain        12\n",
       "6  16072x5346_68 16072x5346     68      81  3.12   WNV  Brain        12\n",
       "7  16072x5346_63 16072x5346     63      81 1.12M  Mock  Brain        12\n",
       "8  16072x5346_70 16072x5346     70      81 2.12M  Mock  Brain        12\n",
       "9  16072x5346_71 16072x5346     71      81 3.12M  Mock  Brain        12\n",
       "10  5358x8046_71  5358x8046     71      73  1.12   WNV  Brain        12\n",
       "   slide_label  Lab Data_Altered Notes Cortex_PV_Inf Cortex_P_Inf Cortex_hm\n",
       "1      083 1.7 Gale           NA    NA             0            0         0\n",
       "2      083 2.7 Gale           NA    NA             0            0         0\n",
       "3      083 3.7 Gale           NA    NA             0            0         0\n",
       "4     081 1.12 Gale           NA    NA             0            0         0\n",
       "5     081 2.12 Gale           NA    NA             0            0         0\n",
       "6     081 3.12 Gale           NA    NA             0            0         0\n",
       "7    081 1.12M Gale           NA    NA             0            0         0\n",
       "8    081 2.12M Gale           NA    NA             0            0         0\n",
       "9    081 3.12M Gale           NA    NA             0            0         0\n",
       "10    073 1.12 Gale           NA    NA             0            0         0\n",
       "   Cortex_nn Cortex_Subtotal Hippocampus_PV_Inf Hippocampus_P_Inf\n",
       "1          0               0                  0                 0\n",
       "2          0               0                  0                 0\n",
       "3          0               0                  0                 0\n",
       "4          0               0                  0                 0\n",
       "5          0               0                  0                 0\n",
       "6          0               0                  0                 0\n",
       "7          0               0                  0                 0\n",
       "8          0               0                  0                 0\n",
       "9          0               0                  0                 0\n",
       "10         0               0                  0                 0\n",
       "   Hippocampus_hm Hippocampus_nn Hippocampus_Subtotal Thalamus_Midbrain_PV_Inf\n",
       "1               0              0                    0                        0\n",
       "2               0              0                    0                        0\n",
       "3               0              0                    0                        0\n",
       "4               0              0                    0                        0\n",
       "5               0              0                    0                        0\n",
       "6               0              0                    0                        0\n",
       "7               0              0                    0                        0\n",
       "8               0              0                    0                        0\n",
       "9               0              0                    0                        0\n",
       "10              0              0                    0                        0\n",
       "   Thalamus_Midbrain_P_Inf Thalamus_Midbrain_hm Thalamus_Midbrain_nn\n",
       "1                        0                    0                    0\n",
       "2                        0                    0                    0\n",
       "3                        0                    0                    0\n",
       "4                        0                    0                    0\n",
       "5                        0                    0                    0\n",
       "6                        0                    0                    0\n",
       "7                        0                    0                    0\n",
       "8                        0                    0                    0\n",
       "9                        0                    0                    0\n",
       "10                       0                    0                    0\n",
       "   Thalamus_Midbrain_Subtotal Cerebellum_PV_Inf Cerebellum_P_Inf Cerebellum_hm\n",
       "1                           0                 0                0             0\n",
       "2                           0                 0                0             0\n",
       "3                           0                 0                0             0\n",
       "4                           0                 0                0             0\n",
       "5                           0                 0                0             0\n",
       "6                           0                 0                0             0\n",
       "7                           0                 0                0             0\n",
       "8                           0                 0                0             0\n",
       "9                           0                 0                0             0\n",
       "10                          0                 0                0             0\n",
       "   Cerebellum_nn Cerebellum_Subtotal Meningitis Total Comment GI_Lesions\n",
       "1              0                   0          0     0    <NA>       <NA>\n",
       "2              0                   0          0     0    <NA>       <NA>\n",
       "3              0                   0          0     0    <NA>       <NA>\n",
       "4              0                   0          0     0    <NA>       <NA>\n",
       "5              0                   0          0     0    <NA>       <NA>\n",
       "6              0                   0          0     0    <NA>       <NA>\n",
       "7              0                   0          0     0    <NA>       <NA>\n",
       "8              0                   0          0     0    <NA>       <NA>\n",
       "9              0                   0          0     0    <NA>       <NA>\n",
       "10             0                   0          0     0    <NA>       <NA>"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "head(hist_data_v2, 10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead><tr><th></th><th scope=col>ID</th><th scope=col>Mating</th><th scope=col>RIX_ID</th><th scope=col>UW_Line</th><th scope=col>UWID</th><th scope=col>Virus</th><th scope=col>Tissue</th><th scope=col>Timepoint</th><th scope=col>slide_label</th><th scope=col>Lab</th><th scope=col>Data_Altered</th><th scope=col>Notes</th><th scope=col>Cortex_PV_Inf</th><th scope=col>Cortex_P_Inf</th><th scope=col>Cortex_hm</th><th scope=col>Cortex_nn</th><th scope=col>Cortex_Subtotal</th><th scope=col>Hippocampus_PV_Inf</th><th scope=col>Hippocampus_P_Inf</th><th scope=col>Hippocampus_hm</th><th scope=col>Hippocampus_nn</th><th scope=col>Hippocampus_Subtotal</th><th scope=col>Thalamus_Midbrain_PV_Inf</th><th scope=col>Thalamus_Midbrain_P_Inf</th><th scope=col>Thalamus_Midbrain_hm</th><th scope=col>Thalamus_Midbrain_nn</th><th scope=col>Thalamus_Midbrain_Subtotal</th><th scope=col>Cerebellum_PV_Inf</th><th scope=col>Cerebellum_P_Inf</th><th scope=col>Cerebellum_hm</th><th scope=col>Cerebellum_nn</th><th scope=col>Cerebellum_Subtotal</th><th scope=col>Meningitis</th><th scope=col>Total</th><th scope=col>Comment</th><th scope=col>GI_Lesions</th></tr></thead>\n",
       "<tbody>\n",
       "</tbody>\n",
       "</table>\n"
      ],
      "text/latex": [
       "\\begin{tabular}{r|llllllllllllllllllllllllllllllllllll}\n",
       "  & ID & Mating & RIX_ID & UW_Line & UWID & Virus & Tissue & Timepoint & slide_label & Lab & Data_Altered & Notes & Cortex_PV_Inf & Cortex_P_Inf & Cortex_hm & Cortex_nn & Cortex_Subtotal & Hippocampus_PV_Inf & Hippocampus_P_Inf & Hippocampus_hm & Hippocampus_nn & Hippocampus_Subtotal & Thalamus_Midbrain_PV_Inf & Thalamus_Midbrain_P_Inf & Thalamus_Midbrain_hm & Thalamus_Midbrain_nn & Thalamus_Midbrain_Subtotal & Cerebellum_PV_Inf & Cerebellum_P_Inf & Cerebellum_hm & Cerebellum_nn & Cerebellum_Subtotal & Meningitis & Total & Comment & GI_Lesions\\\\\n",
       "\\hline\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       " [1] ID                         Mating                    \n",
       " [3] RIX_ID                     UW_Line                   \n",
       " [5] UWID                       Virus                     \n",
       " [7] Tissue                     Timepoint                 \n",
       " [9] slide_label                Lab                       \n",
       "[11] Data_Altered               Notes                     \n",
       "[13] Cortex_PV_Inf              Cortex_P_Inf              \n",
       "[15] Cortex_hm                  Cortex_nn                 \n",
       "[17] Cortex_Subtotal            Hippocampus_PV_Inf        \n",
       "[19] Hippocampus_P_Inf          Hippocampus_hm            \n",
       "[21] Hippocampus_nn             Hippocampus_Subtotal      \n",
       "[23] Thalamus_Midbrain_PV_Inf   Thalamus_Midbrain_P_Inf   \n",
       "[25] Thalamus_Midbrain_hm       Thalamus_Midbrain_nn      \n",
       "[27] Thalamus_Midbrain_Subtotal Cerebellum_PV_Inf         \n",
       "[29] Cerebellum_P_Inf           Cerebellum_hm             \n",
       "[31] Cerebellum_nn              Cerebellum_Subtotal       \n",
       "[33] Meningitis                 Total                     \n",
       "[35] Comment                    GI_Lesions                \n",
       "<0 rows> (or 0-length row.names)"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Check for mocks with scores > 0\n",
    "hist_data_v2[!is.na(hist_data_v2$Total) & hist_data_v2$Total != 0 & hist_data_v2$Virus=='Mock',]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6. Combine with Previous Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [],
      "text/latex": [],
      "text/markdown": [],
      "text/plain": [
       "character(0)"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Get IDs duplicated in new data\n",
    "dup_ids = intersect(hist_data_v2$ID, hist_prev$ID)[!is.na(intersect(hist_data_v2$ID, hist_prev$ID))]\n",
    "dup_ids"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "hist_data_v2[hist_data_v2$ID %in% dup_ids,]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "hist_prev[hist_prev$ID %in% dup_ids,]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<ol class=list-inline>\n",
       "\t<li>545</li>\n",
       "\t<li>34</li>\n",
       "</ol>\n"
      ],
      "text/latex": [
       "\\begin{enumerate*}\n",
       "\\item 545\n",
       "\\item 34\n",
       "\\end{enumerate*}\n"
      ],
      "text/markdown": [
       "1. 545\n",
       "2. 34\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "[1] 545  34"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dim(hist_prev)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "hist_prev = hist_prev[!hist_prev$ID %in% dup_ids,]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<ol class=list-inline>\n",
       "\t<li>545</li>\n",
       "\t<li>34</li>\n",
       "</ol>\n"
      ],
      "text/latex": [
       "\\begin{enumerate*}\n",
       "\\item 545\n",
       "\\item 34\n",
       "\\end{enumerate*}\n"
      ],
      "text/markdown": [
       "1. 545\n",
       "2. 34\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "[1] 545  34"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dim(hist_prev)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] \"Tissue\"\n",
      "[1] \"GI_Lesions\"\n"
     ]
    }
   ],
   "source": [
    "for (cn in setdiff(colnames(hist_data_v2), colnames(hist_prev))) {\n",
    "  print(cn)\n",
    "  hist_prev[,cn] = NA\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "gale_hist = rbind(hist_prev[,colnames(hist_data_v2)], hist_data_v2[,colnames(hist_data_v2)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<ol class=list-inline>\n",
       "\t<li>1161</li>\n",
       "\t<li>36</li>\n",
       "</ol>\n"
      ],
      "text/latex": [
       "\\begin{enumerate*}\n",
       "\\item 1161\n",
       "\\item 36\n",
       "\\end{enumerate*}\n"
      ],
      "text/markdown": [
       "1. 1161\n",
       "2. 36\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "[1] 1161   36"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dim(gale_hist)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "write.table(gale_hist, file=file.path(hist_dir, 'Gale_Histology_21-Mar-2016_final.txt'), \n",
    "            col.names=T, row.names=F, quote=F, sep='\\t', na=\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Last Updated: 24-May-2015"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.2.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}