{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Systems Immunogenetics Project\n", "\n", "## WNV Histology Data Cleaning Workflow\n", "\n", "### McWeeney Lab, Oregon Health & Science University\n", "\n", "#### Author: Michael Mooney (mooneymi@ohsu.edu)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "This document will walk through the steps for parsing and cleaning flow cytometry spreadsheets. The result will be a single R dataframe containing all the data, with standardized column names and formatting.\n", "\n", "Required Files:\n", "\n", "- This notebook** (`SIG_WNV_Histology_Data_Cleaning.ipynb`): [[Download here]](https://raw.githubusercontent.com/biodev/SIG/master/SIG_WNV_Flow_Data_Cleaning.ipynb)\n", "- The data dictionary containing all histology variables (`WNV_Data_Dictionary.xlsx`): [[Download here]](https://raw.githubusercontent.com/biodev/SIG/master/data/WNV_Data_Dictonary.xlsx)\n", "\n", "** Note: this notebook can also be downloaded as an R script (only the code blocks seen below will be included): [[Download R script here]](https://raw.githubusercontent.com/biodev/SIG/master/SIG_WNV_Flow_Data_Cleaning.r)\n", "\n", "Required R Packages:\n", "\n", "- `gdata`: [https://cran.r-project.org/web/packages/gdata/index.html](https://cran.r-project.org/web/packages/gdata/index.html)\n", "\n", "**All code is available on GitHub: [https://github.com/biodev/SIG](https://github.com/biodev/SIG)** \n", "\n", "If you are not familiar with Jupyter Notebooks, I've created a short tutorial to get you up and running quickly. There is also plenty of documentation online:\n", "\n", "1. [Jupyter for R Tutorial](http://nbviewer.jupyter.org/github/mooneymi/jupyter_notebooks/blob/master/r/Getting_Started_R.ipynb)\n", "2. [Jupyter Documentation](http://jupyter.org/)\n", "3. [Conda and R](https://www.continuum.io/conda-for-r)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1. Prepare the Input Files\n", "\n", "Sometimes, a bit of manual cleaning of the spreadsheets is necessary for the parsing function to work correctly.\n", "\n", "1. Move histology data to the first tab in the Excel workbook.\n", "2. Remove any rows with text above the column names." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2. Load the Necessary R Libraries and Functions" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.\n", "\n", "gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.\n", "\n", "Attaching package: ‘gdata’\n", "\n", "The following object is masked from ‘package:stats’:\n", "\n", " nobs\n", "\n", "The following object is masked from ‘package:utils’:\n", "\n", " object.size\n", "\n" ] } ], "source": [ "library(gdata)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3. Read the Data into R" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "## Set data directories\n", "cleaned_dir = '~/Documents/SIG/WNV/Cleaned_Data_Releases/15-Jan-2016'\n", "hist_dir = '~/Documents/SIG/WNV/Histology/fixed_files'\n", "\n", "## Load column names\n", "hist_cn = read.xls(xls=\"./data/WNV_Data_Dictionary.xlsx\", sheet=\"Histology Data\", as.is=T)\n", "hist_cn = hist_cn[,1]" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "## Load cleaned weight data\n", "gale_weight = read.xls(file.path(cleaned_dir, 'Gale_Weight_13-Jan-2016_final.xlsx'), header=T, \n", " as.is=T, na.strings=c(\"\",\" \", \"NA\", \"na\", \"#DIV/0!\"))\n", "\n", "## Load previously cleaned histology data\n", "hist_prev = read.xls(file.path(cleaned_dir, 'Gale_Histology_5-Jan-2016_final.xlsx'), header=T, \n", " as.is=T, na.strings=c(\"\",\" \", \"NA\", \"na\", \"#DIV/0!\"))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"Copy of 15-H943_fixedMM.xlsx\" \n", "[2] \"Copy of Copy of Comparative Pathology Program Consult Request 15-H1246 imaging_fixedMM.xlsx\"\n", "[3] \"Copy of Copy of Comparative Pathology Program Consult Request 15-H1393_fixedMM.xlsx\" \n", "[4] \"Copy of Copy of Comparative Pathology Program Consult Request 15-H1508_fixedMM.xlsx\" \n", "[5] \"Copy of Copy of Gale CC mice CPP request for prefilled 16-H94_fixedMM.xlsx\" \n", "[6] \"Copy of histo 15-H792 results_fixedMM.xlsx\" \n", "[7] \"New scoring 08 19 14_fixedMM.xlsx\" \n", "[8] \"Results 15-H501+15-H502 06 10 15_fixedMM.xlsx\" \n", "[9] \"Results 15-H585 06 17 15_fixedMM.xlsx\" \n" ] } ], "source": [ "## Get files to be processed\n", "hist_files = list.files(hist_dir, pattern=\".*\\\\.xlsx\")\n", "print(hist_files)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"Copy of 15-H943_fixedMM.xlsx\"\n", "[1] \"Copy of Copy of Comparative Pathology Program Consult Request 15-H1246 imaging_fixedMM.xlsx\"\n", "[1] \"Copy of Copy of Comparative Pathology Program Consult Request 15-H1393_fixedMM.xlsx\"\n", "[1] \"Copy of Copy of Comparative Pathology Program Consult Request 15-H1508_fixedMM.xlsx\"\n", "[1] \"Copy of Copy of Gale CC mice CPP request for prefilled 16-H94_fixedMM.xlsx\"\n", "[1] \"Copy of histo 15-H792 results_fixedMM.xlsx\"\n", "[1] \"New scoring 08 19 14_fixedMM.xlsx\"\n", "[1] \"Results 15-H501+15-H502 06 10 15_fixedMM.xlsx\"\n", "[1] \"Results 15-H585 06 17 15_fixedMM.xlsx\"\n" ] } ], "source": [ "## Read all files and combine into a single dataframe\n", "print(hist_files[1])\n", "hist_data_v2 = read.xls(file.path(hist_dir, hist_files[1]), header=T, \n", " as.is=T, na.strings=c(\"\",\" \", \"NA\", \"na\", \"#DIV/0!\"))\n", "colnames(hist_data_v2) = hist_cn[c(9, c(13:35))]\n", "\n", "for (file in hist_files[2:length(hist_files)]) {\n", " print(file)\n", " hist_data = read.xls(file.path(hist_dir, file), header=T, \n", " as.is=T, na.strings=c(\"\",\" \", \"NA\", \"na\", \"#DIV/0!\"))\n", " colnames(hist_data) = hist_cn[c(9, c(13:35))]\n", " hist_data_v2 = rbind(hist_data_v2, hist_data)\n", "}" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
    \n", "\t
  1. 616
  2. \n", "\t
  3. 24
  4. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 616\n", "\\item 24\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 616\n", "2. 24\n", "\n", "\n" ], "text/plain": [ "[1] 616 24" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Check dimensions of dataframe\n", "dim(hist_data_v2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4. Clean and Reformat the Data" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"ID\"\n", "[1] \"Mating\"\n", "[1] \"RIX_ID\"\n", "[1] \"UW_Line\"\n", "[1] \"UWID\"\n", "[1] \"Virus\"\n", "[1] \"Tissue\"\n", "[1] \"Timepoint\"\n", "[1] \"Lab\"\n", "[1] \"Data_Altered\"\n", "[1] \"Notes\"\n", "[1] \"GI_Lesions\"\n" ] } ], "source": [ "## Create missing columns\n", "for (cn in setdiff(hist_cn, colnames(hist_data_v2))) {\n", " print(cn)\n", " hist_data_v2[,cn] = NA\n", "}\n", "\n", "## Order columns\n", "hist_data_v2 = hist_data_v2[, hist_cn]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Warning message:\n", "In eval(expr, envir, enclos): NAs introduced by coercion" ] } ], "source": [ "## Remove any leading or trailing spaces from slide_label\n", "hist_data_v2$slide_label = trim(hist_data_v2$slide_label)\n", "\n", "## Update UW Line\n", "hist_data_v2$UW_Line = sapply(hist_data_v2$slide_label, function(x){unlist(strsplit(x, \" \"))[1]})\n", "hist_data_v2$UW_Line = as.numeric(hist_data_v2$UW_Line)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## Update UWID\n", "hist_data_v2$UWID = sapply(hist_data_v2$slide_label, function(x){unlist(strsplit(x, \" \"))[2]})\n", "hist_data_v2$UWID = gsub(\"m\", \"M\", hist_data_v2$UWID)\n", "\n", "## Update Lab\n", "hist_data_v2$Lab = 'Gale'" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## Update time points\n", "hist_data_v2$Timepoint = sapply(hist_data_v2$UWID, function(x){unlist(strsplit(x, \"\\\\.\"))[2]})\n", "\n", "## Use time points to update virus\n", "hist_data_v2$Virus = 'WNV'\n", "hist_data_v2$Virus[grepl(\"M\", hist_data_v2$Timepoint)] = 'Mock'\n", "hist_data_v2$Virus[grepl(\"m\", hist_data_v2$Timepoint)] = 'Mock'\n", "\n", "## Update time points\n", "hist_data_v2$Timepoint = gsub(\"M\", \"\", hist_data_v2$Timepoint)\n", "hist_data_v2$Timepoint = gsub(\"m\", \"\", hist_data_v2$Timepoint)\n", "hist_data_v2$Timepoint = as.numeric(hist_data_v2$Timepoint)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"line=3260x1566; rix_id=2\" \"line=3260x1566; rix_id=15\"\n" ] } ], "source": [ "## Update Mating and RIX_ID\n", "for (i in 1:dim(hist_data_v2)[1]) {\n", " line = hist_data_v2$UW_Line[i]\n", " if (!is.na(line)) {\n", " mating = gale_weight$Mating[with(gale_weight, UW_Line==hist_data_v2$UW_Line[i])]\n", " if (length(mating)>0) {\n", " hist_data_v2$Mating[i] = mating[1]\n", " rix_id = gale_weight$RIX_ID[with(gale_weight, UW_Line==hist_data_v2$UW_Line[i] & UWID==hist_data_v2$UWID[i])]\n", " if (length(rix_id)>0) {\n", " if (length(rix_id)>1) {\n", " ## If more than one ID matches the UWID, print the IDs\n", " print(paste0(\"line=\", mating[1], \"; rix_id=\", rix_id))\n", " }\n", " hist_data_v2$RIX_ID[i] = rix_id[1]\n", " }\n", " }\n", " }\n", "}" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
IDMatingRIX_IDUW_LineUWIDVirusTissueTimepointslide_labelLabData_AlteredNotes
273NA3260x156621002.12WNVNA12100 2.12GaleNANA
274NA3260x156681001.12MMockNA12100 1.12MGaleNANA
275NA3260x156691002.12MMockNA12100 2.12MGaleNANA
276NA3260x1566161003.12MMockNA12100 3.12MGaleNANA
331NA3260x1566NA1001.4WNVNA4100 1.4GaleNANA
332NA3260x1566NA1002.4WNVNA4100 2.4GaleNANA
333NA3260x1566NA1003.4WNVNA4100 3.4GaleNANA
364NA3260x1566NA1001.7WNVNA7100 1.7GaleNANA
365NA3260x1566NA1002.7WNVNA7100 2.7GaleNANA
366NA3260x1566NA1003.7WNVNA7100 3.7GaleNANA
\n" ], "text/latex": [ "\\begin{tabular}{r|llllllllllll}\n", " & ID & Mating & RIX_ID & UW_Line & UWID & Virus & Tissue & Timepoint & slide_label & Lab & Data_Altered & Notes\\\\\n", "\\hline\n", "\t273 & NA & 3260x1566 & 2 & 100 & 2.12 & WNV & NA & 12 & 100 2.12 & Gale & NA & NA\\\\\n", "\t274 & NA & 3260x1566 & 8 & 100 & 1.12M & Mock & NA & 12 & 100 1.12M & Gale & NA & NA\\\\\n", "\t275 & NA & 3260x1566 & 9 & 100 & 2.12M & Mock & NA & 12 & 100 2.12M & Gale & NA & NA\\\\\n", "\t276 & NA & 3260x1566 & 16 & 100 & 3.12M & Mock & NA & 12 & 100 3.12M & Gale & NA & NA\\\\\n", "\t331 & NA & 3260x1566 & NA & 100 & 1.4 & WNV & NA & 4 & 100 1.4 & Gale & NA & NA\\\\\n", "\t332 & NA & 3260x1566 & NA & 100 & 2.4 & WNV & NA & 4 & 100 2.4 & Gale & NA & NA\\\\\n", "\t333 & NA & 3260x1566 & NA & 100 & 3.4 & WNV & NA & 4 & 100 3.4 & Gale & NA & NA\\\\\n", "\t364 & NA & 3260x1566 & NA & 100 & 1.7 & WNV & NA & 7 & 100 1.7 & Gale & NA & NA\\\\\n", "\t365 & NA & 3260x1566 & NA & 100 & 2.7 & WNV & NA & 7 & 100 2.7 & Gale & NA & NA\\\\\n", "\t366 & NA & 3260x1566 & NA & 100 & 3.7 & WNV & NA & 7 & 100 3.7 & Gale & NA & NA\\\\\n", "\\end{tabular}\n" ], "text/plain": [ " ID Mating RIX_ID UW_Line UWID Virus Tissue Timepoint slide_label Lab\n", "273 NA 3260x1566 2 100 2.12 WNV NA 12 100 2.12 Gale\n", "274 NA 3260x1566 8 100 1.12M Mock NA 12 100 1.12M Gale\n", "275 NA 3260x1566 9 100 2.12M Mock NA 12 100 2.12M Gale\n", "276 NA 3260x1566 16 100 3.12M Mock NA 12 100 3.12M Gale\n", "331 NA 3260x1566 NA 100 1.4 WNV NA 4 100 1.4 Gale\n", "332 NA 3260x1566 NA 100 2.4 WNV NA 4 100 2.4 Gale\n", "333 NA 3260x1566 NA 100 3.4 WNV NA 4 100 3.4 Gale\n", "364 NA 3260x1566 NA 100 1.7 WNV NA 7 100 1.7 Gale\n", "365 NA 3260x1566 NA 100 2.7 WNV NA 7 100 2.7 Gale\n", "366 NA 3260x1566 NA 100 3.7 WNV NA 7 100 3.7 Gale\n", " Data_Altered Notes\n", "273 NA NA\n", "274 NA NA\n", "275 NA NA\n", "276 NA NA\n", "331 NA NA\n", "332 NA NA\n", "333 NA NA\n", "364 NA NA\n", "365 NA NA\n", "366 NA NA" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Check Matings and IDs of the discrepancies\n", "## Make manual corrections if necessary\n", "hist_data_v2[hist_data_v2$Mating=='3260x1566' & !is.na(hist_data_v2$Mating), 1:12]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## Update ID\n", "hist_data_v2$ID = paste(hist_data_v2$Mating, hist_data_v2$RIX_ID, sep=\"_\")\n", "hist_data_v2$ID[grepl(\"NA\", hist_data_v2$ID)] = NA" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [], "source": [ "## Update tissue\n", "hist_data_v2$Tissue = 'Brain'\n", "for (i in 1:length(hist_data_v2$slide_label)) {\n", " label = hist_data_v2$slide_label[i]\n", " if (grepl(\"Br\", label)) {hist_data_v2$Tissue[i] = 'Brain'}\n", " if (grepl(\"LI\", label)) {hist_data_v2$Tissue[i] = 'Large Intestine'}\n", " if (grepl(\"SI\", label)) {hist_data_v2$Tissue[i] = 'Small Intestine'}\n", " if (grepl(\"Stom\", label)) {hist_data_v2$Tissue[i] = 'Stomach'}\n", " if (grepl(\"[Cc]ec\", label)) {hist_data_v2$Tissue[i] = 'Cecum'}\n", "}" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## Update GI_Lesions\n", "hist_data_v2$GI_Lesions = NA\n", "for (i in 1:length(hist_data_v2$Tissue)) {\n", " tissue = hist_data_v2$Tissue[i]\n", " if (tissue != 'Brain') {\n", " hist_data_v2$GI_Lesions[i] = hist_data_v2$Cortex_PV_Inf[i]\n", " hist_data_v2$Cortex_PV_Inf[i] = NA\n", " }\n", "}\n", "\n", "hist_data_v2$Cortex_PV_Inf = as.numeric(hist_data_v2$Cortex_PV_Inf)\n", "hist_data_v2$GI_Lesions[hist_data_v2$GI_Lesions == 'NSL'] = 0" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "\n", " 0 1 \n", "98 22 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table(hist_data_v2$GI_Lesions)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "
IDMatingRIX_IDUW_LineUWIDVirusTissueTimepointslide_labelLabData_AlteredNotesCortex_PV_InfCortex_P_InfCortex_hmCortex_nnCortex_SubtotalHippocampus_PV_InfHippocampus_P_InfHippocampus_hmHippocampus_nnHippocampus_SubtotalThalamus_Midbrain_PV_InfThalamus_Midbrain_P_InfThalamus_Midbrain_hmThalamus_Midbrain_nnThalamus_Midbrain_SubtotalCerebellum_PV_InfCerebellum_P_InfCerebellum_hmCerebellum_nnCerebellum_SubtotalMeningitisTotalCommentGI_Lesions
397NANANANAMockWNVBrainNASTI Mock KGaleNANA0000000000000000000000KIDNEY- no signifincat lesions; mild artifacts MALENA
398NANANANA3.6WNVBrain6STI 3.6 KGaleNANA0000000000000000000000NANA
\n" ], "text/latex": [ "\\begin{tabular}{r|llllllllllllllllllllllllllllllllllll}\n", " & ID & Mating & RIX_ID & UW_Line & UWID & Virus & Tissue & Timepoint & slide_label & Lab & Data_Altered & Notes & Cortex_PV_Inf & Cortex_P_Inf & Cortex_hm & Cortex_nn & Cortex_Subtotal & Hippocampus_PV_Inf & Hippocampus_P_Inf & Hippocampus_hm & Hippocampus_nn & Hippocampus_Subtotal & Thalamus_Midbrain_PV_Inf & Thalamus_Midbrain_P_Inf & Thalamus_Midbrain_hm & Thalamus_Midbrain_nn & Thalamus_Midbrain_Subtotal & Cerebellum_PV_Inf & Cerebellum_P_Inf & Cerebellum_hm & Cerebellum_nn & Cerebellum_Subtotal & Meningitis & Total & Comment & GI_Lesions\\\\\n", "\\hline\n", "\t397 & NA & NA & NA & NA & Mock & WNV & Brain & NA & STI Mock K & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & KIDNEY- no signifincat lesions; mild artifacts MALE & NA\\\\\n", "\t398 & NA & NA & NA & NA & 3.6 & WNV & Brain & 6 & STI 3.6 K & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & NA & NA\\\\\n", "\\end{tabular}\n" ], "text/plain": [ " ID Mating RIX_ID UW_Line UWID Virus Tissue Timepoint slide_label Lab\n", "397 NA NA Mock WNV Brain NA STI Mock K Gale\n", "398 NA NA 3.6 WNV Brain 6 STI 3.6 K Gale\n", " Data_Altered Notes Cortex_PV_Inf Cortex_P_Inf Cortex_hm Cortex_nn\n", "397 NA NA 0 0 0 0\n", "398 NA NA 0 0 0 0\n", " Cortex_Subtotal Hippocampus_PV_Inf Hippocampus_P_Inf Hippocampus_hm\n", "397 0 0 0 0\n", "398 0 0 0 0\n", " Hippocampus_nn Hippocampus_Subtotal Thalamus_Midbrain_PV_Inf\n", "397 0 0 0\n", "398 0 0 0\n", " Thalamus_Midbrain_P_Inf Thalamus_Midbrain_hm Thalamus_Midbrain_nn\n", "397 0 0 0\n", "398 0 0 0\n", " Thalamus_Midbrain_Subtotal Cerebellum_PV_Inf Cerebellum_P_Inf Cerebellum_hm\n", "397 0 0 0 0\n", "398 0 0 0 0\n", " Cerebellum_nn Cerebellum_Subtotal Meningitis Total\n", "397 0 0 0 0\n", "398 0 0 0 0\n", " Comment GI_Lesions\n", "397 KIDNEY- no signifincat lesions; mild artifacts MALE \n", "398 " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Fix STI sample\n", "hist_data_v2[grepl(\"STI\", hist_data_v2$slide_label),]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## Fix STI sample\n", "hist_data_v2$UWID[397] = NA\n", "hist_data_v2$Virus[397] = 'Mock'\n", "hist_data_v2$Tissue[397] = 'Kidney'\n", "hist_data_v2$Tissue[398] = 'Kidney'" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "0" ], "text/latex": [ "0" ], "text/markdown": [ "0" ], "text/plain": [ "[1] 0" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Check for duplicates\n", "sum(duplicated(hist_data_v2$ID[!is.na(hist_data_v2$ID)]))" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [], "text/latex": [], "text/markdown": [], "text/plain": [ "character(0)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Duplicated IDs\n", "hist_data_v2$ID[duplicated(hist_data_v2$ID) & !is.na(hist_data_v2$ID)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 5. Fix Missing Values & Calculate Totals" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [], "source": [ "cortex_cols = c('Cortex_PV_Inf','Cortex_P_Inf','Cortex_hm','Cortex_nn')\n", "hippocampus_cols = c('Hippocampus_PV_Inf','Hippocampus_P_Inf','Hippocampus_hm','Hippocampus_nn')\n", "thalamus_cols = c('Thalamus_Midbrain_PV_Inf','Thalamus_Midbrain_P_Inf','Thalamus_Midbrain_hm','Thalamus_Midbrain_nn')\n", "cerebellum_cols = c('Cerebellum_PV_Inf','Cerebellum_P_Inf','Cerebellum_hm','Cerebellum_nn')\n", "subtotal_cols = c('Cortex_Subtotal','Hippocampus_Subtotal','Thalamus_Midbrain_Subtotal','Cerebellum_Subtotal','Meningitis')\n", "\n", "## Fill in zeros if not NA\n", "for (i in 1:dim(hist_data_v2)[1]) {\n", " if (!is.na(hist_data_v2[i, 'Cortex_Subtotal']) & hist_data_v2[i, 'Cortex_Subtotal'] == 0) {\n", " hist_data_v2[i,cortex_cols] = c(0,0,0,0)\n", " }\n", " if (!is.na(hist_data_v2[i, 'Hippocampus_Subtotal']) & hist_data_v2[i, 'Hippocampus_Subtotal'] == 0) {\n", " hist_data_v2[i,hippocampus_cols] = c(0,0,0,0)\n", " }\n", " if (!is.na(hist_data_v2[i, 'Thalamus_Midbrain_Subtotal']) & hist_data_v2[i, 'Thalamus_Midbrain_Subtotal'] == 0) {\n", " hist_data_v2[i,thalamus_cols] = c(0,0,0,0)\n", " }\n", " if (!is.na(hist_data_v2[i, 'Cerebellum_Subtotal']) & hist_data_v2[i, 'Cerebellum_Subtotal'] == 0) {\n", " hist_data_v2[i,cerebellum_cols] = c(0,0,0,0)\n", " }\n", " if (!is.na(hist_data_v2[i, 'Total']) & hist_data_v2[i, 'Total'] == 0) {\n", " hist_data_v2[i,subtotal_cols] = c(0,0,0,0,0)\n", " }\n", "}" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [], "source": [ "hist_data_v2$Cortex_Subtotal = apply(hist_data_v2[,cortex_cols], 1, function(x){sum(x)})\n", "hist_data_v2$Hippocampus_Subtotal = apply(hist_data_v2[,hippocampus_cols], 1, function(x){sum(x)})\n", "hist_data_v2$Thalamus_Midbrain_Subtotal = apply(hist_data_v2[,thalamus_cols], 1, function(x){sum(x)})\n", "hist_data_v2$Cerebellum_Subtotal = apply(hist_data_v2[,cerebellum_cols], 1, function(x){sum(x)})\n", "hist_data_v2$Total = apply(hist_data_v2[,subtotal_cols], 1, function(x){sum(x)})" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
IDMatingRIX_IDUW_LineUWIDVirusTissueTimepointslide_labelLabData_AlteredNotesCortex_PV_InfCortex_P_InfCortex_hmCortex_nnCortex_SubtotalHippocampus_PV_InfHippocampus_P_InfHippocampus_hmHippocampus_nnHippocampus_SubtotalThalamus_Midbrain_PV_InfThalamus_Midbrain_P_InfThalamus_Midbrain_hmThalamus_Midbrain_nnThalamus_Midbrain_SubtotalCerebellum_PV_InfCerebellum_P_InfCerebellum_hmCerebellum_nnCerebellum_SubtotalMeningitisTotalCommentGI_Lesions
13154x16012_373154x1601237831.7WNVBrain7083 1.7GaleNANA0000000000000000000000NANA
23154x16012_383154x1601238832.7WNVBrain7083 2.7GaleNANA0000000000000000000000NANA
33154x16012_453154x1601245833.7WNVBrain7083 3.7GaleNANA0000000000000000000000NANA
416072x5346_6116072x534661811.12WNVBrain12081 1.12GaleNANA0000000000000000000000NANA
516072x5346_6216072x534662812.12WNVBrain12081 2.12GaleNANA0000000000000000000000NANA
616072x5346_6816072x534668813.12WNVBrain12081 3.12GaleNANA0000000000000000000000NANA
716072x5346_6316072x534663811.12MMockBrain12081 1.12MGaleNANA0000000000000000000000NANA
816072x5346_7016072x534670812.12MMockBrain12081 2.12MGaleNANA0000000000000000000000NANA
916072x5346_7116072x534671813.12MMockBrain12081 3.12MGaleNANA0000000000000000000000NANA
105358x8046_715358x804671731.12WNVBrain12073 1.12GaleNANA0000000000000000000000NANA
\n" ], "text/latex": [ "\\begin{tabular}{r|llllllllllllllllllllllllllllllllllll}\n", " & ID & Mating & RIX_ID & UW_Line & UWID & Virus & Tissue & Timepoint & slide_label & Lab & Data_Altered & Notes & Cortex_PV_Inf & Cortex_P_Inf & Cortex_hm & Cortex_nn & Cortex_Subtotal & Hippocampus_PV_Inf & Hippocampus_P_Inf & Hippocampus_hm & Hippocampus_nn & Hippocampus_Subtotal & Thalamus_Midbrain_PV_Inf & Thalamus_Midbrain_P_Inf & Thalamus_Midbrain_hm & Thalamus_Midbrain_nn & Thalamus_Midbrain_Subtotal & Cerebellum_PV_Inf & Cerebellum_P_Inf & Cerebellum_hm & Cerebellum_nn & Cerebellum_Subtotal & Meningitis & Total & Comment & GI_Lesions\\\\\n", "\\hline\n", "\t1 & 3154x16012_37 & 3154x16012 & 37 & 83 & 1.7 & WNV & Brain & 7 & 083 1.7 & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & NA & NA\\\\\n", "\t2 & 3154x16012_38 & 3154x16012 & 38 & 83 & 2.7 & WNV & Brain & 7 & 083 2.7 & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & NA & NA\\\\\n", "\t3 & 3154x16012_45 & 3154x16012 & 45 & 83 & 3.7 & WNV & Brain & 7 & 083 3.7 & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & NA & NA\\\\\n", "\t4 & 16072x5346_61 & 16072x5346 & 61 & 81 & 1.12 & WNV & Brain & 12 & 081 1.12 & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & NA & NA\\\\\n", "\t5 & 16072x5346_62 & 16072x5346 & 62 & 81 & 2.12 & WNV & Brain & 12 & 081 2.12 & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & NA & NA\\\\\n", "\t6 & 16072x5346_68 & 16072x5346 & 68 & 81 & 3.12 & WNV & Brain & 12 & 081 3.12 & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & NA & NA\\\\\n", "\t7 & 16072x5346_63 & 16072x5346 & 63 & 81 & 1.12M & Mock & Brain & 12 & 081 1.12M & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & NA & NA\\\\\n", "\t8 & 16072x5346_70 & 16072x5346 & 70 & 81 & 2.12M & Mock & Brain & 12 & 081 2.12M & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & NA & NA\\\\\n", "\t9 & 16072x5346_71 & 16072x5346 & 71 & 81 & 3.12M & Mock & Brain & 12 & 081 3.12M & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & NA & NA\\\\\n", "\t10 & 5358x8046_71 & 5358x8046 & 71 & 73 & 1.12 & WNV & Brain & 12 & 073 1.12 & Gale & NA & NA & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & NA & NA\\\\\n", "\\end{tabular}\n" ], "text/plain": [ " ID Mating RIX_ID UW_Line UWID Virus Tissue Timepoint\n", "1 3154x16012_37 3154x16012 37 83 1.7 WNV Brain 7\n", "2 3154x16012_38 3154x16012 38 83 2.7 WNV Brain 7\n", "3 3154x16012_45 3154x16012 45 83 3.7 WNV Brain 7\n", "4 16072x5346_61 16072x5346 61 81 1.12 WNV Brain 12\n", "5 16072x5346_62 16072x5346 62 81 2.12 WNV Brain 12\n", "6 16072x5346_68 16072x5346 68 81 3.12 WNV Brain 12\n", "7 16072x5346_63 16072x5346 63 81 1.12M Mock Brain 12\n", "8 16072x5346_70 16072x5346 70 81 2.12M Mock Brain 12\n", "9 16072x5346_71 16072x5346 71 81 3.12M Mock Brain 12\n", "10 5358x8046_71 5358x8046 71 73 1.12 WNV Brain 12\n", " slide_label Lab Data_Altered Notes Cortex_PV_Inf Cortex_P_Inf Cortex_hm\n", "1 083 1.7 Gale NA NA 0 0 0\n", "2 083 2.7 Gale NA NA 0 0 0\n", "3 083 3.7 Gale NA NA 0 0 0\n", "4 081 1.12 Gale NA NA 0 0 0\n", "5 081 2.12 Gale NA NA 0 0 0\n", "6 081 3.12 Gale NA NA 0 0 0\n", "7 081 1.12M Gale NA NA 0 0 0\n", "8 081 2.12M Gale NA NA 0 0 0\n", "9 081 3.12M Gale NA NA 0 0 0\n", "10 073 1.12 Gale NA NA 0 0 0\n", " Cortex_nn Cortex_Subtotal Hippocampus_PV_Inf Hippocampus_P_Inf\n", "1 0 0 0 0\n", "2 0 0 0 0\n", "3 0 0 0 0\n", "4 0 0 0 0\n", "5 0 0 0 0\n", "6 0 0 0 0\n", "7 0 0 0 0\n", "8 0 0 0 0\n", "9 0 0 0 0\n", "10 0 0 0 0\n", " Hippocampus_hm Hippocampus_nn Hippocampus_Subtotal Thalamus_Midbrain_PV_Inf\n", "1 0 0 0 0\n", "2 0 0 0 0\n", "3 0 0 0 0\n", "4 0 0 0 0\n", "5 0 0 0 0\n", "6 0 0 0 0\n", "7 0 0 0 0\n", "8 0 0 0 0\n", "9 0 0 0 0\n", "10 0 0 0 0\n", " Thalamus_Midbrain_P_Inf Thalamus_Midbrain_hm Thalamus_Midbrain_nn\n", "1 0 0 0\n", "2 0 0 0\n", "3 0 0 0\n", "4 0 0 0\n", "5 0 0 0\n", "6 0 0 0\n", "7 0 0 0\n", "8 0 0 0\n", "9 0 0 0\n", "10 0 0 0\n", " Thalamus_Midbrain_Subtotal Cerebellum_PV_Inf Cerebellum_P_Inf Cerebellum_hm\n", "1 0 0 0 0\n", "2 0 0 0 0\n", "3 0 0 0 0\n", "4 0 0 0 0\n", "5 0 0 0 0\n", "6 0 0 0 0\n", "7 0 0 0 0\n", "8 0 0 0 0\n", "9 0 0 0 0\n", "10 0 0 0 0\n", " Cerebellum_nn Cerebellum_Subtotal Meningitis Total Comment GI_Lesions\n", "1 0 0 0 0 \n", "2 0 0 0 0 \n", "3 0 0 0 0 \n", "4 0 0 0 0 \n", "5 0 0 0 0 \n", "6 0 0 0 0 \n", "7 0 0 0 0 \n", "8 0 0 0 0 \n", "9 0 0 0 0 \n", "10 0 0 0 0 " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "head(hist_data_v2, 10)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
IDMatingRIX_IDUW_LineUWIDVirusTissueTimepointslide_labelLabData_AlteredNotesCortex_PV_InfCortex_P_InfCortex_hmCortex_nnCortex_SubtotalHippocampus_PV_InfHippocampus_P_InfHippocampus_hmHippocampus_nnHippocampus_SubtotalThalamus_Midbrain_PV_InfThalamus_Midbrain_P_InfThalamus_Midbrain_hmThalamus_Midbrain_nnThalamus_Midbrain_SubtotalCerebellum_PV_InfCerebellum_P_InfCerebellum_hmCerebellum_nnCerebellum_SubtotalMeningitisTotalCommentGI_Lesions
\n" ], "text/latex": [ "\\begin{tabular}{r|llllllllllllllllllllllllllllllllllll}\n", " & ID & Mating & RIX_ID & UW_Line & UWID & Virus & Tissue & Timepoint & slide_label & Lab & Data_Altered & Notes & Cortex_PV_Inf & Cortex_P_Inf & Cortex_hm & Cortex_nn & Cortex_Subtotal & Hippocampus_PV_Inf & Hippocampus_P_Inf & Hippocampus_hm & Hippocampus_nn & Hippocampus_Subtotal & Thalamus_Midbrain_PV_Inf & Thalamus_Midbrain_P_Inf & Thalamus_Midbrain_hm & Thalamus_Midbrain_nn & Thalamus_Midbrain_Subtotal & Cerebellum_PV_Inf & Cerebellum_P_Inf & Cerebellum_hm & Cerebellum_nn & Cerebellum_Subtotal & Meningitis & Total & Comment & GI_Lesions\\\\\n", "\\hline\n", "\\end{tabular}\n" ], "text/plain": [ " [1] ID Mating \n", " [3] RIX_ID UW_Line \n", " [5] UWID Virus \n", " [7] Tissue Timepoint \n", " [9] slide_label Lab \n", "[11] Data_Altered Notes \n", "[13] Cortex_PV_Inf Cortex_P_Inf \n", "[15] Cortex_hm Cortex_nn \n", "[17] Cortex_Subtotal Hippocampus_PV_Inf \n", "[19] Hippocampus_P_Inf Hippocampus_hm \n", "[21] Hippocampus_nn Hippocampus_Subtotal \n", "[23] Thalamus_Midbrain_PV_Inf Thalamus_Midbrain_P_Inf \n", "[25] Thalamus_Midbrain_hm Thalamus_Midbrain_nn \n", "[27] Thalamus_Midbrain_Subtotal Cerebellum_PV_Inf \n", "[29] Cerebellum_P_Inf Cerebellum_hm \n", "[31] Cerebellum_nn Cerebellum_Subtotal \n", "[33] Meningitis Total \n", "[35] Comment GI_Lesions \n", "<0 rows> (or 0-length row.names)" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Check for mocks with scores > 0\n", "hist_data_v2[!is.na(hist_data_v2$Total) & hist_data_v2$Total != 0 & hist_data_v2$Virus=='Mock',]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 6. Combine with Previous Data" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [], "text/latex": [], "text/markdown": [], "text/plain": [ "character(0)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Get IDs duplicated in new data\n", "dup_ids = intersect(hist_data_v2$ID, hist_prev$ID)[!is.na(intersect(hist_data_v2$ID, hist_prev$ID))]\n", "dup_ids" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "hist_data_v2[hist_data_v2$ID %in% dup_ids,]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "hist_prev[hist_prev$ID %in% dup_ids,]" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
    \n", "\t
  1. 545
  2. \n", "\t
  3. 34
  4. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 545\n", "\\item 34\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 545\n", "2. 34\n", "\n", "\n" ], "text/plain": [ "[1] 545 34" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dim(hist_prev)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [], "source": [ "hist_prev = hist_prev[!hist_prev$ID %in% dup_ids,]" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
    \n", "\t
  1. 545
  2. \n", "\t
  3. 34
  4. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 545\n", "\\item 34\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 545\n", "2. 34\n", "\n", "\n" ], "text/plain": [ "[1] 545 34" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dim(hist_prev)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"Tissue\"\n", "[1] \"GI_Lesions\"\n" ] } ], "source": [ "for (cn in setdiff(colnames(hist_data_v2), colnames(hist_prev))) {\n", " print(cn)\n", " hist_prev[,cn] = NA\n", "}" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": true }, "outputs": [], "source": [ "gale_hist = rbind(hist_prev[,colnames(hist_data_v2)], hist_data_v2[,colnames(hist_data_v2)])" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
    \n", "\t
  1. 1161
  2. \n", "\t
  3. 36
  4. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 1161\n", "\\item 36\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 1161\n", "2. 36\n", "\n", "\n" ], "text/plain": [ "[1] 1161 36" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dim(gale_hist)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "write.table(gale_hist, file=file.path(hist_dir, 'Gale_Histology_21-Mar-2016_final.txt'), \n", " col.names=T, row.names=F, quote=F, sep='\\t', na=\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Last Updated: 24-May-2015" ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.2.2" } }, "nbformat": 4, "nbformat_minor": 0 }