{"cells":[{"metadata":{},"cell_type":"markdown","source":"# ** DIscBIO: a user-friendly pipeline for biomarker discovery in single-cell transcriptomics**"},{"metadata":{},"cell_type":"markdown","source":"The pipeline consists of four successive steps: data pre-processing, cellular clustering and pseudo-temporal ordering, determining differential expressed genes and identifying biomarkers."},{"metadata":{},"cell_type":"markdown","source":"![DIsccBIO](DiscBIO.png)"},{"metadata":{},"cell_type":"markdown","source":"# CTC Notebook [PART 4]\n\n## Running the DIscBIO pipeline based on a list of genes related to Golgi Fragmentation\n## [Data pre-processing and cell clustering]\n "},{"metadata":{},"cell_type":"markdown","source":"## Required Packages"},{"metadata":{"trusted":false},"cell_type":"code","source":"library(DIscBIO)","execution_count":1,"outputs":[{"output_type":"stream","text":"Loading required package: SingleCellExperiment\n\nLoading required package: SummarizedExperiment\n\nLoading required package: MatrixGenerics\n\nLoading required package: matrixStats\n\n\nAttaching package: ‘MatrixGenerics’\n\n\nThe following objects are masked from ‘package:matrixStats’:\n\n colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,\n colCounts, colCummaxs, colCummins, colCumprods, colCumsums,\n colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,\n colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,\n colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,\n colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,\n colWeightedMeans, colWeightedMedians, colWeightedSds,\n colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,\n rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,\n rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,\n rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,\n rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,\n rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,\n rowWeightedMads, rowWeightedMeans, rowWeightedMedians,\n rowWeightedSds, rowWeightedVars\n\n\nLoading required package: GenomicRanges\n\nLoading required package: stats4\n\nLoading required package: BiocGenerics\n\nLoading required package: parallel\n\n\nAttaching package: ‘BiocGenerics’\n\n\nThe following objects are masked from ‘package:parallel’:\n\n clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,\n clusterExport, clusterMap, parApply, parCapply, parLapply,\n parLapplyLB, parRapply, parSapply, parSapplyLB\n\n\nThe following objects are masked from ‘package:stats’:\n\n IQR, mad, sd, var, xtabs\n\n\nThe following objects are masked from ‘package:base’:\n\n anyDuplicated, append, as.data.frame, basename, cbind, colnames,\n dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,\n grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,\n order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,\n rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,\n union, unique, unsplit, which.max, which.min\n\n\nLoading required package: S4Vectors\n\n\nAttaching package: ‘S4Vectors’\n\n\nThe following object is masked from ‘package:base’:\n\n expand.grid\n\n\nLoading required package: IRanges\n\nLoading required package: GenomeInfoDb\n\nLoading required package: Biobase\n\nWelcome to Bioconductor\n\n Vignettes contain introductory material; view with\n 'browseVignettes()'. To cite Bioconductor, see\n 'citation(\"Biobase\")', and for packages 'citation(\"pkgname\")'.\n\n\n\nAttaching package: ‘Biobase’\n\n\nThe following object is masked from ‘package:MatrixGenerics’:\n\n rowMedians\n\n\nThe following objects are masked from ‘package:matrixStats’:\n\n anyMissing, rowMedians\n\n\n\n\n","name":"stderr"}]},{"metadata":{},"cell_type":"markdown","source":"## Loading dataset"},{"metadata":{},"cell_type":"markdown","source":"\nThe \"CTCdataset\" dataset consisting of single migratory circulating tumor cells (CTCs) collected from patients with breast cancer. Data are available in the GEO database with accession numbers GSE51827, GSE55807, GSE67939, GSE75367, GSE109761, GSE111065 and GSE86978. The dataset should be formatted in a data frame where columns refer to samples and rows refer to genes. We provide here the possibility to load the dataset either as \".csv\" or \".rda\" extensions.The dataset should be formatted in a data frame where columns refer to samples and rows refer to genes. \nWe provide here the possibility to load the dataset either as \".csv\" or \".rda\" extensions."},{"metadata":{"trusted":false},"cell_type":"code","source":"FileName<-\"CTCdataset\" # Name of the dataset\n#CSV=TRUE # If the dataset has \".csv\", the user shoud set CSV to TRUE\nCSV=FALSE # If the dataset has \".rda\", the user shoud set CSV to FALSE\n\nif (CSV==TRUE){\n DataSet <- read.csv(file = paste0(FileName,\".csv\"), sep = \",\",header=T)\n rownames(DataSet)<-DataSet[,1]\n DataSet<-DataSet[,-1]\n} else{\n load(paste0(FileName,\".rda\"))\n DataSet<-get(FileName)\n}\ncat(paste0(\"The \", FileName,\" contains:\",\"\\n\",\"Genes: \",length(DataSet[,1]),\"\\n\",\"cells: \",length(DataSet[1,]),\"\\n\"))","execution_count":2,"outputs":[{"output_type":"stream","text":"The CTCdataset contains:\nGenes: 13181\ncells: 1462\n","name":"stdout"}]},{"metadata":{},"cell_type":"markdown","source":"### 1. Preparing the dataset"},{"metadata":{"trusted":false},"cell_type":"code","source":"FG<- DISCBIO(DataSet)\nFG<-Normalizedata(FG, mintotal=1000, minexpr=0, minnumber=0, maxexpr=Inf, downsample=FALSE, dsn=1, rseed=17000) \nFG<-FinalPreprocessing(FG,GeneFlitering=\"ExpF\",export = TRUE) # The GeneFiltering should be set to \"ExpF\"\n\nGolgiFragGeneList<- read.csv(file = \"GolgiFragGeneList.csv\", sep = \",\",header=F)\nData<-FG@fdata \ngenes<-rownames(Data)\ngene_list<- GolgiFragGeneList[,1]\nidx_genes <- is.element(genes,gene_list)\nOAdf<-Data[idx_genes,] \nFG@fdata<-OAdf\ndim(FG@fdata)\ncat(paste0(\"A list of \", length(OAdf[,1]), \" genes will be used for the clustering\",\"\\n\"))","execution_count":3,"outputs":[{"output_type":"stream","text":"The gene filtering method = Noise filtering\n\nThe Filtered Normalized dataset contains:\nGenes: 13181\ncells: 1448\n\n\n\nThe Filtered Normalized dataset was saved as: filteredDataset.Rdata\n\n","name":"stderr"},{"output_type":"display_data","data":{"text/html":"\n