1 Institut de Recherche en Cancérologie de Montpellier, Inserm, Montpellier, France ; Institut régional du Cancer Montpellier, Montpellier, France ; Université de Montpellier, Montpellier, France
This guide provides an overview of the SingleCellSignalR package, a comprehensive framework to obtain cellular network maps from scRNA-seq data. SingleCellSignalR comes with a complete pipeline integrating existing methods to cluster individual cell transcriptomes and identify cell subpopulations as well as novel cellular network-specific algorithms. More advanced users can substitute their own logic or alternative tools at various stages of data processing. SingleCellSignalR also maps cell subpopulation internal network linked to genes of iinterest through the integration of regulated KEGG and Reactome pathways together with ligands and receptors involved in inferred cell-cell interactions. The cellular networks can be exported in text files and graphML objects to be further explored with Cytoscape (www.cytoscape.org), yEd (www.yworks.com), or similar software tools.
Independent of the chosen scRNA-seq platform, deep or shallower, data comes as a table of read or unique molecule identifier (UMI) counts, one column per individual cell and one row per gene. Initial processing is required to prepare such data for subsequent analysis and we decided to propose a generic solution for the sake of convenience, though users can easily substitute their own computations. Gene names (HUGO symbols) are provided in the first column of the table.
Each analysis is organized around a working directory (or project folder):
The file containing the read counts should be placed in the working directory.
> file <- "example_data.txt"
Data processing can then start:
> data <- data_prepare(file = file)
log-Normalization
14223 genes
400 cells
Zero rate = 89.6%
The data_prepare() function eliminates non expressed genes before performing read count normalization.
Normalized data are submitted to a clustering algorithm to identify cell subpopulations:
> clust <- clustering(data = data,n = 10, method = "simlr")
Estimating the number of clusters
Estimated number of clusters = 4
Computing the multiple Kernels.
Performing network diffiusion.
Iteration: 1
Iteration: 2
Iteration: 3
Iteration: 4
Iteration: 5
Iteration: 6
Iteration: 7
Iteration: 8
Iteration: 9
Iteration: 10
Iteration: 11
Performing t-SNE.
Epoch: Iteration # 100 error is: 0.1974932
Epoch: Iteration # 200 error is: 0.1873306
Epoch: Iteration # 300 error is: 0.1825675
Epoch: Iteration # 400 error is: 0.1807186
Epoch: Iteration # 500 error is: 0.1805211
Epoch: Iteration # 600 error is: 0.1804052
Epoch: Iteration # 700 error is: 0.18031
Epoch: Iteration # 800 error is: 0.1802304
Epoch: Iteration # 900 error is: 0.1802297
Epoch: Iteration # 1000 error is: 0.1801049
Performing Kmeans.
Performing t-SNE.
Epoch: Iteration # 100 error is: 12.70413
Epoch: Iteration # 200 error is: 0.2041319
Epoch: Iteration # 300 error is: 0.1891918
Epoch: Iteration # 400 error is: 0.1856234
Epoch: Iteration # 500 error is: 0.1847919
Epoch: Iteration # 600 error is: 0.1842574
Epoch: Iteration # 700 error is: 0.1838822
Epoch: Iteration # 800 error is: 0.183603
Epoch: Iteration # 900 error is: 0.1833924
Epoch: Iteration # 1000 error is: 0.1832294
4 clusters detected
cluster 1 -> 27 cells
cluster 2 -> 172 cells
cluster 3 -> 76 cells
cluster 4 -> 125 cells
We set the method argument to simlr
, which caused the SIMLR() function of the SIMLR package [1] to be used. The SIMLR_Estimate_Number_of_Clusters() function determined the number of clusters, between 2 and n (n=10 above).
Next, differentially expressed genes in one cluster compared to the others are identified using the cluster_analysis() function, which relies on edgeR. A result table is automatically created in the cluster-analysis folder:
> clust.ana <- cluster_analysis(data = data, genes = rownames(data), cluster = clust$cluster)
edgeR differential gene expression (dge) processing:
Looking for differentially expressed genes in cluster 1
Looking for differentially expressed genes in cluster 2
Looking for differentially expressed genes in cluster 3
Looking for differentially expressed genes in cluster 4
Once the preliminary steps illustrated above are completed, SingleCellSignalR can be used to generate cellular interaction lists using the cell_signaling() function:
> signal <- cell_signaling(data = data, genes = rownames(data), cluster = clust$cluster)
Paracrine signaling:
Checking for signaling between cell types
9 interactions from cluster 1 to cluster 2
35 interactions from cluster 1 to cluster 3
3 interactions from cluster 1 to cluster 4
16 interactions from cluster 2 to cluster 1
56 interactions from cluster 2 to cluster 3
7 interactions from cluster 2 to cluster 4
8 interactions from cluster 3 to cluster 1
13 interactions from cluster 3 to cluster 2
12 interactions from cluster 3 to cluster 4
43 interactions from cluster 4 to cluster 1
39 interactions from cluster 4 to cluster 2
89 interactions from cluster 4 to cluster 3
An intercellular network can also be generated to map the overall ligand/receptor interactions invoking the inter_network() function:
> inter.net <- inter_network(data = data$complete.dataset, signal = signal, genes = genes, cluster = cluster)
Doing cluster 1 and cluster 2 ... OK
Doing cluster 1 and cluster 3 ... OK
Doing cluster 1 and cluster 4 ... OK
Doing cluster 2 and cluster 1 ... OK
Doing cluster 2 and cluster 3 ... OK
Doing cluster 2 and cluster 4 ... OK
Doing cluster 3 and cluster 1 ... OK
Doing cluster 3 and cluster 2 ... OK
Doing cluster 3 and cluster 4 ... OK
Doing cluster 4 and cluster 1 ... OK
Doing cluster 4 and cluster 2 ... OK
Doing cluster 4 and cluster 3 ... OK
At this point the intercellular network have been generated and exported in text and graphML formats in the networks folder.
A summary of the interactions between cell clusters can be output in the form of a chord diagram by the visualize() function:
> visualize(inter = signal)
This function will create a plot in the R plot window.
The details of the interactions between two clusters, for example cluster 1 and 2, can also be shown in the plot window with the visualize() function. Note that in the example below we ask for the display of two pairs of cell clusters, pair 1 that contains interactions from cluster 1 to 2, and pair 4 from cluster 2 to 1. (names(signal)
returns the cell cluster names in each pair, see function visualize() details.)
> visualize(inter = signal,show.in=c(1,4))
And these plots can be saved into pdf files in the images folder using the write.in
argument of the visualize() function.
> visualize(inter = signal,write.in=c(1,4))
red
red
red
SingleCellSignalR package functions have many arguments parameters that can be changed by the user to fit her needs. Furthermore, several handy functions that were not illustrated above are provided to generate additional plots or reports.
red
red
red
Classifies cells using cell type specific markers.
cell_classifier(data, genes, markers = markers.default, tsne = NULL, plot.details = FALSE, write = TRUE, verbose = TRUE)
Argument | Description |
---|---|
data |
a data frame of n rows (genes) and m columns (cells) of read or UMI counts (note : rownames(data)=genes) |
genes |
a character vector of HUGO official gene symbols of length n |
markers |
a data frame of cell type signature genes |
tsne |
(optional) a table of n rows and 2 columns with t-SNE projection coordinates for each cell |
plot.details |
a logical (if TRUE, then plots the number of cells attributed to one cell type, see below) |
write |
a logical |
verbose |
a logical |
The markers
argument must be a table with cell type gene signatures, one cell type in each column. The column names are the names of the cell types. SingleCellSignalR package markers.default table provides an example of this format.
If tsne
is not provided, then the function will just not display the cells on the t-SNE. Although t-SNE maps are widely used to display cells on a 2D projection, the user can provide any table with two columns and a number of rows equal to the number of columns of data
(e.g. the two first components of a PCA).
If plot.details
is TRUE, then the function plots the number of cells attributed to a single cell type as a function of the threshold applied to the normalized gene signature average.
If write
is TRUE, then the function writes four different text files. (1) The “raw classification matrix” provides the normalized average gene signature for each cell type in each individual cell, a number between 0 and 1. This matrix has one row per cell type and one column per cell, and the sum per column is 1. Row names are the cell type names (column names of the markers table) and the column names are the individual cell identifiers (column names of data
). (2) The “thresholded classification matrix”, which is obtained by eliminating all the values of the “raw classification matrix” that are below a threshold a*. In practice, a* is automatically determined by the function to maximize the number of cells that are assigned to a single cell type and all the cells (columns) assigned to 0 or >1 cell types are discarded. The number of cells assigned to a single type depending on a* can be plotted by using the parameter plot.details=TRUE
. (3) A cluster vector assigning each cell to a cell type. Note that a supplementary, virtual cluster is created to collect all the cells assigned to 0 or >1 types. This virtual cluster is named “undefined”. (4) A table associating each cell type to a cluster number in the cluster vector.
The function returns a list containing the thresholded table, the maximum table, the raw table, a cluster vector and the cluster names. The maximum table is a special thresholded table where in every column only the maximum gene signature is kept. It can be used to force the classification of every cell.
data <- matrix(runif(1000,0,1),nrow=50,ncol=20)
rownames(data) <- paste("gene",1:50)
markers <- matrix(paste("gene",1:10),ncol=5,nrow=2)
colnames(markers) <- paste("type",1:5)
cell_classifier(data,rownames(data),markers)
Computes “autocrine” or “paracrine” interactions between cell clusters.
cell_signaling(data, genes, cluster, int.type = c("paracrine", "autocrine"), c.names = NULL, s.score = 0.5, logFC = NULL, species = c("homo sapiens", "mus musculus"), tol = 0, write = TRUE, verbose = TRUE)
Argument | Description |
---|---|
data |
a data frame of n rows (genes) and m columns (cells) of read or UMI counts (note : rownames(data)=genes) |
genes |
a character vector of HUGO official gene symbols of length n |
cluster |
a numeric vector of length m |
int.type |
“autocrine” or “paracrine” |
c.names |
(optional) cluster names |
s.score |
a number between 0 and 1, the LRscore threshold |
logFC |
a number, the log fold-change threshold for differentially expressed genes |
species |
“homo sapiens” or “mus musculus” |
tol |
a tolerance parameter for balancing between “autocrine” and “paracrine” interactions |
write |
a logical |
verbose |
a logical |
int.type
must be equal to “paracrine” or “autocrine” exclusively. The “paracrine” option looks for ligands expressed in cluster A and their associated receptors according to LRdb that are expressed in any other cluster but A. These interactions are labelled “paracrine”. The interactions that involve a ligand and a receptor, both differentially expressed in their respective cell clusters according to the edgeR analysis performed by the cluster_analysis() function, are labelled “specific”.
The “autocrine” option searches for ligands expressed in cell cluster A and their associated receptors also expressed in A. These interactions are labelled “autocrine”. Additionally, it searches for those associated receptors in the other cell clusters (not A) to cover the part of the signaling that is “autocrine” and “paracrine” simultaneously. These interactions are labelled “autocrine/paracrine”.
The tol
argument allows the user to tolerate a fraction of the cells in cluster A to express the receptors in case int.type="paracrine"
, that is to call paracrine interactions that are dominantly paracrine though not exclusively. Conversely, it allows the user to reject interactions involving receptors that would be expressed by a small fraction of cluster A cells in case int.type="autocrine"
. By construction the association of these two options covers all the possible interactions and increasing the tol
argument allows the user to move interactions from “autocrine” to “paracrine”.
If the user does not set c.names
, the clusters will be named from 1 to the maximum number of clusters (cluster 1, cluster 2, …). The user can exploit the c.names
vector in the list returned by the cell_classifier() function for this purpose. The user can also provide her own cluster names.
s.score
is the threshold on the LRscore. The value must lie in the [0;1] interval, default is 0.4 to ensure confident ligand-receptor pair identifications (see our publication). Lower values increase the number of putative interactions while increasing the false positives. Higher values do the opposite.
logFC
is a threshold applied to the log fold-change (logFC) computed for each gene during the differential gene expression analysis. Its default value is log2(1.5) It further selects the differentially expressed genes (>logFC) after the p-value threshold imposed in the function cluster_analysis() below.
species
must be equal to “homo sapiens” or “mus musculus”, default is “homo sapiens”. In the case of mouse data, the function converts mouse genes in human orthologs (according to Ensembl) such that LRdb can be exploited, and finally output genes are converted back to mouse.
If write
is TRUE, then the function writes a text file that reports the interactions in the cell-signaling folder. This file is a 4-column table: ligands, receptors, interaction types (“paracrine”, “autocrine”, “autocrine/paracrine” and “specific”), and the associated LRscore.
Remarks: this function can be used with any data
table associated with corresponding genes
and cluster
vectors, meaning that advanced users can perform their own data normalization and cell clustering upfront. In case, the function cluster_analysis() was not executed, this function would work but “specific” interactions would not be annotated as such.
The function returns “autocrine” or “paracrine” interaction lists.
data=matrix(runif(1000,0,1),nrow=5,ncol=200)
rownames(data) = c("A2M","LRP1","AANAT","MTNR1A","ACE")
cluster=c(rep(1,100),rep(2,100))
cell_signaling(data,rownames(data),cluster,int.type="paracrine",write=FALSE)
Analysis of the differentially expressed genes in the clusters and their composition by a marker based approach.
cluster_analysis(data, genes, cluster, c.names = NULL, dif.exp = TRUE, s.pval = 10^-2, markers = NULL, write = TRUE, verbose = TRUE)
Argument | Description |
---|---|
data |
a data frame of n rows (genes) and m columns (cells) of read or UMI counts (note : rownames(data)=genes) |
genes |
a character vector of HUGO official gene symbols of length n |
cluster |
a numeric vector of length m |
c.names |
a vector of cluster names |
dif.exp |
a logical (if TRUE, then computes the differential gene expression between the clusters using edgeR) |
s.pval |
a value, a fixed p-value threshold |
markers |
a table of cell type signature genes |
write |
a logical |
verbose |
a logical |
If dif.exp
is TRUE, then the function uses edgeR functions glmFit() and glmRT() to find differentially expressed genes between one cluster and all the other columns of data
. If dif.exp
is FALSE, then the function skips the differential gene analysis.
If the user does not set c.names
, the clusters will be named from 1 to the maximum number of clusters (cluster 1, cluster 2, …). The user can exploit the c.names
vector in the list returned by the cell_classifier() function for this purpose. The user can also provide her own cluster names.
s.pval
is the adjusted (Benjamini-Hochberg) p-value threshold imposed to gene differential expression.
If markers
is set, it must be a table with gene signatures for one cell type in each column. The column names are the names of the cell types. If markers
is not provided, then the function skips the cluster cell type calling step.
If write
and dif.exp
are both TRUE, then the function writes a text file named “table_dge_X.txt”, where X is the cluster name, that contains the list of differentially expressed genes. If write
is TRUE and markers
is provided, then the function writes in a second text file a table containing probabilities of assignments of each cluster to a cell type for each cell cluster. This cell type calling is performed as for the individual cells without thresholding but based on the cluster average transcriptome.
Remark: this function can be used with any data
table associated with corresponding genes
and cluster
vectors, meaning that advanced users can perform their own data normalization and cell clustering upfront.
The function returns a list comprised of a table of differentially expressed genes, a table of cell types, and a table of cell cluster types.
data=matrix(runif(1000,0,1),nrow=5,ncol=200)
rownames(data) = c("A2M","LRP1","AANAT","MTNR1A","ACE")
cluster=c(rep(1,100),rep(2,100))
cluster_analysis(data,rownames(data),cluster,dif.exp=FALSE)
Identifies the cell clusters, i.e. the cell subpopulations.
clustering(data, n.cluster = 0, n = 10, method = c("simlr", "kmeans"), plot = TRUE, pdf = TRUE, write = TRUE)
Argument | Description |
---|---|
data |
a data frame of n rows (genes) and m columns (cells) of read or UMI counts (note : rownames(data)=genes) |
n.cluster |
a number, an estimation of the ideal number of clusters is computed if equal to 0 |
n |
a number, the maximum to consider for an automatic determination of the ideal number of clusters |
method |
“kmeans” or “simlr” |
plot |
a logical |
pdf |
a logical |
write |
a logical |
If the user knows the number of clusters present in her data set, then n.cluster
can be set and the estimation of the number of clusters is skipped.
n
is the maximum number of clusters that the automatic estimation of the number of clusters will consider. It is ignored if n.cluster
is provided.
method
must be “simlr” or “kmeans” exclusively. If set to “simlr”, then the function uses the SIMLR() function (SIMLR package) to perform clustering. If set to “kmeans” the function will perform a dimensionality reduction by principal component analysis (PCA) followed by K-means clustering and 2-dimensional projection by t-distributed stochastic neighbor embedding (t-SNE). Regardless of the value of method
(“simlr” or “kmeans”), in case n.cluster
is not provided, then the function relies on the SIMLR_Estimate_Number_of_Clusters() function to determine the number of clusters, between 2 and n
.
If plot
is TRUE, then the function displays the t-SNE map with each cell colored according to the cluster it belongs to. If method
argument is “simlr”, then it further displays a heatmap of the similarity matrix calculated by the SIMLR() function.
If pdf
is TRUE, then the function exports the t-SNE plot in a pdf file in the images folder. The file is named “t-SNE_map-X.pdf”, where X is the method
argument.
If write
is TRUE, then the function writes two text files in the data folder. The first one is called “cluster-Y-X.txt”, containing the cluster vector assigning each cell of data
to a cluster. The second one is called “tsne-Y-X.txt”, containing the coordinates of each cell in the 2D t-SNE projection. “X” is the method
argument anf “Y” is the retained number of clusters.
The function returns a list containing a numeric vector specifying the cluster assignment for each cell, a 2D t-SNE projection, and the number of cells per cluster.
data <- matrix(runif(100000,0,1),nrow=500,ncol=200)
clustering(data,20,2)
Prepares the data for further analysis.
data_prepare(file, most.variables = 0, lower = 0, upper = 0, lognorm = TRUE, write = TRUE, show.hist.cols = FALSE, verbose = TRUE)
Argument | Description |
---|---|
file |
a string for the scRNAseq data file |
most.variables |
a number |
lower |
a number in [0,1], low quantile threshold |
upper |
a number in [0,1], high quantile threshold |
normalize |
a logical, if TRUE, then computes 99th percentile normalization |
write |
a logical |
show.hist.cols |
a logical |
verbose |
a logical |
file
is the path to the file containing the read or UMI count matrix the user wants to analyze.
most.variables
can be set to N to select the Nth most variables genes. This option allows the user to use a reduced matrix (N x number of cells) to perform the clustering step faster.
lower
and upper
are used to remove the genes whose average counts are outliers. The values of these arguments are fractions of the total number of genes and hence must be between 0 and 1. Namely, if lower = 0.05
, then the function removes the 5% less expressed genes and if upper = 0.05
, then the function removes the 5% most expressed genes.
If normalize
is FALSE, then the function skips the 99th percentile normalization and the log transformation
If write
is TRUE, then the function writes two text files. One for the normalized and gene thresholded read counts table and another one for the genes that passed the lower and upper threshold. Note that the length of the genes vector written in the genes.txt file is equal to the number of rows of the table of read counts written in the data.txt file.
The function returns a data frame of filtered and/or normalized data with genes as row names.
data_prepare("~/scRNAseq_dataset.txt")
Displays the level of expression of a gene in each cell on the 2D projected data.
expression.plot(data, name, tsne, colors = c("default", "rainbow", "heat"))
Argument | Description |
---|---|
data |
a data frame of n rows (genes) and m columns (cells) of read or UMI counts (note : rownames(data)=genes) |
name |
the identifier of the gene of interest |
tsne |
a table of n rows and 2 columns with 2D projection coordinates for each cell |
colors |
“default” returns the default colorpanel, also accepts “rainbow” or “heat” |
This function displays the expression level of a gene of interest on a 2D projection.
name
can be any character that corresponds to a row name of data
.
tsne
corresponds to the 2D coordinates for each cell. Although t-SNE maps are widely used to display cells on a 2D projection, the user can provide any table with two columns and a number of rows equal to the number of columns of data (i.e. the two first components of a PCA).
colors
must be “default”, “rainbow” or “heat” exclusively. “rainbow” and “heat” are the color palettes provided in R.
The function returns a R plot.
data = matrix(runif(5,0,1),ncol=5)
data[2] = data[5] = 0
rownames(data) = "gene 1"
tsne = matrix(runif(10,0,1),ncol=2)
expression.plot(data,"gene 1",tsne)
Displays the level of expression of two genes in each cell on the 2D projected data.
expression.plot.2(data, name.1, name.2, tsne)
Argument | Description |
---|---|
data |
a data frame of n rows (genes) and m columns (cells) of read or UMI counts (note : rownames(data)=genes) |
name.1 |
the identifier of the first gene of interest |
name.2 |
the identifier of the second gene of interest |
tsne |
a table of n rows and 2 columns with t-SNE projection coordinates for each cell |
This function can be used independantly from any other. It displays the expression level of two genes of interest on a 2D projection.
name.1
and name.2
can be any characters that correspond to a row name of data
.
The function returns a R plot.
data = matrix(runif(10,0,1),ncol=5)
data[1,1:3] = 0
data[2,4:5] = 0
rownames(data) = c("gene 1","gene 2")
tsne = matrix(runif(10,0,1),ncol=2)
expression.plot.2(data, "gene 1", "gene 2", tsne)
Computes intercellular gene networks.
inter_network = function(data,genes,cluster,signal,c.names=NULL,species=c("homo sapiens","mus musculus"),write=TRUE,verbose=TRUE)
Argument | Description |
---|---|
data |
a data frame of n rows (genes) and m columns (cells) of read or UMI counts (note : rownames(data)=genes) |
genes |
a character vector of HUGO official gene symbols of length n |
cluster |
a numeric vector of length m |
signal |
(optional) a list (result of the cell_signaling() function) |
c.names |
(optional) cluster names |
species |
“homo sapiens” or “mus musculus” |
write |
a logical (if TRUE writes graphML and text files for the interface network) |
verbose |
a logical |
signal
is a list containing the cell-cell interaction tables. It is the result of the cell_signaling() function.
If the user does not set c.names
, the clusters will be named from 1 to the maximum number of clusters (cluster 1, cluster 2, …). The user can exploit the c.names
vector in the list returned by the cell_classifier() function for this purpose. The user can also provide her own cluster names.
species
must be equal to “homo sapiens” or “mus musculus”. In the case of mouse data, the function converts mouse genes in human orthologs (according to Ensembl) such that the Reactome/KEGG interaction database can be exploited, and finally output genes are converted back to mouse.
If write
is TRUE, then the function writes four different files. A graphML file in the cell-signaling folder for intercellular interactions between each pair of clusters named “intercell_network_Z1-Z2.graphml”, where Z1 and Z2 are the c.names of the clusters. A graphML file in the cell-signaling folder that contains a compilation of all the intercellular, ligand-receptor interactions named “full-intercellular-network.graphml”. A text and a graphML file in the networks folder containing the intracellular network for each cell cluster named “intracell_network_Z.txt” and “intracell_network_Z.graphml”, where Z is the c.names of the cluster.
The function returns a list containing the tables of interaction between two cell types and the table for the full network of all the cell types.
m = data.frame(cell.1=runif(10,0,2),cell.2=runif(10,0,2),cell.3=runif(10,0,2),
cell.4=runif(10,0,2),cell.5=runif(10,0,2),cell.6=runif(10,0,2),cell.7=runif(10,0,2))
rownames(m) = paste("gene", 1:10)
cluster = c(1,1,1,2,3,3,2)
inter_network(m,rownames(m),cluster,signal=NULL)
Computes intracellular networks linked to genes of interest.
intra_network = function(goi,data,genes,cluster,coi,cell.prop=0.2,c.names=NULL,signal=NULL,write=TRUE,plot=TRUE,add.lig=TRUE,species=c("homo sapiens","mus musculus"),connected=FALSE,verbose=TRUE)
Argument | Description |
---|---|
goi |
a vector containing genes of interest (typically one or more receptors) |
data |
a data frame of n rows (genes) and m columns (cells) of read or UMI counts (note : rownames(data)=genes) |
genes |
a character vector of HUGO official gene symbols of length n |
cluster |
a numeric vector of length m |
coi |
name of the cluster of interest |
cell.prop |
a threshold, only the genes expressed in this proportion of the cells of the coi will be taken into account |
c.names |
(optional) cluster names |
signal |
(optional) a list (result of the cell_signaling() function) |
species |
“homo sapiens” or “mus musculus” |
write |
a logical (if TRUE writes graphML and text files for the internal network) |
plot |
a logical |
add.lig |
a logical (if TRUE adds the goi associated ligands from signal to the network) |
connected |
a logical (if TRUE keeps only the genes connected to the goi) |
verbose |
a logical |
signal
is a list containing the cell-cell interaction tables. It is the result of the cell_signaling() function.
cell.prop
is set to 0.2 by default to avoid unreadable downstream networks. However if the calculated network too small or non-existent (or too big) the user can try lower (or higher) values.
If the user does not set c.names
, the clusters will be named from 1 to the maximum number of clusters (cluster 1, cluster 2, …). The user can exploit the c.names
vector in the list returned by the cell_classifier() function for this purpose. The user can also provide her own cluster names.
species
must be equal to “homo sapiens” or “mus musculus”. In the case of mouse data, the function converts mouse genes in human orthologs (according to Ensembl) such that the Reactome/KEGG interaction database can be exploited, and finally output genes are converted back to mouse.
If write
is TRUE, then the function writes four different files. A graphML file in the cell-signaling folder for intercellular interactions between each pair of clusters named “intercell_network_Z1-Z2.graphml”, where Z1 and Z2 are the c.names of the clusters. A graphML file in the cell-signaling folder that contains a compilation of all the intercellular, ligand-receptor interactions named “full-intercellular-network.graphml”. A text and a graphML file in the networks folder containing the intracellular network for each cell cluster named “intracell_network_Z.txt” and “intracell_network_Z.graphml”, where Z is the c.names of the cluster.
The function returns a list of tables, one for each element in goi.
m = data.frame(cell.1=runif(10,0,2),cell.2=runif(10,0,2),cell.3=runif(10,0,2),
cell.4=runif(10,0,2),cell.5=runif(10,0,2),cell.6=runif(10,0,2),cell.7=runif(10,0,2))
rownames(m) = paste("gene", 1:10)
cluster = c(1,1,1,2,3,3,2)
inter_network(m,rownames(m),cluster,signal=NULL)
Provides a table of cell type specific markers.
markers(category = c("immune", "tme", "melanoma", "bc"))
Argument | Description |
---|---|
category |
one or several of the following “immune”, “tme”, “melanoma”, “bc” |
To use this function the user must have some knowledge about the cell composition of her data set. For instance, if the dataset comes from a breast cancer tumor, the user may select “bc”, but also “immune” and “tme” for cells of the microenvironment (see Examples of use below).
The function returns a cell type specific markers table.
markers(c("immune","bc"))
Displays a heatmap showing the most variable interactions over all clusters.
mv_interactions = function(data,genes,cluster,c.names=NULL,n=30,species=c("homo sapiens","mus musculus"))
Argument | Description |
---|---|
data |
a data frame of n rows (genes) and m columns (cells) of read or UMI counts (note : rownames(data)=genes) |
genes |
a character vector of HUGO official gene symbols of length n |
cluster |
a numeric vector of length m |
c.names |
(optional) cluster names |
n |
an integer, the number of most variables interactions |
species |
“homo sapiens” or “mus musculus” |
To use this function the user must have some knowledge about the cell composition of her data set. For instance, if the dataset comes from a breast cancer tumor, the user may select “bc”, but also “immune” and “tme” for cells of the microenvironment (see Examples of use below).
The function returns a cell type specific markers table.
markers(c("immune","bc"))
Creates chord diagrams from the interactions tables.
visualize(inter, show.in = NULL, write.in = NULL, write.out = FALSE, method = "default", limit = 30)
Argument | Description |
---|---|
inter |
a list of data frames result of the cell_signaling() function |
show.in |
a vector of which elements of inter must be shown |
write.in |
a vector of which elements of inter must be written |
write.out |
a logical |
method |
a string (usually relative to the experiment) |
limit |
a value between 1 and number of interactions |
show.in
gives the elements of inter
to be displayed in the plot window.
write.in
gives the elements of inter
to be written as pdf files in the images folder.
If write.out
is TRUE, then the function writes a pdf file with a summary of the all the interactions of inter
as a chord diagram.
limit
is the maximum number of interactions displayed on one chord diagram. Raising this limit over 30 may decrease the visibility.
The function returns images in the plot window of Rstudio and images in the pdf format in the images folder.
int.1 = matrix(c("gene 1","gene 1", "gene 2", "gene 3"),ncol=2)
colnames(int.1) = c("cluster 1","cluster 2" )
int.2 = matrix(c("gene 1","gene 4","gene 4","gene 2","gene 3","gene 3"),ncol=2)
colnames(int.2) = c("cluster 1","cluster 3" )
inter = list(int.1,int.2)
visualize(inter)
red
red
red
After running the example in the Quick Start section, the user can define cell clusters after the output of the cell_classifier(). The demo data set is comprised of a subset of the 10x PBMC dataset [3], i.e. immune cells. The t-SNE map calculated with the clustering() function will also be used. For this example we will set the plot.details
argument to TRUE to monitor the choice of the threshold of gene signature scores.
> class = cell_classifier(data=data, genes=rownames(data), markers = markers(c("immune")), tsne=clust$`t-SNE`,plot.details=TRUE)
A treshold of 47 % maximizes the number of cells identified to one cell type
23 cell(s) or 5.8 % identified to 0 cell type(s)
370 cell(s) or 92.5 % identified to 1 cell type(s)
7 cell(s) or 1.8 % identified to 2 cell type(s)
Cell_type Number_of_cells
cluster 1 T-cells 126
cluster 2 B-cells 35
cluster 3 Macrophages 57
cluster 4 Cytotoxic cells 114
cluster 5 Neutrophils 35
cluster 6 Treg 3
cluster 7 undefined 30
Let us use the cell clustering obtained with the cell_classifier() function. Although “undefined” cells may be interesting in some cases, here they form a heterogeneous cluster because they represent cells that seem to be in a transition between two states (“T-cells” and “Cytotoxic cells”, or “Neutrophils” and “Macrophages”, see heatmap above). We discard these cells.
> # Define the cluster vector and the cluster names
> cluster <- class$cluster
> c.names <- class$c.names
> # Remove undefined cells
> data <- data[,cluster!=(max(cluster))]
> tsne <- clust$`t-SNE`[cluster!=(max(cluster)),]
> c.names <- c.names[-max(cluster)]
> cluster <- cluster[cluster!=(max(cluster))]
Then the analysis can be carried on.
> clust.ana <- cluster_analysis(data = data, genes = rownames(data), cluster = cluster, c.names = c.names)
edgeR differential gene expression (dge) processing:
Looking for differentially expressed genes in T-cells
Looking for differentially expressed genes in B-cells
Looking for differentially expressed genes in Macrophages
Looking for differentially expressed genes in Cytotoxic cells
Looking for differentially expressed genes in Neutrophils
Looking for differentially expressed genes in Treg
No differentially expressed genes in Treg
Once the cluster analysis is done, the cell_signaling(), inter_network() functions can be used.
> signal <- cell_signaling(data = data, genes = genes, cluster = cluster, c.names = c.names)
No such file as table_dge_Treg.txt in the cluster-analysis folder
Paracrine signaling:
Checking for signaling between cell types
29 interactions from T-cells to B-cells
55 interactions from T-cells to Macrophages
27 interactions from T-cells to Cytotoxic cells
49 interactions from T-cells to Neutrophils
1 interactions from T-cells to Treg
23 interactions from B-cells to T-cells
70 interactions from B-cells to Macrophages
28 interactions from B-cells to Cytotoxic cells
71 interactions from B-cells to Neutrophils
5 interactions from B-cells to Treg
20 interactions from Macrophages to T-cells
29 interactions from Macrophages to B-cells
29 interactions from Macrophages to Cytotoxic cells
7 interactions from Macrophages to Neutrophils
14 interactions from Macrophages to Treg
17 interactions from Cytotoxic cells to T-cells
21 interactions from Cytotoxic cells to B-cells
63 interactions from Cytotoxic cells to Macrophages
60 interactions from Cytotoxic cells to Neutrophils
3 interactions from Cytotoxic cells to Treg
28 interactions from Neutrophils to T-cells
28 interactions from Neutrophils to B-cells
15 interactions from Neutrophils to Macrophages
43 interactions from Neutrophils to Cytotoxic cells
14 interactions from Neutrophils to Treg
56 interactions from Treg to T-cells
62 interactions from Treg to B-cells
102 interactions from Treg to Macrophages
73 interactions from Treg to Cytotoxic cells
95 interactions from Treg to Neutrophils
> inter.net <- inter_network(data = data, signal = signal, genes = genes, cluster = cluster)
Doing T-cells and B-cells ... OK
Doing T-cells and Macrophages ... OK
Doing T-cells and Cytotoxic cells ... OK
Doing T-cells and Neutrophils ... OK
Doing T-cells and Treg ... OK
Doing B-cells and T-cells ... OK
Doing B-cells and Macrophages ... OK
Doing B-cells and Cytotoxic cells ... OK
Doing B-cells and Neutrophils ... OK
Doing B-cells and Treg ... OK
Doing Macrophages and T-cells ... OK
Doing Macrophages and B-cells ... OK
Doing Macrophages and Cytotoxic cells ... OK
Doing Macrophages and Neutrophils ... OK
Doing Macrophages and Treg ... OK
Doing Cytotoxic cells and T-cells ... OK
Doing Cytotoxic cells and B-cells ... OK
Doing Cytotoxic cells and Macrophages ... OK
Doing Cytotoxic cells and Neutrophils ... OK
Doing Cytotoxic cells and Treg ... OK
Doing Neutrophils and T-cells ... OK
Doing Neutrophils and B-cells ... OK
Doing Neutrophils and Macrophages ... OK
Doing Neutrophils and Cytotoxic cells ... OK
Doing Neutrophils and Treg ... OK
Doing Treg and T-cells ... OK
Doing Treg and B-cells ... OK
Doing Treg and Macrophages ... OK
Doing Treg and Cytotoxic cells ... OK
Doing Treg and Neutrophils ... OK
If we take a look at signal[[5]]
(or signal[["B-cells-T-cells"]]
)
> signal[[5]]
2510 PTMA VIPR1 paracrine 0.7974090
235 B2M KLRD1 paracrine 0.7327886
2954 UBA52 ACVR1 paracrine 0.7283035
1473 GNAS VIPR1 paracrine 0.7252499
1540 HLA-B KLRD1 paracrine 0.6973507
3245 CD47 SIRPG paracrine 0.6765857
1442 GNAI2 IGF1R paracrine 0.6740132
1471 GNAS PTGIR paracrine 0.6590192
2964 UBA52 NOTCH1 paracrine 0.6568902
1722 IL16 CD4 paracrine 0.6528601
1558 HLA-E KLRD1 paracrine 0.6520872
2900 TNFSF12 TNFRSF25 paracrine 0.6478817
404 CALM1 VIPR1 paracrine 0.6432291
234 B2M KLRC2 paracrine 0.6180494
1545 HLA-C DDR1 paracrine 0.5980521
228 B2M HFE paracrine 0.5940835
1559 HLA-E SLC16A4 paracrine 0.5931120
2599 RPS27A TLR4 paracrine 0.5733684
405 CALM2 ABCA1 paracrine 0.5716807
190 ARF1 PLD2 paracrine 0.5610602
364 CALM1 ABCA1 paracrine 0.5425848
1589 HSP90B1 LRP1 paracrine 0.5305076
1557 HLA-E KLRC2 paracrine 0.5251504
We can be interested in genes participating in pathways with a receptor of interest inside a cluster of interest. Let us say PTGIR in “T-cells”.
> intra = intra_network(goi = "PTGIR",data = data,genes = genes,cluster = cluster, coi = "T-cells", c.names = c.names, signal = signal)
Patwhay(s) that include ADCY9:
- Rap1 signalling
- Adenylate cyclase inhibitory pathway
- Glucagon signaling in metabolic regulation
- OAS antiviral response
Now, let us take an overview of the signaling between the cell types.
> visualize(signal)
Let us get deeper and look at the signaling between “T-cells” and “B-cells” for example.
> visualize(signal, show.in=c(1,6))
The following command will save these plots in the images folder.
> visualize(signal, write.in=c(1,6))
For this example we use the scRNAseq dataset from Tirosh et al. [4]. We use only the data from patient 80.
> file <- "patient_80.txt"
> data <- data_prepare(file = file)
log-Normalization
19452 genes
480 cells
Zero rate = 75.5%
Remark: One can notice that the zero rate is lower than in the previous example which reflects the fact that the sequencing is deeper.
We know that this dataset is composed of melanoma cells and their microenvironment, we hence define our markers table using the markers() function.
> my.markers <- markers(category = c("immune", "tme", "melanoma"))
> head(my.markers)
T-cells B-cells Macrophages Cytotoxic cells DC Mast cells Neutrophils NK cells Treg Endothelial cells CAFs melanoma
1 CD2 CD19 CD163 PRF1 CCL13 TPSB2 FPR1 XCL1 FOXP3 PECAM1 FAP MIA
2 CD3D CD79A CD14 GZMA CD209 TPSAB1 SIGLEC5 XCL2 VWF THY1 TYR
3 CD3E CD79B CSF1R GZMB HSD11B1 CPA3 CSF3R NCR1 CDH5 DCN SLC45A2
4 CD3G BLK C1QC NKG7 MS4A2 FCAR KIR2DL3 CLDN5 COL1A1 CDH19
5 CD8A MS4A1 VSIG4 GZMH HDC FCGR3B KIR3DL1 PLVAP COL1A2 PMEL
6 SIRPG BANK1 C1QA KLRK1 CEACAM3 KIR3DL2 ECSCR COL6A1 SLC24A5
Let us perform the clustering. For this example, we set the method argument to “kmeans” and the n argument to 12.
> clust <- clustering(data = data,n = 12, method = "kmeans")
Estimating the number of clusters
Estimated number of clusters = 6
6 clusters detected
cluster 1 -> 157 cells
cluster 2 -> 15 cells
cluster 3 -> 137 cells
cluster 4 -> 6 cells
cluster 5 -> 114 cells
cluster 6 -> 51 cells
Now we take advantage of the markers
argument of the cluster_analysis() function using my.markers obtained above with the markers() function.
> clust.ana <- cluster_analysis(data = data, genes = rownames(data), cluster = clust$cluster, markers = my.markers)
edgeR differential gene expression (dge) processing:
Looking for differentially expressed genes in cluster 1
Looking for differentially expressed genes in cluster 2
Looking for differentially expressed genes in cluster 3
Looking for differentially expressed genes in cluster 4
Looking for differentially expressed genes in cluster 5
Looking for differentially expressed genes in cluster 6
We can see that the clusters 2 and 5 are well defined, they are respectively cancer associated fibroblasts (CAFs) and melanoma cells. The cluster 6 is also clearly composed of endothelial cells. Clusters 1 and 2 are immune cells but the clustering did not succeed in sorting them correctly and cluster 4 counts only 6 cells. Those do not seem to be homogeneous and we decide to remove them.
> data <- data[,clust$cluster!=4]
> cluster <- clust$cluster[clust$cluster!=4]
> cluster[cluster>4] <- cluster[cluster>4] - 1
Then we can name our clusters manually before pursuing the analysis.
> c.names <- c("Immune 1", "CAFs", "Immune 2", "melanoma", "endothelial")
> signal <- cell_signaling(data = data, genes = rownames(data), cluster = cluster, c.names = c.names)
Paracrine signaling:
Checking for signaling between cell types
78 interactions from Immune 1 to CAFs
30 interactions from Immune 1 to melanoma
11 interactions from Immune 1 to endothelial
67 interactions from CAFs to Immune 1
85 interactions from CAFs to Immune 2
86 interactions from CAFs to melanoma
78 interactions from CAFs to endothelial
84 interactions from Immune 2 to CAFs
55 interactions from Immune 2 to melanoma
44 interactions from Immune 2 to endothelial
12 interactions from melanoma to Immune 1
33 interactions from melanoma to CAFs
19 interactions from melanoma to Immune 2
12 interactions from melanoma to endothelial
53 interactions from endothelial to Immune 1
33 interactions from endothelial to CAFs
69 interactions from endothelial to Immune 2
28 interactions from endothelial to melanoma
Remark: the names of the dge tables in the cluster_analysis folder must be changed according to the cluster names (c.names).
And now visualize!
> visualize(signal, show.in = c(12,18))
Remark: We observe that in the chord diagrams above, the “specific” interactions were highlighted with a thick black line.
Let us look at one of these specific interactions using the expression.plot.2() function.
> expression.plot.2(data,"NID1","COL13A1",clust$`t-SNE`)
red
red
red
Thank you for reading this guide and for using SingleCellSignalR.
Wang B, Zhu J, Pierson E, Ramazzotti D, Batzoglou S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat Methods. 2017;14:414-6.
McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 2012;40:4288-97.
8k PBMCs from a Healthy Donor [Internet]. 2017. Available from: https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/pbmc8k
Tirosh I, Izar B, Prakadan SM, Wadsworth MH, Treacy D, Trombetta JJ, et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science. 2016;352:189-96.