{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature Selection Tutorial\n", "\n", "In this Jupyter notebook, we'll walk through the information-theoretic feature selection algorithms in PicturedRocks and demonstrate the interactive marker selection user interface.\n", "\n", "If you are viewing this notebook inside the PicturedRocks documentation, the interactive marker selection tool will not work (as it needs a python backend to perform the computations. You can download this notebook from [GitHub](https://raw.githubusercontent.com/umangv/picturedrocks/master/docs/FeatureSelectionTutorial.ipynb) and run in on your own computer to try the interactive tool.)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import scanpy as sc\n", "import picturedrocks as pr" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING: In Scanpy 0.*, this returned logarithmized data. Now it returns non-logarithmized data.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "... storing 'paul15_clusters' as categorical\n", "Trying to set attribute `.uns` of view, making a copy.\n" ] } ], "source": [ "adata = sc.datasets.paul15()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 2730 × 3451 \n", " obs: 'paul15_clusters'\n", " uns: 'iroot'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `process_clusts` method copies the cluster column and precomputes various indices, etc. If you have multiple columns that can be used as target labels (e.g., different treatments, clusters via different clustering algorithms or parameters, or demographics), this sets and processes the given columns as the one we're currently examining.\n", "\n", "This is necessary for supervised analysis and visualization tools in PicturedRocks that use cluster labels." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 2730 × 3451 \n", " obs: 'paul15_clusters', 'clust', 'y'\n", " uns: 'iroot', 'num_clusts', 'clusterindices'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pr.read.process_clusts(adata, \"paul15_clusters\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `make_infoset` method creates a `SparseInformationSet` object with a discretized version of the data matrix. It is useful to have only a small number of discrete states that each gene can take so that entropy is a reasonable measurement. By default, `make_infoset` performs an adaptive transform that we call a recursive quantile transform. This is implemented in `pr.markers.mutualinformation.infoset.quantile_discretize`. If you have a different discretization transformation, you can pass a transformed matrix directly to `SparseInformationSet`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "infoset = pr.markers.makeinfoset(adata, True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because this dataset only has 3451 features, it is computationally easy to do feature selection without restricting the number of features. If we wanted to, we could do either supervised or unsupervised univariate feature selection (i.e., without considering any interactions between features)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# supervised\n", "mim = pr.markers.mutualinformation.iterative.MIM(infoset)\n", "most_relevant_genes = mim.autoselect(1000)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# unsupervised\n", "ue = pr.markers.mutualinformation.iterative.UniEntropy(infoset)\n", "most_variable_genes = ue.autoselect(1000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this stage we can slice our `adata` object as `adata[:,most_relevant_genes]` or `adata[:,most_variable_genes]` and create a new `InformationSet` object for this sliced object. We don't need to do that here since there are not a lot of genes but will do so anyway for demonstration purposes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Supervised Feature Selection\n", "\n", "Let's jump straight into supervised feature selection. Here we will use the `CIFE` objective" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "adata_mr = adata[:,most_relevant_genes].copy()\n", "infoset_mr = pr.markers.makeinfoset(adata_mr, True)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "cife = pr.markers.CIFE(infoset_mr)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.95022366, 0.93749845, 0.88470651, 0.86819372, 0.8634894 ,\n", " 0.80903075, 0.75775072, 0.75361203, 0.71991963, 0.7106652 ,\n", " 0.70321104, 0.6821289 , 0.67109598, 0.65202536, 0.65192364,\n", " 0.6458561 , 0.64569101, 0.63526239, 0.62452935, 0.62346646])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cife.score[:20]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['Mpo', 'Prtn3', 'Ctsg', 'Car2', 'Elane', 'Car1', 'Klf1', 'Blvrb',\n", " 'Ermap', 'Mt2'],\n", " dtype='object')\n" ] } ], "source": [ "top_genes = np.argsort(cife.score)[::-1]\n", "print(adata_mr.var_names[top_genes[:10]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's select 'Mpo'" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "ind = adata_mr.var_names.get_loc('Mpo')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "cife.add(ind)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, the top genes are" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['Car2', 'Car1', 'Gnb2l1', 'Fth1', 'Atpif1', 'AK158095', 'Ncl', 'Blvrb',\n", " 'Rpl4', 'Atp5b'],\n", " dtype='object')\n" ] } ], "source": [ "top_genes = np.argsort(cife.score)[::-1]\n", "print(adata_mr.var_names[top_genes[:10]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observe that the order has changed based on redundancy (or lack thereof) with 'Mpo'. Let's add 'Car1'" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "ind = adata_mr.var_names.get_loc('Car1')\n", "cife.add(ind)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['Actb', 'Gpx1', 'Hsp90ab1', 'Ftl1', 'Ybx1', 'AK158095', 'Ncl', 'Rps3',\n", " 'hnRNP A2/B1', 'Tuba1b'],\n", " dtype='object')\n" ] } ], "source": [ "top_genes = np.argsort(cife.score)[::-1]\n", "print(adata_mr.var_names[top_genes[:10]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to select the top gene repeatedly, we can use `autoselect`" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "cife.autoselect(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To look at the markers we've selected, we can examine `cife.S`" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0, 5, 187, 23, 49, 931, 306]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cife.S" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Mpo', 'Car1', 'Actb', 'H2afy', 'Hsp90ab1', 'Gpr56', 'Ly6e'], dtype='object')" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata_mr.var_names[cife.S]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### User Interface\n", "\n", "This process can also done manually with a user-interface allowing you to incorporate domain knowledge in this process. Use the `View` dropdown to look at heatplots for candidate genes and already selected genes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Normalize per cell and log transform the data. We are doing this here only to generate familiar features. We do not recommend performing these transformations before `make_infoset`." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "sc.pp.normalize_per_cell(adata_mr)\n", "sc.pp.log1p(adata_mr)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ " \n", " " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Running tsne on cells...\n" ] } ], "source": [ "im = pr.markers.InteractiveMarkerSelection(adata_mr, cife, ['tsne', 'violin'])" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "3d46aa64ff394550b34b24a4d8c41da8", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output()" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "im.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note, that because we passed the same `cife` object, any genes added/removed in the interface will affect the `cife` object." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Mpo', 'Car1', 'Actb', 'H2afy', 'Hsp90ab1', 'Gpr56', 'Ly6e'], dtype='object')" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata_mr.var_names[cife.S]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Unsupervised Feature Selection\n", "\n", "This works very similarly. In the example below, we'll autoselect 5 genes and then run the interface. Note that although the previous section would not work without cluster labels, the following code will." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "cife_unsup = pr.markers.CIFEUnsup(infoset)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "cife_unsup.autoselect(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you ran the example above, this will load faster because the t_SNE coordinates for genes and cells have already been computed." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ " \n", " " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Running tsne on cells...\n" ] } ], "source": [ "im_unsup = pr.markers.interactive.InteractiveMarkerSelection(adata, cife_unsup, [\"tsne\"])" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "15bf7c0b25aa43c6bcea74062b48347c", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output()" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "im_unsup.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Binary Feature Selection\n", "\n", "We can also perform feature selection specifically for individual class labels (e.g., clusters). This is done by changing the `SparseInformationSet`'s `y` array. In the example below, we will target the class label \"2Ery\". Notice that the features selected by MIM (MIM doesn't consider redundancy) are only those that are informative about \"2Ery\" in particular.\n", "\n", "Binary (i.e., not multiclass) feature selection can be performed with any information-theoretic feature selection algorithm (e.g., CIFE, JMI, MIM)." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "# since we are changing y anyway, the value of include_y (True in the line below) doesn't matter\n", "infoset2 = pr.markers.makeinfoset(adata, True)\n", "infoset2.set_y((adata.obs['clust'] == '2Ery').astype(int).values)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "mim2 = pr.markers.mutualinformation.iterative.MIM(infoset2)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ " \n", " " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "im2 = pr.markers.interactive.InteractiveMarkerSelection(adata, mim2, [\"violin\"])" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "a782291121e9401b99b3bd3e12892abb", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output()" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "im2.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }