{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 10 - Data frames and Statistics" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import skimage.io as io" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Running a basic statistical analysis\n", "## 1.1. Introduction\n", "There are 1000 mouse femur bones which have been measured at high resolution and a number of shape analyses run on each sample. - Phenotypical Information - Each column represents a metric which was assessed in the images - CORT_DTO__C_TH for example is the mean thickness of the cortical bone." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1.2 Data preparation\n", "\n", "## 1.2.1. Load data into frames\n", "For this example we will start with a fairly complicated dataset from a genetics analysis done at the Institute of Biomechanics, ETHZ." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pheno = pd.read_csv('phenoTable.csv')\n", "pheno.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Genetic Information (genoTable.csv)\n", "Each animal has been tagged at a number of different regions of the genome (called markers: D1Mit236)\n", "- At each marker there are 3 (actually 4) possibilities\n", "- A is homozygous (the same from both parents) from the A strain\n", "- B is homozygous from the B strain\n", "- H is heterozygous (one from A, one from B)\n", "- ‘-’ is missing or erronous measurements" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "geno = pd.read_csv('genoTable.csv')\n", "geno.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.2.2. [Merge data frames](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)\n", "We want to merge the data set using the key 'ID'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df = pd.merge(pheno,geno, on='ID')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.2.3. Rename a column\n", "The key 'FEMALE' is a boolean, we want to change the key name to 'GENDER' and the contents from true/false to 'F'/'M'. Here, a [lambda function](https://realpython.com/python-lambda/) is used for the mapping when we apply an operation to each row in the GENDER column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df=df.rename(columns={\"FEMALE\": \"GENDER\"})\n", "df['GENDER'] = df['GENDER'].apply(lambda x: 'F' if x else 'M')\n", "\n", "df['GENDER'].sample(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.3. Tasks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.3.1 First inspection\n", "1. Look at the histograms of the available variables in the phenotype data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig,ax=plt.subplots(1,1,figsize=(15,15));\n", "pheno.hist(ax=ax,bins=50);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are far too many variables to work with. At least for a start. We have to focus on some few e.g.\n", "- Bone mineral density (BMD)\n", "- Cortical bone thickness (CORT_DTO_TH)\n", "- Cortical bone Microstructural thickness (CORT_DTO_TH_SD)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.3.2 Look at the pair plot\n", "Explore the data and correlations between various metrics by using the ‘pairplot’ plotting component. Examine different variable combinations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.pairplot(pheno, vars = ['BMD', 'CORT_DTO__C_TH', 'CORT_DTO__C_TH_SD']);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. For the rest of the analysis you can connect the various components to the ‘Column Filter’ node since that is the last step in the processing\n", "4. Use one of the T-Test to see if there is a statistically significant difference between Gender’s when examining Cortical Bone Microstructural Thickness (Mean) ```CORT_DTO__C_TH```\n", " - Which value is the p-value?\n", " - What does the p-value mean, is it significant, by what criterion?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import ttest_ind\n", "\n", "# Extract the CORT_DTO__C_TH column and create a data frame for FEMALE and MALE respectively using the GENDER column\n", "male = df[insert your gender filters][select column to compare]\n", "female = df[insert your gender filters][select column to compare]\n", "\n", "ttest,pval = ttest_ind(female,male)\n", "\n", "print(\"p-value={:0.4f}\".format(pval))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. Use another node from the Hypothesis Testing section to evaluate the effect on the D16Mit5 on the Lacuna Distribution Anisotropy? Is it significant?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Insert your test code here\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.3.3 Questions\n", "1. In the ‘Independent Groups T-Test’ node we can run a t-test against all of the columns at the same time, why SHOULDN’T we do this?\n", "2. If we do, how do we need to interpret this in the results\n", "3. Is p<0.05 a sufficient signifance criteria?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Comparing two real bone samples\n", "For this example we will compare two real cortical bone samples taken from mice.\n", "For the purpose of the analysis and keeping the data sizes small, we will use Anders' Crazy Camera for simulating the noisy detection process. \n", "\n", "The assignment aims to be more integrative and you will combine a number of different lectures to get to the final answer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Preparing the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "imgA = (0.0 < io.imread('bone_7H3A_B1.tif')).astype(float)\n", "imgB = (0.0 < io.imread('bone_7H6A_B2.tif')).astype(float)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.hist([imgA.ravel(), imgB.ravel()],bins=2);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the code to simulate a bad camera that produces bad images" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as numpy\n", "import skimage.filters as filters\n", "\n", "def camera(img,blurr=1.0,noise=0.1,illum=0.0) :\n", "\n", " res = filters.gaussian(img,sigma=blurr)\n", " res = res + np.random.normal(size=res.shape,loc=0,scale=noise)\n", "\n", " return res\n", "\n", "def crappyCamera(img) :\n", " return camera(img,blurr=2.0,noise=0.2, illum=0.0)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cimgA=crappyCamera(imgA)\n", "cimgB=crappyCamera(imgB)\n", "\n", "fig, (ax1,ax2) = plt.subplots(1,2,figsize=(12,7))\n", "ax1.imshow(cimgA[100])\n", "ax2.imshow(cimgB[100])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Questions\n", "1. We want to know if there is a statistically significant difference in\n", " - cell volume\n", " - cell shape\n", " - cell density\n", " \n", " between the two samples given the variation in the detector\n", " 1. which metric do we need here? \n", " 2. why?\n", "2. We see in the volume comparison a very skewed representation of the data Volume/Num Pixels\n", " - why is this? (Hint check the segmented images)\n", " - What might be done to alleviate it (hint Row Filter)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Hints\n", "1. Look at the kind of noise (you can peek inside the Crappy Camera) to choose the proper filter\n", "2. Use an automated thresholding technique for finding the bone automatic-methods\n", "3. To do this we will need to enhance the image, segment out the bone (dense) tissue, find the mask so that we can look at the cells.\n", "4. We then need to label the cells, and analyze their volume and shape: labeling\n", "5. Use Morphology and strongly filter parameters it might be possible to maximize the differences in the groups\n", "6. Create a data frame with columns for sample_id, item_id and metric. Look for ideas in the previous exercises and lectures." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. T-Test Simulator\n", "This exercise (workflow named - Statistical Significance Hunter) shows the same results as we discussed in the lecture for finding p-values of significance. \n", "\n", "It takes a completely random stream of numbers with a mean 0.0 and tests them against a null hypothesis (that they equal 0) in small batches, you can adjust the size of the batches, the number of items and the confidence interval. The result is a pie chart showing the number of “significant” results found using the standard scientific criteria for common studies. \n", "\n", "### T-test for single value $\\mathcal{H}_0$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import ttest_1samp\n", "\n", "data= np.random.normal(size=(100,3),loc=0,scale=1.0)\n", "\n", "tset, pval = ttest_1samp(data, 0.0) # Test if data has average = 0\n", "\n", "print(\"p-values\",pval)\n", "\n", "reject = pval < 0.05\n", "print(\"Reject : \", reject)\n", "for idx,r in enumerate(reject) :\n", " if r : # alpha value is 0.05 or 5%\n", " print(\"[x] we are rejecting null hypothesis\")\n", " else:\n", " print(\"[v] we are accepting null hypothesis\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def statSignificanceHunter(batches=10,samples=10) :\n", " testLevels=[0.05,0.01]\n", " data= np.random.normal(size=(samples,batches),loc=0,scale=1.0)\n", "\n", " tset, pval = ttest_1samp(data, 0.0)\n", " \n", " counts=[np.sum(testLevels[0]<=pval),np.sum(pval