{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Homework 3.1: The RNA Pol-II CTD and transcriptional bursting (80 pts)\n",
    "\n",
    "[Data set 1 download](https://s3.amazonaws.com/bebi103.caltech.edu/data/pp7_snapshot_parts.csv), [Data set 2 download](https://s3.amazonaws.com/bebi103.caltech.edu/data/pp7_frac_active_cells.csv)\n",
    "\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The largest subunit of RNA polymerase II (Pol II) has a disordered domain on its C-terminus, the so-called C-terminal domain, or CTD. The CTD consists of repeats of seven amino acids. The number of repeats ranges from five to sixty in various species. Each heptad is referred to as a CTD repeat, or CTDr. \n",
    "\n",
    "To investigate the effects of the number of CTDrs on transcriptional activity, Porfirio Quintero-Cadena and Paul Sternberg at Caltech, in collaboration with Tineke Lenstra at the Netherlands Cancer Institute, did a clever experiment ([Molec. Cell., 2020](https://doi.org/10.1016/j.molcel.2020.05.030)). Pol II in wild type budding yeast *Saccharomyces cerevisiae* contains twenty-six CTDrs. Quintero-Cadena generated *S. cerevisiae* strains with varying number of CTDrs, starting as low as eight (the minimum number necessary for transcription). They also inserted several copies of a sequence that forms RNA hairpins upon transcription in the 5′ untranslated region (UTR) of the Gal10 gene. They also engineered the cells to have nuclear-expressed PP7, which binds RNA hairpins. The PP7 is fused with GFP, so when the gene of interest is being transcribed, a fluorescent dot will appear in the nucleus of the cell. A brighter dot corresponds to more active transcription. A schematic of the setup is shown below, taken from [the paper](https://doi.org/10.1016/j.molcel.2020.05.030).\n",
    "\n",
    "\n",
    "<div style=\"margin: auto; width: 400px\">\n",
    "    \n",
    "![Quintero PP7 schematic](quintero_pp7.png)\n",
    "    \n",
    "</div>\n",
    "\n",
    "Conveniently, expression of Gal10 is induced by presence of galactose, enabling the experimenter to control when gene expression is turned on.\n",
    "\n",
    "\n",
    "**a)** In one experiment, Quintero-Cadena induced transcription using galactose and then took snapshots of the cells with a fluorescence microscope. He used digital image processing techniques to locate, characterize, and quantify dots. The results of the image acquisition and analysis may be found here: [https://s3.amazonaws.com/bebi103.caltech.edu/data/pp7_snapshot_parts.csv](https://s3.amazonaws.com/bebi103.caltech.edu/data/pp7_snapshot_parts.csv). The data are tidy, and when you load the data frame, each row refers to a single dot in the image. Below is a brief description of the columns. (Many of the columns refer to parameters of the image acquisition and processing using [trackpy](http://soft-matter.github.io/trackpy/v0.4.2/).) \n",
    "\n",
    "| column       | content          |\n",
    "| ------------- |-------------:|\n",
    "| date      | date of the experiment |\n",
    "| ecc      |  eccentricity of the dot   |\n",
    "| ep | estimate of uncertainty in dot position      |\n",
    "| frame | which frame of the movie |\n",
    "| laser power | Laser power for image acquisition |\n",
    "| mass | integrated fluorescent intensity of dot |\n",
    "| mass_norm | fluorescent intensity of dot normalized against nuclear fluorescent intensity|\n",
    "| mov_name | name of movie snapshot was taken from |\n",
    "| nuc_fluor | fluorescence throughout the nucleus containing the dot|\n",
    "| particle | identifier of particle |\n",
    "| pid | tag for image processing ID |\n",
    "| raw_mass | total integrated intensity of the ROI|\n",
    "| roi | index of region of interest containing the dot |\n",
    "| signal | measure of how bright the dot is in bandpass-filtered image |\n",
    "| size | radius of gyration of dot in image |\n",
    "| strain | yeast strain |\n",
    "| traj_len | length of the trajectory tracing the dot |\n",
    "| x | x-position of center of dot in image |\n",
    "| y | y-position of center of dot in image |\n",
    "| corrwideal | correlation with ideal dot using a Gaussian process classifier |\n",
    "| time_postinduction | number of minutes after galactose induction |\n",
    "| CTDr | number of CTDrs in the strain |\n",
    "\n",
    "The columns of most interest to you are mass_norm, time_postinduction, and CTDr. The corrwideal is also important, since we do not want to consider spurious artifacts in the image. Quintero-Cadena only considered dots that had a correlation above 0.5 in his analysis. \n",
    "\n",
    "From this snapshot data set, make an informative plot or plots exploring how the fluorescent intensity varies with the number of CTD repeats. Be sure to comment on your findings. \n",
    "\n",
    "**b)** Quintero-Cadena took another perspective on these data. He took snapshots and determined how many cells in the field of view were actively transcribing the target gene. A cell was deemed to be active if it has a transcription site whose integrated normalized fluorescence (`mass_norm`) exceeded a threshold. Quintero-Cadena used a threshold of 7. \n",
    "\n",
    "You can download the data set with this analysis here: [https://s3.amazonaws.com/bebi103.caltech.edu/data/pp7_frac_active_cells.csv](https://s3.amazonaws.com/bebi103.caltech.edu/data/pp7_frac_active_cells.csv)\n",
    "\n",
    "\n",
    "| Column       | content          |\n",
    "| ------------- |-------------:|\n",
    "| time_postinduction | number of minutes after galactose induction |\n",
    "| mov_name | name of movie snapshot was taken from |\n",
    "| strain | yeast strain |\n",
    "| no_TS | total number of transcription sites in the image|\n",
    "| no_cells | number of cells in the image |\n",
    "|frac_active| fraction of cells deemed active |\n",
    "| thresh | minimum number of dots required for a cell to be deemed active |\n",
    "| rep | replicate of the experiment |\n",
    "| date | date of the experiment |\n",
    "| CTDr | number of CTDrs in the strain |\n",
    "\n",
    "Use these data to make an informative plot or plots exploring the effect of the number of CTDrs and the time after induction on the activity of cells.\n",
    "\n",
    "\n",
    "_Porfirio Quintero-Cadena is a former student and TA of this course. He is also a believer in open access to (tidy) data. The data sets used in the paper are freely available and CC-0 licensed, which means we may use them completely freely. If you want people to learn more from your hard-earned data, follow Porfirio's example and **make them freely available.**_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br />"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}