{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Homework 4.1: A dashboard for zebrafish sleep studies (100 pts)\n",
    "\n",
    "**[Dataset download 1](https://s3.amazonaws.com/bebi103.caltech.edu/data/150717_2A_genotypes.txt)**, **[Dataset download 2](https://s3.amazonaws.com/bebi103.caltech.edu/data/150717_2A_2B.csv)**\n",
    "\n",
    "<hr />"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Colab setup ------------------\n",
    "import os, sys, subprocess\n",
    "if \"google.colab\" in sys.modules:\n",
    "    cmd = \"pip install --upgrade iqplot bebi103 colorcet watermark\"\n",
    "    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n",
    "    stdout, stderr = process.communicate()\n",
    "    data_path = \"https://s3.amazonaws.com/bebi103.caltech.edu/data/\"\n",
    "else:\n",
    "    data_path = \"../data/\"\n",
    "# ------------------------------\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import csv\n",
    "import datetime"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<hr>\n",
    "\n",
    "Your task in this homework problem is to build a dashboard to explore data sets that come from instruments in [David Prober](http://proberlab.caltech.edu/)'s lab. The instrument takes movies of moving/sleeping zebrafish larvae. An example is shown [below](prober_fish.mp4).\n",
    "\n",
    "<div style=\"margin: auto; width: 500px;\">\n",
    "\n",
    "<video src=\"prober_fish.mp4\" style=\"width: 400px;\" controls>Your browser does not support display of this video.</video>\n",
    "\n",
    "</div>\n",
    "\n",
    "To understand how the data are acquired in their assays, here is the text from the Methods section of David Prober's original paper ([J. Neurosci., 2006](https://doi.org/10.1523/JNEUROSCI.4332-06.2006)) laying out their assay.\n",
    "\n",
    "> Larvae were raised on a 14/10 h light/dark (LD) cycle at 28.5°C. On the fourth day of development, single larva were placed in each of 80 wells of a 96-well plate (7701-1651; Whatman, Clifton, NJ), which allowed simultaneous tracking of each larva and prevented the larvae from interfering with the activity of each other. Locomotor activity was monitored for several days using an automated video-tracking system (Videotrack; ViewPoint Life Sciences, Montreal, Quebec, Canada) with a Dinion one-third inch Monochrome camera (model LTC0385; Bosch, Fairport, NY) fitted with a fixed-angle megapixel lens (M5018-MP; Computar) and infrared filter, and the movement of each larva was recorded using Videotrack quantization mode. The 96-well plate and camera were housed inside a custom-modified Zebrabox (ViewPoint Life Sciences) that was continuously illuminated with infrared lights and was illuminated with white lights from 9:00 A.M. to 11:00 P.M. The 96-well plate was housed in a chamber filled with circulating water to maintain a constant temperature of 28.5°C. The Videotrack threshold parameters for detection were matched to visual observation of the locomotion of single larva. The Videotrack quantization parameters were set as follows: detection threshold, 40; burst (threshold for very large movement), 25; freeze (threshold for no movement), 4; bin size, 60 s. The data were further analyzed using custom PERL software and Visual Basic Macros for Microsoft (Seattle, WA) Excel. Any 1 min bin with zero detectable movement was considered 1 min of rest because this duration of inactivity was correlated with an increased arousal threshold; a rest bout was defined as a continuous string of rest minutes. Sleep latency was defined as the length of time from lights out to the start of the first rest bout. An active minute was defined as a 1 min bin with any detectable activity. An active bout was considered any continuous stretch of 1 min bins with detectable movement.\n",
    "\n",
    "After performing an experiment with larvae, DNA is extracted from the larvae and sent to sequencing so the genotype of the fish in each well may be determined. The genotype data is then hand-entered into a spreadsheet and saved as a text file.\n",
    "\n",
    "The data you will work with in this homework come from a paper by Grigorios Oikonomou (a staff scientist in the Prober lab and a former student in this class). This data set was a set of early measurements that led him and his colleagues toward understanding how tryptophan hydroxylase (TPH) affect sleep patterns ([Neuron, 2019](https://doi.org/10.1016/j.neuron.2019.05.038)). To understand the experiment, you watch the [video abstract of the paper](https://www.cell.com/cms/10.1016/j.neuron.2019.05.038/attachment/bec5ed0b-0b38-4319-af8c-e397824c57e7/mmc3.mp4). The data set you will work with are like the *tph2* experiment that Grigorios discusses around the 1:50 mark of the video.\n",
    "\n",
    "You can access the data sets here:\n",
    "\n",
    "- [https://s3.amazonaws.com/bebi103.caltech.edu/data/150717_2A_genotypes.txt](https://s3.amazonaws.com/bebi103.caltech.edu/data/150717_2A_genotypes.txt)\n",
    "- [https://s3.amazonaws.com/bebi103.caltech.edu/data/150717_2A_2B.csv](https://s3.amazonaws.com/bebi103.caltech.edu/data/150717_2A_2B.csv)\n",
    "\n",
    "The file `150717_2A_genotype_3.txt` is a genotype file giving the genotypes of the fish in the different locations of the plate. Only the fish in instrument 2A, numbered 1 through 96 in the activity data file `150717_2A_2B.csv`, were genotyped. The fish in instrument 2B, numbered 97 and above, are not used in the assay. (The instrument has room for two 96 well-plates, and experiments are run in parallel. The experiment in well B of instrument 2 was by another researcher in the Prober lab.)\n",
    "\n",
    "The instrument give data as MS Excel spread sheets in an old format that Pandas has trouble reading. I have converted the Excel sheet to a CSV file, `150717_2A_2B.csv`, but have otherwise not touched it. (When building your dashboard, you may assume the data sets come in the formats of the `.txt` file above and the `.csv` file I just described. I think newer versions of the software for the instrument give CSV files.) \n",
    "\n",
    "To understand the quantitative measurements in this data set, it is important to know how the measurements are made. Images are acquired at a rate of 15 frames per second. Lights were off from 11 PM to 9 AM and on from 9 AM to 11 PM. The image is divided into 96 regions of interest (ROIs), one for each well in the 96-well plate. The detection threshold is a number between zero and 255. If a pixel changes its intensity by more than the *detection threshold* from one frame to the next, the pixel is considered to have changed between frames. If the total number of pixels in an ROI that have changed between frames is less than the *freeze threshold*, the fish is considered to be \"frozen,\" or not moving. If the total number of pixels that have changed between frames is greater than the *burst threshold*, the fish is considered to have undergone a \"burst,\" or rapid movement. If the number of pixels that changed is between the freeze and burst threshold, then the fish is said to have \"middle movement,\" meaning that it is moving, but not moving violently.\n",
    "\n",
    "Here are some useful Some comments on the content of the data set.\n",
    "\n",
    "1. The `'location'` column contains the number of the well preceded by the letter `'c'`.\n",
    "2. The `'animal'` column is redundant; the same information is contained in the `'location'` column.\n",
    "3. The `'user'` column is the account on the controlling computer. It is not used in the analysis.\n",
    "4. The `'sn'` and `'an'` columns are not used.\n",
    "4. Each row of the data set refers to a minute of video data acquisition. The `start` and `end` columns give the time from the beginning of the experiment of the measurement. The `stdate` and `sttime` give the start date and time according to the clocks in Pasadena.\n",
    "5. The columns `'frect'`, `'fredur'`, `'midct'`, `'middur'`, `'burct'`, and `'burdur'` contain information about the fish's activity per minute of observation. The prefixes `fre`, `mid`, and `bur` refer respectively to freeze, middle, and burst. The suffixes `ct` and `dur` refer to count and duration. The counts are the number of events observed in a given time interval. So, `burct` is the number of bursts observed in the time interval.\n",
    "\n",
    "Based on discussion I have had with Grigorios, he recommends considering moving vs. non-moving fish, meaning that the distinction between burst and middle movement is not very important, and summing `burdur` and `middur` is an appropriate metric for fish activity.\n",
    "\n",
    "As you are building your dashboard, think about what quantities to describe sleep that you want to compute from the data. You might look at Fig. 1 from the paper for ideas, but you can also come up with your own. This is the most important part of data exploration/dashboarding. Think carefully about it, and you should discuss thoroughly with your teammates.\n",
    "\n",
    "**Big hint:** Though tidy, there is a fair amount of wrangling that needs to be done to get the data frame in a workable format, particularly because of the time series. The `load_activity()` function below will allow you do load in the data frame and automatically do some wrangling to get the data frame in a workable format. Be sure you understand what it does and how it does it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_activity(\n",
    "    fname,\n",
    "    genotype_fname,\n",
    "    instrument=-9999,\n",
    "    trial=-9999,\n",
    "    lights_on=\"9:00:00\",\n",
    "    lights_off=\"23:00:00\",\n",
    "    day_in_the_life=4,\n",
    "    zeitgeber_0=None,\n",
    "    zeitgeber_0_day=5,\n",
    "    zeitgeber_0_time=None,\n",
    "    wake_threshold=0.1,\n",
    "    rename={},\n",
    "    comment=\"#\",\n",
    "    gtype_double_header=None,\n",
    "    gtype_rstrip=False,\n",
    "):\n",
    "    \"\"\"\n",
    "    Load in activity CSV file to tidy DateFrame\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    fname : str, or list or tuple or strings\n",
    "        If a string, the CSV file containing the activity data. This is\n",
    "        a conversion to CSV of the Excel file that comes off the\n",
    "        instrument. If a list or tuple, each entry contains a CSV file \n",
    "        for a single experiment. The data in these files are stitched\n",
    "        together.\n",
    "    genotype_fname : str\n",
    "        File containing genotype information. This is in standard\n",
    "        Prober lab format, with tab delimited file.\n",
    "        - First row discarded\n",
    "        - Second row contains genotypes. String is only kept up to\n",
    "          the last space because they typically appear something like\n",
    "          'tph2-/- (n=20)', and we do not need the ' (n=20)'.\n",
    "        - Subsequent rows containing wells in the 96 well plate\n",
    "          corresponding to each genotype.\n",
    "    instrument: str of int\n",
    "        Name of instrument used to make measurement. If not specified,\n",
    "        the 'instrument' column is populated with -9999 as a \n",
    "        placeholder.\n",
    "    trial: str or int, default -9999\n",
    "        Trial number of measurement. If not specified, the 'trial'\n",
    "        column is populated with -9999 as a placeholder.\n",
    "    lights_on : string or datetime.time instance, default '9:00:00'\n",
    "        The time where lights come on each day, e.g., '9:00:00'.\n",
    "    lights_off: string or datetime.time, or None, default '23:00:00'\n",
    "        The time where lights go off each day, e.g., '23:00:00'.\n",
    "        If None, the 'light' column is all True, meaning we are not\n",
    "        keeping track of lighting.\n",
    "    day_in_the_life : int, default 4\n",
    "        The day in the life of the embryos when data acquisition\n",
    "        started. The default is 4, which is the standard \n",
    "    zeitgeber_0 : datetime instance, default None\n",
    "        If not None, gives the date and time of Zeitgeber time zero. \n",
    "    zeitgeber_0_day : int, default 5\n",
    "        The day in the life of the embryos where Zeitgeber time zero is.\n",
    "        Ignored if `zeitgeber_0` is not None.\n",
    "    zeigeber_0_time : str, or None (default)\n",
    "        String representing time of Zeitgeber time zero. If None,\n",
    "        defaults to the first time the lights came on. Ignored if \n",
    "        `zeitgeber_0` is not None.\n",
    "    wake_threshold : float, default 0.1\n",
    "        Threshold number of seconds per minute that the fish moved\n",
    "        to be considered awake.\n",
    "    rename : dict, default {}\n",
    "        Dictionary for renaming column headings.\n",
    "    comment : string, default '#'\n",
    "        Test that begins and comment line in the file\n",
    "    gtype_double_header : bool or None, default None\n",
    "        If True, the file has a two-line header. The first line\n",
    "        is ignored and the second is kept as a header, possibly\n",
    "        with stripping using the `rstrip` argument. If False, assume\n",
    "        a single header row. If None, infer the header, giving a\n",
    "        warning if a double header is inferred.\n",
    "    gtype_rstrip : bool, default True\n",
    "        If True, strip out all text in genotype name to the right of\n",
    "        the last space. This is because the genotype files sometimes\n",
    "        have headers like 'wt (n=22)', and the '(n=22)' is useless.\n",
    "        \n",
    "    Returns\n",
    "    -------\n",
    "    df : pandas DataFrame\n",
    "        Tidy DataFrame with all of the columns of the input file, plus:\n",
    "        - time: time in proper datetime format, based on the `sttime`\n",
    "          column of the inputted data file\n",
    "        - sleep : 1 if fish is asleep (activity < wake_threshold), and 0 otherwise.\n",
    "          This is convenient for computing sleep when resampling.\n",
    "        - location: ID of the location of the animal. This is often\n",
    "          renamed to `fish`, but not by default.\n",
    "        - genotype: genotype of the fish\n",
    "        - zeit: The Zeitgeber time, based off of the clock time, not\n",
    "          the experimental time. Zeitgeber time zero is specified with \n",
    "          the `zeitgeber_0` kwarg, or alternatively with the \n",
    "          `zeitgeber_0_day` and `zeitgeber_0_time` kwargs.\n",
    "        - zeit_ind: Index of the measured Zeitgeber time. Because of \n",
    "          some errors in the acquisition, sometimes the times do not\n",
    "          perfectly line up. This is needed for computing averages over\n",
    "          locations at each time point.\n",
    "        - exp_ind: an index for the experimental time. Because of some\n",
    "          errors in the acquisition, sometimes the times do not\n",
    "          perfectly line up. exp_ind is just the index of the\n",
    "          measurement. This is needed for computing averages over\n",
    "          fish at each time point.\n",
    "        - acquisition: Number associated with which acquisition the data\n",
    "          are coming from. If the experimenter restarts acquisition,\n",
    "          this number would change.\n",
    "        - instrument: Name of instrument used to acquire the data. If no\n",
    "          instrument is known, this is populated with NaNs.\n",
    "        - trial: Name of trial of data acquisition.  If no trial is \n",
    "          known, this is populated with NaNs.\n",
    "        - light: True if the light is on.\n",
    "        - day: The day in the life of the fish. The day begins with\n",
    "          `lights_on`.\n",
    "          \n",
    "    Notes\n",
    "    -----\n",
    "    .. If `lights_off` is `None`, this means we ignore the lighting,\n",
    "       but we still want to know what day it is. Specification of\n",
    "       `lights_on` says what wall clock time specifies the start of\n",
    "       a day.\n",
    "    \"\"\"\n",
    "    # Get genotype information\n",
    "    df_gt = load_gtype(\n",
    "        genotype_fname,\n",
    "        comment=comment,\n",
    "        double_header=gtype_double_header,\n",
    "        rstrip=gtype_rstrip,\n",
    "    )\n",
    "\n",
    "    # Read in DataFrames\n",
    "    if type(fname) == str:\n",
    "        fname = [fname]\n",
    "\n",
    "    # Read in DataFrames\n",
    "    df = pd.concat(\n",
    "        [\n",
    "            _load_single_activity_file(\n",
    "                filename, df_gt, comment=comment, acquisition=ac + 1,\n",
    "            )\n",
    "            for ac, filename in enumerate(fname)\n",
    "        ]\n",
    "    )\n",
    "\n",
    "    # Columns to use\n",
    "    usecols = list(df.columns)\n",
    "\n",
    "    # Sort by location and then time\n",
    "    df = df.sort_values([\"location\", \"time\"]).reset_index(drop=True)\n",
    "\n",
    "    # Convert lights_on to datetime\n",
    "    if type(lights_on) != datetime.time:\n",
    "        lights_on = pd.to_datetime(lights_on).time()\n",
    "    if type(lights_off) != datetime.time and lights_off is not None:\n",
    "        lights_off = pd.to_datetime(lights_off).time()\n",
    "\n",
    "    # Convert zeitgeber_0 to datetime object\n",
    "    if zeitgeber_0 is not None and type(zeitgeber_0) == str:\n",
    "        zeitgeber_0 = pd.to_datetime(zeitgeber_0)\n",
    "\n",
    "    # Determine light or dark\n",
    "    if lights_off is None:\n",
    "        df[\"light\"] = [True] * len(df)\n",
    "    else:\n",
    "        clock = pd.DatetimeIndex(df[\"time\"]).time\n",
    "        df[\"light\"] = np.logical_and(clock >= lights_on, clock < lights_off)\n",
    "\n",
    "    # Get earliest time point\n",
    "    t_min = pd.DatetimeIndex(df[\"time\"]).min()\n",
    "\n",
    "    # Which day it is (day goes lights on to lights on)\n",
    "    df[\"day\"] = (\n",
    "        (df[\"time\"] - datetime.datetime.combine(t_min.date(), lights_on)).dt.days\n",
    "        + day_in_the_life\n",
    "        - 1\n",
    "    )\n",
    "\n",
    "    # Compute zeitgeber_0\n",
    "    if zeitgeber_0 is None:\n",
    "        times = df.loc[(df[\"day\"] == zeitgeber_0_day) & (df[\"light\"] == True), \"time\"]\n",
    "        if len(times) == 0:\n",
    "            raise RuntimeError(\n",
    "                \"Unable to find Zeitgeber_0. Check `day_in_the_life` and \"\n",
    "                + \"zeitgeber_0_day` inputs.\"\n",
    "            )\n",
    "        zeit_date = times.min().date()\n",
    "        zeitgeber_0 = pd.to_datetime(str(zeit_date) + \" \" + str(lights_on))\n",
    "\n",
    "    # Add Zeitgeber time\n",
    "    df[\"zeit\"] = (df[\"time\"] - zeitgeber_0).dt.total_seconds() / 3600\n",
    "\n",
    "    # Set up exp_time indices\n",
    "    for loc in df[\"location\"].unique():\n",
    "        df.loc[df[\"location\"] == loc, \"exp_ind\"] = np.arange(\n",
    "            np.sum(df[\"location\"] == loc)\n",
    "        )\n",
    "    df[\"exp_ind\"] = df[\"exp_ind\"].astype(int)\n",
    "\n",
    "    # Infer time interval in units of hours (almost always 1/60)\n",
    "    dt = np.diff(df.loc[df[\"location\"] == df[\"location\"].unique()[0], \"time\"])\n",
    "    dt = np.median(dt.astype(float) / 3600e9)\n",
    "\n",
    "    # Add zeit indices\n",
    "    df[\"zeit_ind\"] = (np.round(df[\"zeit\"] / dt)).astype(int)\n",
    "\n",
    "    # Get the columns we want to keep\n",
    "    cols = usecols + [\"zeit\", \"zeit_ind\", \"exp_ind\", \"light\", \"day\"]\n",
    "    df = df[cols]\n",
    "\n",
    "    # Compute sleep\n",
    "    df[\"sleep\"] = (df[\"middur\"] + df[\"burdur\"] < wake_threshold).astype(int)\n",
    "\n",
    "    # Get experimental time in units of hours (DEPRECATED)\n",
    "    # df['exp_time'] = df['start'] / 3600\n",
    "\n",
    "    # Rename columns\n",
    "    if rename is not None:\n",
    "        df = df.rename(columns=rename)\n",
    "\n",
    "    # Fill in trial and instrument information\n",
    "    df[\"instrument\"] = [instrument] * len(df)\n",
    "    df[\"trial\"] = [trial] * len(df)\n",
    "\n",
    "    return df\n",
    "\n",
    "\n",
    "def _sniff_file_info(fname, comment=\"#\", check_header=True, quiet=False):\n",
    "    \"\"\"\n",
    "    Infer number of header rows and delimiter of a file.\n",
    "    Parameters\n",
    "    ----------\n",
    "    fname : string\n",
    "        CSV file containing the genotype information.\n",
    "    comment : string, default '#'\n",
    "        Character that starts a comment row.\n",
    "    check_header : bool, default True\n",
    "        If True, check number of header rows, assuming a row\n",
    "        that begins with a non-digit character is header.\n",
    "    quiet : bool, default False\n",
    "        If True, suppress output to screen.\n",
    "    Returns\n",
    "    -------\n",
    "    n_header : int or None\n",
    "        Number of header rows. None is retured if `check_header`\n",
    "        is False.\n",
    "    delimiter : str\n",
    "        Inferred delimiter\n",
    "    line : str\n",
    "        The first line of data in the file.\n",
    "    Notes\n",
    "    -----\n",
    "    .. Valid delimiters are: ['\\t', ',', ';', '|', ' ']\n",
    "    \"\"\"\n",
    "\n",
    "    valid_delimiters = [\"\\t\", \",\", \";\", \"|\", \" \"]\n",
    "\n",
    "    with open(fname, \"r\") as f:\n",
    "        # Read through comments\n",
    "        line = f.readline()\n",
    "        while line != \"\" and line[0] == comment:\n",
    "            line = f.readline()\n",
    "\n",
    "        # Read through header, counting rows\n",
    "        if check_header:\n",
    "            n_header = 0\n",
    "            while line != \"\" and (not line[0].isdigit()):\n",
    "                line = f.readline()\n",
    "                n_header += 1\n",
    "        else:\n",
    "            n_header = None\n",
    "\n",
    "        if line == \"\":\n",
    "            delimiter = None\n",
    "            if not quiet:\n",
    "                print(\"Unable to determine delimiter, returning None\")\n",
    "        else:\n",
    "            # If no tab, comma, ;, |, or space, assume single entry per column\n",
    "            if not any(d in line for d in valid_delimiters):\n",
    "                delimiter = None\n",
    "                if not quiet:\n",
    "                    print(\"Unable to determine delimiter, returning None\")\n",
    "            else:\n",
    "                delimiter = csv.Sniffer().sniff(line).delimiter\n",
    "\n",
    "    # Return number of header rows and delimiter\n",
    "    return n_header, delimiter, line\n",
    "\n",
    "\n",
    "def load_gtype(fname, comment=\"#\", double_header=None, rstrip=False, quiet=False):\n",
    "    \"\"\"\n",
    "    Read genotype file into tidy DataFrame\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    fname : string\n",
    "        File containing genotype information. This is in standard\n",
    "        Prober lab format, with tab delimited file.\n",
    "        - First row discarded\n",
    "        - Second row contains genotypes. String is only kept up to\n",
    "          the last space because they typically appear something like\n",
    "          'tph2-/- (n=20)', and we do not need the ' (n=20)'.\n",
    "        - Subsequent rows containg wells in the 96 well plate\n",
    "          corresponding to each genotype.\n",
    "    comment : string, default '#'\n",
    "        Test that begins and comment line in the file\n",
    "    double_header : bool or None, default None\n",
    "        If True, the file has a two-line header. The first line\n",
    "        is ignored and the second is kept as a header, possibly\n",
    "        with stripping using the `rstrip` argument. If False, assume\n",
    "        a single header row. If None, infer the header, giving a\n",
    "        warning if a double header is inferred.\n",
    "    rstrip : bool, default True\n",
    "        If True, strip out all text in genotype name to the right of\n",
    "        the last space. This is because the genotype files typically\n",
    "        have headers like 'wt (n=22)', and the '(n=22)' is useless.\n",
    "    quiet : bool, default False\n",
    "        If True, suppress output to screen.\n",
    "        \n",
    "    Returns\n",
    "    -------\n",
    "    df : pandas DataFrame\n",
    "        Tidy DataFrame with columns:\n",
    "        - location: ID of location\n",
    "        - genotype: genotype of animal at location\n",
    "    \"\"\"\n",
    "\n",
    "    # Sniff file info\n",
    "    n_header, delimiter, _ = _sniff_file_info(\n",
    "        fname, check_header=True, comment=comment, quiet=True\n",
    "    )\n",
    "    if double_header is None:\n",
    "        if n_header == 2:\n",
    "            double_header = True\n",
    "            if not quiet:\n",
    "                warnings.warn(\"Inferring two header rows.\", RuntimeWarning)\n",
    "\n",
    "    if double_header:\n",
    "        df = pd.read_csv(fname, comment=comment, header=[0, 1], delimiter=delimiter)\n",
    "\n",
    "        # Reset the columns to be the second level of indexing\n",
    "        df.columns = df.columns.get_level_values(1)\n",
    "    else:\n",
    "        df = pd.read_csv(fname, comment=comment, delimiter=delimiter)\n",
    "\n",
    "    # Only keep genotype up to last space because sometimes has n\n",
    "    if rstrip:\n",
    "        df.columns = [\n",
    "            col[: col.rfind(\" \")] if col.rfind(\" \") > 0 else col for col in df.columns\n",
    "        ]\n",
    "\n",
    "    # Melt the DataFrame\n",
    "    df = pd.melt(df, var_name=\"genotype\", value_name=\"location\").dropna()\n",
    "\n",
    "    # Reset the index\n",
    "    df = df.reset_index(drop=True)\n",
    "\n",
    "    # Make sure data type is integer\n",
    "    df.loc[:, \"location\"] = df.loc[:, \"location\"].astype(int)\n",
    "\n",
    "    return df\n",
    "\n",
    "\n",
    "def _load_single_activity_file(fname, df_gt, comment=\"#\", acquisition=1):\n",
    "    \"\"\"\n",
    "    Load in activity CSV file to tidy DateFrame\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    fname : string\n",
    "        The CSV file containing the activity data. This is\n",
    "        a conversion to CSV of the Excel file that comes off the\n",
    "        instrument.\n",
    "    df_gt : pandas DataFrame\n",
    "        Tidy DataFrame with columns:\n",
    "        - location: ID of location\n",
    "        - genotype: genotype of of animal at location\n",
    "    comment : string, default '#'\n",
    "        Test that begins and comment line in the file\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    df : pandas DataFrame\n",
    "        - acquisition: Number associated with which acquisition the data\n",
    "          are coming from. If the experimenter restarts acquisition,\n",
    "          this number would change.\n",
    "        - genotype: genotype of the animal in the location\n",
    "        \n",
    "    Notes\n",
    "    -----\n",
    "    .. If `lights_off` is `None`, this means we ignore the lighting,\n",
    "       but we still want to know what day it is. Specification of\n",
    "       `lights_on` says what wall clock time specifies the start of\n",
    "       a day.\n",
    "    \"\"\"\n",
    "\n",
    "    # Sniff out the delimiter, see how many headers, check file not empty\n",
    "    _, delimiter, _ = _sniff_file_info(\n",
    "        fname, check_header=False, comment=comment, quiet=True\n",
    "    )\n",
    "\n",
    "    # Read file\n",
    "    df = pd.read_csv(fname, comment=comment, delimiter=delimiter)\n",
    "\n",
    "    # Detect if it's the new file format, and the convert fish to integer\n",
    "    if \"-\" in df[\"location\"].iloc[0]:\n",
    "        df[\"location\"] = (\n",
    "            df[\"location\"].apply(lambda x: x[x.rfind(\"-\") + 1 :]).astype(int)\n",
    "        )\n",
    "    else:\n",
    "        df[\"location\"] = df[\"location\"].str.extract(\"(\\d+)\", expand=False).astype(int)\n",
    "\n",
    "    # Only keep fish that we have genotypes for\n",
    "    df = df.loc[df[\"location\"].isin(df_gt[\"location\"]), :]\n",
    "\n",
    "    # Store the genotypes\n",
    "    loc_lookup = {\n",
    "        loc: df_gt.loc[df_gt[\"location\"] == loc, \"genotype\"].values[0]\n",
    "        for loc in df_gt[\"location\"]\n",
    "    }\n",
    "    df[\"genotype\"] = df[\"location\"].apply(lambda x: loc_lookup[x])\n",
    "\n",
    "    # Convert date and time to a time stamp\n",
    "    df[\"time\"] = pd.to_datetime(df[\"stdate\"] + df[\"sttime\"], format=\"%d/%m/%Y%H:%M:%S\")\n",
    "\n",
    "    # Add the acquisition number\n",
    "    df[\"acquisition\"] = acquisition * np.ones(len(df), dtype=int)\n",
    "\n",
    "    return df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Smaller, but useful hint:** To display some of the time series, you may wish to resample. Resampling with is easily done with the `rolling()` method of Pandas Series."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}