{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 2: Penguin regression with PyTorch\n",
    "\n",
    "<img src=\"https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png\" width=\"750\" />\n",
    "\n",
    "\n",
    "Artwork by @allison_horst\n",
    "\n",
    "In this exercise, we will again use the [``palmerpenguins``](https://github.com/mcnakhaee/palmerpenguins) data to continue our exploration of PyTorch.\n",
    "\n",
    "We will use the same dataset object as before, but this time we'll take a look at a regression problem: predicting the mass of a penguin given other physical features."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 1: look at the data\n",
    "In the following code block, we import the ``load_penguins`` function from the ``palmerpenguins`` package.\n",
    "\n",
    "- Load the penguin data as you did before.\n",
    "- This time, consider which features we might like to use to predict a penguin's mass."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from palmerpenguins import load_penguins\n",
    "\n",
    "# Insert code here ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The features we will use to estimate the mass are:\n",
    "- ...\n",
    "- ...\n",
    "- ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 2: creating a ``torch.utils.data.Dataset``\n",
    "\n",
    "As before, we need to create PyTorch ``Dataset`` objects to supply data to our neural network.  \n",
    "Since we have already created and explored the ``PenguinDataset`` class there is nothing else to do here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 3: Obtaining training and validation datasets.\n",
    "\n",
    "- Instantiate the penguin dataloader.\n",
    "  - Make sure you supply the correct column titles for the features and the targets.\n",
    "  - Remember, the target is now mass, not the species!\n",
    "- Iterate over the dataset\n",
    "    - Hint:\n",
    "        ```python\n",
    "        for features, targets in dataset:\n",
    "            # print the features and targets here\n",
    "        ```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ml_workshop import PenguinDataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 4: Applying transforms to the data\n",
    "\n",
    "As in the previous exercise, the raw inputs and targets need transforming to ``torch.Tensor``s before they can be passed to a neural network.  \n",
    "We will again use ``torchvision.transforms.Compose`` to take a list of callable objects and apply them to the incoming data.\n",
    "\n",
    "Because the raw units of mass are in grams, the numbers are quite large. This can encumber the model's predictive power. A sensible way to address this is to normalise targets using statistics from the training set. The most common form of normalisation is to subtract the mean and divide by the standard deviation (of the training set). However here, for the sake of simplicity, we will just scale the mass by dividing by the mean of the training set.\n",
    "\n",
    "Note that this means that the model will now be trained to predict masses as fractions of the training mean.\n",
    "\n",
    "We grab the mean of the training split in the following cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_set = PenguinDataset(features, [\"body_mass_g\"], train=True)\n",
    "\n",
    "training_mean = train_set.split.body_mass_g.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we create our real training and validation set, and supply transforms as before."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from torchvision.transforms import Compose\n",
    "\n",
    "# Let's apply the transfroms we need to the PenguinDataset to get out inputs\n",
    "# targets as Tensors."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 5: Creating ``DataLoaders``—Again!\n",
    "\n",
    "As before, we wrap our ``Dataset``s in ``DataLoader`` before we proceed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from torch.utils.data import DataLoader\n",
    "\n",
    "# Create training and validation DataLoaders."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 6: Creating a neural network in PyTorch\n",
    "\n",
    "Previously we created our neural network from scratch, but doing this every time we need to solve a new problem is cumbersome.  \n",
    "Many projects working with the ICCS have codes where the numbers of layers, layer sizes, and other parts of the models are hard-coded from scratch every time!\n",
    "\n",
    "The result is ungainly, non-general, and heavily-duplicated code. Here, we are going to shamelessly punt Jim Denholm's Python repo, [``TorchTools``](https://github.com/jdenholm/TorchTools), which contains generalisations of many commonly-used PyTorch tools, to save save us some time.\n",
    "\n",
    "Here, we can use the ``FCNet`` model, whose documentation lives [here](https://jdenholm.github.io/TorchTools/models.html). This model is a fully-connected neural network with various options for dropout, batch normalisation, and easily-modifiable layers.\n",
    "\n",
    "#### A brief sidebar\n",
    "Note: the repo is pip-installable with\n",
    "```bash\n",
    "pip install git+https://github.com/jdenholm/TorchTools.git\n",
    "```\n",
    "but has already been installed for you in the requirements of this workshop package.\n",
    "\n",
    "It is useful to know you can install Python packages from GitHub using pip. To install specific versions you can use:\n",
    "```bash\n",
    "pip install git+https://github.com/jdenholm/TorchTools.git@v0.1.0\n",
    "```\n",
    "(The famous [segment anything model](https://github.com/facebookresearch/segment-anything) (SAM) published by Facebook Research was released in this way.)\n",
    "\n",
    "One might argue that this is a much better way of making one-off codes available, for example academic codes which might accompany papers, rather than using the global communal package index PyPI."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Back to work: let's instantiate the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from torch_tools import FCNet\n",
    "\n",
    "# model ="
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 7: Selecting a loss function\n",
    "\n",
    "The previous loss function we chose was appropriate for classification, but _not_ regression.  \n",
    "Here we'll use the mean-squared-error loss, which is more appropriate for regression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from torch.nn import MSELoss\n",
    "\n",
    "# loss_func = ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 8: Selecting an optimiser\n",
    "\n",
    "``Adam`` is regarded as the king of optimisers: let's use it again, but this time with a learning rate.\n",
    "\n",
    "[Adam docs](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from torch.optim import Adam\n",
    "\n",
    "# Create an optimiser and give it the model's parameters."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 9: Writing basic training and validation loops\n",
    "\n",
    "\n",
    "As before, we will write the training loop together and you can then continue with the validation loop.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Dict\n",
    "from torch.nn import Module\n",
    "\n",
    "\n",
    "def train_one_epoch(\n",
    "    model: Module,\n",
    "    train_loader: DataLoader,\n",
    "    optimiser: Adam,\n",
    "    loss_func: MSELoss,\n",
    ") -> Dict[str, float]:\n",
    "    \"\"\"Train ``model`` for once epoch.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    model : Module\n",
    "        The neural network.\n",
    "    train_loader : DataLoader\n",
    "        Training dataloader.\n",
    "    optimiser : Adam\n",
    "        The optimiser.\n",
    "    loss_func : MSELoss\n",
    "        Mean-squared-error loss.\n",
    "    Returns\n",
    "    -------\n",
    "    Dict[str, float]\n",
    "        A dictionary of metrics.\n",
    "\n",
    "    \"\"\"\n",
    "\n",
    "\n",
    "def validate_one_epoch(\n",
    "    model: Module,\n",
    "    valid_loader: DataLoader,\n",
    "    loss_func: MSELoss,\n",
    ") -> Dict[str, float]:\n",
    "    \"\"\"Validate ``model`` for a single epoch.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    model : Module\n",
    "        The neural network.\n",
    "    valid_loader : DataLoader\n",
    "        Validation dataloader.\n",
    "    loss_func : MSELoss\n",
    "        Mean-squared-error loss.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    Dict[str, float]\n",
    "        Metrics of interest.\n",
    "\n",
    "    \"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 10: Training and extracting metrics\n",
    "\n",
    "- Now we can train our model for a specified number of epochs.\n",
    "  - During each epoch the model \"sees\" each training item once.\n",
    "- Append the training and validation metrics to a list.\n",
    "- Turm them into a ``pandas.DataFrame``\n",
    "  - Note: You can turn a ``List[Dict[str, float]]``, say ``my_list`` into a ``DataFrame`` with ``DataFrame(my_list)``."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "epochs = 3\n",
    "\n",
    "for _ in range(epochs):\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 11: Plotting metrics\n",
    "\n",
    "- Use Matplotlib to plot the training and validation metrics as a function of the number of epochs.\n",
    "- Does this allow us to interpret performance?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}