{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Machine Learned Potentials with IPSuite\n",
    "\n",
    "In this tutorial, you will train several small machine learned inter-atomic potential using prepared data and in the end, load a fully trained model and perform a small MD simulation.\n",
    "\n",
    "In the first part of the tutorial, we provide you with a fully functioning script which produces DFT data before training and deploying a model.\n",
    "In later parts of the notebook, you will need to copy the model generation and data selection componants to perform your own experiments."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installation\n",
    "\n",
    "We will install IPSuite and its dependencies for this Project from the `requirements.txt` file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "!pip install -r requirements.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fitting a Potential\n",
    "\n",
    "The following Code describes the generation of training data and fitting a GAP potential.\n",
    "[![](https://mermaid.ink/img/pako:eNqllFFvgjAQx7-Kqa9qhigiD0vcNM5MJll5GwtpaNHGQg2UOWP87isbaCx1WTKeLv__r3fc9dIjiDgmwAEx4_togzLRWr4Gaav6UmkabwGYwNmc8NVOBOD92u3_uO604Zi1ExoNb3DxQrEhWcJzgZrJh2fsF8iS0KPXf4Y0XTPicZo2mVHJ8DSm6yJDgvIUEkaiMmig9k1U08dYwnP_KXyYwAUM4cyHPfwRNTDjruK8lT978ReT5Q2unLW7dGXImmY5ag9F24RrzHLaXkYw1XdlDK58l4iMRnkTKycOE8pI7vOJ4IkGKeeNY4HNHkZC30a3e_-9G3rZVDaokkd62VY2R59kqC9pVbKhlxV6pC9Zy4ZS0651Jfu40i11CW7oN-Zy1tW6Rl_f1VlXC5g1P1BvuzbUm7IumS5OLg6MVL_WiiljTjsex51cZHxLnLZpmlXc3VMsNs5g96k7bP7jrP3Hs6ADEvlgIIrl03YsMwVAPiEJCYAjQ0xiVDD5TATpSaKoEBwe0gg4IitIBxQ7udlkStE6QwlwYsRycvoCcbuCGw?type=png)](https://mermaid.live/edit#pako:eNqllFFvgjAQx7-Kqa9qhigiD0vcNM5MJll5GwtpaNHGQg2UOWP87isbaCx1WTKeLv__r3fc9dIjiDgmwAEx4_togzLRWr4Gaav6UmkabwGYwNmc8NVOBOD92u3_uO604Zi1ExoNb3DxQrEhWcJzgZrJh2fsF8iS0KPXf4Y0XTPicZo2mVHJ8DSm6yJDgvIUEkaiMmig9k1U08dYwnP_KXyYwAUM4cyHPfwRNTDjruK8lT978ReT5Q2unLW7dGXImmY5ag9F24RrzHLaXkYw1XdlDK58l4iMRnkTKycOE8pI7vOJ4IkGKeeNY4HNHkZC30a3e_-9G3rZVDaokkd62VY2R59kqC9pVbKhlxV6pC9Zy4ZS0651Jfu40i11CW7oN-Zy1tW6Rl_f1VlXC5g1P1BvuzbUm7IumS5OLg6MVL_WiiljTjsex51cZHxLnLZpmlXc3VMsNs5g96k7bP7jrP3Hs6ADEvlgIIrl03YsMwVAPiEJCYAjQ0xiVDD5TATpSaKoEBwe0gg4IitIBxQ7udlkStE6QwlwYsRycvoCcbuCGw)\n",
    "\n",
    "For this exercise we will only focus on the highlighted Nodes. You will make some modifications to the ML model.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "import ipsuite as ips"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Example Script\n",
    "\n",
    "In the following notebook, you can see a fully prepared script for:\n",
    "\n",
    "* Generating DFT data\n",
    "* Selecting training and test data\n",
    "* Fitting a machine-learned inter-atomic potential\n",
    "* Deploying the potential in an MD simulation\n",
    "\n",
    "This cell will run immediately as results of a pre-run model will be loaded. \n",
    "Take note of each step as you will need some of them later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "mapping = ips.geometry.BarycenterMapping(data=None)\n",
    "thermostat = ips.calculators.LangevinThermostat(\n",
    "    temperature=300, friction=0.01, time_step=0.5\n",
    ")\n",
    "\n",
    "with ips.Project(automatic_node_names=True) as project:\n",
    "    mol = ips.configuration_generation.SmilesToAtoms(smiles=\"O\")\n",
    "    \n",
    "    # Create a box of atoms.\n",
    "    packmol = ips.configuration_generation.Packmol(\n",
    "        data=[mol.atoms], count=[10], density=997\n",
    "    )\n",
    "    \n",
    "    # Define the CP2K calculations\n",
    "    cp2k = ips.calculators.CP2KSinglePoint(\n",
    "        data=packmol,\n",
    "        cp2k_files=[\"GTH_BASIS_SETS\", \"GTH_POTENTIALS\", \"dftd3.dat\"],\n",
    "        cp2k_shell=\"cp2k_shell.psmp\",\n",
    "    )\n",
    "    \n",
    "    # Perform geometry optimization\n",
    "    geopt = ips.calculators.ASEGeoOpt(\n",
    "        model=cp2k,\n",
    "        data=packmol.atoms,\n",
    "        optimizer=\"BFGS\",\n",
    "        run_kwargs={\"fmax\": 0.1},\n",
    "    )\n",
    "    \n",
    "    # Define an integrator\n",
    "    md = ips.calculators.ASEMD(\n",
    "        model=cp2k,\n",
    "        thermostat=thermostat,\n",
    "        data=geopt.atoms,\n",
    "        data_id=-1,\n",
    "        sampling_rate=1,\n",
    "        dump_rate=1,\n",
    "        steps=5000,\n",
    "    )\n",
    "    \n",
    "    # Split data into test and train\n",
    "    test_data = ips.configuration_selection.RandomSelection(\n",
    "        data=md, n_configurations=250\n",
    "    )\n",
    "    train_data = ips.configuration_selection.RandomSelection(\n",
    "        data=md,\n",
    "        n_configurations=250,\n",
    "        exclude_configurations=[\n",
    "            test_data.selected_configurations,\n",
    "        ],\n",
    "    )\n",
    "    \n",
    "    # Define a model\n",
    "    model = ips.models.GAP(data=train_data, soap={\"cutoff\": 3, \"l_max\": 5, \"n_max\": 5})\n",
    "    \n",
    "    # Use model to make some predicitons\n",
    "    predictions = ips.analysis.Prediction(data=test_data, model=model)\n",
    "    analysis = ips.analysis.PredictionMetrics(data=predictions)\n",
    "    \n",
    "    # Use model in an MD simulation.\n",
    "    ml_md = ips.calculators.ASEMD(\n",
    "        model=model,\n",
    "        thermostat=thermostat,\n",
    "        data=geopt.atoms,\n",
    "        data_id=-1,\n",
    "        sampling_rate=1,\n",
    "        dump_rate=1,\n",
    "        steps=5000,\n",
    "    )\n",
    "\n",
    "# Run all of the steps.\n",
    "project.run(repro=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Training your own models\n",
    "\n",
    "Now that you have seen what it looks like, you can start playing with your own models, but first, let's load up the previous model and see how it performed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "!dvc pull --quiet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Compute the RMSE scores\n",
    "\n",
    "RMSE scores provide one with an idea of how far away an ML models predictions are from the underlying DFT data.\n",
    "In this case it has been computed for you but later, you will need to do this yourself so keep an eye on where to get the data from."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "analysis.load()  # Required for IPSuite to run"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "analysis.energy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "analysis.energy_df.iloc[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following correlation plots was also generated automatically and might help you evaluate the ML Model accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import display, Image, display_pretty, Pretty\n",
    "display(Image(filename='nodes/PredictionMetrics/plots/energy.png'))\n",
    "display(Image(filename='nodes/PredictionMetrics/plots/forces.png'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compute the Radial Distribution Function\n",
    "\n",
    "The next thing you want to do is see how well the model performed in a real MD simulation.\n",
    "To do so, we compare the radial distirbution function of the md run with the machine-learned potential with the underlying DFT data on which it was trained"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load the DFT DATA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "md.load()\n",
    "md.atoms[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load the ML-MD Simulation Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ml_md.load()\n",
    "ml_md.atoms[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compute the RDF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "import ase\n",
    "import ase.geometry.analysis\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = ase.geometry.analysis.Analysis(ml_md.atoms).get_rdf(rmax=3.0, nbins=100, elements=[\"H\", \"O\"])\n",
    "dft_data = ase.geometry.analysis.Analysis(md.atoms).get_rdf(rmax=3.0, nbins=100, elements=[\"H\", \"O\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "ax.plot(np.linspace(0, 3.0, 100), np.mean(data, axis=0), label=\"ML PES\")\n",
    "ax.plot(np.linspace(0, 3.0, 100), np.mean(dft_data, axis=0), label=\"DFT\")\n",
    "ax.hlines(xmin=0, xmax=3, y=1, ls=\"--\")\n",
    "ax.set_xlabel(r\"Distance r / $\\AA$\")\n",
    "ax.set_ylabel(\"RDF g(r)\")\n",
    "ax.legend()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Fit your own ML model\n",
    "\n",
    "Now that we have looked at a pretrained model, it is time train some of your own.\n",
    "We will create IPSuite `Experiments` to accomplish this.\n",
    "\n",
    "Fit five models with data-set sizes ranging from 10, to 200 and make a plot of their RMSE scores with respect to their data-set sizes and explain what you see.\n",
    "\n",
    "> If you have lost track of all the experiments that you performed, you can use `dvc exp show` from the command line to view them in a tabular form. \n",
    "\n",
    "```python\n",
    "with project.create_experiment() as exp1:\n",
    "    train_data.n_configurations = 10    \n",
    "    \n",
    "with project.create_experiment() as exp2:\n",
    "    ...\n",
    "\n",
    "# copy this line to run the experiments\n",
    "!dvc exp run --run-all\n",
    "\n",
    "print(exp1[analysis.name].energy)\n",
    "print(exp1[train_data.name].n_configurations)\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your experiments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Deploying a Pre-Trained Model\n",
    "\n",
    "You have used the GAP model up until now.\n",
    "For the last part we will have a look at another ML Model, called MACE https://github.com/ACEsuit/mace\n",
    "\n",
    "For this example training a MACE model takes longer than training a GAP model.\n",
    "Therefore, we will look at a pretrained model that we can download.\n",
    "\n",
    "We will save our current model and load the MACE model afterwards."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[main 4b07b60] new ML Model\n",
      " 3 files changed, 1031 insertions(+), 20 deletions(-)\n",
      " create mode 100644 .ipynb_checkpoints/main-checkpoint.ipynb\n"
     ]
    }
   ],
   "source": [
    "!git add .\n",
    "!git commit -m \"new ML Model\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Branch 'mace' set up to track remote branch 'mace' from 'origin'.\n",
      "Switched to a new branch 'mace'\n"
     ]
    }
   ],
   "source": [
    "!git checkout mace"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "!dvc pull --quiet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following images show the correlation plots of the MACE model.\n",
    "How much do they differ from your best GAP model?\n",
    "Copy and paste the code from the **Analyse the Results** section above again and evaluate your model.\n",
    "Because we are using MACE instead of GAP we need to change the Model via `model = ips.models.MACE.from_rev()`.\n",
    "Especially compare the RDF of the MACE model with your GAP model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = ips.models.MACE.from_rev()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(Image(filename='nodes/PredictionMetrics/plots/energy.png'))\n",
    "display(Image(filename='nodes/PredictionMetrics/plots/forces.png'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lastly, let us scale things up a bit.\n",
    "With DFT we are limited to small box sizes.\n",
    "With our ML-Pes we can have a look at bigger boxes.\n",
    "We can set the `repeat` attribute of our `ASEMD` Node to increase the box size.\n",
    "\n",
    "Perform an MD simulation for an unscaled and scaled simulation box.\n",
    "Compute the RDF for each simulation and compare them to one another and the DFT data you loaded at the start of the notebook.\n",
    "Discuss in your report what you see.\n",
    "\n",
    "```python\n",
    "with project.create_experiment() as mace_exp1:\n",
    "    # original size\n",
    "    ml_md.repeat = (1, 1, 1)\n",
    "\n",
    "with project.create_experiment() as mace_exp2:\n",
    "    # 8 times the volume\n",
    "    ml_md.repeat = (2, 2, 2)\n",
    "    \n",
    "\n",
    "project.run_exp()\n",
    "\n",
    "mace_exp1[ml_md.name].atoms # access the atoms to print the RDF\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Return to your model\n",
    "\n",
    "These are just some lines required for IPSuite to work correctly if you want to go back to your model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!git checkout main\n",
    "!dvc checkout"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  },
  "vscode": {
   "interpreter": {
    "hash": "0b56f66012b023714c40f2382361243d24cc7c2d1707d28001be66bf6c226584"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}