{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Advanced MPDS API usage: unusual materials phases from the machine learning\n",
    "==========\n",
    "\n",
    "- **Complexity level**: green karate belt\n",
    "- **Requirements**: familiarity with machine learning and parallel programming\n",
    "\n",
    "Here we look in MPDS for the \"unusual\" materials phases, _i.e._ those which have the extreme values of more than one physical property. _Extreme_ in this context means close to the either of the prediction bounds, minimum or maximum. We consider 8 properties generated by machine learning. In MPDS they have clear bounds.\n",
    "\n",
    "For instance, a crystal with the very low **Debye temperature**, very low **enthalpy of formation**, very high **linear thermal expansion coefficient** _etc._ would match. The materials with such unusual combinations of properties certainly deserve attention, so let's list them.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "**Important! Before you proceed:** the notebooks running at the third-party servers are not secure. Using this notebook assumes you authenticate at the MPDS server with your own API key. Please run this notebook only if you have an open-access account (_i.e._ an **access** section of your MPDS account reads: `Programmatic data access: only open data`).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Please **do not** run this notebook at the third-party servers if you have an elevated API access to the MPDS, since there's a nonzero probability of key leakage!\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Be sure to **always invalidate** (revoke) your API key at your [MPDS account](https://mpds.io/#modal/menu) after using the notebooks.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Now let's proceed with the authentication part. First, apply for an [MPDS account](https://mpds.io/open-data-api), if you have none. Then copy your API key, run the next cell, paste the key in the appeared prompt input, and hit **Enter**.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os, getpass\n",
    "os.environ['MPDS_KEY'] = getpass.getpass()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "OK, now you may talk to the MPDS server programmatically from this notebook on your behalf.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install mpds_client"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from __future__ import division\n",
    "import time\n",
    "import random\n",
    "import threading\n",
    "\n",
    "from mpds_client import MPDSDataRetrieval, MPDSDataTypes\n",
    "\n",
    "ml_data = {\n",
    "    'isothermal bulk modulus': {'bounds': [5, 265], 'units': 'GPa'},\n",
    "    'enthalpy of formation': {'bounds': [-325, 0], 'units': 'kJ g-at.-1'},\n",
    "    'heat capacity at constant pressure': {'bounds': [11, 28], 'units': 'J K-1 g-at.-1'},\n",
    "    'Seebeck coefficient': {'bounds': [-150, 225], 'units': 'muV K-1'},\n",
    "    'values of electronic band gap': {'bounds': [0.5, 10], 'units': 'eV'}, # NB both direct & indirect\n",
    "    'temperature for congruent melting': {'bounds': [300, 2700], 'units': 'K'},\n",
    "    'Debye temperature': {'bounds': [175, 1100], 'units': 'K'},\n",
    "    'linear thermal expansion coefficient': {'bounds': [1.0E-06, 9.5E-05], 'units': 'K-1'}\n",
    "}\n",
    "\n",
    "bound_tolerance_factor = 15"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "What's the `bound_tolerance_factor`? For each machine-learning property we divide the entire range of values (_e.g._ from `300` to `2700`) into this number. Then we take the first and the last segment. Entries with the property values in these segments will be considered as **extreme** and kept.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Note, if the key isn't valid, the API returns an HTTP error `403`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "extremes, extremes_intersects = {}, {}\n",
    "\n",
    "def mpds_download_worker(prop, min_bound, max_bound):\n",
    "    '''\n",
    "    A parallelizable worker\n",
    "    '''\n",
    "    print(\"---Starting with %s\" % prop)\n",
    "\n",
    "    client = MPDSDataRetrieval(dtype=MPDSDataTypes.MACHINE_LEARNING)\n",
    "\n",
    "    min_entries, max_entries = [], []\n",
    "\n",
    "    for item in client.get_data({\"props\": prop}, fields={'P':[\n",
    "        'sample.material.entry',\n",
    "        'sample.material.phase_id',\n",
    "        'sample.material.chemical_formula',\n",
    "        'sample.measurement[0].property.scalar'\n",
    "    ]}):\n",
    "        if item[3] < min_bound:\n",
    "            min_entries.append(item)\n",
    "\n",
    "        elif item[3] > max_bound:\n",
    "            max_entries.append(item)\n",
    "\n",
    "    for item in list(min_entries) + list(max_entries):\n",
    "\n",
    "        keep_info = [prop, item[0]] + item[2:]\n",
    "\n",
    "        if item[1] in extremes:\n",
    "            extremes_intersects.setdefault(item[1], []).append(keep_info)\n",
    "\n",
    "        else:\n",
    "            extremes[item[1]] = keep_info"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below is the most time-consuming step. We need to scan all the machine-learning data. To fetch all the entries for each property requires about 10 minutes. So that will be about 2 hours in total sequentially. Parallelizing the data extraction for **8 properties** we could ideally achieve **8x** speedup. However that would also increase the load at the MPDS server **8x**, which we in principle should avoid. Let's be polite! Although it's safe to increase the load twice, so we can run two threads four times to fetch all the data. The total running time will be then about half an hour."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start_time = time.time()\n",
    "threads = []\n",
    "ml_props = list(ml_data.keys())\n",
    "\n",
    "for even, odd in zip(ml_props[0::2], ml_props[1::2]):\n",
    "\n",
    "    print(\"---Preparing a pair of %s & %s\" % (even, odd))\n",
    "\n",
    "    for key in [even, odd]:\n",
    "\n",
    "        # adjust bounds to match entries near the margin\n",
    "        margin = (ml_data[key]['bounds'][1] - ml_data[key]['bounds'][0]) / bound_tolerance_factor\n",
    "        ml_data[key]['bounds'] = [ml_data[key]['bounds'][0] + margin, ml_data[key]['bounds'][1] - margin]\n",
    "\n",
    "        # run in parallel\n",
    "        thread = threading.Thread(target=mpds_download_worker, args=[key] + ml_data[key]['bounds'])\n",
    "        thread.start()\n",
    "        threads.append(thread)\n",
    "\n",
    "    for thread in threads:\n",
    "        thread.join()\n",
    "\n",
    "for phase_id in extremes_intersects:\n",
    "    extremes_intersects[phase_id].append(extremes[phase_id])\n",
    "\n",
    "for phase_id in sorted(extremes_intersects.keys()):\n",
    "\n",
    "    print(\"*\" * 30 + \" Distinct phase https://mpds.io/#phase_id/%s \" % phase_id + \"*\" * 30)\n",
    "\n",
    "    for card in extremes_intersects[phase_id]:\n",
    "        print(\"%s (%s) %s = %s %s\" % (\n",
    "            card[2], card[1], card[0], card[3], ml_data[card[0]]['units']\n",
    "        ))\n",
    "\n",
    "print(\"Done in %1.2f sc\" % (time.time() - start_time))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Were you able to follow everything? Please, try to answer:\n",
    "- How is the value of `bound_tolerance_factor` connected with the total number of results?\n",
    "- How could one obtain the particular crystalline structures for these results?\n",
    "- How could one in principle verify these results?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "**PS** don't forget to [invalidate](https://mpds.io/#modal/menu) (revoke) your API key.\n"
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 2
}