{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Train a Model to Predict Formation Energy using the OQMD\n", "This notebook recreates a 2016 paper by [Ward et al.](https://www.nature.com/articles/npjcompumats201628) on predicting the formation enthalpy of materials based on their composition. We will use the [Materials Data Facility](http://materialsdatafacility.org) to retrieve a training set from the the [OQMD](http://oqmd.org), compute features based on the composition of each entry, and then train a random forest model.\n", "\n", "This example was last updated on 06/07/2021 for Matminer v.0.7.0" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "from matminer.data_retrieval import retrieve_MDF\n", "from matminer.featurizers.base import MultipleFeaturizer\n", "from matminer.featurizers import composition as cf\n", "from matminer.featurizers.conversions import StrToComposition\n", "from matplotlib import pyplot as plt\n", "from matplotlib.colors import LogNorm\n", "import numpy as np\n", "import pandas as pd\n", "import pickle as pkl\n", "from sklearn import metrics\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.model_selection import cross_val_score, cross_val_predict, GridSearchCV, ShuffleSplit, KFold" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Settings to change" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "quick_demo = True # Whether to run an faster version of this demo. \n", "# The full OQMD model takes about a hour to test and ~8GB of RAM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Training Set\n", "Ward _et al._ trained their machine learning models on the formation enthalpies of crystalline compounds form the [OQMD](oqmd.org). Here, we extract the data using the copy of the OQMD available through the MDF" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download the Data\n", "We first create a `Forge` instance, which simplifies performing search queries against the MDF." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first step is to create a tool for reading from the MDF's search index." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "mdf = retrieve_MDF.MDFDataRetrieval(anonymous=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we assemble a query that gets only the converged static calculations from the OQMD. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "query_string = 'mdf.source_name:oqmd AND (oqmd.configuration:static OR '\\\n", " 'oqmd.configuration:standard) AND dft.converged:True'\n", "if quick_demo:\n", " query_string += \" AND mdf.scroll_id:<10000\"" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "data = mdf.get_data(query_string, unwind_arrays=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tool creates a DataFrame object with the metadata for each entry in the OQMD" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | crystal_structure.cross_reference.icsd | \n", "crystal_structure.number_of_atoms | \n", "crystal_structure.space_group_number | \n", "crystal_structure.volume | \n", "dft.converged | \n", "dft.cutoff_energy | \n", "dft.exchange_correlation_functional | \n", "files | \n", "material.composition | \n", "material.elements | \n", "... | \n", "oqmd.delta_e.units | \n", "oqmd.delta_e.value | \n", "oqmd.magnetic_moment.units | \n", "oqmd.magnetic_moment.value | \n", "oqmd.stability.units | \n", "oqmd.stability.value | \n", "oqmd.total_energy.units | \n", "oqmd.total_energy.value | \n", "oqmd.volume_pa.units | \n", "oqmd.volume_pa.value | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "90433.0 | \n", "12 | \n", "62 | \n", "185.154 | \n", "True | \n", "520.0 | \n", "PBE | \n", "[{'data_type': 'ASCII text, with very long lin... | \n", "Nb1Pt1Si1 | \n", "[Nb, Pt, Si] | \n", "... | \n", "eV/atom | \n", "-0.805020 | \n", "bohr/atom | \n", "-0.000119 | \n", "eV/atom | \n", "-0.105391 | \n", "eV/atom | \n", "-7.996541 | \n", "angstrom^3/atom | \n", "15.4295 | \n", "
1 | \n", "639016.0 | \n", "3 | \n", "139 | \n", "59.293 | \n", "True | \n", "520.0 | \n", "PBE | \n", "[{'data_type': 'ASCII text, with very long lin... | \n", "Hf2Zn1 | \n", "[Hf, Zn] | \n", "... | \n", "eV/atom | \n", "-0.173969 | \n", "bohr/atom | \n", "-0.000561 | \n", "eV/atom | \n", "-0.042780 | \n", "eV/atom | \n", "-7.232890 | \n", "angstrom^3/atom | \n", "19.7643 | \n", "
2 rows × 29 columns
\n", "