{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# An Example of using sklearn Pipeline with matminer\n", "\n", "This goes over the steps to build a model using sklearn Pipeline and matminer. Look at the intro_predicting_bulk_modulus notebook for more details about matminer and the featurizers used here.\n", "\n", "This notebook was last updated 06/07/21 for version 0.7.0 of matminer.\n", "\n", "**Note that in order to get the in-line plotting to work, you might need to start Jupyter notebook with a higher data rate, e.g., ``jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10``. We recommend you do this before starting.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Why use Pipeline?\n", "\n", "Pre-processing and featurizing materials data can be viewed as a series of transformations on the data, going from the initially loaded state to training ready. Pipelines are a tool for encapsulating this process in a way that enables easy replication/repeatability, presents a simple model of data transformation, and helps to avoid errant changes to the data. Pipelines chain together transformations into a single transformation. They can also be used to build end end-to-end methods for preprocessing/training/validating a model, by optionally putting an estimator at the end of the pipeline. See the [scikit-learn Pipeline documentation](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) for details." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Load sklearn modules\n", "from sklearn.pipeline import FeatureUnion, Pipeline\n", "from sklearn.base import TransformerMixin, BaseEstimator\n", "\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.svm import SVR, LinearSVR\n", "\n", "from sklearn.decomposition import PCA, NMF\n", "from sklearn.feature_selection import SelectKBest, chi2\n", "from sklearn.preprocessing import StandardScaler, MinMaxScaler\n", "\n", "from sklearn.metrics import mean_squared_error\n", "from sklearn.model_selection import RepeatedKFold, cross_val_score, cross_val_predict, train_test_split, GridSearchCV, RandomizedSearchCV\n", "\n", "import numpy as np\n", "from pandas import DataFrame\n", "from scipy.stats import randint as sp_randint\n", "\n", "# Load featurizers and conversion functions\n", "from matminer.featurizers.composition import ElementProperty, OxidationStates\n", "from matminer.featurizers.structure import DensityFeatures\n", "from matminer.featurizers.conversions import CompositionToOxidComposition, StrToComposition" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the Dataset\n", "Matminer comes pre-loaded with several example data sets you can use. Below, we'll load a data set of computed elastic properties of materials which is sourced from the paper: \"Charting the complete elastic properties of inorganic crystalline compounds\", M. de Jong *et al.*, Sci. Data. 2 (2015) 150009." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Decoding objects from /Users/ardunn/alex/lbl/projects/common_env/dev_codes/matminer/matminer/datasets/elastic_tensor_2015.json.gz: 100%|#########9| 4706/4724 [00:01<00:00, 5250.95it/s]" ] } ], "source": [ "from matminer.datasets.convenience_loaders import load_elastic_tensor\n", "df = load_elastic_tensor() # loads dataset in a pandas DataFrame \n", "unwanted_columns = [\"volume\", \"nsites\", \"compliance_tensor\", \"elastic_tensor\", \n", " \"elastic_tensor_original\", \"K_Voigt\", \"G_Voigt\", \"K_Reuss\", \"G_Reuss\"]\n", "df = df.drop(unwanted_columns, axis=1)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# seperate out values to be estimated\n", "y = df['K_VRH'].values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Preprocessing\n", "The conversion functions in matminer need to be run before the pipeline as a data preprocessing step." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "dd6a17a2d1d94176b5fec4816468690d", "version_major": 2, "version_minor": 0 }, "text/plain": [ "StrToComposition: 0%| | 0/1181 [00:00