![](logo.png)

# Welcome to the automatminer basic tutorial!
#### Versions used to make this notebook (`automatminer 2019.10.14` and `matminer 0.6.2`, `python 3.7.3` on MacOS Mojave `10.14.6`)

---

[Automatminer](https://github.com/hackingmaterials/automatminer) is a package for *automatically* creating ML pipelines using matminer's featurizers, feature reduction techniques, and Automated Machine Learning (AutoML). Automatminer works end to end - raw data to prediction - without *any* human input necessary. 

#### Put in a dataset, get out a machine that predicts materials properties.

Automatminer is competitive with state of the art hand-tuned machine learning models across multiple domains of materials informatics. Automatminer also included utilities for running MatBench, a materials science ML benchmark. 

#### Learn more about Automatminer and MatBench from the [official documentation](http://hackingmaterials.lbl.gov/automatminer/). 



# How does automatminer work?
Automatminer automatically decorates a dataset using hundreds of descriptor techniques from matminer’s descriptor library, picks the most useful features for learning, and runs a separate AutoML pipeline. Once a pipeline has been fit, it can be summarized in a text file, saved to disk, or used to make predictions on new materials.

![](pipe.png)

Materials primitives (e.g., crystal structures) go in one end, and property predictions come out the other. MatPipe handles the intermediate operations such as assigning descriptors, cleaning problematic data, data conversions, imputation, and machine learning.

### MatPipe is the main Automatminer object
`MatPipe` is the central object in Automatminer. It has a sklearn BaseEstimator syntax for `fit` and `predict` operations. Simply `fit` on your training data, then `predict` on your testing data.

### MatPipe uses [pandas](https://pandas.pydata.org>) dataframes as inputs and outputs. 
Put dataframes (of materials) in, get dataframes (of property predictions) out.


# What's in this notebook?

In this notebook, we walk through the basic steps of using Automatminer to train and predict on data. We'll also view the internals of our AutoML pipeline using Automatminer's API. 

* First, we'll load a dataset of ~4,600 band gaps collected from experimental sources.
* Next, we'll fit a Automatminer `MatPipe` (pipeline) to the data
* Then, we'll predict experimental band gap from chemical composition, and see how our predictions do (note, this is not an easy problem!)
* We'll examine our pipeline with `MatPipe`'s introspection methods.
* Finally, we look at how to save and load pipelines for reproducible predictions.

*Note: for the sake of brevity, we will use a single train-test split in this notebook. To run a full Automatminer benchmark, see the documentation for `MatPipe.benchmark`*

# Preparing a dataset

Let's load a dataset to play around with. For this example, we will use matminer to load one of the MatBench v0.1 datasets. If you have been through some of machine learning or data retrieval tutorials on this repo, you will be familiar with the commands needed to fetch our dataset as a dataframe.


In [1]:
from matminer.datasets import load_dataset

df = load_dataset("matbench_expt_gap")

# Let's look at our dataset
df.describe()

Unnamed: 0,gap expt
count,4604.0
mean,0.975951
std,1.445034
min,0.0
25%,0.0
50%,0.0
75%,1.8125
max,11.7


### Looking at the data

In [2]:
df.head()

Unnamed: 0,composition,gap expt
0,Ag(AuS)2,0.0
1,Ag(W3Br7)2,0.0
2,Ag0.5Ge1Pb1.75S4,1.83
3,Ag0.5Ge1Pb1.75Se4,1.51
4,Ag2BBr,0.0


### Seeing how many unique compositions are present
We should find all the compositions are unique.

In [3]:
# How many unique compositions do we have?
df["composition"].unique().shape[0]

4604

### Generate a train-test split

In [4]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True, random_state=20191014)

### Remove the target property from the test_df

Let's remove the testing dataframe's target property so we can be sure we are not giving Automatminer any test information.

Our target variable is `"gap expt"`.

In [5]:
target = "gap expt"
prediction_df = test_df.drop(columns=[target])
prediction_df.head()

Unnamed: 0,composition
4514,ZnSb
834,Co1Te1.88
4481,Zn2Ni9O13
3958,TiAlAu2
3087,Pr(MnSi)2


In [6]:
prediction_df.describe()

Unnamed: 0,composition
count,921
unique,921
top,La2GeSe5
freq,1


# Fitting and predicting with Automatminer's MatPipe

Our dataset contains 4,604 unique stoichiometries and experimentally measured band gaps. We have everything we need to start our AutoML pipeline.

For simplicity, we will use an `MatPipe` preset. `MatPipe` is highly customizable and has hundreds of configuration options, but most use cases will be satisfied by using one of the preset configurations. We use the `from_preset` method.

In this example, we'll use the "express" preset, which will take approximately an hour.


In [7]:
from automatminer import MatPipe

pipe = MatPipe.from_preset("express")





### Fitting the pipeline

To fit an Automatminer `MatPipe` to the data, pass in your training data and desired target.

In [8]:
pipe.fit(train_df, target)

2019-10-14 20:51:56 INFO     Problem type is: regression
2019-10-14 20:51:56 INFO     Fitting MatPipe pipeline to data.
2019-10-14 20:51:56 INFO     AutoFeaturizer: Starting fitting.
2019-10-14 20:51:56 INFO     AutoFeaturizer: Compositions detected as strings. Attempting conversion to Composition objects...


HBox(children=(IntProgress(value=0, description='StrToComposition', max=3683, style=ProgressStyle(description_…


2019-10-14 20:51:56 INFO     AutoFeaturizer: Guessing oxidation states of compositions, as they were not present in input.


HBox(children=(IntProgress(value=0, description='CompositionToOxidComposition', max=3683, style=ProgressStyle(…


2019-10-14 20:52:55 INFO     AutoFeaturizer: Will remove YangSolidSolution because it's fraction passing the precheck for this dataset (0.4051045343469997) was less than the minimum (0.9)
2019-10-14 20:52:55 INFO     AutoFeaturizer: Will remove Miedema because it's fraction passing the precheck for this dataset (0.4051045343469997) was less than the minimum (0.9)
2019-10-14 20:52:55 INFO     AutoFeaturizer: Featurizer type structure not in the dataframe to be fitted. Skipping...
2019-10-14 20:52:55 INFO     AutoFeaturizer: Featurizer type bandstructure not in the dataframe to be fitted. Skipping...
2019-10-14 20:52:55 INFO     AutoFeaturizer: Featurizer type dos not in the dataframe to be fitted. Skipping...
2019-10-14 20:52:55 INFO     AutoFeaturizer: Finished fitting.
2019-10-14 20:52:55 INFO     AutoFeaturizer: Starting transforming.
2019-10-14 20:52:55 INFO     AutoFeaturizer: Featurizing with ElementProperty.


HBox(children=(IntProgress(value=0, description='ElementProperty', max=3683, style=ProgressStyle(description_w…


2019-10-14 20:53:03 INFO     AutoFeaturizer: Featurizing with OxidationStates.


HBox(children=(IntProgress(value=0, description='OxidationStates', max=3683, style=ProgressStyle(description_w…


2019-10-14 20:53:03 INFO     AutoFeaturizer: Featurizing with ElectronAffinity.


HBox(children=(IntProgress(value=0, description='ElectronAffinity', max=3683, style=ProgressStyle(description_…


2019-10-14 20:53:03 INFO     AutoFeaturizer: Featurizing with IonProperty.


HBox(children=(IntProgress(value=0, description='IonProperty', max=3683, style=ProgressStyle(description_width…


2019-10-14 20:53:15 INFO     AutoFeaturizer: Featurizer type structure not in the dataframe. Skipping...
2019-10-14 20:53:15 INFO     AutoFeaturizer: Featurizer type bandstructure not in the dataframe. Skipping...
2019-10-14 20:53:15 INFO     AutoFeaturizer: Featurizer type dos not in the dataframe. Skipping...
2019-10-14 20:53:15 INFO     AutoFeaturizer: Finished transforming.
2019-10-14 20:53:15 INFO     DataCleaner: Starting fitting.
2019-10-14 20:53:15 INFO     DataCleaner: Cleaning with respect to samples with sample na_method 'drop'
2019-10-14 20:53:15 INFO     DataCleaner: Replacing infinite values with nan for easier screening.
2019-10-14 20:53:16 INFO     DataCleaner: Before handling na: 3683 samples, 141 features
2019-10-14 20:53:16 INFO     DataCleaner: 0 samples did not have target values. They were dropped.
2019-10-14 20:53:16 INFO     DataCleaner: Handling feature na by max na threshold of 0.01 with method 'drop'.
2019-10-14 20:53:16 INFO     DataCleaner: These 8 feature


The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.



2019-10-14 20:54:21 INFO     TreeFeatureReducer: Finished tree-based feature reduction of 114 initial features to 48
2019-10-14 20:54:21 INFO     FeatureReducer: Finished fitting.
2019-10-14 20:54:21 INFO     FeatureReducer: Starting transforming.
2019-10-14 20:54:21 INFO     FeatureReducer: Finished transforming.
2019-10-14 20:54:21 INFO     TPOTAdaptor: Starting fitting.
28 operators have been imported by TPOT.


HBox(children=(IntProgress(value=0, description='Optimization Progress', max=20, style=ProgressStyle(descripti…

_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required..
_pre_test decorator: _random_mutation_operator: num_test=1 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required..
_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required..
_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required by StandardScaler..
Skipped pipeline #38 due to time out. Continuing to the next pipeline.
Generation 1 - Current Pareto front scores:
-3	-0.5019658700076691	RandomForestRegressor(Normalizer(SelectPercentile(input_matrix, SelectPercentile__percentile=43), Normalizer__norm=l1), RandomForestRegressor__bootstrap=False, RandomForestRegressor__max_features=0.7500000000000002, RandomForestRegressor__min_samples_leaf=1, RandomForestRegressor__mi

MatPipe(autofeaturizer=AutoFeaturizer(bandstructure_col=None, cache_src=None,
                                      composition_col='composition',
                                      do_precheck=True, dos_col='dos',
                                      drop_inputs=True, exclude=[],
                                      featurizers={'bandstructure': [BandFeaturizer(find_method='nearest',
                                                                                    kpoints=None,
                                                                                    nbands=2),
                                                                     BranchPointEnergy(atol=1e-05,
                                                                                       calculate_band_edges=True,
                                                                                       n_cb=1,
                                                                                       n_vb=1)],
         

### Predicting new data

Our MatPipe is now fit. Let's predict our test data with `MatPipe.predict`. This should only take a few minutes.

In [10]:
prediction_df = pipe.predict(prediction_df)

2019-10-14 21:56:56 INFO     Beginning MatPipe prediction using fitted pipeline.
2019-10-14 21:56:56 INFO     AutoFeaturizer: Starting transforming.
2019-10-14 21:56:56 INFO     AutoFeaturizer: Compositions detected as strings. Attempting conversion to Composition objects...


HBox(children=(IntProgress(value=0, description='StrToComposition', max=921, style=ProgressStyle(description_w…


2019-10-14 21:56:57 INFO     AutoFeaturizer: Guessing oxidation states of compositions, as they were not present in input.


HBox(children=(IntProgress(value=0, description='CompositionToOxidComposition', max=921, style=ProgressStyle(d…


2019-10-14 21:57:08 INFO     AutoFeaturizer: Featurizing with ElementProperty.


HBox(children=(IntProgress(value=0, description='ElementProperty', max=921, style=ProgressStyle(description_wi…


2019-10-14 21:57:10 INFO     AutoFeaturizer: Featurizing with OxidationStates.


HBox(children=(IntProgress(value=0, description='OxidationStates', max=921, style=ProgressStyle(description_wi…


2019-10-14 21:57:10 INFO     AutoFeaturizer: Featurizing with ElectronAffinity.


HBox(children=(IntProgress(value=0, description='ElectronAffinity', max=921, style=ProgressStyle(description_w…


2019-10-14 21:57:10 INFO     AutoFeaturizer: Featurizing with IonProperty.


HBox(children=(IntProgress(value=0, description='IonProperty', max=921, style=ProgressStyle(description_width=…


2019-10-14 21:58:45 INFO     AutoFeaturizer: Featurizer type structure not in the dataframe. Skipping...
2019-10-14 21:58:45 INFO     AutoFeaturizer: Featurizer type bandstructure not in the dataframe. Skipping...
2019-10-14 21:58:45 INFO     AutoFeaturizer: Featurizer type dos not in the dataframe. Skipping...
2019-10-14 21:58:45 INFO     AutoFeaturizer: Finished transforming.
2019-10-14 21:58:45 INFO     DataCleaner: Starting transforming.
2019-10-14 21:58:45 INFO     DataCleaner: Cleaning with respect to samples with sample na_method 'fill'
2019-10-14 21:58:45 INFO     DataCleaner: Replacing infinite values with nan for easier screening.
2019-10-14 21:58:45 INFO     DataCleaner: Before handling na: 921 samples, 140 features
['minimum oxidation state', 'maximum oxidation state', 'range oxidation state', 'std_dev oxidation state', 'avg anion electron affinity', 'compound possible', 'max ionic char', 'avg ionic char']
2019-10-14 21:58:45 INFO     DataCleaner: After handling na: 921 sa

### Examine predictions

`MatPipe` places the predictions a column called `"{target} predicted"`:

In [11]:
prediction_df.head()

Unnamed: 0,MagpieData maximum Number,MagpieData minimum MendeleevNumber,MagpieData range MendeleevNumber,MagpieData avg_dev MendeleevNumber,MagpieData avg_dev AtomicWeight,MagpieData maximum MeltingT,MagpieData range MeltingT,MagpieData mean MeltingT,MagpieData avg_dev MeltingT,MagpieData mean Column,...,MagpieData mean GSvolume_pa,MagpieData avg_dev GSvolume_pa,MagpieData mode GSvolume_pa,MagpieData maximum GSbandgap,MagpieData mean GSbandgap,MagpieData avg_dev GSbandgap,MagpieData avg_dev GSmagmom,MagpieData mean SpaceGroupNumber,MagpieData avg_dev SpaceGroupNumber,gap expt predicted
4514,51.0,69.0,16.0,8.0,28.19,903.78,211.1,798.23,105.55,13.5,...,22.76,8.8,13.96,0.0,0.0,0.0,0.0,180.0,14.0,0.9297
834,52.0,58.0,32.0,14.506173,31.127892,1768.0,1045.34,1085.625278,473.871335,13.569444,...,26.250023,11.114599,34.763333,0.464,0.302889,0.21034,0.70195,166.583333,19.039352,0.84045
4481,30.0,61.0,26.0,12.1875,21.802408,1728.0,1673.2,735.406667,744.445,13.416667,...,9.965208,0.931892,9.105,0.0,0.0,0.0,0.279091,107.041667,102.961806,0.84395
3958,79.0,43.0,30.0,9.5,79.77115,1941.0,1007.53,1387.2825,276.85875,9.75,...,16.6425,0.08125,16.7,0.0,0.0,0.0,8e-06,217.25,11.625,1.65345
3087,59.0,17.0,61.0,18.08,31.806681,1687.0,483.0,1523.2,131.04,9.0,...,19.506034,7.214759,10.487586,0.773,0.3092,0.37104,0.000149,216.4,8.96,0.0


### Score predictions

Now let's score our predictions using mean average error, and compare them to a Dummy Regressor from sklearn.

In [17]:
from sklearn.metrics import mean_absolute_error
from sklearn.dummy import DummyRegressor

# fit the dummy
dr = DummyRegressor()
dr.fit(train_df["composition"], train_df[target])
dummy_test = dr.predict(test_df["composition"])


# Score dummy and MatPipe
true = test_df[target]
matpipe_test = prediction_df[target + " predicted"]

mae_matpipe = mean_absolute_error(true, matpipe_test)
mae_dummy = mean_absolute_error(true, dummy_test)

print("Dummy MAE: {} eV".format(mae_dummy))
print("MatPipe MAE: {} eV".format(mae_matpipe))

Dummy MAE: 1.1256546688824407 eV
MatPipe MAE: 0.4591923995656894 eV


# Examining the internals of MatPipe

Inspect `MatPipe` internals with a dict/text digest from either `MatPipe.inspect` (long, comprehensive version of all proper attriute names) or `MatPipe.summarize` (executive summary). 

In [15]:
import pprint

# Get a summary and save a copy to json
summary = pipe.summarize(filename="MatPipe_predict_experimental_gap_from_composition_summary.json")

pprint.pprint(summary)

{'data_cleaning': {'drop_na_targets': 'True',
                   'encoder': 'one-hot',
                   'feature_na_method': 'drop',
                   'na_method_fit': 'drop',
                   'na_method_transform': 'fill'},
 'feature_reduction': {'reducer_params': "{'tree': {'importance_percentile': "
                                         "0.99, 'mode': 'regression', "
                                         "'random_state': 0}}",
                       'reducers': "('corr', 'tree')"},
 'features': ['MagpieData maximum Number',
              'MagpieData minimum MendeleevNumber',
              'MagpieData range MendeleevNumber',
              'MagpieData avg_dev MendeleevNumber',
              'MagpieData avg_dev AtomicWeight',
              'MagpieData maximum MeltingT',
              'MagpieData range MeltingT',
              'MagpieData mean MeltingT',
              'MagpieData avg_dev MeltingT',
              'MagpieData mean Column',
              'MagpieData avg_dev Colu

In [16]:
# Explain the MatPipe's internals more comprehensively
details = pipe.inspect(filename="MatPipe_predict_experimental_gap_from_composition_details.json")

print(details)

{'autofeaturizer': {'autofeaturizer': {'cache_src': None, 'preset': 'express', '_logger': <Logger automatminer (INFO)>, 'featurizers': {'composition': [ElementProperty(data_source=<matminer.utils.data.MagpieData object at 0x1280f9710>,
                features=['Number', 'MendeleevNumber', 'AtomicWeight',
                          'MeltingT', 'Column', 'Row', 'CovalentRadius',
                          'Electronegativity', 'NsValence', 'NpValence',
                          'NdValence', 'NfValence', 'NValence', 'NsUnfilled',
                          'NpUnfilled', 'NdUnfilled', 'NfUnfilled', 'NUnfilled',
                          'GSvolume_pa', 'GSbandgap', 'GSmagmom',
                          'SpaceGroupNumber'],
                stats=['minimum', 'maximum', 'range', 'mean', 'avg_dev',
                       'mode']), OxidationStates(stats=['minimum', 'maximum', 'range', 'std_dev']), ElectronAffinity(), IonProperty(data_source=<matminer.utils.data.PymatgenData object at 0x122c17710>,


### Access MatPipe's internal objects directly.

You can access MatPipe's internal objects directly, instead of via a text digest; you just need to know which attributes to access. See the online API docs or the source code for more info.

In [21]:
# Access some attributes of MatPipe directly, instead of via a text digest
print(pipe.learner.best_pipeline)

Pipeline(memory=Memory(location=/var/folders/4z/3vrw2wq10kzfh29c4x35qk3m0000gp/T/tmp79ge0rli/joblib),
         steps=[('variancethreshold', VarianceThreshold(threshold=0.2)),
                ('normalizer', Normalizer(copy=True, norm='max')),
                ('randomforestregressor',
                 RandomForestRegressor(bootstrap=False, criterion='mse',
                                       max_depth=None,
                                       max_features=0.7500000000000002,
                                       max_leaf_nodes=None,
                                       min_impurity_decrease=0.0,
                                       min_impurity_split=None,
                                       min_samples_leaf=1, min_samples_split=2,
                                       min_weight_fraction_leaf=0.0,
                                       n_estimators=200, n_jobs=None,
                                       oob_score=False, random_state=None,
                                

In [22]:
print(pipe.autofeaturizer.featurizers["composition"])

[ElementProperty(data_source=<matminer.utils.data.MagpieData object at 0x1280f9710>,
                features=['Number', 'MendeleevNumber', 'AtomicWeight',
                          'MeltingT', 'Column', 'Row', 'CovalentRadius',
                          'Electronegativity', 'NsValence', 'NpValence',
                          'NdValence', 'NfValence', 'NValence', 'NsUnfilled',
                          'NpUnfilled', 'NdUnfilled', 'NfUnfilled', 'NUnfilled',
                          'GSvolume_pa', 'GSbandgap', 'GSmagmom',
                          'SpaceGroupNumber'],
                stats=['minimum', 'maximum', 'range', 'mean', 'avg_dev',
                       'mode']), OxidationStates(stats=['minimum', 'maximum', 'range', 'std_dev']), ElectronAffinity(), IonProperty(data_source=<matminer.utils.data.PymatgenData object at 0x122c17710>,
            fast=False)]


# Persistence of pipelines

### Being able to reproduce your results is a crucial aspect of materials informatics.
`MatPipe` provides methods for easily saving and loading **entire pipelines** for use by others.

Save a MatPipe for later with `MatPipe.save`. Load it with `MatPipe.load`.


In [None]:
# Save the pipeline for later

filename = "MatPipe_predict_experimental_gap_from_composition.p"
pipe.save(filename)

In [None]:
# Load your saved pipeline later, or on another machine
pipe_loaded = MatPipe.load(filename)

# This concludes the Automatminer basic tutorial

Congrats! You've made it through the basic Automatminer tutorial!

In this tutorial, you learned how to:

1. Access a MatBench benchmarking dataset with matminer.
2. Fit and make production predictions with `MatPipe`.
3. Inspect the `MatPipe` pipeline.
4. Save and share your results for reproducible science. 


If you encountered any problems running this notebook, please open an issue on the repo or post an issue on our [support forum](https://hackingmaterials.discourse.group). 