{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Statistical and Machine Learning Models Exercises\n", "----\n", "\n", "There are multiple ways to complete these exercises. A solution will be given, but other ways can work. If you find better solutions [let me know](mailto:christina.maimone@northwestern.edu).\n", "\n", "### Note\n", "\n", "There are a few exercises here to get you started, but the best way to learn how to use these packages in Python is to take some analysis you've done in another program and try to replicate it in Python instead. There is great variety in statistical and machine learning models. Without teaching the theory behind the models too (which is out of the scope of this particular workshop), it's difficult to have exercises that are both interesting and accessible to a wide audience. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports\n", "\n", "Don't worry about the pandas warning that is generated." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from scipy import stats\n", "import statsmodels.api as sm\n", "import statsmodels.formula.api as smf\n", "import sklearn\n", "\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise: Working with Distributions\n", "\n", "Make a plot of the pdf of the Beta distribution with different parameter values like the one found on Wikipedia: https://en.wikipedia.org/wiki/Beta_distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise: Difference in Means\n", "\n", "Load in data on the average daily temperature in Chicago from http://academic.udayton.edu/kissock/http/Weather/gsod95-current/ILCHICAG.txt\n", "\n", "Was the average temperature in Chicago this June significantly different from last June?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise: Correlation\n", "\n", "Load weather data from Chicago and Detroit (you might have done Chicago above):\n", "\n", "Chicago: http://academic.udayton.edu/kissock/http/Weather/gsod95-current/ILCHICAG.txt\n", "\n", "Detroit: http://academic.udayton.edu/kissock/http/Weather/gsod95-current/MIDETROI.txt\n", "\n", "During June 2017, were the temperatures in the two cities correlated with each other? Plot them against each other as a visual check." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Exercise: Logistic Regression\n", "\n", "Implement [UCLA's IDRE](https://idre.ucla.edu/) logistic regression example written in R in Python instead: https://stats.idre.ucla.edu/r/dae/logit-regression/. \n", "\n", "The example is also available for several different statistical programs on [this page](https://stats.idre.ucla.edu/other/dae/) if they help you understand. The example in R most closely mirrors a workflow in Python. Note that in R, the logistic regression is run with the `glm` function if you're looking for where the model is actually run.\n", "\n", "Implement the entire example with `statsmodels` (and `pandas` as appropriate). Then use Scikit-learn to run the actual regression and get predicted values, but you don't need to do extra steps like getting the summary and confidence intervals, as there are not built-in methods for that. Note that the coefficient estimates you get from Scikit-learn will not match those from `statsmodels` or in the reference example. This has to do with a regularization step that Scikit-learn's function applies to the X variables.\n", "\n", "Plot with `matplotlib`.\n", "Note that neither package provides a way to get standard errors on predicted values, so you can ignore that particular step." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise: Classification Models\n", "\n", "Using the data provided in `mydata`, fit both SVM (`sklearn.svm.SVC`) and K nearest neighbors classifier models. \n", "\n", "For each method, make a plot showing which points were wrongly classified.\n", "\n", "Make a plot showing which points are classified differently by the two methods." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "np.random.seed(42)\n", "nsamp = 60\n", "A = np.random.multivariate_normal([1,1], np.array([[1,.1],[.1,1]]), nsamp)\n", "B = np.random.multivariate_normal([4,1], np.array([[1,.1],[.1,1]]), nsamp)\n", "C = np.random.multivariate_normal([2.5,4], np.array([[1,.1],[.1,1]]), nsamp)\n", "pts = np.vstack([A,B,C])\n", "mydata = pd.DataFrame({\"x\":pts[:,0], \"y\":pts[:,1], \n", " \"group\":['red']*nsamp+['blue']*nsamp+['green']*nsamp})" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "plt.scatter(mydata['x'], mydata['y'], c=mydata['group']);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise: Exploring Classification Model Parameters\n", "\n", "Using the supplied data, explore how well a `KNeighborsClassifier` model fits as a function of the number of neighbors used for the classification, from 1 to 10. Use 10 fold cross-validation to assess the accuracy at each number of neighbors. Plot the results, showing mean accuracy with confidence intervals.\n", "\n", "Does the number of neighbors matter? How sure are you?\n", "\n", "Hint: You can write a loop to do this, or you can use `GridSearchCV`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "np.random.seed(50)\n", "nsamp = 100\n", "A = np.random.multivariate_normal([1,1], np.array([[1,.1],[.1,1]]), nsamp)\n", "B = np.random.multivariate_normal([3,1], np.array([[1,.1],[.1,1]]), nsamp)\n", "C = np.random.multivariate_normal([2,3], np.array([[1,.1],[.1,1]]), nsamp)\n", "pts = np.vstack([A,B,C])\n", "mydata = pd.DataFrame({\"x\":pts[:,0], \"y\":pts[:,1], \n", " \"group\":['red']*nsamp+['blue']*nsamp+['green']*nsamp})\n", "plt.scatter(mydata['x'], mydata['y'], c=mydata['group']);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise: Predicting Material Properties\n", "\n", "In this exercise, you will build and evaluate a linear model to predict a material property, namely band gap. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Electronic band gap is a materials property which can be described as the energy difference between top of the electronic valence and bottom of the conduction bands. Consequently, band gap primarily affects electrical conductivity and optical properties of materials. A general rule: As the band gap increases, electronic conductivity decreases. \n", "\n", "In this exercise you will train a linear model to predict electronic band gap of materials. The training set for this task contains the band gaps of 4096 binary compounds (AxBy), e.g. $H_2 O$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This example is adapted from [Citrine.io](https://citrine.io/2015/03/17/machine-learning-for-the-materials-scientist-feature-engineering-model-building/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some set-up code is provided. `bandgapDFT.csv` is in the repository (download it first if you haven't already)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from numpy import mean\n", "from sklearn import linear_model, metrics, ensemble\n", "from sklearn.model_selection import cross_val_score, ShuffleSplit\n", "import json\n", "\n", "trainFile = open(\"bandgapDFT.csv\",\"r\").readlines()\n", "\n", "# Initialize dictionary and list variables for the rest of the example\n", "atomicnumbers = dict()\n", "bandgaps = []\n", "features = []\n", "\n", "for line in trainFile:\n", " split = str.split(line, ',')\n", " bandgaps.append(float(split[1])) #store numerical values of band gaps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `feature_vectors.txt` (in the repository) contains the vectorized composition of the all the compounds whose band gaps are given in `bandgapDFT.csv`. For instace, let's define $BeH_2$ as a linear combination of compositions of the first 100 elements in the periodic table:\n", "\n", "\n", "$BeH_2$ = (0.66666 H + 0.0 He + 0.0 Li + 0.33333 Be + 0.0 B + 0.0 C + 0.0 N + 0.0 O + 0.0 F + ... )\n", "\n", "The corresponding list for the given linear combination is shown below:\n", "\n", "[0.6666666666666666, 0.0, 0.0, 0.3333333333333333, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]\n", "\n", "where each number corresponds to the abundance of each element respectively in the list below \n", "\n", "['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Xe', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd', 'Pm', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Po', 'At', 'Rn', 'Fr', 'Ra', 'Ac', 'Th', 'Pa', 'U', 'Np', 'Pu', 'Am', 'Cm', 'Bk', 'Cf', 'Es', 'Fm']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "with open(\"feature_vectors.txt\",\"r\") as fp: \n", " features=json.load(fp)\n", "\n", "# atomic_numbers.txt\"contains the names and atomic \n", "# numbers of the first 100 elements in the periodic table\n", "with open(\"atomic_numbers.txt\",\"r\") as fp:\n", " atomicnumbers=json.load(fp)\n", " \n", "co2 = [0.0, 0.0, 0.0, 0.0, 0.0, 0.33333333, 0.0, 0.66666666, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]\n", "alna = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5, 0.0, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have a (naive) way of converting each material in our training set into a vector for our machine learning model to crunch, and equivalently we can express any new chemical formula with this representation for the purposes of making predictions. Time for the fun part!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this point forward, we have everything we need to train and evaluate a regression model for material band gaps.\n", "\n", "1) First determine a baseline accuracy by calculating the mean absolute error of band gaps. This is the error corresponding to always guessing the average of the band gaps for a material. A good model should have a lower error than this.\n", "\n", "2) Use a linear ridge regression model to model band gaps using our feature set (i.e. the composition vector):\n", "\n", "* Set the alpha parameter of the ridge regression model (a paramater to control size of the fitted coefficients) equal to 0.5\n", "\n", "* Define `ShuffleSplit` cross validation scheme for the band gaps. Use 10-fold cross validation with 0.1 test fraction.\n", "\n", "* Evaluate the cross validation score for your modes with the \"mean absolute error\" scoring scheme.\n", "\n", "* Look at the model coefficients to see if they make sense with what we know about the elements (use the `atomicnumbers` dict to help with labels and matching values).\n", "\n", "3) Finally use a Random Forest Regression model and calculate the mean absolute error score for this model. Use 10 trees in the forest.\n", "\n", "* Using the Random Forest model, predict the band gap for $CO_2$ (expect a high value, non-conductive gas) and AlNa (expect a low value, conductive metal). There are variables defined for these above." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (conda p3)", "language": "python", "name": "p3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }