{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# xlines of Python: machine learning\n", "\n", "This notebook goes with [the blog post of the same name](http://ageo.co/xlines04).\n", "\n", "We're going to go over a very simple machine learning exercise. We're using the data from the [2016 SEG machine learning contest](https://github.com/seg/2016-ml-contest)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'0.19.1'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import matplotlib.pyplot as mpl\n", "\n", "import sklearn\n", "sklearn.__version__\n", "\n", "# Should be at least 0.18" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read the data\n", "\n", "`numpy` has a convenient function, `loadtxt` that can load a CSV file. It needs a file... and ours is on the web. That's OK, we don't need to download it, we can just read it by sending its text content to a `StringIO` object, which acts exactly like a file handle." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import requests\n", "import io\n", "\n", "r = requests.get('https://raw.githubusercontent.com/seg/2016-ml-contest/master/training_data.csv') # 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can't just load it, because we only want NumPy to have to handle an array of floats and there's metadata in this file (we can't tell that, I just happen to know it... and it's normal for CSV files). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Pandas](http://pandas.pydata.org/) is really convenient for this sort of data." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FaciesFormationWell NameDepthGRILD_log10DeltaPHIPHINDPENM_MRELPOS
03A1 SHSHRIMPLIN2793.077.450.6649.911.9154.611.000
13A1 SHSHRIMPLIN2793.578.260.66114.212.5654.110.979
23A1 SHSHRIMPLIN2794.079.050.65814.813.0503.610.957
33A1 SHSHRIMPLIN2794.586.100.65513.913.1153.510.936
43A1 SHSHRIMPLIN2795.074.580.64713.513.3003.410.915
\n", "
" ], "text/plain": [ " Facies Formation Well Name Depth GR ILD_log10 DeltaPHI PHIND \\\n", "0 3 A1 SH SHRIMPLIN 2793.0 77.45 0.664 9.9 11.915 \n", "1 3 A1 SH SHRIMPLIN 2793.5 78.26 0.661 14.2 12.565 \n", "2 3 A1 SH SHRIMPLIN 2794.0 79.05 0.658 14.8 13.050 \n", "3 3 A1 SH SHRIMPLIN 2794.5 86.10 0.655 13.9 13.115 \n", "4 3 A1 SH SHRIMPLIN 2795.0 74.58 0.647 13.5 13.300 \n", "\n", " PE NM_M RELPOS \n", "0 4.6 1 1.000 \n", "1 4.1 1 0.979 \n", "2 3.6 1 0.957 \n", "3 3.5 1 0.936 \n", "4 3.4 1 0.915 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv(io.StringIO(r.text)) # 2\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**I later learned that you can just do this...**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = pd.read_csv('https://raw.githubusercontent.com/seg/2016-ml-contest/master/training_data.csv')\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

**A word about the data.** This dataset is not, strictly speaking, open data. It has been shared by the Kansas Geological Survey for the purposes of the contest. That's why I'm not copying the data into this repository, but instead reading it from the web. We are working on making an open access version of this dataset. In the meantime, I'd appreciarte it if you didn't replicate the data anywhere. Thanks!

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get the feature vectors, `X`" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "features = ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE'] # 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll load the data we want. First the feature vectors, `X`. We'll just get the logs, which are in columns 4 to 8:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X = df[features].values # 4" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(3232, 5)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get the label vector, `y`" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "y = df.Facies.values # 5" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(3232,)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y.shape" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAABRMAAABRCAYAAABBlzHpAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAAPYQAAD2EBqD+naQAAIABJREFUeJzt3XmcHGd95/HPr4/pYw7NSLI8I49kS/Jt2Vg2lzkcYYNJ\nHOIsYSGLiUNgkwAh4BeEI9cuOV6BTTZ2gMReSAgQMN4sIWzigIzBNsawGGxkG+NDsq2RPDPSzGju\nq+/uZ/+orp4+pmda9ow0x/f9etVrpqqeqv5V9fM89dTTdZhzDhEREREREREREZHFBE51ACIiIiIi\nIiIiIrI6qDNRREREREREREREGqLORBEREREREREREWmIOhNFRERERERERESkIepMFBERERERERER\nkYaoM1FEREREREREREQaos5EERERERERERERaYg6E0VERERERERERKQh6kwUERERERERERGRhqgz\nUURERERERERERBqybJ2JZtZhZl8xs0kzGzezz5lZ8wksf6eZFczsuuWKUURERERERERERBoXWsZ1\n3w6cDlwNNAFfBD4L/NpiC5rZB4A84BpIuwl4PXAESD3vaEVERERERERERNanKHAWcJdzbnShhObc\nov11J8zMzgeeBC53zj1SnPZ64JtAt3NucIFlXwTcAbwEGAT+k3PujgXSXw98ZQnDFxERERERERER\nWY/e5py7faEEy3Vl4hXAuN+RWHQ33pWGLwP+fb6FzCyGd0Xje51zx82skc86AnD2i99KvHVLaWJ0\nYJZgtlAaz0WCBPKOQK6AM0hsay3N23POk+y+9jlueWZvzcov2jLAgW+fA0D+0plG4qFtXwyA5t7p\n0rT0xii5lnDNdIDZ7V4szmBkj/HBV+8rzfNjCo4GaP5xpLSsv0zs6AyBvKtY19a39LKzY5Ad0bmO\n5H/seQXhHwZpeWTxzuN0R5Rca5gbbryzYvpEfzP3/s2eRZf3/cpNPwDgroELOP7Vs2junS7F7Zu+\nKomLwnvPua9i+o9Gz+Ki2BD/+qW9BNMNf2RJIFsgNjBbGq/+3OHLvLzV1TbOW1/0ADd//1pCgTzv\nf+Vd3Pz9axv+nPig96SARGehoXR+HvqXS/ZVzN+fjjJdmHvqwNd/71ULri9/6LnS/0/zU87lRfOm\nC5y1jcKRPgCCu84EvH1xw413cnh2E/uOXVxK+4YzHuPM+BjAvGXB13x3jGCmfmxNYykCP312wfhT\nr9kNUMpj/ue17YvVlA9fYmsLodksTZNpUqfFSJ1lJC/PlJYDCCayREdSpe2sty6Y24d+LM290xX7\ntRH+PvU/D2Dq2iQATc+GiD4dLs3f+pZegvFcaXzvhoM16/vyp37hhD5/MWde/wxXnvYsdx58EU8d\nP4MLz+slGskC0DfVzlQqtuDyrgCZ4/Ga6fHBADu7B7jmiv088Nj5PPvdHQAEZ7NERysvEM81h0hv\nqv+9lpvd3kJz7wwbz5pi7/seK03/9uD5PDPdWZE2+GgLAIGnPsdv//dOPvMvb5h3nflLZ0pp/XHw\n6px//sxryabnvqPFYiyvR2648U7uHTqPp6a6+NXtD/F/el9SmufXZ1/+1C/MW9/70/w6strnD19B\nMhcp5ev5VO/r1JY413/026XxiYFm7v3rPWSbw2Q2Resed/wyOF/eKz8G3Dd5Xt1YAJ443gVAaChA\nfH9kwbS+6phyzWHSm6LzpHRMXTu3rblEiFBZWarreIjAUITCxbMEH20hf+lMTbl802/eS7w5vWid\nCzRcP/j1Qnn67EvOIzrmbUP59+7n28JpGWwszGuveZBoIMsdRy8lNxMiP9NEqD3N+y++hy8efjmT\nEy2EWnLEftJE+08SHBj4Dud3va5uLLPbW4l+9/GG4rbWFmzLJgzINwU5+rpmwtsSgJenR56K8N0/\nKLazzAju3F6xvMsXSscbfz/kmoz+a1vZdtPPCJzZTeG5fgrxMImfO49Qai4PlO+r8uNWvf1ZPi/Z\nGSc26MXpHBR66n9PgW1bKfQda2h/BHedyfTOVl711v0EInn2HbuYTbcHaJrK1LQpwGtXXHbJoXnr\n9lKa2Va+9bnavBbI5EvbMO9yl0d51/V3c9uRl/JrZz1YM7+83eK3N6rlL53hvefcx98d2MsbOn/G\ngzddTHpjlMuuPMj27YP82+17a5aJDiYIZvJ14wIvr/Rd20xLnyN5puPGV36LW57ZS3owTnwwgAs4\nXFuOC196hJe29hAPZLnlmb2895z7uOPoJYxn4rx9x48q1vn5w1dwUdsAj9y7m3RHgRtfu69UR5W3\ng/t+/PV583+6I8pb//hugNJn+fspvDnJ+y74bkWZL32fzhG97wkK7XEye3bWtIN9jbRZys1ub2X4\nssr2fd8jp/HQbfXr1XxTkFf90aN0tUyWYo90JkrHhUS38a437au7vM/PGx989T5ueWYvV0708egP\nz61J57fR623PGz7+I5oic/Wuv97wlBHIQKYrj8vW5r1gAkI7Z3nvOd8DmPdY/c+//tVFt2Mp/Nf/\n8aaaafW2t3xfOIPnfrmNjsN5un7xKNFQtpTu4cd2ER02Annj3W/+Brd+5/UEJsK8+83f4Md9u/h/\nPecRP17/qWLVbZTqeQBXbXiaOw9eStf3MuRavePX4cfuYMcl3pPA6tUfe3//p2w8bW77JnIxHp3d\nXoo70pngvefcV1FGbnlmL9e1PsF3v/ZimsYypTb3+Ju8bS4/f7kv6bUPHxjZwcPjXj1d3m4J5ApY\nzhE97sWW7oiS2hLmHb89V656v3ceP7njtHm3P9nVzNs+dBc5F+Czz15JejBOpDNB12dyNXXw7pcc\nYs8rnmbfwG4OPOttY6QzURNztTc/5uXj67c9xDc+e2XFvOpzycWc898Oc//wOQ2nrzAYhs4sHzlr\nPy9pG+LNj11L/tk4TTNe3ilsyOF2pHj55h6e/NJFQPG85XAfFLxz0MyeHbztY/cDcP/w2fxsopv0\nYJz2AxBOuIr6zK/DrC9CYDRM/vxZiDockCnu5+vPfJDbn3tpQ+H7eXjbK49wZHZzRTvYF++bwea5\niO0lH3+CbZHximnl56HlfQQjQxu4859fUZH/sy1w3Vu/z6Vt45wdniub7z94JZkgdLZ4ZWBqOsbe\nzqf491uuqin3u94yyA82drO9eZTW5gwHRrbw7l3fr2kb+ufQvs8++ypyLkTgQIz3/NKdDGQ2cDA5\nd76yd8NB9sbqH9sX8qnhCxnPz7WnpzMRJm4/oyZdvTrMb2u+KJLm4z99GQeHt/Kaix/jh8NnA5Cd\nbCK8IVMqV+nBufO97NBxRm67HYr9bAtZrs7ETuB4+QTnXN7Mxorz6vkb4AfOuW+cwGelAOKtW2jp\n6C5NjE9OEcyUdSbGggRyjkC2gAtAoKN9LtiuQXbuHiZa6KpZ+aYzssR+6q03t2uyoYBaOrwC1To8\nMRdka4xse6RmOoAVY3EG09uMnbvnvkw/plBzkJanY6Vl/WWaxycJ5FzFujaem+CM05LsjCfn9od1\n0vRMiLbY4p2JqdYY2Y5IRRwAI7FW2mK1+6gef/mO1s0kOrppHZ4oxe3LnTVLodnVfNahwY2c2TxL\n88ZuQs/j5vVAOk/zxFzhqv7cqW1eZ2JrcTsjR7oJB3Kl/xsVc8VOwm0Ldyb66fw8dNkllSfM08kY\n44VgaXyx/Zy3qdL/IRemzTrmTReIdlIwrzESLK7TOtrZuTtOZqqNaNnnnLHjCDtbvZ09X1nwNT/c\nQmiBDt5IJkHQFrwimlCxrPrfu/95LR0tNeWjtC3tbYQDGSKZFKG2ZoJbDLcrVVoOIBTKEJtNlLaz\n3rpgbh/6sbQOT1Ts10YEy/afn8cyu7z9HU2EiQ/PHQQ2npsg1DLXEN+5ae5k2Vdehy2FzeeOsfOM\nY3RkTyMS6abj3CTxmNcBOz62iUxy4cfYurxhkdpGbswF6NiWZ+fuOM9ObWawGHcomCGWqDxoZlvC\nhDuaF/wufNa+gdbhSTZV1T8b2zYTnajMk6GBDQAE+4Ls3B0ndv/8+y63a7KU1h8HL++1bjyDTKqp\nNG+xGMvrkZ274zy6YRPR8S62n9NKNNxVMQ+873O++t6fVl3v+ZoDnbhctJSv51O9r0MbWirWN9rS\nQlusi0xLE+mOeN3jTnms1crX97PRzXVjAYi2etsfbgrS0rNwJ7WvOqZsSxPhjvn2iSOza65Bn5sO\nE2rNzpOukkWaCFiU/K4pQgMbyO2arCmXZ13QTEtbsKFjW6P1g18vlKfPtG0llvSOyeX71c+3ha1p\nLNxE1/kHiQczRKNd5CabyE1GCG9OsnN3nOZgJ6nhDYTbMzT3RmmLzRAKRheM3TraidvRhuK20AYs\n1ul1JkaCjHS30bRruhRzLB2dO9aYVdR/AC6fLx1v/P2QixqRbe20WX/xeDRLPtiEbewmnJzLA+X7\nKhDrrFlPdZryecENLTRPeumdcxQW+J4C0dMpWLLu/Op1Fza2s/W8ZwlE80RjXbS2Bolk0zVtCvDa\nFZvOn5y3bvdFp1rnLWuBVK60DfOZ2RJn5+44LaHT2Xl+bRkpb7f47Y1quV2T7NwdJ5rZSve2Xp5q\n3kq4PU7nmcOcec7MvHHFp6cJBhfpTIwGiWxrI552FLYVvM8odEG4lZgrdiZuzLLp/BnOah+iJZgh\nWuhi5+44G+KnkUm3sPOCym1qCXRy2sY0sSe64TRvnS1lx2rw8na9/J9qjVW0L/z/I0e6aeqcZefu\neMVype+z4IjbMfKhVtId3XXr6EbaLOWso52pqvZ9YWwDBxcou/lIkG3n97C9PVuKPbp9eu64cLrV\nja+cnzf87+X04Vlanqr9rv02er3t2XFhM5HYXL3rr7dpzAimwbbncJlgzXKhaQjvmirFOt+xurpN\nvFzmy+P1trd8XziDyLYOmqdzdJyXJl7WYREZ7yYWNAI57/uIPXkGgUgTO3fH6YlsJJLpJmb1OxOr\n2yjV8wA6Nx0nkuimtTVFtsM7foXCsdL21Ks/tp/Xw5atc+V3NNtM39TmUtzR7dOlfFFeXs5o76e5\no5toNlVqcyd3eQ3/8u+qd9Zb5pmBjUSH58qEL5DJYzlHfNqLLdUaI7ipsn2XP7SZp+uUg+CGVnbu\njpMtBIg6r06Jbp+mLZarqYO3bJtg5+44HS2biaS9/RLdPl0Tc7XorPfZ23e11OSP6nPJxZx+/jjR\ntsbPk8tZOILbnuac81q5bOMk0dku8ok2IpPe+WphU5bCrgRbOsfprThvmQHzzkFTrXPf45NHNxEd\n9fZZ8zA0RSrPtf06LFCIEQg1kTtrGpoLOAdW3M/bz20hGmpse/w83H7OLNHpznnbli2jk1ihth+i\n+8I+dsYqTyzLz0PL445v8I6h5fk/0wbbL2jlgo4UF0fKzqXzp5MOwaYNXpkJTDRz5ra+edvmp2/P\nEj29i/bWAO1taWIDXTXHCZg7h/bFXBdZFyI46bXBQ6lWRmbn2ss7N/VxWfPCfQT1nH5sI4HcXHs6\nmIqSO4E6zI/zkqjRPr2FSLSb08/rK+XRwGiUpk2pUrkiXPsjKQ08QvCEXsBiZp8ovhSl3pA3s9qf\nu8pWQZ3nIBZftHIV8IETiUlEnp+lf8CBiMgKo4pOlkJDN8qsHiu9WKz0+EROKhWIdcrN+6/ISnKi\nVyb+NfCFRdL04D3rcEv5RDMLAh3AUJ3lXgPsBCarbm/+upnd75y7aqEPPfzYHYTCZb23ySxbWy+i\nq/2iRcIVERERERERERFZH2b2P8LIbQ+RT851CxaSjd8WekKdicW3uSx8/yJgZg8A7Wa2p+y5iVfj\n/bb74zqLfQL4h6ppjwM3Aove9rzjkusqb3M+Unmbs4hUWmMXWoiI1FJFJ0thjV0VstKLxUqPT+Sk\nUoFYp2zef0WWUsvle9j8xrNJ9c7d5pzu62fgpk82tPyyPDPROXfAzO4B7jazCFAoDl/13+RsZluB\ne4AbnHM/AbLAu4FrgG3ACF7RGXHOndhbEUTWkU62Pa/l1ti5kaxTL/35+R/cLQKs+Yqua4PuvpAT\nt9KLRaPxderuI1kP5ikQm7ddevLjkJPMUepFXOmV9kmm/L9ynNAzE0+QUdmPXj0eBs4F/KdYbgW6\ngA8Cu4G3F6e/bxljFFn1Om374olE1qiXqTNR1jE9yuUk0VUhK5Lyv6xXp23bc6pDEDlllP9XjmW5\nMtHMzse7rfly/zZnM3s98E0z+4BzbrB4tWHplTvOuSeAN5et5rCZvRn4spkFnHO6Z1lERERERERE\nROQUWq4rE68AxsuelwhwN95Fui87gfW0A1PqSBQRERGRU0K3mImIiIhUWK7OxE7gePkE51weGCvO\nW5SZbQb+GPjskkcnIrprS0TWPlV0IjVWerFY6fGJnFQqEOuUXsAiK98JdSaa2SfMrLDAkDezcxda\nBQ38vmtmrcA38d7m/KcnEqOINEYXWojImqeKTpbCGjuRW+nFYqXHJ3JSqUCsU27ef0VWkhN9ZuJf\nA19YJE0PMAhsKZ9oZkGgAxhaaGEzawHuAiaAXyle0biow4/dQSgcK40Hk1m2tl6khzOLiIiIyPOn\nEzkRERFZY2b2P8LIbQ+RT851CxaSqYaXP6HOROfcKDC6WDozewBoN7M9wCuAD+G9rTkEpBdYrhX4\nCdCNd9XkQ2b2+865Oxf7zB2XXEdLR3dpPH5kimBGj1oUqWeNXWghIlJLFZ1IjZVeLFZ6fCInlQrE\nOqXbnGX5tVy+h81vPJtUb2tpWrqvn4GbPtnQ8svyzETn3AG8qwu/BtwM3AYcA54BbjezzWa21cye\nMrMXQ+mKxAeAc4G/wXsb9HeAfzMzXV4ossR0oYWIrHmq6GQprLETuZVeLFZ6fCInlQrEOqXbnGXl\nW64XsABcD7QV//9dvLc5XwokgHcCYbyOw3gxzeXAhcX//wD4PvBBoAn46DLGKSIiIiIiIiIiIg1Y\nzs7EWaAdeLNzrsM591vOuQRep+IVzrnnnHNB59z9AM657wF9wI3F6UHnXADvBSwXL2OcIuvSGrvQ\nQkSklio6WQpr7KqQlV4sVnp8IieVCsQ6pducZeVbzs7EzUCQ2heuDAGddZbpPMH0IvI8rbFzIxGR\nWqroRGqs9GKx0uMTOalUINYp3eYsK99ydibWY5xYkTjR9CIiIiIiS0NXhYiIiIhUOKG3OZ+gESAP\nnF41fQu1Vx/6Bk8wfcnhx+4gFI6VxoPJLFtbL6KrXe9uEZmPzo1EZM1TRSdLYY39pL3Si8VKj0/k\npFKBWKd0m7Msv5n9jzBy20Pkk3PdgoVkquHll60z0TmXNbP9eG9lvgPAzKw4/uk6iz0wz/zXFacv\naMcl19HS0V0ajx+ZIpgpPL/gRdaBNXZuJCJSSxWdSI2VXixWenwiJ5UKxDrlKPUiKg/IMmm5fA+b\n33g2qd7W0rR0Xz8DN32yoeWX+zbnm4H3mNmwmaXwrjxsBb4IYGZfMrOPl6XvAX7JzBJmNmFmh4AX\nA3+3zHGKiIiIiNTSVSEiIiIiFZa7M9GobYKVPwOxm8qXq2wD/gHvtuYo0AFkgPHlDVNERERERERE\nREQWs9ydiR8AbnXOneaci+J1HE4D7wRwzl3lnHunn9g5d4Nz7l3OuR3F9JuBAt6tzyIiIiIiJ5du\nMRMRERGpsGydiWYWBi4H7vGnOecccDdwRYOraQbCwNiSByiyzumuLRFZ81TRidRY6cVipccnclKp\nQKxTegGLrHzLeWXiZiBI7ZuYh6i8tXkhfwkcxeuAFJElpAstRGTNU0UnS2GNncit9GKx0uMTOalU\nINYpN++/IivJsr3NeQHlz0ysn8js94G3AD/nnMsse1QiIiIiItV0IiciIiJSYTk7E0eAPHB61fQt\n1F6tWMHMPgR8BLjaOfdEIx92+LE7CIVjpfFgMsvW1ovoar/ohIIWWS/W2IUWIiK1VNGJ1FjpxWKl\nxydyUqlArFO6zVmW38z+Rxi57SHyybluwUIy1fDyy9aZ6JzLmtl+vJen3AFgZlYc/3S95czsw8Af\nAtc45x5p9PN2XHIdLR3dpfH4kSmCmcLzjF5k7dOFFiKy5qmik6Wwxk7kVnqxWOnxiZxUKhDrlKN0\n8FEekGXScvkeNr/xbFK9raVp6b5+Bm76ZEPLL/fbnG8G3mNmw2aWAgaBVuCLAGb2JTP7uJ/YzD4C\n/Dne254vM7OCmX3TzJqXOU4RERERERERERFZxHJ3Jhq1v+eWPzOxm8qXsbwH7+3NXwM+U0x7LfB7\nyxumyPqzxi60EBGppYpOlsIauypkpReLlR6fyEmlArFO6TZnWfmW+wUsHwBudc7dCKXbnPvwrjz8\nK+fcVeWJnXM7zCwAfA/4PHAlsME592fLHKfIurPGzo1ERGqpohOpsdKLxUqPT+SkUoFYp3Sbs6x8\ny3ZlopmFgcuBe/xpzjkH3A1cscCiHwOOO+e+sFyxiawlg673VIcgcsr8+FvDpzoEkVNmYKKhd9TJ\nC6WrQlYk5X9Zr4b7Gn6tgMiao/y/ciznbc6bgSC1b24eovLW5hIzeyXwDuA3lzEukTVlkL5THYLI\nKfOgOhNlHRuYVGeKrF+D6kyUdWqk79FTHYLIKaP8v3Is9zMT51P+zMS5iWYtwJeB33LOjZ/0qERE\nREREqukWMxEREZEKy/nMxBEgD5xeNX0LtVcrAuwCzgT+o/hsRSh2dppZBjjPOXe43ocdfuwOQuFY\naTyYzLK19SK62i96/lsgIiIiIuubbnMWERGRNWZm/yOM3PYQ+eRct2AhmWp4+WXrTHTOZc1sP3A1\ncAeUXsByNfDpeRZ5Cri4atpfAC3A+2Hhezl3XHIdLR3dpfH4kSmCmcLzjl9ERERERERERGStabl8\nD5vfeDap3tbStHRfPwM3fbKh5Zf7bc43A/9U7FR8EO/tznHgiwBm9iWg3zn3h865DPBk+cJmNoH3\n3panFviMKEBi+njFxFxilmB2rjMxVwgSyDsCuQLOIDE+U5o3ODBGz+MJUocGalY+Oj1CcqgfgPyh\nmZr585kZ966QdMnp0rT0dJScC9dMB5gtxuIM0n1Gz+OJ0jw/puBogMB4pLSsv0w+MUMg7yrWNfb0\nCEeHJ2mKzq0n0TNIdjiIJRe/Vyc9HSVHuCIOgIl+YypZu4/q8ZcfHxhhZrwfl5wuxe1LHkniotAT\nqPys4dExnovNMDvWTzDd8EeWBLIF8snZ0nj156b7vMsMpifH6dmQIN3XTz6Qp+dx7/9GBYe8JwWk\nbeGOaz+dn4cebq7s8X86DdOFuacOLLaf82VPAsiRZarOkwECqUEKxXnB4jpnx2foeTzBsdkpUsfm\nPudoaopQ3Pse5isLpW0ZixHM1I8tM5MisMiTClLj3j7284j/eTPjsZry4UtMTBOazZJOpklNxUgd\nN1KHMqXlAIKJLNnirymz4zN11wVz+9CPxSWnK/ZrI4Jl35Ofx1KHkgAUjoUojIdL88eeHiEYz5XG\nezZU5nlvOxrPe40YeXqUnvEE4z3DpI83MR4fJhnJep81lSOVii24vCtA5ni8ZnpwKMB4+Dg9jycY\nPjJSijs4O7f/fblAiPQC32u52YlpXHKG0PhURf0zNjhCarrycBUc8tYXmPHKrV9PV8sfmiml9cfB\nq3Omx46STc99R4vFWF6P9DyeYGxolNRUE73ZaVK9c3nBr8/8eq96Hf606jq2lObwIKlcpJSv51O9\nr1OT8Yr1TQx49XU2ECYzHq173PGXmS/vla9vdHKkbiwAqePefswNBbDxyIJpfdUx5QJh0uPR+VKS\nOjS3rblEiFBZWarreIjAUITCoVmCQ9PkD83UlMsjT80Sb043dGxrtH7w64Xy9NmpY6Xvq3y/+vm2\nUMhgY2EGDkwSDWRJHR0gNxMiP9NEfjpNT0uC2cODpCdmyI/msONNhJMJcvnUgrHPjs+QazBuy2Wx\npMOAfD5Iun+Kgnmx9gQSjPTky441VlH/Abh8oXS88fdDLm+k+2aYcuOl41EhHyYx1k8oNZcHyvdV\nIDlYs57qNOXzkpNx8kkvTueoWLZaIBVbcH71umfHZjh2cJJAU57UwADT0wHSyUxNmwK8dsVox8i8\ndbtveDY4b1kLZPKlbZhP8niUnscTzBwZoidXm6683eK3N6rlD83QE0iQ6jnG0cQk07PHSE9EGXxu\njCY3M29cudkEwUy+blxQzCt9UySOO9IxV2pPpwcnCQ4FcAGHS+YYPTDCkdZZ4oEsqUMD9AQSTB4d\nZjYzS0++cptmDw8yPDxGcqifdKbgbXvZsRqKebtO/k9PRyvaF36dnO7rp5BM0hNNVCxX+j6dI+fG\nKeTSZMb769bRjbRZKrZnfKamfX/0uckFy24+H6TvwBS5lrnYySZKx4XEkNWNr5yfN/zvZWhibN7v\n2m+j19uew0/O0hSZq3f99RamjEAGMpbHZWvzXi4Buehs6TuY71j98GONXwXzQtTb7vmU7wvv/GyW\n2ZE84wdHSIaypXTpvn5s2Ajkve8jefQogQnvHGq4b4x0fz/B4/WfKlbdRqmeBzA4MkG6r5/p6Qw5\nisfabLK0PfXqj96D08yMlbULco7R2ZFS3GQTXp1QVkZShwY42jrJ7Hg/2enMXJv7kLfN5ecvPUlv\nu0ZGxkiNz5UJXyBXwHKOXDG29HSU1Gjl+eXR/hGmkvNf+p2c9NqCORco1SlkE0wlczV18PG+UXoe\nTzA+MEK6r9huzSZqYq7ml+XeTG0dWH0uuZihAxOkhhs/T64wGIZslmfy07S3pUgdGiB/bJLCjLeP\nC6kcLpLi+OR4RV2Yd+NeYx3ITA+U9u3I8CipiRjpwUlmRyGTcPP2L9jRCIHRMPkjsxB1OCBT3M+9\nuRlSzzW2PX4ennhmmNSsq2gH+wqJGczV9kP0PzmFRSrzb/l5aHkfwchQmJnx/or8n81C71PTbGxL\nkg3Plc2pZ4fIBGG0xYttanqW56Zn5m2bD/WOkpqJMdE8Sr45Q3IkQA+JmjraP4f2JQ8NkHMhAgMx\neh5PMJAJM5qcay/3bEjwcOz51W9Dw2OM5+fa09OZCDPjTTXp6tVhfpytkTQTPcdJD4cYah8v5dHs\nZBOFiUzym+AfAAAI9klEQVSpXKUHJ0vLZodK/WrzNcormJvnS11KZvY7wEfwbnd+FHifc+4nxXn3\nAkecc++ss+wXgA3OuV9ZYP3XA19Z8sBFRERERERERETWl7c5525fKMGydyYuNzPbBLweOAKcnJ+2\nRERERERERERE1o4ocBZwl3NudKGEq74zUURERERERERERE6O+g9xEBERERERERERESmjzkQRERER\nERERERFpiDoTRUREREREREREpCHqTBQREREREREREZGGrPrORDN7r5kdNrOkmf3IzF5yqmMSeSHM\n7GNmVqganiybHzGzW8xsxMymzexrZralah3bzOybZjZrZoNm9ldmturLu6w9ZvZqM7vDzI4W8/p1\n86T5MzM7ZmYJM/uOmZ1dNb/DzL5iZpNmNm5mnzOz5qo0l5jZ/cVjxXNm9uHl3jaRxSyW/83sC/Mc\nD/ZVpVH+l1XHzP7AzB40sykzGzKz/2tm51alWZL2jpntNbP9ZpYys6fN7O0nYxtF6mkw/99XVffn\nzezWqjTK/7LqmNm7zeynxXbLpJn90Mx+vmy+6v5VYlV3LpjZrwI3AR8D9gA/Be4ys82nNDCRF+5x\n4HSgszi8qmzeJ4FfBN4EXAlsBf7Vn1msSPcBIeDlwNuB3wD+7CTELXKimoFHgfcCrnqmmX0U+F3g\nXcBLgVm8er6pLNntwAXA1Xhl40rgs2XraAXuAg4DlwEfBv7EzH5zGbZH5EQsmP+L7qTyePDWqvnK\n/7IavRr4W+BlwGuBMPBtM4uVpXnB7R0zOwv4BnAP8CLgU8DnzOx1y7JVIo1pJP874O+Zq/+7gI/4\nM5X/ZRXrAz4KXF4c7gX+3cwuKM5X3b9aOOdW7QD8CPhU2bgB/cBHTnVsGjQ83wGvc/zhOvPagDTw\nxrJp5wEF4KXF8V8AssDmsjTvAsaB0KnePg0a6g3FfHxd1bRjwAfKxtuAJPCW4vgFxeX2lKV5PZAD\nOovj7wFGyvM/8AngyVO9zRo0+EOd/P8F4OsLLHO+8r+GtTAAm4t5+VXF8SVp7wB/CTxW9Vn/G9h3\nqrdZgwZ/qM7/xWnfBW5eYBnlfw1rZgBGgXeo7l9dw6q9MtHMwng92ff405yXS+4GrjhVcYkskXOK\nt70dMrPbzGxbcfrleL/ClOf7g0Avc/n+5cDPnHMjZeu7C9gAXLT8oYssDTPbgfdrfHl+nwJ+TGV+\nH3fOPVK26N14v+i/rCzN/c65XFmau4DzzGzDMoUvslT2Fm+DO2Bmt5rZxrJ5V6D8L2tDO16+HSuO\nL1V75+V4ZYKqNDpXkJWkOv/73mZmw2b2MzP7eNWVi8r/suqZWcDM/gsQBx5Adf+qsmo7E/F+wQkC\nQ1XTh/BOPkVWqx/hXar9euDdwA7g/uIzsDqBTLFDpVx5vu9k/nIBKhuyunTiNa4Xquc7gePlM51z\nebwGucqErHZ3Ar8OXIV3e9vPAfvMzIrzlf9l1Svm508CP3DO+c+IXqr2Tr00bWYWeaGxi7xQdfI/\nwFeAXwP2Ah8HbgC+XDZf+V9WLTPbbWbTeFch3op3JeIBVPevKqFTHcAyMOo/d0hkxXPO3VU2+riZ\nPQg8B7wFSNVZrNF8r7Iha0Ej+X2xNH5njMqErFjOua+WjT5hZj8DDuGdXH53gUWV/2U1uRW4kMrn\nQ9ezFO0d5X9ZSfz8/8ryic65z5WNPmFmg8A9ZrbDOXd4kXUq/8tKdwDvWYbteM9G/JKZXblAetX9\nK9BqvjJxBMjjPZS23BZqe6FFVi3n3CTwNHA2MAg0mVlbVbLyfD9Ibbnwx1U2ZDUZxDvwL1TPDxbH\nS8wsCHQU5/lp5lsHqEzIKlI8gRzBOx6A8r+scmb2d8C1wF7n3LGyWS+0vbNY/p9yzmVeSOwiL1RV\n/h9YJPmPi3/L63/lf1mVnHM551yPc+5h59wf4b1I90ZU968qq7Yz0TmXBfbjvb0QKF0mfjXww1MV\nl8hSM7MWYBfeiyj24z1YvzzfnwtsZy7fPwBcXPVW82uASaD89gmRFa3YcTJIZX5vw3sWXHl+bzez\nPWWLXo3XCflgWZori50svmuAg8XOepFVwcy6gU2Af9Kp/C+rVrEj5ZeB1zjneqtmv9D2zlNlaa6m\n0jXF6SKnzCL5fz578K6oKq//lf9lrQgAEVT3rypWfLPNqmRmbwH+Ce/tPQ8CHwD+M3C+c274VMYm\n8nyZ2f8E/gPv1uYzgD8FLgEudM6NmtmteG+xegcwDXwaKDjnXl1cPgA8gtf5+FGgC/gS8PfOuf92\nkjdHZEHFZ4Gejdf58TDwQbzbN8ecc31m9hG8fPwbwBHgz/EernyR/8uime3D+7XxPUAT8HngQefc\nDcX5bXi3U3wH7+1uFwP/CNzonPvHk7KhIvNYKP8Xh48B/4rXqX42Xv5tBi4p/qiq/C+rUrEt81bg\nOry7L3yTzrlUWZoX1N4xs7OAx4Fb8MrG1XjPp7vWOVf9cH6Rk2Kx/G9mO4HrgX14b7l9EXAz0Ouc\nu6q4DuV/WZXM7C/wngndB7QCbwM+DFzjnLtXdf8qcqpfJ/1CB+B38E4wk3g9zS8+1TFp0PBCBrzX\n1vcX83QvcDuwo2x+BPhbvFvdpoF/AbZUrWMb8A1gBu+S8L8EAqd62zRoqB7wXihRwHtsRfnw+bI0\nf4LXYEjgvYnt7Kp1tAO34f0iOQ78AxCvSnMx8L3iOnqBD53qbdegYaH8D0SBb+F1JKaAHuB/AadV\nrUP5X8OqG+rk+zzw62VplqS9Uyxn+4vtqmeAG0719mtY38Ni+R/oBu4Dhov19kHgE0BL1XqU/zWs\nugH4XLFNkyy2cb4NXFU2X3X/KhlW9ZWJIiIiIiIiIiIicvKs2mcmioiIiIiIiIiIyMmlzkQRERER\nERERERFpiDoTRUREREREREREpCHqTBQREREREREREZGGqDNRREREREREREREGqLORBERERERERER\nEWmIOhNFRERERERERESkIepMFBERERERERERkYaoM1FEREREREREREQaos5EERERERERERERaYg6\nE0VERERERERERKQh6kwUERERERERERGRhvx/SGnHVlZD3UoAAAAASUVORK5CYII=\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "plt.figure(figsize=(16, 1))\n", "plt.imshow(np.array([y]), cmap='viridis', aspect=100)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have data! Almost ready to train, we just have to get our test / train subsets sorted.\n", "\n", "## Extracting some training data" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(X, y) # 6" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((2424, 5), (2424,), (808, 5), (808,))" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.shape, y_train.shape, X_test.shape, y_test.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Optional exercise:** Use [the docs for `train_test_split`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to set the size of the test set, and also to set a random seed for the splitting." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the fun can really begin. \n", "\n", "## Training and evaluating a model" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.ensemble import ExtraTreesClassifier " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "clf = ExtraTreesClassifier() # 7" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',\n", " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", " min_impurity_split=1e-07, min_samples_leaf=1,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " n_estimators=10, n_jobs=1, oob_score=False, random_state=None,\n", " verbose=0, warm_start=False)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.fit(X_train, y_train) # 8" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6410891089108911" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Optional exercise:** Try changing some hyperparameters, eg `verbose`, `n_estimators`, `n_jobs`, and `random_state`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clf = ExtraTreesClassifier(... HYPERPARAMETERS GO HERE ...)\n", "clf.fit(X_train, y_train)\n", "clf.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All models have the same API (but not the same hyperparameters), so it's very easy to try lots of models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Predict!" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "y_pred = clf.predict(X_test) # 9" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A quick score:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6410891089108911" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.metrics import accuracy_score\n", "\n", "accuracy_score(y_test, y_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The confusion matrix, showing exactly what kinds of mistakes (type 1 and type 2 errors) we're making:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 50, 11, 0, 1, 0, 1, 0, 0, 0],\n", " [ 10, 145, 30, 2, 0, 1, 0, 2, 0],\n", " [ 3, 35, 93, 2, 3, 2, 1, 1, 0],\n", " [ 0, 15, 1, 22, 5, 7, 0, 3, 0],\n", " [ 0, 3, 8, 3, 22, 16, 0, 7, 0],\n", " [ 0, 9, 1, 3, 9, 71, 2, 18, 2],\n", " [ 2, 5, 2, 0, 0, 1, 12, 2, 0],\n", " [ 1, 9, 6, 2, 2, 31, 1, 69, 5],\n", " [ 0, 0, 0, 1, 1, 0, 1, 1, 34]])" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.metrics import confusion_matrix\n", "\n", "confusion_matrix(y_test, y_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, the classification report shows the type 1 and type 2 error rates (well, 1 - the error) for each facies, along with the combined, F1, score:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 1 0.76 0.79 0.78 63\n", " 2 0.62 0.76 0.69 190\n", " 3 0.66 0.66 0.66 140\n", " 4 0.61 0.42 0.49 53\n", " 5 0.52 0.37 0.44 59\n", " 6 0.55 0.62 0.58 115\n", " 7 0.71 0.50 0.59 24\n", " 8 0.67 0.55 0.60 126\n", " 9 0.83 0.89 0.86 38\n", "\n", "avg / total 0.64 0.64 0.64 808\n", "\n" ] } ], "source": [ "from sklearn.metrics import classification_report\n", "\n", "print(classification_report(y_test, y_pred)) # 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More in-depth evaluation: k-fold cross-validation\n", "\n", "We should drop out entire wells, not just a bunch of random samples. Otherwise we're training on data that's just one sample away from data we're validating against.\n", "\n", "To do this, we need a vector that contains an integer (or something) representing each unique well." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "wells = df['Well Name']" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " CHURCHMAN BIBLE 0.460\n", " CROSS H CATTLE 0.315\n", " LUKE G U 0.388\n", " NEWBY 0.376\n", " NOLAN 0.352\n", " Recruit F9 0.868\n", " SHANKLE 0.419\n", " SHRIMPLIN 0.524\n" ] } ], "source": [ "from sklearn.model_selection import LeaveOneGroupOut\n", "\n", "logo = LeaveOneGroupOut()\n", "clf = ExtraTreesClassifier(random_state=0)\n", "\n", "for train, test in logo.split(X, y, groups=wells):\n", " # train and test are the indices of the data to use.\n", " well_name = wells[test[0]]\n", " clf.fit(X[train], y[train])\n", " score = clf.score(X[test], y[test])\n", " print(\"{:>20s} {:.3f}\".format(well_name, score))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "
\n", "

© Agile Geoscience 2016

\n", "
" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 1 }