{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# xlines of Python: machine learning\n", "\n", "This notebook goes with [the blog post of the same name](http://ageo.co/xlines04).\n", "\n", "We're going to go over a very simple machine learning exercise. We're using the data from the [2016 SEG machine learning contest](https://github.com/seg/2016-ml-contest)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'0.19.1'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import matplotlib.pyplot as mpl\n", "\n", "import sklearn\n", "sklearn.__version__\n", "\n", "# Should be at least 0.18" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read the data\n", "\n", "`numpy` has a convenient function, `loadtxt` that can load a CSV file. It needs a file... and ours is on the web. That's OK, we don't need to download it, we can just read it by sending its text content to a `StringIO` object, which acts exactly like a file handle." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import requests\n", "import io\n", "\n", "r = requests.get('https://raw.githubusercontent.com/seg/2016-ml-contest/master/training_data.csv') # 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can't just load it, because we only want NumPy to have to handle an array of floats and there's metadata in this file (we can't tell that, I just happen to know it... and it's normal for CSV files). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Pandas](http://pandas.pydata.org/) is really convenient for this sort of data." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Facies | \n", "Formation | \n", "Well Name | \n", "Depth | \n", "GR | \n", "ILD_log10 | \n", "DeltaPHI | \n", "PHIND | \n", "PE | \n", "NM_M | \n", "RELPOS | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "3 | \n", "A1 SH | \n", "SHRIMPLIN | \n", "2793.0 | \n", "77.45 | \n", "0.664 | \n", "9.9 | \n", "11.915 | \n", "4.6 | \n", "1 | \n", "1.000 | \n", "
1 | \n", "3 | \n", "A1 SH | \n", "SHRIMPLIN | \n", "2793.5 | \n", "78.26 | \n", "0.661 | \n", "14.2 | \n", "12.565 | \n", "4.1 | \n", "1 | \n", "0.979 | \n", "
2 | \n", "3 | \n", "A1 SH | \n", "SHRIMPLIN | \n", "2794.0 | \n", "79.05 | \n", "0.658 | \n", "14.8 | \n", "13.050 | \n", "3.6 | \n", "1 | \n", "0.957 | \n", "
3 | \n", "3 | \n", "A1 SH | \n", "SHRIMPLIN | \n", "2794.5 | \n", "86.10 | \n", "0.655 | \n", "13.9 | \n", "13.115 | \n", "3.5 | \n", "1 | \n", "0.936 | \n", "
4 | \n", "3 | \n", "A1 SH | \n", "SHRIMPLIN | \n", "2795.0 | \n", "74.58 | \n", "0.647 | \n", "13.5 | \n", "13.300 | \n", "3.4 | \n", "1 | \n", "0.915 | \n", "
**A word about the data.** This dataset is not, strictly speaking, open data. It has been shared by the Kansas Geological Survey for the purposes of the contest. That's why I'm not copying the data into this repository, but instead reading it from the web. We are working on making an open access version of this dataset. In the meantime, I'd appreciarte it if you didn't replicate the data anywhere. Thanks!
© Agile Geoscience 2016
\n", "