{ "metadata": { "name": "", "signature": "sha256:52b8d3f4a5619aa365b898484108613698775ce405efc60cc154ab6c40ea5175" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#INTRODUCTION TO PYTHON FOR DATA MINING\n", "\n", "Python is a great language for data mining. It has a lot of great libraries for exploring, modeling, and visualizing data. To get started I would recommend downloading the [Anaconda Package](http://continuum.io/downloads). It comes with most of the libraries you will need and provides and IDE and package manager.\n", "\n", "I do most of my work from the command line, but Anaconda comes with a launcher app that can be found in the ~/anaconda directory. To get the launcher to work with a Mac, you need to do the following:\n", "1. Go to your terminal (hit command-space_bar and then type terminal)\n", "2. Type conda install -f launcher\n", "3. After that runs, type conda install -f node-webkit\n", "\n", "Now you can open the launcher and see:\n", "1. [glueviz](https://github.com/glue-viz/glue) - This lets you link multiple plots across files\n", "2. [Ipython Notebook](http://ipython.org/notebook.html) - A great way to display and work on your data mining projects\n", "3. [Ipython qtconsole](http://ipython.org/ipython-doc/2/interactive/qtconsole.html) - Basically an Ipython terminal for coding\n", "3. [Spyder](https://pythonhosted.org/spyder/) - An IDE for Ipython\n", "\n", "#IPython vs Python\n", "\n", "Ipython is what makes Python interactive. Meaning that you can type some code, get some results, and then type some more code. This is very useful for exploring data because you don't always know what you are looking for and it can be annoying to have to run your entire program every time you make changes.\n", "\n", "#Libraries You Should Know About\n", "\n", "1. [Pandas](http://pandas.pydata.org/) - Provides R like data structures and a high level API to work with data\n", "2. [Numpy](http://www.numpy.org/) - Provides fast numerical computing such as arrays and linear algebra\n", "3. [Scipy](http://www.scipy.org/) - For scientific computing such as drawing from distributions\n", "4. [Matplotlib](http://matplotlib.org/) - For plotting\n", " 1. [Seaborn](http://stanford.edu/~mwaskom/software/seaborn/) - To make your plots look better\n", "5. [Scikit-Learn](http://scikit-learn.org/stable/) - For machine learning; great documentation and tutorials\n", "6. [Statsmodels](http://statsmodels.sourceforge.net/) - For more traditional statistics\n", "\n", "###Getting Seaborn\n", "\n", "In the terminal type pip install seaborn\n", "\n", "#An Example\n", "### Read in Data\n", "\n", "I will use pandas to read in some data from the web and quickly remove the NA rows." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "from sklearn import datasets, linear_model\n", "import pandas as pd\n", "from pandas import DataFrame, Series\n", "from __future__ import division\n", "import seaborn as sns\n", "from sklearn.cross_validation import train_test_split\n", "sns.set(style='ticks', palette='Set2')\n", "%matplotlib inline\n", "\n", "data = pd.read_csv(\"http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original\",\n", " delim_whitespace = True, header=None,\n", " names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration',\n", " 'model', 'origin', 'car_name'])\n", "print(data.shape)\n", "data = data.dropna()\n", "data.head()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "(406, 9)\n" ] }, { "html": [ "
\n", " | mpg | \n", "cylinders | \n", "displacement | \n", "horsepower | \n", "weight | \n", "acceleration | \n", "model | \n", "origin | \n", "car_name | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "18 | \n", "8 | \n", "307 | \n", "130 | \n", "3504 | \n", "12.0 | \n", "70 | \n", "1 | \n", "chevrolet chevelle malibu | \n", "
1 | \n", "15 | \n", "8 | \n", "350 | \n", "165 | \n", "3693 | \n", "11.5 | \n", "70 | \n", "1 | \n", "buick skylark 320 | \n", "
2 | \n", "18 | \n", "8 | \n", "318 | \n", "150 | \n", "3436 | \n", "11.0 | \n", "70 | \n", "1 | \n", "plymouth satellite | \n", "
3 | \n", "16 | \n", "8 | \n", "304 | \n", "150 | \n", "3433 | \n", "12.0 | \n", "70 | \n", "1 | \n", "amc rebel sst | \n", "
4 | \n", "17 | \n", "8 | \n", "302 | \n", "140 | \n", "3449 | \n", "10.5 | \n", "70 | \n", "1 | \n", "ford torino | \n", "
acceleration | \n", "8.0 | \n", "8.5 | \n", "9.0 | \n", "9.5 | \n", "10.0 | \n", "10.5 | \n", "11.0 | \n", "11.1 | \n", "11.2 | \n", "11.3 | \n", "... | \n", "21.5 | \n", "21.7 | \n", "21.8 | \n", "21.9 | \n", "22.1 | \n", "22.2 | \n", "23.5 | \n", "23.7 | \n", "24.6 | \n", "24.8 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cylinders | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
3 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
4 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "43.1 | \n", "44.3 | \n", "30 | \n", "19 | \n", "24.5 | \n", "29.0 | \n", "23 | \n", "43.4 | \n", "44 | \n", "27.2 | \n", "
5 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
6 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "28.8 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
8 | \n", "14 | \n", "14.5 | \n", "14 | \n", "15.5 | \n", "14.5 | \n", "17 | \n", "13.285714 | \n", "16 | \n", "18.1 | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "23.9 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
5 rows \u00d7 95 columns
\n", "