{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": "true" }, "source": [ "# Table of Contents\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Seriation Classification: sc-1 #" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal of experiment `sc-1` is to validate that the Laplacian eigenvalue spectral distance can be useful in k-Nearest Neighbor classifiers for seriation output. In this experiment, I take a supervised learning approach, starting with two regional metapopulation models, simulating unbiased cultural transmission with 50 replicates across each model, sampling and time averaging the resulting cultural trait distributions in archaeologically realistic ways, and then seriating the results using our IDSS algorithm. Each seriation resulting from this procedure is thus \"labeled\" as to the regional metapopulation model from which it originated, so we can assess the accuracy of predicting that label based upon the graph spectral similarity. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2016-02-23T09:22:33.964109", "start_time": "2016-02-23T09:22:32.042017" }, "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Couldn't import dot_parser, loading of dot files will not be possible.\n" ] } ], "source": [ "import numpy as np\n", "import networkx as nx\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "%matplotlib inline\n", "import cPickle as pickle\n", "from copy import deepcopy\n", "from sklearn.metrics import classification_report, accuracy_score, confusion_matrix" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2016-02-23T09:22:34.122866", "start_time": "2016-02-23T09:22:33.970680" }, "collapsed": true }, "outputs": [], "source": [ "train_graphs = pickle.load(open(\"train-freq-graphs.pkl\",'r'))\n", "train_labels = pickle.load(open(\"train-freq-labels.pkl\",'r'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "sklearn-mmadsen is a python package of useful machine learning tools that I'm accumulating for research and commercial work. You can find it at http://github.com/mmadsen/sklearn-mmadsen. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2016-02-23T09:22:36.866651", "start_time": "2016-02-23T09:22:36.857846" }, "collapsed": false }, "outputs": [], "source": [ "import sklearn_mmadsen.graphs as skm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initial Classification Attempt ##" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's just see if the graph spectral distance does anything useful at all, or whether I'm barking up the wrong tree. I imagine that we want a few neighbors (to rule out relying on a single neighbor which might be anomalous), but not too many. So let's start with k=5. \n", "\n", "The approach here is to essentially do a \"leave one out\" strategy on the dataset. The KNN model isn't really \"trained\" in the usual sense of the term, so we don't need to separate a test and train set, we just need to make sure that the target graph we're trying to predict is not one of the \"training\" graphs that we calculate spectral distances to, otherwise the self-matching of the graph will always predict zero distance. So we first define a simple function which splits a graph out of the training set and returns the rest. I'd use scikit-learn functions for this, but our \"data\" is really a list of NetworkX objects, not a numeric matrix." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2016-02-23T09:22:39.628097", "start_time": "2016-02-23T09:22:39.625961" }, "collapsed": false }, "outputs": [], "source": [ "gclf = skm.GraphEigenvalueNearestNeighbors(n_neighbors=5)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2016-02-23T09:22:42.344478", "start_time": "2016-02-23T09:22:42.338773" }, "collapsed": true }, "outputs": [], "source": [ "def leave_one_out_cv(ix, train_graphs, train_labels):\n", " \"\"\"\n", " Simple LOO data sets for kNN classification, given an index, returns a train set, labels, with the left out \n", " graph and label as test_graph, test_label\n", " \"\"\"\n", " test_graph = train_graphs[ix]\n", " test_label = train_labels[ix]\n", " train_loo_graphs = deepcopy(train_graphs)\n", " train_loo_labels = deepcopy(train_labels)\n", " del train_loo_graphs[ix]\n", " del train_loo_labels[ix]\n", " return (train_loo_graphs, train_loo_labels, test_graph, test_label)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2016-02-23T09:23:20.305174", "start_time": "2016-02-23T09:22:44.289728" }, "collapsed": false }, "outputs": [], "source": [ "test_pred = []\n", "for ix in range(0, len(train_graphs)):\n", " train_loo_graphs, train_loo_labels, test_graph, test_label = leave_one_out_cv(ix, train_graphs, train_labels)\n", " gclf.fit(train_loo_graphs, train_loo_labels)\n", " test_pred.append(gclf.predict([test_graph])[0])\n", " " ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2016-02-23T09:23:26.948388", "start_time": "2016-02-23T09:23:26.921634" }, "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " predicted 0 predicted 1\n", "actual 0 41 9\n", "actual 1 14 35\n", " precision recall f1-score support\n", "\n", " 0 0.75 0.82 0.78 50\n", " 1 0.80 0.71 0.75 49\n", "\n", "avg / total 0.77 0.77 0.77 99\n", "\n", "Accuracy on test: 0.768\n" ] } ], "source": [ "cm = confusion_matrix(train_labels, test_pred)\n", "cmdf = pd.DataFrame(cm)\n", "cmdf.columns = map(lambda x: 'predicted {}'.format(x), cmdf.columns)\n", "cmdf.index = map(lambda x: 'actual {}'.format(x), cmdf.index)\n", "\n", "print cmdf\n", "print(classification_report(train_labels, test_pred))\n", "print(\"Accuracy on test: %0.3f\" % accuracy_score(train_labels, test_pred))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2016-02-23T09:23:30.472779", "start_time": "2016-02-23T09:23:30.265367" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "