{ "metadata": { "name": "", "signature": "sha256:a7c7eb9eabded78dd972947ff0acd507d29036026f76bffa6cb5a6563a9f5227" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "FCWKernel Benchmark\n", "===" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "import time" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import Citeseer Dataset\n", "---\n", "The citeseer dataset is a set of academic papers and between those.\n", "The goal is to predict the topic of the paper.\n", "\n", "* citeseer.content is a table of document-ids, bag of words representation and the label, indicating the topic the document belongs to\n", "* citeseer.cites specifies citation edges between document-ids\n", "\n", "The following calls result in the feature matrix \"features\", vectors \"ids\" and \"labels\" and the adjacency matrix A." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from util.DatasetUtil import LINQ\n", "from sklearn import preprocessing\n", "\n", "f = open('data/citeseer/citeseer.content', 'r')\n", "[ids, features, labels] = LINQ.readContent(f)\n", "f.close()\n", "num_nodes = len(labels)\n", "\n", "# transform labels into consecutive integers starting at 0\n", "le = preprocessing.LabelEncoder()\n", "le.fit(labels)\n", "labels = le.transform(labels)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read adjacency matrix. Note that a few paper-ids appear in the adjacency list, but do not have a content entry." ] }, { "cell_type": "code", "collapsed": false, "input": [ "f = open('data/citeseer/citeseer.cites', 'r')\n", "A = LINQ.readAdjacencyMatrix(f, num_nodes, ids)\n", "f.close()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "'197556' missing in content\n", "'ghani01hypertext' missing in content\n", "'38137' missing in content\n", "'95786' missing in content\n", "'nielsen00designing' missing in content\n", "'flach99database' missing in content\n", "'khardon99relational' missing in content\n", "'kohrs99using' missing in content\n", "'kohrs99using' missing in content\n", "'raisamo99evaluating' missing in content\n", "'raisamo99evaluating' missing in content\n", "'wang01process' missing in content\n", "'hahn98ontology' missing in content\n", "'tobies99pspace' missing in content\n", "'293457' missing in content\n", "'gabbard97taxonomy' missing in content\n", "'weng95shoslifn' missing in content\n" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following variables represent indexes after a random **train/test split**." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn import cross_validation\n", "test_size = 0.3 # percentage of unlabeled data\n", "X_train, X_test, y_train, y_test = cross_validation.train_test_split(np.array(range(num_nodes)), \n", " labels, test_size=test_size, random_state=0)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Number of nodes and features**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print features.shape" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "(3312, 3703)\n" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Number of labels**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "num_labels = max(labels)+1\n", "print num_labels" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "6\n" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Proportion of labels**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "r = [0.0 for i in range(num_labels)]\n", "for l in labels:\n", " r[l] += 1\n", "r = map(lambda x: x/num_nodes, r)\n", "print r" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[0.07518115942028986, 0.17995169082125603, 0.21165458937198067, 0.15338164251207728, 0.20169082125603865, 0.17814009661835747]\n" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Number of Links**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print (A != 0).sum(0).sum()/2" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "4598\n" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bag-Of-Words Kernel\n", "---\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create linear bag-of-words kernel BOWK (content-only) and measure duration in seconds." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from util import Kernel\n", "\n", "start = time.clock()\n", "BOWK = Kernel.LinearKernel(features) \n", "print time.clock()-start" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "33.740926\n" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "train SVM" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn import svm\n", "\n", "BOWK_train = BOWK[X_train][:,X_train]\n", "BOWK_model = svm.SVC(kernel='precomputed', C=1, verbose=True).fit(BOWK_train, y_train)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[LibSVM]" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Test accuracy which is the percentage of correctly predicted labels in the test set." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.metrics import accuracy_score\n", "\n", "BOWK_test = BOWK[X_test,:][:,X_train];\n", "score = accuracy_score(BOWK_model.predict(BOWK_test), y_test)\n", "print str(score)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.706237424547\n" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "CW Kernel\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute a CWK using 10 hop random walks and absorbing probability alpha=0.5." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import CWKernel\n", "\n", "start = time.clock()\n", "[CWKLabel_train, CWKLabel_test] = CWKernel.CWKernel(A, X_train, labels[X_train], X_test, \n", " max(labels)+1, 10, alpha=0.5)\n", "print time.clock()-start" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1.199689\n" ] } ], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "CWK_model = svm.SVC(kernel='precomputed', C=0.1, verbose=True, shrinking=False).fit(CWKLabel_train, y_train)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[LibSVM]" ] } ], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "score = accuracy_score(CWK_model.predict(CWKLabel_test), y_test)\n", "print str(score)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.738430583501\n" ] } ], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "FCW Kernel\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For FCWK additionally supply a node kernel, which in our case is simply the linear bag of words kernel we created earlier." ] }, { "cell_type": "code", "collapsed": false, "input": [ "start = time.clock()\n", "[CWK_train, CWK_test] = CWKernel.CWFeatureKernel(A, X_train, X_test, 10, node_kernel=BOWK, alpha=0.5)\n", "print time.clock()-start" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "60.062712\n" ] } ], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "CWK_model = svm.SVC(kernel='precomputed', C=0.1, verbose=True, shrinking=False).fit(CWK_train, y_train)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[LibSVM]" ] } ], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "score = accuracy_score(CWK_model.predict(CWK_test), y_test)\n", "print str(score)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.778672032193\n" ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Cross Validate FCWK on Citeseer**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def cross_validate(apply_classifier, labels, n_folds=3):\n", " y = labels\n", " kf = cross_validation.KFold(len(labels), n_folds, shuffle=True)\n", " scores = []\n", " for train, test in kf:\n", " [CWK_train, CWK_test] = apply_classifier(train, test)\n", " CWK_model = svm.SVC(kernel='precomputed', C=0.1, verbose=True, shrinking=False).fit(CWK_train, y[train])\n", " score = accuracy_score(CWK_model.predict(CWK_test), y[test])\n", " print \"finished fold: \" + str(score)\n", " scores.append(score)\n", " print np.mean(scores)\n", " " ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [ "def apply_FCWK_factory(A, node_kernel):\n", " X = np.array(range(len(labels)))\n", " def apply_FCWK(train_ind, test_ind):\n", " return CWKernel.CWFeatureKernel(A, X[train_ind], \n", " X[test_ind], 10, \n", " node_kernel=BOWK,\n", " alpha=0.5)\n", " return apply_FCWK" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 19 }, { "cell_type": "code", "collapsed": false, "input": [ "cross_validate(apply_FCWK_factory(A, BOWK), labels, n_folds=3)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[LibSVM]finished fold: 0.746376811594" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "[LibSVM]" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "finished fold: 0.789855072464" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "[LibSVM]" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "finished fold: 0.765398550725" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "0.767210144928\n" ] } ], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cora Dataset\n", "===" ] }, { "cell_type": "code", "collapsed": false, "input": [ "f = open('data/cora/cora.content', 'r')\n", "[ids, features, labels] = LINQ.readContent(f)\n", "f.close()\n", "num_nodes = len(labels)\n", "\n", "le = preprocessing.LabelEncoder()\n", "le.fit(labels)\n", "labels = le.transform(labels)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 21 }, { "cell_type": "code", "collapsed": false, "input": [ "f = open('data/cora/cora.cites', 'r')\n", "A = LINQ.readAdjacencyMatrix(f, num_nodes, ids)\n", "f.close()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 22 }, { "cell_type": "code", "collapsed": false, "input": [ "X_train, X_test, y_train, y_test = cross_validation.train_test_split(np.array(range(num_nodes)), labels, test_size=0.5, random_state=0)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 23 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Number of nodes and features**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "features.shape" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 24, "text": [ "(2708, 1433)" ] } ], "prompt_number": 24 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Number of classes**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "num_labels = max(labels)+1\n", "print num_labels" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "7\n" ] } ], "prompt_number": 25 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Proportion of labels**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "r = [0.0 for i in range(num_labels)]\n", "for l in labels:\n", " r[l] += 1\n", "r = map(lambda x: x/num_nodes, r)\n", "print r" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[0.11004431314623338, 0.15435745937961595, 0.3020679468242245, 0.15731166912850814, 0.08013293943870015, 0.06646971935007386, 0.129615952732644]\n" ] } ], "prompt_number": 26 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Number of links**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print (A != 0).sum(0).sum()/2" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "5278\n" ] } ], "prompt_number": 27 }, { "cell_type": "code", "collapsed": false, "input": [ "BOWK = Kernel.LinearKernel(features) " ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 28 }, { "cell_type": "code", "collapsed": false, "input": [ "reload(CWKernel)\n", "start = time.clock()\n", "[CWK_train, CWK_test] = CWKernel.CWFeatureKernel(A, X_train, X_test, 10, alpha=0.5, node_kernel=BOWK)\n", "print time.clock()-start" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "121.642287\n" ] } ], "prompt_number": 29 }, { "cell_type": "code", "collapsed": false, "input": [ "CWK_model = svm.SVC(kernel='precomputed', C=1, verbose=True, shrinking=False).fit(CWK_train, y_train)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[LibSVM]" ] } ], "prompt_number": 30 }, { "cell_type": "code", "collapsed": false, "input": [ "score = accuracy_score(CWK_model.predict(CWK_test), y_test)\n", "print str(score)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.857459379616\n" ] } ], "prompt_number": 31 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Cross Validate FCWK on Cora**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "cross_validate(apply_FCWK_factory(A, BOWK), labels, n_folds=3)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[LibSVM]finished fold: 0.870431893688" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "[LibSVM]" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "finished fold: 0.900332225914" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "[LibSVM]" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "finished fold: 0.852549889135" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "0.874438002912\n" ] } ], "prompt_number": 32 } ], "metadata": {} } ] }