{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "\n", "# Document retrieval from wikipedia data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fire up GraphLab Create\n", "(See [Getting Started with SFrames](../Week%201/Getting%20Started%20with%20SFrames.ipynb) for setup instructions)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import graphlab as gl" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Limit number of worker processes. This preserves system memory, which prevents hosted notebooks from crashing.\n", "graphlab.set_runtime_config('GRAPHLAB_DEFAULT_NUM_PYLAMBDA_WORKERS', 4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Load some text data - from wikipedia, pages on people" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [], "source": [ "people = gl.SFrame('people_wiki.gl/')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data contains: link to wikipedia article, name of person, text of article." ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
URI | \n", "name | \n", "text | \n", "
---|---|---|
<http://dbpedia.org/resou rce/Digby_Morrell> ... | \n",
" Digby Morrell | \n", "digby morrell born 10 october 1979 is a former ... | \n",
"
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ... | \n",
" Alfred J. Lewy | \n", "alfred j lewy aka sandy lewy graduated from ... | \n",
"
<http://dbpedia.org/resou rce/Harpdog_Brown> ... | \n",
" Harpdog Brown | \n", "harpdog brown is a singer and harmonica player who ... | \n",
"
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ... | \n",
" Franz Rottensteiner | \n", "franz rottensteiner born in waidmannsfeld lower ... | \n",
"
<http://dbpedia.org/resou rce/G-Enka> ... | \n",
" G-Enka | \n", "henry krvits born 30 december 1974 in tallinn ... | \n",
"
<http://dbpedia.org/resou rce/Sam_Henderson> ... | \n",
" Sam Henderson | \n", "sam henderson born october 18 1969 is an ... | \n",
"
<http://dbpedia.org/resou rce/Aaron_LaCrate> ... | \n",
" Aaron LaCrate | \n", "aaron lacrate is an american music producer ... | \n",
"
<http://dbpedia.org/resou rce/Trevor_Ferguson> ... | \n",
" Trevor Ferguson | \n", "trevor ferguson aka john farrow born 11 november ... | \n",
"
<http://dbpedia.org/resou rce/Grant_Nelson> ... | \n",
" Grant Nelson | \n", "grant nelson born 27 april 1971 in london ... | \n",
"
<http://dbpedia.org/resou rce/Cathy_Caruth> ... | \n",
" Cathy Caruth | \n", "cathy caruth born 1955 is frank h t rhodes ... | \n",
"
URI | \n", "name | \n", "text | \n", "
---|---|---|
<http://dbpedia.org/resou rce/Barack_Obama> ... | \n",
" Barack Obama | \n", "barack hussein obama ii brk husen bm born august ... | \n",
"
URI | \n", "name | \n", "text | \n", "word_count | \n", "
---|---|---|---|
<http://dbpedia.org/resou rce/Barack_Obama> ... | \n",
" Barack Obama | \n", "barack hussein obama ii brk husen bm born august ... | \n",
" {'operations': 1, 'represent': 1, 'offi ... | \n",
"
word | \n", "count | \n", "
---|---|
the | \n", "40 | \n", "
in | \n", "30 | \n", "
and | \n", "21 | \n", "
of | \n", "18 | \n", "
to | \n", "14 | \n", "
his | \n", "11 | \n", "
obama | \n", "9 | \n", "
act | \n", "8 | \n", "
he | \n", "7 | \n", "
a | \n", "7 | \n", "
word | \n", "count | \n", "
---|---|
the | \n", "40 | \n", "
in | \n", "30 | \n", "
and | \n", "21 | \n", "
of | \n", "18 | \n", "
to | \n", "14 | \n", "
his | \n", "11 | \n", "
obama | \n", "9 | \n", "
act | \n", "8 | \n", "
he | \n", "7 | \n", "
a | \n", "7 | \n", "
URI | \n", "name | \n", "text | \n", "word_count | \n", "
---|---|---|---|
<http://dbpedia.org/resou rce/Digby_Morrell> ... | \n",
" Digby Morrell | \n", "digby morrell born 10 october 1979 is a former ... | \n",
" {'selection': 1, 'carltons': 1, 'being': ... | \n",
"
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ... | \n",
" Alfred J. Lewy | \n", "alfred j lewy aka sandy lewy graduated from ... | \n",
" {'precise': 1, 'thomas': 1, 'closely': 1, ... | \n",
"
<http://dbpedia.org/resou rce/Harpdog_Brown> ... | \n",
" Harpdog Brown | \n", "harpdog brown is a singer and harmonica player who ... | \n",
" {'just': 1, 'issued': 1, 'mainly': 1, 'nominat ... | \n",
"
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ... | \n",
" Franz Rottensteiner | \n", "franz rottensteiner born in waidmannsfeld lower ... | \n",
" {'all': 1, 'bauforschung': 1, ... | \n",
"
<http://dbpedia.org/resou rce/G-Enka> ... | \n",
" G-Enka | \n", "henry krvits born 30 december 1974 in tallinn ... | \n",
" {'they': 1, 'gangstergenka': 1, ... | \n",
"
<http://dbpedia.org/resou rce/Sam_Henderson> ... | \n",
" Sam Henderson | \n", "sam henderson born october 18 1969 is an ... | \n",
" {'currently': 1, 'less': 1, 'being': 1, ... | \n",
"
<http://dbpedia.org/resou rce/Aaron_LaCrate> ... | \n",
" Aaron LaCrate | \n", "aaron lacrate is an american music producer ... | \n",
" {'exclusive': 2, 'producer': 1, 'show' ... | \n",
"
<http://dbpedia.org/resou rce/Trevor_Ferguson> ... | \n",
" Trevor Ferguson | \n", "trevor ferguson aka john farrow born 11 november ... | \n",
" {'taxi': 1, 'salon': 1, 'gangs': 1, 'being': 1, ... | \n",
"
<http://dbpedia.org/resou rce/Grant_Nelson> ... | \n",
" Grant Nelson | \n", "grant nelson born 27 april 1971 in london ... | \n",
" {'houston': 1, 'frankie': 1, 'labels': 1, ... | \n",
"
<http://dbpedia.org/resou rce/Cathy_Caruth> ... | \n",
" Cathy Caruth | \n", "cathy caruth born 1955 is frank h t rhodes ... | \n",
" {'phenomenon': 1, 'deborash': 1, 'both' ... | \n",
"
URI | \n", "name | \n", "text | \n", "word_count | \n", "
---|---|---|---|
<http://dbpedia.org/resou rce/Digby_Morrell> ... | \n",
" Digby Morrell | \n", "digby morrell born 10 october 1979 is a former ... | \n",
" {'selection': 1, 'carltons': 1, 'being': ... | \n",
"
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ... | \n",
" Alfred J. Lewy | \n", "alfred j lewy aka sandy lewy graduated from ... | \n",
" {'precise': 1, 'thomas': 1, 'closely': 1, ... | \n",
"
<http://dbpedia.org/resou rce/Harpdog_Brown> ... | \n",
" Harpdog Brown | \n", "harpdog brown is a singer and harmonica player who ... | \n",
" {'just': 1, 'issued': 1, 'mainly': 1, 'nominat ... | \n",
"
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ... | \n",
" Franz Rottensteiner | \n", "franz rottensteiner born in waidmannsfeld lower ... | \n",
" {'all': 1, 'bauforschung': 1, ... | \n",
"
<http://dbpedia.org/resou rce/G-Enka> ... | \n",
" G-Enka | \n", "henry krvits born 30 december 1974 in tallinn ... | \n",
" {'they': 1, 'gangstergenka': 1, ... | \n",
"
<http://dbpedia.org/resou rce/Sam_Henderson> ... | \n",
" Sam Henderson | \n", "sam henderson born october 18 1969 is an ... | \n",
" {'currently': 1, 'less': 1, 'being': 1, ... | \n",
"
<http://dbpedia.org/resou rce/Aaron_LaCrate> ... | \n",
" Aaron LaCrate | \n", "aaron lacrate is an american music producer ... | \n",
" {'exclusive': 2, 'producer': 1, 'show' ... | \n",
"
<http://dbpedia.org/resou rce/Trevor_Ferguson> ... | \n",
" Trevor Ferguson | \n", "trevor ferguson aka john farrow born 11 november ... | \n",
" {'taxi': 1, 'salon': 1, 'gangs': 1, 'being': 1, ... | \n",
"
<http://dbpedia.org/resou rce/Grant_Nelson> ... | \n",
" Grant Nelson | \n", "grant nelson born 27 april 1971 in london ... | \n",
" {'houston': 1, 'frankie': 1, 'labels': 1, ... | \n",
"
<http://dbpedia.org/resou rce/Cathy_Caruth> ... | \n",
" Cathy Caruth | \n", "cathy caruth born 1955 is frank h t rhodes ... | \n",
" {'phenomenon': 1, 'deborash': 1, 'both' ... | \n",
"
tfidf | \n", "
---|
{'selection': 3.836578553093086, ... | \n",
"
{'precise': 6.44320060695519, ... | \n",
"
{'just': 2.7007299687108643, ... | \n",
"
{'all': 1.6431112434912472, ... | \n",
"
{'they': 1.8993401178193898, ... | \n",
"
{'currently': 1.637088969126014, ... | \n",
"
{'exclusive': 10.455187230695827, ... | \n",
"
{'taxi': 6.0520214560945025, ... | \n",
"
{'houston': 3.935505942157149, ... | \n",
"
{'phenomenon': 5.750053426395245, ... | \n",
"
word | \n", "tfidf | \n", "
---|---|
obama | \n", "43.2956530721 | \n", "
act | \n", "27.678222623 | \n", "
iraq | \n", "17.747378588 | \n", "
control | \n", "14.8870608452 | \n", "
law | \n", "14.7229357618 | \n", "
ordered | \n", "14.5333739509 | \n", "
military | \n", "13.1159327785 | \n", "
involvement | \n", "12.7843852412 | \n", "
response | \n", "12.7843852412 | \n", "
democratic | \n", "12.4106886973 | \n", "
Starting brute force nearest neighbors model training." ], "text/plain": [ "Starting brute force nearest neighbors model training." ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "knn_model = graphlab.nearest_neighbors.create(people, features=['tfidf'], label='name')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Applying the nearest-neighbors model for retrieval" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Who is closest to Obama?" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Starting pairwise querying." ], "text/plain": [ "Starting pairwise querying." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------+---------+-------------+--------------+" ], "text/plain": [ "+--------------+---------+-------------+--------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Query points | # Pairs | % Complete. | Elapsed Time |" ], "text/plain": [ "| Query points | # Pairs | % Complete. | Elapsed Time |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------+---------+-------------+--------------+" ], "text/plain": [ "+--------------+---------+-------------+--------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 0 | 1 | 0.00169288 | 13.291ms |" ], "text/plain": [ "| 0 | 1 | 0.00169288 | 13.291ms |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Done | | 100 | 360.03ms |" ], "text/plain": [ "| Done | | 100 | 360.03ms |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------+---------+-------------+--------------+" ], "text/plain": [ "+--------------+---------+-------------+--------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
query_label | \n", "reference_label | \n", "distance | \n", "rank | \n", "
---|---|---|---|
0 | \n", "Barack Obama | \n", "0.0 | \n", "1 | \n", "
0 | \n", "Joe Biden | \n", "0.794117647059 | \n", "2 | \n", "
0 | \n", "Joe Lieberman | \n", "0.794685990338 | \n", "3 | \n", "
0 | \n", "Kelly Ayotte | \n", "0.811989100817 | \n", "4 | \n", "
0 | \n", "Bill Clinton | \n", "0.813852813853 | \n", "5 | \n", "
0 | \n", "Artur Davis | \n", "0.817232375979 | \n", "6 | \n", "
0 | \n", "George W. Bush | \n", "0.818947368421 | \n", "7 | \n", "
0 | \n", "John Kerry | \n", "0.819477434679 | \n", "8 | \n", "
0 | \n", "Sam Brownback | \n", "0.821138211382 | \n", "9 | \n", "
0 | \n", "Richard Cordray | \n", "0.821808510638 | \n", "10 | \n", "
Starting pairwise querying." ], "text/plain": [ "Starting pairwise querying." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------+---------+-------------+--------------+" ], "text/plain": [ "+--------------+---------+-------------+--------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Query points | # Pairs | % Complete. | Elapsed Time |" ], "text/plain": [ "| Query points | # Pairs | % Complete. | Elapsed Time |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------+---------+-------------+--------------+" ], "text/plain": [ "+--------------+---------+-------------+--------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 0 | 1 | 0.00169288 | 8.379ms |" ], "text/plain": [ "| 0 | 1 | 0.00169288 | 8.379ms |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Done | | 100 | 330.986ms |" ], "text/plain": [ "| Done | | 100 | 330.986ms |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------+---------+-------------+--------------+" ], "text/plain": [ "+--------------+---------+-------------+--------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
query_label | \n", "reference_label | \n", "distance | \n", "rank | \n", "
---|---|---|---|
0 | \n", "Angelina Jolie | \n", "0.0 | \n", "1 | \n", "
0 | \n", "Brad Pitt | \n", "0.784023668639 | \n", "2 | \n", "
0 | \n", "Julianne Moore | \n", "0.795857988166 | \n", "3 | \n", "
0 | \n", "Billy Bob Thornton | \n", "0.803069053708 | \n", "4 | \n", "
0 | \n", "George Clooney | \n", "0.8046875 | \n", "5 | \n", "
Starting pairwise querying." ], "text/plain": [ "Starting pairwise querying." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------+---------+-------------+--------------+" ], "text/plain": [ "+--------------+---------+-------------+--------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Query points | # Pairs | % Complete. | Elapsed Time |" ], "text/plain": [ "| Query points | # Pairs | % Complete. | Elapsed Time |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------+---------+-------------+--------------+" ], "text/plain": [ "+--------------+---------+-------------+--------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 0 | 1 | 0.00169288 | 13.007ms |" ], "text/plain": [ "| 0 | 1 | 0.00169288 | 13.007ms |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Done | | 100 | 336.714ms |" ], "text/plain": [ "| Done | | 100 | 336.714ms |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------+---------+-------------+--------------+" ], "text/plain": [ "+--------------+---------+-------------+--------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+-------------+------------------------------+----------------+------+\n", "| query_label | reference_label | distance | rank |\n", "+-------------+------------------------------+----------------+------+\n", "| 0 | Arnold Schwarzenegger | 0.0 | 1 |\n", "| 0 | Jesse Ventura | 0.818918918919 | 2 |\n", "| 0 | John Kitzhaber | 0.824615384615 | 3 |\n", "| 0 | Lincoln Chafee | 0.833876221498 | 4 |\n", "| 0 | Anthony Foxx | 0.833910034602 | 5 |\n", "| 0 | Abel Maldonado | 0.834482758621 | 6 |\n", "| 0 | Pat Quinn (politician) | 0.837209302326 | 7 |\n", "| 0 | Scott Walker (politician) | 0.838905775076 | 8 |\n", "| 0 | Mike Johanns | 0.839009287926 | 9 |\n", "| 0 | John Garamendi | 0.839762611276 | 10 |\n", "| 0 | Sean Parnell | 0.840531561462 | 11 |\n", "| 0 | Alec Baldwin | 0.843260188088 | 12 |\n", "| 0 | Gary Herbert | 0.844594594595 | 13 |\n", "| 0 | Lonnie Napier | 0.844776119403 | 14 |\n", "| 0 | David Steelman | 0.845238095238 | 15 |\n", "| 0 | Tom Corbett | 0.845930232558 | 16 |\n", "| 0 | March Fong Eu | 0.846354166667 | 17 |\n", "| 0 | Nat Robertson | 0.846405228758 | 18 |\n", "| 0 | Bob Corker | 0.846405228758 | 19 |\n", "| 0 | David Paterson | 0.847619047619 | 20 |\n", "| 0 | Antonio Villaraigosa | 0.84776119403 | 21 |\n", "| 0 | Mary Fallin | 0.84776119403 | 22 |\n", "| 0 | Jack Markell | 0.848297213622 | 23 |\n", "| 0 | Phil Mitman | 0.848874598071 | 24 |\n", "| 0 | Mark Mahon | 0.849230769231 | 25 |\n", "| 0 | Michael Steele | 0.849673202614 | 26 |\n", "| 0 | Donald E. Hines | 0.85 | 27 |\n", "| 0 | Neil Abercrombie | 0.850152905199 | 28 |\n", "| 0 | Jay Nixon | 0.852112676056 | 29 |\n", "| 0 | Bob Miller (Nevada governor) | 0.852150537634 | 30 |\n", "| 0 | Tom Sieckmann | 0.852564102564 | 31 |\n", "| 0 | Denny Altes | 0.853260869565 | 32 |\n", "| 0 | BettyLou DeCroce | 0.853293413174 | 33 |\n", "| 0 | Javier S%C3%A1nchez | 0.853741496599 | 34 |\n", "| 0 | Patsy Kinsey | 0.854037267081 | 35 |\n", "| 0 | Rodney Alexander | 0.854103343465 | 36 |\n", "| 0 | Ed Case | 0.854838709677 | 37 |\n", "| 0 | Andrew R. Ciesla | 0.854889589905 | 38 |\n", "| 0 | Rick Perry | 0.854961832061 | 39 |\n", "| 0 | Maggie Hassan | 0.85534591195 | 40 |\n", "+-------------+------------------------------+----------------+------+\n", "[40 rows x 4 columns]\n", "\n" ] }, { "data": { "text/plain": [ "graphlab.toolkits.nearest_neighbors._nearest_neighbors.NearestNeighborsModel" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "knn_model.query(arnold, k=40).print_rows(num_rows=40)\n", "\n", "type(knn_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Assignment" ] }, { "cell_type": "code", "execution_count": 103, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
URI | \n", "name | \n", "text | \n", "
---|---|---|
<http://dbpedia.org/resou rce/Elton_John> ... | \n",
" Elton John | \n", "sir elton hercules john cbe born reginald ken ... | \n",
"
word | \n", "count | \n", "
---|---|
the | \n", "27 | \n", "
in | \n", "18 | \n", "
and | \n", "15 | \n", "
of | \n", "13 | \n", "
a | \n", "10 | \n", "
has | \n", "9 | \n", "
john | \n", "7 | \n", "
he | \n", "7 | \n", "
on | \n", "6 | \n", "
award | \n", "5 | \n", "
word | \n", "tfidf | \n", "
---|---|
furnish | \n", "18.38947184 | \n", "
elton | \n", "17.48232027 | \n", "
billboard | \n", "17.3036809575 | \n", "
john | \n", "13.9393127924 | \n", "
songwriters | \n", "11.250406447 | \n", "
tonightcandle | \n", "10.9864953892 | \n", "
overallelton | \n", "10.9864953892 | \n", "
19702000 | \n", "10.2933482087 | \n", "
fivedecade | \n", "10.2933482087 | \n", "
aids | \n", "10.262846934 | \n", "
URI | \n", "name | \n", "text | \n", "word_count | \n", "
---|---|---|---|
<http://dbpedia.org/resou rce/Digby_Morrell> ... | \n",
" Digby Morrell | \n", "digby morrell born 10 october 1979 is a former ... | \n",
" {'selection': 1, 'carltons': 1, 'being': ... | \n",
"
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ... | \n",
" Alfred J. Lewy | \n", "alfred j lewy aka sandy lewy graduated from ... | \n",
" {'precise': 1, 'thomas': 1, 'closely': 1, ... | \n",
"
<http://dbpedia.org/resou rce/Harpdog_Brown> ... | \n",
" Harpdog Brown | \n", "harpdog brown is a singer and harmonica player who ... | \n",
" {'just': 1, 'issued': 1, 'mainly': 1, 'nominat ... | \n",
"
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ... | \n",
" Franz Rottensteiner | \n", "franz rottensteiner born in waidmannsfeld lower ... | \n",
" {'all': 1, 'bauforschung': 1, ... | \n",
"
<http://dbpedia.org/resou rce/G-Enka> ... | \n",
" G-Enka | \n", "henry krvits born 30 december 1974 in tallinn ... | \n",
" {'they': 1, 'gangstergenka': 1, ... | \n",
"
<http://dbpedia.org/resou rce/Sam_Henderson> ... | \n",
" Sam Henderson | \n", "sam henderson born october 18 1969 is an ... | \n",
" {'currently': 1, 'less': 1, 'being': 1, ... | \n",
"
<http://dbpedia.org/resou rce/Aaron_LaCrate> ... | \n",
" Aaron LaCrate | \n", "aaron lacrate is an american music producer ... | \n",
" {'exclusive': 2, 'producer': 1, 'show' ... | \n",
"
<http://dbpedia.org/resou rce/Trevor_Ferguson> ... | \n",
" Trevor Ferguson | \n", "trevor ferguson aka john farrow born 11 november ... | \n",
" {'taxi': 1, 'salon': 1, 'gangs': 1, 'being': 1, ... | \n",
"
<http://dbpedia.org/resou rce/Grant_Nelson> ... | \n",
" Grant Nelson | \n", "grant nelson born 27 april 1971 in london ... | \n",
" {'houston': 1, 'frankie': 1, 'labels': 1, ... | \n",
"
<http://dbpedia.org/resou rce/Cathy_Caruth> ... | \n",
" Cathy Caruth | \n", "cathy caruth born 1955 is frank h t rhodes ... | \n",
" {'phenomenon': 1, 'deborash': 1, 'both' ... | \n",
"
tfidf | \n", "
---|
{'selection': 3.836578553093086, ... | \n",
"
{'precise': 6.44320060695519, ... | \n",
"
{'just': 2.7007299687108643, ... | \n",
"
{'all': 1.6431112434912472, ... | \n",
"
{'they': 1.8993401178193898, ... | \n",
"
{'currently': 1.637088969126014, ... | \n",
"
{'exclusive': 10.455187230695827, ... | \n",
"
{'taxi': 6.0520214560945025, ... | \n",
"
{'houston': 3.935505942157149, ... | \n",
"
{'phenomenon': 5.750053426395245, ... | \n",
"
Starting brute force nearest neighbors model training." ], "text/plain": [ "Starting brute force nearest neighbors model training." ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "knn_words = graphlab.nearest_neighbors.create(people, features=['word_count'], label='name', distance='cosine')\n", "knn_tfidf = graphlab.nearest_neighbors.create(people, features=['tfidf'], label='name')\n" ] }, { "cell_type": "code", "execution_count": 122, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Starting pairwise querying." ], "text/plain": [ "Starting pairwise querying." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------+---------+-------------+--------------+" ], "text/plain": [ "+--------------+---------+-------------+--------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Query points | # Pairs | % Complete. | Elapsed Time |" ], "text/plain": [ "| Query points | # Pairs | % Complete. | Elapsed Time |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------+---------+-------------+--------------+" ], "text/plain": [ "+--------------+---------+-------------+--------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 0 | 1 | 0.00169288 | 16.166ms |" ], "text/plain": [ "| 0 | 1 | 0.00169288 | 16.166ms |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Done | | 100 | 231.817ms |" ], "text/plain": [ "| Done | | 100 | 231.817ms |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------+---------+-------------+--------------+" ], "text/plain": [ "+--------------+---------+-------------+--------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+-------------+-----------------------+-------------------+------+\n", "| query_label | reference_label | distance | rank |\n", "+-------------+-----------------------+-------------------+------+\n", "| 0 | Elton John | 2.22044604925e-16 | 1 |\n", "| 0 | Cliff Richard | 0.16142415259 | 2 |\n", "| 0 | Sandro Petrone | 0.16822542751 | 3 |\n", "| 0 | Rod Stewart | 0.168327165587 | 4 |\n", "| 0 | Malachi O'Doherty | 0.177315545979 | 5 |\n", "| 0 | Roger Daltrey | 0.177554184666 | 6 |\n", "| 0 | Peter Paret | 0.180734837403 | 7 |\n", "| 0 | Mervyn Burtch | 0.181990140263 | 8 |\n", "| 0 | Chris Chivers | 0.1830733129 | 9 |\n", "| 0 | Dejan Bogdanovi%C4%87 | 0.184989473454 | 10 |\n", "+-------------+-----------------------+-------------------+------+\n", "[10 rows x 4 columns]\n", "\n" ] }, { "data": { "text/html": [ "
Starting pairwise querying." ], "text/plain": [ "Starting pairwise querying." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------+---------+-------------+--------------+" ], "text/plain": [ "+--------------+---------+-------------+--------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Query points | # Pairs | % Complete. | Elapsed Time |" ], "text/plain": [ "| Query points | # Pairs | % Complete. | Elapsed Time |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------+---------+-------------+--------------+" ], "text/plain": [ "+--------------+---------+-------------+--------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 0 | 1 | 0.00169288 | 12.86ms |" ], "text/plain": [ "| 0 | 1 | 0.00169288 | 12.86ms |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Done | | 100 | 319.98ms |" ], "text/plain": [ "| Done | | 100 | 319.98ms |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------+---------+-------------+--------------+" ], "text/plain": [ "+--------------+---------+-------------+--------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+-------------+------------------+----------------+------+\n", "| query_label | reference_label | distance | rank |\n", "+-------------+------------------+----------------+------+\n", "| 0 | Elton John | 0.0 | 1 |\n", "| 0 | Phil Collins | 0.76399026764 | 2 |\n", "| 0 | Rod Stewart | 0.773333333333 | 3 |\n", "| 0 | Annie Lennox | 0.776623376623 | 4 |\n", "| 0 | Barry Gibb | 0.780952380952 | 5 |\n", "| 0 | Sting (musician) | 0.787172011662 | 6 |\n", "| 0 | Adele | 0.78813559322 | 7 |\n", "| 0 | Roger Daltrey | 0.788461538462 | 8 |\n", "| 0 | Billy Joel | 0.790769230769 | 9 |\n", "| 0 | Carrie Underwood | 0.79177377892 | 10 |\n", "+-------------+------------------+----------------+------+\n", "[10 rows x 4 columns]\n", "\n" ] } ], "source": [ "knn_words.query(elton, radius=0.84, k=10).print_rows()\n", "knn_tfidf.query(elton, radius=0.84, k=10).print_rows()" ] }, { "cell_type": "code", "execution_count": 124, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Starting pairwise querying." ], "text/plain": [ "Starting pairwise querying." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------+---------+-------------+--------------+" ], "text/plain": [ "+--------------+---------+-------------+--------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Query points | # Pairs | % Complete. | Elapsed Time |" ], "text/plain": [ "| Query points | # Pairs | % Complete. | Elapsed Time |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------+---------+-------------+--------------+" ], "text/plain": [ "+--------------+---------+-------------+--------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 0 | 1 | 0.00169288 | 8.141ms |" ], "text/plain": [ "| 0 | 1 | 0.00169288 | 8.141ms |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Done | | 100 | 267.05ms |" ], "text/plain": [ "| Done | | 100 | 267.05ms |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------+---------+-------------+--------------+" ], "text/plain": [ "+--------------+---------+-------------+--------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+-------------+--------------------------+--------------------+------+\n", "| query_label | reference_label | distance | rank |\n", "+-------------+--------------------------+--------------------+------+\n", "| 0 | Victoria Beckham | -2.22044604925e-16 | 1 |\n", "| 0 | Mary Fitzgerald (artist) | 0.207307036115 | 2 |\n", "| 0 | Adrienne Corri | 0.214509782788 | 3 |\n", "| 0 | Beverly Jane Fry | 0.217466468741 | 4 |\n", "| 0 | Raman Mundair | 0.217695474992 | 5 |\n", "+-------------+--------------------------+--------------------+------+\n", "[5 rows x 4 columns]\n", "\n" ] }, { "data": { "text/html": [ "
Starting pairwise querying." ], "text/plain": [ "Starting pairwise querying." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------+---------+-------------+--------------+" ], "text/plain": [ "+--------------+---------+-------------+--------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Query points | # Pairs | % Complete. | Elapsed Time |" ], "text/plain": [ "| Query points | # Pairs | % Complete. | Elapsed Time |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------+---------+-------------+--------------+" ], "text/plain": [ "+--------------+---------+-------------+--------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 0 | 1 | 0.00169288 | 14.129ms |" ], "text/plain": [ "| 0 | 1 | 0.00169288 | 14.129ms |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Done | | 100 | 305.451ms |" ], "text/plain": [ "| Done | | 100 | 305.451ms |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------+---------+-------------+--------------+" ], "text/plain": [ "+--------------+---------+-------------+--------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
query_label | \n", "reference_label | \n", "distance | \n", "rank | \n", "
---|---|---|---|
0 | \n", "Victoria Beckham | \n", "0.0 | \n", "1 | \n", "
0 | \n", "Cheryl Cole | \n", "0.800586510264 | \n", "2 | \n", "
0 | \n", "Heidi Klum | \n", "0.810344827586 | \n", "3 | \n", "
0 | \n", "Simon Fuller | \n", "0.822742474916 | \n", "4 | \n", "
0 | \n", "Adele | \n", "0.824915824916 | \n", "5 | \n", "