{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Week 4: Retrieving Wikipedia articles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this module, we focused on using nearest neighbors and clustering to retrieve documents that interest users, by analyzing their text. We explored two document representations: word counts and TF-IDF. We also built an iPython notebook for retrieving articles from Wikipedia about famous people.\n", "\n", "In this assignment, we are going to dig deeper into this application, explore the retrieval results for various famous people, and familiarize ourselves with the code needed to build a retrieval system. These techniques will be key to building the intelligent application in your capstone project." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Learning outcomes\n", "- Execute document retrieval code with the iPython notebook\n", "- Load and transform real, text data\n", "- Compare results with word counts and TF-IDF\n", "- Set the distance function in the retrieval\n", "- Build a document retrieval model using nearest neighbor search" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[INFO] This non-commercial license of GraphLab Create is assigned to chengjun@chem.ku.dk and will expire on January 27, 2017. For commercial licensing options, visit https://dato.com/buy/.\n", "\n", "[INFO] Start server at: ipc:///tmp/graphlab_server-25044 - Server binary: /usr/local/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1454423320.log\n", "[INFO] GraphLab Server Version: 1.8.1\n", "[WARNING] Unable to create session in specified location: '/Users/jcj/.graphlab/artifacts'. Using: '/var/tmp/graphlab-jcj/25044/tmp_session_73cf76c0-3994-489f-a377-fd84c2b28012'\n" ] } ], "source": [ "import graphlab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load and explore the data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "people = graphlab.SFrame('people_wiki.gl/')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
URInametext
<http://dbpedia.org/resou
rce/Digby_Morrell> ...
Digby Morrelldigby morrell born 10
october 1979 is a former ...
<http://dbpedia.org/resou
rce/Alfred_J._Lewy> ...
Alfred J. Lewyalfred j lewy aka sandy
lewy graduated from ...
<http://dbpedia.org/resou
rce/Harpdog_Brown> ...
Harpdog Brownharpdog brown is a singer
and harmonica player who ...
<http://dbpedia.org/resou
rce/Franz_Rottensteiner> ...
Franz Rottensteinerfranz rottensteiner born
in waidmannsfeld lower ...
<http://dbpedia.org/resou
rce/G-Enka> ...
G-Enkahenry krvits born 30
december 1974 in tallinn ...
<http://dbpedia.org/resou
rce/Sam_Henderson> ...
Sam Hendersonsam henderson born
october 18 1969 is an ...
<http://dbpedia.org/resou
rce/Aaron_LaCrate> ...
Aaron LaCrateaaron lacrate is an
american music producer ...
<http://dbpedia.org/resou
rce/Trevor_Ferguson> ...
Trevor Fergusontrevor ferguson aka john
farrow born 11 november ...
<http://dbpedia.org/resou
rce/Grant_Nelson> ...
Grant Nelsongrant nelson born 27
april 1971 in london ...
<http://dbpedia.org/resou
rce/Cathy_Caruth> ...
Cathy Caruthcathy caruth born 1955 is
frank h t rhodes ...
\n", "[10 rows x 3 columns]
\n", "
" ], "text/plain": [ "Columns:\n", "\tURI\tstr\n", "\tname\tstr\n", "\ttext\tstr\n", "\n", "Rows: 10\n", "\n", "Data:\n", "+-------------------------------+---------------------+\n", "| URI | name |\n", "+-------------------------------+---------------------+\n", "| \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wordcount
the40
in30
and21
of18
to14
his11
obama9
act8
he7
a7
\n", "[273 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.\n", "" ], "text/plain": [ "Columns:\n", "\tword\tstr\n", "\tcount\tint\n", "\n", "Rows: 273\n", "\n", "Data:\n", "+-------+-------+\n", "| word | count |\n", "+-------+-------+\n", "| the | 40 |\n", "| in | 30 |\n", "| and | 21 |\n", "| of | 18 |\n", "| to | 14 |\n", "| his | 11 |\n", "| obama | 9 |\n", "| act | 8 |\n", "| he | 7 |\n", "| a | 7 |\n", "+-------+-------+\n", "[273 rows x 2 columns]\n", "Note: Only the head of the SFrame is printed.\n", "You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns." ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obama_word_count_table.sort('count', ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compute TF-IDF for the corpus" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": true }, "outputs": [], "source": [ "people['word_count'] = graphlab.text_analytics.count_words(people['text'])" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tfidf = graphlab.text_analytics.tf_idf(people['word_count'])" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [], "source": [ "people['tfidf'] = tfidf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examine the TF-IDF for the Obama article" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wordcount
obama43.2956530721
act27.678222623
iraq17.747378588
control14.8870608452
law14.7229357618
ordered14.5333739509
military13.1159327785
involvement12.7843852412
response12.7843852412
democratic12.4106886973
\n", "[273 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.\n", "
" ], "text/plain": [ "Columns:\n", "\tword\tstr\n", "\tcount\tfloat\n", "\n", "Rows: 273\n", "\n", "Data:\n", "+-------------+---------------+\n", "| word | count |\n", "+-------------+---------------+\n", "| obama | 43.2956530721 |\n", "| act | 27.678222623 |\n", "| iraq | 17.747378588 |\n", "| control | 14.8870608452 |\n", "| law | 14.7229357618 |\n", "| ordered | 14.5333739509 |\n", "| military | 13.1159327785 |\n", "| involvement | 12.7843852412 |\n", "| response | 12.7843852412 |\n", "| democratic | 12.4106886973 |\n", "+-------------+---------------+\n", "[273 rows x 2 columns]\n", "Note: Only the head of the SFrame is printed.\n", "You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns." ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obama = people[people['name'] == 'Barack Obama']\n", "obama[['tfidf']].stack('tfidf', new_column_name = ['word', 'count']).sort('count', ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Manually compute distances between a few people" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [], "source": [ "clinton = people[people['name'] == 'Bill Clinton']\n", "beckham = people[people['name'] == 'David Beckham']" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.833985493688\n", "0.979130584475\n" ] } ], "source": [ "print graphlab.distances.cosine(obama['tfidf'][0], clinton['tfidf'][0])\n", "print graphlab.distances.cosine(obama['tfidf'][0], beckham['tfidf'][0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Build a nearest neighbor model for document retrieva" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PROGRESS: Starting brute force nearest neighbors model training.\n" ] } ], "source": [ "knn_model = graphlab.nearest_neighbors.create(people, features=['tfidf'], label='name')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Applying the nearest-neighbors model for retrieval" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PROGRESS: Starting pairwise querying.\n", "PROGRESS: +--------------+---------+-------------+--------------+\n", "PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |\n", "PROGRESS: +--------------+---------+-------------+--------------+\n", "PROGRESS: | 0 | 1 | 0.00169288 | 22.349ms |\n", "PROGRESS: | Done | | 100 | 414.256ms |\n", "PROGRESS: +--------------+---------+-------------+--------------+\n" ] }, { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
query_labelreference_labeldistancerank
0Barack Obama0.01
0Joe Biden0.7941176470592
0Joe Lieberman0.7946859903383
0Kelly Ayotte0.8119891008174
0Bill Clinton0.8138528138535
\n", "[5 rows x 4 columns]
\n", "
" ], "text/plain": [ "Columns:\n", "\tquery_label\tint\n", "\treference_label\tstr\n", "\tdistance\tfloat\n", "\trank\tint\n", "\n", "Rows: 5\n", "\n", "Data:\n", "+-------------+-----------------+----------------+------+\n", "| query_label | reference_label | distance | rank |\n", "+-------------+-----------------+----------------+------+\n", "| 0 | Barack Obama | 0.0 | 1 |\n", "| 0 | Joe Biden | 0.794117647059 | 2 |\n", "| 0 | Joe Lieberman | 0.794685990338 | 3 |\n", "| 0 | Kelly Ayotte | 0.811989100817 | 4 |\n", "| 0 | Bill Clinton | 0.813852813853 | 5 |\n", "+-------------+-----------------+----------------+------+\n", "[5 rows x 4 columns]" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "knn_model.query(obama)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PROGRESS: Starting pairwise querying.\n", "PROGRESS: +--------------+---------+-------------+--------------+\n", "PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |\n", "PROGRESS: +--------------+---------+-------------+--------------+\n", "PROGRESS: | 0 | 1 | 0.00169288 | 23.254ms |\n", "PROGRESS: | Done | | 100 | 398.965ms |\n", "PROGRESS: +--------------+---------+-------------+--------------+\n" ] }, { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
query_labelreference_labeldistancerank
0Taylor Swift0.01
0Carrie Underwood0.762318840582
0Alicia Keys0.7647058823533
0Jordin Sparks0.7696335078534
0Leona Lewis0.7761194029855
\n", "[5 rows x 4 columns]
\n", "
" ], "text/plain": [ "Columns:\n", "\tquery_label\tint\n", "\treference_label\tstr\n", "\tdistance\tfloat\n", "\trank\tint\n", "\n", "Rows: 5\n", "\n", "Data:\n", "+-------------+------------------+----------------+------+\n", "| query_label | reference_label | distance | rank |\n", "+-------------+------------------+----------------+------+\n", "| 0 | Taylor Swift | 0.0 | 1 |\n", "| 0 | Carrie Underwood | 0.76231884058 | 2 |\n", "| 0 | Alicia Keys | 0.764705882353 | 3 |\n", "| 0 | Jordin Sparks | 0.769633507853 | 4 |\n", "| 0 | Leona Lewis | 0.776119402985 | 5 |\n", "+-------------+------------------+----------------+------+\n", "[5 rows x 4 columns]" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "swift = people[people['name'] == 'Taylor Swift']\n", "knn_model.query(swift)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": true }, "outputs": [], "source": [ "john = people[people['name'] == 'Elton John']" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wordcount
the27
in18
and15
of13
a10
has9
john7
he7
on6
award5
\n", "[255 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.\n", "
" ], "text/plain": [ "Columns:\n", "\tword\tstr\n", "\tcount\tint\n", "\n", "Rows: 255\n", "\n", "Data:\n", "+-------+-------+\n", "| word | count |\n", "+-------+-------+\n", "| the | 27 |\n", "| in | 18 |\n", "| and | 15 |\n", "| of | 13 |\n", "| a | 10 |\n", "| has | 9 |\n", "| john | 7 |\n", "| he | 7 |\n", "| on | 6 |\n", "| award | 5 |\n", "+-------+-------+\n", "[255 rows x 2 columns]\n", "Note: Only the head of the SFrame is printed.\n", "You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns." ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "john[['word_count']].stack('word_count', new_column_name=['word', 'count']).sort('count', ascending=False)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wordcount
furnish18.38947184
elton17.48232027
billboard17.3036809575
john13.9393127924
songwriters11.250406447
tonightcandle10.9864953892
overallelton10.9864953892
1970200010.2933482087
fivedecade10.2933482087
aids10.262846934
\n", "[255 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.\n", "
" ], "text/plain": [ "Columns:\n", "\tword\tstr\n", "\tcount\tfloat\n", "\n", "Rows: 255\n", "\n", "Data:\n", "+---------------+---------------+\n", "| word | count |\n", "+---------------+---------------+\n", "| furnish | 18.38947184 |\n", "| elton | 17.48232027 |\n", "| billboard | 17.3036809575 |\n", "| john | 13.9393127924 |\n", "| songwriters | 11.250406447 |\n", "| tonightcandle | 10.9864953892 |\n", "| overallelton | 10.9864953892 |\n", "| 19702000 | 10.2933482087 |\n", "| fivedecade | 10.2933482087 |\n", "| aids | 10.262846934 |\n", "+---------------+---------------+\n", "[255 rows x 2 columns]\n", "Note: Only the head of the SFrame is printed.\n", "You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns." ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "john[['tfidf']].stack('tfidf', new_column_name=['word', 'count']).sort('count', ascending=False)" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": true }, "outputs": [], "source": [ "vb = people[people['name'] == 'Victoria Beckham']\n", "pm = people[people['name'] == 'Paul McCartney']" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.956700637666\n", "0.825031002922\n" ] } ], "source": [ "print graphlab.distances.cosine(john['tfidf'][0], vb['tfidf'][0])\n", "print graphlab.distances.cosine(john['tfidf'][0], pm['tfidf'][0])" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PROGRESS: Starting brute force nearest neighbors model training.\n" ] } ], "source": [ "knn_model_wc = graphlab.nearest_neighbors.create(people, features=['word_count'], distance='cosine', label='name')" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PROGRESS: Starting brute force nearest neighbors model training.\n" ] } ], "source": [ "knn_model_tfidf = graphlab.nearest_neighbors.create(people, features=['tfidf'], distance='cosine', label='name')" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PROGRESS: Starting pairwise querying.\n", "PROGRESS: +--------------+---------+-------------+--------------+\n", "PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |\n", "PROGRESS: +--------------+---------+-------------+--------------+\n", "PROGRESS: | 0 | 1 | 0.00169288 | 14.839ms |\n", "PROGRESS: | Done | | 100 | 310.023ms |\n", "PROGRESS: +--------------+---------+-------------+--------------+\n" ] }, { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
query_labelreference_labeldistancerank
0Elton John2.22044604925e-161
0Cliff Richard0.161424152592
0Sandro Petrone0.168225427513
0Rod Stewart0.1683271655874
0Malachi O'Doherty0.1773155459795
\n", "[5 rows x 4 columns]
\n", "
" ], "text/plain": [ "Columns:\n", "\tquery_label\tint\n", "\treference_label\tstr\n", "\tdistance\tfloat\n", "\trank\tint\n", "\n", "Rows: 5\n", "\n", "Data:\n", "+-------------+-------------------+-------------------+------+\n", "| query_label | reference_label | distance | rank |\n", "+-------------+-------------------+-------------------+------+\n", "| 0 | Elton John | 2.22044604925e-16 | 1 |\n", "| 0 | Cliff Richard | 0.16142415259 | 2 |\n", "| 0 | Sandro Petrone | 0.16822542751 | 3 |\n", "| 0 | Rod Stewart | 0.168327165587 | 4 |\n", "| 0 | Malachi O'Doherty | 0.177315545979 | 5 |\n", "+-------------+-------------------+-------------------+------+\n", "[5 rows x 4 columns]" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "knn_model_wc.query(john)" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PROGRESS: Starting pairwise querying.\n", "PROGRESS: +--------------+---------+-------------+--------------+\n", "PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |\n", "PROGRESS: +--------------+---------+-------------+--------------+\n", "PROGRESS: | 0 | 1 | 0.00169288 | 22.611ms |\n", "PROGRESS: | Done | | 100 | 439.836ms |\n", "PROGRESS: +--------------+---------+-------------+--------------+\n" ] }, { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
query_labelreference_labeldistancerank
0Elton John-2.22044604925e-161
0Rod Stewart0.7172196678932
0George Michael0.7476009989693
0Sting (musician)0.7476719544314
0Phil Collins0.751193248795
\n", "[5 rows x 4 columns]
\n", "
" ], "text/plain": [ "Columns:\n", "\tquery_label\tint\n", "\treference_label\tstr\n", "\tdistance\tfloat\n", "\trank\tint\n", "\n", "Rows: 5\n", "\n", "Data:\n", "+-------------+------------------+--------------------+------+\n", "| query_label | reference_label | distance | rank |\n", "+-------------+------------------+--------------------+------+\n", "| 0 | Elton John | -2.22044604925e-16 | 1 |\n", "| 0 | Rod Stewart | 0.717219667893 | 2 |\n", "| 0 | George Michael | 0.747600998969 | 3 |\n", "| 0 | Sting (musician) | 0.747671954431 | 4 |\n", "| 0 | Phil Collins | 0.75119324879 | 5 |\n", "+-------------+------------------+--------------------+------+\n", "[5 rows x 4 columns]" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "knn_model_tfidf.query(john)" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PROGRESS: Starting pairwise querying.\n", "PROGRESS: +--------------+---------+-------------+--------------+\n", "PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |\n", "PROGRESS: +--------------+---------+-------------+--------------+\n", "PROGRESS: | 0 | 1 | 0.00169288 | 14.514ms |\n", "PROGRESS: | Done | | 100 | 297.345ms |\n", "PROGRESS: +--------------+---------+-------------+--------------+\n" ] }, { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
query_labelreference_labeldistancerank
0Victoria Beckham-2.22044604925e-161
0Mary Fitzgerald (artist)0.2073070361152
0Adrienne Corri0.2145097827883
0Beverly Jane Fry0.2174664687414
0Raman Mundair0.2176954749925
\n", "[5 rows x 4 columns]
\n", "
" ], "text/plain": [ "Columns:\n", "\tquery_label\tint\n", "\treference_label\tstr\n", "\tdistance\tfloat\n", "\trank\tint\n", "\n", "Rows: 5\n", "\n", "Data:\n", "+-------------+--------------------------+--------------------+------+\n", "| query_label | reference_label | distance | rank |\n", "+-------------+--------------------------+--------------------+------+\n", "| 0 | Victoria Beckham | -2.22044604925e-16 | 1 |\n", "| 0 | Mary Fitzgerald (artist) | 0.207307036115 | 2 |\n", "| 0 | Adrienne Corri | 0.214509782788 | 3 |\n", "| 0 | Beverly Jane Fry | 0.217466468741 | 4 |\n", "| 0 | Raman Mundair | 0.217695474992 | 5 |\n", "+-------------+--------------------------+--------------------+------+\n", "[5 rows x 4 columns]" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "knn_model_wc.query(vb)" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PROGRESS: Starting pairwise querying.\n", "PROGRESS: +--------------+---------+-------------+--------------+\n", "PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |\n", "PROGRESS: +--------------+---------+-------------+--------------+\n", "PROGRESS: | 0 | 1 | 0.00169288 | 18.672ms |\n", "PROGRESS: | Done | | 100 | 431.824ms |\n", "PROGRESS: +--------------+---------+-------------+--------------+\n" ] }, { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
query_labelreference_labeldistancerank
0Victoria Beckham1.11022302463e-161
0David Beckham0.5481696102632
0Stephen Dow Beckham0.7849867068283
0Mel B0.8095855234094
0Caroline Rush0.8198264229195
\n", "[5 rows x 4 columns]
\n", "
" ], "text/plain": [ "Columns:\n", "\tquery_label\tint\n", "\treference_label\tstr\n", "\tdistance\tfloat\n", "\trank\tint\n", "\n", "Rows: 5\n", "\n", "Data:\n", "+-------------+---------------------+-------------------+------+\n", "| query_label | reference_label | distance | rank |\n", "+-------------+---------------------+-------------------+------+\n", "| 0 | Victoria Beckham | 1.11022302463e-16 | 1 |\n", "| 0 | David Beckham | 0.548169610263 | 2 |\n", "| 0 | Stephen Dow Beckham | 0.784986706828 | 3 |\n", "| 0 | Mel B | 0.809585523409 | 4 |\n", "| 0 | Caroline Rush | 0.819826422919 | 5 |\n", "+-------------+---------------------+-------------------+------+\n", "[5 rows x 4 columns]" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "knn_model_tfidf.query(vb)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }