# Week 4: Retrieving Wikipedia articles

In this module, we focused on using nearest neighbors and clustering to retrieve documents that interest users, by analyzing their text. We explored two document representations: word counts and TF-IDF. We also built an iPython notebook for retrieving articles from Wikipedia about famous people.

In this assignment, we are going to dig deeper into this application, explore the retrieval results for various famous people, and familiarize ourselves with the code needed to build a retrieval system. These techniques will be key to building the intelligent application in your capstone project.

Learning outcomes
- Execute document retrieval code with the iPython notebook
- Load and transform real, text data
- Compare results with word counts and TF-IDF
- Set the distance function in the retrieval
- Build a document retrieval model using nearest neighbor search

In [1]:
import graphlab

[INFO] This non-commercial license of GraphLab Create is assigned to chengjun@chem.ku.dk and will expire on January 27, 2017. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-25044 - Server binary: /usr/local/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1454423320.log
[INFO] GraphLab Server Version: 1.8.1


## Load and explore the data

In [2]:
people = graphlab.SFrame('people_wiki.gl/')

In [4]:
people.head()

URI,name,text
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...


In [15]:
obama = people[people['name'] == 'Barack Obama']
cloony = people[people['name'] == 'George Clooney']

In [17]:
obama['word_count'] = graphlab.text_analytics.count_words(obama['text'])

In [22]:
# why [[]] here?
obama_word_count_table = obama[['word_count']].stack('word_count', new_column_name = ['word', 'count'])

In [24]:
obama_word_count_table.sort('count', ascending=False)

word,count
the,40
in,30
and,21
of,18
to,14
his,11
obama,9
act,8
he,7
a,7


## Compute TF-IDF for the corpus

In [25]:
people['word_count'] = graphlab.text_analytics.count_words(people['text'])

In [27]:
tfidf = graphlab.text_analytics.tf_idf(people['word_count'])

In [33]:
people['tfidf'] = tfidf

## Examine the TF-IDF for the Obama article

In [39]:
obama = people[people['name'] == 'Barack Obama']
obama[['tfidf']].stack('tfidf', new_column_name = ['word', 'count']).sort('count', ascending=False)

word,count
obama,43.2956530721
act,27.678222623
iraq,17.747378588
control,14.8870608452
law,14.7229357618
ordered,14.5333739509
military,13.1159327785
involvement,12.7843852412
response,12.7843852412
democratic,12.4106886973


## Manually compute distances between a few people

In [40]:
clinton = people[people['name'] == 'Bill Clinton']
beckham = people[people['name'] == 'David Beckham']

In [43]:
print graphlab.distances.cosine(obama['tfidf'][0], clinton['tfidf'][0])
print graphlab.distances.cosine(obama['tfidf'][0], beckham['tfidf'][0])

0.833985493688
0.979130584475


## Build a nearest neighbor model for document retrieva

In [46]:
knn_model = graphlab.nearest_neighbors.create(people, features=['tfidf'], label='name')

PROGRESS: Starting brute force nearest neighbors model training.


## Applying the nearest-neighbors model for retrieval

In [47]:
knn_model.query(obama)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 22.349ms     |
PROGRESS: | Done         |         | 100         | 414.256ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Barack Obama,0.0,1
0,Joe Biden,0.794117647059,2
0,Joe Lieberman,0.794685990338,3
0,Kelly Ayotte,0.811989100817,4
0,Bill Clinton,0.813852813853,5


In [48]:
swift = people[people['name'] == 'Taylor Swift']
knn_model.query(swift)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 23.254ms     |
PROGRESS: | Done         |         | 100         | 398.965ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Taylor Swift,0.0,1
0,Carrie Underwood,0.76231884058,2
0,Alicia Keys,0.764705882353,3
0,Jordin Sparks,0.769633507853,4
0,Leona Lewis,0.776119402985,5


In [49]:
john = people[people['name'] == 'Elton John']

In [54]:
john[['word_count']].stack('word_count', new_column_name=['word', 'count']).sort('count', ascending=False)

word,count
the,27
in,18
and,15
of,13
a,10
has,9
john,7
he,7
on,6
award,5


In [55]:
john[['tfidf']].stack('tfidf', new_column_name=['word', 'count']).sort('count', ascending=False)

word,count
furnish,18.38947184
elton,17.48232027
billboard,17.3036809575
john,13.9393127924
songwriters,11.250406447
tonightcandle,10.9864953892
overallelton,10.9864953892
19702000,10.2933482087
fivedecade,10.2933482087
aids,10.262846934


In [60]:
vb = people[people['name'] == 'Victoria Beckham']
pm = people[people['name'] == 'Paul McCartney']

In [63]:
print graphlab.distances.cosine(john['tfidf'][0], vb['tfidf'][0])
print graphlab.distances.cosine(john['tfidf'][0], pm['tfidf'][0])

0.956700637666
0.825031002922


In [64]:
knn_model_wc = graphlab.nearest_neighbors.create(people, features=['word_count'], distance='cosine', label='name')

PROGRESS: Starting brute force nearest neighbors model training.


In [65]:
knn_model_tfidf = graphlab.nearest_neighbors.create(people, features=['tfidf'], distance='cosine', label='name')

PROGRESS: Starting brute force nearest neighbors model training.


In [67]:
knn_model_wc.query(john)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 14.839ms     |
PROGRESS: | Done         |         | 100         | 310.023ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Elton John,2.22044604925e-16,1
0,Cliff Richard,0.16142415259,2
0,Sandro Petrone,0.16822542751,3
0,Rod Stewart,0.168327165587,4
0,Malachi O'Doherty,0.177315545979,5


In [68]:
knn_model_tfidf.query(john)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 22.611ms     |
PROGRESS: | Done         |         | 100         | 439.836ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Elton John,-2.22044604925e-16,1
0,Rod Stewart,0.717219667893,2
0,George Michael,0.747600998969,3
0,Sting (musician),0.747671954431,4
0,Phil Collins,0.75119324879,5


In [69]:
knn_model_wc.query(vb)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 14.514ms     |
PROGRESS: | Done         |         | 100         | 297.345ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Victoria Beckham,-2.22044604925e-16,1
0,Mary Fitzgerald (artist),0.207307036115,2
0,Adrienne Corri,0.214509782788,3
0,Beverly Jane Fry,0.217466468741,4
0,Raman Mundair,0.217695474992,5


In [70]:
knn_model_tfidf.query(vb)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 18.672ms     |
PROGRESS: | Done         |         | 100         | 431.824ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Victoria Beckham,1.11022302463e-16,1
0,David Beckham,0.548169610263,2
0,Stephen Dow Beckham,0.784986706828,3
0,Mel B,0.809585523409,4
0,Caroline Rush,0.819826422919,5
