{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Week 4: Retrieving Wikipedia articles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this module, we focused on using nearest neighbors and clustering to retrieve documents that interest users, by analyzing their text. We explored two document representations: word counts and TF-IDF. We also built an iPython notebook for retrieving articles from Wikipedia about famous people.\n", "\n", "In this assignment, we are going to dig deeper into this application, explore the retrieval results for various famous people, and familiarize ourselves with the code needed to build a retrieval system. These techniques will be key to building the intelligent application in your capstone project." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Learning outcomes\n", "- Execute document retrieval code with the iPython notebook\n", "- Load and transform real, text data\n", "- Compare results with word counts and TF-IDF\n", "- Set the distance function in the retrieval\n", "- Build a document retrieval model using nearest neighbor search" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[INFO] This non-commercial license of GraphLab Create is assigned to chengjun@chem.ku.dk and will expire on January 27, 2017. For commercial licensing options, visit https://dato.com/buy/.\n", "\n", "[INFO] Start server at: ipc:///tmp/graphlab_server-25044 - Server binary: /usr/local/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1454423320.log\n", "[INFO] GraphLab Server Version: 1.8.1\n", "[WARNING] Unable to create session in specified location: '/Users/jcj/.graphlab/artifacts'. Using: '/var/tmp/graphlab-jcj/25044/tmp_session_73cf76c0-3994-489f-a377-fd84c2b28012'\n" ] } ], "source": [ "import graphlab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load and explore the data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "people = graphlab.SFrame('people_wiki.gl/')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
URI | \n", "name | \n", "text | \n", "
---|---|---|
<http://dbpedia.org/resou rce/Digby_Morrell> ... | \n",
" Digby Morrell | \n", "digby morrell born 10 october 1979 is a former ... | \n",
"
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ... | \n",
" Alfred J. Lewy | \n", "alfred j lewy aka sandy lewy graduated from ... | \n",
"
<http://dbpedia.org/resou rce/Harpdog_Brown> ... | \n",
" Harpdog Brown | \n", "harpdog brown is a singer and harmonica player who ... | \n",
"
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ... | \n",
" Franz Rottensteiner | \n", "franz rottensteiner born in waidmannsfeld lower ... | \n",
"
<http://dbpedia.org/resou rce/G-Enka> ... | \n",
" G-Enka | \n", "henry krvits born 30 december 1974 in tallinn ... | \n",
"
<http://dbpedia.org/resou rce/Sam_Henderson> ... | \n",
" Sam Henderson | \n", "sam henderson born october 18 1969 is an ... | \n",
"
<http://dbpedia.org/resou rce/Aaron_LaCrate> ... | \n",
" Aaron LaCrate | \n", "aaron lacrate is an american music producer ... | \n",
"
<http://dbpedia.org/resou rce/Trevor_Ferguson> ... | \n",
" Trevor Ferguson | \n", "trevor ferguson aka john farrow born 11 november ... | \n",
"
<http://dbpedia.org/resou rce/Grant_Nelson> ... | \n",
" Grant Nelson | \n", "grant nelson born 27 april 1971 in london ... | \n",
"
<http://dbpedia.org/resou rce/Cathy_Caruth> ... | \n",
" Cathy Caruth | \n", "cathy caruth born 1955 is frank h t rhodes ... | \n",
"
word | \n", "count | \n", "
---|---|
the | \n", "40 | \n", "
in | \n", "30 | \n", "
and | \n", "21 | \n", "
of | \n", "18 | \n", "
to | \n", "14 | \n", "
his | \n", "11 | \n", "
obama | \n", "9 | \n", "
act | \n", "8 | \n", "
he | \n", "7 | \n", "
a | \n", "7 | \n", "
word | \n", "count | \n", "
---|---|
obama | \n", "43.2956530721 | \n", "
act | \n", "27.678222623 | \n", "
iraq | \n", "17.747378588 | \n", "
control | \n", "14.8870608452 | \n", "
law | \n", "14.7229357618 | \n", "
ordered | \n", "14.5333739509 | \n", "
military | \n", "13.1159327785 | \n", "
involvement | \n", "12.7843852412 | \n", "
response | \n", "12.7843852412 | \n", "
democratic | \n", "12.4106886973 | \n", "
query_label | \n", "reference_label | \n", "distance | \n", "rank | \n", "
---|---|---|---|
0 | \n", "Barack Obama | \n", "0.0 | \n", "1 | \n", "
0 | \n", "Joe Biden | \n", "0.794117647059 | \n", "2 | \n", "
0 | \n", "Joe Lieberman | \n", "0.794685990338 | \n", "3 | \n", "
0 | \n", "Kelly Ayotte | \n", "0.811989100817 | \n", "4 | \n", "
0 | \n", "Bill Clinton | \n", "0.813852813853 | \n", "5 | \n", "
query_label | \n", "reference_label | \n", "distance | \n", "rank | \n", "
---|---|---|---|
0 | \n", "Taylor Swift | \n", "0.0 | \n", "1 | \n", "
0 | \n", "Carrie Underwood | \n", "0.76231884058 | \n", "2 | \n", "
0 | \n", "Alicia Keys | \n", "0.764705882353 | \n", "3 | \n", "
0 | \n", "Jordin Sparks | \n", "0.769633507853 | \n", "4 | \n", "
0 | \n", "Leona Lewis | \n", "0.776119402985 | \n", "5 | \n", "
word | \n", "count | \n", "
---|---|
the | \n", "27 | \n", "
in | \n", "18 | \n", "
and | \n", "15 | \n", "
of | \n", "13 | \n", "
a | \n", "10 | \n", "
has | \n", "9 | \n", "
john | \n", "7 | \n", "
he | \n", "7 | \n", "
on | \n", "6 | \n", "
award | \n", "5 | \n", "
word | \n", "count | \n", "
---|---|
furnish | \n", "18.38947184 | \n", "
elton | \n", "17.48232027 | \n", "
billboard | \n", "17.3036809575 | \n", "
john | \n", "13.9393127924 | \n", "
songwriters | \n", "11.250406447 | \n", "
tonightcandle | \n", "10.9864953892 | \n", "
overallelton | \n", "10.9864953892 | \n", "
19702000 | \n", "10.2933482087 | \n", "
fivedecade | \n", "10.2933482087 | \n", "
aids | \n", "10.262846934 | \n", "
query_label | \n", "reference_label | \n", "distance | \n", "rank | \n", "
---|---|---|---|
0 | \n", "Elton John | \n", "2.22044604925e-16 | \n", "1 | \n", "
0 | \n", "Cliff Richard | \n", "0.16142415259 | \n", "2 | \n", "
0 | \n", "Sandro Petrone | \n", "0.16822542751 | \n", "3 | \n", "
0 | \n", "Rod Stewart | \n", "0.168327165587 | \n", "4 | \n", "
0 | \n", "Malachi O'Doherty | \n", "0.177315545979 | \n", "5 | \n", "
query_label | \n", "reference_label | \n", "distance | \n", "rank | \n", "
---|---|---|---|
0 | \n", "Elton John | \n", "-2.22044604925e-16 | \n", "1 | \n", "
0 | \n", "Rod Stewart | \n", "0.717219667893 | \n", "2 | \n", "
0 | \n", "George Michael | \n", "0.747600998969 | \n", "3 | \n", "
0 | \n", "Sting (musician) | \n", "0.747671954431 | \n", "4 | \n", "
0 | \n", "Phil Collins | \n", "0.75119324879 | \n", "5 | \n", "
query_label | \n", "reference_label | \n", "distance | \n", "rank | \n", "
---|---|---|---|
0 | \n", "Victoria Beckham | \n", "-2.22044604925e-16 | \n", "1 | \n", "
0 | \n", "Mary Fitzgerald (artist) | \n", "0.207307036115 | \n", "2 | \n", "
0 | \n", "Adrienne Corri | \n", "0.214509782788 | \n", "3 | \n", "
0 | \n", "Beverly Jane Fry | \n", "0.217466468741 | \n", "4 | \n", "
0 | \n", "Raman Mundair | \n", "0.217695474992 | \n", "5 | \n", "
query_label | \n", "reference_label | \n", "distance | \n", "rank | \n", "
---|---|---|---|
0 | \n", "Victoria Beckham | \n", "1.11022302463e-16 | \n", "1 | \n", "
0 | \n", "David Beckham | \n", "0.548169610263 | \n", "2 | \n", "
0 | \n", "Stephen Dow Beckham | \n", "0.784986706828 | \n", "3 | \n", "
0 | \n", "Mel B | \n", "0.809585523409 | \n", "4 | \n", "
0 | \n", "Caroline Rush | \n", "0.819826422919 | \n", "5 | \n", "