{ "metadata": { "name": "", "signature": "sha256:2f47664872cfaa34ab4161cd438a77f10d2905ddc7b40effff295fc6a1989117" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Tutorial Brief" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

This Tutorial is also available on YouTube

" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from IPython.display import YouTubeVideo\n", "# a talk about IPython at Sage Days at U. Washington, Seattle.\n", "# Video credit: William Stein.\n", "YouTubeVideo('http://youtu.be/HuDIbSCnsqo')" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", " \n", " " ], "metadata": {}, "output_type": "pyout", "prompt_number": 22, "text": [ "" ] } ], "prompt_number": 22 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Extracting Features from Text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the most common ways to analyze is to split it in to words and count how many times each word was used in a text. This may not sound very useful, but with large amounts of text you can build a fairly accurate classifier and this is what we will do in this tutorial." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Import Required Libraries" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "import lxml\n", "import requests\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.cross_validation import train_test_split\n", "from sklearn.naive_bayes import MultinomialNB\n", "import warnings\n", "\n", "warnings.filterwarnings(\"ignore\")" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Fetch the RSS news feed from Reuters" ] }, { "cell_type": "code", "collapsed": false, "input": [ "url_list = [\"http://feeds.reuters.com/reuters/businessNews\",\n", " \"http://feeds.reuters.com/reuters/technologyNews\",\n", " \"http://feeds.reuters.com/reuters/sportsNews\",\n", " ]\n", "documents = []\n", "\n", "for url in url_list:\n", " response = requests.get(url)\n", " xml_page = response.text\n", " parser = lxml.etree.XMLParser(recover=True, encoding='utf-8')\n", " documents.append(lxml.etree.fromstring(xml_page.encode(\"utf-8\"), parser=parser))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Extracting Articles from the XML Feed" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Examining the XML Feed Format" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def print_tag(node):\n", " print \"<%s %s>%s\" % (node.tag, \" \".join([\"%s=%s\" % (k,v)for k,v in node.attrib.iteritems()]), node.text)\n", " for item in node[:25]:\n", " print \" <%s %s>%s\" % (item.tag, \" \".join([\"%s=%s\" % (k,v)for k,v in item.attrib.iteritems()]), item.text, item.tag)\n", " print \"\" % node.tag" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "temp_node = documents[0]\n", "print_tag(temp_node)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "None\n", " None\n", "\n" ] } ], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "temp_node = temp_node[0]\n", "print_tag(temp_node)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "None\n", " Reuters: Business News\n", " http://www.reuters.com\n", " Reuters.com is your source for breaking news, business, financial and investing news, including personal finance and stocks. Reuters is the leading global provider of news, financial information and technology solutions to the world's media, financial institutions, businesses and individuals.\n", " en-us\n", " All rights reserved. Users may download and print extracts of content from this website for their own personal and non-commercial use only. Republication or redistribution of Reuters content, including by framing or similar means, is expressly prohibited without the prior written consent of Reuters. Reuters and the Reuters sphere logo are registered trademarks or trademarks of the Reuters group of companies around the world. \u00a9 Reuters 2015\n", " Sat, 07 Feb 2015 14:34:41 GMT\n", " Sat, 07 Feb 2015 14:34:41 GMT\n", " 5\n", " None\n", " <{http://www.w3.org/2005/Atom}link rel=self type=application/rss+xml href=http://feeds.reuters.com/reuters/businessNews>None\n", " <{http://rssnamespace.org/feedburner/ext/1.0}info uri=reuters/businessnews>None\n", " <{http://www.w3.org/2005/Atom}link rel=hub href=http://pubsubhubbub.appspot.com/>None\n", " <{http://rssnamespace.org/feedburner/ext/1.0}feedFlare href=http://add.my.yahoo.com/rss?url=http%3A%2F%2Ffeeds.reuters.com%2Freuters%2FbusinessNews src=http://us.i1.yimg.com/us.yimg.com/i/us/my/addtomyyahoo4.gif>Subscribe with My Yahoo!\n", " <{http://rssnamespace.org/feedburner/ext/1.0}feedFlare href=http://www.newsgator.com/ngs/subscriber/subext.aspx?url=http%3A%2F%2Ffeeds.reuters.com%2Freuters%2FbusinessNews src=http://www.newsgator.com/images/ngsub1.gif>Subscribe with NewsGator\n", " <{http://rssnamespace.org/feedburner/ext/1.0}feedFlare href=http://feeds.my.aol.com/add.jsp?url=http%3A%2F%2Ffeeds.reuters.com%2Freuters%2FbusinessNews src=http://o.aolcdn.com/favorites.my.aol.com/webmaster/ffclient/webroot/locale/en-US/images/myAOLButtonSmall.gif>Subscribe with My AOL\n", " <{http://rssnamespace.org/feedburner/ext/1.0}feedFlare href=http://www.bloglines.com/sub/http://feeds.reuters.com/reuters/businessNews src=http://www.bloglines.com/images/sub_modern11.gif>Subscribe with Bloglines\n", " <{http://rssnamespace.org/feedburner/ext/1.0}feedFlare href=http://www.netvibes.com/subscribe.php?url=http%3A%2F%2Ffeeds.reuters.com%2Freuters%2FbusinessNews src=http://www.netvibes.com/img/add2netvibes.gif>Subscribe with Netvibes\n", " <{http://rssnamespace.org/feedburner/ext/1.0}feedFlare href=http://fusion.google.com/add?feedurl=http%3A%2F%2Ffeeds.reuters.com%2Freuters%2FbusinessNews src=http://buttons.googlesyndication.com/fusion/add.gif>Subscribe with Google\n", " <{http://rssnamespace.org/feedburner/ext/1.0}feedFlare href=http://www.pageflakes.com/subscribe.aspx?url=http%3A%2F%2Ffeeds.reuters.com%2Freuters%2FbusinessNews src=http://www.pageflakes.com/ImageFile.ashx?instanceId=Static_4&fileName=ATP_blu_91x17.gif>Subscribe with Pageflakes\n", " None\n", " None\n", " None\n", " None\n", " None\n", " None\n", "\n" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "temp_node = temp_node.xpath(\"item\")[0]\n", "print_tag(temp_node)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "None\n", " Exclusive: Top Fox investors seek to convert voting shares, Murdoch may benefit\n", " http://feeds.reuters.com/~r/reuters/businessNews/~3/Ti_4toCmS6w/story01.htm\n", " NEW YORK (Reuters) - Several top investors in Twenty-First Century Fox Inc are pressing for the right to swap their voting shares for ordinary shares, which are trading at an unusual premium, even though the move could hand even more control of the company to Rupert Murdoch, according to people familiar with the matter.
\n", " \n", "
\"\"/
\n", " businessNews\n", " Sat, 07 Feb 2015 14:30:10 GMT\n", " http://www.reuters.com/article/2015/02/07/us-fox-votingshares-exclusive-idUSKBN0LA2HP20150207?feedType=RSS&feedName=businessNews\n", " <{http://rssnamespace.org/feedburner/ext/1.0}origLink >http://reuters.us.feedsportal.com/c/35217/f/654199/s/4325e2cf/sc/2/l/0L0Sreuters0N0Carticle0C20A150C0A20C0A70Cus0Efox0Evotingshares0Eexclusive0EidUSKBN0ALA2HP20A150A20A70DfeedType0FRSS0GfeedName0FbusinessNews/story01.htm\n", "
\n" ] } ], "prompt_number": 6 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Processing All Feeds to a DataFrame" ] }, { "cell_type": "code", "collapsed": false, "input": [ "title_list = []\n", "description_list = []\n", "category_list = []\n", "\n", "for xml_doc in documents:\n", " articles = xml_doc.xpath(\"//item\")\n", " for article in articles:\n", " title_list.append(article[0].text)\n", " description_list.append(article[2].text)\n", " category_list.append(article[3].text)\n", " \n", "\n", "news_data = pd.DataFrame(title_list, columns=[\"title\"])\n", "news_data[\"description\"] = description_list\n", "news_data[\"category\"] = category_list\n", "print len(news_data)\n", "news_data" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "60\n" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titledescriptioncategory
0 Exclusive: Top Fox investors seek to convert v... NEW YORK (Reuters) - Several top investors in ... businessNews
1 Different delivery, one message to Greece's ne... ROME/PARIS (Reuters) - In Paris and Rome, it w... businessNews
2 Isolated Greece wants no more bailout money wi... ATHENS/BRUSSELS (Reuters) - Greece's new lefti... businessNews
3 Valuations may hurt small caps, despite job gr... NEW YORK (Reuters) - The good news from Friday... businessNews
4 Shippers suspend weekend cargo loading at U.S.... LOS ANGELES (Reuters) - The loading and unload... businessNews
5 Building unions return to Kentucky refinery de... NEW YORK (Reuters) - Workers represented by th... businessNews
6 Motorola exploring possible sale: Bloomberg (Reuters) - Walkie-talkie and radio systems ma... businessNews
7 Wall St. ends down on interest rate, Greece ji... NEW YORK (Reuters) - Wall Street stocks fell o... businessNews
8 Strong U.S. job, wage gains open door to mid-y... WASHINGTON (Reuters) - U.S. job growth rose so... businessNews
9 Building unions return to Kentucky refinery de... NEW YORK (Reuters) - Workers represented by th... businessNews
10 Motorola exploring possible sale: Bloomberg (Reuters) - Walkie-talkie and radio systems ma... businessNews
11 JPMorgan under scrutiny over hiring of Chinese... (Reuters) - JPMorgan Chase & Co is under feder... businessNews
12 Exclusive: Top Fox investors seek to convert v... NEW YORK (Reuters) - Several top investors in ... businessNews
13 Isolated Greece wants no more bailout money wi... ATHENS/BRUSSELS (Reuters) - Greece's new lefti... businessNews
14 Wall St. ends down on interest rate, Greece ji... NEW YORK (Reuters) - Wall Street stocks fell o... businessNews
15 RadioShack would accept liquidation bids: lawyer (Reuters) - A lawyer for RadioShack on Friday ... businessNews
16 Strong U.S. job, wage gains open door to mid-y... WASHINGTON (Reuters) - U.S. job growth rose so... businessNews
17 Wall Street firms waver over June 2015 rate hi... NEW YORK (Reuters) - Economists at Wall Street... businessNews
18 Oil climbs, Brent posts best two weeks since 1998 NEW YORK (Reuters) - Oil rallied again on Frid... businessNews
19 Brazil's Petrobras taps state banker as CEO; s... BRASILIA (Reuters) - Brazil's President Dilma ... businessNews
20 Feds can't indefinitely gag Yahoo about subpoe... SAN FRANCISCO (Reuters) - Law enforcement cann... technologyNews
21 Motorola exploring possible sale: Bloomberg (Reuters) - Walkie-talkie and radio systems ma... technologyNews
22 Anthem warns U.S. customers of email scam afte... (Reuters) - Health insurer Anthem Inc on Frida... technologyNews
23 U.S. court orders Symantec to pay $17 mln for ... (Reuters) - Symantec Corp, maker of the popula... technologyNews
24 Motorola exploring possible sale: Bloomberg (Reuters) - Walkie-talkie and radio systems ma... technologyNews
25 RadioShack would accept liquidation bids: lawyer (Reuters) - A lawyer for RadioShack on Friday ... technologyNews
26 Anthem warns U.S. customers of email scam afte... (Reuters) - Health insurer Anthem Inc on Frida... technologyNews
27 Google can disrupt car industry but is no auto... FRANKFURT (Reuters) - Technology companies suc... technologyNews
28 Chipmaker Intel promotes senior executives, du... SAN FRANCISCO (Reuters) - Chipmaker Intel Corp... technologyNews
29 Feds can't indefinitely gag Yahoo about subpoe... SAN FRANCISCO (Reuters) - Law enforcement cann... technologyNews
30 Google panel backs firm on EU limit to 'right ... BRUSSELS (Reuters) - A panel of experts appoin... technologyNews
31 Family money puts its faith in European techno... LONDON (Reuters) - Some of Europe's wealthiest... technologyNews
32 U.S. states probe massive data breach at healt... NEW YORK (Reuters) - Several U.S. states are i... technologyNews
33 Health insurer Anthem hit by massive cybersecu... (Reuters) - Health insurer Anthem Inc , which ... technologyNews
34 China to have most robots in world by 2017 FRANKFURT (Reuters) - China will have more rob... technologyNews
35 U.S. businesses ask White House to help on Chi... WASHINGTON (Reuters) - U.S. business lobbies c... technologyNews
36 Sharp says not considering sale of any oversea... TOKYO (Reuters) - Japanese electronics maker S... technologyNews
37 RadioShack files for bankruptcy; Sprint to tak... (Reuters) - Electronics retailer RadioShack Co... technologyNews
38 SEC probes Blackberry options trading ahead of... NEW YORK (Reuters) - The U.S. Securities and E... technologyNews
39 Twitter tops Wall Street revenue target, user ... SAN FRANCISCO (Reuters) - Twitter Inc said on ... technologyNews
40 U.S. Golf Association to launch Senior Women's... NEW YORK (Reuters) - The United States Golf As... sportsNews
41 In-form Wiesberger goes two shots clear in Kua... KUALA LUMPUR (Reuters) - Austrian Bernd Wiesbe... sportsNews
42 Jansrud signals downhill ambition in final tra... BEAVER CREEK, Colorado (Reuters) - Unheralded ... sportsNews
43 NFL, players union spar in court over Peterson... MINNEAPOLIS (Reuters) - The National Football ... sportsNews
44 English takes charge as Mickelson exits Torrey... LA JOLLA, California (Reuters) - Local favorit... sportsNews
45 No pressure for double medallist Fenninger BEAVER CREEK, Colorado (Reuters) - A Super-G v... sportsNews
46 Referees association backs rookie in flare-up ... (Reuters) - The National Basketball Referees A... sportsNews
47 Rangers goalie Lundqvist out for three weeks w... (Reuters) - New York Rangers goalie Henrik Lun... sportsNews
48 Miracle now needed to win gold, says Vonn BEAVER CREEK, Colorado (Reuters) - Denied gold... sportsNews
49 NFL and players union spar in court over Peter... MINNEAPOLIS (Reuters) - The National Football ... sportsNews
50 No sharing world downhill gold for Maze BEAVER CREEK, Colorado (Reuters) - Tina Maze h... sportsNews
51 Doping should be criminal offense: Russia chief MOSCOW (Reuters) - Russia's Athletics Federati... sportsNews
52 Kansas police drop drug charges against Dallas... DALLAS, Texas (Reuters) - Authorities have dro... sportsNews
53 Thompson leads by one stroke at Torrey Pines LA JOLLA, California (Reuters - American Nicho... sportsNews
54 Borzakovskiy takes over as Russia's caretaker ... MOSCOW (Reuters) - Former Olympic 800 meters c... sportsNews
55 Clubs need long-term mental health plan: FIFPro (Reuters) - Soccer clubs need to take a longer... sportsNews
56 European Games will convince skeptics, attract... (Reuters) - The inaugural European Games this ... sportsNews
57 Departing VFLA President to continue fight aga... MOSCOW (Reuters) - Russian Athletics Federatio... sportsNews
58 Canizares joins champion Westwood at the top i... KUALA LUMPUR (Reuters) - Spaniard Alejandro Ca... sportsNews
59 Force India opposed Marussia's bid to use old car LONDON (Reuters) - Marussia's hopes of rising ... sportsNews
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 7, "text": [ " title \\\n", "0 Exclusive: Top Fox investors seek to convert v... \n", "1 Different delivery, one message to Greece's ne... \n", "2 Isolated Greece wants no more bailout money wi... \n", "3 Valuations may hurt small caps, despite job gr... \n", "4 Shippers suspend weekend cargo loading at U.S.... \n", "5 Building unions return to Kentucky refinery de... \n", "6 Motorola exploring possible sale: Bloomberg \n", "7 Wall St. ends down on interest rate, Greece ji... \n", "8 Strong U.S. job, wage gains open door to mid-y... \n", "9 Building unions return to Kentucky refinery de... \n", "10 Motorola exploring possible sale: Bloomberg \n", "11 JPMorgan under scrutiny over hiring of Chinese... \n", "12 Exclusive: Top Fox investors seek to convert v... \n", "13 Isolated Greece wants no more bailout money wi... \n", "14 Wall St. ends down on interest rate, Greece ji... \n", "15 RadioShack would accept liquidation bids: lawyer \n", "16 Strong U.S. job, wage gains open door to mid-y... \n", "17 Wall Street firms waver over June 2015 rate hi... \n", "18 Oil climbs, Brent posts best two weeks since 1998 \n", "19 Brazil's Petrobras taps state banker as CEO; s... \n", "20 Feds can't indefinitely gag Yahoo about subpoe... \n", "21 Motorola exploring possible sale: Bloomberg \n", "22 Anthem warns U.S. customers of email scam afte... \n", "23 U.S. court orders Symantec to pay $17 mln for ... \n", "24 Motorola exploring possible sale: Bloomberg \n", "25 RadioShack would accept liquidation bids: lawyer \n", "26 Anthem warns U.S. customers of email scam afte... \n", "27 Google can disrupt car industry but is no auto... \n", "28 Chipmaker Intel promotes senior executives, du... \n", "29 Feds can't indefinitely gag Yahoo about subpoe... \n", "30 Google panel backs firm on EU limit to 'right ... \n", "31 Family money puts its faith in European techno... \n", "32 U.S. states probe massive data breach at healt... \n", "33 Health insurer Anthem hit by massive cybersecu... \n", "34 China to have most robots in world by 2017 \n", "35 U.S. businesses ask White House to help on Chi... \n", "36 Sharp says not considering sale of any oversea... \n", "37 RadioShack files for bankruptcy; Sprint to tak... \n", "38 SEC probes Blackberry options trading ahead of... \n", "39 Twitter tops Wall Street revenue target, user ... \n", "40 U.S. Golf Association to launch Senior Women's... \n", "41 In-form Wiesberger goes two shots clear in Kua... \n", "42 Jansrud signals downhill ambition in final tra... \n", "43 NFL, players union spar in court over Peterson... \n", "44 English takes charge as Mickelson exits Torrey... \n", "45 No pressure for double medallist Fenninger \n", "46 Referees association backs rookie in flare-up ... \n", "47 Rangers goalie Lundqvist out for three weeks w... \n", "48 Miracle now needed to win gold, says Vonn \n", "49 NFL and players union spar in court over Peter... \n", "50 No sharing world downhill gold for Maze \n", "51 Doping should be criminal offense: Russia chief \n", "52 Kansas police drop drug charges against Dallas... \n", "53 Thompson leads by one stroke at Torrey Pines \n", "54 Borzakovskiy takes over as Russia's caretaker ... \n", "55 Clubs need long-term mental health plan: FIFPro \n", "56 European Games will convince skeptics, attract... \n", "57 Departing VFLA President to continue fight aga... \n", "58 Canizares joins champion Westwood at the top i... \n", "59 Force India opposed Marussia's bid to use old car \n", "\n", " description category \n", "0 NEW YORK (Reuters) - Several top investors in ... businessNews \n", "1 ROME/PARIS (Reuters) - In Paris and Rome, it w... businessNews \n", "2 ATHENS/BRUSSELS (Reuters) - Greece's new lefti... businessNews \n", "3 NEW YORK (Reuters) - The good news from Friday... businessNews \n", "4 LOS ANGELES (Reuters) - The loading and unload... businessNews \n", "5 NEW YORK (Reuters) - Workers represented by th... businessNews \n", "6 (Reuters) - Walkie-talkie and radio systems ma... businessNews \n", "7 NEW YORK (Reuters) - Wall Street stocks fell o... businessNews \n", "8 WASHINGTON (Reuters) - U.S. job growth rose so... businessNews \n", "9 NEW YORK (Reuters) - Workers represented by th... businessNews \n", "10 (Reuters) - Walkie-talkie and radio systems ma... businessNews \n", "11 (Reuters) - JPMorgan Chase & Co is under feder... businessNews \n", "12 NEW YORK (Reuters) - Several top investors in ... businessNews \n", "13 ATHENS/BRUSSELS (Reuters) - Greece's new lefti... businessNews \n", "14 NEW YORK (Reuters) - Wall Street stocks fell o... businessNews \n", "15 (Reuters) - A lawyer for RadioShack on Friday ... businessNews \n", "16 WASHINGTON (Reuters) - U.S. job growth rose so... businessNews \n", "17 NEW YORK (Reuters) - Economists at Wall Street... businessNews \n", "18 NEW YORK (Reuters) - Oil rallied again on Frid... businessNews \n", "19 BRASILIA (Reuters) - Brazil's President Dilma ... businessNews \n", "20 SAN FRANCISCO (Reuters) - Law enforcement cann... technologyNews \n", "21 (Reuters) - Walkie-talkie and radio systems ma... technologyNews \n", "22 (Reuters) - Health insurer Anthem Inc on Frida... technologyNews \n", "23 (Reuters) - Symantec Corp, maker of the popula... technologyNews \n", "24 (Reuters) - Walkie-talkie and radio systems ma... technologyNews \n", "25 (Reuters) - A lawyer for RadioShack on Friday ... technologyNews \n", "26 (Reuters) - Health insurer Anthem Inc on Frida... technologyNews \n", "27 FRANKFURT (Reuters) - Technology companies suc... technologyNews \n", "28 SAN FRANCISCO (Reuters) - Chipmaker Intel Corp... technologyNews \n", "29 SAN FRANCISCO (Reuters) - Law enforcement cann... technologyNews \n", "30 BRUSSELS (Reuters) - A panel of experts appoin... technologyNews \n", "31 LONDON (Reuters) - Some of Europe's wealthiest... technologyNews \n", "32 NEW YORK (Reuters) - Several U.S. states are i... technologyNews \n", "33 (Reuters) - Health insurer Anthem Inc , which ... technologyNews \n", "34 FRANKFURT (Reuters) - China will have more rob... technologyNews \n", "35 WASHINGTON (Reuters) - U.S. business lobbies c... technologyNews \n", "36 TOKYO (Reuters) - Japanese electronics maker S... technologyNews \n", "37 (Reuters) - Electronics retailer RadioShack Co... technologyNews \n", "38 NEW YORK (Reuters) - The U.S. Securities and E... technologyNews \n", "39 SAN FRANCISCO (Reuters) - Twitter Inc said on ... technologyNews \n", "40 NEW YORK (Reuters) - The United States Golf As... sportsNews \n", "41 KUALA LUMPUR (Reuters) - Austrian Bernd Wiesbe... sportsNews \n", "42 BEAVER CREEK, Colorado (Reuters) - Unheralded ... sportsNews \n", "43 MINNEAPOLIS (Reuters) - The National Football ... sportsNews \n", "44 LA JOLLA, California (Reuters) - Local favorit... sportsNews \n", "45 BEAVER CREEK, Colorado (Reuters) - A Super-G v... sportsNews \n", "46 (Reuters) - The National Basketball Referees A... sportsNews \n", "47 (Reuters) - New York Rangers goalie Henrik Lun... sportsNews \n", "48 BEAVER CREEK, Colorado (Reuters) - Denied gold... sportsNews \n", "49 MINNEAPOLIS (Reuters) - The National Football ... sportsNews \n", "50 BEAVER CREEK, Colorado (Reuters) - Tina Maze h... sportsNews \n", "51 MOSCOW (Reuters) - Russia's Athletics Federati... sportsNews \n", "52 DALLAS, Texas (Reuters) - Authorities have dro... sportsNews \n", "53 LA JOLLA, California (Reuters - American Nicho... sportsNews \n", "54 MOSCOW (Reuters) - Former Olympic 800 meters c... sportsNews \n", "55 (Reuters) - Soccer clubs need to take a longer... sportsNews \n", "56 (Reuters) - The inaugural European Games this ... sportsNews \n", "57 MOSCOW (Reuters) - Russian Athletics Federatio... sportsNews \n", "58 KUALA LUMPUR (Reuters) - Spaniard Alejandro Ca... sportsNews \n", "59 LONDON (Reuters) - Marussia's hopes of rising ... sportsNews " ] } ], "prompt_number": 7 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Clean up data" ] }, { "cell_type": "code", "collapsed": false, "input": [ "news_data[\"description\"].head()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 8, "text": [ "0 NEW YORK (Reuters) - Several top investors in ...\n", "1 ROME/PARIS (Reuters) - In Paris and Rome, it w...\n", "2 ATHENS/BRUSSELS (Reuters) - Greece's new lefti...\n", "3 NEW YORK (Reuters) - The good news from Friday...\n", "4 LOS ANGELES (Reuters) - The loading and unload...\n", "Name: description, dtype: object" ] } ], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "print news_data[\"description\"][0]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "NEW YORK (Reuters) - Several top investors in Twenty-First Century Fox Inc are pressing for the right to swap their voting shares for ordinary shares, which are trading at an unusual premium, even though the move could hand even more control of the company to Rupert Murdoch, according to people familiar with the matter.
\n", " \n", "
\"\"/\n" ] } ], "prompt_number": 9 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Visualizing article in HTML" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%HTML\n", "NEW YORK (Reuters) - Several top investors in Twenty-First Century Fox Inc are pressing for the right to swap their voting shares for ordinary shares, which are trading at an unusual premium, even though the move could hand even more control of the company to Rupert Murdoch, according to people familiar with the matter.
\n", " \n", "
\"\"/" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "NEW YORK (Reuters) - Several top investors in Twenty-First Century Fox Inc are pressing for the right to swap their voting shares for ordinary shares, which are trading at an unusual premium, even though the move could hand even more control of the company to Rupert Murdoch, according to people familiar with the matter.
\n", " \n", "
\"\"/" ], "metadata": {}, "output_type": "display_data", "text": [ "" ] } ], "prompt_number": 10 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Extracting text from the description" ] }, { "cell_type": "code", "collapsed": false, "input": [ "news_data[\"short_description\"] = [item[item.find(\" - \")+3:item.find(\"<\")] for item in news_data[\"description\"]]\n", "news_data" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titledescriptioncategoryshort_description
0 Exclusive: Top Fox investors seek to convert v... NEW YORK (Reuters) - Several top investors in ... businessNews Several top investors in Twenty-First Century ...
1 Different delivery, one message to Greece's ne... ROME/PARIS (Reuters) - In Paris and Rome, it w... businessNews In Paris and Rome, it was sugar coated; in Ber...
2 Isolated Greece wants no more bailout money wi... ATHENS/BRUSSELS (Reuters) - Greece's new lefti... businessNews Greece's new leftist-led government, isolated ...
3 Valuations may hurt small caps, despite job gr... NEW YORK (Reuters) - The good news from Friday... businessNews The good news from Friday's jobs report may al...
4 Shippers suspend weekend cargo loading at U.S.... LOS ANGELES (Reuters) - The loading and unload... businessNews The loading and unloading of freighters will b...
5 Building unions return to Kentucky refinery de... NEW YORK (Reuters) - Workers represented by th... businessNews Workers represented by the Building Trades Uni...
6 Motorola exploring possible sale: Bloomberg (Reuters) - Walkie-talkie and radio systems ma... businessNews Walkie-talkie and radio systems maker Motorola...
7 Wall St. ends down on interest rate, Greece ji... NEW YORK (Reuters) - Wall Street stocks fell o... businessNews Wall Street stocks fell on Friday as a better-...
8 Strong U.S. job, wage gains open door to mid-y... WASHINGTON (Reuters) - U.S. job growth rose so... businessNews U.S. job growth rose solidly in January and wa...
9 Building unions return to Kentucky refinery de... NEW YORK (Reuters) - Workers represented by th... businessNews Workers represented by the Building Trades Uni...
10 Motorola exploring possible sale: Bloomberg (Reuters) - Walkie-talkie and radio systems ma... businessNews Walkie-talkie and radio systems maker Motorola...
11 JPMorgan under scrutiny over hiring of Chinese... (Reuters) - JPMorgan Chase & Co is under feder... businessNews JPMorgan Chase & Co is under federal scrutiny ...
12 Exclusive: Top Fox investors seek to convert v... NEW YORK (Reuters) - Several top investors in ... businessNews Several top investors in Twenty-First Century ...
13 Isolated Greece wants no more bailout money wi... ATHENS/BRUSSELS (Reuters) - Greece's new lefti... businessNews Greece's new leftist-led government, isolated ...
14 Wall St. ends down on interest rate, Greece ji... NEW YORK (Reuters) - Wall Street stocks fell o... businessNews Wall Street stocks fell on Friday as a better-...
15 RadioShack would accept liquidation bids: lawyer (Reuters) - A lawyer for RadioShack on Friday ... businessNews A lawyer for RadioShack on Friday said the ban...
16 Strong U.S. job, wage gains open door to mid-y... WASHINGTON (Reuters) - U.S. job growth rose so... businessNews U.S. job growth rose solidly in January and wa...
17 Wall Street firms waver over June 2015 rate hi... NEW YORK (Reuters) - Economists at Wall Street... businessNews Economists at Wall Street's biggest banks are ...
18 Oil climbs, Brent posts best two weeks since 1998 NEW YORK (Reuters) - Oil rallied again on Frid... businessNews Oil rallied again on Friday, with benchmark Br...
19 Brazil's Petrobras taps state banker as CEO; s... BRASILIA (Reuters) - Brazil's President Dilma ... businessNews Brazil's President Dilma Rousseff tapped a con...
20 Feds can't indefinitely gag Yahoo about subpoe... SAN FRANCISCO (Reuters) - Law enforcement cann... technologyNews Law enforcement cannot indefinitely forbid Yah...
21 Motorola exploring possible sale: Bloomberg (Reuters) - Walkie-talkie and radio systems ma... technologyNews Walkie-talkie and radio systems maker Motorola...
22 Anthem warns U.S. customers of email scam afte... (Reuters) - Health insurer Anthem Inc on Frida... technologyNews Health insurer Anthem Inc on Friday warned U.S...
23 U.S. court orders Symantec to pay $17 mln for ... (Reuters) - Symantec Corp, maker of the popula... technologyNews Symantec Corp, maker of the popular Norton ant...
24 Motorola exploring possible sale: Bloomberg (Reuters) - Walkie-talkie and radio systems ma... technologyNews Walkie-talkie and radio systems maker Motorola...
25 RadioShack would accept liquidation bids: lawyer (Reuters) - A lawyer for RadioShack on Friday ... technologyNews A lawyer for RadioShack on Friday said the ban...
26 Anthem warns U.S. customers of email scam afte... (Reuters) - Health insurer Anthem Inc on Frida... technologyNews Health insurer Anthem Inc on Friday warned U.S...
27 Google can disrupt car industry but is no auto... FRANKFURT (Reuters) - Technology companies suc... technologyNews Technology companies such as Google are unlike...
28 Chipmaker Intel promotes senior executives, du... SAN FRANCISCO (Reuters) - Chipmaker Intel Corp... technologyNews Chipmaker Intel Corp said on Friday it promote...
29 Feds can't indefinitely gag Yahoo about subpoe... SAN FRANCISCO (Reuters) - Law enforcement cann... technologyNews Law enforcement cannot indefinitely forbid Yah...
30 Google panel backs firm on EU limit to 'right ... BRUSSELS (Reuters) - A panel of experts appoin... technologyNews A panel of experts appointed by Google to advi...
31 Family money puts its faith in European techno... LONDON (Reuters) - Some of Europe's wealthiest... technologyNews Some of Europe's wealthiest families are inves...
32 U.S. states probe massive data breach at healt... NEW YORK (Reuters) - Several U.S. states are i... technologyNews Several U.S. states are investigating a massiv...
33 Health insurer Anthem hit by massive cybersecu... (Reuters) - Health insurer Anthem Inc , which ... technologyNews Health insurer Anthem Inc , which has nearly 4...
34 China to have most robots in world by 2017 FRANKFURT (Reuters) - China will have more rob... technologyNews China will have more robots operating in its p...
35 U.S. businesses ask White House to help on Chi... WASHINGTON (Reuters) - U.S. business lobbies c... technologyNews U.S. business lobbies called on the White Hous...
36 Sharp says not considering sale of any oversea... TOKYO (Reuters) - Japanese electronics maker S... technologyNews Japanese electronics maker Sharp Corp said a s...
37 RadioShack files for bankruptcy; Sprint to tak... (Reuters) - Electronics retailer RadioShack Co... technologyNews Electronics retailer RadioShack Corp filed for...
38 SEC probes Blackberry options trading ahead of... NEW YORK (Reuters) - The U.S. Securities and E... technologyNews The U.S. Securities and Exchange Commission is...
39 Twitter tops Wall Street revenue target, user ... SAN FRANCISCO (Reuters) - Twitter Inc said on ... technologyNews Twitter Inc said on Thursday the social media ...
40 U.S. Golf Association to launch Senior Women's... NEW YORK (Reuters) - The United States Golf As... sportsNews The United States Golf Association will launch...
41 In-form Wiesberger goes two shots clear in Kua... KUALA LUMPUR (Reuters) - Austrian Bernd Wiesbe... sportsNews Austrian Bernd Wiesberger will take a two-shot...
42 Jansrud signals downhill ambition in final tra... BEAVER CREEK, Colorado (Reuters) - Unheralded ... sportsNews Unheralded Frenchman Brice Roger posted the fa...
43 NFL, players union spar in court over Peterson... MINNEAPOLIS (Reuters) - The National Football ... sportsNews The National Football League and its players u...
44 English takes charge as Mickelson exits Torrey... LA JOLLA, California (Reuters) - Local favorit... sportsNews Local favorite Phil Mickelson has joined Tiger...
45 No pressure for double medallist Fenninger BEAVER CREEK, Colorado (Reuters) - A Super-G v... sportsNews A Super-G victory at last year's Winter Olympi...
46 Referees association backs rookie in flare-up ... (Reuters) - The National Basketball Referees A... sportsNews The National Basketball Referees Association c...
47 Rangers goalie Lundqvist out for three weeks w... (Reuters) - New York Rangers goalie Henrik Lun... sportsNews New York Rangers goalie Henrik Lundqvist will ...
48 Miracle now needed to win gold, says Vonn BEAVER CREEK, Colorado (Reuters) - Denied gold... sportsNews Denied gold medals in her best events, America...
49 NFL and players union spar in court over Peter... MINNEAPOLIS (Reuters) - The National Football ... sportsNews The National Football League and its players u...
50 No sharing world downhill gold for Maze BEAVER CREEK, Colorado (Reuters) - Tina Maze h... sportsNews Tina Maze had to share Olympic downhill gold b...
51 Doping should be criminal offense: Russia chief MOSCOW (Reuters) - Russia's Athletics Federati... sportsNews Russia's Athletics Federation (VFLA) president...
52 Kansas police drop drug charges against Dallas... DALLAS, Texas (Reuters) - Authorities have dro... sportsNews Authorities have dropped drug charges against ...
53 Thompson leads by one stroke at Torrey Pines LA JOLLA, California (Reuters - American Nicho... sportsNews American Nicholas Thompson has a one-shot lead...
54 Borzakovskiy takes over as Russia's caretaker ... MOSCOW (Reuters) - Former Olympic 800 meters c... sportsNews Former Olympic 800 meters champion Yury Borzak...
55 Clubs need long-term mental health plan: FIFPro (Reuters) - Soccer clubs need to take a longer... sportsNews Soccer clubs need to take a longer-term approa...
56 European Games will convince skeptics, attract... (Reuters) - The inaugural European Games this ... sportsNews The inaugural European Games this year will co...
57 Departing VFLA President to continue fight aga... MOSCOW (Reuters) - Russian Athletics Federatio... sportsNews Russian Athletics Federation (VFLA) President ...
58 Canizares joins champion Westwood at the top i... KUALA LUMPUR (Reuters) - Spaniard Alejandro Ca... sportsNews Spaniard Alejandro Canizares picked up three s...
59 Force India opposed Marussia's bid to use old car LONDON (Reuters) - Marussia's hopes of rising ... sportsNews Marussia's hopes of rising from the dead to ra...
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 11, "text": [ " title \\\n", "0 Exclusive: Top Fox investors seek to convert v... \n", "1 Different delivery, one message to Greece's ne... \n", "2 Isolated Greece wants no more bailout money wi... \n", "3 Valuations may hurt small caps, despite job gr... \n", "4 Shippers suspend weekend cargo loading at U.S.... \n", "5 Building unions return to Kentucky refinery de... \n", "6 Motorola exploring possible sale: Bloomberg \n", "7 Wall St. ends down on interest rate, Greece ji... \n", "8 Strong U.S. job, wage gains open door to mid-y... \n", "9 Building unions return to Kentucky refinery de... \n", "10 Motorola exploring possible sale: Bloomberg \n", "11 JPMorgan under scrutiny over hiring of Chinese... \n", "12 Exclusive: Top Fox investors seek to convert v... \n", "13 Isolated Greece wants no more bailout money wi... \n", "14 Wall St. ends down on interest rate, Greece ji... \n", "15 RadioShack would accept liquidation bids: lawyer \n", "16 Strong U.S. job, wage gains open door to mid-y... \n", "17 Wall Street firms waver over June 2015 rate hi... \n", "18 Oil climbs, Brent posts best two weeks since 1998 \n", "19 Brazil's Petrobras taps state banker as CEO; s... \n", "20 Feds can't indefinitely gag Yahoo about subpoe... \n", "21 Motorola exploring possible sale: Bloomberg \n", "22 Anthem warns U.S. customers of email scam afte... \n", "23 U.S. court orders Symantec to pay $17 mln for ... \n", "24 Motorola exploring possible sale: Bloomberg \n", "25 RadioShack would accept liquidation bids: lawyer \n", "26 Anthem warns U.S. customers of email scam afte... \n", "27 Google can disrupt car industry but is no auto... \n", "28 Chipmaker Intel promotes senior executives, du... \n", "29 Feds can't indefinitely gag Yahoo about subpoe... \n", "30 Google panel backs firm on EU limit to 'right ... \n", "31 Family money puts its faith in European techno... \n", "32 U.S. states probe massive data breach at healt... \n", "33 Health insurer Anthem hit by massive cybersecu... \n", "34 China to have most robots in world by 2017 \n", "35 U.S. businesses ask White House to help on Chi... \n", "36 Sharp says not considering sale of any oversea... \n", "37 RadioShack files for bankruptcy; Sprint to tak... \n", "38 SEC probes Blackberry options trading ahead of... \n", "39 Twitter tops Wall Street revenue target, user ... \n", "40 U.S. Golf Association to launch Senior Women's... \n", "41 In-form Wiesberger goes two shots clear in Kua... \n", "42 Jansrud signals downhill ambition in final tra... \n", "43 NFL, players union spar in court over Peterson... \n", "44 English takes charge as Mickelson exits Torrey... \n", "45 No pressure for double medallist Fenninger \n", "46 Referees association backs rookie in flare-up ... \n", "47 Rangers goalie Lundqvist out for three weeks w... \n", "48 Miracle now needed to win gold, says Vonn \n", "49 NFL and players union spar in court over Peter... \n", "50 No sharing world downhill gold for Maze \n", "51 Doping should be criminal offense: Russia chief \n", "52 Kansas police drop drug charges against Dallas... \n", "53 Thompson leads by one stroke at Torrey Pines \n", "54 Borzakovskiy takes over as Russia's caretaker ... \n", "55 Clubs need long-term mental health plan: FIFPro \n", "56 European Games will convince skeptics, attract... \n", "57 Departing VFLA President to continue fight aga... \n", "58 Canizares joins champion Westwood at the top i... \n", "59 Force India opposed Marussia's bid to use old car \n", "\n", " description category \\\n", "0 NEW YORK (Reuters) - Several top investors in ... businessNews \n", "1 ROME/PARIS (Reuters) - In Paris and Rome, it w... businessNews \n", "2 ATHENS/BRUSSELS (Reuters) - Greece's new lefti... businessNews \n", "3 NEW YORK (Reuters) - The good news from Friday... businessNews \n", "4 LOS ANGELES (Reuters) - The loading and unload... businessNews \n", "5 NEW YORK (Reuters) - Workers represented by th... businessNews \n", "6 (Reuters) - Walkie-talkie and radio systems ma... businessNews \n", "7 NEW YORK (Reuters) - Wall Street stocks fell o... businessNews \n", "8 WASHINGTON (Reuters) - U.S. job growth rose so... businessNews \n", "9 NEW YORK (Reuters) - Workers represented by th... businessNews \n", "10 (Reuters) - Walkie-talkie and radio systems ma... businessNews \n", "11 (Reuters) - JPMorgan Chase & Co is under feder... businessNews \n", "12 NEW YORK (Reuters) - Several top investors in ... businessNews \n", "13 ATHENS/BRUSSELS (Reuters) - Greece's new lefti... businessNews \n", "14 NEW YORK (Reuters) - Wall Street stocks fell o... businessNews \n", "15 (Reuters) - A lawyer for RadioShack on Friday ... businessNews \n", "16 WASHINGTON (Reuters) - U.S. job growth rose so... businessNews \n", "17 NEW YORK (Reuters) - Economists at Wall Street... businessNews \n", "18 NEW YORK (Reuters) - Oil rallied again on Frid... businessNews \n", "19 BRASILIA (Reuters) - Brazil's President Dilma ... businessNews \n", "20 SAN FRANCISCO (Reuters) - Law enforcement cann... technologyNews \n", "21 (Reuters) - Walkie-talkie and radio systems ma... technologyNews \n", "22 (Reuters) - Health insurer Anthem Inc on Frida... technologyNews \n", "23 (Reuters) - Symantec Corp, maker of the popula... technologyNews \n", "24 (Reuters) - Walkie-talkie and radio systems ma... technologyNews \n", "25 (Reuters) - A lawyer for RadioShack on Friday ... technologyNews \n", "26 (Reuters) - Health insurer Anthem Inc on Frida... technologyNews \n", "27 FRANKFURT (Reuters) - Technology companies suc... technologyNews \n", "28 SAN FRANCISCO (Reuters) - Chipmaker Intel Corp... technologyNews \n", "29 SAN FRANCISCO (Reuters) - Law enforcement cann... technologyNews \n", "30 BRUSSELS (Reuters) - A panel of experts appoin... technologyNews \n", "31 LONDON (Reuters) - Some of Europe's wealthiest... technologyNews \n", "32 NEW YORK (Reuters) - Several U.S. states are i... technologyNews \n", "33 (Reuters) - Health insurer Anthem Inc , which ... technologyNews \n", "34 FRANKFURT (Reuters) - China will have more rob... technologyNews \n", "35 WASHINGTON (Reuters) - U.S. business lobbies c... technologyNews \n", "36 TOKYO (Reuters) - Japanese electronics maker S... technologyNews \n", "37 (Reuters) - Electronics retailer RadioShack Co... technologyNews \n", "38 NEW YORK (Reuters) - The U.S. Securities and E... technologyNews \n", "39 SAN FRANCISCO (Reuters) - Twitter Inc said on ... technologyNews \n", "40 NEW YORK (Reuters) - The United States Golf As... sportsNews \n", "41 KUALA LUMPUR (Reuters) - Austrian Bernd Wiesbe... sportsNews \n", "42 BEAVER CREEK, Colorado (Reuters) - Unheralded ... sportsNews \n", "43 MINNEAPOLIS (Reuters) - The National Football ... sportsNews \n", "44 LA JOLLA, California (Reuters) - Local favorit... sportsNews \n", "45 BEAVER CREEK, Colorado (Reuters) - A Super-G v... sportsNews \n", "46 (Reuters) - The National Basketball Referees A... sportsNews \n", "47 (Reuters) - New York Rangers goalie Henrik Lun... sportsNews \n", "48 BEAVER CREEK, Colorado (Reuters) - Denied gold... sportsNews \n", "49 MINNEAPOLIS (Reuters) - The National Football ... sportsNews \n", "50 BEAVER CREEK, Colorado (Reuters) - Tina Maze h... sportsNews \n", "51 MOSCOW (Reuters) - Russia's Athletics Federati... sportsNews \n", "52 DALLAS, Texas (Reuters) - Authorities have dro... sportsNews \n", "53 LA JOLLA, California (Reuters - American Nicho... sportsNews \n", "54 MOSCOW (Reuters) - Former Olympic 800 meters c... sportsNews \n", "55 (Reuters) - Soccer clubs need to take a longer... sportsNews \n", "56 (Reuters) - The inaugural European Games this ... sportsNews \n", "57 MOSCOW (Reuters) - Russian Athletics Federatio... sportsNews \n", "58 KUALA LUMPUR (Reuters) - Spaniard Alejandro Ca... sportsNews \n", "59 LONDON (Reuters) - Marussia's hopes of rising ... sportsNews \n", "\n", " short_description \n", "0 Several top investors in Twenty-First Century ... \n", "1 In Paris and Rome, it was sugar coated; in Ber... \n", "2 Greece's new leftist-led government, isolated ... \n", "3 The good news from Friday's jobs report may al... \n", "4 The loading and unloading of freighters will b... \n", "5 Workers represented by the Building Trades Uni... \n", "6 Walkie-talkie and radio systems maker Motorola... \n", "7 Wall Street stocks fell on Friday as a better-... \n", "8 U.S. job growth rose solidly in January and wa... \n", "9 Workers represented by the Building Trades Uni... \n", "10 Walkie-talkie and radio systems maker Motorola... \n", "11 JPMorgan Chase & Co is under federal scrutiny ... \n", "12 Several top investors in Twenty-First Century ... \n", "13 Greece's new leftist-led government, isolated ... \n", "14 Wall Street stocks fell on Friday as a better-... \n", "15 A lawyer for RadioShack on Friday said the ban... \n", "16 U.S. job growth rose solidly in January and wa... \n", "17 Economists at Wall Street's biggest banks are ... \n", "18 Oil rallied again on Friday, with benchmark Br... \n", "19 Brazil's President Dilma Rousseff tapped a con... \n", "20 Law enforcement cannot indefinitely forbid Yah... \n", "21 Walkie-talkie and radio systems maker Motorola... \n", "22 Health insurer Anthem Inc on Friday warned U.S... \n", "23 Symantec Corp, maker of the popular Norton ant... \n", "24 Walkie-talkie and radio systems maker Motorola... \n", "25 A lawyer for RadioShack on Friday said the ban... \n", "26 Health insurer Anthem Inc on Friday warned U.S... \n", "27 Technology companies such as Google are unlike... \n", "28 Chipmaker Intel Corp said on Friday it promote... \n", "29 Law enforcement cannot indefinitely forbid Yah... \n", "30 A panel of experts appointed by Google to advi... \n", "31 Some of Europe's wealthiest families are inves... \n", "32 Several U.S. states are investigating a massiv... \n", "33 Health insurer Anthem Inc , which has nearly 4... \n", "34 China will have more robots operating in its p... \n", "35 U.S. business lobbies called on the White Hous... \n", "36 Japanese electronics maker Sharp Corp said a s... \n", "37 Electronics retailer RadioShack Corp filed for... \n", "38 The U.S. Securities and Exchange Commission is... \n", "39 Twitter Inc said on Thursday the social media ... \n", "40 The United States Golf Association will launch... \n", "41 Austrian Bernd Wiesberger will take a two-shot... \n", "42 Unheralded Frenchman Brice Roger posted the fa... \n", "43 The National Football League and its players u... \n", "44 Local favorite Phil Mickelson has joined Tiger... \n", "45 A Super-G victory at last year's Winter Olympi... \n", "46 The National Basketball Referees Association c... \n", "47 New York Rangers goalie Henrik Lundqvist will ... \n", "48 Denied gold medals in her best events, America... \n", "49 The National Football League and its players u... \n", "50 Tina Maze had to share Olympic downhill gold b... \n", "51 Russia's Athletics Federation (VFLA) president... \n", "52 Authorities have dropped drug charges against ... \n", "53 American Nicholas Thompson has a one-shot lead... \n", "54 Former Olympic 800 meters champion Yury Borzak... \n", "55 Soccer clubs need to take a longer-term approa... \n", "56 The inaugural European Games this year will co... \n", "57 Russian Athletics Federation (VFLA) President ... \n", "58 Spaniard Alejandro Canizares picked up three s... \n", "59 Marussia's hopes of rising from the dead to ra... " ] } ], "prompt_number": 11 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Extract Features with CountVictrizer" ] }, { "cell_type": "code", "collapsed": false, "input": [ "corpus = news_data[\"short_description\"]\n", "vectorizer = CountVectorizer(min_df=1)\n", "X = vectorizer.fit_transform(corpus).toarray()\n", "print X.shape\n", "X" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "(60, 809)\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "array([[0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 1, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 1],\n", " ..., \n", " [0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0]])" ] } ], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "vectorizer.get_feature_names()[:25]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ "[u'11',\n", " u'14',\n", " u'17',\n", " u'18',\n", " u'2012',\n", " u'2017',\n", " u'2018',\n", " u'29',\n", " u'40',\n", " u'400',\n", " u'65',\n", " u'800',\n", " u'about',\n", " u'abroad',\n", " u'abuse',\n", " u'accept',\n", " u'according',\n", " u'account',\n", " u'added',\n", " u'admitted',\n", " u'adrian',\n", " u'advise',\n", " u'affiliate',\n", " u'after',\n", " u'again']" ] } ], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "categories = news_data[\"category\"].unique()\n", "category_dict = {value:index for index, value in enumerate(categories)}\n", "results = news_data[\"category\"].map(category_dict)\n", "category_dict" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 14, "text": [ "{'businessNews': 0, 'sportsNews': 2, 'technologyNews': 1}" ] } ], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "print \"corpus size: %s\" % len(vectorizer.get_feature_names())" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "corpus size: 809\n" ] } ], "prompt_number": 15 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Multinomial Naive Bayes Cassifier" ] }, { "cell_type": "code", "collapsed": false, "input": [ "x_train,x_test, y_train,y_test = train_test_split(X, results, test_size=0.2, random_state=1, )" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "clf = MultinomialNB()\n", "clf.fit(x_train, y_train)\n", "clf.score(x_test, y_test)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 17, "text": [ "0.83333333333333337" ] } ], "prompt_number": 17 }, { "cell_type": "code", "collapsed": false, "input": [ "y_test" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": [ "array([1, 2, 0, 2, 2, 2, 1, 1, 2, 1, 1, 2])" ] } ], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [ "clf.predict(x_test)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 19, "text": [ "array([1, 2, 0, 2, 2, 2, 1, 0, 2, 1, 0, 2])" ] } ], "prompt_number": 19 }, { "cell_type": "code", "collapsed": false, "input": [ "category_dict" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 20, "text": [ "{'businessNews': 0, 'sportsNews': 2, 'technologyNews': 1}" ] } ], "prompt_number": 20 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Manual Testing" ] }, { "cell_type": "code", "collapsed": false, "input": [ "text = [\"Who won the Superbawl?\"]\n", "vec_text = vectorizer.transform(text).toarray()\n", "category_dict.keys()[category_dict.values().index(clf.predict(vec_text)[0])]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 21, "text": [ "'sportsNews'" ] } ], "prompt_number": 21 } ], "metadata": {} } ] }