{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 11 - Web Information Retrieval \n", "\n", "by [Alejandro Correa Bahnsen](albahnsen.com/)\n", "\n", "version 0.1, Apr 2016\n", "\n", "## Part of the class [Practical Machine Learning](https://github.com/albahnsen/PracticalMachineLearningClass)\n", "\n", "\n", "\n", "This notebook is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-sa/3.0/deed.en_US). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing. (Wikipedia)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "url_ = 'http://mashable.com/2016/04/01/facebook-live-shooting/?utm_cid=mash-prod-nav-sub-st#44rpYp4efOqf'" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import IFrame\n", "IFrame(url_, 600, 600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get the webpage into a python object\n", "\n", "If we want to collect information on hundreds or twosands of webpages doing it manually is a no go. Instead, lets get the information of the webpage into python using web scraping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download the HTML code of the webpage" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import urllib.request\n", "response = urllib.request.urlopen(url_)\n", "html = response.read()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "b'\\n\\n\\n\\n\n", " \n", " Chicago man appears to stream his own shooting on Facebook\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

\n", " Mashable\n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " We're using cookies to improve your experience.\n", " \n", " Click Here to find out more.\n", " \n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "

\n", " World\n", "

\n", "
\n", " \n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " Chicago man appears to stream his own shooting on Facebook\n", "

\n", " \n", "
\n", " \"Shooter\"\n", "
\n", " A man is seen firing a gun in a Facebook Live video reportedly taken in Chicago.\n", "
\n", "
\n", " Image: Facebook\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", "

\n", " I guess it was only a matter of time.\n", "

\n", "

\n", " Chicago Police are investigating an extraordinary video that appears to capture the shooting of a man while he was streaming on Facebook Live.\n", "

\n", "

\n", " Watch the video (graphic content):\n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " \n", "

\n", "

\n", " Damn he 󾆵 was on LIVE 󾓹 talking 󾆓 shit with no 󾍑 burner 󾓵 and got his 󾆵 life took 󾆳󾆯 smds shit crazy 󾍁󾍀󾍛󾭻 . #LITERALLY\n", "

\n", "

\n", " Posted by\n", " \n", " TcWorld Creamer\n", " \n", " on Thursday, March 31, 2016\n", "

\n", "
\n", "
\n", "
\n", "
\n", "

\n", " In the clip, which was re-recorded from its original source on Thursday and disseminated widely, a man in a blue Chicago White Sox hat is seen talking into the camera while standing in front of Scott's Convenience Store in Chicago's West Englewood neighborhood.\n", "

\n", "

\n", " He jokes that the store is open because he needed \"somewhere to duck and hide for cover.\"\n", "

\n", "
\n", "

\n", " \"Scott's\n", "

\n", "
\n", "

\n", " Scott's Convenience Store is seen in the Facebook video and on Google Street View.\n", "

\n", "
\n", "
\n", "

\n", " Image: google/facebook\n", "

\n", "
\n", "
\n", "

\n", " \n", " Moments later, shots ring out and the phone drops to the street, camera up. The apparent assailant, wearing red, then steps into the frame and continues firing the gun elsewhere.\n", " \n", "

\n", "

\n", " A minute passes and a woman begins screaming, “Oh my God, no! I can’t believe this.\"\n", "

\n", "

\n", " While many people have questioned the video's authenticity, especially as it surfaced in the hours before April Fools' Day, Chicago Police say it's likely legitimate.\n", "

\n", "

\n", " \"CPD is aware of the social media video in question and suspect the video is connected to the incident,\" Officer Kevin Quaid, with the Office of News Affairs, told\n", " \n", " Mashable.\n", " \n", "

\n", " \n", "

\n", " He added that detectives were working to confirm its authenticity, but that a man, 31, was shot on Thursday at the location seen in the video and transported to Mount Sinai hospital in critical condition.\n", "

\n", "

\n", " Detectives are waiting to speak with the apparent victim, who is now under sedation, The Associated Press reported.\n", "

\n", "

\n", " The victim is believed to be a known gang member.\n", "

\n", "

\n", " Peter Nickeas, a\n", " \n", " Chicago Tribune\n", " \n", " reporter who wrote about the video,\n", " \n", " tweeted\n", " \n", " that police are privately a lot more confident than their official statements suggest.\n", "

\n", "
\n", "
\n", "
\n", "

\n", " Video is probably real but important (essential, IMO) to make sure it's real, not just that it couldn't conceivably be fake.\n", "

\n", "

\n", " — Peter Nickeas (@PeterNickeas)\n", " \n", " April 1, 2016\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " \n", " \"Video is probably real but important (essential, IMO) to make sure it's real, not just that it couldn't conceivably be fake,\" he said. \"Police are privately more confident (a lot more) in the video's authenticity than the CPD news affairs statement lets on.\"\n", " \n", "

\n", "

\n", " If it proves to be real, this would probably be a first for the live video product Facebook CEO Mark Zuckerberg is\n", " \n", " reportedly\n", " \n", " \"obsessed\" with.\n", "

\n", "

\n", " The shocking video comes as numerous news organizations have embraced the medium, which has been used to stream live scenes from protests, political events and average life.\n", "

\n", "

\n", " Violence in Chicago is reaching levels \"unseen in years,\"\n", " \n", " according to the\n", " \n", " Tribune\n", " \n", " \n", " \n", " \n", " , with shootings up 73% over this time last year.\n", " \n", " More than 700 people have been shot in the year's first quarter.\n", " \n", "

\n", "

\n", " \n", " The suspected shooter in Thursday's incident is still at large.\n", " \n", "

\n", "

\n", " \n", " Have something to add to this story? Share it in the comments.\n", " \n", "

\n", "
\n", " \n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", " Image: Facebook\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " Load Comments\n", " \n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", "

\n", " More in World\n", "

\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " What's New\n", "

\n", "
\n", "
\n", "
\n", "
\n", "

\n", " What's Rising\n", "

\n", "
\n", "
\n", "
\n", "
\n", "

\n", " What's Hot\n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", " \n", " \n", "
\n", "
\n", " \n", " \n", "\n", "\n" ] } ], "source": [ "from bs4 import BeautifulSoup\n", "soup = BeautifulSoup(html, 'html.parser')\n", "\n", "print(soup.prettify())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Page title" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'Chicago man appears to stream his own shooting on Facebook'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "title = soup.title.string\n", "title" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Author name" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[By Brian Ries]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "author = soup.find_all(\"span\", { \"class\" : \"author_name\"})\n", "author" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[By Brian Ries]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# If author is empty try this:\n", "if author == []:\n", " author = soup.find_all(\"span\", { \"class\" : \"byline basic\"})\n", "author" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['[')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'By Brian Ries')[1]\n", "author" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'Brian Ries\n", " 1.5k\n", "
Shares
\n", " ]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "shares = soup.find_all(\"div\", { \"class\" : \"total-shares\"})\n", "shares" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'1.5k'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "shares = str(shares).split('')[1].split('')[0]\n", "shares" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'1500'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "if 'k' in shares:\n", " shares = shares[:-1]\n", " shares = shares.replace('.', '') + '00'\n", "shares" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Author webpage" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[\"Headshot_2015_brianries_updatedshot_1\"
By Brian Ries
]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "author_web = soup.find_all(\"a\", { \"class\" : \"byline\"})\n", "author_web" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'/people/moneyries/\">\"Headshot_2015_brianries_updatedshot_1\"
By Brian Ries
]'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "if author_web != []:\n", " author_web = str(author_web).split('href=\"')[1]\n", "author_web" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'/people/moneyries/'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "if author_web != []:\n", " author_web = author_web.split('\">')[0]\n", "author_web" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'http://mashable.com/people/moneyries/'" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "if author_web != []:\n", " author_web = 'http://mashable.com' + author_web\n", "author_web" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Video on webpage" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Donde empieza y donde termina el texto\n", "\n", "temp = str(soup)[str(soup).find('class=\"author_and_date\"'):]\n", "\n", "temp = temp[:temp.find('class=\"article-topics\"')]" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'class=\"author_and_date\">By Brian Ries\\n\\n
\\n
\\n

I guess it was only a matter of time.

\\n

Chicago Police are investigating an extraordinary video that appears to capture the shooting of a man while he was streaming on Facebook Live.

\\n

Watch the video (graphic content):

\\n
\\n
\\n
\\n

\\n

Damn he \\U000fe1b5 was on LIVE \\U000fe4f9 talking \\U000fe193 shit with no \\U000fe351 burner \\U000fe4f5 and got his \\U000fe1b5 life took \\U000fe1b3\\U000fe1af smds shit crazy \\U000fe341\\U000fe340\\U000fe35b\\U000feb7b . #LITERALLY

\\n

Posted by TcWorld Creamer on Thursday, March 31, 2016

\\n
\\n
\\n

In the clip, which was re-recorded from its original source on Thursday and disseminated widely, a man in a blue Chicago White Sox hat is seen talking into the camera while standing in front of Scott\\'s Convenience Store in Chicago\\'s West Englewood neighborhood.

\\n

He jokes that the store is open because he needed \"somewhere to duck and hide for cover.\"

\\n

\"Scott\\'s

\\n

Scott\\'s Convenience Store is seen in the Facebook video and on Google Street View.

Image: google/facebook

Moments later, shots ring out and the phone drops to the street, camera up. The apparent assailant, wearing red, then steps into the frame and continues firing the gun elsewhere.

\\n

A minute passes and a woman begins screaming,\\xa0“Oh my God, no! I can’t believe this.\"

\\n

While many people have questioned the video\\'s authenticity, especially as it surfaced in the hours before April Fools\\' Day, Chicago Police say it\\'s likely legitimate.

\\n

\"CPD is aware of the social media video in question and suspect the video is connected to the incident,\" Officer Kevin Quaid, with the Office of News Affairs, told Mashable.\\xa0

\\n\\n

He added\\xa0that\\xa0detectives were working to confirm its authenticity, but that a man, 31, was shot on Thursday at the location seen in the video and transported to Mount Sinai hospital in critical condition.

\\n

Detectives are waiting to speak with the apparent victim, who is now under sedation, The Associated Press reported.

\\n

The victim is believed to be a known gang member.

\\n

Peter Nickeas, a Chicago Tribune reporter who wrote about the video, tweeted\\xa0that police are privately a lot more confident than their official statements suggest.

\\n
\\n

Video is probably real but important (essential, IMO) to make sure it\\'s real, not just that it couldn\\'t conceivably be fake.

\\n

— Peter Nickeas (@PeterNickeas) April 1, 2016

\\n
\\n

\"Video is probably real but important (essential, IMO) to make sure it\\'s real, not just that it couldn\\'t conceivably be fake,\" he said. \"Police are privately more confident (a lot more) in the video\\'s authenticity than the CPD news affairs statement lets on.\"

\\n

If it proves to be real, this would probably be a first for the live video product Facebook CEO Mark Zuckerberg is reportedly \"obsessed\" with.

\\n

The shocking video comes as numerous news organizations have embraced the medium, which has been used to stream live scenes from protests, political events and average life.

\\n

Violence in Chicago is reaching levels \"unseen in years,\" according to the Tribune, with shootings up\\xa073% over this time last year.\\xa0More than 700 people have been shot in the year\\'s first quarter.

\\n

The suspected shooter in Thursday\\'s incident is still at large.

\\n

Have something to add to this story? Share it in the comments.

\\n
\\n\\n
\\n
\\n