# 11 - Web Information Retrieval 

by [Alejandro Correa Bahnsen](albahnsen.com/)

version 0.1, Apr 2016

## Part of the class [Practical Machine Learning](https://github.com/albahnsen/PracticalMachineLearningClass)



This notebook is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-sa/3.0/deed.en_US). 

## Introduction

Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing. (Wikipedia)

In [1]:
url_ = 'http://mashable.com/2016/04/01/facebook-live-shooting/?utm_cid=mash-prod-nav-sub-st#44rpYp4efOqf'

In [2]:
from IPython.display import IFrame
IFrame(url_, 600, 600)

## Get the webpage into a python object

If we want to collect information on hundreds or twosands of webpages doing it manually is a no go. Instead, lets get the information of the webpage into python using web scraping

### Download the HTML code of the webpage

In [3]:
import urllib.request
response = urllib.request.urlopen(url_)
html = response.read()

In [4]:
html[0:800]

b'<!DOCTYPE html>\n<!--\no o     o     +              o\n+   +     +             o     +       +\n            +\no  +    +        o  +           +        +\n     __  __           _           _     _\n~_,-|  \\/  | __ _ ___| |__   __ _| |__ | | ___\n    | |\\/| |/ _` / __| \'_ \\ / _` | \'_ \\| |/ _ \\,-~_,- - - ,\n~_,-| |  | | (_| \\__ \\ | | | (_| | |_) | |  __/    |   /\\_/\\\n    |_|  |_|\\__,_|___/_| |_|\\__,_|_.__/|_|\\___|  ~=|__( ^ .^)\n~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,""   ""\no o     o     +              o\n+   +     +             o     +       +\n            +\no  +    +        o  +           +        +\n-->\n<html data-env=\'production\' lang=\'en\' xml:lang=\'en\'>\n<head>\n<script>\n  window.__o = {"channel":"world","content_type":"article","v_buy":null,"v_buy_i":null,"h_pub":16.0,"h_buy":null,"h_pub'

### Lets parse the input to a more readable one

In [5]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

print(soup.prettify())

<!DOCTYPE html>
<!--
o o     o     +              o
+   +     +             o     +       +
            +
o  +    +        o  +           +        +
     __  __           _           _     _
~_,-|  \/  | __ _ ___| |__   __ _| |__ | | ___
    | |\/| |/ _` / __| '_ \ / _` | '_ \| |/ _ \,-~_,- - - ,
~_,-| |  | | (_| \__ \ | | | (_| | |_) | |  __/    |   /\_/\
    |_|  |_|\__,_|___/_| |_|\__,_|_.__/|_|\___|  ~=|__( ^ .^)
~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,""   ""
o o     o     +              o
+   +     +             o     +       +
            +
o  +    +        o  +           +        +
-->
<html data-env="production" lang="en" xml:lang="en">
 <head>
  <script>
   window.__o = {"channel":"world","content_type":"article","v_buy":null,"v_buy_i":null,"h_pub":16.0,"h_buy":null,"h_pub_buy":null,"v_cur":0.1,"v_max":0.1,"v_cur_i":0,"v_max_i":0,"events":"event51,event61","top_channel":"world","content_source_type":"Internal","content_source_name":"Internal","author_name":"Brian R

### Page title

In [6]:
title = soup.title.string
title

'Chicago man appears to stream his own shooting on Facebook'

### Author name

In [7]:
author = soup.find_all("span", { "class" : "author_name"})
author

[<span class="author_name">By Brian Ries</span>]

In [8]:
# If author is empty try this:
if author == []:
    author = soup.find_all("span", { "class" : "byline basic"})
author

[<span class="author_name">By Brian Ries</span>]

In [9]:
str(author).split('>')

['[<span class="author_name"', 'By Brian Ries</span', ']']

In [10]:
author = str(author).split('>')[1]
author

'By Brian Ries</span'

In [11]:
author = author.split('By ')[1]
author

'Brian Ries</span'

In [12]:
author = author.split('<')[0]
author

'Brian Ries'

### Total shares

In [13]:
shares = soup.find_all("div", { "class" : "total-shares"})
shares

[<div class="total-shares" data-index="0">
 <em>1.5k</em>
 <div class="caption">Shares</div>
 </div>]

In [14]:
shares = str(shares).split('<em>')[1].split('</em>')[0]
shares

'1.5k'

In [15]:
if 'k' in shares:
    shares = shares[:-1]
    shares = shares.replace('.', '') + '00'
shares

'1500'

### Author webpage

In [16]:
author_web = soup.find_all("a", { "class" : "byline"})
author_web

[<a class="byline " href="/people/moneyries/"><img alt="Headshot_2015_brianries_updatedshot_1" class="author_image" src="http://rack.0.mshcdn.com/media/ZgkyMDE1LzA2LzIyLzBjL0hlYWRzaG90XzIwLjdiMmE5LmpwZwpwCXRodW1iCTkweDkwIwplCWpwZw/e4aef619/c0c/Headshot_2015_BrianRies_UpdatedShot_1.jpg"/><div class="author_and_date"><span class="author_name">By Brian Ries</span><time datetime="Fri, 01 Apr 2016 23:26:01 +0000">2016-04-01 23:26:01 UTC</time></div></a>]

In [17]:
if author_web != []:
    author_web = str(author_web).split('href="')[1]
author_web

'/people/moneyries/"><img alt="Headshot_2015_brianries_updatedshot_1" class="author_image" src="http://rack.0.mshcdn.com/media/ZgkyMDE1LzA2LzIyLzBjL0hlYWRzaG90XzIwLjdiMmE5LmpwZwpwCXRodW1iCTkweDkwIwplCWpwZw/e4aef619/c0c/Headshot_2015_BrianRies_UpdatedShot_1.jpg"/><div class="author_and_date"><span class="author_name">By Brian Ries</span><time datetime="Fri, 01 Apr 2016 23:26:01 +0000">2016-04-01 23:26:01 UTC</time></div></a>]'

In [18]:
if author_web != []:
    author_web = author_web.split('">')[0]
author_web

'/people/moneyries/'

In [19]:
if author_web != []:
    author_web = 'http://mashable.com' + author_web
author_web

'http://mashable.com/people/moneyries/'

### Video on webpage

In [20]:
# Donde empieza y donde termina el texto

temp = str(soup)[str(soup).find('class="author_and_date"'):]

temp = temp[:temp.find('class="article-topics"')]

In [45]:
temp

'class="author_and_date"><span class="author_name">By Brian Ries</span><time datetime="Fri, 01 Apr 2016 23:26:01 +0000">2016-04-01 23:26:01 UTC</time></div></a></div>\n</header>\n<section class="article-content blueprint">\n<div class="post-text">\n<p>I guess it was only a matter of time.</p>\n<p>Chicago Police are investigating an extraordinary video that appears to capture the shooting of a man while he was streaming on Facebook Live.</p>\n<h2 class="h2">Watch the video (graphic content):</h2>\n<div>\n<div id="fb-root"></div>\n<div class="fb-video" data-allowfullscreen="1" data-href="/marie.creamer.549/videos/vb.100007259195441/1671424763109481/?type=3"><div class="fb-xfbml-parse-ignore"><blockquote cite="https://www.facebook.com/marie.creamer.549/videos/1671424763109481/">\n<p><a href="https://www.facebook.com/marie.creamer.549/videos/1671424763109481/" target="_blank"></a></p>\n<p>Damn he \U000fe1b5 was on LIVE \U000fe4f9 talking \U000fe193 shit  with no \U000fe351 burner \U000fe4f

In [51]:
# ytp-thumbnail-overlay ytp-cued-thumbnail-overlay

videos = []
index = 0
while index < len(temp):
    index = temp.find('https://www.youtube.com', index)
    if index == -1:
        break
    videos.append(temp[index:index+100].split('"')[0])
    index += 2

index = 0
while index < len(temp):
    index = temp.find('div class="fb-video" data-allowfullscreen="1" data-href=', index)
    if index == -1:
        break
    videos.append('http://facebook.com' + (temp[index:index+1000].split('"><div class="fb')[0].split('data-href="')[1]))
    index += 2
    
videos

['http://facebook.com/marie.creamer.549/videos/vb.100007259195441/1671424763109481/?type=3']

In [52]:
len(videos)

1

### Get news text

In [23]:
print(soup.get_text())






  window.__o = {"channel":"world","content_type":"article","v_buy":null,"v_buy_i":null,"h_pub":16.0,"h_buy":null,"h_pub_buy":null,"v_cur":0.1,"v_max":0.1,"v_cur_i":0,"v_max_i":0,"events":"event51,event61","top_channel":"world","content_source_type":"Internal","content_source_name":"Internal","author_name":"Brian Ries","age":"0","pub_day":1,"pub_month":4,"pub_year":2016,"pub_date":"04/01/2016","sourced_from":"Internal","isPostView":true,"post_lead_type":"Alt Image Lead","topics":"Chicago,Facebook,facebook live,shooting,World","campaign":null,"display_mode":null,"viral_video_type":null,"b_flag":false};
  window._gaq = window._gaq || [];
  window._gaq.push(['_setAccount', 'UA-92124-1']);
  window._geo = "US";
  window.__domStart = (new Date().getTime())

Chicago man appears to stream his own shooting on Facebook


































{"@context":"http://schema.org","headline":"Chicago man appears to stream his own shooting on Facebook","url":"http://mashable.com/2016/04/0

In [24]:
try:
    text = str(soup.get_text()).split("UTC\n\n\n")[1]
except IndexError:
    text = str(soup.get_text()).split("Analysis\n\n")[1]

text = text.split('Have something to add to this story?')[0]

In [25]:
print(text)


I guess it was only a matter of time.
Chicago Police are investigating an extraordinary video that appears to capture the shooting of a man while he was streaming on Facebook Live.
Watch the video (graphic content):




Damn he 󾆵 was on LIVE 󾓹 talking 󾆓 shit  with no 󾍑 burner 󾓵 and got his 󾆵 life took 󾆳󾆯 smds shit crazy 󾍁󾍀󾍛󾭻 . #LITERALLY
Posted by TcWorld Creamer on Thursday, March 31, 2016


In the clip, which was re-recorded from its original source on Thursday and disseminated widely, a man in a blue Chicago White Sox hat is seen talking into the camera while standing in front of Scott's Convenience Store in Chicago's West Englewood neighborhood.
He jokes that the store is open because he needed "somewhere to duck and hide for cover."

Scott's Convenience Store is seen in the Facebook video and on Google Street View.Image:  google/facebookMoments later, shots ring out and the phone drops to the street, camera up. The apparent assailant, wearing red, then steps into the frame and co

## Author information

In [26]:
# Only if author_web != []:
from IPython.display import IFrame
IFrame(author_web, 600, 600)

Binary features if author is on: Facebook, LinkedIn, Twitter, Google+

In [27]:
author_networks = {'facebo': '',
                   'linked': '',
                   'twitte': '',
                   'google': ''}

In [28]:
response = urllib.request.urlopen(author_web)
html = response.read()
soup = BeautifulSoup(html, 'html.parser')

### Get all the networks that the author is in

In [29]:
networks = soup.find_all("div", { "class" : "profile-networks"})
networks

[<div class="profile-networks">
 <a class="network-badge network-badge-facebook network-badge-round" href="https://www.facebook.com/brianries" target="_blank"></a>
 <a class="network-badge network-badge-linkedin network-badge-round" href="http://www.linkedin.com/in/briankries" target="_blank"></a>
 <a class="network-badge network-badge-round network-badge-twitter" href="https://twitter.com/moneyries" target="_blank"></a>
 <a class="network-badge network-badge-google network-badge-round" href="https://plus.google.com/+BrianRiesAtMashable?rel=author" target="_blank"></a>
 </div>]

In [30]:
networks = str(networks).replace('network-badge-round', '')
networks

'[<div class="profile-networks">\n<a class="network-badge network-badge-facebook " href="https://www.facebook.com/brianries" target="_blank"></a>\n<a class="network-badge network-badge-linkedin " href="http://www.linkedin.com/in/briankries" target="_blank"></a>\n<a class="network-badge  network-badge-twitter" href="https://twitter.com/moneyries" target="_blank"></a>\n<a class="network-badge network-badge-google " href="https://plus.google.com/+BrianRiesAtMashable?rel=author" target="_blank"></a>\n</div>]'

In [31]:
networks = networks.split('network-badge-')
networks  # Note networks is now a list of strings

['[<div class="profile-networks">\n<a class="network-badge ',
 'facebook " href="https://www.facebook.com/brianries" target="_blank"></a>\n<a class="network-badge ',
 'linkedin " href="http://www.linkedin.com/in/briankries" target="_blank"></a>\n<a class="network-badge  ',
 'twitter" href="https://twitter.com/moneyries" target="_blank"></a>\n<a class="network-badge ',
 'google " href="https://plus.google.com/+BrianRiesAtMashable?rel=author" target="_blank"></a>\n</div>]']

In [32]:
for network in networks:
    if network[:6] in author_networks.keys():
        author_networks[network[:6]] = network.split('href="')[1].split('" target')[0]

In [33]:
author_networks

{'facebo': 'https://www.facebook.com/brianries',
 'google': 'https://plus.google.com/+BrianRiesAtMashable?rel=author',
 'linked': 'http://www.linkedin.com/in/briankries',
 'twitte': 'https://twitter.com/moneyries'}

### Get number of twitter followers

In [34]:
author_networks['twitter_followers'] = 0
author_networks['twitte']

'https://twitter.com/moneyries'

In [35]:
response = urllib.request.urlopen(author_networks['twitte'])
html = response.read()
soup = BeautifulSoup(html, 'html.parser')

In [36]:
followers = str(soup.find_all("span", { "class" : "ProfileNav-value"})[2])
followers

'<span class="ProfileNav-value" data-is-compact="true">17,5\xa0K</span>'

In [37]:
followers = followers.split('">')[1]
followers

'17,5\xa0K</span>'

In [38]:
if ('K' in followers) or ('mil' in followers):
    followers = followers.split('\xa0')[0]
    if ',' in followers:
        followers = followers.replace(',', '') + '00'
    else:
        followers = followers + '000'
else:
    followers = followers.split('</span')[0].replace('.', '')
followers = int(followers)
followers

17500

In [39]:
author_networks['twitter_followers'] = followers

## Merge everything into a function

In [40]:
def news_info(url):
    # Download HTML
    response = urllib.request.urlopen(url_)
    html = response.read()
    soup = BeautifulSoup(html, 'html.parser')
    
    # Title, author, text
    title = soup.title.string
    
    author = soup.find_all("span", { "class" : "author_name"})
    # If author is empty try this:
    if author == []:
        author = soup.find_all("span", { "class" : "byline basic"})
    author = str(author).split('>')[1].split('By ')[1].split('<')[0]
    
    # Number of shares
    shares = soup.find_all("div", { "class" : "total-shares"})
    shares = str(shares).split('<em>')[1].split('</em>')[0]
    if 'k' in shares:
        shares = shares[:-1]
        shares = shares.replace('.', '') + '00'
    
    # Get text
    try:
        text = str(soup.get_text()).split("UTC\n\n\n")[1]
    except IndexError:
        text = str(soup.get_text()).split("Analysis\n\n")[1]
        
    text = text.split('Have something to add to this story?')[0]
    
    author_web = soup.find_all("a", { "class" : "byline"})
    if author_web != []:
        author_web = 'http://mashable.com' + str(author_web).split('href="')[1].split('">')[0]
    
        # Author networks
        author_networks = {'facebo': '',
                           'linked': '',
                           'twitte': '',
                           'google': ''}

        response = urllib.request.urlopen(author_web)
        html = response.read()
        soup = BeautifulSoup(html, 'html.parser')

        networks = str(soup.find_all("div", { "class" : "profile-networks"})).replace('network-badge-round', '').split('network-badge-')

        for network in networks:
            if network[:6] in author_networks.keys():
                author_networks[network[:6]] = network.split('href="')[1].split('" target')[0]

        # Author twitter followers
        author_networks['twitter_followers'] = 0
        if author_networks['twitte'] != '':

            response = urllib.request.urlopen(author_networks['twitte'])
            html = response.read()
            soup = BeautifulSoup(html, 'html.parser')

            followers = str(soup.find_all("span", { "class" : "ProfileNav-value"})[2]).split('">')[1]
            if ('K' in followers) or ('mil' in followers):
                followers = followers.split('\xa0')[0]
                if ',' in followers:
                    followers = followers.replace(',', '') + '00'
                else:
                    followers = followers + '000'
            else:
                followers = followers.split('</span')[0].replace('.', '')

            author_networks['twitter_followers'] = int(followers)
        
    else:
        author_networks = {'facebo': '',
                           'linked': '',
                           'twitte': '',
                           'google': '', 
                           'twitter_followers': 0}
        
    return {'title': title, 'author': author, 'shares': shares, 
            'author_web': author_web, 'text':text, 
            'author_networks': author_networks}


In [41]:
url_ = 'http://mashable.com/2016/03/04/the-who-50th-anniversary'
news_info(url_)

{'author': 'Yohana Desta',
 'author_networks': {'facebo': '',
  'google': 'https://plus.google.com/102665910331841447889?rel=author',
  'linked': '',
  'twitte': 'https://twitter.com/YohanaDesta',
  'twitter_followers': 2992},
 'author_web': 'http://mashable.com/people/yohana-desta/',
 'shares': '762',
 'text': '\nFifty years later, the Who can still rule a crowd.\nRoger Daltrey’s still got the howl and Pete Townshend can make the entirety of Madison Square Garden lose its mind over a swivel of his right arm. The band, currently on its 50th anniversary tour, packed Manhattan’s famed stadium on Wednesday night with thousands of fans eager to sing along to hits like “My Generation” and "Won\'t Get Fooled Again."\xa0\nThe show was even more celebratory because it was rescheduled from last fall, when Daltrey was diagnosed with viral meningitis. At the end of the show, he thanked the crowd for sticking around while he was "really having a tough time with the reaper."\n"Here were are, still 

In [42]:
url_ = 'http://mashable.com/2016/03/08/scotland-giant-rabbit-home'
news_info(url_)

{'author': 'Davina Merchant',
 'author_networks': {'facebo': '',
  'google': 'https://plus.google.com/105525238342980116477?rel=author',
  'linked': '',
  'twitte': '',
  'twitter_followers': 0},
 'author_web': 'http://mashable.com/people/568bdab3519840193100211f/',
 'shares': '7000',
 'text': 'LONDON - Last month we reported on a dog-sized rabbit in desperate need of a new home — now just under a month later Atlas the rabbit now has a permanent home. \nAfter his story went global people from all over the world, including the U.S., Canada and France started reaching out to the Scottish Society for the Prevention of Cruelty to Animals to re-home the rabbit.   \nThanks to Jen Hislop from Ayrshire, the adorable bunny will get to stay in his native Scotland. \n\nHe even has a new buggy.Image: Facebook Scottish SPCAJen, a financial fraud investigator, told the charity, "I burst into tears when I got the phone call saying I had been chosen to re-home Atlas and I cried again when I collected 

In [43]:
url_ = 'http://mashable.com/2016/03/08/15-skills-digital-marketers'
news_info(url_)

{'author': 'Scott Gerber',
 'author_networks': {'facebo': '',
  'google': '',
  'linked': '',
  'twitte': '',
  'twitter_followers': 0},
 'author_web': [],
 'shares': '5000',
 'text': 'Today\'s digital marketing experts must have a diverse skill set, including a sophisticated grasp of available media channels, the ability to identify up-and-coming opportunities, on top of having the basic skills of a brilliant marketer. What\'s more, they have to possess a balance of critical and creative thinking skills in order to drive measurable success for their company.\nThat\'s why I asked 15 members of Young Entrepreneur Council (YEC) what they look for when hiring digital marketers. Their best answers are below.\n1. Paid social media advertising expertise\n\nA new digital marketing hire should be well-versed in paid social media advertising, especially through Facebook or a similar social platform that our company uses regularly. They need to be able to understand and implement Facebook analyt

In [44]:
url_ = 'http://mashable.com/2016/03/31/donald-trump-gaslighting-women'
news_info(url_)

{'author': 'Rebecca Ruiz',
 'author_networks': {'facebo': '',
  'google': '',
  'linked': '',
  'twitte': 'https://twitter.com/rebecca_ruiz',
  'twitter_followers': 3736},
 'author_web': 'http://mashable.com/people/rebecca-ruiz/',
 'shares': '1900',
 'text': 'Analysis\n\n\n\n\n\nThere is a reason that\xa0Donald Trump\'s\xa0outrageous statements and behavior feel familiar to many women.\xa0\nIt\'s not because they know his declarative style and trademark shrug from reality television or political debates. Nor is it because his outsized role in American business made an unforgettable impression on them.\nSEE ALSO: As Donald Trump targets women, Republicans say he could cost the party everything\nThe eerie familiarity is more personal than that. They know Trump because they\'ve encountered a man like him at home, work, on social media or in a relationship.\xa0\nThis man extols the virtues of women, but has no problem reducing them to sex objects. He casts himself as unflappable, but blame