# AutoExtract articleBodyHtml example

The [AutoExtract API](https://scrapinghub.com/autoextract) is a service for 
automatically extracting information from web content. This notebook
shows how is it possible to extract article body content
from articles automatically and specifically it focuses on the features
offered by the attribute `articleBodyHtml`. 

`articleBodyHtml` attribute **returns**
a clean version of the article content where **all irrelevant stuff has been removed**
(framing, ads, links to content no directly related with article, call to actions elements, etc)
and where the resultant **HTML is simplified and normalized** in such a way
that it is **consistent across content from different sites**.

Resultant HTML offers a great flexibility to:
* Apply custom and consistent styling to content from different sites
* Pick which content elements to show or hide or even rearange the elements in the article

AutoExtract is relying in machine learning models and is able to detect elements like figure captions or block quotes even if they were not annotated with the proper HTML tag, bringing
normalization one step further.

> **Recomendation:** For a better viewing experience execute this notebook cell by cell.

Before starting, let's import some stuff that will be needed:

In [None]:
import os
import re
import json
from itertools import chain
from autoextract.sync import request_batch
from IPython.core.display import HTML
from parsel import Selector
import html_text

Scrapinghub client library ``scrapinghub-autoextract`` brings access to the Articles 
Extraction API in Python. A key is required to access the service. You can obtain one
at [in this page](https://scrapinghub.com/autoextract). The client library will look
for this key in the environmental variable ``SCRAPINGHUB_AUTOEXTRACT_KEY`` but **you can
also set it in the variable `AUTOEXTRACT_KEY` below and then evaluate the cell**.

In [None]:
# Set in the variable below your AutoExtract key
AUTOEXTRACT_KEY = ""

if AUTOEXTRACT_KEY:
    os.environ['SCRAPINGHUB_AUTOEXTRACT_KEY'] = AUTOEXTRACT_KEY
if not os.environ.get('SCRAPINGHUB_AUTOEXTRACT_KEY'):
    raise Exception("Please, fill the variable 'AUTOEXTRACT_KEY above with your "
                    "AutoExtract key")

The method [``request_raw``](https://github.com/scrapinghub/scrapinghub-autoextract#synchronous-api) 
is the entrypoint to AutoExtract API. Let's define the method ``autoextract_article`` for convenience 
as:  

In [None]:
def autoextract_article(url):
    return request_batch([url], page_type='article')[0]['article']

Between the [attributes returned by AutoExtract](https://doc.scrapinghub.com/autoextract.html#article-extraction)
this notebook will focus in the attribute ``articleBodyHtml``, which contains the simplified, 
normalized and cleaned up article content in HTML code.

Let's see an extraction example for [this page](https://thenewdaily.com.au/sport/afl/2020/03/12/clear-the-decks-and-let-the-aflw-thrive-and-prosper/)

In [None]:
sport_article = autoextract_article(
    "https://thenewdaily.com.au/sport/afl/2020/03/12/clear-the-decks-and-"
    "let-the-aflw-thrive-and-prosper/")
HTML(sport_article['articleBodyHtml'])

Note how only the relevant content of the article was extracted, avoiding elements
like ads, unrelated content, etc. AutoExtract relies in advanced machine learning
models that are able to discriminate between what is relevant and what is not.  

Also note how figures with captions was extracted. Many 
[other elements can be also present](https://doc.scrapinghub.com/autoextract.html#format-of-articlebodyhtml-field). 

## Styling

Having normalized HTML code has some cool advantages. One is that the content
can be formatted independently of the original style with simple CSS rules.
That means that the same consistent formatting can be applied even if the content is coming
from very different pages with different formats.  

AutoExtract encapsulates the `articleBodyHtml` content within ``article`` tags. For example:
```html
<article>
    <p>This is a simple article</p>
</article>
```

For convenience, we are going to encapsulate the content within a `div` with the class `beauty`. This way we will be able to apply our custom styling only to `div` tags with this mark. 
The method `show` will take care of that:  

In [None]:
def show(article):
    return HTML(f"""
        <div class=beauty>
            {article['articleBodyHtml']}
        </div>""")

Now let's create some CSS style rules to be applied for the `beauty` class:  

In [None]:
style = """
<style>
    .beauty {
        font-family: 'Benton Sans', Sans-Serif;
        line-height: 23px;
        font-size: 17.008px;
        font-style: normal;
        background-color: #F9F9F9;
        padding: 20px;
        border: 0.063rem dotted #D0D0D0;
    }
    .beauty h2, h3, h4, h5, h6 {
        font-family: Majerit, serif;
        font-weight: 700;
    }
    .beauty p {
        margin-bottom: 10px;
        color: #444;
    }
    .beauty dl { margin-top: 30px; }
    .beauty dd { margin-left: 20px; }
    .beauty figure {
        display: table;
        margin: 0 auto;
    }
    .beauty figure img {
      width: 100%;
      height: auto;
    }
    .beauty figcaption {
        display: table-caption;
        caption-side: bottom;
        border-bottom: 0.063rem dotted #D0D0D0;
        margin-bottom: 10px;
        line-height: 22px;
        font-size: 13px;
        color: #646464;
        text-align: center;
    }
    .beauty figcaption * {
        text-align: center;
        font-size: 13px;
        color: #646464;
    }
    .beauty figcaption p { margin-bottom: 0px;}
</style>
"""
HTML(style)

Let's show the article again. It looks better, isn't it? And the best is that this style (with a little bit more of work) would work consistently across content from different websites.

In [None]:
show(sport_article)

## Tweets and other embeddings

Have a look to the [following page](https://www.geekwire.com/2019/tesla-shares-slump-sec-accuses-ceo-elon-musk-violating-tweet-deal/):

In [None]:
musk_article = autoextract_article(
    "https://www.geekwire.com/2019/tesla-shares-slump-sec-accuses-ceo-elon-"
    "musk-violating-tweet-deal/")
show(musk_article)

The page is full of tweets, but the format is not the usual one seen in pages. 
But don't worry. Everything is ready to get them formatted, all we have to do is to include
the [Twitter widgets javascript library](https://developer.twitter.com/en/docs/twitter-for-websites/javascript-api/guides/set-up-twitter-for-websites)
into the page. Let's to do it: 

In [None]:
twitter_js = """<script async src="https://platform.twitter.com/widgets.js" charset="utf-8">
                </script>"""
HTML(twitter_js)

Now the tweets in the article are nicely formatted. Facebook and Instagram content
can also get formatted by [including its javascript libraries](https://doc.scrapinghub.com/autoextract.html#format-of-articlebodyhtml-field).    

But not only that. Other `iframe` based multimedia content like videos, podcasts, maps, etc 
will also be present and functional in the `articleBodyHtml` attribute.  

## Cherry picking

Another advantage of having a normalized structure is that we can pick only the parts
we are in interested in.

In the following example, we are going to just pick the images
from [this article](https://www.theguardian.com/uk-news/2019/aug/23/prince-albert-passions-digitised-website-photos-200th-anniversary)
with its corresponding caption to compose an images array. 

In [None]:
queen_article = autoextract_article(
    "https://www.theguardian.com/uk-news/2019/aug/23/prince-albert-passions-digitised-"
    "website-photos-200th-anniversary")

In [None]:
sel = Selector(queen_article['articleBodyHtml'])
images = [{'img_url': fig.xpath(".//img/@src").get(),
           'caption': html_text.selector_to_text(fig.xpath("(.//figcaption)"))} 
          for fig in sel.xpath("//figure")]
print(json.dumps(images, indent=4))

[parsel](https://github.com/scrapy/parsel) and [html-text](https://github.com/TeamHG-Memex/html-text)
libraries were used as helpers for the task. `parsel` makes possible to query the content using
XPath and CSS expressions and `html-text` converts HTML content to raw text.    

Note that in the source code of the page in question there is not any `figcaption`
tag: AutoExtract machine learning capabilities can detect that a particular
section of the page is really a figure caption even if it was not annotated with the right
HTML tag. Such intelligence is also applied to other elements like `blockquote`. 

Let's go further. We are now going to compose a summary page that also 
includes independent sections for figures and tweets. It is really easy to cherry pick 
such elements from `articleBodyHtml`. Let's see it applied to the Musk page: 

In [None]:
sel = Selector(musk_article['articleBodyHtml']) 
only_tweets = sel.css(".twitter-tweet")
only_figures = sel.css("figure")
HTML(
    f"""
    <article class='beauty'>
        <h2>{musk_article['headline']}</h2>
        <dl>
            <dt>Author</dt>       <dd>{musk_article['author']}</dd>
            <dt>Published</dt>    <dd>{musk_article['datePublished'][:10]}</dd>
            <dt>Time to read</dt> <dd>{len(musk_article['articleBody'].split()) / 130:.1f}
                                      minutes
                                  </dd>
        </dl>
        <h3>First paragraph</h3>
        {sel.css("article > p").get()}
        <h3>Tweets ({len(only_tweets)})</h3>
        {"".join(only_tweets.getall())}
        <h3>Figures ({len(only_figures)})</h3>
        {"".join(only_figures.getall())}
    </article>
    {twitter_js}
    """
)

The **normalized HTML brings thus flexibility to adapt the article content to your
own purposes**: you might decide to exclude figure captions, or to exclude multimedia content from 
`iframes`, or show figures in a separated carousel for example.

Heading levels are also normalized. It can be handy to automatically extract 
"table of contents" for `articleBodyHtml`. The function `print_toc` presented below
print the table of content of an article extracted by AutoExtract.

In [None]:
def print_toc(html):  
    for section in Selector(html).css("h2,h3,h4,h5,h6"):
        level = int(section.root.tag[-1]) - 2
        print(f"{'   ' * level}{section.css('::text').get()}")

Let's try it with [this article](http://cs231n.github.io/neural-networks-1/):

In [None]:
article_toc = autoextract_article("http://cs231n.github.io/neural-networks-1/")        
print_toc(article_toc['articleBodyHtml'])

### Including figure captions in the text body

The textual attribute `articleBody` is not including any text from figure
elements (i.e. figure captions) by default. This is generally desired because images cannot
be included in raw text and showing a caption without its figure is disturbing for humans.

But sometimes the body textual information is used as the input for some analysis algorithm. 
For example you could be grouping articles by similarity using the simple technique of 
K Nearest Neighbors. Or even you can be feeding very advance neural networks using 
deep learning models for NLP.

In all these cases you might want to have the textual information for figure captions included. It is very easy to do. Let's do it for the sport article:

In [None]:
# Converting `articleBodyHtml` into text is enough to have figure captions included
sport_text_with_captions = html_text.selector_to_text(
    Selector(sport_article['articleBodyHtml']))

print("Without captions:")
print("-----------------")
print(sport_article['articleBody'][500:800])
print("\nWith captions:")
print("---------------")
print(sport_text_with_captions[500:800])

### Removing pull quotes
[Pull quotes](https://en.wikipedia.org/wiki/Pull_quote) are being used very often in
articles these days. A pull quote is an **excerpt** of the article content **which is repeated**
within the article **but highlighted** with a different format (i.e appearing in its own box and using a bigger font).
A pair of examples
can be seen on [this page](https://www.vox.com/the-highlight/2020/1/15/20863236/chris-hughes-break-up-facebook-economic-security-basic-income-new-republic). 

Pull quotes are a nice formatting element, but it might be better to strip them out if we are converting the document to plain text because having repeated content should be avoided here: formatting is lost in raw text 
and therefore pull quotes are not useful but disturbing for the reader. The attribute `articleBody` already contains a text version of the article, but pull quotes are not removed there. In the following example, we are
going to convert the article to raw text but excluding all pull quotes.

Note that AutoExtract detects quotes using machine learning techniques and returns
them in `articleBodyHtml` under `blockquote` tags. 

In [None]:
chris_article = autoextract_article("https://www.vox.com/the-highlight/2020/1/15/20863236/chris-hughes-break-up-facebook-economic-security-basic-income-new-republic")

In [None]:
def drop_elements(selectors):
    """ Drops HTML subtrees for given selectors """
    for element in selectors:
        tree = element.root
        if tree.getparent() is not None:
            tree.drop_tree()

# First let's get the text of the article without any quote. 
# We'll search over it to detect which quotes are pull quotes.
sel = Selector(chris_article['articleBodyHtml'])
drop_elements(sel.css("blockquote"))
text_without_quotes = html_text.selector_to_text(sel)

# Some quotes can change the case, or add some '""' characters. 
# Using some normalization helps with the matching
normalized = lambda text: re.sub(r'"|“|”|', '', ' '.join(text.split()).lower().strip())

# Now let's iterate over all `blockquote` tags
sel = Selector(chris_article['articleBodyHtml'])
pull_quotes = []
for quote in sel.css("blockquote"):
    # bq_text contains the quote text
    bq_text = html_text.selector_to_text(quote)
    # The quote is a pull quote if the quote text was already in the text without quotes
    if normalized(bq_text) in normalized(text_without_quotes):        
        pull_quotes.append(quote)
        
# Let's show found pull quotes
print(f"Found {len(pull_quotes)} pull quotes from {len(sel.css('blockquote'))} "
       "source quotes:\n")
for idx, quote in enumerate(pull_quotes):
    print(f"Pull quote {idx}:")
    print("------------------")
    print(html_text.selector_to_text(quote))
    print()

Finally we can obtain the full text but with pull quotes stripped out:

In [None]:
# Removing figures as well as probably you will also want them removed
drop_elements(chain(pull_quotes, sel.css("figure")))
cleaned_text = html_text.selector_to_text(sel)

# Printing first 500 characters of the clean text
print(cleaned_text[:500])

Let's verify that we have removed the duplicated text:

In [None]:
def count(needle, haystack):
    return len(re.findall(needle, haystack))

pquote_excerpt = "haven’t heard from Mark"
cases_before = count(pquote_excerpt, chris_article['articleBodyHtml'])
cases_after = count(pquote_excerpt, cleaned_text)
print(f"Occurrences before: {cases_before} and after the clean up: {cases_after}")

## Try it yourself

Now is the moment to try it yourself. Set the `url` variable below and execute the cell
to see the results of autoextract on it:

In [None]:
url = "https://www.vox.com/policy-and-politics/2020/1/17/21046874/netherlands-universal-health-insurance-private"

article = autoextract_article(url)
show(article)