{ "cells": [ { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "# AutoExtract articleBodyHtml example\n", "\n", "The [AutoExtract API](https://scrapinghub.com/autoextract) is a service for \n", "automatically extracting information from web content. This notebook\n", "shows how is it possible to extract article body content\n", "from articles automatically and specifically it focuses on the features\n", "offered by the attribute `articleBodyHtml`. \n", "\n", "`articleBodyHtml` attribute **returns**\n", "a clean version of the article content where **all irrelevant stuff has been removed**\n", "(framing, ads, links to content no directly related with article, call to actions elements, etc)\n", "and where the resultant **HTML is simplified and normalized** in such a way\n", "that it is **consistent across content from different sites**.\n", "\n", "Resultant HTML offers a great flexibility to:\n", "* Apply custom and consistent styling to content from different sites\n", "* Pick which content elements to show or hide or even rearange the elements in the article\n", "\n", "AutoExtract is relying in machine learning models and is able to detect elements like figure captions or block quotes even if they were not annotated with the proper HTML tag, bringing\n", "normalization one step further.\n", "\n", "> **Recomendation:** For a better viewing experience execute this notebook cell by cell.\n", "\n", "Before starting, let's import some stuff that will be needed:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "import os\n", "import re\n", "import json\n", "from itertools import chain\n", "from autoextract.sync import request_batch\n", "from IPython.core.display import HTML\n", "from parsel import Selector\n", "import html_text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scrapinghub client library ``scrapinghub-autoextract`` brings access to the Articles \n", "Extraction API in Python. A key is required to access the service. You can obtain one\n", "at [in this page](https://scrapinghub.com/autoextract). The client library will look\n", "for this key in the environmental variable ``SCRAPINGHUB_AUTOEXTRACT_KEY`` but **you can\n", "also set it in the variable `AUTOEXTRACT_KEY` below and then evaluate the cell**." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [], "source": [ "# Set in the variable below your AutoExtract key\n", "AUTOEXTRACT_KEY = \"\"\n", "\n", "if AUTOEXTRACT_KEY:\n", " os.environ['SCRAPINGHUB_AUTOEXTRACT_KEY'] = AUTOEXTRACT_KEY\n", "if not os.environ.get('SCRAPINGHUB_AUTOEXTRACT_KEY'):\n", " raise Exception(\"Please, fill the variable 'AUTOEXTRACT_KEY above with your \"\n", " \"AutoExtract key\")" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "is_executing": false, "name": "#%% md\n" } }, "source": [ "The method [``request_raw``](https://github.com/scrapinghub/scrapinghub-autoextract#synchronous-api) \n", "is the entrypoint to AutoExtract API. Let's define the method ``autoextract_article`` for convenience \n", "as: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [], "source": [ "def autoextract_article(url):\n", " return request_batch([url], page_type='article')[0]['article']" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Between the [attributes returned by AutoExtract](https://doc.scrapinghub.com/autoextract.html#article-extraction)\n", "this notebook will focus in the attribute ``articleBodyHtml``, which contains the simplified, \n", "normalized and cleaned up article content in HTML code.\n", "\n", "Let's see an extraction example for [this page](https://thenewdaily.com.au/sport/afl/2020/03/12/clear-the-decks-and-let-the-aflw-thrive-and-prosper/)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" }, "scrolled": false }, "outputs": [], "source": [ "sport_article = autoextract_article(\n", " \"https://thenewdaily.com.au/sport/afl/2020/03/12/clear-the-decks-and-\"\n", " \"let-the-aflw-thrive-and-prosper/\")\n", "HTML(sport_article['articleBodyHtml'])" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "is_executing": false, "name": "#%% md\n" } }, "source": [ "Note how only the relevant content of the article was extracted, avoiding elements\n", "like ads, unrelated content, etc. AutoExtract relies in advanced machine learning\n", "models that are able to discriminate between what is relevant and what is not. \n", "\n", "Also note how figures with captions was extracted. Many \n", "[other elements can be also present](https://doc.scrapinghub.com/autoextract.html#format-of-articlebodyhtml-field). \n", "\n", "## Styling\n", "\n", "Having normalized HTML code has some cool advantages. One is that the content\n", "can be formatted independently of the original style with simple CSS rules.\n", "That means that the same consistent formatting can be applied even if the content is coming\n", "from very different pages with different formats. \n", "\n", "AutoExtract encapsulates the `articleBodyHtml` content within ``article`` tags. For example:\n", "```html\n", "
\n", "

This is a simple article

\n", "
\n", "```\n", "\n", "For convenience, we are going to encapsulate the content within a `div` with the class `beauty`. This way we will be able to apply our custom styling only to `div` tags with this mark. \n", "The method `show` will take care of that: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [], "source": [ "def show(article):\n", " return HTML(f\"\"\"\n", "
\n", " {article['articleBodyHtml']}\n", "
\"\"\")" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Now let's create some CSS style rules to be applied for the `beauty` class: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [], "source": [ "style = \"\"\"\n", "\n", "\"\"\"\n", "HTML(style)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "is_executing": false, "name": "#%% md\n" } }, "source": [ "Let's show the article again. It looks better, isn't it? And the best is that this style (with a little bit more of work) would work consistently across content from different websites." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" }, "scrolled": false }, "outputs": [], "source": [ "show(sport_article)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "is_executing": false, "name": "#%% md\n" } }, "source": [ "## Tweets and other embeddings\n", "\n", "Have a look to the [following page](https://www.geekwire.com/2019/tesla-shares-slump-sec-accuses-ceo-elon-musk-violating-tweet-deal/):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" }, "scrolled": false }, "outputs": [], "source": [ "musk_article = autoextract_article(\n", " \"https://www.geekwire.com/2019/tesla-shares-slump-sec-accuses-ceo-elon-\"\n", " \"musk-violating-tweet-deal/\")\n", "show(musk_article)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "The page is full of tweets, but the format is not the usual one seen in pages. \n", "But don't worry. Everything is ready to get them formatted, all we have to do is to include\n", "the [Twitter widgets javascript library](https://developer.twitter.com/en/docs/twitter-for-websites/javascript-api/guides/set-up-twitter-for-websites)\n", "into the page. Let's to do it: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [], "source": [ "twitter_js = \"\"\"\"\"\"\n", "HTML(twitter_js)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the tweets in the article are nicely formatted. Facebook and Instagram content\n", "can also get formatted by [including its javascript libraries](https://doc.scrapinghub.com/autoextract.html#format-of-articlebodyhtml-field). \n", "\n", "But not only that. Other `iframe` based multimedia content like videos, podcasts, maps, etc \n", "will also be present and functional in the `articleBodyHtml` attribute. " ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "is_executing": false, "name": "#%% md\n" } }, "source": [ "## Cherry picking\n", "\n", "Another advantage of having a normalized structure is that we can pick only the parts\n", "we are in interested in.\n", "\n", "In the following example, we are going to just pick the images\n", "from [this article](https://www.theguardian.com/uk-news/2019/aug/23/prince-albert-passions-digitised-website-photos-200th-anniversary)\n", "with its corresponding caption to compose an images array. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [], "source": [ "queen_article = autoextract_article(\n", " \"https://www.theguardian.com/uk-news/2019/aug/23/prince-albert-passions-digitised-\"\n", " \"website-photos-200th-anniversary\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" }, "scrolled": false }, "outputs": [], "source": [ "sel = Selector(queen_article['articleBodyHtml'])\n", "images = [{'img_url': fig.xpath(\".//img/@src\").get(),\n", " 'caption': html_text.selector_to_text(fig.xpath(\"(.//figcaption)\"))} \n", " for fig in sel.xpath(\"//figure\")]\n", "print(json.dumps(images, indent=4))" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "[parsel](https://github.com/scrapy/parsel) and [html-text](https://github.com/TeamHG-Memex/html-text)\n", "libraries were used as helpers for the task. `parsel` makes possible to query the content using\n", "XPath and CSS expressions and `html-text` converts HTML content to raw text. \n", "\n", "Note that in the source code of the page in question there is not any `figcaption`\n", "tag: AutoExtract machine learning capabilities can detect that a particular\n", "section of the page is really a figure caption even if it was not annotated with the right\n", "HTML tag. Such intelligence is also applied to other elements like `blockquote`. \n", "\n", "Let's go further. We are now going to compose a summary page that also \n", "includes independent sections for figures and tweets. It is really easy to cherry pick \n", "such elements from `articleBodyHtml`. Let's see it applied to the Musk page: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" }, "scrolled": false }, "outputs": [], "source": [ "sel = Selector(musk_article['articleBodyHtml']) \n", "only_tweets = sel.css(\".twitter-tweet\")\n", "only_figures = sel.css(\"figure\")\n", "HTML(\n", " f\"\"\"\n", "
\n", "

{musk_article['headline']}

\n", "
\n", "
Author
{musk_article['author']}
\n", "
Published
{musk_article['datePublished'][:10]}
\n", "
Time to read
{len(musk_article['articleBody'].split()) / 130:.1f}\n", " minutes\n", "
\n", "
\n", "

First paragraph

\n", " {sel.css(\"article > p\").get()}\n", "

Tweets ({len(only_tweets)})

\n", " {\"\".join(only_tweets.getall())}\n", "

Figures ({len(only_figures)})

\n", " {\"\".join(only_figures.getall())}\n", "
\n", " {twitter_js}\n", " \"\"\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The **normalized HTML brings thus flexibility to adapt the article content to your\n", "own purposes**: you might decide to exclude figure captions, or to exclude multimedia content from \n", "`iframes`, or show figures in a separated carousel for example.\n", "\n", "Heading levels are also normalized. It can be handy to automatically extract \n", "\"table of contents\" for `articleBodyHtml`. The function `print_toc` presented below\n", "print the table of content of an article extracted by AutoExtract." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [], "source": [ "def print_toc(html): \n", " for section in Selector(html).css(\"h2,h3,h4,h5,h6\"):\n", " level = int(section.root.tag[-1]) - 2\n", " print(f\"{' ' * level}{section.css('::text').get()}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try it with [this article](http://cs231n.github.io/neural-networks-1/):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [], "source": [ "article_toc = autoextract_article(\"http://cs231n.github.io/neural-networks-1/\") \n", "print_toc(article_toc['articleBodyHtml'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Including figure captions in the text body\n", "\n", "The textual attribute `articleBody` is not including any text from figure\n", "elements (i.e. figure captions) by default. This is generally desired because images cannot\n", "be included in raw text and showing a caption without its figure is disturbing for humans.\n", "\n", "But sometimes the body textual information is used as the input for some analysis algorithm. \n", "For example you could be grouping articles by similarity using the simple technique of \n", "K Nearest Neighbors. Or even you can be feeding very advance neural networks using \n", "deep learning models for NLP.\n", "\n", "In all these cases you might want to have the textual information for figure captions included. It is very easy to do. Let's do it for the sport article:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Converting `articleBodyHtml` into text is enough to have figure captions included\n", "sport_text_with_captions = html_text.selector_to_text(\n", " Selector(sport_article['articleBodyHtml']))\n", "\n", "print(\"Without captions:\")\n", "print(\"-----------------\")\n", "print(sport_article['articleBody'][500:800])\n", "print(\"\\nWith captions:\")\n", "print(\"---------------\")\n", "print(sport_text_with_captions[500:800])" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Removing pull quotes\n", "[Pull quotes](https://en.wikipedia.org/wiki/Pull_quote) are being used very often in\n", "articles these days. A pull quote is an **excerpt** of the article content **which is repeated**\n", "within the article **but highlighted** with a different format (i.e appearing in its own box and using a bigger font).\n", "A pair of examples\n", "can be seen on [this page](https://www.vox.com/the-highlight/2020/1/15/20863236/chris-hughes-break-up-facebook-economic-security-basic-income-new-republic). \n", "\n", "Pull quotes are a nice formatting element, but it might be better to strip them out if we are converting the document to plain text because having repeated content should be avoided here: formatting is lost in raw text \n", "and therefore pull quotes are not useful but disturbing for the reader. The attribute `articleBody` already contains a text version of the article, but pull quotes are not removed there. In the following example, we are\n", "going to convert the article to raw text but excluding all pull quotes.\n", "\n", "Note that AutoExtract detects quotes using machine learning techniques and returns\n", "them in `articleBodyHtml` under `blockquote` tags. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [], "source": [ "chris_article = autoextract_article(\"https://www.vox.com/the-highlight/2020/1/15/20863236/chris-hughes-break-up-facebook-economic-security-basic-income-new-republic\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" }, "scrolled": false }, "outputs": [], "source": [ "def drop_elements(selectors):\n", " \"\"\" Drops HTML subtrees for given selectors \"\"\"\n", " for element in selectors:\n", " tree = element.root\n", " if tree.getparent() is not None:\n", " tree.drop_tree()\n", "\n", "# First let's get the text of the article without any quote. \n", "# We'll search over it to detect which quotes are pull quotes.\n", "sel = Selector(chris_article['articleBodyHtml'])\n", "drop_elements(sel.css(\"blockquote\"))\n", "text_without_quotes = html_text.selector_to_text(sel)\n", "\n", "# Some quotes can change the case, or add some '\"\"' characters. \n", "# Using some normalization helps with the matching\n", "normalized = lambda text: re.sub(r'\"|“|”|', '', ' '.join(text.split()).lower().strip())\n", "\n", "# Now let's iterate over all `blockquote` tags\n", "sel = Selector(chris_article['articleBodyHtml'])\n", "pull_quotes = []\n", "for quote in sel.css(\"blockquote\"):\n", " # bq_text contains the quote text\n", " bq_text = html_text.selector_to_text(quote)\n", " # The quote is a pull quote if the quote text was already in the text without quotes\n", " if normalized(bq_text) in normalized(text_without_quotes): \n", " pull_quotes.append(quote)\n", " \n", "# Let's show found pull quotes\n", "print(f\"Found {len(pull_quotes)} pull quotes from {len(sel.css('blockquote'))} \"\n", " \"source quotes:\\n\")\n", "for idx, quote in enumerate(pull_quotes):\n", " print(f\"Pull quote {idx}:\")\n", " print(\"------------------\")\n", " print(html_text.selector_to_text(quote))\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally we can obtain the full text but with pull quotes stripped out:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false, "name": "#%% \n" } }, "outputs": [], "source": [ "# Removing figures as well as probably you will also want them removed\n", "drop_elements(chain(pull_quotes, sel.css(\"figure\")))\n", "cleaned_text = html_text.selector_to_text(sel)\n", "\n", "# Printing first 500 characters of the clean text\n", "print(cleaned_text[:500])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's verify that we have removed the duplicated text:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "def count(needle, haystack):\n", " return len(re.findall(needle, haystack))\n", "\n", "pquote_excerpt = \"haven’t heard from Mark\"\n", "cases_before = count(pquote_excerpt, chris_article['articleBodyHtml'])\n", "cases_after = count(pquote_excerpt, cleaned_text)\n", "print(f\"Occurrences before: {cases_before} and after the clean up: {cases_after}\")" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Try it yourself\n", "\n", "Now is the moment to try it yourself. Set the `url` variable below and execute the cell\n", "to see the results of autoextract on it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" }, "scrolled": false }, "outputs": [], "source": [ "url = \"https://www.vox.com/policy-and-politics/2020/1/17/21046874/netherlands-universal-health-insurance-private\"\n", "\n", "article = autoextract_article(url)\n", "show(article)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" }, "pycharm": { "stem_cell": { "cell_type": "raw", "source": [], "metadata": { "collapsed": false } } } }, "nbformat": 4, "nbformat_minor": 4 }