This is a simple article
\n", "{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# AutoExtract articleBodyHtml example\n",
"\n",
"The [AutoExtract API](https://scrapinghub.com/autoextract) is a service for \n",
"automatically extracting information from web content. This notebook\n",
"shows how is it possible to extract article body content\n",
"from articles automatically and specifically it focuses on the features\n",
"offered by the attribute `articleBodyHtml`. \n",
"\n",
"`articleBodyHtml` attribute **returns**\n",
"a clean version of the article content where **all irrelevant stuff has been removed**\n",
"(framing, ads, links to content no directly related with article, call to actions elements, etc)\n",
"and where the resultant **HTML is simplified and normalized** in such a way\n",
"that it is **consistent across content from different sites**.\n",
"\n",
"Resultant HTML offers a great flexibility to:\n",
"* Apply custom and consistent styling to content from different sites\n",
"* Pick which content elements to show or hide or even rearange the elements in the article\n",
"\n",
"AutoExtract is relying in machine learning models and is able to detect elements like figure captions or block quotes even if they were not annotated with the proper HTML tag, bringing\n",
"normalization one step further.\n",
"\n",
"> **Recomendation:** For a better viewing experience execute this notebook cell by cell.\n",
"\n",
"Before starting, let's import some stuff that will be needed:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"import os\n",
"import re\n",
"import json\n",
"from itertools import chain\n",
"from autoextract.sync import request_batch\n",
"from IPython.core.display import HTML\n",
"from parsel import Selector\n",
"import html_text"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Scrapinghub client library ``scrapinghub-autoextract`` brings access to the Articles \n",
"Extraction API in Python. A key is required to access the service. You can obtain one\n",
"at [in this page](https://scrapinghub.com/autoextract). The client library will look\n",
"for this key in the environmental variable ``SCRAPINGHUB_AUTOEXTRACT_KEY`` but **you can\n",
"also set it in the variable `AUTOEXTRACT_KEY` below and then evaluate the cell**."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Set in the variable below your AutoExtract key\n",
"AUTOEXTRACT_KEY = \"\"\n",
"\n",
"if AUTOEXTRACT_KEY:\n",
" os.environ['SCRAPINGHUB_AUTOEXTRACT_KEY'] = AUTOEXTRACT_KEY\n",
"if not os.environ.get('SCRAPINGHUB_AUTOEXTRACT_KEY'):\n",
" raise Exception(\"Please, fill the variable 'AUTOEXTRACT_KEY above with your \"\n",
" \"AutoExtract key\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% md\n"
}
},
"source": [
"The method [``request_raw``](https://github.com/scrapinghub/scrapinghub-autoextract#synchronous-api) \n",
"is the entrypoint to AutoExtract API. Let's define the method ``autoextract_article`` for convenience \n",
"as: "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"def autoextract_article(url):\n",
" return request_batch([url], page_type='article')[0]['article']"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Between the [attributes returned by AutoExtract](https://doc.scrapinghub.com/autoextract.html#article-extraction)\n",
"this notebook will focus in the attribute ``articleBodyHtml``, which contains the simplified, \n",
"normalized and cleaned up article content in HTML code.\n",
"\n",
"Let's see an extraction example for [this page](https://thenewdaily.com.au/sport/afl/2020/03/12/clear-the-decks-and-let-the-aflw-thrive-and-prosper/)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%%\n"
},
"scrolled": false
},
"outputs": [],
"source": [
"sport_article = autoextract_article(\n",
" \"https://thenewdaily.com.au/sport/afl/2020/03/12/clear-the-decks-and-\"\n",
" \"let-the-aflw-thrive-and-prosper/\")\n",
"HTML(sport_article['articleBodyHtml'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% md\n"
}
},
"source": [
"Note how only the relevant content of the article was extracted, avoiding elements\n",
"like ads, unrelated content, etc. AutoExtract relies in advanced machine learning\n",
"models that are able to discriminate between what is relevant and what is not. \n",
"\n",
"Also note how figures with captions was extracted. Many \n",
"[other elements can be also present](https://doc.scrapinghub.com/autoextract.html#format-of-articlebodyhtml-field). \n",
"\n",
"## Styling\n",
"\n",
"Having normalized HTML code has some cool advantages. One is that the content\n",
"can be formatted independently of the original style with simple CSS rules.\n",
"That means that the same consistent formatting can be applied even if the content is coming\n",
"from very different pages with different formats. \n",
"\n",
"AutoExtract encapsulates the `articleBodyHtml` content within ``article`` tags. For example:\n",
"```html\n",
" This is a simple article