{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "
\n", "Ejercicio\n", "
\n", "\n",
"Escriba una función, `status(urls)`, que dada una lista de de URLs en `urls`, devuelve un diccionario usando como claves los códigos de estado de las peticiones HTTP y como valores la lista correspondiente de URLs que lo producen.\n",
"\n",
"Por ejemplo:\n",
"\n",
"urls = [\n",
" \"https://httpbin.org/\",\n",
" \"https://httpbin.org/404\",\n",
" \"https://www.eltiempo.es/\",\n",
" \"https://www.pixar.com/cer890h76yt89j768y6590g43e9f4efv54er\",\n",
"]\n",
"status(urls)\n",
"
\n",
" \n",
"Debería producir\n",
"\n",
"\n",
"{200: ['https://httpbin.org/', 'https://www.eltiempo.es/'],\n",
" 404: ['https://httpbin.org/404', 'https://www.pixar.com/cer890h76yt89j768y6590g43e9f4efv54er']}\n",
"
\n",
"
\n", "Ejercicio\n", "
\n", "\n", "Escriba el código necesario para descargar el contenido alojado en `https://httpbin.org/html` y devolver las 10 palabras más comunes y su frecuencia absoluta (total de apariciones)\n", "
\n", "` cuyo id es \"subject\" y devolver el \"texto\" que contiene.\n", "\n", "En el mundo real, el código HTML de una web no tiene por qué estar bien formado y anotado. Esto significa que necesitaremos estudiar la estructura de una web antes de extraer información de la misma." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Scraping - HTML\n", "```html\n", "\n", "
\n", "Ciencia de Datos
\n", "\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Scraping - HTML\n", "```html\n", "\n", "\n", "Javier de la Rosa
\n", "Ciencia de datos
\n", "Noviembre 2019
\n", "\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Scraping - `robots.txt`\n", "\n", "El archivo robots.txt es un documento que define qué partes de un dominio pueden ser analizadas por los rastreadores de los motores de búsqueda y proporciona un enlace al XML-sitemap (ver [ejemplo de sitemap](https://yoast.com/sitemap_index.xml)).\n", "\n", "```\n", "# robots.txt for http://www.example.com/\n", "\n", "User-agent: UniversalRobot/1.0\n", "User-agent: GoogleBot\n", "Disallow: /sources/dtd/\n", "\n", "User-agent: *\n", "Disallow: /nonsense/\n", "Disallow: /temp/\n", "Disallow: /newsticker.shtml\n", "```\n", "\n", "Ejemplo real: https://www.google.com/robots.txt" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Scraping con Python\n", "\n", "- [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/), encargada de crear un árbol con los elementos que componen una web y facilitar su acceso\n", "- [requests](https://requests.readthedocs.io/en/master/) para facilitar la ejecución de peticiones a servidores desde código Python. Aunque nativamente se puede usar [urllib3](https://urllib3.readthedocs.io/en/latest/index.html).\n", "- [html5lib](https://github.com/html5lib/html5lib-python), para mejorar el tratamiento de los elementos HTML" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "!!pip install beautifulsoup4\n", "!!pip install requests requests-html\n", "!!pip install html5lib" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "### Scraping - requests vs urllib3\n", "import requests\n", "\n", "print(\"requests:\",\n", " requests.get(\"http://httpbin.org/ip\").json()\n", ")\n", "\n", "\n", "import urllib3\n", "import json\n", "\n", "http = urllib3.PoolManager()\n", "response = http.request('GET', 'http://httpbin.org/ip')\n", "print(\"urllib3: \",\n", " json.loads(response.data.decode('utf-8'))\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "### Scraping con Python\n", "\n", "# elpais.com example - First part\n", "from bs4 import BeautifulSoup\n", "import requests\n", "\n", "url = \"https://elpais.com/\"\n", "html = requests.get(url).text\n", "soup = BeautifulSoup(html, 'html5lib')\n", "links = []\n", "all_news_lines = soup.select('.articulo-titulo')\n", "for line in all_news_lines:\n", " link = line.find('a')\n", " links.append(link)\n", "links[:3]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "### Scraping con Python\n", "\n", "# elpais.com example - Second part\n", "news = []\n", "for link in links:\n", " new = link.text # link.get(\"title\")\n", " news.append(new)\n", "news[:5]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Selectores\n", "- Selector de tipo o etiqueta (todos)\n", "```css\n", "p\n", "p, span\n", "```\n", "\n", "- Selector descendente (dentro)\n", "```css\n", "p span\n", "h1 span\n", "```\n", "```html\n", "Algo de texto especial!
\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Selectores (cont.)\n", "\n", "- Selectores de ID\n", "```css\n", "#destacado\n", "```\n", "```html\n", "Segundo párrafo
\n", "```\n", "\n", "- Selector de clase\n", "```css\n", "p.destacado\n", "```\n", "```html\n", "Lorem ipsum dolor sit amet...
\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Selectores (cont.)\n", "\n", "- Combinación de selectores básicos\n", "```css\n", ".aviso .especial\n", "```\n", "```html\n", "Algo de texto especial!
\n", "```\n", "\n", "- Hijos directos\n", "```css\n", "td > span\n", "```\n", "```html\n", "Paragraph y
enlace\n", "```\n", "\n", "- Posicionales\n", "```css\n", "p:first span:last a:n-child(2)\n", "```\n", "\n", "```html\n", "Para 1 Span 1Span 2 Enlace 1 Enlace 2
Para 2
\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "Ejercicio\n", "
\n", "\n", "Escriba las expresiones de selección para los siguientes casos:\n", " \n", "- Todos los `div` del documento.\n", "- Todos los `p` y `a`.\n", "- Todos los `a` descedientes de `p`.\n", "- Todos los elementos de las clase `nombreClase`.\n", "- Los `div` de las clase `nombreClase`.\n", "\n", "
\n", "\n", "Member Name | \n", "Birth-Death | \n", "
---|---|
ADAMS, George Madison | \n", "1837-1920 | \n", "
ALBERT, William Julian | \n", "1816-1879 | \n", "
ALBRIGHT, Charles | \n", "1830-1880 | \n", "
Member Name | \n", "Birth-Death | \n", "
---|---|
ADAMS, George Madison | \n", "1837-1920 | \n", "
ALBERT, William Julian | \n", "1816-1879 | \n", "
ALBRIGHT, Charles | \n", "1830-1880 | \n", "
\n", "Ejercicio\n", "
\n", "\n", "Desde hace años, el portal de alexa.com mantiene listados de los sitios más visitados por país. En https://www.alexa.com/topsites/countries/ES se puede ver la lista de los 50 primeros. Usando la librería [`builtwith`](https://pypi.org/project/builtwith/), se pide escrapear los 50 sitios más visitados de España en el ranking de Alexa y devolver los sitios que usen el *framework* Javascript `React` (clave `'javascript-frameworks'` en el resultado devuelto por `builtwith`).\n", "
\n", "(**Pista**: Para poder procesar una URL, debe tener el protocolo primero, `https://`. Adem'as algunas webs no funcionan con `builtwith`, así que hay que capturar las excepciones correspondientes)
\n", "\n", "Ejercicio\n", "
\n", "\n", "Dada la URL https://pythonprogramming.net/parsememcparseface/, se pide scrapearla y obtener el valor textual del id `yesnojs`.\n", "
\n", "\n", "Ejercicio\n", "
\n", "\n", "Escriba la expresión regular para validar los sguientes supuestos (una expresión para cada uno):\n", " \n", "- Validar numero real positivo con x decimales.\n", "- Validar numero real negativo con x decimales.\n", "- Validar una fecha con formato dd/mm/aaaa\n", "- Validar un nombre, incluyendo nombres compuestos.\n", "- Validar un email.\n", "- Valida un nombre de usuario en twitter, empieza por @ y puede contener letras mayusculas y minusculas, numeros, guiones y guiones bajos.\n", "- Validar ISBN de 13 digitos, siempre empieza en 978 o 979.\n", " \n", "
\n", "