{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Collecting information for machine learning purposes. Parsing and Grabbing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When studing machine learning we mainly concentrate on algorithms of proccessing data rather than on collecting that data. And this is naturally because there are so many databases available for downloading: any types, any sizes, for any ML algorithms. But in real life we are given particular goals and of course any data science processing starts with collecting/getting of information." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Today our life is directly connected with internet and web sites: almost any text information that we could need is available online. So in this tutorial we'l consider how to collect particular information from web sites. And first of all we`l look a little inside html code to understand better how to extract information from it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "HTML language \"tells\" web browsers when, where and what element to show at the html page. We can imagine it as map that specifies the route for drivers: when to start, where to turn left or right and what place to go. That`s why html structure of web pages are so convinient to grab information. Here is the simple piece of HTML code:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "
and below how this code is interpreted by web browser
\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and below how this code is interpreted by web browser
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Two markers 'h1' and 'p' point browser what and how to show on the page, and thus this markers are keys for us that help to get exactly that information we need. There are a lot of information about html language and its main tags ('h1', 'p', 'html', etc. - this are all tags), so you can learn it more deeply, because we will focus on the parsing process. And for this purpose we will use BeautifulSoup Python library:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "blocks in the page\n", "all_p = soup.find_all(\"p\")\n", "print(all_p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
https://www.google.com/search?q=bitcoin&num=100&biw=1920&bih=938&source=lnt&tbs=cdr%3A1%2Ccd_min%3A12%2F11%2F2018%2Ccd_max%3A12%2F11%2F2018&tbm=nws
\n", "\"search?q=bitcoin\" - what we are searching for
\n", "\"num=100\" - number of headlines
\n", "\"cd_min%3A12%2F11%2F2018\" - start date (cd_min%3A [12] %2F [11] %2F [2018] %2C - 12/11/2018 - MM/DD/YYYY)
\n", "\"cd_max%3A12%2F11%2F2018\" - end date
\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we see 'h3' tag is resposible for block with news titles. This tag has attribute class=\"r dO0Ag\". But in this case we can use only 'h3' tag as anchor because it used only to highlight titles.
\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# collect all h3 tags in the code\n", "titles = soup.find_all(\"h3\")\n", "print(titles)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "