{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Collecting information for machine learning purposes. Parsing and Grabbing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When studing machine learning we mainly concentrate on algorithms of proccessing data rather than on collecting that data. And this is naturally because there are so many databases available for downloading: any types, any sizes, for any ML algorithms. But in real life we are given particular goals and of course any data science processing starts with collecting/getting of information." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Today our life is directly connected with internet and web sites: almost any text information that we could need is available online. So in this tutorial we'l consider how to collect particular information from web sites. And first of all we`l look a little inside html code to understand better how to extract information from it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "HTML language \"tells\" web browsers when, where and what element to show at the html page. We can imagine it as map that specifies the route for drivers: when to start, where to turn left or right and what place to go. That`s why html structure of web pages are so convinient to grab information. Here is the simple piece of HTML code:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'\\n

Hello! This is the first text paragraph

\\n

and below how this code is interpreted by web browser

\\n'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"\"\"\n", "

Hello! This is the first text paragraph

\n", "

and below how this code is interpreted by web browser

\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Hello! This is the first text paragraph

\n", "

and below how this code is interpreted by web browser

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Two markers 'h1' and 'p' point browser what and how to show on the page, and thus this markers are keys for us that help to get exactly that information we need. There are a lot of information about html language and its main tags ('h1', 'p', 'html', etc. - this are all tags), so you can learn it more deeply, because we will focus on the parsing process. And for this purpose we will use BeautifulSoup Python library:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before learning how to grab information directly online, let's load a little html page as text file from our folder. (right click on this link and download to folder with yupiter notebook), as sometimes we can work with already downloaded files. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "

This is first paragraph.

This is second paragraph.

This is third paragraph.

\n" ] } ], "source": [ "source_text = open(\"toy.html\", \"r\")\n", "html = \"\"\n", "for line in source_text:\n", " html = html + line\n", "print(html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our file consists of several tags and the information we need to get is contained in the second 'p' tag. To get it, we have to \"feed\" the BeautifulSoup module with total html code, that it could parse through it and find what we need. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[

This is first paragraph.

,

This is second paragraph.

,

This is third paragraph.

]\n" ] } ], "source": [ "# \"feed\" total page\n", "soup = BeautifulSoup(html, \"lxml\")\n", "# Find all

blocks in the page\n", "all_p = soup.find_all(\"p\")\n", "print(all_p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "


" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Function '.find_all' collects for us all 'p' blocks in our file. We simply choose the necessary p-element from list by certain index and leave only text inside this tag:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'This is second paragraph.'" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# extract tags text\n", "all_p[1].text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But when there are many uniform tags (like 'p') in the code or when from page to page the index number of needed paragraph changes, such aprroach will not do right things. Today almost any tag contains special attributes like 'id', 'class', 'title' etc. To know more about it you can search for CSS style sheet language. For us this attributes are additional anchors to pull from page source exactly the right paragraph (in our case). Using function '.find' we'l get not list but only 1 element (be sure that such element is only one in the page, because you can miss some information in other case.)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is third paragraph.\n", "This is first paragraph.\n" ] } ], "source": [ "# We know the needed paragraph contains attribute id='third'. Let`s grab it\n", "p_third = soup.find(\"p\", id=\"third\").text\n", "print(p_third)\n", "# In case of searching for class attribute, we have to use another way of coding,\n", "# because of reserved word \"class\" in Python\n", "p_first = soup.find(\"p\", {\"class\": \"main 1\"}).text\n", "print(p_first)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the times of dynamic pages which has various CSS styles for different types of devices, you will be very often facing the problem of changing names of tag attributes. Or it could change a little from page to page accroding to other content on it. In case when names of needed tag blocks are totaly different - we have to setup complex grabbing \"architecture\". But usually there are common words in that differing names. In our toy case we have two paragraphs with word \"main\"in 'class' attribute in both of them." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['This is first paragraph.', 'This is second paragraph.']\n", "['This is second paragraph.']\n" ] } ], "source": [ "# find all paragraphs where class attribute contains word \"main\"\n", "p_main = soup.find_all(\"p\", {\"class\": \"main\"})\n", "p_main = [p.text for p in p_main]\n", "print(p_main)\n", "# find all paragraphs with word \"2\"\n", "p_second = soup.find_all(\"p\", {\"class\": \"2\"})\n", "p_second = [p.text for p in p_second]\n", "print(p_second)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also you can get code between two tags which contains inside smaller tag blocks and clean them out, leaving only text inside them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "

This is first paragraph.

This is second paragraph.

This is third paragraph.

\n", "This is first paragraph.This is second paragraph.This is third paragraph.\n" ] } ], "source": [ "# get all html code inside tag \"html\"\n", "html_tag = soup.find(\"html\")\n", "print(html_tag)\n", "# clear it from inside tags, leaving text.\n", "html_tag = html_tag.text\n", "print(html_tag)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Let's do things more complicated. First grabbing task." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Imagine we have the task to analyze if there was correlation between the title of major news and price for Bitcoin? On the one hand we need to collect news about bitcoin for defined period and on the another - price. To do this we need to connect a few extra libraries including \"selenium\". Then download chromedriver and put it in folder with yupiter notebook. Selenium connects python script with chrome browser and let us to send commands to it and recieve html code of loaded pages." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "import random\n", "import time\n", "from datetime import datetime, timedelta\n", "\n", "from selenium import webdriver" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the way to get necessery news is to use Google Search. First of all because it grabs news headlines from many sites and we don't need to tune our script for that every news portal. The second thing - we can browse news by dates. What we have to do is to understand how the link of google news section works:\n", "

https://www.google.com/search?q=bitcoin&num=100&biw=1920&bih=938&source=lnt&tbs=cdr%3A1%2Ccd_min%3A12%2F11%2F2018%2Ccd_max%3A12%2F11%2F2018&tbm=nws

\n", "

\"search?q=bitcoin\" - what we are searching for

\n", "

\"num=100\" - number of headlines

\n", "

\"cd_min%3A12%2F11%2F2018\" - start date (cd_min%3A [12] %2F [11] %2F [2018] %2C - 12/11/2018 - MM/DD/YYYY)

\n", "

\"cd_max%3A12%2F11%2F2018\" - end date

\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to load news for word bitcoin for 01/15/2018" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# start chrome web browser\n", "driver = webdriver.Chrome()\n", "# set settings\n", "cur_year = 2018\n", "cur_month = 1\n", "cur_day = 1\n", "news_word = \"bitcoin\"\n", "# set up url\n", "cur_url = (\n", " \"https://www.google.com/search?q=\"\n", " + str(news_word)\n", " + \"&num=100&biw=1920&bih=929&source=lnt&tbs=cdr:1,cd_min:\"\n", " + str(cur_month)\n", " + \"/\"\n", " + str(cur_day)\n", " + \"/\"\n", " + str(cur_year)\n", " + \",cd_max:\"\n", " + str(cur_month)\n", " + \"/\"\n", " + str(cur_day)\n", " + \"/\"\n", " + str(cur_year)\n", " + \"&tbm=nws\"\n", ")\n", "# load url to the chromedriver\n", "driver.get(cur_url)\n", "# wait a little to let the page load fully\n", "time.sleep(random.uniform(5, 10))\n", "# read html code of loaded to chrome web page.\n", "html = driver.page_source\n", "soup = BeautifulSoup(html, \"lxml\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are lucky): correct page, correct search word and correct date. To move on, we need to examine html code and to find that tags (anchors) which let us grab necessary information. The most convenient way is to use \"inspect\" button of the right-click menu of Google Chrome web browser (or similar in other browsers). See the screenshot.

As we see 'h3' tag is resposible for block with news titles. This tag has attribute class=\"r dO0Ag\". But in this case we can use only 'h3' tag as anchor because it used only to highlight titles.

\n", "" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[

After Ripple's Rise BTC Dominance Falls Below 40%

,

How is the growth of bitcoin affecting the environment?

,

Two Things We Learned From Bitcoin's Price Moves

,

India Falsely Condemns Bitcoin as Ponzi Scheme, Flawed Logic

,

Bitcoin Futures Fail To Live Up To The Hype

,

Станет ли Lightning Network ответом биткоина на низкие комиссии ...

,

Bitcoin fever to burn out in 'spectacular crash,' David Stockman warned

,

Bitcoin 'Adds 0.3%' To Japan's GDP, Claim Nomura Analysts

,

Bank of England Could Issue “Bitcoin-style Digital Currency” in 2018

,

'Privacy Coin' Verge is Allegedly Leaking Users' IP Addresses

,

How the world hijacked bitcoin in 2017

,

Public Firm Faces Class Action Lawsuit for Falsely Claiming Link to ...

,

The Illogical Value Proposition Of Bitcoin

,

Here are the top 10 cryptoassets of 2017 (and bitcoin's 1000% rise ...

,

Bitcoin Is Coming to Your Bodega

,

Video: Bitcoin Sign Guy Tells All About Infamous Janet Yellen ...

,

Bitcoin's gender divide could be a bad sign, experts say

,

Bitcoin Adoption by Businesses in 2017

,

PR: Promising ICOs and How to Spot Them – ICOtoInvest.com

,

Video: Bitcoin or Litecoin? Charlie Lee on Which Crypto Had a Better ...

,

China & Russia to crash bitcoin & trade oil in yuan: Saxo Bank's ...

,

South Korean Exchanges Revise Policies to Comply with Crypto ...

]\n" ] } ], "source": [ "# collect all h3 tags in the code\n", "titles = soup.find_all(\"h3\")\n", "print(titles)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a lot of additional tags inside 'h3' blocks, that why we use loop to clear them and leave only text." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "titles = [title.text for title in titles]" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\"After Ripple's Rise BTC Dominance Falls Below 40%\", 'How is the growth of bitcoin affecting the environment?', 'India Falsely Condemns Bitcoin as Ponzi Scheme, Flawed Logic', \"Two Things We Learned From Bitcoin's Price Moves\", 'Bitcoin Futures Fail To Live Up To The Hype', 'Станет ли Lightning Network ответом биткоина на низкие комиссии ...', \"Bitcoin fever to burn out in 'spectacular crash,' David Stockman warned\", \"Bitcoin 'Adds 0.3%' To Japan's GDP, Claim Nomura Analysts\", 'Bank of England Could Issue “Bitcoin-style Digital Currency” in 2018', \"'Privacy Coin' Verge is Allegedly Leaking Users' IP Addresses\", 'How the world hijacked bitcoin in 2017', 'Public Firm Faces Class Action Lawsuit for Falsely Claiming Link to ...', 'The Illogical Value Proposition Of Bitcoin', \"Here are the top 10 cryptoassets of 2017 (and bitcoin's 1000% rise ...\", 'Bitcoin Is Coming to Your Bodega', 'Video: Bitcoin Sign Guy Tells All About Infamous Janet Yellen ...', \"Bitcoin's gender divide could be a bad sign, experts say\", 'Bitcoin Adoption by Businesses in 2017', 'PR: Promising ICOs and How to Spot Them – ICOtoInvest.com', 'Video: Bitcoin or Litecoin? Charlie Lee on Which Crypto Had a Better ...', \"China & Russia to crash bitcoin & trade oil in yuan: Saxo Bank's ...\", 'South Korean Exchanges Revise Policies to Comply with Crypto ...']\n", "22\n" ] } ], "source": [ "print(titles)\n", "print(len(titles))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's all. We get 21 news titles dated 1 january 2018. Also we can grab a few starting sentences from that news and use them in future analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Despite being one of the most amazing periods of sustained gains for bitcoin, 2017 ends with the first cryptocurrency losing market share to altcoins. After falling\\xa0...', 'Bitcoin is the most popular virtual currency in the world, and it has grown in value this year. It was created in 2009 as a new way of paying for things that would\\xa0...', 'Recently the Indian finance ministry criticized Bitcoin and the rest of the digital currencies in the market for their lack of intrinsic value. The Indian finance ministry\\xa0...', \"Even though bitcoin's price jumped by more than 1,400 percent in 2017, observers have yet to come up with a good explanation for its rise.\", \"After the first week I gave the CBOE's Bitcoin Futures contract a grade of C- (link). That contract has been trading for 3 weeks and the CME's contract has now\\xa0...\", 'В начале декабря 2017 года, когда основное внимание биткоин-сообщества было приковано к стремительному росту цены криптовалюты, многие\\xa0...', \"First it was the stock market. Now, it's bitcoin. David Stockman, President Ronald Reagan's former director of the Office of Management and a relentless Wall\\xa0...\", \"The optimism capitalizes on Japan's Bitcoin breakout this year, which saw the country take pole position in Bitcoin-to-fiat trading, as well as adopt nurturing\\xa0...\", \"The Bank of England might have “its own Bitcoin-style digital currency” this year, according to the country's legacy media. The more than three hundred year old\\xa0...\", \"... article on privacy coins, news.Bitcoin.com wrote: “The general consensus is that verge isn't as private as some of its competitors, so don't trust it with your life.\", 'Hearing Jamie Dimon talk out the side of his mouth, calling bitcoin a “fraud” in September, less than a month before the bank he runs, JP Morgan Chase, said it\\xa0...', 'With bitcoin at over $19,000 and ether close to $800 that day, the stock price of Apollo shot up in price from 4.3 shekels to over 10 shekels immediately after this\\xa0...', 'This paper has been doing the rounds on Twitter. Written by the pseudonymous “Mr. Game & Watch”, it purports to “demystify” the “value proposition” of Bitcoin.', \"Bitcoin's value grew by more than 1,000% in 2017, but that wasn't enough to even place it among the 10 best-performing cryptoassets of the year. In a breakout\\xa0...\", 'This year, Bitcoin has increased in value more than a thousand percent, with the price of a single Bitcoin recently (albeit briefly) breaking $17,000. Entrepreneurs\\xa0...', \"This is an entry in CoinDesk's Most Influential in Blockchain 2017 series. He might never spend a summer in D.C. again, but he made this one count.\", \"Bitcoin, and the world of cryptocurrency, is a boys' club, say some experts, and that should be cause for concern. Cryptocurrency is a form of digital currency\\xa0...\", \"2017 was a big year for Bitcoin. CBOE launched the first Bitcoin futures market, the NYSE filed for two Bitcoin ETF's, and Bitcoin price rose over 1,300 percent.\", 'ICOtoInvest.com successfully recognised the potential of trade.io, the project has moved forward to form partnerships with HitBTC, the renown Bitcoin university\\xa0...', \"While litecoin outperformed in the markets, Lee says not to write off bitcoin's technical improvements. In this exclusive CoinDesk interview, Lee talks about ICOs,\\xa0...\", \"Danish Saxo Bank is famous for making 'outrageous' predictions for the year ahead. This year, the bank predicts Russia and China will crack down on bitcoin\\xa0...\", \"South Korea's cryptocurrency exchanges have implemented changes to comply with the government's mandates announced last week. In addition to restricting\\xa0...\"]\n", "22\n" ] } ], "source": [ "# alternative way to set class attribute is to use \"class_\"\n", "news = soup.find_all(\"div\", class_=\"st\")\n", "news = [new.text for new in news]\n", "print(news)\n", "print(len(news))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But 1 day in history is nothing for correlation detection . That's why we'l create the list of 10 dates (for educational purposes) and set up loop grabbing to get news for all of them. Tip: if you want to change the language of news, when the first page is loaded during script execution, change the language in the settings manually or in few minutes you'l learn how to do this by algorithm.\n", "" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[datetime.datetime(2018, 1, 25, 0, 0), datetime.datetime(2018, 1, 26, 0, 0), datetime.datetime(2018, 1, 27, 0, 0), datetime.datetime(2018, 1, 28, 0, 0), datetime.datetime(2018, 1, 29, 0, 0), datetime.datetime(2018, 1, 30, 0, 0), datetime.datetime(2018, 1, 31, 0, 0), datetime.datetime(2018, 2, 1, 0, 0), datetime.datetime(2018, 2, 2, 0, 0), datetime.datetime(2018, 2, 3, 0, 0)]\n" ] } ], "source": [ "# Create the list of dates\n", "dates = [\n", " datetime.strptime(\"01/25/2018\", \"%m/%d/%Y\") + timedelta(days=n) for n in range(10)\n", "]\n", "print(dates)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10\n", "10\n", "10\n", "2018-01-30 00:00:00\n", "['Facebook is banning all ads promoting cryptocurrencies — including ...', 'Bitcoin drops 12%, falls below $10000 amid broad cryptocurrency sell ...', \"Turkish soccer club completes first Bitcoin transfer in country's history ...\", 'Facebook ban on bitcoin ads is the latest in a very bad day for ...', \"Coincheck to Repay Hack Victims' XEM Balances at 81 US Cents Each\"]\n", "['Facebook is banning all ads promoting cryptocurrencies — including ...', 'Bitcoin drops 12%, falls below $10000 amid broad cryptocurrency sell ...', \"Turkish soccer club completes first Bitcoin transfer in country's history ...\", 'Facebook ban on bitcoin ads is the latest in a very bad day for ...', \"Coincheck to Repay Hack Victims' XEM Balances at 81 US Cents Each\"]\n" ] } ], "source": [ "# create lists to save grabbed by dates information\n", "news = []\n", "titles = []\n", "# start loop\n", "for date in dates:\n", " cur_year = date.year\n", " cur_month = date.month\n", " cur_day = date.day\n", " news_word = \"bitcoin\"\n", " cur_url = (\n", " \"https://www.google.com/search?q=\"\n", " + str(news_word)\n", " + \"&num=100&biw=1920&bih=929&source=lnt&tbs=cdr:1,cd_min:\"\n", " + str(cur_month)\n", " + \"/\"\n", " + str(cur_day)\n", " + \"/\"\n", " + str(cur_year)\n", " + \",cd_max:\"\n", " + str(cur_month)\n", " + \"/\"\n", " + str(cur_day)\n", " + \"/\"\n", " + str(cur_year)\n", " + \"&tbm=nws\"\n", " )\n", " driver.get(cur_url)\n", " # we have to increase the pause between loadings of pages to avoid detection\n", " # our activity as robots doing. So you have time for a cup of coffee while waiting.\n", " time.sleep(random.uniform(60, 120))\n", "\n", " html = driver.page_source\n", " soup = BeautifulSoup(html, \"lxml\")\n", " cur_titles = soup.find_all(\"h3\")\n", " cur_titles = [title.text for title in cur_titles]\n", " titles.append(cur_titles)\n", " cur_news = soup.find_all(\"h3\")\n", " cur_news = [new.text for new in cur_news]\n", " news.append(cur_news)\n", "print(len(dates))\n", "print(len(titles))\n", "print(len(news))\n", "# chech if the script works properly\n", "print(dates[5])\n", "print(titles[5][:5])\n", "print(news[5][:5])\n", "driver.quit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# But if it is so simply, it wouldn't be so interesting." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First of all such grabbing could be detected by web sites algorithms like the robots activity and could be banned. Secondly, some web sites hide all content of their pages and show it only when you scroll down the page. Thirdly, very often we need to input values in input boxes, click links to open next/previous page or click download button. To solve these problems we can use special methods to control browser." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's open the example page at \"Yahoo! Finance\": https://finance.yahoo.com/quote/SNA/history?p=SNA. If you scroll down the page you'l see that the content loads up periodically, and then finally it reaches the last row \"Dec 13, 2017\". But when the page is just opened and we view page source (Ctrl+U for Google Chrome), we'l not find there \"Dec 13, 2017\". So to get data with all dates for this symbol, first of all we have to scroll down to the end and after that parse the page. Such code will help us to solve this problem (to learn different ways of scrolling look here https://goo.gl/JdSvR4):" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[706249, 707803, 720356, 720356, 768578, 768578, 768578, 768578, 794404, 794404, 794404, 794404, 794404]\n", " Date Open High Low Close* Adj Close** Volume\n", "0 Dec 12, 2018 149.40 150.82 148.25 148.39 148.39 468400.0\n", "1 Dec 11, 2018 150.74 151.82 146.87 147.34 147.34 538500.0\n", "2 Dec 10, 2018 150.53 151.00 145.94 148.71 148.71 553600.0\n", "3 Dec 07, 2018 156.17 157.50 149.83 151.00 151.00 643200.0\n", "4 Dec 06, 2018 154.19 155.98 151.31 155.97 155.97 859300.0\n", " Date Open High \\\n", "250 Dec 19, 2017 172.66 173.49 \n", "251 Dec 18, 2017 169.30 173.25 \n", "252 Dec 15, 2017 166.71 168.97 \n", "253 Dec 14, 2017 170.37 170.37 \n", "254 *Close price adjusted for splits.**Adjusted cl... NaN NaN \n", "\n", " Low Close* Adj Close** Volume \n", "250 170.11 172.35 168.75 584800.0 \n", "251 169.30 172.62 169.01 656000.0 \n", "252 166.09 168.20 164.69 761900.0 \n", "253 165.45 165.50 162.04 702900.0 \n", "254 NaN NaN NaN NaN \n" ] } ], "source": [ "import pandas as pd\n", "\n", "driver = webdriver.Chrome()\n", "\n", "cur_url = \"https://finance.yahoo.com/quote/SNA/history?p=SNA\"\n", "driver.get(cur_url)\n", "\n", "SCROLL_PAUSE_TIME = 0.5\n", "\n", "equal = 0\n", "html_len_list = []\n", "while True:\n", " window_size = driver.get_window_size()\n", " # get the size of loaded page\n", " html_len = len(driver.page_source.encode(\"utf-8\"))\n", " html_len_list.append(html_len)\n", " # scroll down\n", " driver.execute_script(\n", " \"window.scrollTo(0, window.scrollY +\" + str(window_size[\"height\"]) + \")\"\n", " )\n", " # time to load all content\n", " time.sleep(1)\n", " # get the size of newly loaded content\n", " new_html_len = len(driver.page_source.encode(\"utf-8\"))\n", " # check if size of content not equal before and after scrolling.\n", " # if they are equal: add 1 to \"equal\" and scroll down.\n", " # if they are equal more than 4 last scrollings: break\n", " # if they are not \"equal\": reset equal to 0 and scroll down again\n", " if html_len == new_html_len:\n", " equal += 1\n", " if equal > 4:\n", " break\n", " else:\n", " equal = 0\n", "\n", "print(html_len_list)\n", "html = driver.page_source\n", "soup = BeautifulSoup(html, \"lxml\")\n", "table = soup.find(\"table\", {\"data-test\": \"historical-prices\"})\n", "table = pd.read_html(str(table))[0]\n", "print(table.head())\n", "print(table.tail())\n", "driver.quit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many web sites that prefer to devide one article for two and more parts, so you have to click 'next' or 'previous' buttons. For us it is the task to open all that pages and grab them). The same task is with multi page catalogues. Here is example: we will open several pages at stackoverflow tags catalogue and collect top tag words with their occurancies through the portal. To do this we will use find_element_by_css_selector() method to locate certain element on the page and click on it with click() method. To read more about locating elements open this: https://goo.gl/PyzbBN" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " tag count\n", "0 javascript 1730078\n", "1 java 1492129\n", "2 c# 1268430\n", "3 php 1248340\n", "4 android 1158141\n", " tag count\n", "103 ruby-on-rails-3 55908\n", "104 validation 54970\n", "105 numpy 54786\n", "106 tsql 54530\n", "107 sorting 54524\n" ] } ], "source": [ "import pandas as pd\n", "from selenium.webdriver.common.keys import Keys\n", "\n", "\n", "def page_parse(html):\n", " soup = BeautifulSoup(html, \"lxml\")\n", " tags = soup.find_all(\"div\", class_=\"grid-layout--cell tag-cell\")\n", " tag_text_list = []\n", " tag_count_list = []\n", " for tag in tags:\n", " tag_text = tag.find(\"a\").text\n", " tag_text_list.append(tag_text)\n", " tag_count = tag.find(\"span\", class_=\"item-multiplier-count\").text\n", " tag_count_list.append(tag_count)\n", "\n", " return tag_text_list, tag_count_list\n", "\n", "\n", "driver = webdriver.Chrome()\n", "cur_url = \"https://stackoverflow.com/tags\"\n", "driver.get(cur_url)\n", "tag_names, tag_counts = [], []\n", "\n", "for i in range(3):\n", " if i == 0:\n", " html = driver.page_source\n", " cur_tag_names, cur_tag_counts = page_parse(html)\n", " tag_names = tag_names + cur_tag_names\n", " tag_counts = tag_counts + cur_tag_counts\n", " else:\n", " # find necessery element to click\n", " next = driver.find_element_by_css_selector(\".page-numbers.next\")\n", " # in some cases it would be enough to run next.click() but sometimes it doesn`t work\n", " # for more information about possible troubles of using click() read here:\n", " # https://goo.gl/kUGvsC\n", " driver.execute_script(\"arguments[0].click();\", next)\n", " time.sleep(2)\n", " html = driver.page_source\n", " cur_tag_names, cur_tag_counts = page_parse(html)\n", " tag_names = tag_names + cur_tag_names\n", " tag_counts = tag_counts + cur_tag_counts\n", "\n", "tag_table = pd.DataFrame({\"tag\": tag_names, \"count\": tag_counts})\n", "\n", "print(tag_table.head())\n", "print(tag_table.tail())\n", "driver.quit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or here is another example: medium.com site hides the part of comments below articles. But if we need to analyze the \"reasons\" of popularity of the page, comments can play a great role in this analysis and it better to grab all of them. Open this page and scroll to the bottom - you'll find that there is \"Show all responses\" button as \"div\" element there. Let's click on it and open all comments." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "driver = webdriver.Chrome()\n", "cur_url = \"https://medium.com/@pdquant/all-the-backpropagation-derivatives-d5275f727f60\"\n", "driver.get(cur_url)\n", "# locate div container that contains button\n", "find_div = driver.find_element_by_css_selector(\".container.js-showOtherResponses\")\n", "# locate button inside container\n", "button = find_div.find_element_by_tag_name(\"button\")\n", "driver.execute_script(\"arguments[0].click();\", button)\n", "# check the page before and after running script - in second case all\n", "# comments are opened" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Authorization and input boxes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A lot of information is available only after authorization. So let's learn how to log in at Facebook. Algorithm is the same: find input boxes for login and password, insert into them text and after submit it. To send text to the inputs we will use .send_keys() method and to submit: .submit() method." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "driver = webdriver.Chrome()\n", "cur_url = \"https://www.facebook.com/\"\n", "driver.get(cur_url)\n", "username_field = driver.find_element_by_name(\"email\") # get the username field by name\n", "password_field = driver.find_element_by_name(\"pass\") # get the password field by name\n", "username_field.send_keys(\"your_email\") # insert your email\n", "password_field.send_keys(\"your_password\") # insert your password\n", "password_field.submit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But this methods are also very usefull when we need to change dates or insert values into input boxes to get certain information. For example, here is the \"one page tool\" to recieve etf fund flows information: Etf Fund Flows. There are no special pages for each ETF (as Yahoo! has) to view or download desired values. All you can do: to enter ETF symbol, start and end dates and click button \"Submit\". But if your boss set atask to obtain historical data for 500 etfs and 10 last years (120 months), you'l have to click 60000 the button \"submit\". What's a dull amusement... So let's make an algorithm that can collect this information while you'l be raving somewhere at Ibiza party.\n" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Ticker Fund Name Net Flows Details Start Date End Date\n", "0 SPY SPDR S&P 500 ETF Trust 1714.15 NaN 03-03-31 06-06-30\n", "1 SPY SPDR S&P 500 ETF Trust 1714.15 NaN 12-12-31 03-03-31\n", "2 SPY SPDR S&P 500 ETF Trust 1714.15 NaN 09-09-30 12-12-31\n", "3 QQQ Invesco QQQ Trust 342.83 NaN 03-03-31 06-06-30\n", "4 QQQ Invesco QQQ Trust 342.83 NaN 12-12-31 03-03-31\n", "5 QQQ Invesco QQQ Trust 342.83 NaN 09-09-30 12-12-31\n" ] } ], "source": [ "# a function to get year, date and month for start and end date inputs\n", "def convert_date(date):\n", " year = date.split(\"/\")[2]\n", " month = date.split(\"/\")[0]\n", " day = date.split(\"/\")[1]\n", " # here we have to add zero before the month if it is less than 10\n", " # because the input form requires such format of data: 01/01/2018\n", " if len(month) < 2:\n", " month = str(\"0\") + str(month)\n", "\n", " return day, month, month\n", "\n", "\n", "driver = webdriver.Chrome()\n", "# set some dates\n", "dates = [\"6/30/2018\", \"3/31/2018\", \"12/31/2017\", \"9/30/2017\"]\n", "# set a few etfs\n", "etfs = [\"SPY\", \"QQQ\"]\n", "# create empty dataframe to store values\n", "export_table = pd.DataFrame({})\n", "\n", "# start ticker loop\n", "for ticker in etfs:\n", " loginUrl = \"http://www.etf.com/etfanalytics/etf-fund-flows-tool\"\n", " # set loop for dates without last date, because we need to set a period of\n", " # 'start' and 'end' date\n", " for i in range(0, len(dates) - 1):\n", " # create current pair of dates\n", " start_day, start_month, start_year = convert_date(dates[i + 1])\n", " end_day, end_month, end_year = convert_date(dates[i])\n", " date_1 = (\n", " str(start_year) + str(\"-\") + str(start_month) + str(\"-\") + str(start_day)\n", " )\n", " date_2 = str(end_year) + str(\"-\") + str(end_month) + str(\"-\") + str(end_day)\n", " driver.get(loginUrl)\n", " # locate the field to input etf symbol\n", " ticker_field = driver.find_element_by_id(\"edit-tickers\")\n", " ticker_field.send_keys(str(ticker))\n", " # locate the field to input start date\n", " start_date_field = driver.find_element_by_id(\n", " \"edit-startdate-datepicker-popup-0\"\n", " )\n", " start_date_field.send_keys(str(date_1))\n", " # locate the field to input end date\n", " end_date_field = driver.find_element_by_id(\"edit-enddate-datepicker-popup-0\")\n", " end_date_field.send_keys(str(date_2))\n", " # submit form\n", " end_date_field.submit()\n", " # read the page souse\n", " html = driver.page_source\n", " soup = BeautifulSoup(html, \"lxml\")\n", " # find certain table with etf flows information\n", " table = soup.find_all(\"table\", id=\"topTenTable\")\n", " # some transformations to get html code readable by pd.read_html() method\n", " table = table[0].find_all(\"tbody\")\n", " table = str(table[0]).split(\"\")[1]\n", " table = table.split(\"\")[0]\n", " data = \"\" + str(table) + \"
\"\n", " soup = BeautifulSoup(data, \"lxml\")\n", " # convert html code to pandas dataframe\n", " df = pd.read_html(str(soup))\n", " current_table = df[0]\n", " current_table.columns = [\"Ticker\", \"Fund Name\", \"Net Flows\", \"Details\"]\n", " current_table[\"Start Date\"] = [date_1]\n", " current_table[\"End Date\"] = [date_2]\n", " # concatenate current inflow table with main dataframe.\n", " export_table = pd.concat([export_table, current_table], ignore_index=True)\n", " # let the algorithm rest for a while\n", " time.sleep(random.uniform(5, 10))\n", "\n", "# some magic and we get the information assigned by task.\n", "print(export_table)\n", "driver.quit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are enormous amount of sites, each of them has its own design, access to information, protection against robots, etc. That's why this tutorial could be as a little book. But at least one more approach of grabbing information we'l discover. It is connected with parsing dynamic graphs as like www.google/trends uses. Interestingly that Google's programmers don't allow to parse the code of trend graphs (the div tag which contains the code of graph is hidden) but let you download csv file with information (so we can use one of the above algorithms to find this button, to click and download file).\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take another site where we can parse similar graphs: Portfolio Visualizer. Scroll down this page and you'l find graph as at screenshot. The worth of this graph is that the historical prices for Us Treasury Notes are not freely available - you have to buy it. But here we can grab it either manually (rewrite dates and values correspondigly), or to write code which \"rewrites\" the values for us and not only from this page...\n", "\n" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0 1\n", "0 Dec 31, 1971 $10,000\n", "1 Jan 31, 1972 $9,901\n", "2 Feb 29, 1972 $9,988\n", "3 Mar 31, 1972 $9,979\n", "4 Apr 30, 1972 $10,015\n", " 0 1\n", "559 Jul 31, 2018 $239,204\n", "560 Aug 31, 2018 $241,631\n", "561 Sep 30, 2018 $238,720\n", "562 Oct 31, 2018 $237,998\n", "563 Nov 30, 2018 $241,163\n" ] } ], "source": [ "loginUrl = \"https://www.portfoliovisualizer.com/backtest-asset-class-allocation?s=y&mode=2&startYear=1972&endYear=2018&initialAmount=10000&annualOperation=0&annualAdjustment=0&inflationAdjusted=true&annualPercentage=0.0&frequency=4&rebalanceType=1&portfolio1=Custom&portfolio2=Custom&portfolio3=Custom&TreasuryNotes1=100\"\n", "driver = webdriver.Chrome()\n", "driver.get(loginUrl)\n", "html = driver.page_source\n", "soup = BeautifulSoup(html, \"lxml\")\n", "# find the div with chart values\n", "chart = soup.find_all(\"div\", id=\"chartDiv2\")\n", "table = str(chart[0]).split(\"\")[1]\n", "table = table.split(\"\")[0]\n", "data = \"\" + str(table) + \"
\"\n", "soup = BeautifulSoup(data, \"lxml\")\n", "df = pd.read_html(str(soup))[0]\n", "print(df.head())\n", "print(df.tail())\n", "driver.quit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# In coclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's important to admit that parsing activity can be easily determined as robot's activity and you we'l be asked to pass the \"antirobot's\" captcha. On the one hand you can find solutions how to give the right answers to it, but on the other (i think more natural), you can set up such algorithms that will be similar to human activity, when they use web sites. You are lucky, when the website has no protection against parsing. But in case with Google News - after 10 or 20 loadings of page, you'l meet google's captcha. So try to make your algorithm more humanlike: scroll up and down, click links or buttons, be on the page for at list 10-15 seconds or more, especially when you need to download several thousands of pages, take breaks for hour and for night, etc." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# And good luck!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }