{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Scraping StackOverflow\n", "\n", "In this project, we will be scraping StackOverflow website and:\n", "\n", "- [Goal 1: List Most mentioned/tagged languages along with their tag counts](#Goal1)\n", "- [Goal 2: List Most voted questions along with with their attributes (votes, summary, tags, number of votes, answers and views)](#Goal2)\n", "\n", "We will divide our project into the above mentioned two goals.\n", "\n", "Before starting our project, we need to understand few basics regarding Web Scraping." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Web Scraping Basics\n", "\n", "Before starting our project, we need to understand few basics regarding Web Pages and Web Scraping.\n", "\n", "When we visit a page, our browser makes a request to a web server. Most of the times, this request is a [GET Request](https://realpython.com/lessons/the-get-request/). Our web browser then receives a bunch of files, typically (HTML, CSS, JavaScript). HTML contains the content, CSS & JavaScript tell browser how to render the webpage. So, we will be mainly interested in the HTML file.\n", "\n", "### HTML: \n", "HTML has elements called [tags](https://www.w3schools.com/html/html_elements.asp), which help in differentiating different parts of a HTML Document. Different types of tags are:\n", "* `html` - all content is inside this tag\n", "* `head` - contains title and other related files\n", "* `body` - contains main cotent to be displayed on the webpage\n", "* `div` - division or area of a page\n", "* `p` - paragraph\n", "* `a` - links\n", "\n", "We will get our content inside the body tag and use p and a tags for getting paragraphs and links.\n", "\n", "HTML also has [class and id properties](https://www.codecademy.com/articles/classes-vs-ids). These properties give HTML elements names and makes it easier for us to refer to a particular element. `Class` can be shared among multiple elements and an element can have moer then one class. Whereas, `id` needs to be unique for a given element and can be used just once in the document.\n", "\n", "### Requests\n", "The requests module in python lets us easily download pages from the web.
\n", "We can request contents of a webpage by using `requests.get()`, passing in target link as a parameter. This will give us a [response object](https://realpython.com/python-requests/#the-response). \n", "\n", "### Beautiful Soup\n", "[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library helps us parse contents of the webpage in an easy to use manner. It provides us with some very useful methods and attributes like:\n", "* `find()`, `select_one()` - retuns first occurence of the tag object that matches our filter\n", "* `find_all()`, `select()` - retuns a list of the tag object that matches our filter\n", "* `children` - provides list of direct nested tags of the given paramter/tag\n", "\n", "These methods help us in extracting specific portions from the webpage.\n", "\n", "***Tip: When Scraping, we try to find common properties shared among target objects. This helps us in extracting all of them in just one or two commands.***\n", "\n", "For e.g. We want to scrap points of teams on a league table. In such a scenario, we can go to each element and extract its value. Or else, we can find a common thread (like **same class, same parent + same element type**) between all the points. And then, pass that common thread as an argument to BeautifulSoup. BeautifulSoup will then extract and return the elements to us." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Goal 1: Listing most tagged Languages\n", "\n", "Now that we know the basics of Web Scraping, we will move towards our first goal.\n", "\n", "In Goal 1, we have to list most tagged Languages along with their Tag Count. First, lets make a list of steps to follow:\n", "\n", "- [1. Download Webpage from stackoverflow](#1.1)\n", "- [2. Parse the document content into BeautifulSoup](#1.2)\n", "- [3. Extract Top Languages](#1.3)\n", "- [4. Extract their respective Tag Counts](#1.4)\n", "- [5. Put all code together and join the two lists](#1.5)\n", "- [6. Plot Data](#1.6)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's import all the required libraries and packages" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np # linear algebra\n", "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", "import requests # Getting Webpage content\n", "from bs4 import BeautifulSoup as bs # Scraping webpages\n", "import matplotlib.pyplot as plt # Visualization\n", "import matplotlib.style as style # For styling plots\n", "from matplotlib import pyplot as mp # For Saving plots as images\n", "\n", "# For displaying plots in jupyter notebook\n", "%matplotlib inline \n", "\n", "style.use('fivethirtyeight') # matplotlib Style " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Downloading Tags page from StackOverflow\n", "\n", "We will download the [tags page](https://stackoverflow.com/tags) from [stackoverflow](https://stackoverflow.com/), where it has all the languages listed with their tag count." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "200" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Using requests module for downloading webpage content\n", "response = requests.get('https://stackoverflow.com/tags')\n", "\n", "# Getting status of the request\n", "# 200 status code means our request was successful\n", "# 404 status code means that the resource you were looking for was not found\n", "response.status_code" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parsing the document into Beautiful Soup" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bs4.element.Tag" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Parsing html data using BeautifulSoup\n", "soup = bs(response.content, 'html.parser')\n", "\n", "# body \n", "body = soup.find('body')\n", "\n", "# printing the object type of body\n", "type(body)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract Top Languages\n", "\n", "In order to acheive this, we need to understand HTML structure of the document that we have. And then, narrow down to our element of interest.\n", "\n", "\n", "One way of doing this would be manually searching the webpage (hint: print `body` variable from above and search through it).
\n", "Second method, is to use the browser's Developr Tools. \n", "\n", "We will use this second one. On Chrome, open [tags page](http://stackoverflow.com/tags?tab=popular) and right-click on the language name (shown in top left) and choose **Inspect**.\n", "\n", "![Image for Reference](https://github.com/nveenverma/Projects/blob/master/Exploring%20StackOverflow/tags.png?raw=true)\n", "*
Image for Reference
*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the Language name is inside `a` tag, which in turn is inside a lot of div tags. This seems, difficult to extract. Here, the [class](#classes) and [id](#id), we spoke about earlier comes to our rescue. \n", "\n", "If we look more closely in the image above, we can see that the `a` tag has a class of `post-tag`. Using this class along with `a` tag, we can extract all the language links in a list." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[javascript,\n", " java]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lang_tags = body.find_all('a', class_='post-tag')\n", "lang_tags[:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, using [list comprehension](https://towardsdatascience.com/python-basics-list-comprehensions-631278f22c40), we will extract all the language names." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['javascript', 'java', 'c#', 'php', 'android']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "languages = [i.text for i in lang_tags]\n", "languages[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract Tag Counts\n", "\n", "To extract tag counts, we will follow the same process.\n", "\n", "On Chrome, open [tags page](http://stackoverflow.com/tags) and right-click on the tag count, next to the top language (shown in top left) and choose **Inspect**.\n", "\n", "![Image for Reference](https://github.com/nveenverma/Projects/blob/master/Exploring%20StackOverflow/tag_count.png?raw=true)\n", "*
Image for Reference
*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, the tag counts are inside `span` tag, with a class of `item-multiplier-count`. Using this class along with `span` tag, we will extract all the tag count spans in a list." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1824582,\n", " 1557391]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tag_counts = body.find_all('span', class_='item-multiplier-count')\n", "tag_counts[:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, using [list comprehension](https://towardsdatascience.com/python-basics-list-comprehensions-631278f22c40), we will extract all the Tag Counts." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1824582, 1557391, 1320273, 1289585, 1200130]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "no_of_tags = [int(i.text) for i in tag_counts]\n", "no_of_tags[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Put all code together and join the two lists\n", "\n", "We will use [Pandas.DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) to put the two lists together.
In order to make a DataFrame, we need to pass both the lists (in dictionary form) as argument to our function." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Function to check, if there is any error in length of the extracted bs4 object\n", "def error_checking(list_name, length):\n", " if (len(list_name) != length):\n", " print(\"Error in {} parsing, length not equal to {}!!!\".format(list_name, length))\n", " return -1\n", " else:\n", " pass\n", " \n", "\n", "def get_top_languages(url):\n", " # Using requests module for downloading webpage content\n", " response = requests.get(url)\n", "\n", " # Parsing html data using BeautifulSoup\n", " soup = bs(response.content, 'html.parser')\n", " body = soup.find('body')\n", "\n", " # Extracting Top Langauges\n", " lang_tags = body.find_all('a', class_='post-tag')\n", " error_checking(lang_tags, 36) # Error Checking\n", " languages = [i.text for i in lang_tags] # Languages List\n", "\n", " # Extracting Tag Counts\n", " tag_counts = body.find_all('span', class_='item-multiplier-count')\n", " error_checking(tag_counts, 36) # Error Checking\n", " no_of_tags = [int(i.text) for i in tag_counts] # Tag Counts List\n", "\n", " # Putting the two lists together\n", " df = pd.DataFrame({'Languages':languages,\n", " 'Tag Count':no_of_tags})\n", "\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plot Data" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LanguagesTag Count
0javascript1824582
1java1557391
2c#1320273
3php1289585
4android1200130
\n", "
" ], "text/plain": [ " Languages Tag Count\n", "0 javascript 1824582\n", "1 java 1557391\n", "2 c# 1320273\n", "3 php 1289585\n", "4 android 1200130" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "URL1 = 'https://stackoverflow.com/tags'\n", "\n", "df = get_top_languages(URL1)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we will plot the Top Languages along with their Tag Counts." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(8, 3))\n", "plt.bar(height=df['Tag Count'][:10], x=df['Languages'][:10])\n", "plt.xticks(rotation=90)\n", "plt.xlabel('Languages')\n", "plt.ylabel('Tag Counts')\n", "plt.savefig('lang_vs_tag_counts.png', bbox_inches='tight')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Goal 2: Listing most voted Questions\n", "\n", "Now that we have collected data using web scraping one time, it won't be difficult the next time.
\n", "In Goal 2 part, we have to list questions with most votes along with their attributes, like:\n", "> - Summary\n", "- Tags\n", "- Number of Votes\n", "- Number of Answers\n", "- Number of Views\n", "\n", "I would suggest giving it a try on your own, then come here to see my solution.\n", "\n", "Similar to previous step, we will make a list of steps to act upon:\n", "\n", "- [1. Download Webpage from stackoverflow](#2.1)\n", "- [2. Parse the document content into BeautifulSoup](#2.2)\n", "- [3. Extract Top Questions](#2.3)\n", "- [4. Extract their respective Summary](#2.4)\n", "- [5. Extract their respective Tags](#2.5)\n", "- [6. Extract their respective no. of votes, answers and views](#2.6)\n", "- [7. Put all code togther and join the lists](#2.7)\n", "- [8. Plot Data](#2.8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Downloading Questions page from StackOverflow\n", "\n", "We will download the [questions page](https://stackoverflow.com/questions?sort=votes&pagesize=50) from [stackoverflow](https://stackoverflow.com/), where it has all the top voted questions listed.
\n", " \n", "Here, I've appended `?sort=votes&pagesize=50` to the end of the defualt questions URL, to get a list of top 50 questions." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "200" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Using requests module for downloading webpage content\n", "response1 = requests.get('https://stackoverflow.com/questions?sort=votes&pagesize=50')\n", "\n", "# Getting status of the request\n", "# 200 status code means our request was successful\n", "# 404 status code means that the resource you were looking for was not found\n", "response1.status_code" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A different Scraping Function\n", "\n", "In this section, we will use `select()` and `select_one()` to return BeautifulSoup objects as per our requierment. While `find_all` uses tags, `select` uses CSS Selectors in the filter. I personally tend to use the latter one more.\n", "\n", "For example:\n", "- `p a` — finds all a tags inside of a p tag. \n", "> ```soup.select('p a')```\n", "\n", "- `div.outer-text` — finds all div tags with a class of outer-text.\n", "- `div#first` — finds all div tags with an id of first.\n", "- `body p.outer-text` — finds any p tags with a class of outer-text inside of a body tag." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parsing the document into Beautiful Soup" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bs4.element.Tag" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Parsing html data using BeautifulSoup\n", "soup1 = bs(response1.content, 'html.parser')\n", "\n", "# body \n", "body1 = soup1.select_one('body')\n", "\n", "# printing the object type of body\n", "type(body1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract Top Questions\n", "\n", "On Chrome, open [questions page](https://stackoverflow.com/questions?sort=votes&pagesize=50) and right-click on the top question and choose **Inspect**.\n", "\n", "![Image for Reference](https://github.com/nveenverma/Projects/blob/master/Exploring%20StackOverflow/questions.png?raw=true)\n", "*
Image for Reference
*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the question is inside `a` tag, which has a class of `question-hyperlink`. \n", "\n", "Taking cue from our previous Goal, we can use this class along with `a` tag, to extract all the question links in a list. However, there are more question hyperlinks in sidebar which will also be extracted in this case. To avoid this scenario, we can combine `a` tag, `question-hyperlink` class with their parent `h3` tag. This will give us exactly 50 Tags." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Why is processing a sorted array faster than processing an unsorted array?,\n", " How do I undo the most recent local commits in Git?]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "question_links = body1.select(\"h3 a.question-hyperlink\")\n", "question_links[:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[List comprehension](https://towardsdatascience.com/python-basics-list-comprehensions-631278f22c40), to extract all the questions." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Why is processing a sorted array faster than processing an unsorted array?',\n", " 'How do I undo the most recent local commits in Git?']" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "questions = [i.text for i in question_links]\n", "questions[:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract Summary\n", "\n", "On Chrome, open [questions page](https://stackoverflow.com/questions?sort=votes&pagesize=50) and right-click on summary of the top question and choose **Inspect**.\n", "\n", "![Image for Reference](https://github.com/nveenverma/Projects/blob/master/Exploring%20StackOverflow/summary.png?raw=true)\n", "*
Image for Reference
*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the question is inside `div` tag, which has a class of `excerpt`. Using this class along with `div` tag, we can extract all the question links in a list." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "
\r\n", " Here is a piece of C++ code that shows some very peculiar behavior. For some strange reason, sorting the data miraculously makes the code almost six times faster:\n", "\n", "#include <algorithm>\n", "#include &...\r\n", "
\n" ] } ], "source": [ "summary_divs = body1.select(\"div.excerpt\")\n", "print(summary_divs[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[List comprehension](https://towardsdatascience.com/python-basics-list-comprehensions-631278f22c40), to extract all the questions. \n", "\n", "Here, we will also use [strip()](https://www.programiz.com/python-programming/methods/string/strip) method on each div's text. This is to remove both leading and trailing unwanted characters from a string." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Here is a piece of C++ code that shows some very peculiar behavior. For some strange reason, sorting the data miraculously makes the code almost six times faster:\\n\\n#include \\n#include &...'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "summaries = [i.text.strip() for i in summary_divs]\n", "summaries[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract Tags\n", "\n", "On Chrome, open [questions page](https://stackoverflow.com/questions?sort=votes&pagesize=50) and right-click on summary of the top question and choose **Inspect**.\n", "\n", "![Image for Reference](https://github.com/nveenverma/Projects/blob/master/Exploring%20StackOverflow/tags_names.png?raw=true)\n", "*
Image for Reference
*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Extracting **tags per question** is the most complex task in this post. Here, we cannot find unique class or id for each tag, and there are multiple tags per question that we n\n", "eed to store. \n", "\n", "To extract **tags per question**, we will follow a multi-step process:\n", "\n", "* As shown in figure, individual tags are in a third layer, under two nested div tags. With the upper div tag, only having unique class (`summary`).\n", " - First, we will extract div with `summary`class.\n", " - Now notice our target div is third child overall and second `div` child of the above extracted object. Here, we can use `nth-of-type()` method to extract this 2nd `div` child. Usage of this method is very easy and few exmaples can be found [here](https://gist.github.com/yoki/b7f2fcef64c893e307c4c59303ead19a#file-20_search-py). This method will extract the 2nd `div` child directly, without extracting `summary div` first." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tags_divs = body1.select(\"div.summary > div:nth-of-type(2)\")\n", "tags_divs[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Now, we can use [list comprehension](https://towardsdatascience.com/python-basics-list-comprehensions-631278f22c40) to extract `a` tags in a list, grouped per question." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[java,\n", " c++,\n", " performance,\n", " optimization,\n", " branch-prediction]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a_tags_list = [i.select('a') for i in tags_divs]\n", "\n", "# Printing first question's a tags\n", "a_tags_list[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Now we will run a for loop for going through each question and use [list comprehension](https://towardsdatascience.com/python-basics-list-comprehensions-631278f22c40) inside it, to extract the tags names." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['java', 'c++', 'performance', 'optimization', 'branch-prediction']" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tags = []\n", "\n", "for a_group in a_tags_list:\n", " tags.append([a.text for a in a_group])\n", "\n", "tags[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract Number of votes, answers and views\n", "\n", "On Chrome, open [questions page](https://stackoverflow.com/questions?sort=votes&pagesize=50) and inspect vote, answers and views for the topmost answer.\n", "\n", "![Image for Reference](https://github.com/nveenverma/Projects/blob/master/Exploring%20StackOverflow/votes.png?raw=true)\n", "*
Image for Reference
*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### No. of Votes \n", "- They can be found by using `span` tag along with `vote-count-post` class and nested `strong` tags" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[23111, 19690]\n" ] } ], "source": [ "vote_spans = body1.select(\"span.vote-count-post strong\")\n", "print(vote_spans[:2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[List comprehension](https://towardsdatascience.com/python-basics-list-comprehensions-631278f22c40), to extract vote counts. " ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[23111, 19690, 15321, 11030, 9718]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "no_of_votes = [int(i.text) for i in vote_spans]\n", "no_of_votes[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I'm not going to post images to extract last two attributes\n", "\n", "### No. of Answers \n", "\n", "- They can be found by using `div` tag along with `status` class and nested `strong` tags. Here, we don't use `answered-accepted` because its not common among all questions, few of them (whose answer are not accepted) have the class - `answered`. " ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[22, 78]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "answer_divs = body1.select(\"div.status strong\")\n", "answer_divs[:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[List comprehension](https://towardsdatascience.com/python-basics-list-comprehensions-631278f22c40), to extract answer counts. " ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[22, 78, 38, 40, 34]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "no_of_answers = [int(i.text) for i in answer_divs]\n", "no_of_answers[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### No. of Views \n", "- For views, we can see two options. One is short form in number of millions and other is full number of views. We will extract the full version. \n", "- They can be found by using `div` tag along with `supernova` class. Then we need to clean the string and convert it into integer format." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "
\n", " 1.4m views\n", "
" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "div_views = body1.select(\"div.supernova\")\n", "div_views[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[List comprehension](https://towardsdatascience.com/python-basics-list-comprehensions-631278f22c40), to extract vote counts. " ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1362267, 7932952, 7011126, 2550002, 2490787]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "no_of_views = [i['title'] for i in div_views]\n", "no_of_views = [i[:-6].replace(',', '') for i in no_of_views]\n", "no_of_views = [int(i) for i in no_of_views]\n", "no_of_views[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Putting all of them together in a dataframe" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "def get_top_questions(url, question_count):\n", " # WARNING: Only enter one of these 3 values [15, 30, 50].\n", " # Since, stackoverflow, doesn't display any other size questions list\n", " url = url + \"?sort=votes&pagesize={}\".format(question_count)\n", " \n", " # Using requests module for downloading webpage content\n", " response = requests.get(url)\n", "\n", " # Parsing html data using BeautifulSoup\n", " soup = bs(response.content, 'html.parser')\n", " body = soup.find('body')\n", "\n", " # Extracting Top Questions\n", " question_links = body1.select(\"h3 a.question-hyperlink\")\n", " error_checking(question_links, question_count) # Error Checking\n", " questions = [i.text for i in question_links] # questions list\n", " \n", " # Extracting Summary\n", " summary_divs = body1.select(\"div.excerpt\")\n", " error_checking(summary_divs, question_count) # Error Checking\n", " summaries = [i.text.strip() for i in summary_divs] # summaries list\n", " \n", " # Extracting Tags\n", " tags_divs = body1.select(\"div.summary > div:nth-of-type(2)\")\n", " \n", " error_checking(tags_divs, question_count) # Error Checking\n", " a_tags_list = [i.select('a') for i in tags_divs] # tag links\n", " \n", " tags = []\n", "\n", " for a_group in a_tags_list:\n", " tags.append([a.text for a in a_group]) # tags list\n", " \n", " # Extracting Number of votes\n", " vote_spans = body1.select(\"span.vote-count-post strong\")\n", " error_checking(vote_spans, question_count) # Error Checking\n", " no_of_votes = [int(i.text) for i in vote_spans] # votes list\n", " \n", " # Extracting Number of answers\n", " answer_divs = body1.select(\"div.status strong\")\n", " error_checking(answer_divs, question_count) # Error Checking\n", " no_of_answers = [int(i.text) for i in answer_divs] # answers list\n", " \n", " # Extracting Number of views\n", " div_views = body1.select(\"div.supernova\")\n", " \n", " error_checking(div_views, question_count) # Error Checking\n", " no_of_views = [i['title'] for i in div_views]\n", " no_of_views = [i[:-6].replace(',', '') for i in no_of_views]\n", " no_of_views = [int(i) for i in no_of_views] # views list\n", " \n", " # Putting all of them together\n", " df = pd.DataFrame({'question': questions, \n", " 'summary': summaries, \n", " 'tags': tags,\n", " 'no_of_votes': no_of_votes,\n", " 'no_of_answers': no_of_answers,\n", " 'no_of_views': no_of_views})\n", "\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotting Votes v/s Views v/s Answers" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
questionsummarytagsno_of_votesno_of_answersno_of_views
0Why is processing a sorted array faster than p...Here is a piece of C++ code that shows some ve...[java, c++, performance, optimization, branch-...23111221362267
1How do I undo the most recent local commits in...I accidentally committed the wrong files to Gi...[git, version-control, git-commit, undo]19690787932952
2How do I delete a Git branch locally and remot...I want to delete a branch both locally and rem...[git, git-branch, git-remote]15321387011126
3What is the difference between 'git pull' and ...Moderator Note: Given that this question has a...[git, git-pull, git-fetch]11030402550002
4What is the correct JSON content type?I've been messing around with JSON for some ti...[json, http-headers, content-type]9718342490787
\n", "
" ], "text/plain": [ " question \\\n", "0 Why is processing a sorted array faster than p... \n", "1 How do I undo the most recent local commits in... \n", "2 How do I delete a Git branch locally and remot... \n", "3 What is the difference between 'git pull' and ... \n", "4 What is the correct JSON content type? \n", "\n", " summary \\\n", "0 Here is a piece of C++ code that shows some ve... \n", "1 I accidentally committed the wrong files to Gi... \n", "2 I want to delete a branch both locally and rem... \n", "3 Moderator Note: Given that this question has a... \n", "4 I've been messing around with JSON for some ti... \n", "\n", " tags no_of_votes \\\n", "0 [java, c++, performance, optimization, branch-... 23111 \n", "1 [git, version-control, git-commit, undo] 19690 \n", "2 [git, git-branch, git-remote] 15321 \n", "3 [git, git-pull, git-fetch] 11030 \n", "4 [json, http-headers, content-type] 9718 \n", "\n", " no_of_answers no_of_views \n", "0 22 1362267 \n", "1 78 7932952 \n", "2 38 7011126 \n", "3 40 2550002 \n", "4 34 2490787 " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "URL2 = 'https://stackoverflow.com/questions'\n", "\n", "df1 = get_top_questions(URL2, 50)\n", "df1.head()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "f, ax = plt.subplots(3, 1, figsize=(12, 8))\n", "\n", "ax[0].bar(df1.index, df1.no_of_votes)\n", "ax[0].set_ylabel('No of Votes')\n", "\n", "ax[1].bar(df1.index, df1.no_of_views)\n", "ax[1].set_ylabel('No of Views')\n", "\n", "ax[2].bar(df1.index, df1.no_of_answers)\n", "ax[2].set_ylabel('No of Answers')\n", "plt.xlabel('Question Number')\n", "\n", "plt.savefig('votes_vs_views_vs_answers.png', bbox_inches='tight')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we may observe that there is no collinearity between the votes, views and answers related to a question. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Useful Resources:\n", "- [Dataquest Tutorial 1](https://www.dataquest.io/blog/web-scraping-tutorial-python/), [2](https://www.dataquest.io/blog/web-scraping-beautifulsoup/)\n", "- [HackerNoon Tutorial](https://hackernoon.com/building-a-web-scraper-from-start-to-finish-bb6b95388184)\n", "- [RealPython Tutorial](https://realpython.com/python-web-scraping-practical-introduction/)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }