{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Patrick's Absolute Beginner's Guide to Webscraping- Reflection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction- A Coding Project for Those Who Have Never Seen Code" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook is, as the title says, an absolute beginner's guide to python and webscraping. And I mean beginner. For the arts student with no prior coding knowledge, who has never seen python or the command line before, who will struggle just to figure out what a jupyter notebook is and how to open it, much less write code in it, this is the notebook for you. In the actual notebook with the code, I explain, to the best of my ability, every single minute part of the code, what means what, what does what, and why we do each step the way we do. If you have even seen code before, this will probably be too simple for you, but if you are one with \"no tech knowledge\" then you're in the right place. \n", "\n", "Webscraping is pretty much what it sounds like, you make a coding program that automatically searches a webpages html (all the underlying code that makes a website look the way it does) and returns all the information from the webpage that you specified in your program. Regardless of whether you understand how exactly the code functions, all you need to realize is that this kind of program can be a really effective way to collect data en masse from the web, allowing you to compile your own databases that are far bigger than anything you would be able to put together manually. \n", "\n", "For this notebook, we're going to work on scraping the Algonquin Park (APK) archives online in order to get a list in a csv spreadsheet file of all the links to photos that relate to the keyword 'fire'. At the end of the notebook, we will have an easy to access list of every photo from the APK Archives that deals with our search term. For me, this will be an effective tool for my ongoing research on Algonquin. \n", "\n", "More broadly, this notebook is simply meant to show you, in the most absolute simple terms possible, how to put together a webscraper system wherein you'll be able to sub in the url to whatever webpage or online archive is most relevant to you. Overall, I just want to expose you to some of the very basics of phython and jupyter notebooks and hopefully give you a soft landing into the world of coding. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Data (The url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All this project requires is the url of the webpage search function we choose to use. In this case it is the url for the photos only search function on the APK Archives. \n", "\n", "This url provides two challenges. Firstly, it provides its results in a random order every time the url is refreshed, meaning that any sorting or organization for your final csv list will have to be done in the csv itself, rather than directly from the webpage. Secondly, the keyword 'fire' gives about 200 results, but these results are spread out across two separate pages, so the scraper in our notebook deals with the first page, but you would have to run the notebook again with the scraper for the second page in order to get the full 200 results generated by the search. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How does the Scraper Function? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I will give a brief overview of how the coding functions here, but please see the primary notebook for step by step explanation in the comments. \n", "\n", "Essentially a server holds all the data for a particular website. My scraper starts by establishing contact with the server so that your program can access the data held by the webpage's server. This is what the response object and the request library do. When we print the response object and get the result '200' it means that our communication with the server has been successful, so we can now access the entirety of the site's underlying code (the html). \n", "\n", "We then make a response object while calling on the Beautiful Soup package so that we can then take that raw html and search it and return specific results. In other words, beautiful soup goes through that raw html and can pick out whatever elements we specify. In order to specifiy elements however, we first need to know what element we need. This is why in the main notebook I encourage youto use the inspect function on your browser to get familiar with the html code and figure out what tags are associated with the content you want. \n", "\n", "Once you've made a loop that allows you to extract all the html elements you want (the photo links and their titles) we then use another code series to create and open a csv (spreadsheet) file and we print all of our parsed html to this csv file. From their, we can then use the spreadsheet's functions to clean up the data from the extra html tags that are still associated with the data we have scraped. [^1]\n", "\n", "[1] Melanie Walsh, \"Webscraping Part 1,\" Introduction to Cultural Analytics and Python, https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Collection/Web-Scraping-Part1.html " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Going Further" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a few obvious ways that we could go further with this. \n", "\n", "Firstly, the end product of my scraper still has quite a few extra html tags associated with each link and title. I have suggested in the notebook that we can clean up these links once they have been printed to the csv, but of course it would be much better if we could clean these up in the code en masse before they go to the spreadsheet. Melanie Walsh's tutorial shows how to use regular expressions and how to make functions out of these expressions to accomplish this task. These parts were beyond me, hence why I stopped my coding where I did, but if you are comfortable doing more, then I would suggest start here. [^1] \n", "\n", "Secondly, we don't have to stop at links to the photos and their titles. As long as a we have a grasp of the APK Archive's page, we could scrape for any html element we like. For example, alongside the links, we could also scrape the images associated with each link and in so doing we could effectively download the archive onto our own machines. [2] \n", "\n", "Finally, check out Pattel's article on webscraping which describes the many ways that these techniques are used across society today, not just in history. He writes about the various applications for businesses and marketing, background checks, aggregate data collection, and many other interesting ways of using this technology. As I stated in the main notebook, he makes the astute judgement that academics deal with data, so it stands to reason that technology that helps collect data can be a powerful tool. Take a look at the page for inspiration on how to apply the skills you've learned. [3]\n", "\n", "[1] Melanie Walsh, \"Webscraping Part 1,\" Introduction to Cultural Analytics and Python, https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Collection/Web-Scraping-Part1.html\n", "\n", "Melanie Walsh, \"Webscraping Part 2,\" Introduction to Cultural Analytics and Python, https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Collection/Web-Scraping-Part2.html\n", "\n", "[2] Shawn Graham, \"Simple Scraper,\" https://dhmuse.netlify.app/notebooks/simple-scraper\n", "\n", "[3] Hiren Pattel, \"How Webscraping is Transforming the World with its Applications,\" Towards Data Science, https://towardsdatascience.com/https-medium-com-hiren787-patel-web-scraping-applications-a6f370d316f4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Significance- Gathering Data, Understanding Code" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In terms of what this project does directly, it provides me with a tool to gather data on my topic of choice, which is in this case the topic of fire in Algonquin Park. I have been working on a research project for this topic all semester and next year I will will likely be in an MA program that further explores these ideas, so this type of scraper is a valuable way for me to collect online information from a source like the online archives en masse, which is to say much more quickly than going through manually. To link it to the Ottawa GLAM sector, the MA project examines the ways that the timber trade connected places like the Ottawa Valley to a growing industrial economy in London after the Napoleonic Wars. Ottawa was part of the route that the logs travelled out to ports in Quebec so in this sense my project is somewhat related to the history of Ottawa as well. \n", "\n", "More broadly, as I stated in the introduction, the primary purpose is simply to expose people to code who have never seen programming language ever. This is why I have gone to such lengths to explain every single step of code, right down to each letter so that novices can follow along. My experience has been that most coding tutorials assume a fair amount of basic knowledge, but for someone like me, who has had no training in code whatsoever, even the most basic assumptions and code most tutorials use trips me up, so this is why I've gone as step by step as possible. Python is a genuine language, and people needs to know their letters before they can read words, and they need to know their words before they can actually use sentences. I'm trying to start at the letter level with this notebook and gradually build up so that anybody who was in my position of total ignorance at the start of January can get started on their coding journey without too much frustration. The beauty of this project is that html underlies just about every website, so once you get the process for the scraper figured out, you can subsitute in whatever url to whatever site you please for whatever project you're working on. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion- A Simple Start to Coding" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In summary, this has been a notebook designed for the most uninitiated of the uninitiated to coding. We went through how to make your code communicate with a website, the APK Archives in this case, and then walked through how to use beautiful soup to search a page's raw html and return only the elements you specified. With our links and their titles in hand, we then created a csv file and printed those results into our spreadsheet and saved it for easy access whenever our research needs require it. \n", "\n", "The applications of webscraping are numerous, and are used by many businesses and professions today. Academics are no exception. Personally, this scraper provides me with a tool to collect photos from the APK Archives for my research, but more importantly this notebook serves as an introduction to any other arts students who are interested in digital history, but who have no experience with code whatsoever. For those people, I hope to have made the introductory process less painful than what you might have expected. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Melanie Walsh, \"Webscraping Part 1,\" Introduction to Cultural Analytics and Python, https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Collection/Web-Scraping-Part1.html\n", "\n", "Melanie Walsh, \"Webscraping Part 2,\" Introduction to Cultural Analytics and Python, https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Collection/Web-Scraping-Part2.html\n", "\n", "Martin Breuss, \"Beautiful Soup: Build a Webscraper with Python,\" Real Python, https://realpython.com/beautiful-soup-web-scraper-python/\n", "\n", "\"Writing Scraped Links to a CSV File Using Pyhton 3,\" StackOverflow, https://stackoverflow.com/questions/47372961/writing-to-scraped-links-to-a-csv-file-using-python3\n", "\n", "\"What does the Newline='' argument do?\", Code Academy Discuss, https://discuss.codecademy.com/t/what-does-the-newline-argument-do/463575\n", "\n", "Andrew Schwan, roommate and computer enthusiast\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }