{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Pinboard\n", "\n", "Pinboard is a social bookmarking site where people share links to content and *tag* them by assigning a word that describes the content. These tags are free-form, and each user decides which ones to use.\n", "\n", "Pinboard has a nice [API](https://pinboard.in/api/) for interacting with your own bookmarks, but not for getting all public bookmarks for a tag. Pinboard also makes all tag pages available as RSS, e.g. https://feeds.pinboard.in/rss/t:covid-19 but it unfortunately doesn't allow paging back in time.\n", "\n", "So unfortunately we're going to have to scrape the pages. But fortunately this won't be too difficult with the [requests_html](https://requests-html.kennethreitz.org/) module because Pinboard has done such a nice job of using [semantic html](https://en.wikipedia.org/wiki/Semantic_HTML)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import time\n", "import requests_html\n", "import dateutil.parser\n", "\n", "def pinboard(hashtag):\n", " http = requests_html.HTMLSession()\n", " pinboard_url = 'https://pinboard.in/t:{}'.format(hashtag)\n", " while True:\n", " resp = http.get(pinboard_url)\n", " bookmarks = resp.html.find('.bookmark')\n", " for b in bookmarks:\n", " a = b.find('.bookmark_title', first=True)\n", " yield {\n", " 'url': a.attrs['href'],\n", " 'title': a.text,\n", " 'created': dateutil.parser.parse(b.find('.when', first=True).attrs['title'])\n", " }\n", " \n", " a = resp.html.find('#top_earlier', first=True)\n", " if not a:\n", " break\n", " \n", " next_url = 'https://pinboard.in' + a.attrs['href']\n", " if pinboard_url == next_url:\n", " break\n", " \n", " time.sleep(1)\n", " pinboard_url = next_url" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'url': 'https://healthweather.us/',\n", " 'title': 'US Health Weather Map by Kinsa',\n", " 'created': datetime.datetime(2020, 3, 25, 10, 0, 11)}" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "next(pinboard('covid-19'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can write all the results to a CSV file. But lets look for a few variants: covid-19, covid_19, covid19. To avoid repeating the same urls we can keep track of them and only write them once." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import csv\n", "\n", "urls_seen = set()\n", "with open('data/pinboard.csv', 'w') as fh:\n", " out = csv.DictWriter(fh, fieldnames=['url', 'created', 'title'])\n", " out.writeheader()\n", " for hashtag in ['covid-19', 'covid_19', 'covid19']:\n", " for bookmark in pinboard(hashtag):\n", " if bookmark['url'] not in urls_seen:\n", " out.writerow(bookmark)\n", " urls_seen.add(bookmark['url']) " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | url | \n", "created | \n", "title | \n", "
---|---|---|---|
0 | \n", "https://www.seriouseats.com/2020/03/food-safety-and-coronavirus-a-comprehensive-guide.html | \n", "2020-03-25 10:02:34 | \n", "Food Safety and Coronavirus: A Comprehensive Guide | Serious Eats | \n", "
1 | \n", "https://healthweather.us/ | \n", "2020-03-25 10:00:11 | \n", "US Health Weather Map by Kinsa | \n", "
2 | \n", "https://loinc.org/sars-coronavirus-2/ | \n", "2020-03-25 09:35:57 | \n", "SARS Coronavirus 2 – LOINC | \n", "
3 | \n", "https://twitter.com/katemclennan1/status/1242656904913932290?s=09 | \n", "2020-03-25 09:22:56 | \n", "Kate McLennan on Twitter: \"We were asked to deliver a PSA from the Australian govermnent… \" | \n", "
4 | \n", "https://valor.globo.com/empresas/noticia/2020/03/25/para-dono-da-innova-crise-deixara-mais-falidos-que-falecidos.ghtml | \n", "2020-03-25 09:20:22 | \n", "Para dono da Innova, crise deixará mais falidos que falecidos | Empresas | Valor Econômico | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
810 | \n", "https://www.youtube.com/watch?v=mwrMtJ3DYXg&feature=youtu.be | \n", "2020-03-23 01:23:16 | \n", "How to cope when the world is canceled: 6 critical skills - YouTube | \n", "
811 | \n", "https://hunch.net/?p=13762539 | \n", "2020-03-23 01:04:58 | \n", "What is the most effective policy response to the new coronavirus pandemic? – Machine Learning (Theory) | \n", "
812 | \n", "https://docs.google.com/spreadsheets/d/1sJM9dFwbSluv9JsoYA9o5EP7TOcCPf83SO_p23hCyCc/edit#gid=0 | \n", "2020-03-23 01:04:47 | \n", "Medical Mask Pattern Comparison-comment only - Google Sheets | \n", "
813 | \n", "https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data | \n", "2020-03-23 01:04:36 | \n", "COVID-19/csse_covid_19_data at master · CSSEGISandData/COVID-19 | \n", "
814 | \n", "https://www.instructables.com/id/AB-Mask-for-a-Nurse-by-a-Nurse/ | \n", "2020-03-23 01:01:43 | \n", "A.B. Mask - for a Nurse by a Nurse : 15 Steps (with Pictures) - Instructables | \n", "
815 rows × 3 columns
\n", "