{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Pinboard\n", "\n", "Pinboard is a social bookmarking site where people share links to content and *tag* them by assigning a word that describes the content. These tags are free-form, and each user decides which ones to use.\n", "\n", "Pinboard has a nice [API](https://pinboard.in/api/) for interacting with your own bookmarks, but not for getting all public bookmarks for a tag. Pinboard also makes all tag pages available as RSS, e.g. https://feeds.pinboard.in/rss/t:covid-19 but it unfortunately doesn't allow paging back in time.\n", "\n", "So unfortunately we're going to have to scrape the pages. But fortunately this won't be too difficult with the [requests_html](https://requests-html.kennethreitz.org/) module because Pinboard has done such a nice job of using [semantic html](https://en.wikipedia.org/wiki/Semantic_HTML)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import time\n", "import requests_html\n", "import dateutil.parser\n", "\n", "def pinboard(hashtag):\n", " http = requests_html.HTMLSession()\n", " pinboard_url = 'https://pinboard.in/t:{}'.format(hashtag)\n", " while True:\n", " resp = http.get(pinboard_url)\n", " bookmarks = resp.html.find('.bookmark')\n", " for b in bookmarks:\n", " a = b.find('.bookmark_title', first=True)\n", " yield {\n", " 'url': a.attrs['href'],\n", " 'title': a.text,\n", " 'created': dateutil.parser.parse(b.find('.when', first=True).attrs['title'])\n", " }\n", " \n", " a = resp.html.find('#top_earlier', first=True)\n", " if not a:\n", " break\n", " \n", " next_url = 'https://pinboard.in' + a.attrs['href']\n", " if pinboard_url == next_url:\n", " break\n", " \n", " time.sleep(1)\n", " pinboard_url = next_url" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'url': 'https://healthweather.us/',\n", " 'title': 'US Health Weather Map by Kinsa',\n", " 'created': datetime.datetime(2020, 3, 25, 10, 0, 11)}" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "next(pinboard('covid-19'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can write all the results to a CSV file. But lets look for a few variants: covid-19, covid_19, covid19. To avoid repeating the same urls we can keep track of them and only write them once." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import csv\n", "\n", "urls_seen = set()\n", "with open('data/pinboard.csv', 'w') as fh:\n", " out = csv.DictWriter(fh, fieldnames=['url', 'created', 'title'])\n", " out.writeheader()\n", " for hashtag in ['covid-19', 'covid_19', 'covid19']:\n", " for bookmark in pinboard(hashtag):\n", " if bookmark['url'] not in urls_seen:\n", " out.writerow(bookmark)\n", " urls_seen.add(bookmark['url']) " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
urlcreatedtitle
0https://www.seriouseats.com/2020/03/food-safety-and-coronavirus-a-comprehensive-guide.html2020-03-25 10:02:34Food Safety and Coronavirus: A Comprehensive Guide | Serious Eats
1https://healthweather.us/2020-03-25 10:00:11US Health Weather Map by Kinsa
2https://loinc.org/sars-coronavirus-2/2020-03-25 09:35:57SARS Coronavirus 2 – LOINC
3https://twitter.com/katemclennan1/status/1242656904913932290?s=092020-03-25 09:22:56Kate McLennan on Twitter: \"We were asked to deliver a PSA from the Australian govermnent… \"
4https://valor.globo.com/empresas/noticia/2020/03/25/para-dono-da-innova-crise-deixara-mais-falidos-que-falecidos.ghtml2020-03-25 09:20:22Para dono da Innova, crise deixará mais falidos que falecidos | Empresas | Valor Econômico
............
810https://www.youtube.com/watch?v=mwrMtJ3DYXg&feature=youtu.be2020-03-23 01:23:16How to cope when the world is canceled: 6 critical skills - YouTube
811https://hunch.net/?p=137625392020-03-23 01:04:58What is the most effective policy response to the new coronavirus pandemic? – Machine Learning (Theory)
812https://docs.google.com/spreadsheets/d/1sJM9dFwbSluv9JsoYA9o5EP7TOcCPf83SO_p23hCyCc/edit#gid=02020-03-23 01:04:47Medical Mask Pattern Comparison-comment only - Google Sheets
813https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data2020-03-23 01:04:36COVID-19/csse_covid_19_data at master · CSSEGISandData/COVID-19
814https://www.instructables.com/id/AB-Mask-for-a-Nurse-by-a-Nurse/2020-03-23 01:01:43A.B. Mask - for a Nurse by a Nurse : 15 Steps (with Pictures) - Instructables
\n", "

815 rows × 3 columns

\n", "
" ], "text/plain": [ " url \\\n", "0 https://www.seriouseats.com/2020/03/food-safety-and-coronavirus-a-comprehensive-guide.html \n", "1 https://healthweather.us/ \n", "2 https://loinc.org/sars-coronavirus-2/ \n", "3 https://twitter.com/katemclennan1/status/1242656904913932290?s=09 \n", "4 https://valor.globo.com/empresas/noticia/2020/03/25/para-dono-da-innova-crise-deixara-mais-falidos-que-falecidos.ghtml \n", ".. ... \n", "810 https://www.youtube.com/watch?v=mwrMtJ3DYXg&feature=youtu.be \n", "811 https://hunch.net/?p=13762539 \n", "812 https://docs.google.com/spreadsheets/d/1sJM9dFwbSluv9JsoYA9o5EP7TOcCPf83SO_p23hCyCc/edit#gid=0 \n", "813 https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data \n", "814 https://www.instructables.com/id/AB-Mask-for-a-Nurse-by-a-Nurse/ \n", "\n", " created \\\n", "0 2020-03-25 10:02:34 \n", "1 2020-03-25 10:00:11 \n", "2 2020-03-25 09:35:57 \n", "3 2020-03-25 09:22:56 \n", "4 2020-03-25 09:20:22 \n", ".. ... \n", "810 2020-03-23 01:23:16 \n", "811 2020-03-23 01:04:58 \n", "812 2020-03-23 01:04:47 \n", "813 2020-03-23 01:04:36 \n", "814 2020-03-23 01:01:43 \n", "\n", " title \n", "0 Food Safety and Coronavirus: A Comprehensive Guide | Serious Eats \n", "1 US Health Weather Map by Kinsa \n", "2 SARS Coronavirus 2 – LOINC \n", "3 Kate McLennan on Twitter: \"We were asked to deliver a PSA from the Australian govermnent… \" \n", "4 Para dono da Innova, crise deixará mais falidos que falecidos | Empresas | Valor Econômico \n", ".. ... \n", "810 How to cope when the world is canceled: 6 critical skills - YouTube \n", "811 What is the most effective policy response to the new coronavirus pandemic? – Machine Learning (Theory) \n", "812 Medical Mask Pattern Comparison-comment only - Google Sheets \n", "813 COVID-19/csse_covid_19_data at master · CSSEGISandData/COVID-19 \n", "814 A.B. Mask - for a Nurse by a Nurse : 15 Steps (with Pictures) - Instructables \n", "\n", "[815 rows x 3 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas\n", "\n", "# prevent dataframe columns from being truncated\n", "pandas.set_option('display.max_columns', None)\n", "pandas.set_option('display.width', None)\n", "pandas.set_option('display.max_colwidth', None)\n", "\n", "df = pandas.read_csv('data/pinboard.csv')\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just out of curiousity is there currently any overlap with the IIPC seeds?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6',\n", " 'https://github.com/CSSEGISandData/COVID-19',\n", " 'https://www.brookings.edu/research/the-global-macroeconomic-impacts-of-covid-19-seven-scenarios/',\n", " 'https://www.nytimes.com/interactive/2020/world/coronavirus-maps.html'}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iipc = pandas.read_csv('data/iipc.csv')\n", "overlap = set(iipc.url).intersection(set(df.url))\n", "overlap" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nice, there are a few!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }