{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# IIPC\n", "\n", "This notebook explores the seeds that are being crawled in the [Novel Coronavirus COVID-19](https://archive-it.org/collections/13529/) Archive-It collection. It uses the [Archive-It Parnter API](https://support.archive-it.org/hc/en-us/articles/360032747311-Access-your-account-with-the-Archive-It-Partner-API) which does not seem to require a key for public collections (yay). More context for this collecting effort can be found in [this IIPC blog post](https://blog.archive.org/2020/02/13/archiving-information-on-the-novel-coronavirus-covid-19/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 0. Import\n", "\n", "First let's import some things we're going to need later. It's useful to do them all here at the beginning in case you want to skip parts of the data collection and use the data that is already present in the repository." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import csv\n", "import altair\n", "import pandas\n", "import wayback\n", "import datetime\n", "import requests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Get the Seeds\n", "\n", "First lets download the seeds in the collection and save them as a CSV. If you want to use the CSV that's already here you can move on to **Section 2**. We're going to write out the data to a file called `iipc.csv`. You can see the type of data that is returned by looking at [this API response](https://partner.archive-it.org/api/seed?collection=13529&limit=100). The Archive-It Partner API has a route for returning seeds for a given collection that is indicated with the `collection` parameter. We can use the `limit` and `offset` parameters to walk through the results page by page without getting all of them at once." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "url = 'https://partner.archive-it.org/api/seed'\n", "params = {\n", " \"collection\": 13529,\n", " \"limit\": 100\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can create a loop that keeps fetching results and incrementing the offset until there are no more seeds. We could have used the CSV output, but it is useful to normalize some of the structured metadata. This will likely take a few minutes to run." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "out = csv.writer(open('data/iipc.csv', 'w'))\n", "out.writerow([\n", " \"id\",\n", " \"url\",\n", " \"creator\",\n", " \"created\",\n", " \"updated\",\n", " \"crawl_definition\",\n", " \"title\",\n", " \"description\",\n", " \"language\",\n", " \"tld\"\n", "])\n", "\n", "def first_val(meta, name):\n", " return meta[name][0][\"value\"] if name in meta else None\n", "\n", "params['offset'] = 0\n", "\n", "while True:\n", " resp = requests.get(url, params=params)\n", " seeds = resp.json()\n", " if len(seeds) == 0: break\n", "\n", " for seed in seeds:\n", " meta = seed[\"metadata\"]\n", " out.writerow([\n", " seed[\"id\"],\n", " seed[\"url\"],\n", " seed[\"created_by\"],\n", " seed[\"created_date\"],\n", " seed[\"last_updated_date\"],\n", " seed[\"crawl_definition\"],\n", " first_val(meta, \"Title\"),\n", " first_val(meta, \"Description\"),\n", " first_val(meta, \"Language\"),\n", " first_val(meta, \"Top-Level Domain\")\n", " ])\n", "\n", " params['offset'] += 100" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "url = 'https://partner.archive-it.org/api/seed'\n", "params = {\n", " \"collection\": 13529,\n", " \"offset\": 0,\n", " \"limit\": 100\n", "}\n", "\n", "while True:\n", " resp = requests.get(url, params=params)\n", " seeds = resp.json()\n", " if len(seeds) == 0: break\n", " for seed in seeds:\n", " if seed['url'] == 'https://www.health.govt.nz/our-work/diseases-and-conditions/covid-19-novel-coronavirus/':\n", " print(seed['url'])\n", " params['offset'] += len(seeds)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So now you should hopefully see an updated `seeds.csv`!\n", "\n", "## 2. Display the Seeds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First lets load our `seeds.csv` into a Pandas DataFrame where we can more easily manipulate it." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idurlcreatorcreatedupdatedcrawl_definitiontitledescriptionlanguagetld
02147692http://coronavirus.fr/alext2020-02-21 03:43:18.662353+00:002020-03-16 19:53:45.860949+00:0031104294373Epicorem. EcoépidémiologieMedical/Scientific aspectsFrench.fr
12147693http://english.whiov.cas.cn/alext2020-02-21 03:43:18.706571+00:002020-03-16 19:52:28.575749+00:0031104294373Wuhan Institute of Virulogy, official page in ...Health OrganisationEnglish.cn
22147694http://www.china-embassy.or.jp/chn/alext2020-02-21 03:43:18.739126+00:002020-03-16 19:53:03.086729+00:0031104294373中华人民共和国驻日本大使馆EmbassyChinese.jp
32147695http://www.china-embassy.or.jp/jpn/alext2020-02-21 03:43:18.766308+00:002020-03-16 19:54:02.280945+00:0031104294373中華人民共和国駐日本国大使館EmbassyJapanese.jp
42147696https://cadenaser.com/tag/ncov/a/alext2020-02-21 03:43:18.791716+00:002020-03-16 19:54:19.694418+00:0031104294373Coronavirus de WuhanCadena SerSpanish.com
.................................
27942173031https://www.suntrust.com/resource-center/comme...nicolab2020-03-26 15:41:06.629121+00:002020-03-26 15:41:06.629220+00:0031104300763NaNNaNNaNNaN
27952148539https://www.eluniversal.com/economia/60496/cor...alext2020-02-21 04:11:12.713039+00:002020-03-16 19:53:55.500654+00:0031104294373Coronavirus afecta economía mundial y rutas co...political aspects,economic aspects, diplomacySpanish.com
27962149377https://ue.delegfrance.org/coronavirus-activat...alext2020-02-21 04:28:15.941569+00:002020-03-16 19:53:43.544395+00:0031104294373Délégation France UE. Coronavirus : Activation...Institutional websiteFrench.org
27972149468https://www.healthdirect.gov.au/coronavirusalext2020-02-21 04:29:30.095448+00:002020-03-16 19:52:16.008948+00:0031104297068Coronavirus disease (COVID-19)Government health informationEnglish.au
27982149123https://www.youtube.com/watch?v=N6BAkWzrsesalext2020-02-21 04:18:08.093558+00:002020-03-16 19:53:35.401711+00:0031104294373CORONAVIRUS PLANO WARNING - YouTubevideo, containment effortsPortuguese.com
\n", "

2799 rows × 10 columns

\n", "
" ], "text/plain": [ " id url creator \\\n", "0 2147692 http://coronavirus.fr/ alext \n", "1 2147693 http://english.whiov.cas.cn/ alext \n", "2 2147694 http://www.china-embassy.or.jp/chn/ alext \n", "3 2147695 http://www.china-embassy.or.jp/jpn/ alext \n", "4 2147696 https://cadenaser.com/tag/ncov/a/ alext \n", "... ... ... ... \n", "2794 2173031 https://www.suntrust.com/resource-center/comme... nicolab \n", "2795 2148539 https://www.eluniversal.com/economia/60496/cor... alext \n", "2796 2149377 https://ue.delegfrance.org/coronavirus-activat... alext \n", "2797 2149468 https://www.healthdirect.gov.au/coronavirus alext \n", "2798 2149123 https://www.youtube.com/watch?v=N6BAkWzrses alext \n", "\n", " created updated \\\n", "0 2020-02-21 03:43:18.662353+00:00 2020-03-16 19:53:45.860949+00:00 \n", "1 2020-02-21 03:43:18.706571+00:00 2020-03-16 19:52:28.575749+00:00 \n", "2 2020-02-21 03:43:18.739126+00:00 2020-03-16 19:53:03.086729+00:00 \n", "3 2020-02-21 03:43:18.766308+00:00 2020-03-16 19:54:02.280945+00:00 \n", "4 2020-02-21 03:43:18.791716+00:00 2020-03-16 19:54:19.694418+00:00 \n", "... ... ... \n", "2794 2020-03-26 15:41:06.629121+00:00 2020-03-26 15:41:06.629220+00:00 \n", "2795 2020-02-21 04:11:12.713039+00:00 2020-03-16 19:53:55.500654+00:00 \n", "2796 2020-02-21 04:28:15.941569+00:00 2020-03-16 19:53:43.544395+00:00 \n", "2797 2020-02-21 04:29:30.095448+00:00 2020-03-16 19:52:16.008948+00:00 \n", "2798 2020-02-21 04:18:08.093558+00:00 2020-03-16 19:53:35.401711+00:00 \n", "\n", " crawl_definition title \\\n", "0 31104294373 Epicorem. Ecoépidémiologie \n", "1 31104294373 Wuhan Institute of Virulogy, official page in ... \n", "2 31104294373 中华人民共和国驻日本大使馆 \n", "3 31104294373 中華人民共和国駐日本国大使館 \n", "4 31104294373 Coronavirus de Wuhan \n", "... ... ... \n", "2794 31104300763 NaN \n", "2795 31104294373 Coronavirus afecta economía mundial y rutas co... \n", "2796 31104294373 Délégation France UE. Coronavirus : Activation... \n", "2797 31104297068 Coronavirus disease (COVID-19) \n", "2798 31104294373 CORONAVIRUS PLANO WARNING - YouTube \n", "\n", " description language tld \n", "0 Medical/Scientific aspects French .fr \n", "1 Health Organisation English .cn \n", "2 Embassy Chinese .jp \n", "3 Embassy Japanese .jp \n", "4 Cadena Ser Spanish .com \n", "... ... ... ... \n", "2794 NaN NaN NaN \n", "2795 political aspects,economic aspects, diplomacy Spanish .com \n", "2796 Institutional website French .org \n", "2797 Government health information English .au \n", "2798 video, containment efforts Portuguese .com \n", "\n", "[2799 rows x 10 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seeds = pandas.read_csv('data/iipc.csv', parse_dates=[\"created\", \"updated\"])\n", "seeds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can sort them by created time in ascending order, and save them again. This might make it easier to compare them over time with `git diff`." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idurlcreatorcreatedupdatedcrawl_definitiontitledescriptionlanguagetld
02147692http://coronavirus.fr/alext2020-02-21 03:43:18.662353+00:002020-03-16 19:53:45.860949+00:0031104294373Epicorem. EcoépidémiologieMedical/Scientific aspectsFrench.fr
6362147692http://coronavirus.fr/alext2020-02-21 03:43:18.662353+00:002020-03-16 19:53:45.860949+00:0031104294373Epicorem. EcoépidémiologieMedical/Scientific aspectsFrench.fr
12147693http://english.whiov.cas.cn/alext2020-02-21 03:43:18.706571+00:002020-03-16 19:52:28.575749+00:0031104294373Wuhan Institute of Virulogy, official page in ...Health OrganisationEnglish.cn
21532147693http://english.whiov.cas.cn/alext2020-02-21 03:43:18.706571+00:002020-03-16 19:52:28.575749+00:0031104294373Wuhan Institute of Virulogy, official page in ...Health OrganisationEnglish.cn
8492147694http://www.china-embassy.or.jp/chn/alext2020-02-21 03:43:18.739126+00:002020-03-16 19:53:03.086729+00:0031104294373中华人民共和国驻日本大使馆EmbassyChinese.jp
22147694http://www.china-embassy.or.jp/chn/alext2020-02-21 03:43:18.739126+00:002020-03-16 19:53:03.086729+00:0031104294373中华人民共和国驻日本大使馆EmbassyChinese.jp
15572147695http://www.china-embassy.or.jp/jpn/alext2020-02-21 03:43:18.766308+00:002020-03-16 19:54:02.280945+00:0031104294373中華人民共和国駐日本国大使館EmbassyJapanese.jp
32147695http://www.china-embassy.or.jp/jpn/alext2020-02-21 03:43:18.766308+00:002020-03-16 19:54:02.280945+00:0031104294373中華人民共和国駐日本国大使館EmbassyJapanese.jp
42147696https://cadenaser.com/tag/ncov/a/alext2020-02-21 03:43:18.791716+00:002020-03-16 19:54:19.694418+00:0031104294373Coronavirus de WuhanCadena SerSpanish.com
17572147697https://doktor.frettabladid.is/sjukdomur/27626-2/alext2020-02-21 03:43:18.814377+00:002020-03-16 19:54:20.668796+00:0031104294373Allt sem þú þarft að vita um Kóróna veirur (co...Health care informationIcelandic.is
\n", "
" ], "text/plain": [ " id url creator \\\n", "0 2147692 http://coronavirus.fr/ alext \n", "636 2147692 http://coronavirus.fr/ alext \n", "1 2147693 http://english.whiov.cas.cn/ alext \n", "2153 2147693 http://english.whiov.cas.cn/ alext \n", "849 2147694 http://www.china-embassy.or.jp/chn/ alext \n", "2 2147694 http://www.china-embassy.or.jp/chn/ alext \n", "1557 2147695 http://www.china-embassy.or.jp/jpn/ alext \n", "3 2147695 http://www.china-embassy.or.jp/jpn/ alext \n", "4 2147696 https://cadenaser.com/tag/ncov/a/ alext \n", "1757 2147697 https://doktor.frettabladid.is/sjukdomur/27626-2/ alext \n", "\n", " created updated \\\n", "0 2020-02-21 03:43:18.662353+00:00 2020-03-16 19:53:45.860949+00:00 \n", "636 2020-02-21 03:43:18.662353+00:00 2020-03-16 19:53:45.860949+00:00 \n", "1 2020-02-21 03:43:18.706571+00:00 2020-03-16 19:52:28.575749+00:00 \n", "2153 2020-02-21 03:43:18.706571+00:00 2020-03-16 19:52:28.575749+00:00 \n", "849 2020-02-21 03:43:18.739126+00:00 2020-03-16 19:53:03.086729+00:00 \n", "2 2020-02-21 03:43:18.739126+00:00 2020-03-16 19:53:03.086729+00:00 \n", "1557 2020-02-21 03:43:18.766308+00:00 2020-03-16 19:54:02.280945+00:00 \n", "3 2020-02-21 03:43:18.766308+00:00 2020-03-16 19:54:02.280945+00:00 \n", "4 2020-02-21 03:43:18.791716+00:00 2020-03-16 19:54:19.694418+00:00 \n", "1757 2020-02-21 03:43:18.814377+00:00 2020-03-16 19:54:20.668796+00:00 \n", "\n", " crawl_definition title \\\n", "0 31104294373 Epicorem. Ecoépidémiologie \n", "636 31104294373 Epicorem. Ecoépidémiologie \n", "1 31104294373 Wuhan Institute of Virulogy, official page in ... \n", "2153 31104294373 Wuhan Institute of Virulogy, official page in ... \n", "849 31104294373 中华人民共和国驻日本大使馆 \n", "2 31104294373 中华人民共和国驻日本大使馆 \n", "1557 31104294373 中華人民共和国駐日本国大使館 \n", "3 31104294373 中華人民共和国駐日本国大使館 \n", "4 31104294373 Coronavirus de Wuhan \n", "1757 31104294373 Allt sem þú þarft að vita um Kóróna veirur (co... \n", "\n", " description language tld \n", "0 Medical/Scientific aspects French .fr \n", "636 Medical/Scientific aspects French .fr \n", "1 Health Organisation English .cn \n", "2153 Health Organisation English .cn \n", "849 Embassy Chinese .jp \n", "2 Embassy Chinese .jp \n", "1557 Embassy Japanese .jp \n", "3 Embassy Japanese .jp \n", "4 Cadena Ser Spanish .com \n", "1757 Health care information Icelandic .is " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seeds = seeds.sort_values('created')\n", "seeds.to_csv('data/iipc.csv')\n", "seeds.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Languages\n", "\n", "We can see that there are a large number of Portuguese seeds. I guess because someone involved in web archiving in Portugal or Brazil got busy." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "altair.Chart(seeds).mark_bar().encode(\n", " altair.X('language', title='Language'),\n", " altair.Y('count(id)')\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Created\n", "\n", "We can see that most of the vast majority of these seeds were entered into Archive-It on February 20, 2020, presumably from the spreadsheet sitting behind the Google Form." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "altair.Chart(seeds).mark_bar().encode(\n", " altair.X('monthdate(created)', title='Created'),\n", " altair.Y('count(id)')\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Last Update\n", "\n", "Similarly we can look to see when the last update time was for each seed." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "altair.Chart(seeds).mark_bar().encode(\n", " altair.X('monthdate(updated)', title='Updates'),\n", " altair.Y('count(id)')\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like most of the seeds were last updated a few days ago. But does this mean that was the last time they were crawled?\n", "\n", "## 6. Get the Crawls\n", "\n", "Oddly I couldn't seem to get any of the crawl related Partner API endpoints to work. Maybe I need to have created the crawls? At any rate, I can use the URL to look directly in Wayback machine to see what is available. The EDGI folks have created a nice [Wayback](https://wayback.readthedocs.io/en/latest/usage.html) module that lets you easily look up URLs in the Wayback Machine (it uses their CDX API behind the scenes). \n", "\n", "This can take some time, so I'm going to save off the results in a `crawls.csv`. If you prefer to use the stored `crawls.csv` you skip ahead to **Section 7**. This will collect crawl information for these URLs from 2019-10-01 on so we can look at their coverage before and after the project started." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "403 Client Error: Forbidden for url: http://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fa2larm.cz%2F2020%2F02%2Fslavoj-zizek-melancholicka-krasa-virove-pandemie%2F&from=20191001000000&showResumeKey=true&resolveRevisits=true\n", "403 Client Error: Forbidden for url: http://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fnationalpost.com%2Fhealth%2Fbio-warfare-experts-question-why-canada-was-sending-lethal-viruses-to-china&from=20191001000000&showResumeKey=true&resolveRevisits=true\n", "403 Client Error: Forbidden for url: http://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fwww.cagle.com%2Fdave-granlund%2F2020%2F01%2Fcoronavirus-usa&from=20191001000000&showResumeKey=true&resolveRevisits=true\n", "403 Client Error: Forbidden for url: http://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fpoliticalcartoons.com%2F%3Fs%3Dcoronavirus&from=20191001000000&showResumeKey=true&resolveRevisits=true\n" ] } ], "source": [ "out = csv.writer(open('data/crawls.csv', 'w'))\n", "out.writerow(['timestamp', 'url', 'status_code', 'archive_url'])\n", "wb = wayback.WaybackClient()\n", "\n", "for index, row in seeds.iterrows():\n", " try:\n", " for crawl in wb.search(row.url, from_date=datetime.datetime(2019, 10, 1)):\n", " out.writerow([\n", " crawl.timestamp.isoformat(),\n", " crawl.url,\n", " crawl.status_code,\n", " crawl.view_url\n", " ])\n", " except Exception as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's interesting that some of the URLs are forbidden for viewing. I'm not sure what's going on there. One important thing to keep in mind is that these URLs could have been crawled by other users of Archive-It or by the Internet Archive's own crawlers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8. View the Crawls\n", "\n", "Now lets load in the `crawls.csv` as a DataFrame and look at the number of crawls over time. It's actually useful to save a sorted version of the crawls.csv so that it can easily be diffed with previous versions." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timestampurlstatus_codearchive_url
157422019-10-01 01:24:55http://www.dw.com/NaNhttp://web.archive.org/web/20191001012455/http...
157432019-10-01 01:24:55https://www.dw.com/NaNhttp://web.archive.org/web/20191001012455/http...
697642019-10-01 01:56:12https://www.colorado.gov/cdphe200.0http://web.archive.org/web/20191001015612/http...
691952019-10-01 02:38:36https://www.healthlinkbc.ca/200.0http://web.archive.org/web/20191001023836/http...
262152019-10-01 03:14:00https://cn.ambafrance.org/200.0http://web.archive.org/web/20191001031400/http...
...............
678792020-03-27 15:01:46https://www.ecdc.europa.eu/en/novel-coronaviru...200.0http://web.archive.org/web/20200327150146/http...
652862020-03-27 15:02:42https://news.ifeng.com/c/special/7tPlDSzDgVk200.0http://web.archive.org/web/20200327150242/http...
218522020-03-27 15:03:11https://www.nbcnews.com/health/coronavirus200.0http://web.archive.org/web/20200327150311/http...
652872020-03-27 15:11:04https://news.ifeng.com/c/special/7tPlDSzDgVk200.0http://web.archive.org/web/20200327151104/http...
652882020-03-27 15:19:26https://news.ifeng.com/c/special/7tPlDSzDgVk200.0http://web.archive.org/web/20200327151926/http...
\n", "

70752 rows × 4 columns

\n", "
" ], "text/plain": [ " timestamp url \\\n", "15742 2019-10-01 01:24:55 http://www.dw.com/ \n", "15743 2019-10-01 01:24:55 https://www.dw.com/ \n", "69764 2019-10-01 01:56:12 https://www.colorado.gov/cdphe \n", "69195 2019-10-01 02:38:36 https://www.healthlinkbc.ca/ \n", "26215 2019-10-01 03:14:00 https://cn.ambafrance.org/ \n", "... ... ... \n", "67879 2020-03-27 15:01:46 https://www.ecdc.europa.eu/en/novel-coronaviru... \n", "65286 2020-03-27 15:02:42 https://news.ifeng.com/c/special/7tPlDSzDgVk \n", "21852 2020-03-27 15:03:11 https://www.nbcnews.com/health/coronavirus \n", "65287 2020-03-27 15:11:04 https://news.ifeng.com/c/special/7tPlDSzDgVk \n", "65288 2020-03-27 15:19:26 https://news.ifeng.com/c/special/7tPlDSzDgVk \n", "\n", " status_code archive_url \n", "15742 NaN http://web.archive.org/web/20191001012455/http... \n", "15743 NaN http://web.archive.org/web/20191001012455/http... \n", "69764 200.0 http://web.archive.org/web/20191001015612/http... \n", "69195 200.0 http://web.archive.org/web/20191001023836/http... \n", "26215 200.0 http://web.archive.org/web/20191001031400/http... \n", "... ... ... \n", "67879 200.0 http://web.archive.org/web/20200327150146/http... \n", "65286 200.0 http://web.archive.org/web/20200327150242/http... \n", "21852 200.0 http://web.archive.org/web/20200327150311/http... \n", "65287 200.0 http://web.archive.org/web/20200327151104/http... \n", "65288 200.0 http://web.archive.org/web/20200327151926/http... \n", "\n", "[70752 rows x 4 columns]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "crawls = pandas.read_csv('data/crawls.csv', parse_dates=['timestamp'])\n", "crawls = crawls.sort_values('timestamp')\n", "crawls.to_csv('data/crawls.csv')\n", "crawls" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datecrawls
02019-10-0122
12019-10-0249
22019-10-0322
32019-10-0452
42019-10-0537
.........
1742020-03-231417
1752020-03-241405
1762020-03-251242
1772020-03-261998
1782020-03-271055
\n", "

179 rows × 2 columns

\n", "
" ], "text/plain": [ " date crawls\n", "0 2019-10-01 22\n", "1 2019-10-02 49\n", "2 2019-10-03 22\n", "3 2019-10-04 52\n", "4 2019-10-05 37\n", ".. ... ...\n", "174 2020-03-23 1417\n", "175 2020-03-24 1405\n", "176 2020-03-25 1242\n", "177 2020-03-26 1998\n", "178 2020-03-27 1055\n", "\n", "[179 rows x 2 columns]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "crawls_per_day = crawls.set_index('timestamp').resample('1D')['url'].count()\n", "crawls_per_day = crawls_per_day.reset_index()\n", "crawls_per_day.columns = ['date', 'crawls']\n", "crawls_per_day" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "altair.Chart(crawls_per_day, width=800).mark_bar().encode(\n", " altair.X('date', title='Crawl Date'),\n", " altair.Y('crawls', title='Crawls')\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 9. Missing Crawls\n", "\n", "We can definitely see these URLs are being crawled a whole lot more since the start of the project. But the graph shows what has been crawled (irrespective of who did it). It also doesn't show what seed URLs have not been crawled yet.\n", "\n", "To see what might be missing lets first group our crawl data by url, and count how many crawls there have been for that url." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "url\n", "http://9news.com.au/coronavirus 2\n", "http://abcnews.go.com/Health/1300-people-died-flu-year/story?id=67754182 71\n", "http://abola.pt/africa/2020-02-01/angola-entre-os-paises-africanos-com-maior-risco-de-contagio-do-coronavirus/827264 1\n", "http://abola.pt/nnh/2020-02-03/formula-1-coronavirus-ameaca-gp-da-china/827542 1\n", "http://albertahealthservices.ca/ 2\n", "Name: crawls, dtype: int64" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "crawls_by_url = crawls.groupby('url').count().timestamp\n", "crawls_by_url.name = 'crawls'\n", "crawls_by_url.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we can take our `seeds` DataFrame, index it by URL, so that we can add our `crawls_by_url` series to it, since it is also indexed by `url`. It is kinda nice how pandas makes this join easy. The use of `fillna` there is to convert any null values (where there has been no crawls yet) to 0." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idcreatorcreatedupdatedcrawl_definitiontitledescriptionlanguagetldcrawls
url
http://coronavirus.fr/2147692alext2020-02-21 03:43:18.662353+00:002020-03-16 19:53:45.860949+00:0031104294373Epicorem. EcoépidémiologieMedical/Scientific aspectsFrench.fr4.0
http://coronavirus.fr/2147692alext2020-02-21 03:43:18.662353+00:002020-03-16 19:53:45.860949+00:0031104294373Epicorem. EcoépidémiologieMedical/Scientific aspectsFrench.fr4.0
http://english.whiov.cas.cn/2147693alext2020-02-21 03:43:18.706571+00:002020-03-16 19:52:28.575749+00:0031104294373Wuhan Institute of Virulogy, official page in ...Health OrganisationEnglish.cn70.0
http://english.whiov.cas.cn/2147693alext2020-02-21 03:43:18.706571+00:002020-03-16 19:52:28.575749+00:0031104294373Wuhan Institute of Virulogy, official page in ...Health OrganisationEnglish.cn70.0
http://www.china-embassy.or.jp/chn/2147694alext2020-02-21 03:43:18.739126+00:002020-03-16 19:53:03.086729+00:0031104294373中华人民共和国驻日本大使馆EmbassyChinese.jp306.0
\n", "
" ], "text/plain": [ " id creator \\\n", "url \n", "http://coronavirus.fr/ 2147692 alext \n", "http://coronavirus.fr/ 2147692 alext \n", "http://english.whiov.cas.cn/ 2147693 alext \n", "http://english.whiov.cas.cn/ 2147693 alext \n", "http://www.china-embassy.or.jp/chn/ 2147694 alext \n", "\n", " created \\\n", "url \n", "http://coronavirus.fr/ 2020-02-21 03:43:18.662353+00:00 \n", "http://coronavirus.fr/ 2020-02-21 03:43:18.662353+00:00 \n", "http://english.whiov.cas.cn/ 2020-02-21 03:43:18.706571+00:00 \n", "http://english.whiov.cas.cn/ 2020-02-21 03:43:18.706571+00:00 \n", "http://www.china-embassy.or.jp/chn/ 2020-02-21 03:43:18.739126+00:00 \n", "\n", " updated \\\n", "url \n", "http://coronavirus.fr/ 2020-03-16 19:53:45.860949+00:00 \n", "http://coronavirus.fr/ 2020-03-16 19:53:45.860949+00:00 \n", "http://english.whiov.cas.cn/ 2020-03-16 19:52:28.575749+00:00 \n", "http://english.whiov.cas.cn/ 2020-03-16 19:52:28.575749+00:00 \n", "http://www.china-embassy.or.jp/chn/ 2020-03-16 19:53:03.086729+00:00 \n", "\n", " crawl_definition \\\n", "url \n", "http://coronavirus.fr/ 31104294373 \n", "http://coronavirus.fr/ 31104294373 \n", "http://english.whiov.cas.cn/ 31104294373 \n", "http://english.whiov.cas.cn/ 31104294373 \n", "http://www.china-embassy.or.jp/chn/ 31104294373 \n", "\n", " title \\\n", "url \n", "http://coronavirus.fr/ Epicorem. Ecoépidémiologie \n", "http://coronavirus.fr/ Epicorem. Ecoépidémiologie \n", "http://english.whiov.cas.cn/ Wuhan Institute of Virulogy, official page in ... \n", "http://english.whiov.cas.cn/ Wuhan Institute of Virulogy, official page in ... \n", "http://www.china-embassy.or.jp/chn/ 中华人民共和国驻日本大使馆 \n", "\n", " description language tld \\\n", "url \n", "http://coronavirus.fr/ Medical/Scientific aspects French .fr \n", "http://coronavirus.fr/ Medical/Scientific aspects French .fr \n", "http://english.whiov.cas.cn/ Health Organisation English .cn \n", "http://english.whiov.cas.cn/ Health Organisation English .cn \n", "http://www.china-embassy.or.jp/chn/ Embassy Chinese .jp \n", "\n", " crawls \n", "url \n", "http://coronavirus.fr/ 4.0 \n", "http://coronavirus.fr/ 4.0 \n", "http://english.whiov.cas.cn/ 70.0 \n", "http://english.whiov.cas.cn/ 70.0 \n", "http://www.china-embassy.or.jp/chn/ 306.0 " ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seeds_by_url = seeds.set_index('url')\n", "seeds_by_url['crawls'] = crawls_by_url\n", "seeds_by_url.crawls = seeds_by_url.crawls.fillna(0)\n", "seeds_by_url.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So now we can see which seeds still need to be crawled, or to have their crawls made public?" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "438 URLS are missing crawls, which is 15.65% of the total seeds.\n" ] } ], "source": [ "missing = seeds_by_url[seeds_by_url.crawls == 0.0]\n", "print(\"{0} URLS are missing crawls, which is {1:.2f}% of the total seeds.\".format(\n", " len(missing),\n", " len(missing) / len(seeds_by_url) * 100\n", "))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 2 }