# IIPC

This notebook explores the seeds that are being crawled in the [Novel Coronavirus COVID-19](https://archive-it.org/collections/13529/) Archive-It collection. It uses the [Archive-It Parnter API](https://support.archive-it.org/hc/en-us/articles/360032747311-Access-your-account-with-the-Archive-It-Partner-API) which does not seem to require a key for public collections (yay). More context for this collecting effort can be found in [this IIPC blog post](https://blog.archive.org/2020/02/13/archiving-information-on-the-novel-coronavirus-covid-19/).

## 0. Import

First let's import some things we're going to need later. It's useful to do them all here at the beginning in case you want to skip parts of the data collection and use the data that is already present in the repository.

In [1]:
import csv
import altair
import pandas
import wayback
import datetime
import requests

## 1. Get the Seeds

First lets download the seeds in the collection and save them as a CSV. If you want to use the CSV that's already here you can move on to **Section 2**. We're going to write out the data to a file called `iipc.csv`. You can see the type of data that is returned by looking at [this API response](https://partner.archive-it.org/api/seed?collection=13529&limit=100). The Archive-It Partner API has a route for returning seeds for a given collection that is indicated with the `collection` parameter. We can use the `limit` and `offset` parameters to walk through the results page by page without getting all of them at once.

In [4]:
url = 'https://partner.archive-it.org/api/seed'
params = {
    "collection": 13529,
    "limit": 100
}

Now we can create a loop that keeps fetching results and incrementing the offset until there are no more seeds. We could have used the CSV output, but it is useful to normalize some of the structured metadata. This will likely take a few minutes to run.

In [16]:
out = csv.writer(open('data/iipc.csv', 'w'))
out.writerow([
    "id",
    "url",
    "creator",
    "created",
    "updated",
    "crawl_definition",
    "title",
    "description",
    "language",
    "tld"
])

def first_val(meta, name):
    return meta[name][0]["value"] if name in meta else None

params['offset'] = 0

while True:
    resp = requests.get(url, params=params)
    seeds = resp.json()
    if len(seeds) == 0: break

    for seed in seeds:
        meta = seed["metadata"]
        out.writerow([
            seed["id"],
            seed["url"],
            seed["created_by"],
            seed["created_date"],
            seed["last_updated_date"],
            seed["crawl_definition"],
            first_val(meta, "Title"),
            first_val(meta, "Description"),
            first_val(meta, "Language"),
            first_val(meta, "Top-Level Domain")
        ])

    params['offset'] += 100

In [13]:
url = 'https://partner.archive-it.org/api/seed'
params = {
    "collection": 13529,
    "offset": 0,
    "limit": 100
}

while True:
    resp = requests.get(url, params=params)
    seeds = resp.json()
    if len(seeds) == 0: break
    for seed in seeds:
        if seed['url'] == 'https://www.health.govt.nz/our-work/diseases-and-conditions/covid-19-novel-coronavirus/':
            print(seed['url'])
    params['offset'] += len(seeds)

So now you should hopefully see an updated `seeds.csv`!

## 2. Display the Seeds

First lets load our `seeds.csv` into a Pandas DataFrame where we can more easily manipulate it.

In [17]:
seeds = pandas.read_csv('data/iipc.csv', parse_dates=["created", "updated"])
seeds

Unnamed: 0,id,url,creator,created,updated,crawl_definition,title,description,language,tld
0,2147692,http://coronavirus.fr/,alext,2020-02-21 03:43:18.662353+00:00,2020-03-16 19:53:45.860949+00:00,31104294373,Epicorem. Ecoépidémiologie,Medical/Scientific aspects,French,.fr
1,2147693,http://english.whiov.cas.cn/,alext,2020-02-21 03:43:18.706571+00:00,2020-03-16 19:52:28.575749+00:00,31104294373,"Wuhan Institute of Virulogy, official page in ...",Health Organisation,English,.cn
2,2147694,http://www.china-embassy.or.jp/chn/,alext,2020-02-21 03:43:18.739126+00:00,2020-03-16 19:53:03.086729+00:00,31104294373,中华人民共和国驻日本大使馆,Embassy,Chinese,.jp
3,2147695,http://www.china-embassy.or.jp/jpn/,alext,2020-02-21 03:43:18.766308+00:00,2020-03-16 19:54:02.280945+00:00,31104294373,中華人民共和国駐日本国大使館,Embassy,Japanese,.jp
4,2147696,https://cadenaser.com/tag/ncov/a/,alext,2020-02-21 03:43:18.791716+00:00,2020-03-16 19:54:19.694418+00:00,31104294373,Coronavirus de Wuhan,Cadena Ser,Spanish,.com
...,...,...,...,...,...,...,...,...,...,...
2794,2173031,https://www.suntrust.com/resource-center/comme...,nicolab,2020-03-26 15:41:06.629121+00:00,2020-03-26 15:41:06.629220+00:00,31104300763,,,,
2795,2148539,https://www.eluniversal.com/economia/60496/cor...,alext,2020-02-21 04:11:12.713039+00:00,2020-03-16 19:53:55.500654+00:00,31104294373,Coronavirus afecta economía mundial y rutas co...,"political aspects,economic aspects, diplomacy",Spanish,.com
2796,2149377,https://ue.delegfrance.org/coronavirus-activat...,alext,2020-02-21 04:28:15.941569+00:00,2020-03-16 19:53:43.544395+00:00,31104294373,Délégation France UE. Coronavirus : Activation...,Institutional website,French,.org
2797,2149468,https://www.healthdirect.gov.au/coronavirus,alext,2020-02-21 04:29:30.095448+00:00,2020-03-16 19:52:16.008948+00:00,31104297068,Coronavirus disease (COVID-19),Government health information,English,.au


We can sort them by created time in ascending order, and save them again. This might make it easier to compare them over time with `git diff`.

In [19]:
seeds = seeds.sort_values('created')
seeds.to_csv('data/iipc.csv')
seeds.head(10)

Unnamed: 0,id,url,creator,created,updated,crawl_definition,title,description,language,tld
0,2147692,http://coronavirus.fr/,alext,2020-02-21 03:43:18.662353+00:00,2020-03-16 19:53:45.860949+00:00,31104294373,Epicorem. Ecoépidémiologie,Medical/Scientific aspects,French,.fr
636,2147692,http://coronavirus.fr/,alext,2020-02-21 03:43:18.662353+00:00,2020-03-16 19:53:45.860949+00:00,31104294373,Epicorem. Ecoépidémiologie,Medical/Scientific aspects,French,.fr
1,2147693,http://english.whiov.cas.cn/,alext,2020-02-21 03:43:18.706571+00:00,2020-03-16 19:52:28.575749+00:00,31104294373,"Wuhan Institute of Virulogy, official page in ...",Health Organisation,English,.cn
2153,2147693,http://english.whiov.cas.cn/,alext,2020-02-21 03:43:18.706571+00:00,2020-03-16 19:52:28.575749+00:00,31104294373,"Wuhan Institute of Virulogy, official page in ...",Health Organisation,English,.cn
849,2147694,http://www.china-embassy.or.jp/chn/,alext,2020-02-21 03:43:18.739126+00:00,2020-03-16 19:53:03.086729+00:00,31104294373,中华人民共和国驻日本大使馆,Embassy,Chinese,.jp
2,2147694,http://www.china-embassy.or.jp/chn/,alext,2020-02-21 03:43:18.739126+00:00,2020-03-16 19:53:03.086729+00:00,31104294373,中华人民共和国驻日本大使馆,Embassy,Chinese,.jp
1557,2147695,http://www.china-embassy.or.jp/jpn/,alext,2020-02-21 03:43:18.766308+00:00,2020-03-16 19:54:02.280945+00:00,31104294373,中華人民共和国駐日本国大使館,Embassy,Japanese,.jp
3,2147695,http://www.china-embassy.or.jp/jpn/,alext,2020-02-21 03:43:18.766308+00:00,2020-03-16 19:54:02.280945+00:00,31104294373,中華人民共和国駐日本国大使館,Embassy,Japanese,.jp
4,2147696,https://cadenaser.com/tag/ncov/a/,alext,2020-02-21 03:43:18.791716+00:00,2020-03-16 19:54:19.694418+00:00,31104294373,Coronavirus de Wuhan,Cadena Ser,Spanish,.com
1757,2147697,https://doktor.frettabladid.is/sjukdomur/27626-2/,alext,2020-02-21 03:43:18.814377+00:00,2020-03-16 19:54:20.668796+00:00,31104294373,Allt sem þú þarft að vita um Kóróna veirur (co...,Health care information,Icelandic,.is


## 3. Languages

We can see that there are a large number of Portuguese seeds. I guess because someone involved in web archiving in Portugal or Brazil got busy.

In [20]:
altair.Chart(seeds).mark_bar().encode(
    altair.X('language', title='Language'),
    altair.Y('count(id)')
)

## 4. Created

We can see that most of the vast majority of these seeds were entered into Archive-It on February 20, 2020, presumably from the spreadsheet sitting behind the Google Form.

In [21]:
altair.Chart(seeds).mark_bar().encode(
    altair.X('monthdate(created)', title='Created'),
    altair.Y('count(id)')
)

## 5. Last Update

Similarly we can look to see when the last update time was for each seed.

In [22]:
altair.Chart(seeds).mark_bar().encode(
    altair.X('monthdate(updated)', title='Updates'),
    altair.Y('count(id)')
)

It looks like most of the seeds were last updated a few days ago. But does this mean that was the last time they were crawled?

## 6. Get the Crawls

Oddly I couldn't seem to get any of the crawl related Partner API endpoints to work. Maybe I need to have created the crawls? At any rate, I can use the URL to look directly in Wayback machine to see what is available. The EDGI folks have created a nice [Wayback](https://wayback.readthedocs.io/en/latest/usage.html) module that lets you easily look up URLs in the Wayback Machine (it uses their CDX API behind the scenes). 

This can take some time, so I'm going to save off the results in a `crawls.csv`. If you prefer to use the stored `crawls.csv` you skip ahead to **Section 7**. This will collect crawl information for these URLs from 2019-10-01 on so we can look at their coverage before and after the project started.

In [23]:
out = csv.writer(open('data/crawls.csv', 'w'))
out.writerow(['timestamp', 'url', 'status_code', 'archive_url'])
wb = wayback.WaybackClient()

for index, row in seeds.iterrows():
    try:
        for crawl in wb.search(row.url, from_date=datetime.datetime(2019, 10, 1)):
            out.writerow([
                crawl.timestamp.isoformat(),
                crawl.url,
                crawl.status_code,
                crawl.view_url
            ])
    except Exception as e:
        print(e)

403 Client Error: Forbidden for url: http://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fa2larm.cz%2F2020%2F02%2Fslavoj-zizek-melancholicka-krasa-virove-pandemie%2F&from=20191001000000&showResumeKey=true&resolveRevisits=true
403 Client Error: Forbidden for url: http://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fnationalpost.com%2Fhealth%2Fbio-warfare-experts-question-why-canada-was-sending-lethal-viruses-to-china&from=20191001000000&showResumeKey=true&resolveRevisits=true
403 Client Error: Forbidden for url: http://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fwww.cagle.com%2Fdave-granlund%2F2020%2F01%2Fcoronavirus-usa&from=20191001000000&showResumeKey=true&resolveRevisits=true
403 Client Error: Forbidden for url: http://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fpoliticalcartoons.com%2F%3Fs%3Dcoronavirus&from=20191001000000&showResumeKey=true&resolveRevisits=true


It's interesting that some of the URLs are forbidden for viewing. I'm not sure what's going on there. One important thing to keep in mind is that these URLs could have been crawled by other users of Archive-It or by the Internet Archive's own crawlers.

## 8. View the Crawls

Now lets load in the `crawls.csv` as a DataFrame and look at the number of crawls over time. It's actually useful to save a sorted version of the crawls.csv so that it can easily be diffed with previous versions.

In [24]:
crawls = pandas.read_csv('data/crawls.csv', parse_dates=['timestamp'])
crawls = crawls.sort_values('timestamp')
crawls.to_csv('data/crawls.csv')
crawls

Unnamed: 0,timestamp,url,status_code,archive_url
15742,2019-10-01 01:24:55,http://www.dw.com/,,http://web.archive.org/web/20191001012455/http...
15743,2019-10-01 01:24:55,https://www.dw.com/,,http://web.archive.org/web/20191001012455/http...
69764,2019-10-01 01:56:12,https://www.colorado.gov/cdphe,200.0,http://web.archive.org/web/20191001015612/http...
69195,2019-10-01 02:38:36,https://www.healthlinkbc.ca/,200.0,http://web.archive.org/web/20191001023836/http...
26215,2019-10-01 03:14:00,https://cn.ambafrance.org/,200.0,http://web.archive.org/web/20191001031400/http...
...,...,...,...,...
67879,2020-03-27 15:01:46,https://www.ecdc.europa.eu/en/novel-coronaviru...,200.0,http://web.archive.org/web/20200327150146/http...
65286,2020-03-27 15:02:42,https://news.ifeng.com/c/special/7tPlDSzDgVk,200.0,http://web.archive.org/web/20200327150242/http...
21852,2020-03-27 15:03:11,https://www.nbcnews.com/health/coronavirus,200.0,http://web.archive.org/web/20200327150311/http...
65287,2020-03-27 15:11:04,https://news.ifeng.com/c/special/7tPlDSzDgVk,200.0,http://web.archive.org/web/20200327151104/http...


In [25]:
crawls_per_day = crawls.set_index('timestamp').resample('1D')['url'].count()
crawls_per_day = crawls_per_day.reset_index()
crawls_per_day.columns = ['date', 'crawls']
crawls_per_day

Unnamed: 0,date,crawls
0,2019-10-01,22
1,2019-10-02,49
2,2019-10-03,22
3,2019-10-04,52
4,2019-10-05,37
...,...,...
174,2020-03-23,1417
175,2020-03-24,1405
176,2020-03-25,1242
177,2020-03-26,1998


In [26]:
altair.Chart(crawls_per_day, width=800).mark_bar().encode(
    altair.X('date', title='Crawl Date'),
    altair.Y('crawls', title='Crawls')
)

## 9. Missing Crawls

We can definitely see these URLs are being crawled a whole lot more since the start of the project. But the graph shows what has been crawled (irrespective of who did it). It also doesn't show what seed URLs have not been crawled yet.

To see what might be missing lets first group our crawl data by url, and count how many crawls there have been for that url.

In [27]:
crawls_by_url = crawls.groupby('url').count().timestamp
crawls_by_url.name = 'crawls'
crawls_by_url.head()

url
http://9news.com.au/coronavirus                                                                                          2
http://abcnews.go.com/Health/1300-people-died-flu-year/story?id=67754182                                                71
http://abola.pt/africa/2020-02-01/angola-entre-os-paises-africanos-com-maior-risco-de-contagio-do-coronavirus/827264     1
http://abola.pt/nnh/2020-02-03/formula-1-coronavirus-ameaca-gp-da-china/827542                                           1
http://albertahealthservices.ca/                                                                                         2
Name: crawls, dtype: int64

Next we can take our `seeds` DataFrame, index it by URL, so that we can add our `crawls_by_url` series to it, since it is also indexed by `url`. It is kinda nice how pandas makes this join easy. The use of `fillna` there is to convert any null values (where there has been no crawls yet) to 0.

In [28]:
seeds_by_url = seeds.set_index('url')
seeds_by_url['crawls'] = crawls_by_url
seeds_by_url.crawls = seeds_by_url.crawls.fillna(0)
seeds_by_url.head()

Unnamed: 0_level_0,id,creator,created,updated,crawl_definition,title,description,language,tld,crawls
url,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
http://coronavirus.fr/,2147692,alext,2020-02-21 03:43:18.662353+00:00,2020-03-16 19:53:45.860949+00:00,31104294373,Epicorem. Ecoépidémiologie,Medical/Scientific aspects,French,.fr,4.0
http://coronavirus.fr/,2147692,alext,2020-02-21 03:43:18.662353+00:00,2020-03-16 19:53:45.860949+00:00,31104294373,Epicorem. Ecoépidémiologie,Medical/Scientific aspects,French,.fr,4.0
http://english.whiov.cas.cn/,2147693,alext,2020-02-21 03:43:18.706571+00:00,2020-03-16 19:52:28.575749+00:00,31104294373,"Wuhan Institute of Virulogy, official page in ...",Health Organisation,English,.cn,70.0
http://english.whiov.cas.cn/,2147693,alext,2020-02-21 03:43:18.706571+00:00,2020-03-16 19:52:28.575749+00:00,31104294373,"Wuhan Institute of Virulogy, official page in ...",Health Organisation,English,.cn,70.0
http://www.china-embassy.or.jp/chn/,2147694,alext,2020-02-21 03:43:18.739126+00:00,2020-03-16 19:53:03.086729+00:00,31104294373,中华人民共和国驻日本大使馆,Embassy,Chinese,.jp,306.0


So now we can see which seeds still need to be crawled, or to have their crawls made public?

In [29]:
missing = seeds_by_url[seeds_by_url.crawls == 0.0]
print("{0} URLS are missing crawls, which is {1:.2f}% of the total seeds.".format(
    len(missing),
    len(missing) / len(seeds_by_url) * 100
))

438 URLS are missing crawls, which is 15.65% of the total seeds.
