# nCovMemory

The nCovMemory project is a GitHub repository where people are curating stories about COVID-19 in the media and social media. You can see it mentioned in a short NYTimes video documentary about censorship in China: [China Is Censoring Coronavirus Stories: These Citizens Are Fighting Back](https://www.nytimes.com/video/world/asia/100000006970549/coronavirus-chinese-citizens.html) by Christoph Koettl, Muyi Xiao, Nilo Tabrizy and Dmitriy Khavin.

They make their data available at this [static website](https://2019ncovmemory.github.io/nCovMemory/) but also as CSV data in their GitHub repository. We can check their data to see if any of them need to be added to the IIPC collection.

## GitHub Data

We can download their latest CSV data directly from the web.


In [13]:
import pandas

url = 'https://raw.githubusercontent.com/2019ncovmemory/nCovMemory/master/data/data.csv'
ncovmem = pandas.read_csv(url, index_col='id')
ncovmem

Unnamed: 0_level_0,category,update,media,date,title,title_en,url,translation_en,is_deleted,alternative,archive
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
4241,non_fiction,2020-03-23,人间theLivings,2020-03-23,海外疫区里的中国留学生：要学位，还是保命？,,https://mp.weixin.qq.com/s/HkJQ01ZBkerky7BC-xAhiA,,,,http://archive.is/XD3Nz
4240,narrative,2020-03-23,在人间living,2020-03-23,今天，武汉封城两个月了,,https://mp.weixin.qq.com/s/mrWF9nFUxtXnNnyNEf-Ibw,,,,http://archive.is/KPoAT
4239,non_fiction,2020-03-23,中国经营报,2020-03-23,“108好汉”为何注射新冠疫苗，这位00后的回答刷屏…,,https://mp.weixin.qq.com/s/GinzGhKnNHZrtlDuVIrkKw,,,,http://archive.is/tHBxO
4238,non_fiction,2020-03-23,中国经营报,2020-03-23,新加坡、澳大利亚“封国”！意大利全国“停产”，美国确诊人数突破3万...,,https://mp.weixin.qq.com/s/hE8J7D-GrkB92GoPnmpcsg,,,,http://archive.is/QcWU5
4237,narrative,2020-03-22,WUXU,2020-03-23,[四十日谈] 条条大路“不”通罗马，老猫的曲折回意之路,,https://mp.weixin.qq.com/s/fLlGjOcZcotS-QybkqZjcw,,,,http://archive.ph/ehsAk
...,...,...,...,...,...,...,...,...,...,...,...
5,non_fiction,2020-02-06,GQ报道,2020-01-29,孝感前线医生：武汉更难，我们下面不好意思提要求,,https://mp.weixin.qq.com/s/uGaFeqrqmLBQe5qdRSTeSQ,,,,https://archive.ph/MnZrn
4,non_fiction,2020-02-06,GQ报道,2020-01-29,疫情危机中不被看见的人们：武汉周边城市百姓的自救行动,,https://mp.weixin.qq.com/s/D8Ob8pNmecHKXg7yR7EWFg,,,,https://archive.ph/vDSj5
3,non_fiction,2020-02-06,GQ报道,2020-01-28,我家离华南海鲜市场很近：返乡、封城、过年，一位武汉大学生的过去一周,,https://mp.weixin.qq.com/s/n7dXGHh-79d6VEzDhhOUbQ,,,,https://archive.ph/RSmFx
2,non_fiction,2020-02-06,GQ报道,2020-01-28,武汉隔离：疫区、信息孤岛与一辆鄂A车的漂流,,https://mp.weixin.qq.com/s/M-hVivF7NQmZHlu8YMnL_w,,,,http://archive.is/3XKZD


## Seeds

Now we need the IIPC seed list. We can get that right here, since we saved it when we ran the Seeds notebook.

In [15]:
seeds = pandas.read_csv('data/iipc.csv')
seeds

Unnamed: 0,id,url,creator,created,updated,crawl_definition,title,description,language,tld
0,2147692,http://coronavirus.fr/,alext,2020-02-21T03:43:18.662353Z,2020-03-16T19:53:45.860949Z,31104294373,Epicorem. Ecoépidémiologie,Medical/Scientific aspects,French,.fr
1,2147693,http://english.whiov.cas.cn/,alext,2020-02-21T03:43:18.706571Z,2020-03-16T19:52:28.575749Z,31104294373,"Wuhan Institute of Virulogy, official page in ...",Health Organisation,English,.cn
2,2147694,http://www.china-embassy.or.jp/chn/,alext,2020-02-21T03:43:18.739126Z,2020-03-16T19:53:03.086729Z,31104294373,中华人民共和国驻日本大使馆,Embassy,Chinese,.jp
3,2147695,http://www.china-embassy.or.jp/jpn/,alext,2020-02-21T03:43:18.766308Z,2020-03-16T19:54:02.280945Z,31104294373,中華人民共和国駐日本国大使館,Embassy,Japanese,.jp
4,2147696,https://cadenaser.com/tag/ncov/a/,alext,2020-02-21T03:43:18.791716Z,2020-03-16T19:54:19.694418Z,31104294373,Coronavirus de Wuhan,Cadena Ser,Spanish,.com


## Massage

There are some rows in the original dataset that lack a value for the `url`.

In [64]:
ncovmem = ncovmem[ncovmem.url.notna()]

## Domains

Lets take a look at the domains that are present in this data.

In [65]:
import altair
from urllib.parse import urlparse

ncovmem['domain'] = ncovmem.url.map(lambda u: urlparse(u).netloc, na_action='ignore')

altair.Chart(ncovmem.reset_index(), title="Coronavirus Subreddit Posts", width=800).mark_bar().encode(
    altair.X('domain', title='Time (Days)'),
    altair.Y('count(id)', title='Posts per Day')
)


So the dataset is almost entirely links to qq.com, or the [TenCent](https://en.wikipedia.org/wiki/Tencent_QQ) instant messaging platform.

## Overlap?

Now let's see if any of the nCovMemory URLs are present in the IIPC one.

In [66]:
len(set(seeds.url).intersection(ncovmem.url))

0

As you can see there's no overlap at all between the two sets of URLs. So the nCovMem dataset should be useful to add to the IIPC collection.

Let's save them off for use later.

In [67]:
ncovmem.to_csv('data/ncovmem.csv')