{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# nCovMemory\n", "\n", "The nCovMemory project is a GitHub repository where people are curating stories about COVID-19 in the media and social media. You can see it mentioned in a short NYTimes video documentary about censorship in China: [China Is Censoring Coronavirus Stories: These Citizens Are Fighting Back](https://www.nytimes.com/video/world/asia/100000006970549/coronavirus-chinese-citizens.html) by Christoph Koettl, Muyi Xiao, Nilo Tabrizy and Dmitriy Khavin.\n", "\n", "They make their data available at this [static website](https://2019ncovmemory.github.io/nCovMemory/) but also as CSV data in their GitHub repository. We can check their data to see if any of them need to be added to the IIPC collection." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GitHub Data\n", "\n", "We can download their latest CSV data directly from the web.\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
categoryupdatemediadatetitletitle_enurltranslation_enis_deletedalternativearchive
id
4241non_fiction2020-03-23人间theLivings2020-03-23海外疫区里的中国留学生:要学位,还是保命?NaNhttps://mp.weixin.qq.com/s/HkJQ01ZBkerky7BC-xAhiANaNNaNNaNhttp://archive.is/XD3Nz
4240narrative2020-03-23在人间living2020-03-23今天,武汉封城两个月了NaNhttps://mp.weixin.qq.com/s/mrWF9nFUxtXnNnyNEf-IbwNaNNaNNaNhttp://archive.is/KPoAT
4239non_fiction2020-03-23中国经营报2020-03-23“108好汉”为何注射新冠疫苗,这位00后的回答刷屏…NaNhttps://mp.weixin.qq.com/s/GinzGhKnNHZrtlDuVIrkKwNaNNaNNaNhttp://archive.is/tHBxO
4238non_fiction2020-03-23中国经营报2020-03-23新加坡、澳大利亚“封国”!意大利全国“停产”,美国确诊人数突破3万...NaNhttps://mp.weixin.qq.com/s/hE8J7D-GrkB92GoPnmpcsgNaNNaNNaNhttp://archive.is/QcWU5
4237narrative2020-03-22WUXU2020-03-23[四十日谈] 条条大路“不”通罗马,老猫的曲折回意之路NaNhttps://mp.weixin.qq.com/s/fLlGjOcZcotS-QybkqZjcwNaNNaNNaNhttp://archive.ph/ehsAk
....................................
5non_fiction2020-02-06GQ报道2020-01-29孝感前线医生:武汉更难,我们下面不好意思提要求NaNhttps://mp.weixin.qq.com/s/uGaFeqrqmLBQe5qdRSTeSQNaNNaNNaNhttps://archive.ph/MnZrn
4non_fiction2020-02-06GQ报道2020-01-29疫情危机中不被看见的人们:武汉周边城市百姓的自救行动NaNhttps://mp.weixin.qq.com/s/D8Ob8pNmecHKXg7yR7EWFgNaNNaNNaNhttps://archive.ph/vDSj5
3non_fiction2020-02-06GQ报道2020-01-28我家离华南海鲜市场很近:返乡、封城、过年,一位武汉大学生的过去一周NaNhttps://mp.weixin.qq.com/s/n7dXGHh-79d6VEzDhhOUbQNaNNaNNaNhttps://archive.ph/RSmFx
2non_fiction2020-02-06GQ报道2020-01-28武汉隔离:疫区、信息孤岛与一辆鄂A车的漂流NaNhttps://mp.weixin.qq.com/s/M-hVivF7NQmZHlu8YMnL_wNaNNaNNaNhttp://archive.is/3XKZD
1non_fiction2020-02-06GQ报道2020-01-2710000个临时发往武汉的口罩NaNhttps://mp.weixin.qq.com/s/p-uPky_zB6XKcAetthqkKgNaNNaNNaNhttps://archive.ph/9s1ug
\n", "

4227 rows × 11 columns

\n", "
" ], "text/plain": [ " category update media date \\\n", "id \n", "4241 non_fiction 2020-03-23 人间theLivings 2020-03-23 \n", "4240 narrative 2020-03-23 在人间living 2020-03-23 \n", "4239 non_fiction 2020-03-23 中国经营报 2020-03-23 \n", "4238 non_fiction 2020-03-23 中国经营报 2020-03-23 \n", "4237 narrative 2020-03-22 WUXU 2020-03-23 \n", "... ... ... ... ... \n", "5 non_fiction 2020-02-06 GQ报道 2020-01-29 \n", "4 non_fiction 2020-02-06 GQ报道 2020-01-29 \n", "3 non_fiction 2020-02-06 GQ报道 2020-01-28 \n", "2 non_fiction 2020-02-06 GQ报道 2020-01-28 \n", "1 non_fiction 2020-02-06 GQ报道 2020-01-27 \n", "\n", " title title_en \\\n", "id \n", "4241 海外疫区里的中国留学生:要学位,还是保命? NaN \n", "4240 今天,武汉封城两个月了 NaN \n", "4239 “108好汉”为何注射新冠疫苗,这位00后的回答刷屏… NaN \n", "4238 新加坡、澳大利亚“封国”!意大利全国“停产”,美国确诊人数突破3万... NaN \n", "4237 [四十日谈] 条条大路“不”通罗马,老猫的曲折回意之路 NaN \n", "... ... ... \n", "5 孝感前线医生:武汉更难,我们下面不好意思提要求 NaN \n", "4 疫情危机中不被看见的人们:武汉周边城市百姓的自救行动 NaN \n", "3 我家离华南海鲜市场很近:返乡、封城、过年,一位武汉大学生的过去一周 NaN \n", "2 武汉隔离:疫区、信息孤岛与一辆鄂A车的漂流 NaN \n", "1 10000个临时发往武汉的口罩 NaN \n", "\n", " url translation_en \\\n", "id \n", "4241 https://mp.weixin.qq.com/s/HkJQ01ZBkerky7BC-xAhiA NaN \n", "4240 https://mp.weixin.qq.com/s/mrWF9nFUxtXnNnyNEf-Ibw NaN \n", "4239 https://mp.weixin.qq.com/s/GinzGhKnNHZrtlDuVIrkKw NaN \n", "4238 https://mp.weixin.qq.com/s/hE8J7D-GrkB92GoPnmpcsg NaN \n", "4237 https://mp.weixin.qq.com/s/fLlGjOcZcotS-QybkqZjcw NaN \n", "... ... ... \n", "5 https://mp.weixin.qq.com/s/uGaFeqrqmLBQe5qdRSTeSQ NaN \n", "4 https://mp.weixin.qq.com/s/D8Ob8pNmecHKXg7yR7EWFg NaN \n", "3 https://mp.weixin.qq.com/s/n7dXGHh-79d6VEzDhhOUbQ NaN \n", "2 https://mp.weixin.qq.com/s/M-hVivF7NQmZHlu8YMnL_w NaN \n", "1 https://mp.weixin.qq.com/s/p-uPky_zB6XKcAetthqkKg NaN \n", "\n", " is_deleted alternative archive \n", "id \n", "4241 NaN NaN http://archive.is/XD3Nz \n", "4240 NaN NaN http://archive.is/KPoAT \n", "4239 NaN NaN http://archive.is/tHBxO \n", "4238 NaN NaN http://archive.is/QcWU5 \n", "4237 NaN NaN http://archive.ph/ehsAk \n", "... ... ... ... \n", "5 NaN NaN https://archive.ph/MnZrn \n", "4 NaN NaN https://archive.ph/vDSj5 \n", "3 NaN NaN https://archive.ph/RSmFx \n", "2 NaN NaN http://archive.is/3XKZD \n", "1 NaN NaN https://archive.ph/9s1ug \n", "\n", "[4227 rows x 11 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas\n", "\n", "url = 'https://raw.githubusercontent.com/2019ncovmemory/nCovMemory/master/data/data.csv'\n", "ncovmem = pandas.read_csv(url, index_col='id')\n", "ncovmem" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Seeds\n", "\n", "Now we need the IIPC seed list. We can get that right here, since we saved it when we ran the Seeds notebook." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idurlcreatorcreatedupdatedcrawl_definitiontitledescriptionlanguagetld
02147692http://coronavirus.fr/alext2020-02-21T03:43:18.662353Z2020-03-16T19:53:45.860949Z31104294373Epicorem. EcoépidémiologieMedical/Scientific aspectsFrench.fr
12147693http://english.whiov.cas.cn/alext2020-02-21T03:43:18.706571Z2020-03-16T19:52:28.575749Z31104294373Wuhan Institute of Virulogy, official page in ...Health OrganisationEnglish.cn
22147694http://www.china-embassy.or.jp/chn/alext2020-02-21T03:43:18.739126Z2020-03-16T19:53:03.086729Z31104294373中华人民共和国驻日本大使馆EmbassyChinese.jp
32147695http://www.china-embassy.or.jp/jpn/alext2020-02-21T03:43:18.766308Z2020-03-16T19:54:02.280945Z31104294373中華人民共和国駐日本国大使館EmbassyJapanese.jp
42147696https://cadenaser.com/tag/ncov/a/alext2020-02-21T03:43:18.791716Z2020-03-16T19:54:19.694418Z31104294373Coronavirus de WuhanCadena SerSpanish.com
\n", "
" ], "text/plain": [ " id url creator \\\n", "0 2147692 http://coronavirus.fr/ alext \n", "1 2147693 http://english.whiov.cas.cn/ alext \n", "2 2147694 http://www.china-embassy.or.jp/chn/ alext \n", "3 2147695 http://www.china-embassy.or.jp/jpn/ alext \n", "4 2147696 https://cadenaser.com/tag/ncov/a/ alext \n", "\n", " created updated crawl_definition \\\n", "0 2020-02-21T03:43:18.662353Z 2020-03-16T19:53:45.860949Z 31104294373 \n", "1 2020-02-21T03:43:18.706571Z 2020-03-16T19:52:28.575749Z 31104294373 \n", "2 2020-02-21T03:43:18.739126Z 2020-03-16T19:53:03.086729Z 31104294373 \n", "3 2020-02-21T03:43:18.766308Z 2020-03-16T19:54:02.280945Z 31104294373 \n", "4 2020-02-21T03:43:18.791716Z 2020-03-16T19:54:19.694418Z 31104294373 \n", "\n", " title \\\n", "0 Epicorem. Ecoépidémiologie \n", "1 Wuhan Institute of Virulogy, official page in ... \n", "2 中华人民共和国驻日本大使馆 \n", "3 中華人民共和国駐日本国大使館 \n", "4 Coronavirus de Wuhan \n", "\n", " description language tld \n", "0 Medical/Scientific aspects French .fr \n", "1 Health Organisation English .cn \n", "2 Embassy Chinese .jp \n", "3 Embassy Japanese .jp \n", "4 Cadena Ser Spanish .com " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seeds = pandas.read_csv('data/iipc.csv')\n", "seeds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Massage\n", "\n", "There are some rows in the original dataset that lack a value for the `url`." ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "ncovmem = ncovmem[ncovmem.url.notna()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Domains\n", "\n", "Lets take a look at the domains that are present in this data." ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import altair\n", "from urllib.parse import urlparse\n", "\n", "ncovmem['domain'] = ncovmem.url.map(lambda u: urlparse(u).netloc, na_action='ignore')\n", "\n", "altair.Chart(ncovmem.reset_index(), title=\"Coronavirus Subreddit Posts\", width=800).mark_bar().encode(\n", " altair.X('domain', title='Time (Days)'),\n", " altair.Y('count(id)', title='Posts per Day')\n", ")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So the dataset is almost entirely links to qq.com, or the [TenCent](https://en.wikipedia.org/wiki/Tencent_QQ) instant messaging platform." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overlap?\n", "\n", "Now let's see if any of the nCovMemory URLs are present in the IIPC one." ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(set(seeds.url).intersection(ncovmem.url))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see there's no overlap at all between the two sets of URLs. So the nCovMem dataset should be useful to add to the IIPC collection." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's save them off for use later." ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [], "source": [ "ncovmem.to_csv('data/ncovmem.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }