{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# The Local News Dataset\n",
"View this document on [Github](https://github.com/yinleon/LocalNewsDataset/blob/master/nbs/local_news_dataset.ipynb?flush_cache=true) | [NbViewer](https://nbviewer.jupyter.org/github/yinleon/LocalNewsDataset/blob/master/nbs/local_news_dataset.ipynb?flush_cache=true#datasheet)\n",
"by [Leon Yin](https://www.leonyin.org/)
\n",
"Data scientist SMaPP Lab NYU and affiliate at Data & Society."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Contents \n",
"1. [Introduction](#intro)\n",
"2. [Tech Specs](#specs)\n",
"3. [Using the Dataset](#use)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. Introduction \n",
"Though not particularly a noticeable part of the current discussion, 2018 has shown us that understanding \"fake news\", and the media (manipulation) ecosystem at large, has as much to do with local broadcasting stations as it does Alex Jones and CNN.\n",
"\n",
"We saw local news outlets used as a sounding board to decry mainstream media outlets as \"[fake news](https://www.youtube.com/watch?v=khbihkeOISc).\" We also saw Russian trolls masquerade as local news outlets to [build trust](https://www.npr.org/2018/07/12/628085238/russian-influence-campaign-sought-to-exploit-americans-trust-in-local-news) as sleeper accounts on Twitter. \n",
"\n",
"To help put the pieces of this disinformation ecosystem into context, we can refer to a 2016 [Pew Study](http://www.journalism.org/2016/07/07/trust-and-accuracy/) on Trust and Accuracy of the Modern Media Consumer which showed that 86% of survey respondents had \"a lot\" or \"some\" confidence in local news. This was more than their confidence of national media outlets, social media, and family and friends.\n",
"\n",
"\n",
"\n",
"\n",
"Social media is the least trustworthy news source according to the 4.6K respondents of the Pew study. It's important to note that this study was published before the 2016 US Presidential Election and social media platforms were not under the same scrutiny as they are today. \n",
"\n",
"Perhaps the most significant finding in this study is that very few have \"a lot\" of trust in information from professional news outlets. Is this because so called \"fake news\" blurs the line between reputable and false pieces of information? Political scientist Andy Guess has shown that older (60+ yrs old) citizens are more sussceptitble to spreading links containing junk news on Facebook. Yet the mistrust economy is more than the [junk news](https://www.buzzfeednews.com/article/craigsilverman/viral-fake-election-news-outperformed-real-news-on-facebook) sites Craig Silverman analyzed when he first coined \"fake news\" in late 2016. \n",
"\n",
"\n",
"\n",
"In 2017, media historian Caroline Jack released a [lexicon](https://datasociety.net/output/lexicon-of-lies/) in an effort to define what was formerly referred to as \"fake news,\" with more nuance. Jack calls this umbrella of deceptive content problematic information.\n",
"\n",
"\n",
"The social media scholar Alice Marwick -- who made some of the first breakthroughs in thie field with [Becca Lewis](https://datasociety.net/output/media-manipulation-and-disinfo-online/), [recently reminded us that](https://www.georgetownlawtechreview.org/why-do-people-share-fake-news-a-sociotechnical-model-of-media-effects/GLTR-07-2018/) problematic information spreads not only through junk news headlines, but also through memes, videos and podcasts. What other mediums are we overlooking? As a hint, we can listen to Marwick and other researchers such as ethnographer [Francesca Tripoldi](https://datasociety.net/output/searching-for-alternative-facts/), who observe that problematic information is deeply connected to one's self-presentation and the reinforcement of group identity. So where does local news fit into this equation?\n",
"\n",
"Though local news is widely viewed as a relatively trustworthy news source, its role in the current media and information landscape is not well studied. To better understand that role, I put together the Local News Dataset in the hopes that it will accelerate research of local news across the web.\n",
"\n",
"## About the Data Set\n",
"This dataset is a machine-readable directory of state-level newspapers, TV stations and magazines. In addition to basic information such as the name of the outlet and state it is located in, all available information regarding web presence, social media (Twitter, YouTube, Facebook) and their owners is scraped, too.\n",
"\n",
"The sources of this dataset are [usnpl.com](www.usnpl.com)-- newspapers and magazines by state, [stationindex.com](www.stationindex.com) -- TV stations by state and by owner, and homepages of the media corporations [Meredith](http://www.meredith.com/local-media/broadcast-and-digital), [Sinclair](http://sbgi.net/tv-channels/), [Nexstar](https://www.nexstar.tv/stations/), [Tribune](http://www.tribunemedia.com/our-brands/) and [Hearst](http://www.hearst.com/broadcasting/our-markets).\n",
"\n",
"This dataset was inspired by ProPublica's [Congress API](https://projects.propublica.org/api-docs/congress-api/). I hope that this dataset will serve a similar purpose as a starting point for research and applications, as well as a bridge between datasets from social media, news articles and online communities.\n",
"\n",
"While you use this dataset, if you see irregularities, questionable entries, or missing outlets please [submit an issue](https://github.com/yinleon/LocalNewsDataset/issues/new) on Github or contact me on [Twitter](https://twitter.com/LeonYin). I'd love to hear how this dataset is put to work \n",
"\n",
"You can browse the dataset on [Google Sheets](https://docs.google.com/spreadsheets/d/1f3PjT2A7-qY0SHcDW30Bc_FXYC_7RxnZfCKyXpoWeuY/edit?usp=sharing)
\n",
"Or look at the raw dataset on [Github](https://github.com/yinleon/LocalNewsDataset/blob/master/data/local_news_dataset_2018.csv)
\n",
"Or just scroll down to the [tech specs](#local_news_dataset_2018)!\n",
"\n",
"Happy hunting!\n",
"\n",
"## Acknowledgements\n",
"I'd like to acknowledge the work of the people behind usnpl.com and stationindex.com for compiling lists of local media outlets.\n",
"Andreu Casas and Gregory Eady provided invaluable comments to improve this dataset for public release. Kinjal Dave provided much needed proofreading. The dataset was created by Leon Yin at the SMaPP Lab at NYU. Thank you Josh Tucker, Jonathan Nagler, Richard Bonneau and my collegue Nicole Baram.\n",
"\n",
"\n",
"## Citation\n",
"If this dataset is helpful to you please cite it as:\n",
"```\n",
"@misc{leon_yin_2018_1345145,\n",
" author = {Leon Yin},\n",
" title = {Local News Dataset},\n",
" month = aug,\n",
" year = 2018,\n",
" doi = {10.5281/zenodo.1345145},\n",
" url = {https://doi.org/10.5281/zenodo.1345145}\n",
"}\n",
"\n",
"```\n",
"\n",
"## License\n",
"This data is free to use, but please follow the ProPublica [Terms](#terms).\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2.Tech Specs \n",
"This section is an in-depth look at what is scraped from the web and how these pieces of disparate Internet matter come together to form the [Local News Dataset](https://github.com/yinleon/LocalNewsDataset).\n",
"\n",
"For those who tinker...
\n",
"The intermediates can be generated and updated:
\n",
"```>>> python download_data.py```
\n",
"The output file is created from merging and pre-processing the intermediates:
\n",
"```>>> python merge.py```
\n",
"These [two scripts](https://github.com/yinleon/LocalNewsDataset/tree/master/py) -- and this notebook, is written in Python 3.6.5 using open sources packages listed in in [requirements.txt](https://github.com/yinleon/LocalNewsDataset/blob/master/requirements.txt).\n",
"\n",
"[Top of Notebook](#top)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Updated 2018-11-12 16:19:34.240807\n",
"By Leon\n",
"Using Python 3.6.5\n",
"On Linux-3.10.0-514.10.2.el7.x86_64-x86_64-with-centos-7.3.1611-Core\n"
]
},
{
"data": {
"text/markdown": [
"## Inventory\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"### Intermediates \n",
" - [sinclair.tsv](#sinclair)\n",
" - [meredith.tsv](#meredith)\n",
" - [nexstar.tsv](#nexstar)\n",
" - [hearst.tsv](#hearst)\n",
" - [tribune.tsv](#tribune)\n",
" - [station_index.tsv](#station_index)\n",
" - [usnpl.tsv](#usnpl)\n",
"\n",
" \n",
"### Outputs\n",
"- [local_news_dataset_2018.csv](#local_news_dataset_2018)"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"## sinclair.tsv"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"An intermediate file of news outlets owned by Sinclair scraped from their website"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"Read the raw file from this [URL](https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/sinclair.tsv):
`https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/sinclair.tsv`\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"See the [code](https://github.com/yinleon/LocalNewsDataset/blob/master/py/download_data.py) used to make this dataset:
`https://github.com/yinleon/LocalNewsDataset/blob/master/py/download_data.py`\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"#### What Does the Data Look Like?"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"Sample of `../data/sinclair.tsv` (N = 1321)"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" city | \n",
" geo | \n",
" network | \n",
" state | \n",
" station | \n",
" website | \n",
" broadcaster | \n",
" source | \n",
" collection_date | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Chico | \n",
" Chico-Redding | \n",
" cozitv | \n",
" CA | \n",
" KRVU-LD-2 | \n",
" NaN | \n",
" Sinclair | \n",
" sbgi.net | \n",
" 2018-08-02 20:31:06.425892 | \n",
"
\n",
" \n",
" 1 | \n",
" West Palm | \n",
" West Palm BeachFort Pierce, FL | \n",
" tbd | \n",
" FL | \n",
" WTCN-3 | \n",
" NaN | \n",
" Sinclair | \n",
" sbgi.net | \n",
" 2018-08-02 20:31:06.425892 | \n",
"
\n",
" \n",
" 2 | \n",
" Abilene | \n",
" Abilene-Sweetwater | \n",
" cw | \n",
" TX | \n",
" KTXS-2 | \n",
" NaN | \n",
" Sinclair | \n",
" sbgi.net | \n",
" 2018-08-02 20:31:06.425892 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" city geo network state station website \\\n",
"0 Chico Chico-Redding cozitv CA KRVU-LD-2 NaN \n",
"1 West Palm West Palm BeachFort Pierce, FL tbd FL WTCN-3 NaN \n",
"2 Abilene Abilene-Sweetwater cw TX KTXS-2 NaN \n",
"\n",
" broadcaster source collection_date \n",
"0 Sinclair sbgi.net 2018-08-02 20:31:06.425892 \n",
"1 Sinclair sbgi.net 2018-08-02 20:31:06.425892 \n",
"2 Sinclair sbgi.net 2018-08-02 20:31:06.425892 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"#### What do the columns mean?\n",
"| Column Name | Description | N Unique Values |\n",
"| --- | --- | --- |\n",
"| city | The name of the city that the TV station is located in. | 85 |\n",
"| geo | The raw geolocation field from the website. We parse this field to get `city` and `state` | 89 |\n",
"| network | The franchise or brand name that the station belongs to IE \"Fox\" | 28 |\n",
"| state | The two letter state abbreviation the media outlet is located in. | 35 |\n",
"| station | The name of the TV station IE \"WGBH\". | 611 |\n",
"| website | The website of the media outlet exactly as we found it online. | 152 |\n",
"| broadcaster | The corporate owner of the station. | 1 |\n",
"| source | Where was this record scraped from? | 1 |\n",
"| collection_date | When was this record collected? | 3 |\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"[Top of Spec Sheet](#specs)"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"## meredith.tsv"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"An intermediate file of news outlets owned by Meredith scraped from their website"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"Read the raw file from this [URL](https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/meredith.tsv):
`https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/meredith.tsv`\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"See the [code](https://github.com/yinleon/LocalNewsDataset/blob/master/py/download_data.py) used to make this dataset:
`https://github.com/yinleon/LocalNewsDataset/blob/master/py/download_data.py`\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"#### What Does the Data Look Like?"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"Sample of `../data/meredith.tsv` (N = 16)"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" city | \n",
" facebook | \n",
" google | \n",
" network | \n",
" state | \n",
" station | \n",
" twitter | \n",
" website | \n",
" broadcaster | \n",
" source | \n",
" collection_date | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Phoenix | \n",
" https://www.facebook.com/CBS5AZ | \n",
" https://plus.google.com/+cbs5az/posts | \n",
" NaN | \n",
" AZ | \n",
" KPHO | \n",
" https://twitter.com/CBS5AZ | \n",
" http://www.kpho.com/ | \n",
" Meredith | \n",
" meridith.com | \n",
" 2018-08-02 14:55:24.612585 | \n",
"
\n",
" \n",
" 1 | \n",
" Nashville | \n",
" https://www.facebook.com/WSMVTV | \n",
" https://plus.google.com/117143042785436999262/... | \n",
" NaN | \n",
" TN | \n",
" WSMV | \n",
" https://twitter.com/WSMV | \n",
" http://www.wsmv.com | \n",
" Meredith | \n",
" meridith.com | \n",
" 2018-08-02 14:55:24.612585 | \n",
"
\n",
" \n",
" 2 | \n",
" Springfield | \n",
" https://www.facebook.com/westernmassnews | \n",
" NaN | \n",
" NaN | \n",
" MA | \n",
" Western Mass News | \n",
" https://twitter.com/WMASSNEWS | \n",
" http://www.westernmassnews.com | \n",
" Meredith | \n",
" meridith.com | \n",
" 2018-08-02 14:55:24.612585 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" city facebook \\\n",
"0 Phoenix https://www.facebook.com/CBS5AZ \n",
"1 Nashville https://www.facebook.com/WSMVTV \n",
"2 Springfield https://www.facebook.com/westernmassnews \n",
"\n",
" google network state \\\n",
"0 https://plus.google.com/+cbs5az/posts NaN AZ \n",
"1 https://plus.google.com/117143042785436999262/... NaN TN \n",
"2 NaN NaN MA \n",
"\n",
" station twitter \\\n",
"0 KPHO https://twitter.com/CBS5AZ \n",
"1 WSMV https://twitter.com/WSMV \n",
"2 Western Mass News https://twitter.com/WMASSNEWS \n",
"\n",
" website broadcaster source \\\n",
"0 http://www.kpho.com/ Meredith meridith.com \n",
"1 http://www.wsmv.com Meredith meridith.com \n",
"2 http://www.westernmassnews.com Meredith meridith.com \n",
"\n",
" collection_date \n",
"0 2018-08-02 14:55:24.612585 \n",
"1 2018-08-02 14:55:24.612585 \n",
"2 2018-08-02 14:55:24.612585 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"#### What do the columns mean?\n",
"| Column Name | Description | N Unique Values |\n",
"| --- | --- | --- |\n",
"| city | The name of the city that the TV station is located in. | 12 |\n",
"| facebook | The URL to the media outlet's Facebook presence. | 14 |\n",
"| google | The URL to the media outlet's Google Plus presence. | 13 |\n",
"| network | The franchise or brand name that the station belongs to IE \"Fox\" | 1 |\n",
"| state | The two letter state abbreviation the media outlet is located in. | 11 |\n",
"| station | The name of the TV station IE \"WGBH\". | 16 |\n",
"| twitter | The URL to the Twitter screen name of the news outlet. | 14 |\n",
"| website | The website of the media outlet exactly as we found it online. | 16 |\n",
"| broadcaster | The corporate owner of the station. | 1 |\n",
"| source | Where was this record scraped from? | 1 |\n",
"| collection_date | When was this record collected? | 1 |\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"[Top of Spec Sheet](#specs)"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"## nexstar.tsv"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"An intermediate file of news outlets owned by Nexstar scraped from their website"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"Read the raw file from this [URL](https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/nexstar.tsv):
`https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/nexstar.tsv`\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"See the [code](https://github.com/yinleon/LocalNewsDataset/blob/master/py/download_data.py) used to make this dataset:
`https://github.com/yinleon/LocalNewsDataset/blob/master/py/download_data.py`\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"#### What Does the Data Look Like?"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"Sample of `../data/nexstar.tsv` (N = 180)"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" station | \n",
" website | \n",
" city | \n",
" state | \n",
" broadcaster | \n",
" source | \n",
" collection_date | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" KWBQ | \n",
" kwbq.com | \n",
" Albuquerque | \n",
" NM | \n",
" Nexstar | \n",
" nexstar.tv | \n",
" 2018-08-02 20:31:06.425892 | \n",
"
\n",
" \n",
" 1 | \n",
" KBVO | \n",
" kbvotv.com | \n",
" Austin | \n",
" TX | \n",
" Nexstar | \n",
" nexstar.tv | \n",
" 2018-08-02 20:31:06.425892 | \n",
"
\n",
" \n",
" 2 | \n",
" KOIN | \n",
" koin.com | \n",
" Portland | \n",
" OR | \n",
" Nexstar | \n",
" nexstar.tv | \n",
" 2018-08-02 20:31:06.425892 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" station website city state broadcaster source \\\n",
"0 KWBQ kwbq.com Albuquerque NM Nexstar nexstar.tv \n",
"1 KBVO kbvotv.com Austin TX Nexstar nexstar.tv \n",
"2 KOIN koin.com Portland OR Nexstar nexstar.tv \n",
"\n",
" collection_date \n",
"0 2018-08-02 20:31:06.425892 \n",
"1 2018-08-02 20:31:06.425892 \n",
"2 2018-08-02 20:31:06.425892 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"#### What do the columns mean?\n",
"| Column Name | Description | N Unique Values |\n",
"| --- | --- | --- |\n",
"| station | The name of the TV station IE \"WGBH\". | 177 |\n",
"| website | The website of the media outlet exactly as we found it online. | 113 |\n",
"| city | The name of the city that the TV station is located in. | 97 |\n",
"| state | The two letter state abbreviation the media outlet is located in. | 39 |\n",
"| broadcaster | The corporate owner of the station. | 1 |\n",
"| source | Where was this record scraped from? | 1 |\n",
"| collection_date | When was this record collected? | 1 |\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"[Top of Spec Sheet](#specs)"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"## hearst.tsv"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"An intermediate file of news outlets owned by Hearst scraped from their website"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"Read the raw file from this [URL](https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/hearst.tsv):
`https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/hearst.tsv`\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"See the [code](https://github.com/yinleon/LocalNewsDataset/blob/master/py/download_data.py) used to make this dataset:
`https://github.com/yinleon/LocalNewsDataset/blob/master/py/download_data.py`\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"#### What Does the Data Look Like?"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"Sample of `../data/hearst.tsv` (N = 33)"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" city | \n",
" facebook | \n",
" network | \n",
" state | \n",
" station | \n",
" twitter | \n",
" website | \n",
" broadcaster | \n",
" source | \n",
" collection_date | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Monterey-Salinas | \n",
" https://www.facebook.com/ksbw8?fref=ts | \n",
" NaN | \n",
" CA | \n",
" KSBW-TV | \n",
" https://twitter.com/ksbw | \n",
" http://www.ksbw.com/ | \n",
" Hearst | \n",
" hearst.com | \n",
" 2018-08-02 14:55:24.612585 | \n",
"
\n",
" \n",
" 1 | \n",
" Portland-Auburn | \n",
" https://www.facebook.com/wmtwtv | \n",
" NaN | \n",
" ME | \n",
" WMTW-TV | \n",
" https://twitter.com/WMTWTV | \n",
" http://www.wmtw.com/ | \n",
" Hearst | \n",
" hearst.com | \n",
" 2018-08-02 14:55:24.612585 | \n",
"
\n",
" \n",
" 2 | \n",
" Burlington VT/Plattsburgh | \n",
" https://www.facebook.com/5WPTZ | \n",
" NaN | \n",
" NY | \n",
" WPTZ-TV/WNNE-TV | \n",
" https://twitter.com/mynbc5 | \n",
" http://www.wptz.com/ | \n",
" Hearst | \n",
" hearst.com | \n",
" 2018-08-02 14:55:24.612585 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" city facebook \\\n",
"0 Monterey-Salinas https://www.facebook.com/ksbw8?fref=ts \n",
"1 Portland-Auburn https://www.facebook.com/wmtwtv \n",
"2 Burlington VT/Plattsburgh https://www.facebook.com/5WPTZ \n",
"\n",
" network state station twitter \\\n",
"0 NaN CA KSBW-TV https://twitter.com/ksbw \n",
"1 NaN ME WMTW-TV https://twitter.com/WMTWTV \n",
"2 NaN NY WPTZ-TV/WNNE-TV https://twitter.com/mynbc5 \n",
"\n",
" website broadcaster source collection_date \n",
"0 http://www.ksbw.com/ Hearst hearst.com 2018-08-02 14:55:24.612585 \n",
"1 http://www.wmtw.com/ Hearst hearst.com 2018-08-02 14:55:24.612585 \n",
"2 http://www.wptz.com/ Hearst hearst.com 2018-08-02 14:55:24.612585 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"#### What do the columns mean?\n",
"| Column Name | Description | N Unique Values |\n",
"| --- | --- | --- |\n",
"| city | The name of the city that the TV station is located in. | 27 |\n",
"| facebook | The URL to the media outlet's Facebook presence. | 33 |\n",
"| network | The franchise or brand name that the station belongs to IE \"Fox\" | 1 |\n",
"| state | The two letter state abbreviation the media outlet is located in. | 23 |\n",
"| station | The name of the TV station IE \"WGBH\". | 33 |\n",
"| twitter | The URL to the Twitter screen name of the news outlet. | 30 |\n",
"| website | The website of the media outlet exactly as we found it online. | 33 |\n",
"| broadcaster | The corporate owner of the station. | 1 |\n",
"| source | Where was this record scraped from? | 1 |\n",
"| collection_date | When was this record collected? | 1 |\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"[Top of Spec Sheet](#specs)"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"## tribune.tsv"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"An intermediate file of news outlets owned by Tribune scraped from their website."
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"Read the raw file from this [URL](https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/tribune.tsv):
`https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/tribune.tsv`\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"See the [code](https://github.com/yinleon/LocalNewsDataset/blob/master/py/download_data.py#L21-L86) used to make this dataset:
`https://github.com/yinleon/LocalNewsDataset/blob/master/py/download_data.py#L21-L86`\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"#### What Does the Data Look Like?"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"Sample of `../data/tribune.tsv` (N = 47)"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" city | \n",
" facebook | \n",
" network | \n",
" station | \n",
" twitter | \n",
" website | \n",
" youtube | \n",
" broadcaster | \n",
" source | \n",
" state | \n",
" collection_date | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" South Florida | \n",
" http://www.facebook.com/SFLCW | \n",
" NaN | \n",
" WSFL | \n",
" https://twitter.com/SFLCW | \n",
" http://sfltv.net/ | \n",
" NaN | \n",
" Tribune | \n",
" tribunemedia.com | \n",
" FL | \n",
" 2018-08-04 01:06:48.283394 | \n",
"
\n",
" \n",
" 1 | \n",
" Indianapolis | \n",
" https://www.facebook.com/CBS4Indy | \n",
" NaN | \n",
" WTTV | \n",
" https://twitter.com/cbs4indy | \n",
" http://cbs4indy.com/ | \n",
" NaN | \n",
" Tribune | \n",
" tribunemedia.com | \n",
" IN | \n",
" 2018-08-04 01:06:48.283394 | \n",
"
\n",
" \n",
" 2 | \n",
" Dallas | \n",
" https://www.facebook.com/NightcapNews | \n",
" NaN | \n",
" KDAF | \n",
" https://twitter.com/NewsFixDFW | \n",
" http://cw33.com/ | \n",
" http://www.youtube.com/user/kdaf | \n",
" Tribune | \n",
" tribunemedia.com | \n",
" TX | \n",
" 2018-08-04 01:06:48.283394 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" city facebook network station \\\n",
"0 South Florida http://www.facebook.com/SFLCW NaN WSFL \n",
"1 Indianapolis https://www.facebook.com/CBS4Indy NaN WTTV \n",
"2 Dallas https://www.facebook.com/NightcapNews NaN KDAF \n",
"\n",
" twitter website \\\n",
"0 https://twitter.com/SFLCW http://sfltv.net/ \n",
"1 https://twitter.com/cbs4indy http://cbs4indy.com/ \n",
"2 https://twitter.com/NewsFixDFW http://cw33.com/ \n",
"\n",
" youtube broadcaster source state \\\n",
"0 NaN Tribune tribunemedia.com FL \n",
"1 NaN Tribune tribunemedia.com IN \n",
"2 http://www.youtube.com/user/kdaf Tribune tribunemedia.com TX \n",
"\n",
" collection_date \n",
"0 2018-08-04 01:06:48.283394 \n",
"1 2018-08-04 01:06:48.283394 \n",
"2 2018-08-04 01:06:48.283394 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"#### What do the columns mean?\n",
"| Column Name | Description | N Unique Values |\n",
"| --- | --- | --- |\n",
"| city | The name of the city that the TV station is located in. | 36 |\n",
"| facebook | The URL to the media outlet's Facebook presence. | 46 |\n",
"| network | The franchise or brand name that the station belongs to IE \"Fox\" | 1 |\n",
"| station | The name of the TV station IE \"WGBH\". | 46 |\n",
"| twitter | The URL to the Twitter screen name of the news outlet. | 43 |\n",
"| website | The website of the media outlet exactly as we found it online. | 47 |\n",
"| youtube | The URL to the media outlet's YouTube presence. | 30 |\n",
"| broadcaster | The corporate owner of the station. | 1 |\n",
"| source | Where was this record scraped from? | 1 |\n",
"| state | The two letter state abbreviation the media outlet is located in. | 26 |\n",
"| collection_date | When was this record collected? | 1 |\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"[Top of Spec Sheet](#specs)"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"## station_index.tsv"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"An intermediate file of TV stations compiled on stationindex.com. The website is scraped according to the market (reigon), and again according to the owner. The two scraped datasets are merged and duplicates are dropped. When dropping duplicates, precedence is given to the entry scraped owners."
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"Read the raw file from this [URL](https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/station_index.tsv):
`https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/station_index.tsv`\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"See the [code](https://github.com/yinleon/LocalNewsDataset/blob/master/py/download_data.py) used to make this dataset:
`https://github.com/yinleon/LocalNewsDataset/blob/master/py/download_data.py`\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"#### What Does the Data Look Like?"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"Sample of `../data/station_index.tsv` (N = 1867)"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" city | \n",
" collection_date | \n",
" id | \n",
" owner | \n",
" source | \n",
" state | \n",
" station | \n",
" station_info | \n",
" subchannels | \n",
" website | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Houston | \n",
" 2018-08-02 14:55:24.612585 | \n",
" \"FOX 26\" | \n",
" Fox Television Stations | \n",
" stationindex | \n",
" TX | \n",
" KRIV | \n",
" Digital Full-Power - 1000 kW | \n",
" NaN | \n",
" http://www.fox26houston.com/ | \n",
"
\n",
" \n",
" 1 | \n",
" Boise | \n",
" 2018-08-02 14:55:24.612585 | \n",
" \"Telemundo Boise\" | \n",
" Boise Telecasters | \n",
" stationindex | \n",
" ID | \n",
" KKJB | \n",
" Digital Full-Power - 35 kW | \n",
" 39.1 Telemundo, 39.2 Cozi TV, 39.3 Antenna TV... | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" Phoenix | \n",
" 2018-08-02 14:55:24.612585 | \n",
" NaN | \n",
" Daystar | \n",
" stationindex | \n",
" AZ | \n",
" KDPH-LP | \n",
" Low-Power - 150 kW | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" city collection_date id \\\n",
"0 Houston 2018-08-02 14:55:24.612585 \"FOX 26\" \n",
"1 Boise 2018-08-02 14:55:24.612585 \"Telemundo Boise\" \n",
"2 Phoenix 2018-08-02 14:55:24.612585 NaN \n",
"\n",
" owner source state station \\\n",
"0 Fox Television Stations stationindex TX KRIV \n",
"1 Boise Telecasters stationindex ID KKJB \n",
"2 Daystar stationindex AZ KDPH-LP \n",
"\n",
" station_info \\\n",
"0 Digital Full-Power - 1000 kW \n",
"1 Digital Full-Power - 35 kW \n",
"2 Low-Power - 150 kW \n",
"\n",
" subchannels \\\n",
"0 NaN \n",
"1 39.1 Telemundo, 39.2 Cozi TV, 39.3 Antenna TV... \n",
"2 NaN \n",
"\n",
" website \n",
"0 http://www.fox26houston.com/ \n",
"1 NaN \n",
"2 NaN "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"#### What do the columns mean?\n",
"| Column Name | Description | N Unique Values |\n",
"| --- | --- | --- |\n",
"| city | The name of the city that the TV station is located in. | 676 |\n",
"| id | The human-recognizable name for the TV station. | 699 |\n",
"| owner | The corporate owner of the station. | 641 |\n",
"| state | The two letter state abbreviation the media outlet is located in. | 56 |\n",
"| station_info | Related to the frequency of the transmission and technical specs | 765 |\n",
"| station | The name of the TV station IE \"WGBH\". | 1866 |\n",
"| subchannels | Alternative names for the TV station | 626 |\n",
"| website | The website of the media outlet exactly as we found it online. | 1172 |\n",
"| source | Where was this record scraped from? | 1 |\n",
"| collection_date | When was this record collected? | 1 |\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"[Top of Spec Sheet](#specs)"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"## usnpl.tsv"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"An intermediate file of News papers, magazines and college papers compiled by usnpl.com. The website is scraped by visiting state-specific pages using requests and BeautifulSoup, websites and social media are collected wherever possible."
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"Read the raw file from this [URL](https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/usnpl.tsv):
`https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/usnpl.tsv`\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"See the [code](https://github.com/yinleon/LocalNewsDataset/blob/master/py/download_data.py) used to make this dataset:
`https://github.com/yinleon/LocalNewsDataset/blob/master/py/download_data.py`\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"#### What Does the Data Look Like?"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"Sample of `../data/usnpl.tsv` (N = 6221)"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Facebook | \n",
" Geography | \n",
" Medium | \n",
" Name | \n",
" Twitter_Name | \n",
" Website | \n",
" Youtube | \n",
" source | \n",
" collection_date | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" https://www.facebook.com/MissionTimesCourier | \n",
" CA | \n",
" Newspapers | \n",
" Mission Times Courier | \n",
" NaN | \n",
" http://www.missiontimescourier.com | \n",
" NaN | \n",
" usnpl.com | \n",
" 2018-08-02 14:55:24.612585 | \n",
"
\n",
" \n",
" 1 | \n",
" https://www.facebook.com/adelantevalle | \n",
" CA | \n",
" Newspapers | \n",
" Adelante Valle | \n",
" IVPNews | \n",
" http://www.ivpressonline.com/adelantevalle | \n",
" http://www.youtube.com/user/ivpressonline | \n",
" usnpl.com | \n",
" 2018-08-02 14:55:24.612585 | \n",
"
\n",
" \n",
" 2 | \n",
" https://www.facebook.com/calmarcourier | \n",
" IA | \n",
" Newspapers | \n",
" Calmar Courier | \n",
" calmarcourier | \n",
" http://calmarcourier.com | \n",
" https://www.youtube.com/channel/UCVTvRL0P_eaIU... | \n",
" usnpl.com | \n",
" 2018-08-02 14:55:24.612585 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Facebook Geography Medium \\\n",
"0 https://www.facebook.com/MissionTimesCourier CA Newspapers \n",
"1 https://www.facebook.com/adelantevalle CA Newspapers \n",
"2 https://www.facebook.com/calmarcourier IA Newspapers \n",
"\n",
" Name Twitter_Name \\\n",
"0 Mission Times Courier NaN \n",
"1 Adelante Valle IVPNews \n",
"2 Calmar Courier calmarcourier \n",
"\n",
" Website \\\n",
"0 http://www.missiontimescourier.com \n",
"1 http://www.ivpressonline.com/adelantevalle \n",
"2 http://calmarcourier.com \n",
"\n",
" Youtube source \\\n",
"0 NaN usnpl.com \n",
"1 http://www.youtube.com/user/ivpressonline usnpl.com \n",
"2 https://www.youtube.com/channel/UCVTvRL0P_eaIU... usnpl.com \n",
"\n",
" collection_date \n",
"0 2018-08-02 14:55:24.612585 \n",
"1 2018-08-02 14:55:24.612585 \n",
"2 2018-08-02 14:55:24.612585 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"#### What do the columns mean?\n",
"| Column Name | Description | N Unique Values |\n",
"| --- | --- | --- |\n",
"| Facebook | The URL to the media outlet's Facebook presence. | 5100 |\n",
"| Geography | The two letter state abbreviation the media outlet is located in. | 51 |\n",
"| Medium | Whether the news outlet is a newspaper, magazine or college newspaper. | 3 |\n",
"| Name | The name of the TV station IE \"WGBH\". | 5765 |\n",
"| Twitter_Name | The Twitter screen name of the news outlet. | 3643 |\n",
"| Website | The website of the media outlet exactly as we found it online. | 6080 |\n",
"| Youtube | The URL to the media outlet's YouTube presence. | 2226 |\n",
"| source | Where was this record scraped from? | 1 |\n",
"| collection_date | When was this record collected? | 1 |\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"[Top of Spec Sheet](#specs)"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"## local_news_dataset_2018.csv"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"The intermediate files above are preprocessed (renaming columns, removing duplicates) and merged resulting in the Local News Dataset! This is it! We made it!"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"Read the raw file from this [URL](https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/local_news_dataset_2018.csv):
`https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/local_news_dataset_2018.csv`\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"See the [code](https://github.com/yinleon/LocalNewsDataset/blob/master/py/merge.py) used to make this dataset:
`https://github.com/yinleon/LocalNewsDataset/blob/master/py/merge.py`\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"#### What Does the Data Look Like?"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"Sample of `../data/local_news_dataset_2018.csv` (N = 8720)"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" state | \n",
" website | \n",
" domain | \n",
" twitter | \n",
" youtube | \n",
" facebook | \n",
" owner | \n",
" medium | \n",
" source | \n",
" collection_date | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" WTVC | \n",
" TN | \n",
" http://www.newschannel9.com/ | \n",
" newschannel9.com | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" Freedom Communications | \n",
" TV station | \n",
" stationindex | \n",
" 2018-08-02 14:55:24.612585 | \n",
"
\n",
" \n",
" 1 | \n",
" Fountain Hills Times | \n",
" AZ | \n",
" http://www.fhtimes.com | \n",
" fhtimes.com | \n",
" NaN | \n",
" NaN | \n",
" https://www.facebook.com/fountainhillstimes | \n",
" NaN | \n",
" Newspapers | \n",
" usnpl.com | \n",
" 2018-08-02 14:55:24.612585 | \n",
"
\n",
" \n",
" 2 | \n",
" Beauregard Daily News | \n",
" LA | \n",
" http://www.beauregarddailynews.net | \n",
" beauregarddailynews.net | \n",
" beauregardnews | \n",
" NaN | \n",
" https://www.facebook.com/beauregardnews | \n",
" NaN | \n",
" Newspapers | \n",
" usnpl.com | \n",
" 2018-08-02 14:55:24.612585 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name state website \\\n",
"0 WTVC TN http://www.newschannel9.com/ \n",
"1 Fountain Hills Times AZ http://www.fhtimes.com \n",
"2 Beauregard Daily News LA http://www.beauregarddailynews.net \n",
"\n",
" domain twitter youtube \\\n",
"0 newschannel9.com NaN NaN \n",
"1 fhtimes.com NaN NaN \n",
"2 beauregarddailynews.net beauregardnews NaN \n",
"\n",
" facebook owner \\\n",
"0 NaN Freedom Communications \n",
"1 https://www.facebook.com/fountainhillstimes NaN \n",
"2 https://www.facebook.com/beauregardnews NaN \n",
"\n",
" medium source collection_date \n",
"0 TV station stationindex 2018-08-02 14:55:24.612585 \n",
"1 Newspapers usnpl.com 2018-08-02 14:55:24.612585 \n",
"2 Newspapers usnpl.com 2018-08-02 14:55:24.612585 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"#### What do the columns mean?\n",
"| Column Name | Description | N Unique Values |\n",
"| --- | --- | --- |\n",
"| name | The name of the TV station IE \"WGBH\". | 8095 |\n",
"| state | The two letter state abbreviation the media outlet is located in. | 55 |\n",
"| website | The website of the media outlet exactly as we found it online. | 7345 |\n",
"| domain | The domain that houses the media outlet. It is standardized (no \"www\" or \"http://\"). Sometimes multiple media outlets direct to the same domain (but seprate sub-domain). | 6296 |\n",
"| twitter | The Twitter screen name of the news outlet. | 3632 |\n",
"| youtube | The URL to the media outlet's YouTube presence. | 2220 |\n",
"| facebook | The URL to the media outlet's Facebook presence. | 5093 |\n",
"| owner | The corporate owner of the station. | 634 |\n",
"| medium | Whether the news outlet is a newspaper, magazine, college newspater or a TV station | 4 |\n",
"| source | Where was this record scraped from? | 8 |\n",
"| collection_date | When was this record collected? | 4 |\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"#### Breakdown of mediums in the Local News Dataset"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" medium | \n",
"
\n",
" \n",
" \n",
" \n",
" Newspapers | \n",
" 5336 | \n",
"
\n",
" \n",
" TV station | \n",
" 2593 | \n",
"
\n",
" \n",
" College Newspapers | \n",
" 480 | \n",
"
\n",
" \n",
" Magazines | \n",
" 311 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" medium\n",
"Newspapers 5336\n",
"TV station 2593\n",
"College Newspapers 480\n",
"Magazines 311"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"#### Breakdown of data sources in the Local News Dataset"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" source | \n",
"
\n",
" \n",
" \n",
" \n",
" usnpl.com | \n",
" 6119 | \n",
"
\n",
" \n",
" stationindex | \n",
" 1708 | \n",
"
\n",
" \n",
" sbgi.net | \n",
" 611 | \n",
"
\n",
" \n",
" nexstar.tv | \n",
" 177 | \n",
"
\n",
" \n",
" tribunemedia.com | \n",
" 47 | \n",
"
\n",
" \n",
" hearst.com | \n",
" 33 | \n",
"
\n",
" \n",
" meridith.com | \n",
" 16 | \n",
"
\n",
" \n",
" User Input | \n",
" 9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" source\n",
"usnpl.com 6119\n",
"stationindex 1708\n",
"sbgi.net 611\n",
"nexstar.tv 177\n",
"tribunemedia.com 47\n",
"hearst.com 33\n",
"meridith.com 16\n",
"User Input 9"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"The `User Input` entires are custom additions added from the contents of [this JSON file](https://github.com/yinleon/LocalNewsDataset/blob/master/data/custom_additions.json) and added to the dataset in `merge.py`"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"Below is an interactive [Plot.ly](https://plot.ly) chloropleth map of state-level representation in this dataset. Scroll over each state to get a counts (num stations) of the top mediums and owners."
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from runtimestamp.runtimestamp import runtimestamp # for reproducibility\n",
"from docs.build_docs import * # auto-generates docs\n",
"runtimestamp('Leon')\n",
"generate_docs()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. Using the Dataset \n",
"Below is some starter code in Python to read the Local News Dataset from the web into a Pandas Dataframe.\n",
"\n",
"[Top of Notebook](#top)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"url = 'https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/local_news_dataset_2018.csv'\n",
"df_local = pd.read_csv(url)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to use this dataset for a list of web domains, there are a few steps you'll need to take:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"df_local_website.to_csv('../data/local_news_dataset_2018_for_domain_analysis.csv', index=False)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df_local_website = df_local[(~df_local.domain.isnull()) &\n",
" (df_local.domain != 'facebook.com') &\n",
" (df_local.domain != 'google.com') &\n",
" (df_local.domain != 'tumblr.com') &\n",
" (df_local.domain != 'wordpress.com') &\n",
" (df_local.domain != 'comettv.com')].drop_duplicates(subset=['domain'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We do these steps because some entries don't have websites, at least one listed website is Facebook pages, comet TV is a nationwide franchise, and some stations share the a website."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" state | \n",
" website | \n",
" domain | \n",
" twitter | \n",
" youtube | \n",
" facebook | \n",
" owner | \n",
" medium | \n",
" source | \n",
" collection_date | \n",
"
\n",
" \n",
" \n",
" \n",
" 5537 | \n",
" Mount Desert Islander | \n",
" ME | \n",
" http://www.mdislander.com | \n",
" mdislander.com | \n",
" TheMDIslander | \n",
" NaN | \n",
" https://www.facebook.com/mdislander | \n",
" NaN | \n",
" Newspapers | \n",
" usnpl.com | \n",
" 2018-08-02 14:55:24.612585 | \n",
"
\n",
" \n",
" 5992 | \n",
" Lake County News-Chronicle | \n",
" MN | \n",
" http://www.lcnewschronicle.com | \n",
" lcnewschronicle.com | \n",
" NaN | \n",
" NaN | \n",
" https://www.facebook.com/pages/Lake-County-New... | \n",
" NaN | \n",
" Newspapers | \n",
" usnpl.com | \n",
" 2018-08-02 14:55:24.612585 | \n",
"
\n",
" \n",
" 633 | \n",
" WATC | \n",
" GA | \n",
" http://www.watc.tv/ | \n",
" watc.tv | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" Carolina Christian Broadcasting | \n",
" TV station | \n",
" stationindex | \n",
" 2018-08-02 14:55:24.612585 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name state website \\\n",
"5537 Mount Desert Islander ME http://www.mdislander.com \n",
"5992 Lake County News-Chronicle MN http://www.lcnewschronicle.com \n",
"633 WATC GA http://www.watc.tv/ \n",
"\n",
" domain twitter youtube \\\n",
"5537 mdislander.com TheMDIslander NaN \n",
"5992 lcnewschronicle.com NaN NaN \n",
"633 watc.tv NaN NaN \n",
"\n",
" facebook \\\n",
"5537 https://www.facebook.com/mdislander \n",
"5992 https://www.facebook.com/pages/Lake-County-New... \n",
"633 NaN \n",
"\n",
" owner medium source \\\n",
"5537 NaN Newspapers usnpl.com \n",
"5992 NaN Newspapers usnpl.com \n",
"633 Carolina Christian Broadcasting TV station stationindex \n",
"\n",
" collection_date \n",
"5537 2018-08-02 14:55:24.612585 \n",
"5992 2018-08-02 14:55:24.612585 \n",
"633 2018-08-02 14:55:24.612585 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_local_website.sample(3, random_state=303)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For convenience this filtered dataset is available here: `https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/local_news_dataset_2018_for_domain_analysis.csv`\n",
"and also here:
`http://bit.ly/local_news_dataset_domains`"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" state | \n",
" website | \n",
" domain | \n",
" twitter | \n",
" youtube | \n",
" facebook | \n",
" owner | \n",
" medium | \n",
" source | \n",
" collection_date | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" KWHE | \n",
" HI | \n",
" http://www.kwhe.com/ | \n",
" kwhe.com | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" LeSea | \n",
" TV station | \n",
" stationindex | \n",
" 2018-08-02 14:55:24.612585 | \n",
"
\n",
" \n",
" 1 | \n",
" WGVK | \n",
" MI | \n",
" http://www.wgvu.org/ | \n",
" wgvu.org | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" Grand Valley State University | \n",
" TV station | \n",
" stationindex | \n",
" 2018-08-02 14:55:24.612585 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name state website domain twitter youtube facebook \\\n",
"0 KWHE HI http://www.kwhe.com/ kwhe.com NaN NaN NaN \n",
"1 WGVK MI http://www.wgvu.org/ wgvu.org NaN NaN NaN \n",
"\n",
" owner medium source \\\n",
"0 LeSea TV station stationindex \n",
"1 Grand Valley State University TV station stationindex \n",
"\n",
" collection_date \n",
"0 2018-08-02 14:55:24.612585 \n",
"1 2018-08-02 14:55:24.612585 "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_local_news_domain = pd.read_csv('http://bit.ly/local_news_dataset_domains')\n",
"df_local_news_domain.head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to get Twitter accounts for all local news stations in Kansas you can filter the dataset as follows:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['atchisonglobe', 'BCtimesgazette', 'baldwincity', 'CCNewsAdvocate',\n",
" 'BTelescope', 'ChanuteTribune', 'TimesSentinel1', 'Star_Argosy',\n",
" 'DerbyInformer', 'dcglobe', 'HighPlainsJrnl', 'emporiagazette',\n",
" 'FSTribune', 'GSentineltimes', 'gctelegram', 'GB_Tribune',\n",
" 'HaysDaily', 'RecordTime', 'HoltonRecorder', 'HutchNews',\n",
" 'iolaregister', 'thedailyunion', 'kckansan', 'KCStar', 'ljworld',\n",
" 'fortleavenworth', 'LVTimesNews', 'louisburgherald',\n",
" 'MERCnewsroom ', 'marionrecord', 'MarysvilleTweet', 'macsentinel',\n",
" 'ClarionPaper', 'ChadFrey', 'OsageCounty', 'osawatomienews',\n",
" 'oheraldnews', 'micorepublic', 'ParsonsSun', 'The_Morning_Sun',\n",
" 'pratttribune', 'sabethaherald', 'salinajournal',\n",
" 'shawneedispatch', 'tonganoxie', 'CJ_news', 'arkvalleynews',\n",
" 'wgtndailynews', 'voiceitwichita', 'kansasdotcom',\n",
" 'winfieldcourier', 'esubulletin', 'TigerMediaNet',\n",
" 'kstatecollegian', 'PSU_Collegio', 'sunflowernews'], dtype=object)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"twitter_ks = df_local[(~df_local.twitter.isnull()) & \n",
" (df_local.state == 'KS')]\n",
"twitter_ks.twitter.unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also get an array of all domains affiliated with Sinclair:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['wabm68.com', 'abc3340.com', 'wtto21.com', 'comettv.com',\n",
" 'utv44.com', 'wfgxtv.com', 'weartv.com', 'local15tv.com', nan,\n",
" 'katv.com', 'bakersfieldnow.com', 'kmph.com', 'kmph-kfre.com',\n",
" 'wjla.com', 'mycbs4.com', 'sbgi.net', 'myfoxtallahassee.com',\n",
" 'wtwc40.com', 'my15wtcn.com', 'azteca48.com', 'cw34.com',\n",
" 'cbs12.com', 'wfxl.com', 'wgxa.tv', 'foxsavannah.com', 'kgan.com',\n",
" 'cbs2iowa.com', 'fox28iowa.com', 'kdsm.com', 'ktvo.com',\n",
" 'siouxlandnews.com', 'cwtreasurevalley.com', 'kboi2.com',\n",
" 'wics.com', 'wyzz43.com', 'wicd15.com', 'cw23tv.com',\n",
" 'foxillinois.com', 'khqa.com', 'wsbt.com', 'foxkansas.com',\n",
" 'mytvwichita.com', 'wdky56.com', 'foxlexington.com', 'mywdka.com',\n",
" 'kbsi23.com', 'foxbaltimore.com', 'cwbaltimore.com',\n",
" 'mytvbaltimore.com', 'myfoxmaine.com', 'wgme.com', 'nbc25news.com',\n",
" 'wsmh.com', 'thecw46.com', 'cw7michigan.com', 'wwmt.com',\n",
" 'upnorthlive.com', 'thecw23.com', 'krcgtv.com', 'abcstlouis.com',\n",
" 'wlos.com', 'my48.tv', 'abc45.com', 'myrdctv.com', 'raleighcw.com',\n",
" 'nebraska.tv', 'foxnebraska.com', 'cw15kxvo.com', 'kptm.com',\n",
" 'mynews3.com', 'thecwlasvegas.tv', 'mylvtv.com', 'my21reno.com',\n",
" 'mynews4.com', 'foxreno.com', 'wutv.com', 'wsyt68.com',\n",
" 'cbs6albany.com', 'cwalbany.com', 'mytvbuffalo.com', 'wutv29.com',\n",
" 'foxrochester.com', 'cwrochester.com', '13wham.com',\n",
" 'cnycentral.com', 'wsyx6.com', 'my64.tv', 'wkef22.com',\n",
" 'star64.tv', 'local12.com', 'cwcincinnati.com',\n",
" 'abc6onyourside.com', 'cwcolumbus.com', 'myfox28columbus.com',\n",
" 'mytvdayton.com', 'abc22now.com', 'fox45now.com', 'nbc24.com',\n",
" 'okcfox.com', 'cwokc.com', 'ktul.com', 'kcby.com', 'kval.com',\n",
" 'kmtr.com', 'kpic.com', 'ktvl.com', 'southernoregoncw.com',\n",
" 'kunptv.com', 'katu.com', 'cwcentralpa.com', 'local21news.com',\n",
" 'wjactv.com', '22thepoint.com', 'wpgh53.com', 'fox56.com',\n",
" 'turnto10.com', 'abcnews4.com', 'wach.com', 'wpde.com', 'my40.tv',\n",
" 'foxchattanooga.com', 'chattanoogacw.com', 'mytv30web.com',\n",
" 'cw58.tv', 'fox17.com', 'abc7amarillo.com', 'telemundoaustin.com',\n",
" 'cbsaustin.com', 'kfdm.com', 'fox4beaumont.com',\n",
" 'fox38corpuschristi.com', 'kfoxtv.com', 'cbs4local.com',\n",
" 'valleycentral.com', 'foxsanantonio.com', 'news4sanantonio.com',\n",
" 'kmys.tv', 'kenvtv.com', 'kutv.com', 'kmyu.tv', 'fox35.com',\n",
" 'mytvz.com', 'mytvrichmond.com', 'foxrichmond.com', 'wset.com',\n",
" 'komonews.com', 'univisionseattle.com', 'klewtv.com', 'kunwtv.com',\n",
" 'keprtv.com', 'kimatv.com', 'cq9tv.com', 'thattvwebsite.com',\n",
" 'fox11online.com', 'cw14online.com', 'fox47.com', 'super18tv.com',\n",
" 'wvah.com', 'wchstv.com', 'wtov9.com'], dtype=object)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sinclair_stations = df_local[df_local.owner == 'Sinclair'].domain.unique()\n",
"sinclair_stations"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Stay tuned for more in-depth tutorials about how this dataset can be used!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 4. Data Sheet \n",
"In the spirit of transparency and good documentation, I am going to answer some questions for datasets proposed in the recent paper [Datasheets for Datasets](https://arxiv.org/abs/1803.09010) by Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumeé III, Kate Crawford.\n",
"\n",
"[Top of Notebook](#top)\n",
"\n",
"### Motivation for Dataset Creation\n",
"*Why was the dataset created? (e.g., were there specific\n",
"tasks in mind, or a specific gap that needed to be filled?)*
\n",
"This Dataset was created to study the role of state-level local news on Twitter.
\n",
"We wanted to find users who follow both local news outlets and members of congress.
\n",
"\n",
"*What (other) tasks could the dataset be used for? Are\n",
"there obvious tasks for which it should not be used?*
\n",
"The dataset can be used to query other social media platforms for local news outlet's social feeds.
\n",
"It can also serve as a list of state-level domains for link analysis. This is one use of this dataset in an uncoming report on the Internet Research Agency's use of links on Twitter.
\n",
"\n",
"I hope that this dataset might be of interest for researchers applying to the [Social Science One and Facebook RFP](https://socialscience.one/our-facebook-partnership).\n",
"\n",
"*Has the dataset been used for any tasks already? If so,\n",
"where are the results so others can compare (e.g., links to\n",
"published papers)?*
\n",
"A study of IRA Twitter accounts sharing national, local, and junk news articles.\n",
"\n",
"*Who funded the creation of the dataset? If there is an\n",
"associated grant, provide the grant number.*
\n",
"The dataset was created by Leon Yin at the SMaPP Lab at NYU. For more information, please visit our [website](https://wp.nyu.edu/smapp/).\n",
"\n",
"### Dataset Composition\n",
"*What are the instances? (that is, examples; e.g., documents,\n",
"images, people, countries) Are there multiple types\n",
"of instances? (e.g., movies, users, ratings; people, interactions\n",
"between them; nodes, edges)*
\n",
"Each instance is a local news outlet.\n",
"\n",
"\n",
"*Are relationships between instances made explicit in\n",
"the data (e.g., social network links, user/movie ratings, etc.)?\n",
"How many instances of each type are there?*
\n",
"We have relational links in this data, but that is up to you to make those connections. For counts, please refer to the spec sheet above.\n",
"\n",
"*What data does each instance consist of? “Raw” data\n",
"(e.g., unprocessed text or images)? Features/attributes?*
\n",
"Each instance is a scraped entity from a website. There are no images involved. The metadata fields regarding state, website, and social accounts are scraped from raw HTML.\n",
"\n",
"\n",
"*Is there a label/target associated with instances? If the instances are related to people, are subpopulations identified\n",
"(e.g., by age, gender, etc.) and what is their distribution?*
\n",
"This is not a traditional supervised machine learning dataset.\n",
"\n",
"*Is everything included or does the data rely on external\n",
"resources? (e.g., websites, tweets, datasets) If external\n",
"resources, a) are there guarantees that they will exist, and\n",
"remain constant, over time; b) is there an official archival\n",
"version.*
\n",
"The data relies of external sources! There are abolutely no guarentees that data to Twitter, Youtube, Facebook, the source websites (where data is scraped), or the destination websites (homepages for news outlets). \n",
"\n",
"Currently there are open source libraries -- like [TweePy](http://www.tweepy.org/), to query Twitter, and my collegue Megan Brown and I are about to release a Python wrapper for the Youtube Data API library.\n",
"\n",
"*Are there licenses, fees or rights associated with\n",
"any of the data?*
\n",
"This dataset is free to use. We're copying terms of use from [ProPublica](https://www.propublica.org/datastore/terms):\n",
"```\n",
"In general, you may use this dataset under the following terms. However, there may be different terms included for some data sets. It is your responsibility to read carefully the specific terms included with the data you download or purchase from our website.\n",
"\n",
"You can’t republish the raw data in its entirety, or otherwise distribute the data (in whole or in part) on a stand-alone basis.\n",
"You can’t change the data except to update or correct it.\n",
"You can’t charge people money to look at the data, or sell advertising specifically against it.\n",
"You can’t sub-license or resell the data to others.\n",
"If you use the data for publication, you must cite Leon Yin and the SMaPP Lab. \n",
"We do not guarantee the accuracy or completeness of the data. You acknowledge that the data may contain errors and omissions. \n",
"We are not obligated to update the data, but in the event we do, you are solely responsible for checking our site for any updates.\n",
"You will indemnify, hold harmless, and defend Leon Yin and the SMaPP Lab from and against any claims arising out of your use of the data.\n",
"```\n",
"\n",
"### Data Collection Process\n",
"*How was the data collected? (e.g., hardware apparatus/sensor,\n",
"manual human curation, software program,\n",
"software interface/API; how were these constructs/measures/methods\n",
"validated?)*
\n",
"The data was collected using 4 CPUs on the NYU HPC Prince Cluster. It was written using [custom code](https://github.com/yinleon/LocalNewsDataset/tree/master/py) that utilizes the requests, beautifulsoup, and Pandas Python libraries. For this reason no APIs are used to collect this data. Data was quality checked by exploring data in Jupyter Noteooks. It was compared to lists curated by [AbilityPR](https://www.agilitypr.com/resources/top-media-outlets/) of the top 10 newspapers by state.\n",
"\n",
"*Who was involved in the data collection process?*
\n",
"This dataset was collected by Leon Yin.\n",
"\n",
"*Over what time-frame was the data collected?*
\n",
"The `process_datetime` columns capture when datasets are collected. Initial development for this project began in April 2018.\n",
"\n",
"*How was the data associated with each instance acquired?*
\n",
"Data is directly scraped from HTML, there is no inferred data. There is no information how the sources curate their websites-- especially TVstationindex.com and USNPL.com.\n",
"\n",
"*Does the dataset contain all possible instances?*
\n",
"Ths is not a sample, but the best attempt at creating a comprehensive list.\n",
"\n",
"*Is there information missing from the dataset and why?*
\n",
"News Outlets not listed in the websites we scrape, or the custom additions JSON are not included. We'll make attempt to take requests for additions and ammendments on GitHub with the intention of creating a website with a submission forum.\n",
"\n",
"*Are there any known errors, sources of noise, or redundancies\n",
"in the data?*\n",
"There are possible redundencies of news outlets occuring across the websites scraped. We have measures to drop duplicates, but if we missed any please submit an error in GitHub.\n",
"\n",
"### Data Preprocessing\n",
"*What preprocessing/cleaning was done?*
\n",
"Twitter Screen Names are extracted from URLs, states are parsed from raw HTML that usually contains a city name, there is no aggregation or engineered features.\n",
"\n",
"*Was the “raw” data saved in addition to the preprocessed/cleaned\n",
"data?*
\n",
"The raw HTML for each site is not provided (so changes in website UI's) will crash future collection. There are no warranties for this. However the intermediate files are saved, and thoroughly documented in the [tech specs](#specs) above.\n",
"\n",
"*Is the preprocessing software available?*
\n",
"The dataset is a standard CSV, so any relevant open source software can be used.\n",
"\n",
"*Does this dataset collection/processing procedure\n",
"achieve the motivation for creating the dataset stated\n",
"in the first section of this datasheet?*
\n",
"The addition of Twitter Screen names makes it possible to use this data for Twitter research. The inclusion of additional fields like website, other social media platforms (Facebook, Youtube) allows for additional applications\n",
"\n",
"\n",
"### Dataset Distribution\n",
"*How is the dataset distributed? (e.g., website, API, etc.;\n",
"does the data have a DOI; is it archived redundantly?)*
\n",
"The dataset is being hosted on GitHub at the moment. It does not have a DOI (if you have suggestions on how to get one please reach out!). There are plans to migrate the dataset to its own website.\n",
"\n",
"*When will the dataset be released/first distributed?*
\n",
"August 2018.\n",
"\n",
"*What license (if any) is it distributed under?*
\n",
"MIT\n",
"\n",
"*Are there any fees or access/export restrictions?*
\n",
"Not while it is on GitHub, but if its migrated elsewhere that's possible.\n",
"\n",
"### Dataset Maintenance\n",
"*Who is supporting/hosting/maintaining the dataset?*
\n",
"The dataset is currently solely maintained by Leon Yin. This seems unsustainable, so if this project sparks an interest with you please reach out to me here: `data-smapp_lab at nyu dot edu`\n",
"\n",
"*Will the dataset be updated? How often and by whom?\n",
"How will updates/revisions be documented and communicated\n",
"(e.g., mailing list, GitHub)? Is there an erratum?*
\n",
"The dataset can be updated locally by running the scripts in this repo. Ammendments to the hosted dataset will contain a separate filepath and URL, and be documented in the README.\n",
"\n",
"\n",
"*If the dataset becomes obsolete how will this be communicated?*
\n",
"If the dataset becomes obsolete, we'll make this clear in the README in the GitHub repository (or whereever it is being hosted).\n",
"\n",
"*Is there a repository to link to any/all papers/systems\n",
"that use this dataset?*
\n",
"There aren't any publications that use this dataset that are published. We'll keep a list on the README or the website.\n",
"\n",
"*If others want to extend/augment/build on this dataset,\n",
"is there a mechanism for them to do so?*
\n",
"Modifications can be made by adding records to the ammendments [JSON](https://github.com/yinleon/LocalNewsDataset/blob/master/data/custom_additions.json).\n",
"\n",
"### Legal & Ethical Considerations\n",
"*If the dataset relates to people (e.g., their attributes) or\n",
"was generated by people, were they informed about the\n",
"data collection?*
\n",
"This dataset has no people-level information. However we don't know anything about the people who generated the webpages that this dataset is built on.\n",
"\n",
"*Does the dataset contain information that might be considered\n",
"sensitive or confidential?*
\n",
"To my knowledge there is no personally identifiable information in this dataset.\n",
"\n",
"*Does the dataset contain information that might be considered\n",
"inappropriate or offensive?*
\n",
"I hope not!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Top of Notebook](#top)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}