{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exploring your harvested data\n", "\n", "In this notebook we'll look at some ways of exploring the `results.csv` created by the Trove Newspaper and Gazette Harvester." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "import altair as alt\n", "from wordcloud import WordCloud\n", "import zipfile\n", "from pathlib import Path\n", "from textblob import TextBlob\n", "from operator import itemgetter\n", "import nltk\n", "nltk.download('stopwords')\n", "nltk.download('punkt')\n", "\n", "stopwords = nltk.corpus.stopwords.words('english')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, this notebook will look for existing harvests in the `data` directory. If you want to use a harvest that downloaded previously, just upload the zipped harvest to the `data` directory and run the cell below. It will expand all the zipped files in the `data` directory." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import a harvest zip file you've created previously\n", "# First upload the zip file to the data directory, then run this cell\n", "\n", "for zipped in Path('data').glob('*.zip'):\n", " with zipfile.ZipFile(zipped, 'r') as zip_file:\n", " zip_file.extractall(Path('data', zipped.name[:-4]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These functions open up a harvest and convert the `results.csv` into a dataframe for analysis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_latest_harvest():\n", " '''\n", " Get the timestamp of the most recent harvest.\n", " '''\n", " harvests = sorted([d for d in os.listdir('data') if os.path.isdir(os.path.join('data', d))])\n", " return harvests[-1]\n", "\n", "def open_harvest_data(timestamp=None):\n", " '''\n", " Open the results of the specified harvest (most recent by default).\n", " \n", " Returns a DataFrame.\n", " '''\n", " if not timestamp:\n", " timestamp = get_latest_harvest()\n", " df = pd.read_csv(os.path.join('data', timestamp, 'results.csv'), parse_dates=['date'])\n", " return df " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Running `open_harvest_data()` without any parameters will load the most recent harvest. To load a different harvest, just supply the name of the directory containing the harvest (this will generally be a timestamp)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = open_harvest_data()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examining the data\n", "\n", "Let's have a peek at the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# .head() displays the first 5 rows of a dataframe\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many articles did we harvest?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's the earliest and latest publication date in the dataset?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['date'].min()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['date'].max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many different newspapers are represented in our dataset?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(df['newspaper_id'].unique())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which article has the most words?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[df['words'].idxmax()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Show the most common newspapers\n", "\n", "Here we'll visualise the 25 most common newspapers in the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_newspapers = df.value_counts(['newspaper_title', 'newspaper_id']).to_frame().reset_index()\n", "df_newspapers.columns = ['newspaper_title', 'newspaper_id', 'count']\n", "df_newspapers" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alt.Chart(df_newspapers[:25]).mark_bar().encode(\n", " x=alt.X('count:Q', title='Number of articles'),\n", " y=alt.Y('newspaper_title:N', title='Newspaper', sort='-x'),\n", " tooltip=[alt.Tooltip('newspaper_title:N', title='Newspaper'), alt.Tooltip('count:Q', title='Articles')]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Show when the articles were published" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['year'] = df['date'].dt.year\n", "df_years = df['year'].value_counts().to_frame().reset_index()\n", "df_years.columns = ['year', 'count']\n", "df_years" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alt.Chart(df_years).mark_line().encode(\n", " x=alt.X('year:Q', axis=alt.Axis(format='d')),\n", " y=alt.Y('count:Q'),\n", " tooltip=[alt.Tooltip('year', title='Year'), alt.Tooltip('count', title='Articles', format=',d')]\n", ").properties(width=700)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Make a simple word cloud from the article titles" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_titles = df.loc[(df['title'] != 'No Title') & (df['title'] != 'Advertising')]\n", "# Get all the articles titles and turn them into a single string\n", "title_text = df_titles['title'].str.lower().str.cat(sep=' ')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Generate a word cloud image\n", "wordcloud = WordCloud(width=800, height=500, collocations=True).generate(title_text)\n", "display(wordcloud.to_image())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using TextBlob" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "blob = TextBlob(title_text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "word_counts = [[word, count] for word, count in blob.lower().word_counts.items() if word not in stopwords]\n", "word_counts = sorted(word_counts, key=itemgetter(1), reverse=True)[:25]\n", "pd.DataFrame(word_counts, columns=['word', 'count']).style.format({'count': '{:,}'}).bar(subset=['count'], color='#d65f5f').set_properties(subset=['count'], **{'width': '300px'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mapping newspaper locations\n", "\n", "This makes use of a spreadsheet file that maps Trove newspaper titles to locations. Once we've loaded the spreadsheet we can use it to locate all of the harvested articles." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Url of the Trove places spreadshseet\n", "trove_places = 'https://docs.google.com/spreadsheets/d/1rURriHBSf3MocI8wsdl1114t0YeyU0BVSXWeg232MZs/gviz/tq?tqx=out:csv&sheet=198244298'\n", "\n", "# Open the CSV file with Pandas\n", "place_df = pd.read_csv(trove_places)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_located = pd.merge(df_newspapers, place_df, how='left', left_on='newspaper_id', right_on='title_id')\n", "\n", "# There may be some newspapers that haven't been added to the locations dataset yet, so we'll drop them\n", "df_located.dropna(axis=0, subset=['latitude'], inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load Australian boundaries\n", "australia = alt.topo_feature('https://raw.githubusercontent.com/GLAM-Workbench/trove-newspapers/master/data/aus_state.geojson', feature='features')\n", "\n", "# Create the map of Australia using the boundaries\n", "aus_background = alt.Chart(australia).mark_geoshape(\n", " \n", " # Style the map\n", " fill='lightgray',\n", " stroke='white'\n", ").project('equirectangular').properties(width=600, height=600)\n", "\n", "# Plot the places\n", "points = alt.Chart(df_located).mark_circle(\n", " color='steelblue'\n", ").encode(\n", " \n", " # Set position of each place using lat and lon\n", " longitude='longitude:Q',\n", " latitude='latitude:Q',\n", " \n", " size=alt.Size('count:Q',\n", " scale=alt.Scale(range=[0, 1000]),\n", " legend=alt.Legend(title='Number of articles')\n", " ),\n", " \n", " # More details on hover\n", " tooltip=[alt.Tooltip('newspaper_title_x', title='Newspaper'), 'latitude', 'longitude', 'count']\n", ").properties(width=600, height=600)\n", "\n", "# Combine map and points\n", "alt.layer(aus_background, points)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org) ([@wragge](https://twitter.com/wragge)) for the [GLAM Workbench](https://github.com/glam-workbench/). \n", "Support this project by [becoming a GitHub sponsor](https://github.com/sponsors/wragge?o=esb).\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }