{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Counting words and phrases" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I've harvested [a gigabyte of OCRd text](https://glam-workbench.github.io/trove-books/#ocrd-text-from-trove-books-and-ephemera) from Trove's digitised books and shared it through Cloudstor. Here we'll explore [*Australian Plain Cookery by a Practical Cook*](https://nla.gov.au/nla.obj-579917051) from 1882. However, you could change the `text_file` value below to point to any of the other books on Cloudstor. There's a complete list [in this CSV file](https://github.com/GLAM-Workbench/trove-books/blob/master/trove_digitised_books_with_ocr.csv)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "from textblob import TextBlob\n", "import re\n", "from collections import Counter\n", "from nltk.corpus import stopwords\n", "from wordcloud import WordCloud\n", "import nltk\n", "nltk.download('stopwords')\n", "nltk.download('punkt')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "CLOUDSTOR_URL = 'https://cloudstor.aarnet.edu.au/plus/s/ugiw3gdijSKaoTL'\n", "text_file = 'australian-plain-cookery-by-a-practical-cook-nla.obj-579917051.txt'\n", "stop_words = stopwords.words('english')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Get the text file from Cloudstor\n", "response = requests.get(f'{CLOUDSTOR_URL}/download?files={text_file}')\n", "text = response.text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Counting words\n", "\n", "One way of getting a sense of what a piece of text is about is to look at the frequencies with which words appear. You don't need any special software to do basic word counts. You can just split the text into individual words (called tokens) using a regular expression – in the case below, `\\w+` looks for groups of alphanumeric characters, separating words from punctuation and spaces. The you can use `Counter` to find the frequency of each word and `.most_common()` to rank them." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('the', 3118),\n", " ('and', 2826),\n", " ('a', 2588),\n", " ('of', 2351),\n", " ('to', 1171),\n", " ('in', 1148),\n", " ('with', 1052),\n", " ('it', 824),\n", " ('for', 640),\n", " ('or', 531)]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "words = re.findall(r'\\w+', text.lower())\n", "Counter(words).most_common(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
.most_common()
. When you've finished reemember to run the cell again using Shift+Enter!