{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# NLTK - Exercises" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 10.1.\n", "\n", "Download the two files below:\n", "\n", "* https://github.com/peterverhaar/PythonCourse/raw/master/Texts/obama.txt\n", "* https://github.com/peterverhaar/PythonCourse/raw/master/Texts/trump.txt\n", "\n", "They are the full transcripts of the inaugurational speeches of the two most recent presidents of he USA. Use the NLTK library to answer the following questions about each of these texts.\n", "\n", "* What is the average sentence length (measured in number of words)?\n", "* What is the ratio between the number of adjectives and the total number of words?\n", "* Which adverbs occur in the text?\n", "* What is the type-token ratio?\n", "\n", "N.B. The Penn Treebank codes for adjectives are 'JJ', 'JJR' and 'JJS'. Adverbs are identifed using the labels 'RB', 'RBR' and 'RBS'. \n", "\n", "If you have never used NLTK before, you need to run the code below to install the relevant components." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import nltk\n", "nltk.download('punkt')\n", "nltk.download('averaged_perceptron_tagger')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import nltk\n", "from nltk.tokenize import sent_tokenize, word_tokenize\n", "import re\n", "\n", "\n", "file = open( 'obama.txt', encoding = 'utf-8' )\n", "fullText1 = file.read()\n", "\n", "file = open( 'trump.txt', encoding = 'utf-8' )\n", "fullText2 = file.read()\n", "\n", "\n", "\n", "def averageSentenceLength(text):\n", " words = word_tokenize(text)\n", " sentences = sent_tokenize(text)\n", " return len(words) / len(sentences)\n", "\n", "def typeToken(text):\n", " words = word_tokenize(text)\n", " words = words[ 0 : 500]\n", " unique = dict()\n", " for w in words:\n", " unique[w] = unique.get( w, 0 ) + 1\n", " return len(unique) / len(words)\n", "\n", "def getAdjectives(text):\n", " words = word_tokenize(text)\n", " pos = nltk.pos_tag(words)\n", "\n", " adjectives = dict()\n", "\n", " for p in pos:\n", " if re.search( '^JJ' , p[1] ):\n", " adjectives[p[0].lower()] = adjectives.get( p[0] , 0 ) + 1\n", " return adjectives\n", "\n", "\n", "print( 'Average sentence length: ' )\n", "\n", "print( 'text 1: {} ; text 2 : {}'.format( averageSentenceLength( fullText1 ) , averageSentenceLength( fullText2 ) ) )\n", "\n", "\n", "print( 'Type-token ratio: ' )\n", "print( 'text 1: {} ; text 2 : {}'.format( typeToken( fullText1 ) , typeToken( fullText2 ) ) )\n", "\n", "adj = getAdjectives(fullText1)\n", "\n", "\n", "for a in sorted(adj):\n", " print(a)\n", "\n", "for a in sorted(adj):\n", " print(a)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }