{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/kasparvonbeelen/ghi_python/main?labpath=3_-_Text_and_String_Methods.ipynb)\n",
    "\n",
    "\n",
    "# 3 Working with Text: Strings and string methods\n",
    "\n",
    "\n",
    "## Text Mining for Historians (with Python)\n",
    "## A Gentle Introduction to Working with Textual Data in Python\n",
    "\n",
    "### Created by Kaspar Beelen and Luke Blaxill\n",
    "\n",
    "### For the German Historical Institute, London\n",
    "\n",
    "<img align=\"left\" src=\"https://www.ghil.ac.uk/typo3conf/ext/wacon_ghil/Resources/Public/Images/institute_icon_small.png\">\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.1 String variables and methods\n",
    "\n",
    "Variables can contain, or more correctly, refer to strings. You may have noticed how operations (such as addition) allow you to perform simple string manipulations. For example, we can write a program that prints a greeting with a name.\n",
    "\n",
    "### -- Exercise: \n",
    "\n",
    "Change the value of the `first_name` and `last_name` variables so that the cell below prints a correct greeting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hello First_name Last_name\n"
     ]
    }
   ],
   "source": [
    "first_name = 'First_name' # change this your first name\n",
    "last_name = 'Last_name' # enter last now\n",
    "print(\"Hello\"+' '+first_name+' '+last_name) # this combines the variables in a greeting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'd achieve the same results by passing these variables as separate arguments to the `print()` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hello First_name Last_name\n"
     ]
    }
   ],
   "source": [
    "print(\"Hello\", first_name, last_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But Python provides you with many more tools to process and manipulate strings (and, by extension, whole documents).\n",
    "\n",
    "Below we first inspect the general syntax and discuss a few simple examples.\n",
    "\n",
    "The `Breakout` provides more detailed background information.\n",
    "\n",
    "Let's store (a part of) the famous opening sentence \" A Tale of Two Cities\" in a variable `first_sentence`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "first_sentence = \"It was the best of times, it was the worst of times.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### -- Exercise: \n",
    "\n",
    "Print the content of `first_sentence`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Enter answer here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "String variables (and numbers) can be thought of as **objects**, \"things you can do stuff with\". In Python, each object has a set of **methods/functions** attached to it, which are the tools that enable you to manipulate these objects. \n",
    "\n",
    "If objects can be thought of as the **nouns** of a programming language, then methods/functions serve as the **verbs**, they are the tools that operate on (do something with) these objects. \n",
    "\n",
    "In general the methods (or functions) appear in these forms:\n",
    "- `function(object)`\n",
    "- `object.method()`\n",
    "\n",
    "For string objects (`str` in Python), we can change the general notation to:\n",
    "- `function(str)`\n",
    "- `str.method()`\n",
    "\n",
    "\n",
    "This may look confusing at first—and we can't go into detail here about these syntactic differences—but you will get familiar with the syntax pretty soon, we promise.\n",
    "\n",
    "Below we discuss a few functions and methods, which will provide you with the tools for working with text data (more technically strings).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### `len()`\n",
    "\n",
    "`len()` takes an object and returns the number of elements, i.e. the length of the object. When given a string `len()` counts the number of characters, not words.\n",
    "\n",
    "Applying `len()` to `first_sentence` should return 52."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "52"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(first_sentence)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `first_sentence` variable is just a toy example. We can easily load the actual content of [\"A Tale of Two Cities\"](https://www.gutenberg.org/files/98/98-0.txt) and print the number of characters it contains. (Please ignore the code in the example, we show it here only to convince you how easy you could scale up from one line of text to a whole book)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "﻿The Project Gutenberg eBook of A Tale of Two Cities, by Charles Dickens\r\n",
      "\r\n",
      "This eBook is for the use of anyone anywhere in the United States and\r\n",
      "most other parts of the world at no cost and with almost no restrictions\r\n",
      "whatsoever. You may copy it, give it away or re-use it under the terms\r\n",
      "of the Project Gutenberg License included with this eBook or online at\r\n",
      "www.gutenberg.org. If you are not located in the United States, you\r\n",
      "will have to check the laws of the country where you are located before\r\n",
      "using this eBook.\r\n",
      "\r\n",
      "Title: A Tale of Two Cities\r\n",
      "       A Story of the French Revolution\r\n",
      "\r\n",
      "Author: Charles Dickens\r\n",
      "\r\n",
      "Release Date: January, 1994 [eBook #98]\r\n",
      "[Most recently updated: December 20, 2020]\r\n",
      "\r\n",
      "Language: English\r\n",
      "\r\n",
      "Character set encoding: UTF-8\r\n",
      "\r\n",
      "Produced by: Judith Boss and David Widger\r\n",
      "\r\n",
      "*** START OF THE PROJECT GUTENBERG EBOOK A TALE OF TWO CITIES ***\r\n",
      "\r\n",
      "\r\n",
      "\r\n",
      "\r\n",
      "A TALE OF TWO CITIES\r\n",
      "\r\n",
      "A STORY OF THE FRENCH REVOLUTION\r\n",
      "\r\n",
      "By Charles Dickens\r\n",
      "\r\n",
      "\r\n",
      "CONTENTS\r\n",
      "\r\n",
      "\r\n",
      "     Book the \n"
     ]
    }
   ],
   "source": [
    "import requests \n",
    "book = requests.get('https://www.gutenberg.org/files/98/98-0.txt').content.decode('utf-8') # download book\n",
    "print(book[:1000]) # print first 1000 characters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "793331\n"
     ]
    }
   ],
   "source": [
    "print(len(book)) # print the number of characters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `str.lowercase()`\n",
    "\n",
    "Lowercasing is often useful for normalizing texts, i.e. removing distinctions between words we don't really care about when analysing collections at scale. For example, many search engines use lowercasing in the background to provide you with all document that matches your query, i.e. if you search for `berlin` you will also get results for `Berlin` etc. Later in this course, when we focus on counting words, lowercasing will also be useful because we want to count `\"Book\"` and `\"book\"` as the same word.\n",
    "\n",
    "\n",
    "Converting all capitals to lowercase is common practice in text mining, but of course, whether it's appropriate or not depends on the purposes of your research. For example, if you are interested in Named Entities (such as place names, you better retain capitals as these contain use signals for detecting such entities).\n",
    "\n",
    "However, the most important thing at this point, is that you understand the syntax of the statement and what it returns. `str.lowercase()` acts on the string (which comes before the dot) and returns a string object.\n",
    "\n",
    "Please note that this method works directly on string or on a variable referring to a string. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "lowercase me!\n"
     ]
    }
   ],
   "source": [
    "print('LOWERCASE ME!'.lower()) # lowercase and print"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "lowercase me!\n"
     ]
    }
   ],
   "source": [
    "lowercase = 'LOWERCASE ME!' # variable assignment\n",
    "print(lowercase.lower()) # lowercase variable and print"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Both `len()` and `str.lowercase()` are called **fruitful** functions/methods, they return something (i.e. a number or a string respectively)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### -- Exercise\n",
    "\n",
    "Lowercase the variable first_sentence, store the lowercased version in a new variable and print the length of this variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# add answer here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `str.endswith(parameter)`\n",
    "\n",
    "`str.endswith(parameter)` is another commonly used string method. It slightly differs from `str.lower()` because it usually requires an argument for the parameter between the parentheses. `str.endswith(parameter)` will return a **boolean value** (`True` or `False`) if the string at the left-hand side of the `.` ends with the string given as an argument. This is commonly used to check the extension of a document, for example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "filename = 'document_1.txt'\n",
    "filename.endswith('.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "filename.endswith('.doc')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are using some technical terms here, which will be explained in more detail later. However, we hope that you slowly start to pick up and remember some of these terms just by reading through the notebook. Don't worry too much about the explanations, try to understand how the code works, that's the most important thing at this point!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### dir()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of course the Python string toolkit is much larger. Use the `dir()` function to see all the methods you can apply to a string. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']\n"
     ]
    }
   ],
   "source": [
    "print(dir(str))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']\n"
     ]
    }
   ],
   "source": [
    "print(dir(\"Hello World.\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`dir()` returns a list of all the tools that apply to a string. You can ignore the items starting with `__`, but please look at those elements further down, for examples the `str.upper()` method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To inspect the `docstring` of a method, which explain its functionality, use `help()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Help on method_descriptor:\n",
      "\n",
      "upper(self, /)\n",
      "    Return a copy of the string converted to uppercase.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "help(str.upper)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see what `str.upper()` does!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'HELLO'"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'hello'.upper()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### -- Exercise\n",
    "\n",
    "- Create a few code cell below\n",
    "- Inspect the docstring of the following methods `str.strip()`, `str.isalpha()` and `str.startswith()`\n",
    "- Create a new string variable (whatever text you prefer)\n",
    "- Apply the above methods to the string and print the outcome"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `Breakout:`\n",
    "- more about [string methods](break_out/string_methods.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Indexing and slicing\n",
    "\n",
    "Another common type of string manipulation is indexing and slicing. Indexing here means retrieving characters of a string (it could also be another data type) by their position (i.e. obtaining the fifth or last character of a word)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In Python, we start counting from `0`: to retrieve the first element, we add `[0]` to the end of a string (variable). Note the square brackets!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I\n"
     ]
    }
   ],
   "source": [
    "print(first_sentence[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To print the second character, we need to access the item at position 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "t\n"
     ]
    }
   ],
   "source": [
    "print(first_sentence[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To access the last character, use `[-1]`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ".\n"
     ]
    }
   ],
   "source": [
    "print(first_sentence[-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Slicing is similar to indexing, but it allows you to select a sequence of (multiple) characters. We still use square brackets but add a colon. At the left of the colon stands the first character, at the right the last characters. \n",
    "\n",
    "Below we printh everything between (and including) the sixth and tenth character."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "s the\n"
     ]
    }
   ],
   "source": [
    "print(first_sentence[5:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Negative indices can also be used for slicing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "times\n"
     ]
    }
   ],
   "source": [
    "print(first_sentence[-6:-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first or last character can remain implicit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "It wa\n"
     ]
    }
   ],
   "source": [
    "print(first_sentence[:5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "imes.\n"
     ]
    }
   ],
   "source": [
    "print(first_sentence[-5:])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Even though these operations seem pretty abstract, we will use indexing and slicing frequently later in this course. Please consult the `breakout` for more information."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## -- Exercise\n",
    "\n",
    "- Assign the sentence (from Jane Austen's \"Pride and Prejudice\") below to a variable named `sentence`. (Please remember, double click on any Markdown cell to reveal the actual text)\n",
    ">   \"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.\"                     \n",
    "- Lowercase the sentence and assign it to `sentence_lower`\n",
    "- Print the first and last **words** of the lowercased sentence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Enter code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `Breakout`:\n",
    "[More on string indexing](break_out/indexing_and_slicing.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.3 Reading and Opening Text Files\n",
    "\n",
    "In this section, we transition from experimenting with mock examples to working with more realistic, historical examples. First, we do this on a small scale, but soon we'll be processing thousands of newspaper articles!\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To open a file in Python, you have to first explain where it is stored. More technically you provide a location or `path` as a string. The `Break out` will point you to more information about the path syntax, for now a simple example of (what is called) a **relative** path should suffice.\n",
    "\n",
    "A relative tells to the location of a file, relative to your current position in the folder structure of your working environment. In our case, this means relative to where the Notebook (the one in which you are working at the moment) is located.\n",
    "\n",
    "The see the files in the current folder run the `ls .` or list command in the cell below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10_-_Hypothesis_Testing.ipynb\r\n",
      "11_-_Linear_Regression.ipynb\r\n",
      "12_-_Generalised_Linear_Models.ipynb\r\n",
      "13_-_Supervised_Learning.ipynb\r\n",
      "14_-_Topic_Modelling.ipynb\r\n",
      "15_-_Word_Vectors.ipynb\r\n",
      "1_-_Introduction.ipynb\r\n",
      "2_-_Values_and_Variables.ipynb\r\n",
      "3_-_Text_and_String_Methods.ipynb\r\n",
      "4_-_Processing_texts.ipynb\r\n",
      "5_-_Corpus_Selection.ipynb\r\n",
      "6_-_Corpus_Exploration.ipynb\r\n",
      "7_-_Trends_over_time.ipynb\r\n",
      "8_-_Data_Exploration_with_Pandas_I.ipynb\r\n",
      "9_-_Data_Exploration_with_Pandas_Part_II.ipynb\r\n",
      "LICENSE\r\n",
      "README.md\r\n",
      "\u001b[34mbreak_out\u001b[m\u001b[m\r\n",
      "\u001b[34mcolab_backup\u001b[m\u001b[m\r\n",
      "\u001b[34mdata\u001b[m\u001b[m\r\n",
      "\u001b[34mexample_data\u001b[m\u001b[m\r\n",
      "\u001b[34mimgs\u001b[m\u001b[m\r\n",
      "\u001b[34mlecture_1\u001b[m\u001b[m\r\n",
      "\u001b[34mlecture_2\u001b[m\u001b[m\r\n",
      "postBuild\r\n",
      "requirements.txt\r\n",
      "\u001b[34mutils\u001b[m\u001b[m\r\n"
     ]
    }
   ],
   "source": [
    "!ls ."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Please note that `!ls` starts with an exclamation mark. `ls` is a bash command you'd normally use in a terminal. This is not very important at the moment, just remember that lines starting with `!` are not Python code."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You see the folder `working_data` appearing. Now we can list the items in `working_data` again using `ls`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[34mnotebook_3\u001b[m\u001b[m\r\n"
     ]
    }
   ],
   "source": [
    "!ls example_data/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "shakespeare_sonnet_i.txt\r\n"
     ]
    }
   ],
   "source": [
    "!ls example_data/notebook_3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The relative path to our file is `working_data/shakespeare_sonnet_i.txt`. Python requires you to define the path as a string (i.e. enclosed by single or double quotation marks).\n",
    "\n",
    "Getting the location right is the first part of the puzzle. Next, we need some Python tools to open a file and read its content. It may sound confusing at first (why open _and_ read?), but these are separate steps in Python. \n",
    "\n",
    "Let's use the `open()` function to open the sonnet. As you notice, this doesn't return the actual text, but a `_io.TextIOWrapper` object (you can ignore that safely."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_io.TextIOWrapper name='example_data/notebook_3/shakespeare_sonnet_i.txt' mode='r' encoding='UTF-8'>"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "path = \"example_data/notebook_3/shakespeare_sonnet_i.txt\"\n",
    "sonnet = open(path)\n",
    "sonnet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to apply the `read()` method to the `_io.TextIOWrapper` object to inspect the content of the file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"From fairest creatures we desire increase,\\nThat thereby beauty's rose might never die,\\nBut as the riper should by time decease,\\nHis tender heir might bear his memory:\\nBut thou, contracted to thine own bright eyes,\\nFeed'st thy light's flame with self-substantial fuel,\\nMaking a famine where abundance lies,\\nThyself thy foe, to thy sweet self too cruel:\\nThou that art now the world's fresh ornament,\\nAnd only herald to the gaudy spring,\\nWithin thine own bud buriest thy content,\\nAnd tender churl mak'st waste in niggarding:\\nPity the world, or else this glutton be,\\nTo eat the world's due, by the grave and thee.\""
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sonnet = open(path).read()\n",
    "sonnet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Please note the special characters such as `\\n` (which marks a new line). This becomes apparent when we print the sonnet.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "From fairest creatures we desire increase,\n",
      "That thereby beauty's rose might never die,\n",
      "But as the riper should by time decease,\n",
      "His tender heir might bear his memory:\n",
      "But thou, contracted to thine own bright eyes,\n",
      "Feed'st thy light's flame with self-substantial fuel,\n",
      "Making a famine where abundance lies,\n",
      "Thyself thy foe, to thy sweet self too cruel:\n",
      "Thou that art now the world's fresh ornament,\n",
      "And only herald to the gaudy spring,\n",
      "Within thine own bud buriest thy content,\n",
      "And tender churl mak'st waste in niggarding:\n",
      "Pity the world, or else this glutton be,\n",
      "To eat the world's due, by the grave and thee.\n"
     ]
    }
   ],
   "source": [
    "print(sonnet)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since the `sonnet` variable refers to a string, we can use everything we learned before to analyse and manipulate this string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "609"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(sonnet)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"from fairest creatures we desire increase,\\nthat thereby beauty's rose might never die,\\nbut as the riper should by time decease,\\nhis tender heir might bear his memory:\\nbut thou, contracted to thine own bright eyes,\\nfeed'st thy light's flame with self-substantial fuel,\\nmaking a famine where abundance lies,\\nthyself thy foe, to thy sweet self too cruel:\\nthou that art now the world's fresh ornament,\\nand only herald to the gaudy spring,\\nwithin thine own bud buriest thy content,\\nand tender churl mak'st waste in niggarding:\\npity the world, or else this glutton be,\\nto eat the world's due, by the grave and thee.\""
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sonnet.lower()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`str.find()` allows you to query a string for a substring. It will return the index of the lowest index of the first match for your query substring S."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Help on method_descriptor:\n",
      "\n",
      "find(...)\n",
      "    S.find(sub[, start[, end]]) -> int\n",
      "    \n",
      "    Return the lowest index in S where substring sub is found,\n",
      "    such that sub is contained within S[start:end].  Optional\n",
      "    arguments start and end are interpreted as in slice notation.\n",
      "    \n",
      "    Return -1 on failure.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "help(str.find)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "98"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sonnet.find('riper')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"riper should by time decease,\\nHis tender heir might bear his memory:\\nBut thou, contracted to thine own bright eyes,\\nFeed'st thy light's flame with self-substantial fuel,\\nMaking a famine where abundance lies,\\nThyself thy foe, to thy sweet self too cruel:\\nThou that art now the world's fresh ornament,\\nAnd only herald to the gaudy spring,\\nWithin thine own bud buriest thy content,\\nAnd tender churl mak'st waste in niggarding:\\nPity the world, or else this glutton be,\\nTo eat the world's due, by the grave and thee.\""
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sonnet[98:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## `Break out`\n",
    "- [paths and filenames](break_out/paths.ipynb)\n",
    "\n",
    "Read more on [reading and writing files](https://openbookproject.net/thinkcs/python/english3e/files.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fin."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}