{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "(text-intro)=\n", "# Introduction to Text\n", "\n", "This chapter covers how to use code to work with text as data, including opening files with text in, changing and cleaning text, and vectorised operations on text.\n", "\n", "It has benefitted from the [Python String Cook Book](https://mkaz.blog/code/python-string-format-cookbook/) and Jake VanderPlas' [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html).\n", "\n", "Note that regexes are mentioned a few times in this chapter; you'll find out much more about them in the {ref}`text-regex` chapter.\n", "\n", "## An aside on encodings\n", "\n", "Before we get to the good stuff, we need to talk about string encodings. Whether you're using code or a text editor (Notepad, Word, Pages, Visual Studio Code), every bit of text that you see on a computer will have an encoding behind the scenes that tells the computer how to display the underlying data. There is no such thing as 'plain' text: all text on computers is the result of an encoding. Oftentimes, a computer programme (email reader, Word, whatever) will guess the encoding and show you what it thinks the text should look like. But it doesn't always know, or get it right: *that is what is happening when you get an email or open a file full of weird symbols and question marks*. If a computer doesn't know whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), it simply cannot display it correctly and you get gibberish.\n", "\n", "When it comes to encodings, there are just two things to remember: i) you should use UTF-8 (aka Unicode), it's the international standard. ii) the Windows operating system tends to use either Latin 1 or Windows 1252 but (and this is good news) is moving to UTF-8.\n", "\n", "[Unicode](https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code. The Unicode specifications are continually revised and updated to add new languages and symbols.\n", "\n", "Take special care when saving CSV files containing text on a Windows machine using Excel; unless you specify it, the text may not be saved in UTF-8. If your computer and you get confused enough about encodings and re-save a file with the wrong ones, you could lose data.\n", "\n", "Hopefully you'll never have to worry about string encodings. But if you *do* see weird symbols appearing in your text, at least you'll know that there's an encoding problem and will know where to start Googling. You can find a much more in-depth explanation of text encodings [here](https://kunststube.net/encoding/).\n", "\n", "## Strings\n", "\n", "Note that there are many built-in functions for using strings in Python, you can find a comprehensive list [here](https://www.w3schools.com/python/python_ref_string.asp).\n", "\n", "Strings are the basic data type for text in Python. They can be of any length. A string can be signalled by quote marks or double quote marks like so:\n", "\n", "`'text'`\n", "\n", "or\n", "\n", "\n", "`\"text\"`\n", "\n", "Style guides tend to prefer the latter but some coders (ahem!) have a bad habit of using the former. We can put this into a variable like so:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var = \"banana\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, if we check the type of the variable:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(var)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that it is `str`, which is short for string.\n", "\n", "Strings in Python can be indexed, so we can get certain characters out by using square brackets to say which positions we would like." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var[:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The usual slicing tricks that apply to lists work for strings too, i.e. the positions you want to get can be retrieved using the `var[start:stop:step]` syntax. Here's an example of getting every other character from the string starting from the 2nd position." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var[1::2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that strings, like tuples such as `(1, 2, 3)` but unlike lists such as `[1, 2, 3]`, are *immutable*. This means commands like `var[0] = \"B\"` will result in an error. If you want to change a single character, you will have to replace the entire string. In this example, the command to do that would be `var = \"Banana\"`.\n", "\n", "Like lists, you can find the length of a string using `len()`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(var)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `+` operator concatenates two or more strings:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "second_word = \"panther\"\n", "first_word = \"black\"\n", "print(first_word + \" \" + second_word)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we added a space so that the phrase made sense. Another way of achieving the same end that scales to many words more efficiently (if you have them in a list) is:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\" \".join([first_word, second_word])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Three useful functions to know about are `upper()`, `lower()`, and `title()`. Let's see what they do\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var = \"input TEXT\"\n", "var_list = [var.upper(), var.lower(), var.title()]\n", "print(var_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{admonition} Exercise\n", "Reverse the string `\"gnirts desrever a si sihT\"` using indexing operations.\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While we're using `print()`, it has a few tricks. If we have a list, we can print out entries with a given separator:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(*var_list, sep=\"; and \\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(We'll find out more about what '\\n' does shortly.) To turn variables of other kinds into strings, use the `str()` function, for example" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(\n", " \"A boolean is either \"\n", " + str(True)\n", " + \" or \"\n", " + str(False)\n", " + \", there are only \"\n", " + str(2)\n", " + \" options.\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example two boolean variables and one integer variable were converted to strings. `str()` generally makes an intelligent guess at how you'd like to convert your non-string type variable into a string type. You can pass a variable or a literal value to `str()`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### f-strings\n", "\n", "The example above is quite verbose. Another way of combining strings with variables is via *f-strings*. A simple f-string looks like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "variable = 15.32399\n", "print(f\"You scored {variable}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is similar to calling `str` on variable and using `+` for concatenation but much shorter to write. You can add expressions to f-strings too:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"You scored {variable**2}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This also works with functions; after all `**2` is just a function with its own special syntax." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, the score number that came out had a lot of (probably) uninteresting decimal places. So how do we polish the printed output? You can pass more information to the f-string to get the output formatted just the way you want. Let's say we wanted two decimal places and a sign (although you always write `+` in the formatting, the sign comes out as + or - depending on the value):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"You scored {variable:+.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a whole range of formatting options for numbers as shown in the following table:\n", "\n", "| Number \t| Format \t| Output \t| Description \t|\n", "|------------\t|---------\t|------------\t|-----------------------------------------------\t|\n", "| 15.32347 \t| {:.2f} \t| 15.32 \t| Format float 2 decimal places \t|\n", "| 15.32347 \t| {:+.2f} \t| +15.32 \t| Format float 2 decimal places with sign \t|\n", "| -1 \t| {:+.2f} \t| -1.00 \t| Format float 2 decimal places with sign \t|\n", "| 15.32347 \t| {:.0f} \t| 15 \t| Format float with no decimal places \t|\n", "| 3 \t| {:0>2d} \t| 03 \t| Pad number with zeros (left padding, width 2) \t|\n", "| 3 \t| {:*<4d} \t| 3*** \t| Pad number with *’s (right padding, width 4) \t|\n", "| 13 \t| {:*<4d} \t| 13** \t| Pad number with *’s (right padding, width 4) \t|\n", "| 1000000 \t| {:,} \t| 1,000,000 \t| Number format with comma separator \t|\n", "| 0.25 \t| {:.1%} \t| 25.0% \t| Format percentage \t|\n", "| 1000000000 \t| {:.2e} \t| 1.00e+09 \t| Exponent notation \t|\n", "| 12 \t| {:10d} \t| 12 | Right aligned (default, width 10) \t|\n", "| 12 \t| {:<10d} \t| 12 | Left aligned (width 10) \t|\n", "| 12 \t| {:^10d} \t| 12 | Center aligned (width 10) \t|\n", "\n", "As well as using this page interactively through the Colab and Binder links at the top of the page, or downloading this page and using it on your own computer, you can play around with some of these options over at [this link](https://www.python-utils.com/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Special characters\n", "\n", "Python has a string module that comes with some useful built-in strings and characters. For example" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import string\n", "\n", "string.punctuation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "gives you all of the punctuation," ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "string.ascii_letters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "returns all of the basic letters in the 'ASCII' encoding (with `.ascii_lowercase` and `.ascii_uppercase` variants), and" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "string.digits" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "gives you the numbers from 0 to 9. Finally, though less impressive visually, `string.whitespace` gives a string containing all of the different (there is more than one!) types of whitespace." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are other special characters around; in fact, we already met the most famous of them: \"\\n\" for new line. To actually print \"\\n\" we have to 'escape' the backward slash by adding another backward slash:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Here is a \\n new line\")\n", "print(\"Here is an \\\\n escaped new line \")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The table below shows the most important escape commands:\n", "\n", "| Code \t| Result \t|\n", "|------\t|-----------------\t|\n", "| `\\'` \t| Single Quote (useful if using `'` for strings) \t|\n", "| `\\\"` | Double Quote (useful if using `\"` for strings) \t|\n", "| `\\\\` \t| Backslash \t|\n", "| `\\n` \t| New Line \t|\n", "| `\\r` \t| Carriage Return \t|\n", "| `\\t` \t| Tab \t|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Methods for Strings\n", "\n", "Let's end this sub-section on strings with a comprehensive overview of all string methods, courtesy of the excellent [**rich**](https://github.com/willmcgugan/rich) package." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from rich import inspect\n", "\n", "var_of_type_str = \"string\"\n", "inspect(var_of_type_str, methods=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cleaning Text\n", "\n", "You often want to make changes to the text you're working with. In this section, we'll look at the various options to do this.\n", "\n", "### Replacing sub-strings\n", "\n", "A common text task is to replace a substring within a longer string. Let's say you have a string variable `var`. You can use `.replace(old_text, new_text)` to do this.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"Value is objective\".replace(\"objective\", \"subjective\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As with any variable of a specific type (here, string), this would also work with variables:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "text = \"Value is objective\"\n", "old_substr = \"objective\"\n", "new_substr = \"subjective\"\n", "text.replace(old_substr, new_substr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that `.replace()` performs an exact replace and so is case-sensitive." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Replacing characters with translate\n", "\n", "A character is an individual entry within a string, like the 'l' in 'equilibrium'. You can always count the number of characters in a string variable called `var` by using `len(var)`. A very fast method for replacing individual characters in a string is `str.translate()`. \n", "\n", "Replacing characters is extremely useful in certain situations, most commonly when you wish to remote all punctuation prior to doing other text analysis. You can use the built-in `string.punctuation` for this.\n", "\n", "Let's see how to use it to remove all of the vowels from some text. With apologies to economist Lisa Cook, we'll use the abstract from {cite:t}`cook2011inventing` as the text we'll modify and we'll first create a dictionary of translations of vowels to nothing, i.e. `\"\"`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "example_text = \"Much recent work has focused on the influence of social capital on innovative outcomes. Little research has been done on disadvantaged groups who were often restricted from participation in social networks that provide information necessary for invention and innovation. Unique new data on African American inventors and patentees between 1843 and 1930 permit an empirical investigation of the relation between social capital and economic outcomes. I find that African Americans used both traditional, i.e., occupation-based, and nontraditional, i.e., civic, networks to maximize inventive output and that laws constraining social-capital formation are most negatively correlated with economically important inventive activity.\"\n", "vowels = \"aeiou\"\n", "translation_dict = {x: \"\" for x in vowels}\n", "translation_dict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we turn our dictionary into a string translator and apply it to our text:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "translator = example_text.maketrans(translation_dict)\n", "example_text.translate(translator)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{admonition} Exercise\n", "Use `translate()` to replace all punctuation from the following sentence with spaces: \"The well-known story I told at the conferences [about hypocondria] in Boston, New York, Philadelphia,...and Richmond went as follows: It amused people who knew Tommy to hear this; however, it distressed Suzi when Tommy (1982--2019) asked, \\\"How can I find out who yelled, 'Fire!' in the theater?\\\" and then didn't wait to hear Missy give the answer---'Dick Tracy.'\"\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generally, `str.translate()` is very fast at replacing individual characters in strings. But you can also do it using a list comprehension and a `join()` of the resulting list, like so:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\".join(\n", " [\n", " ch\n", " for ch in \"Example. string. with- excess_ [punctuation]/,\"\n", " if ch not in string.punctuation\n", " ]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Slugifying\n", "\n", "A special case of string cleaning occurs when you are given text with lots of non-standard characters in, and spaces, and other symbols; and what you want is a clean string suitable for a filename or column heading in a dataframe. Remember that it's best practice to have filenames that don't have spaces in. Slugiyfing is the process of creating the latter from the former and we can use the [**slugify**](https://github.com/un33k/python-slugify) package to do it.\n", "\n", "Here are some examples of slugifying text:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from slugify import slugify\n", "\n", "txt = \"the quick brown fox jumps over the lazy dog\"\n", "slugify(txt, stopwords=[\"the\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this very simple example, the words listed in the `stopwords=` keyword argument (a list), are removed and spaces are replaced by hyphens. Let's now see a more complicated example:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "slugify(\"当我的信息改变时... àccêntæd tËXT \")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Slugify converts text to latin characters, while also removing accents and whitespace (of all kinds-the last whitespace is a tab). There's also a `replacement=` keyword argument that will replace specific strings with other strings using a list of lists format, eg `replacement=[['old_text', 'new_text']]`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Splitting strings\n", "\n", "If you want to split a string at a certain position, there are two quick ways to do it. The first is to use indexing methods, which work well if you know at which position you want to split text, eg\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"This is a sentence and we will split it at character 18\"[:18]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next up we can use the built-in `split` function, which returns a list of places where a given sub-string occurs:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"This is a sentence. And another sentence. And a third sentence\".split(\".\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the character used to split the string is removed from the resulting list of strings. Let's see an example with a string used for splitting instead of a single character:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"This is a sentence. And another sentence. And a third sentence\".split(\"sentence\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A useful extra function to know about is `splitlines()`, which splits a string at line breaks and returns the split parts as a list." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### count and find\n", "\n", "Let's do some simple counting of words within text using `str.count()`. Let's use the first verse of Elizabeth Bishop's sestina 'A Miracle for Breakfast' for our text." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "text = \"At six o'clock we were waiting for coffee, \\n waiting for coffee and the charitable crumb \\n that was going to be served from a certain balcony \\n --like kings of old, or like a miracle. \\n It was still dark. One foot of the sun \\n steadied itself on a long ripple in the river.\"\n", "word = \"coffee\"\n", "print(f'The word \"{word}\" appears {text.count(word)} times.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Meanwhile, `find()` returns the position where a particular word or character occurs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "text.find(word)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check this using the number we get and some string indexing:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "text[text.find(word) : text.find(word) + len(word)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But this isn't the only place where the word 'coffee' appears. If we want to find the last occurrence, it's" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "text.rfind(word)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scaling up from a single string to a corpus\n", "\n", "For this section, it's useful to be familiar with the **pandas** package, which is covered in the [Data Analysis Quickstart](data-quickstart) and [Working with Data](working-with-data) sections. This section will closely follow the treatment by Jake VanderPlas.\n", "\n", "We've seen how to work with individual strings. But often we want to work with a group of strings, otherwise known as a corpus, that is a collection of texts. It could be a collection of words, sentences, paragraphs, or some domain-based grouping (eg job descriptions).\n", "\n", "Fortunately, many of the methods that we have seen deployed on a single string can be straightforwardly scaled up to hundreds, thousands, or millions of strings using **pandas** or other tools. This scaling up is achieved via *vectorisation*, in analogy with going from a single value (a scalar) to multiple values in a list (a vector).\n", "\n", "As a very minimal example, here is capitalisation of names vectorised using a list comprehension:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "[name.capitalize() for name in [\"ada\", \"adam\", \"elinor\", \"grace\", \"jean\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A **pandas** series can be used in place of a list. Let's create the series first:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "dfs = pd.Series(\n", " [\"ada lovelace\", \"adam smith\", \"elinor ostrom\", \"grace hopper\", \"jean bartik\"],\n", " dtype=\"string\",\n", ")\n", "dfs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we use the syntax series.str.function to change the text series:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dfs.str.title()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we had a dataframe and not a series, the syntax would change to refer just to the column of interest like so:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(dfs, columns=[\"names\"])\n", "df[\"names\"].str.title()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The table below shows a non-exhaustive list of the string methods that are available in **pandas**.\n", "\n", "| Function (preceded by `.str.`) | What it does |\n", "|-----------------------------|-------------------------|\n", "| `len()` | Length of string. |\n", "| `lower()` | Put string in lower case. |\n", "| `upper()` | Put string in upper case. |\n", "| `capitalize()` | Put string in leading upper case. |\n", "| `swapcase()` | Swap cases in a string. |\n", "| `translate()` | Returns a copy of the string in which each character has been mapped through a given translation table. |\n", "| `ljust()` | Left pad a string (default is to pad with spaces) |\n", "| `rjust()` | Right pad a string (default is to pad with spaces) |\n", "| `center()` | Pad such that string appears in centre (default is to pad with spaces) |\n", "| `zfill()` | Pad with zeros |\n", "| `strip()` | Strip out leading and trailing whitespace |\n", "| `rstrip()` | Strip out trailing whitespace |\n", "| `lstrip()` | Strip out leading whitespace |\n", "| `find()` | Return the lowest index in the data where a substring appears |\n", "| `split()` | Split the string using a passed substring as the delimiter |\n", "| `isupper()` | Check whether string is upper case |\n", "| `isdigit()` | Check whether string is composed of digits |\n", "| `islower()` | Check whether string is lower case |\n", "| `startswith()` | Check whether string starts with a given sub-string |\n", "\n", "Regular expressions can also be scaled up with **pandas**. The below table shows vectorised regular expressions.\n", "\n", "| Function | What it does |\n", "|-|----------------------------------|\n", "| `match()` | Call `re.match()` on each element, returning a boolean. |\n", "| `extract()` | Call `re.match()` on each element, returning matched groups as strings. |\n", "| `findall()` | Call `re.findall()` on each element |\n", "| `replace()` | Replace occurrences of pattern with some other string |\n", "| `contains()` | Call `re.search()` on each element, returning a boolean |\n", "| `count()` | Count occurrences of pattern |\n", "| `split()` | Equivalent to `str.split()`, but accepts regexes |\n", "| `rsplit()` | Equivalent to `str.rsplit()`, but accepts regexes |\n", "\n", "\n", "Let's see a couple of these in action. First, splitting on a given sub-string:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[\"names\"].str.split(\" \")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's fairly common that you want to split out strings and save the results to new columns in your dataframe. You can specify a (max) number of splits via the `n=` kwarg and you can get the columns using `expand`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[\"names\"].str.split(\" \", n=2, expand=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{admonition} Exercise\n", "Using vectorised operations, create a new column with the index position where the first vowel occurs for each row in the `names` column.\n", "```\n", "\n", "Here's an example of using a regex function with **pandas**:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[\"names\"].str.extract(\"(\\w+)\", expand=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a few more vectorised string operations that are useful.\n", "\n", "| Method | Description |\n", "|-|-|\n", "| `get()` | Index each element |\n", "| `slice()` | Slice each element |\n", "| `slice_replace()` | Replace slice in each element with passed value |\n", "| `cat()` | Concatenate strings |\n", "| `repeat()` | Repeat values |\n", "| `normalize()` | Return Unicode form of string |\n", "| `pad()` | Add whitespace to left, right, or both sides of strings |\n", "| `wrap()` | Split long strings into lines with length less than a given width |\n", "| `join()` | Join strings in each element of the Series with passed separator |\n", "| `get_dummies()` | extract dummy variables as a dataframe |\n", "\n", "\n", "The `get()` and `slice()` methods give access to elements of the lists returned by `split()`. Here's an example that combines `split()` and `get()`:\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[\"names\"].str.split().str.get(-1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We already saw `get_dummies()` in the [Regression](regression) chapter, but it's worth revisiting it here with strings. If we have a column with tags split by a symbol, we can use this function to split it out. For example, let's create a dataframe with a single column that mixes subject and nationality tags:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(\n", " {\n", " \"names\": [\n", " \"ada lovelace\",\n", " \"adam smith\",\n", " \"elinor ostrom\",\n", " \"grace hopper\",\n", " \"jean bartik\",\n", " ],\n", " \"tags\": [\"uk; cs\", \"uk; econ\", \"usa; econ\", \"usa; cs\", \"usa; cs\"],\n", " }\n", ")\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we now use `str.get_dummies` and split on `;` we can get a dataframe of dummies." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[\"tags\"].str.get_dummies(\";\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading Text In\n", "\n", "### Text file\n", "\n", "If you have just a plain text file, you can read it in like so:\n", "\n", "```python\n", "fname = 'book.txt'\n", "with open(fname, encoding='utf-8') as f:\n", " text_of_book = f.read()\n", "```\n", "\n", "You can also read a text file directly into a **pandas** dataframe using \n", "\n", "```python\n", "df = pd.read_csv('book.txt', delimiter = \"\\n\")\n", "```\n", "\n", "In the above, the delimiter for different rows of the dataframe is set as \"\\n\", which means new line, but you could use whatever delimiter you prefer.\n", "\n", "```{admonition} Exercise\n", "Download the file 'smith_won.txt' from this book's github repository using this [link](https://github.com/aeturrell/coding-for-economists/blob/main/data/smith_won.txt) (use right-click and save as). Then read the text in using **pandas**.\n", "```\n", "\n", "### CSV file\n", "\n", "CSV files are already split into rows. By far the easiest way to read in csv files is using **pandas**,\n", "\n", "```python\n", "df = pd.read_csv('book.csv')\n", "```\n", "\n", "Remember that **pandas** can read many other file types too." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.13 ('codeforecon')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "vscode": { "interpreter": { "hash": "c4570b151692b3082981c89d172815ada9960dee4eb0bedb37dc10c95601d3bd" } } }, "nbformat": 4, "nbformat_minor": 4 }