{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Python Classics Cookbook\n", "#### Edited by [Patrick J. Burns](https://github.com/diyclassics)\n", "with contributions from [Anna Conser](https://github.com/aconser/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Replace macrons" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Problem**: You want to remove all of the macrons from a string, like the following sentence from Caesar's *Bellum Gallicum*." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "text_with_macrons = \"\"\"Gallia est omnis dīvīsa in partēs trēs, quārum ūnam incolunt Belgae, aliam Aquītānī, tertiam quī ipsōrum linguā Celtae, nostrā Gallī appellantur.\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are three methods for removing macrons: 1. with ```replace```; 2. with ```re.sub```; and 3. with ```translate```. Click [here](#remove-macrons-best) for the TLDR best solution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### string ```replace```" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dīvīsa > divisa\n", "Aquītānī > Aquitani\n" ] } ], "source": [ "# simple replacement\n", "\n", "word = 'dīvīsa'\n", "word_without_macrons = word.replace('ī', 'i')\n", "print(f\"{word} > {word_without_macrons}\")\n", "\n", "word = 'Aquītānī'\n", "word_without_macrons = word.replace('ā', 'a').replace('ī', 'i')\n", "print(f\"{word} > {word_without_macrons}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It would be tedious to chain together enough ```replace``` methods to solve this problem. So, we could create a dictionary of replacement patterns and loop over them, replacing the text with each pass." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'ā': 'a', 'ē': 'e', 'ī': 'i', 'ō': 'o', 'ū': 'u', 'ȳ': 'y', 'Ā': 'A', 'Ē': 'E', 'Ī': 'I', 'Ō': 'O', 'Ū': 'U', 'Ȳ': 'Y'}\n" ] } ], "source": [ "# create dictionary of macrons\n", "\n", "macron_map = {\n", " 'ā': 'a', \n", " 'ē': 'e', \n", " 'ī': 'i', \n", " 'ō': 'o', \n", " 'ū': 'u',\n", " 'ȳ': 'y',\n", " 'Ā': 'A',\n", " 'Ē': 'E', \n", " 'Ī': 'I', \n", " 'Ō': 'O', \n", " 'Ū': 'U',\n", " 'Ȳ': 'Y'\n", "}\n", "\n", "# compact method with dictionary comprehension\n", "\n", "vowels = 'aeiouyAEIOUY'\n", "vowels_with_macrons = 'āēīōūȳĀĒĪŌŪȲ'\n", "macron_map = {k: v for k, v in zip(vowels_with_macrons, vowels)} \n", "\n", "print(macron_map)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.\n" ] } ], "source": [ "# replace by iterating over dictionary\n", "\n", "text_without_macrons = text_with_macrons\n", "\n", "for k, v in macron_map.items():\n", " text_without_macrons = text_without_macrons.replace(k, v)\n", " \n", "print(text_without_macrons)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# function for replace by iterating over dictionary\n", "\n", "def remove_macrons_1(text_with_macrons, replacement_dictionary):\n", " text_without_macrons = text_with_macrons\n", " for k, v in replacement_dictionary.items():\n", " text_without_macrons = text_without_macrons.replace(k, v) \n", " return text_without_macrons" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4 µs, sys: 1 µs, total: 5 µs\n", "Wall time: 12.2 µs\n" ] }, { "data": { "text/plain": [ "'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time\n", "remove_macrons_1(text_with_macrons, macron_map)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### replacement with regular expressions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another option would be to do the same thing with regular expressions instead of ```replace``..." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# function for re.sub by iterating over dictionary\n", "\n", "import re\n", "\n", "def remove_macrons_2(text_with_macrons, replacement_dictionary):\n", " text_without_macrons = text_with_macrons\n", " for k, v in replacement_dictionary.items():\n", " text_without_macrons = re.sub(rf'{k}', v, text_without_macrons, flags=re.MULTILINE)\n", " return text_without_macrons" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4 µs, sys: 1 µs, total: 5 µs\n", "Wall time: 7.87 µs\n" ] }, { "data": { "text/plain": [ "'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time\n", "remove_macrons_2(text_with_macrons, macron_map)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a single sentence, this turns out to take about the same amount of time to run (not so with larger texts, as we see below)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### replacement with ```translate```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another option is the ```translate``` method. This allows us to make all changes using a translation table without having to loop repeated over the original string." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{257: 'a', 275: 'e', 299: 'i', 333: 'o', 363: 'u', 563: 'y', 256: 'A', 274: 'E', 298: 'I', 332: 'O', 362: 'U', 562: 'Y'}\n" ] } ], "source": [ "# compact method with dictionary comprehension\n", "# note that translate uses ```ord```, i.e. the Unicode code point for each mapped character\n", "\n", "vowels = 'aeiouyAEIOUY'\n", "vowels_with_macrons = 'āēīōūȳĀĒĪŌŪȲ'\n", "macron_table = {ord(k): v for k, v in zip(vowels_with_macrons, vowels)} \n", "\n", "print(macron_table)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# function for replacing macrons with translate\n", "\n", "def remove_macrons_3(text_with_macrons, macron_table):\n", " text_without_macrons = text_with_macrons.translate(macron_table)\n", " return text_without_macrons" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 14 µs, sys: 1 µs, total: 15 µs\n", "Wall time: 19.8 µs\n" ] }, { "data": { "text/plain": [ "'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time\n", "remove_macrons_3(text_with_macrons, macron_table)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Testing recipes on larger texts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All three methods run at about the same speed on a single sentence. But minor differences can add up as the amount of text to be processed increased. How do these recipes perform on larger texts?" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Gallia est omnis dīvīsa in partēs trēs, quārum ūnam incolunt Belgae, aliam Aquītānī, tertiam quī ipsōrum linguā Celtae, nostrā Gallī appellantur.\n" ] } ], "source": [ "# Get sample text with macrons\n", "# Here we'll use the Dickinson College Commentaries text of Caesar's *Bellum Gallicum* (which has macrons!) as found in conventus-lex's github repo for Maccer.\n", "\n", "from requests_html import HTMLSession\n", "session = HTMLSession()\n", "url = 'https://raw.githubusercontent.com/conventus-lex/maccer/master/sources/DCC/Caesar%20-%20Selections%20from%20the%20Gallic%20War.txt'\n", "r = session.get(url)\n", "test = r.text\n", "test = test[test.find('1.1'):] # remove 'metadata'\n", "test = re.sub(r'\\d\\.\\d+', '', test) # remove chapter headings, e.g. 1.1\n", "print(test[2:147]) # print sample" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This text has 6399 words.\n" ] } ], "source": [ "print(f'This text has {len(test.split())} words.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the results of timeit on my iMac 2.7 GHz Intel Core i5..." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "317 µs ± 21.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" ] } ], "source": [ "%timeit -n 1000 remove_macrons_1(test, macron_map)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.74 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" ] } ], "source": [ "%timeit -n 1000 remove_macrons_2(test, macron_map)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4.74 ms ± 947 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" ] } ], "source": [ "%timeit -n 1000 remove_macrons_3(test, macron_table)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The string method ```replace``` even with the multiple passes over the string is much faster than the other two methods." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Warning about combining characters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before wrapping up a discussion about string replacement and unicode characters with diacriticals, it seems like a good time to mention decomposed and precomposed unicode characters. Note the following behavior." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6\n" ] } ], "source": [ "word1 = 'dīvīsa'\n", "print(len(word1))" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "8\n" ] } ], "source": [ "word2 = 'dīvīsa'\n", "print(len(word2))" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "False\n" ] } ], "source": [ "print(word1 == word2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These strings are not the same—word2 contains two decomposed lower-case-i-with-macrons." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "b'd\\\\u012bv\\\\u012bsa'\n", "b'di\\\\u0304vi\\\\u0304sa'\n" ] } ], "source": [ "print(word1.encode('unicode-escape'))\n", "print(word2.encode('unicode-escape'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It seems like a good idea to handle these differences before attempting to replace characters. We can use unicodedata.normalize to convert all strings for replacement to Normalization Form C (NFC) before processing." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6\n" ] } ], "source": [ "import unicodedata\n", "word2 = unicodedata.normalize('NFC', word2)\n", "print(len(word2))" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# function with NFC preprocessing\n", "\n", "def remove_macrons_1b(text_with_macrons, replacement_dictionary):\n", " text_without_macrons = unicodedata.normalize('NFC', text_with_macrons)\n", " for k, v in replacement_dictionary.items():\n", " text_without_macrons = text_without_macrons.replace(k, v) \n", " return text_without_macrons" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 8 µs, sys: 2 µs, total: 10 µs\n", "Wall time: 18.8 µs\n" ] }, { "data": { "text/plain": [ "'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time\n", "remove_macrons_1b(text_with_macrons, macron_map)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Remove Macrons: Best solution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Putting it all together we have the following function that we can use for macron replacement." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "import unicodedata\n", "\n", "def remove_macrons(text_with_macrons):\n", " '''Replace macrons in Latin text'''\n", " vowels = 'aeiouyAEIOUY'\n", " vowels_with_macrons = 'āēīōūȳĀĒĪŌŪȲ'\n", " replacement_dictionary = {k: v for k, v in zip(vowels_with_macrons, vowels)} \n", " \n", " temp = unicodedata.normalize('NFC', text_with_macrons)\n", "\n", " for k, v in replacement_dictionary.items():\n", " temp = temp.replace(k, v)\n", " else:\n", " temp = text_without_macrons\n", "\n", " return text_without_macrons" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "415 µs ± 34.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" ] } ], "source": [ "%timeit -n 1000 remove_macrons(test)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.\n" ] } ], "source": [ "print(remove_macrons(test)[:147])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, slightly slower with normalization, but still faster than other methods." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Remove diacriticals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Problem**: You want to remove all of the diacriticals from a string of Greek text, like the following sentence from Thucydides's *Historiae*." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "text_with_diacriticals = \"\"\"Θουκυδίδης Ἀθηναῖος ξυνέγραψε τὸν πόλεμον τῶν Πελοποννησίων καὶ Ἀθηναίων, ὡς ἐπολέμησαν πρὸς ἀλλήλους, ἀρξάμενος εὐθὺς καθισταμένου καὶ ἐλπίσας μέγαν τε ἔσεσθαι καὶ ἀξιολογώτατον τῶν προγεγενημένων, τεκμαιρόμενος ὅτι ἀκμάζοντές τε ᾖσαν ἐς αὐτὸν ἀμφότεροι παρασκευῇ τῇ πάσῃ καὶ τὸ ἄλλο Ἑλληνικὸν ὁρῶν ξυνιστάμενον πρὸς ἑκατέρους, τὸ μὲν εὐθύς, τὸ δὲ καὶ διανοούμενον.\"\"\"" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "In a certain respect, this recipe appears to be similar to [Remove Macrons](#)—it is a replacement task. But the number of possible combinations would be burdensome to map by hand, so we are better off looking at a more manageable solution, one that works with entire ranges of characters. One strategy along these lines is to decompose the unicode combining characters and strip them using translate (cf. **PCRx 2.12**). This way rather than figure out a mapping for all possible precombined characters (e.g. ί to ι), we can just map the diacriticals separately and remove them." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Θουκυδίδης in 10 characters long.\n", "Θουκυδίδης in 11 characters long.\n", "Θουκυδίδης and Θουκυδίδης are equal: False.\n" ] } ], "source": [ "word1 = 'Θουκυδίδης' # composed\n", "word2 = 'Θουκυδίδης' # decomposed\n", "\n", "print(f'{word1} in {len(word1)} characters long.')\n", "print(f'{word2} in {len(word2)} characters long.')\n", "print(f'{word1} and {word2} are equal: {word1 == word2}.')" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Θ b'\\\\u0398'\n", "ο b'\\\\u03bf'\n", "υ b'\\\\u03c5'\n", "κ b'\\\\u03ba'\n", "υ b'\\\\u03c5'\n", "δ b'\\\\u03b4'\n", "ι b'\\\\u03b9'\n", "́ b'\\\\u0301'\n", "δ b'\\\\u03b4'\n", "η b'\\\\u03b7'\n", "ς b'\\\\u03c2'\n" ] } ], "source": [ "# Characters with their unicode points\n", "\n", "for i, char in enumerate(word2):\n", " print(char, char.encode('unicode-escape'))" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ί\n", "ι\n" ] } ], "source": [ "# So, our plan is to strip everything like \\u0301, like below...\n", "\n", "print(u'\\u03b9\\u0301') # prints iota with acute accent\n", "print(u'\\u03b9\\u0301'.replace(u'\\u0301', '')) # prints iota" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the ```combining``` method in ```unicodedata``` returns 0 if the character is not in a \"canonical combining class\"..." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "230\n" ] } ], "source": [ "import unicodedata\n", "\n", "print(unicodedata.combining(u'\\u0398')) # iota is not in a combining class\n", "print(unicodedata.combining(u'\\u0301')) # acute accent is in a combining class" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we can build a map of all combining characters by iterating through the unicode characters and discarding anything that returns 0 for ```unicode.combining```." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "import sys\n", "\n", "combining_character_table = dict.fromkeys(c for c in range(sys.maxunicode) \n", " if unicodedata.combining(chr(c))\n", " ) " ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Θουκυδιδης Αθηναιος ξυνεγραψε τον πολεμον των Πελοποννησιων και Αθηναιων, ως επολεμησαν προς αλληλους, αρξαμενος ευθυς καθισταμενου και ελπισας μεγαν τε εσεσθαι και αξιολογωτατον των προγεγενημενων, τεκμαιρομενος οτι ακμαζοντες τε ησαν ες αυτον αμφοτεροι παρασκευη τη παση και το αλλο Ελληνικον ορων ξυνισταμενον προς εκατερους, το μεν ευθυς, το δε και διανοουμενον.\n" ] } ], "source": [ "# decompose string\n", "\n", "text_with_diacriticals = unicodedata.normalize('NFD', text_with_diacriticals)\n", "\n", "# replace combining characters with translate\n", "text_without_diacriticals = text_with_diacriticals.translate(combining_character_table)\n", "\n", "# print results\n", "print(text_without_diacriticals)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "# function for removing diacriticals\n", "\n", "import sys\n", "import unicodedata\n", "\n", "def remove_diacriticals(text_with_diacriticals):\n", " ''''''\n", " combining_character_table = dict.fromkeys(c for c in range(sys.maxunicode) \n", " if unicodedata.combining(chr(c))\n", " )\n", " \n", " text_with_diacriticals = unicodedata.normalize('NFD', text_with_diacriticals)\n", " \n", " text_without_diacriticals = text_with_diacriticals.translate(combining_character_table)\n", " \n", " return text_without_diacriticals" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "390 ms ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], "source": [ "%timeit -n 100 text_without_diacriticals = remove_diacriticals(text_with_diacriticals)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Θουκυδιδης Αθηναιος ξυνεγραψε τον πολεμον των Πελοποννησιων και Αθηναιων, ως επολεμησαν προς αλληλους, αρξαμενος ευθυς καθισταμενου και ελπισας μεγαν τε εσεσθαι και αξιολογωτατον των προγεγενημενων, τεκμαιρομενος οτι ακμαζοντες τε ησαν ες αυτον αμφοτεροι παρασκευη τη παση και το αλλο Ελληνικον ορων ξυνισταμενον προς εκατερους, το μεν ευθυς, το δε και διανοουμενον.\n" ] } ], "source": [ "print(text_without_diacriticals)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[aconser](https://github.com/aconser) notes that this method could also be used for removing macrons..." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "import unicodedata\n", "\n", "def remove_macrons_4(text_with_macrons):\n", " MACRON = u'\\u0304'\n", " temp = unicodedata.normalize('NFD', text_with_macrons)\n", " text_without_macrons = temp.replace(MACRON, '')\n", " return unicodedata.normalize('NFC', text_without_macrons)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.22 ms ± 49.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" ] } ], "source": [ "%timeit -n 1000 remove_macrons_4(test)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "376 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" ] } ], "source": [ "# Cf. \n", "%timeit -n 1000 remove_macrons(test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Extract Greek words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Problem**: You want to extract the Greek words from a Latin text (or, really, any non-Greek text)." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "# Cicero Att 1.4\n", "# http://www.perseus.tufts.edu/hopper/text?doc=Perseus%3Atext%3A1999.02.0008%3Abook%3D1%3Aletter%3D1%3Asection%3D4\n", "\n", "text_with_greek = \"\"\"\n", "abs te peto ut mihi hoc ignoscas et me existimes humanitate esse prohibitum ne contra amici summam existimationem miserrimo eius tempore venirem, cum is omnia sua studia et officia in me contulisset. quod si voles in me esse durior, ambitionem putabis mihi obstitisse. ego autem arbitror, etiam si id sit, mihi ignoscendum esse, “ἐπεὶ οὐχ ἱερήϊον οὐδὲ βοεΐην.” vides enim in quo cursu simus et quam omnis gratias non modo retinendas verum etiam acquirendas putemus. spero tibi me causam probasse, cupio quidem certe. \n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use regular expressions to replace any non-Greek characters with a space and then split the string on that space to return a list of Greek words. We start by defining the unicode range for Greek characters..." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "# Define \n", "GREEK_UNICODE_RANGE = '\\u0300-\\u03FF'\n", "GREEK_EXT_UNICODE_RANGE = '\\u1F00-\\u1FFF'" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "91.3 µs ± 3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" ] } ], "source": [ "import re\n", "\n", "%timeit -n 1000 greek_words = re.sub('[^%s%s]' % (GREEK_UNICODE_RANGE, GREEK_EXT_UNICODE_RANGE),' ', text_with_greek).split()" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['ἐπεὶ', 'οὐχ', 'ἱερήϊον', 'οὐδὲ', 'βοεΐην']\n", "True\n" ] } ], "source": [ "greek_words = re.sub('[^%s%s]' % (GREEK_UNICODE_RANGE, GREEK_EXT_UNICODE_RANGE),' ', text_with_greek).split()\n", "print(greek_words)\n", "print(greek_words == ['ἐπεὶ', 'οὐχ', 'ἱερήϊον', 'οὐδὲ', 'βοεΐην'])" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "# function for extracting Greek words\n", "\n", "import re\n", "\n", "def extract_greek(text_with_greek):\n", " ''''''\n", " GREEK_UNICODE_RANGE = '\\u0300-\\u03FF'\n", " GREEK_EXT_UNICODE_RANGE = '\\u1F00-\\u1FFF'\n", " \n", " return re.sub('[^%s%s]' % (GREEK_UNICODE_RANGE, GREEK_EXT_UNICODE_RANGE),' ', text_with_greek).split()" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['ἐπεὶ', 'οὐχ', 'ἱερήϊον', 'οὐδὲ', 'βοεΐην']\n" ] } ], "source": [ "print(extract_greek(text_with_greek))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This recipe could be extended to other language character sets by redefining the unicode character ranges as necessary." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Make iota adscripts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Problem**: You want to change iota subscripts to iota adscripts, e.g. τῷ θεῷ to τῶι θεῶι." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "text_with_iota_subscripts = \"\"\"Χάρις δὲ τῷ θεῷ τῷ διδόντι τὴν αὐτὴν σπουδὴν ὑπὲρ ὑμῶν ἐν τῇ καρδίᾳ Τίτου, ὅτι τὴν μὲν παράκλησιν ἐδέξατο, σπουδαιότερος δὲ ὑπάρχων αὐθαίρετος ἐξῆλθεν πρὸς ὑμᾶς.\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As with the recipes above for dealing with diacriticals, we will take advantage of unicode decomposition to replace iota subscripts with a full iota." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "import unicodedata\n", "\n", "def make_iota_adscripts(text):\n", " text = unicodedata.normalize('NFD', text)\n", " text = text.replace('\\u0345','ι')\n", " text = unicodedata.normalize('NFC', text)\n", " return text" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "27.2 µs ± 8.54 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], "source": [ "%timeit -n 100 text_with_iota_adscripts = make_iota_adscripts(text_with_iota_subscripts)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Χάρις δὲ τῶι θεῶι τῶι διδόντι τὴν αὐτὴν σπουδὴν ὑπὲρ ὑμῶν ἐν τῆι καρδίαι Τίτου, ὅτι τὴν μὲν παράκλησιν ἐδέξατο, σπουδαιότερος δὲ ὑπάρχων αὐθαίρετος ἐξῆλθεν πρὸς ὑμᾶς.\n" ] } ], "source": [ "print(make_iota_adscripts(text_with_iota_subscripts))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----\n", "Please open an issue for any problems you see with the code. You can also use issues, if you would like to suggest another Python solution for any of the recipes you see in this notebook." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }