{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Creating a plaintext Odyssey on the fly with Perseus Table of Contents\n",
    "\n",
    "Patrick J. Burns 10.1.2017"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I recently read an article on sentence length in Greek hexameter poetry by [Dee Clayman from 1981](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1627358) (\"Sentence Length in Greek Hexameter Poetry\" in Hexameter Studies, *Quantitative Linguistics* 11)—an interesting article in many ways, that I will blog about at greater length in the near future. For now, I will just say that it is a data-driven study and an excellent example of computational/philological/literary critical work in Classics from nearly four decades ago.\n",
    "\n",
    "In the article, Clayman presents a series of charts that I wanted to replicate, including this one on \"Sentences One Line in Length in Greek Hexameter Poetry.\" I could do this sort of thing relatively easily for Latin hexameters using CLTK and the plaintext Latin Library corpus that I've [written about at *Disiecta Membra*](https://disiectamembra.wordpress.com/2016/08/11/working-with-the-latin-library-corpus-in-cltk/). But I didn't have a plaintext Greek corpus at hand and I decided that I probably should."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Figure from Clayman 1981](img/clayman.jpg)\n",
    "*Figure from Clayman's 1981 study*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I also just happened to teach a seminar last week on using Python to scrape XML by URL and I thought this would make a good example of an intermediate level scraping project.\n",
    "\n",
    "The [Perseus Digital Library](http://www.perseus.tufts.edu) provides open-access XML texts of many of the hexameter texts from Clayman's article that I wanted to test. We could, I suppose, cut and paste the texts from the browser. But for the Odyssey that would be almost three hundred pages chunked by section. Even chunked by book, we'd have to work through 24 pages. And we'd have to do that for every work we wanted to scrape.\n",
    "\n",
    "Fortunately, the library also provides a Table of Contents which gives us a map of all the individual sections. With these TOC files, we can use Python to build a list of URLs for the sections, scrape these pages, extract lines of poetry, and finally stitch the results together. Python and [lxml](http://lxml.de) are well-suited to this task."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Perseus XML Table of Contents](img/perseus-toc.png)\n",
    "*Perseus XML Table of Contents for Homer's* Odyssey*, available [here](http://www.perseus.tufts.edu/hopper/text?doc=Perseus%3Atext%3A1999.01.0135%3Abook%3D1).*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below is a first pass at handling two Greek hexameter poems from Perseus: Homer's *Odyssey* and Hesiod's *Shield of Heracles*. In an upcoming post/notebook, I will collect all of the hexameter poems from Clayman's study and show how we can use Python to replicate her study."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting a plaintext *Odyssey*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Imports\n",
    "import urllib.request\n",
    "from lxml import etree\n",
    "\n",
    "import collections\n",
    "\n",
    "import time\n",
    "\n",
    "from pprint import pprint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Constants\n",
    "perseus_xml_base_url = \"http://www.perseus.tufts.edu/hopper/xmlchunk?doc=\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Homer's Odyssey TOC XML\n",
    "odyssey_toc_url = \"http://www.perseus.tufts.edu/hopper/xmltoc?doc=Perseus%3Atext%3A1999.01.0135%3Abook%3D1%3Acard%3D1\"\n",
    "\n",
    "# Hesiod's Shield TOC XML \n",
    "shield_toc_url = \"http://www.perseus.tufts.edu/hopper/xmltoc?doc=Perseus%3Atext%3A1999.01.0127%3Acard%3D1\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def check_for_books(root):\n",
    "    \"\"\"\n",
    "    Some poems are single, self-contained works (e.g. Hesiod's Shield)\n",
    "    others are divided into books (e.g. Homer's Odyssey). This tests for\n",
    "    the presence of the attribute type with value 'book' in the <chunk>\n",
    "    element, so that book-level information can be retained when parsing.\n",
    "    \"\"\"\n",
    "    if root.findall(\".//chunk[@type='book']\"):\n",
    "        return True\n",
    "    return False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "with urllib.request.urlopen(odyssey_toc_url) as f:\n",
    "    perseus_toc_xml = f.read()\n",
    "\n",
    "root = etree.fromstring(perseus_toc_xml)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get list of refs from <chunk> elements\n",
    "\n",
    "if check_for_books(root):\n",
    "    books = root.findall(\".//chunk[@type='book']\")\n",
    "    booknames = [book.find('head').text for book in books]\n",
    "else:\n",
    "    books = [root]\n",
    "    booknames = ['work']\n",
    "    \n",
    "book_refs = []\n",
    "\n",
    "for book in books:\n",
    "    chunks = book.findall('chunk')\n",
    "    refs = [chunk.attrib['ref'] for chunk in chunks]\n",
    "    book_refs.append(refs)\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Perseus%3Atext%3A1999.01.0135%3Abook%3D1%3Acard%3D1', 'Perseus%3Atext%3A1999.01.0135%3Abook%3D1%3Acard%3D44', 'Perseus%3Atext%3A1999.01.0135%3Abook%3D1%3Acard%3D80', 'Perseus%3Atext%3A1999.01.0135%3Abook%3D1%3Acard%3D125', 'Perseus%3Atext%3A1999.01.0135%3Abook%3D1%3Acard%3D178', 'Perseus%3Atext%3A1999.01.0135%3Abook%3D1%3Acard%3D230', 'Perseus%3Atext%3A1999.01.0135%3Abook%3D1%3Acard%3D280', 'Perseus%3Atext%3A1999.01.0135%3Abook%3D1%3Acard%3D325', 'Perseus%3Atext%3A1999.01.0135%3Abook%3D1%3Acard%3D365', 'Perseus%3Atext%3A1999.01.0135%3Abook%3D1%3Acard%3D421']\n"
     ]
    }
   ],
   "source": [
    "print(book_refs[0]) # Example of retrieved refs from TOC for Odyssey 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get xml for each ref\n",
    "book_sections = []\n",
    "\n",
    "for book_ref in book_refs:\n",
    "    book_section_xml = []\n",
    "    for ref in book_ref:\n",
    "        #print(ref) #Uncomment if you want to see the progress\n",
    "        time.sleep(.1)\n",
    "        with urllib.request.urlopen(perseus_xml_base_url+ref) as f:\n",
    "            book_section_xml.append(f.read())\n",
    "    book_sections.append(book_section_xml)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n",
      "<TEI.2><text><body><div1 n=\"1\" type=\"Book\" org=\"uniform\" sample=\"complete\"><milestone n=\"1\" unit=\"card\" ed=\"p\" />\n",
      "<l>ἄνδρα μοι ἔννεπε, μοῦσα, πολύτροπον, ὃς μάλα πολλὰ</l>\n",
      "<l>πλάγχθη, ἐπεὶ Τροίης ἱερὸν πτολίεθρον ἔπερσεν:</l>\n",
      "<l>πολλῶν δ᾽ ἀνθρώπων ἴδεν ἄστεα καὶ νόον ἔγνω,</l>\n",
      "<l>πολλὰ δ᾽ ὅ γ᾽ ἐν πόντῳ πάθεν ἄλγεα ὃν κατὰ θυμόν,</l>\n",
      "<l n=\"5\">ἀρνύμενος ἥν τε ψυχὴν καὶ νόστον ἑταίρων.</l>\n",
      "<l>ἀλλ᾽ οὐδ᾽ ὣς ἑτάρους ἐρρύσατο, ἱέμενός περ:</l>\n",
      "<l>αὐτῶν γὰρ σφετέρῃσιν ἀτασθαλίῃσιν ὄλοντο,</l>\n",
      "<l>νήπιοι, οἳ κατὰ βοῦς Ὑπερίονος Ἠελίοιο</l>\n",
      "<l>ἤσθιον: αὐτὰρ ὁ τοῖσιν ἀφείλετο νόστιμον ἦμαρ.</l>\n",
      "<l n=\"10\">τῶν ἁμόθεν γε,  θεά, θύγατερ Διός, εἰπὲ καὶ ἡμῖν.</l>\n",
      "<l><milestone unit=\"para\" ed=\"P\" />ἔνθ᾽ ἄλλοι μὲν πάντες, ὅσοι φύγον αἰπὺν ὄλεθρον,</l>\n",
      "<l>οἴκοι ἔσαν, πόλεμόν τε πεφευγότες ἠδὲ θάλασσαν:</l>\n",
      "<l>τὸν δ᾽ οἶον νόστου κεχρημένον ἠδὲ γυναικὸς</l>\n",
      "<l>νύμφη πότνι᾽ ἔρυκε Καλυψὼ δῖα θεάων</l>\n",
      "<l n=\"15\">ἐν σπέσσι γλαφυροῖσι,  λιλαιομένη πόσιν εἶναι.</l>\n",
      "<l>ἀλλ᾽ ὅτ\n"
     ]
    }
   ],
   "source": [
    "# Example XML from Odyssey 1, Section 1\n",
    "print(book_sections[0][0].decode('utf-8')[:1000])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Some helper functions\n",
    "\n",
    "def check_for_lb(root):\n",
    "    \"\"\"\n",
    "    Some poetry in the Perseus XML has lines delimited by <lb>\n",
    "    and some by <l>. This tests for the presence of <lb>, so that\n",
    "    the right parser is used below.\n",
    "    \"\"\"\n",
    "    if root.findall(\".//lb\"):\n",
    "        return True\n",
    "    return False\n",
    "\n",
    "\n",
    "# Need this helper function to retrieve lines which have\n",
    "# intervening <milestone> elements.\n",
    "def node_text(node):\n",
    "    \"\"\"https://stackoverflow.com/a/7500304/1816347\"\"\"\n",
    "    if node.text:\n",
    "        result = node.text\n",
    "    else:\n",
    "        result = ''\n",
    "    for child in node:\n",
    "        if child.tail is not None:\n",
    "            result += child.tail\n",
    "    return result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['ἄνδρα μοι ἔννεπε, μοῦσα, πολύτροπον, ὃς μάλα πολλὰ', 'πλάγχθη, ἐπεὶ Τροίης ἱερὸν πτολίεθρον ἔπερσεν:', 'πολλῶν δ᾽ ἀνθρώπων ἴδεν ἄστεα καὶ νόον ἔγνω,', 'πολλὰ δ᾽ ὅ γ᾽ ἐν πόντῳ πάθεν ἄλγεα ὃν κατὰ θυμόν,', 'ἀρνύμενος ἥν τε ψυχὴν καὶ νόστον ἑταίρων.', 'ἀλλ᾽ οὐδ᾽ ὣς ἑτάρους ἐρρύσατο, ἱέμενός περ:', 'αὐτῶν γὰρ σφετέρῃσιν ἀτασθαλίῃσιν ὄλοντο,', 'νήπιοι, οἳ κατὰ βοῦς Ὑπερίονος Ἠελίοιο', 'ἤσθιον: αὐτὰρ ὁ τοῖσιν ἀφείλετο νόστιμον ἦμαρ.', 'τῶν ἁμόθεν γε,  θεά, θύγατερ Διός, εἰπὲ καὶ ἡμῖν.', 'ἔνθ᾽ ἄλλοι μὲν πάντες, ὅσοι φύγον αἰπὺν ὄλεθρον,', 'οἴκοι ἔσαν, πόλεμόν τε πεφευγότες ἠδὲ θάλασσαν:', 'τὸν δ᾽ οἶον νόστου κεχρημένον ἠδὲ γυναικὸς', 'νύμφη πότνι᾽ ἔρυκε Καλυψὼ δῖα θεάων', 'ἐν σπέσσι γλαφυροῖσι,  λιλαιομένη πόσιν εἶναι.', 'ἀλλ᾽ ὅτε δὴ ἔτος ἦλθε περιπλομένων ἐνιαυτῶν,', 'τῷ οἱ ἐπεκλώσαντο θεοὶ οἶκόνδε νέεσθαι', 'εἰς Ἰθάκην, οὐδ᾽ ἔνθα πεφυγμένος ἦεν ἀέθλων', 'καὶ μετὰ οἷσι φίλοισι. θεοὶ δ᾽ ἐλέαιρον ἅπαντες', 'νόσφι Ποσειδάωνος: ὁ δ᾽ ἀσπερχὲς μενέαινεν', 'ἀντιθέῳ Ὀδυσῆι πάρος ἣν γαῖαν ἱκέσθαι.', 'ἀλλ᾽ ὁ μὲν Αἰθίοπας μετεκίαθε τηλόθ᾽ ἐόντας,', 'Αἰθίοπας τοὶ διχθὰ δεδαίαται, ἔσχατοι ἀνδρῶν,', 'οἱ μὲν δυσομένου Ὑπερίονος οἱ δ᾽ ἀνιόντος,', 'ἀντιόων ταύρων τε καὶ ἀρνειῶν ἑκατόμβης.']\n"
     ]
    }
   ],
   "source": [
    "# Get xml for each ref\n",
    "book_lines = []\n",
    "\n",
    "for section in book_sections:\n",
    "    section_lines = []\n",
    "    for xml in section:\n",
    "        root = etree.fromstring(xml)\n",
    "        \n",
    "        if check_for_lb(root):\n",
    "            lines = root.findall('.//lb')\n",
    "            lines = [line.tail for line in lines]\n",
    "            lines = ['\\n' if line is None else line for line in lines]\n",
    "            section_lines.append(lines)\n",
    "        else:\n",
    "            lines = root.findall('.//l')\n",
    "            lines = [node_text(line) for line in lines]\n",
    "            lines = ['\\n' if line is None else line for line in lines]\n",
    "            section_lines.append(lines)\n",
    "\n",
    "    book_lines.append(section_lines)\n",
    "\n",
    "print(book_lines[0][0][:25])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def flatten(l):\n",
    "    \"\"\"https://stackoverflow.com/a/2158532/1816347\"\"\"\n",
    "    for el in l:\n",
    "        if isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes)):\n",
    "            yield from flatten(el)\n",
    "        else:\n",
    "            yield el"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ἄνδρα μοι ἔννεπε, μοῦσα, πολύτροπον, ὃς μάλα πολλὰ\n",
      "πλάγχθη, ἐπεὶ Τροίης ἱερὸν πτολίεθρον ἔπερσεν:\n",
      "πολλῶν δ᾽ ἀνθρώπων ἴδεν ἄστεα καὶ νόον ἔγνω,\n",
      "πολλὰ δ᾽ ὅ γ᾽ ἐν πόντῳ πάθεν ἄλγεα ὃν κατὰ θυμόν,\n",
      "ἀρνύμενος ἥν τε ψυχὴν καὶ νόστον ἑταίρων.\n",
      "ἀλλ᾽ οὐδ᾽ ὣς ἑτάρους ἐρρύσατο, ἱέμενός περ:\n",
      "αὐτῶν γὰρ σφετέρῃσιν ἀτασθαλίῃσιν ὄλοντο,\n",
      "νήπιοι, οἳ κατὰ βοῦς Ὑπερίονος Ἠελίοιο\n",
      "ἤσθιον: αὐτὰρ ὁ τοῖσιν ἀφείλετο νόστιμον ἦμαρ.\n",
      "τῶν ἁμόθεν γε,  θεά, θύγατερ Διός, εἰπὲ καὶ ἡμῖν.\n",
      "ἔνθ᾽ ἄλλοι μὲν πάντες, ὅσοι φύγον αἰπὺν ὄλεθρον,\n",
      "οἴκοι ἔσαν, πόλεμόν τε πεφευγότες ἠδὲ θάλασσαν:\n",
      "τὸν δ᾽ οἶον νόστου κεχρημένον ἠδὲ γυναικὸς\n",
      "νύμφη πότνι᾽ ἔρυκε Καλυψὼ δῖα θεάων\n",
      "ἐν σπέσσι γλαφυροῖσι,  λιλαιομένη πόσιν εἶναι.\n",
      "ἀλλ᾽ ὅτε δὴ ἔτος ἦλθε περιπλομένων ἐνιαυτῶν,\n",
      "τῷ οἱ ἐπεκλώσαντο θεοὶ οἶκόνδε νέεσθαι\n",
      "εἰς Ἰθάκην, οὐδ᾽ ἔνθα πεφυγμένος ἦεν ἀέθλων\n",
      "καὶ μετὰ οἷσι φίλοισι. θεοὶ δ᾽ ἐλέαιρον ἅπαντες\n",
      "νόσφι Ποσειδάωνος: ὁ δ᾽ ἀσπερχὲς μενέαινεν\n",
      "ἀντιθέῳ Ὀδυσῆι πάρος ἣν γαῖαν ἱκέσθαι.\n",
      "ἀλλ᾽ ὁ μὲν Αἰθίοπας μετεκίαθε τηλόθ᾽ ἐόντας,\n",
      "Αἰθίοπας τοὶ διχθ\n"
     ]
    }
   ],
   "source": [
    "plaintext = flatten(book_lines)\n",
    "print(\"\\n\".join(list(plaintext))[:1000])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting a plaintext *Shield*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "with urllib.request.urlopen(shield_toc_url) as f:\n",
    "    perseus_toc_xml = f.read()\n",
    "\n",
    "root = etree.fromstring(perseus_toc_xml)\n",
    "\n",
    "# Get list of refs from <chunk> elements\n",
    "\n",
    "if check_for_books(root):\n",
    "    books = root.findall(\".//chunk[@type='book']\")\n",
    "    booknames = [book.find('head').text for book in books]\n",
    "else:\n",
    "    books = [root]\n",
    "    booknames = ['work']\n",
    "    \n",
    "book_refs = []\n",
    "\n",
    "for book in books:\n",
    "    chunks = book.findall('chunk')\n",
    "    refs = [chunk.attrib['ref'] for chunk in chunks]\n",
    "    book_refs.append(refs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get xml for each ref\n",
    "book_sections = []\n",
    "\n",
    "for book_ref in book_refs:\n",
    "    book_section_xml = []\n",
    "    for ref in book_ref:\n",
    "        #print(ref) #Uncomment if you want to see the progress\n",
    "        time.sleep(.1)\n",
    "        with urllib.request.urlopen(perseus_xml_base_url+ref) as f:\n",
    "            book_section_xml.append(f.read())\n",
    "    book_sections.append(book_section_xml)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['\\n', 'ἤλυθεν ἐς Θήβας μετ᾽ ἀρήιον Ἀμφιτρύωνα\\n', 'Ἀλκμήνη, θυγάτηρ λαοσσόου Ἠλεκτρύωνος:\\n', 'ἥ ῥα γυναικῶν φῦλον ἐκαίνυτο θηλυτεράων\\n', 'εἴδεΐ τε μεγέθει τε: νόον γε μὲν οὔ τις ἔριζε\\n', 'τάων, ἃς θνηταὶ θνητοῖς τέκον εὐνηθεῖσαι.\\n', 'τῆς καὶ ἀπὸ κρῆθεν βλεφάρων τ᾽ ἄπο κυανεάων\\n', 'τοῖον ἄηθ᾽ οἶόν τε πολυχρύσου Ἀφροδίτης.\\n', 'ἣ δὲ καὶ ὣς κατὰ θυμὸν ἑὸν τίεσκεν ἀκοίτην,\\n', 'ὡς οὔ πώ τις ἔτισε γυναικῶν θηλυτεράων:\\n', 'ἦ μέν οἱ πατέρ᾽ ἐσθλὸν ἀπέκτανε ἶφι δαμάσσας,\\n', 'χωσάμενος περὶ βουσί: λιπὼν δ᾽ ὅ γε πατρίδα γαῖαν\\n', 'ἐς Θήβας ἱκέτευσε φερεσσακέας Καδμείους.\\n', 'ἔνθ᾽ ὅ γε δώματ᾽ ἔναιε σὺν αἰδοίῃ παρακοίτι\\n', 'νόσφιν ἄτερ φιλότητος ἐφιμέρου,  οὐδέ οἱ ἦεν\\n', 'πρὶν λεχέων ἐπιβῆναι ἐυσφύρου Ἠλεκτρυώνης,\\n', 'πρίν γε φόνον τίσαιτο κασιγνήτων μεγαθύμων\\n', 'ἧς ἀλόχου, μαλερῷ δὲ καταφλέξαι πυρὶ κώμας\\n', 'ἀνδρῶν ἡρώων Ταφίων ἰδὲ Τηλεβοάων.\\n', 'τὼς  γάρ οἱ διέκειτο,  θεοὶ δ᾽ ἐπὶ μάρτυροι ἦσαν:\\n', 'τῶν ὅ γ᾽ ὀπίζετο μῆνιν, ἐπείγετο δ᾽ ὅττι τάχιστα\\n', 'ἐκτελέσαι μέγα ἔργον, ὅ οἱ Διόθεν θέμις ἦεν.\\n', 'τῷ δ᾽ ἅμα ἱέμενοι πολέμοιό τε φυλόπιδός τε\\n', 'Βοιωτοὶ πλήξιπποι, ὑπὲρ σακέων πνείοντες,\\n', 'Λοκροί τ᾽ ἀγχέμαχοι καὶ Φωκῆες μεγάθυμοι\\n']\n"
     ]
    }
   ],
   "source": [
    "# Get xml for each ref\n",
    "book_lines = []\n",
    "\n",
    "for section in book_sections:\n",
    "    section_lines = []\n",
    "    for xml in section:\n",
    "        root = etree.fromstring(xml)\n",
    "        \n",
    "        if check_for_lb(root):\n",
    "            lines = root.findall('.//lb')\n",
    "            lines = [line.tail for line in lines]\n",
    "            lines = ['\\n' if line is None else line for line in lines]\n",
    "            section_lines.append(lines)\n",
    "        else:\n",
    "            lines = root.findall('.//l')\n",
    "            lines = [node_text(line) for line in lines]\n",
    "            lines = ['\\n' if line is None else line for line in lines]\n",
    "            section_lines.append(lines)\n",
    "\n",
    "    book_lines.append(section_lines)\n",
    "\n",
    "print(book_lines[0][0][:25])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "ἤλυθεν ἐς Θήβας μετ᾽ ἀρήιον Ἀμφιτρύωνα\n",
      "Ἀλκμήνη, θυγάτηρ λαοσσόου Ἠλεκτρύωνος:\n",
      "ἥ ῥα γυναικῶν φῦλον ἐκαίνυτο θηλυτεράων\n",
      "εἴδεΐ τε μεγέθει τε: νόον γε μὲν οὔ τις ἔριζε\n",
      "τάων, ἃς θνηταὶ θνητοῖς τέκον εὐνηθεῖσαι.\n",
      "τῆς καὶ ἀπὸ κρῆθεν βλεφάρων τ᾽ ἄπο κυανεάων\n",
      "τοῖον ἄηθ᾽ οἶόν τε πολυχρύσου Ἀφροδίτης.\n",
      "ἣ δὲ καὶ ὣς κατὰ θυμὸν ἑὸν τίεσκεν ἀκοίτην,\n",
      "ὡς οὔ πώ τις ἔτισε γυναικῶν θηλυτεράων:\n",
      "ἦ μέν οἱ πατέρ᾽ ἐσθλὸν ἀπέκτανε ἶφι δαμάσσας,\n",
      "χωσάμενος περὶ βουσί: λιπὼν δ᾽ ὅ γε πατρίδα γαῖαν\n",
      "ἐς Θήβας ἱκέτευσε φερεσσακέας Καδμείους.\n",
      "ἔνθ᾽ ὅ γε δώματ᾽ ἔναιε σὺν αἰδοίῃ παρακοίτι\n",
      "νόσφιν ἄτερ φιλότητος ἐφιμέρου,  οὐδέ οἱ ἦεν\n",
      "πρὶν λεχέων ἐπιβῆναι ἐυσφύρου Ἠλεκτρυώνης,\n",
      "πρίν γε φόνον τίσαιτο κασιγνήτων μεγαθύμων\n",
      "ἧς ἀλόχου, μαλερῷ δὲ καταφλέξαι πυρὶ κώμας\n",
      "ἀνδρῶν ἡρώων Ταφίων ἰδὲ Τηλεβοάων.\n",
      "τὼς  γάρ οἱ διέκειτο,  θεοὶ δ᾽ ἐπὶ μάρτυροι ἦσαν:\n",
      "τῶν ὅ γ᾽ ὀπίζετο μῆνιν, ἐπείγετο δ᾽ ὅττι τάχιστα\n",
      "ἐκτελέσαι μέγα ἔργον, ὅ οἱ Διόθεν θέμις ἦεν.\n",
      "τῷ δ᾽ ἅμα ἱέμενοι πολέμοιό τε φυλόπιδός τε\n",
      "Βοιωτοὶ πλήξιπποι, ὑπὲρ σακέων πνείοντες,\n",
      "Λοκρο\n"
     ]
    }
   ],
   "source": [
    "plaintext = flatten(book_lines)\n",
    "print(\"\".join(list(plaintext))[:1000]) # Handle '\\n' better"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}