{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Topic analysis\n", "\n", "Joint work of Mathias Coeckelbergs and Dirk Roorda.\n", "\n", "Mathias is experimenting with topic detection experiments using Mallet\n", "Dirk and Mathias integrate results of those experiments in the Hebrew text.\n", "\n", "We are still in a preliminary stage (2016-03-15).\n", "\n", "Basically, we render the Hebrew text, but add a hyperlink to each word that is part of a topic definition. The hyperlink points to the corresponding topic.\n", "\n", "# Result\n", "\n", "[Full hyperlinked text](etcbc_topictext.html) (5.7 MB)\n", "\n", "[Topic list](topic_list.html)\n", "\n", "Save both files to the same directory, and you can jump from text to topic list." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0.00s This is LAF-Fabric 4.5.22\n", "API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html\n", "Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html\n", "\n" ] } ], "source": [ "import sys, collections, re\n", "from markdown import markdown\n", "\n", "from laf.fabric import LafFabric\n", "from etcbc.preprocess import prepare\n", "fabric = LafFabric()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "source='etcbc'\n", "version='4b'" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0.00s LOADING API: please wait ... \n", " 0.00s USING main DATA COMPILED AT: 2015-11-02T15-08-56\n", " 0.00s USING annox DATA COMPILED AT: 2016-01-27T19-01-17\n", " 2.45s LOGFILE=/Users/dirk/laf-fabric-output/etcbc4b/workshop/__log__workshop.txt\n", " 2.45s INFO: LOADING PREPARED data: please wait ... \n", " 2.45s prep prep: G.node_sort\n", " 2.57s prep prep: G.node_sort_inv\n", " 3.07s prep prep: L.node_up\n", " 6.82s prep prep: L.node_down\n", " 12s prep prep: V.verses\n", " 12s prep prep: V.books_la\n", " 12s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html\n", " 14s INFO: LOADED PREPARED data\n", " 14s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK workshop AT 2016-03-14T22-37-41\n" ] } ], "source": [ "API=fabric.load(source+version, 'lexicon', 'workshop', {\n", " \"xmlids\": {\"node\": False, \"edge\": False},\n", " \"features\": ('''\n", " otype\n", " lex\n", " sp gloss\n", " chapter verse\n", " ''',''),\n", " \"prepare\": prepare,\n", " \"primary\": False,\n", "})\n", "exec(fabric.localnames.format(var='fabric'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Topic compilation\n", "\n", "We compile the list of topics, produced by Mallet, into a dictionary with topic words as keys, and sets of topics as values." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [], "source": [ "word2topic = collections.defaultdict(set)\n", "topic2words = {}\n", "keys = open('etcbc_keys.txt')\n", "for line in keys:\n", " (n, r, words) = line.rstrip().split('\\t')\n", " n = int(n)\n", " word_list = words.split()\n", " for word in word_list:\n", " word2topic[word].add(n)\n", " topic2words[n] = word_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the resulting mapping." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "defaultdict(set,\n", " {'ba[yhwˈāh': {14},\n", " 'bammˈāyim': {16},\n", " 'bayyˈôm': {3},\n", " 'baššānˈā': {9},\n", " 'baḥˈerev': {11},\n", " 'bišᵊnˈaṯ': {9},\n", " 'bānˈîm': {4},\n", " 'bāvˈel': {5},\n", " 'bāśˈār': {0},\n", " 'bēʔḏˈayin': {14},\n", " 'bˈêṯ': {17},\n", " 'bˈêṯ-ʔˈēl': {14},\n", " 'bˈānû': {0},\n", " 'bˈāʔû': {1},\n", " 'bᵊhˌar': {9},\n", " 'bᵊmālᵊḵˈô': {1},\n", " 'bᵊnˈê': {17},\n", " 'bᵊnˌê': {15},\n", " 'bᵊnˌô': {12},\n", " 'bᵊrîṯ-[yᵊhwˈāh': {7},\n", " 'bᵊʕênˈê': {18},\n", " 'bᵊḵˌōl': {2},\n", " 'dāwˈiḏ': {17},\n", " 'dāwˌîḏ': {16},\n", " 'dˌî': {12},\n", " 'fᵊlištˈîm': {18},\n", " 'gˈam': {8},\n", " 'gˌam': {4},\n", " 'gᵊvˈûl': {0},\n", " 'haddāvˈār': {15},\n", " 'haddᵊvārˈîm': {1},\n", " 'haggilʕˈāḏ': {6},\n", " 'hahˈî': {2},\n", " 'hahˈû': {3},\n", " 'hakkōhˈēn': {5},\n", " 'halᵊwiyyˌim': {16},\n", " 'hammalkˈā': {10},\n", " 'hammeleḵ': {7},\n", " 'hammilḥāmˌā': {19},\n", " 'hammˈeleḵ': {17},\n", " 'hammˈāyim': {16},\n", " 'hannāvˈî': {2},\n", " 'hannᵊvîʔˈîm': {11},\n", " 'hayyardˈēn': {15},\n", " 'hayyardˌēn': {1},\n", " 'hayyāmˌîm': {8},\n", " 'hayyˈôm': {17},\n", " 'hazzāhˈāv': {10},\n", " 'hazzˈeh': {3},\n", " 'haššabbˈāṯ': {4},\n", " 'haššēnˈîṯ': {19},\n", " 'haššˈaʕar': {11},\n", " 'haṭṭˈôv': {1},\n", " 'heḥᵉzˈîq': {10},\n", " 'hinnˌî': {15},\n", " 'hāhˈēm': {6},\n", " 'hāyˌû': {5},\n", " 'hāʔˈāreṣ': {13},\n", " 'hāʔˈēlleh': {3},\n", " 'hāʔᵃnāšˈîm': {0},\n", " 'hāʔᵃḏāmˈā': {7},\n", " 'hāʔᵉlōhˈîm': {18},\n", " 'hāʕˈîr': {13},\n", " 'hōlˈēḵ': {0},\n", " 'hˈû': {18},\n", " 'hˈāʔᵉlōhˈîm': {5},\n", " 'hˌēm': {9},\n", " 'kaʔᵃšˌer': {3},\n", " 'kol-hāʔˈāreṣ': {19},\n", " 'kol-ʔᵃšˈer': {4},\n", " 'kullˈām': {6},\n", " 'kˈî': {3},\n", " 'kˈî-ʔattˈā': {14},\n", " 'kˈî-ʔᵃnˈî': {16},\n", " 'kˈēn': {18},\n", " 'kˈō-ʔāmˈar': {5},\n", " 'kˌî': {17},\n", " 'kˌō': {12},\n", " 'kᵊnˈaʕan': {4},\n", " 'kᵊḵˌōl': {8},\n", " 'la[yhwˈāh': {3},\n", " 'lalᵊwiyyˈim': {11},\n", " 'lammilḥāmˈā': {8},\n", " 'laʕᵃśˈôṯ': {5},\n", " 'laʕᵃśˌôṯ': {2, 19},\n", " 'leʔᵉḵˈōl': {9},\n", " 'lifᵊnˈê': {17},\n", " 'lifᵊnˌê': {3},\n", " 'liqᵊrāṯˈô': {7},\n", " 'liqᵊrˈaṯ': {16},\n", " 'lirᵊʔˈôṯ': {6},\n", " 'livᵊnˌê': {16},\n", " 'lišᵊlōmˈō': {10},\n", " 'llēʔmˈōr': {7, 8},\n", " 'lāhˈem': {3},\n", " 'lāhˌem': {19},\n", " 'lālˈeḵeṯ': {4},\n", " 'lāqˈaḥ': {7},\n", " 'lāḏˈaʕaṯ': {1},\n", " 'lāḵˈem': {17},\n", " 'lāḵˈēn': {5},\n", " 'lāḵˌēn': {10},\n", " 'lāṯˌēṯ': {6},\n", " 'lēʔmˈōr': {17},\n", " 'lēʔmˌōr': {19},\n", " 'lˈa[yhwˌāh': {13},\n", " 'lˈeḥem': {18},\n", " 'lˈî': {2, 6},\n", " 'lˈô': {17},\n", " 'lˈāmmā': {18},\n", " 'lˈēv': {19},\n", " 'lˈēḵ': {6},\n", " 'lˈō': {17},\n", " 'lˌeḥem': {4},\n", " 'lˌô': {3},\n", " 'lˌāh': {10},\n", " 'lˌānû': {10},\n", " 'lˌō': {17},\n", " 'lᵊfānˈeʸḵā': {18},\n", " 'lᵊhillāḥˌēm': {7},\n", " 'lᵊmalᵊḵˌê': {14},\n", " 'lᵊrištˈāh': {14},\n", " 'lᵊšālˈôm': {11},\n", " 'lᵊḥaṭṭˈāṯ': {9},\n", " 'lᵊḵˌā': {5},\n", " 'maddˈûₐʕ': {9},\n", " 'malkˈā': {8, 9},\n", " 'maʕᵃśˌē': {12},\n", " 'mibbˌen': {4},\n", " 'mikkˈōl': {4},\n", " 'milḥāmˈā': {4},\n", " 'minḥˈā': {4},\n", " 'minḥˌā': {11},\n", " 'miyyˈaḏ': {5},\n", " 'mizbˈēₐḥ': {11},\n", " 'mišpˌaḥaṯ': {4},\n", " 'mišpˌāṭ': {6},\n", " 'miššˌām': {7},\n", " 'miṣrˈayim': {2},\n", " 'miṣrˈāyim': {13},\n", " 'miṣrˌayim': {5},\n", " 'mālˈaḵ': {14},\n", " 'mālˌaḵ': {18},\n", " 'māḡˈēn': {0},\n", " 'mēhˈem': {16},\n", " 'mēʔˌereṣ': {16},\n", " 'mēʔˌā': {14},\n", " 'mēʕˌal': {13},\n", " 'mōšˈeh': {17},\n", " 'mˈeleḵ': {15},\n", " 'mˈeleḵ-bāvˈel': {9},\n", " 'mˌî': {6},\n", " 'mˌôṯ': {14},\n", " 'mᵊʔˈûmā': {12},\n", " 'nafšˈô': {4},\n", " 'nāṯˈattî': {9},\n", " 'nāṯˌan': {2},\n", " 'nōśˈē': {11},\n", " 'parʕˈō': {13},\n", " 'pᵊnˌê': {1},\n", " 'qoḏšˈî': {0},\n", " 'qˈôl': {12},\n", " 'qˌôl': {1},\n", " 'qᵊṭˈōreṯ': {6},\n", " 'rabbˈîm': {18},\n", " 'rabbˌîm': {11},\n", " 'rāšˈāʕ': {15},\n", " 'rāʔˈîṯî': {2},\n", " 'rāʕˈā': {1},\n", " 'rˈāv': {1},\n", " 'sˈōleṯ': {0},\n", " 'taʕᵃśˈeh': {0},\n", " 'tihyˌeh': {10},\n", " 'vabbˈōqer': {12},\n", " 'vˈāh': {5},\n", " 'vᵊnˈê-yiśrāʔˈēl': {11},\n", " 'vᵊnˌô': {8},\n", " 'wattˌōmer': {12},\n", " 'wayyihyˈû': {16},\n", " 'wayyimlˌōḵ': {1},\n", " 'wayyiqrˈā': {15},\n", " 'wayyiqrˌā': {5, 19},\n", " 'wayyiśśˌā': {19},\n", " 'wayyišlˈaḥ': {15},\n", " 'wayyišlˌaḥ': {8},\n", " 'wayyāvˈō': {13},\n", " 'wayyōmᵊrˌû': {18},\n", " 'wayyˈaʕan': {7},\n", " 'wayyˈaḵ': {6},\n", " 'wayyˈōmer': {13},\n", " 'wayyˈōmᵊrû': {5},\n", " 'wayyˌaḵ': {12},\n", " 'wayyˌāqom': {14},\n", " 'wayyˌēšev': {12},\n", " 'wayyˌōmer': {13},\n", " 'wayᵊhˈî': {9, 18},\n", " 'wayᵊhˌî': {17},\n", " 'wayᵊḏabbˌēr': {8},\n", " 'waʔᵃšˈer': {15},\n", " 'waʔᵃšˌer': {5},\n", " 'waʕᵃśîṯˌem': {0},\n", " 'waḥᵃmiššˈā': {0},\n", " 'wᵊhalᵊwiyyˈim': {8},\n", " 'wᵊhinnˌē': {8},\n", " 'wᵊhāyˌû': {10},\n", " 'wᵊhāʕˌām': {6},\n", " 'wᵊhˌû': {19},\n", " 'wᵊhˌēm': {4},\n", " 'wᵊlifᵊnˌê': {6},\n", " 'wᵊlˈō': {3},\n", " 'wᵊlˌō': {13},\n", " 'wᵊnāṯattˌā': {12},\n", " 'wᵊyˌeṯer': {10},\n", " 'wᵊšˈēm': {8},\n", " 'wᵊšˌēm': {7},\n", " 'wᵊʔahᵃrˈōn': {1},\n", " 'wᵊʔarbāʕˌā': {14},\n", " 'wᵊʔargāmˈān': {10},\n", " 'wᵊʔattˈem': {16},\n", " 'wᵊʔattˌem': {2},\n", " 'wᵊʔāmartˈā': {8},\n", " 'wᵊʔānōḵˌî': {19},\n", " 'wᵊʔˌîš': {15},\n", " 'wᵊʔˌēlleh': {9},\n", " 'wᵊʔˌēṯ': {3},\n", " 'wᵊʕeśrˈîm': {4},\n", " 'wᵊʕeśrˌîm': {10},\n", " 'wᵊʕāśˈîṯā': {2},\n", " 'wᵊʕˈal': {6},\n", " 'wᵊʕˌaḏ': {15},\n", " 'wᵊḥaṣrêhˈen': {19},\n", " 'wᵊḵol-hāʕˈām': {12},\n", " 'wᵊḵol-yiśrāʔˈēl': {19},\n", " 'wᵊḵol-yiśrāʔˌēl': {16},\n", " 'wᵊḵˈēn': {7},\n", " 'wᵊḵˌî': {12},\n", " 'wᵊḵˌōl': {2},\n", " 'yaʕᵃqˌōv': {5},\n", " 'yaʕᵃśˈeh': {15},\n", " 'yiśrāʔˈēl': {3},\n", " 'yiśrāʔˌēl': {15},\n", " 'yôšˈēv': {12},\n", " 'yāmˈûṯ': {16},\n", " 'yāmˌîm': {2},\n", " 'yāvˈōʔû': {14},\n", " 'yāḏˌî': {12},\n", " 'yāḏˌô': {2},\n", " 'yōšᵊvˌê': {11},\n", " 'yˈom': {9},\n", " 'yˈôm': {13},\n", " 'yᵊhwˈāh': {17},\n", " 'yᵊhwˌāh': {17},\n", " 'yᵊhôšˈuₐʕ': {13},\n", " 'yᵊhûḏˈā': {3},\n", " 'yᵊmˌê': {0},\n", " 'yᵊrûšālˈāim': {8},\n", " 'yᵊḥizqiyyˈāhû': {4},\n", " 'zāhˌāv': {18},\n", " 'zˈōṯ': {13},\n", " 'zˌeh': {15},\n", " 'ûmˌî': {16},\n", " 'ûvānˈôṯ': {10},\n", " 'ûšᵊnˈê': {11},\n", " 'šimšˈôn': {10},\n", " 'šivʕˈaṯ': {2},\n", " 'šivʕˈā': {9},\n", " 'šivʕˌîm': {4},\n", " 'šivʕˌā': {1},\n", " 'šāmˈaʕtî': {7},\n", " 'šānˈā': {18},\n", " 'šānˌîm': {6},\n", " 'šēnˈîṯ': {16},\n", " 'šōmᵊrˈôn': {1},\n", " 'šˈaʕar': {8},\n", " 'šˈemen': {4},\n", " 'šˈāmmā': {16},\n", " 'šᵊlōmˈō': {13},\n", " 'šᵊlōšˈîm': {19},\n", " 'šᵊlōšˌîm': {6},\n", " 'šᵊlˌōš': {16},\n", " 'šᵊmayyˈā': {11},\n", " 'šᵊmˈeḵā': {11},\n", " 'šᵊmˈô': {10},\n", " 'šᵊnˈê': {1},\n", " 'šᵊnˌêm': {4},\n", " 'šᵊʔērˈîṯ': {0},\n", " 'šᵊʔˈôl': {12},\n", " 'ʔavšālˈôm': {0},\n", " 'ʔaḥʔˌāv': {0},\n", " 'ʔaḥᵃrˌê': {18},\n", " 'ʔefrˈayim': {8},\n", " 'ʔel-hammˈeleḵ': {9},\n", " 'ʔel-mōšˈeh': {5},\n", " 'ʔel-ʔˈereṣ': {6},\n", " 'ʔel-ʔˈōhel': {1},\n", " 'ʔel-ʔˌereṣ': {7},\n", " 'ʔestˈēr': {11},\n", " 'ʔettˈēn': {7},\n", " 'ʔeḥˈāḏ': {2},\n", " 'ʔeṯ-[yᵊhwˌāh': {15},\n", " 'ʔeṯ-haddāvˌār': {10},\n", " 'ʔeṯ-hāʔˌāreṣ': {11},\n", " 'ʔeṯ-pᵊnˈê': {10},\n", " 'ʔeṯ-ʕammˈî': {10},\n", " 'ʔimmˈô': {15},\n", " 'ʔittˈî': {19},\n", " 'ʔištˈô': {19},\n", " 'ʔôṯˈām': {13},\n", " 'ʔāmˈar': {13},\n", " 'ʔārˈûr': {0},\n", " 'ʔāsˈā': {0},\n", " 'ʔēlˈay': {3},\n", " 'ʔēlˈāʸw': {13},\n", " 'ʔōṯˈô': {17},\n", " 'ʔōṯˈām': {13},\n", " 'ʔˈarṣā': {16},\n", " 'ʔˈaḵ': {18},\n", " 'ʔˈîš': {3},\n", " 'ʔˈô': {13},\n", " 'ʔˈānōḵî': {11},\n", " 'ʔˈārᵊṣā': {14},\n", " 'ʔˈāwen': {12},\n", " 'ʔˈāz': {14},\n", " 'ʔˈēṯ': {13},\n", " 'ʔˈōreḵ': {9},\n", " 'ʔˌên': {6},\n", " 'ʔˌîš': {3},\n", " 'ʔˌô': {15},\n", " 'ʔˌēš': {5},\n", " 'ʔˌēṣel': {0},\n", " 'ʔˌēṯ': {3},\n", " 'ʔᵃlêḵˈem': {9},\n", " 'ʔᵃlēhˈem': {5},\n", " 'ʔᵃlēhˌem': {0},\n", " 'ʔᵃnˈaḥnû': {2},\n", " 'ʔᵃrˈôn': {6},\n", " 'ʔᵃvōṯˈām': {4},\n", " 'ʔᵃšer-dibbˌer': {14},\n", " 'ʔᵃšer-lˈô': {7},\n", " 'ʔᵃšer-ʕāśˌā': {9},\n", " 'ʔᵃšˈer': {3},\n", " 'ʔᵃšˌer': {17},\n", " 'ʔᵃḏabbˈēr': {7},\n", " 'ʔᵃḏōnˈî': {5},\n", " 'ʔᵃḏōnˈāy': {2},\n", " 'ʔᵃḥˈî': {11},\n", " 'ʔᵉlōhˈeʸḵā': {17},\n", " 'ʔᵉlōhˈîm': {17},\n", " 'ʔᵉlōhˌênû': {14},\n", " 'ʔᵉlōhˌîm': {8, 18},\n", " 'ʔᵉmˈeṯ': {19},\n", " 'ʕal-hammizbˌēₐḥ': {1},\n", " 'ʕal-kˈēn': {2, 10},\n", " 'ʕammˈîm': {7},\n", " 'ʕavdᵊḵˈā': {9},\n", " 'ʕavᵊḏˈê': {9},\n", " 'ʕaḏ-hāʕˈārev': {19},\n", " 'ʕimmˈô': {15},\n", " 'ʕimmˌô': {12},\n", " 'ʕôlˈām': {1, 7},\n", " 'ʕālˈayiḵ': {16},\n", " 'ʕālˈāy': {14},\n", " 'ʕālˈāʸw': {3},\n", " 'ʕālˌayiḵ': {10},\n", " 'ʕāśˈîṯā': {2},\n", " 'ʕāśˈû': {12},\n", " 'ʕāśˌû': {0},\n", " 'ʕāśˌār': {2},\n", " 'ʕēśˈāw': {19},\n", " 'ʕōśˈeh': {12},\n", " 'ʕˈal': {18},\n", " 'ʕˈîr': {8},\n", " 'ʕᵃlêḵˌem': {11},\n", " 'ˈkî': {18},\n", " 'ˈkō': {8},\n", " 'ˈkōl': {8},\n", " 'ˈʔîš': {6},\n", " 'ḏāwˈiḏ': {15},\n", " 'ḏāwˈîḏ': {1},\n", " 'ḏᵊvar-[yᵊhwˈāh': {14},\n", " 'ḏᵊvar-[yᵊhwˌāh': {18},\n", " 'ḥoḵmˈā': {14},\n", " 'ḥālˌāv': {16},\n", " 'ḥˈāyil': {1},\n", " 'ḥᵃmiššˈā': {14},\n", " 'ḥᵒḏāšˈîm': {7},\n", " 'ṣāfˈônā': {7},\n", " 'ṣˈōn': {19},\n", " 'ṭˈôv': {5},\n", " 'ṭˌôv': {15},\n", " 'ṯaʕᵃśˈû': {11}})" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "word2topic" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just to explore: are there words that belong to multiple topics?" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "9" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "multiples = {x for x in word2topic if len(word2topic[x]) > 1}\n", "len(multiples)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yes, and here they are:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'laʕᵃśˌôṯ',\n", " 'llēʔmˈōr',\n", " 'lˈî',\n", " 'malkˈā',\n", " 'wayyiqrˌā',\n", " 'wayᵊhˈî',\n", " 'ʔᵉlōhˌîm',\n", " 'ʕal-kˈēn',\n", " 'ʕôlˈām'}" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "multiples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Write a HTML file of topics." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{0: ['ʕāśˌû',\n", " 'gᵊvˈûl',\n", " 'hāʔᵃnāšˈîm',\n", " 'taʕᵃśˈeh',\n", " 'ʔavšālˈôm',\n", " 'ʔᵃlēhˌem',\n", " 'yᵊmˌê',\n", " 'waḥᵃmiššˈā',\n", " 'ʔārˈûr',\n", " 'ʔaḥʔˌāv',\n", " 'ʔāsˈā',\n", " 'hōlˈēḵ',\n", " 'qoḏšˈî',\n", " 'waʕᵃśîṯˌem',\n", " 'māḡˈēn',\n", " 'bˈānû',\n", " 'sˈōleṯ',\n", " 'šᵊʔērˈîṯ',\n", " 'ʔˌēṣel',\n", " 'bāśˈār'],\n", " 1: ['ʕôlˈām',\n", " 'bˈāʔû',\n", " 'haddᵊvārˈîm',\n", " 'wᵊʔahᵃrˈōn',\n", " 'pᵊnˌê',\n", " 'šōmᵊrˈôn',\n", " 'qˌôl',\n", " 'rˈāv',\n", " 'šᵊnˈê',\n", " 'hayyardˌēn',\n", " 'wayyimlˌōḵ',\n", " 'šivʕˌā',\n", " 'rāʕˈā',\n", " 'lāḏˈaʕaṯ',\n", " 'haṭṭˈôv',\n", " 'ʕal-hammizbˌēₐḥ',\n", " 'ʔel-ʔˈōhel',\n", " 'ḏāwˈîḏ',\n", " 'bᵊmālᵊḵˈô',\n", " 'ḥˈāyil'],\n", " 2: ['ʔᵃḏōnˈāy',\n", " 'ʔeḥˈāḏ',\n", " 'lˈî',\n", " 'miṣrˈayim',\n", " 'ʕāśˌār',\n", " 'wᵊḵˌōl',\n", " 'hannāvˈî',\n", " 'yāmˌîm',\n", " 'bᵊḵˌōl',\n", " 'wᵊʕāśˈîṯā',\n", " 'yāḏˌô',\n", " 'laʕᵃśˌôṯ',\n", " 'hahˈî',\n", " 'šivʕˈaṯ',\n", " 'ʕal-kˈēn',\n", " 'ʕāśˈîṯā',\n", " 'nāṯˌan',\n", " 'ʔᵃnˈaḥnû',\n", " 'rāʔˈîṯî',\n", " 'wᵊʔattˌem'],\n", " 3: ['kˈî',\n", " 'ʔᵃšˈer',\n", " 'yiśrāʔˈēl',\n", " 'hazzˈeh',\n", " 'ʔˈîš',\n", " 'wᵊlˈō',\n", " 'yᵊhûḏˈā',\n", " 'lāhˈem',\n", " 'wᵊʔˌēṯ',\n", " 'lifᵊnˌê',\n", " 'ʔˌîš',\n", " 'kaʔᵃšˌer',\n", " 'lˌô',\n", " 'ʕālˈāʸw',\n", " 'hāʔˈēlleh',\n", " 'la[yhwˈāh',\n", " 'ʔˌēṯ',\n", " 'bayyˈôm',\n", " 'hahˈû',\n", " 'ʔēlˈay'],\n", " 4: ['milḥāmˈā',\n", " 'mišpˌaḥaṯ',\n", " 'lālˈeḵeṯ',\n", " 'nafšˈô',\n", " 'bānˈîm',\n", " 'lˌeḥem',\n", " 'šᵊnˌêm',\n", " 'ʔᵃvōṯˈām',\n", " 'šivʕˌîm',\n", " 'haššabbˈāṯ',\n", " 'mikkˈōl',\n", " 'wᵊhˌēm',\n", " 'wᵊʕeśrˈîm',\n", " 'kol-ʔᵃšˈer',\n", " 'mibbˌen',\n", " 'minḥˈā',\n", " 'gˌam',\n", " 'yᵊḥizqiyyˈāhû',\n", " 'šˈemen',\n", " 'kᵊnˈaʕan'],\n", " 5: ['hakkōhˈēn',\n", " 'lᵊḵˌā',\n", " 'ṭˈôv',\n", " 'laʕᵃśˈôṯ',\n", " 'ʔᵃlēhˈem',\n", " 'ʔel-mōšˈeh',\n", " 'kˈō-ʔāmˈar',\n", " 'bāvˈel',\n", " 'wayyiqrˌā',\n", " 'lāḵˈēn',\n", " 'yaʕᵃqˌōv',\n", " 'waʔᵃšˌer',\n", " 'wayyˈōmᵊrû',\n", " 'vˈāh',\n", " 'miṣrˌayim',\n", " 'ʔᵃḏōnˈî',\n", " 'hāyˌû',\n", " 'ʔˌēš',\n", " 'miyyˈaḏ',\n", " 'hˈāʔᵉlōhˈîm'],\n", " 6: ['lˈî',\n", " 'lˈēḵ',\n", " 'kullˈām',\n", " 'wᵊʕˈal',\n", " 'lāṯˌēṯ',\n", " 'ʔᵃrˈôn',\n", " 'ʔel-ʔˈereṣ',\n", " 'ʔˌên',\n", " 'šᵊlōšˌîm',\n", " 'mišpˌāṭ',\n", " 'wayyˈaḵ',\n", " 'ˈʔîš',\n", " 'hāhˈēm',\n", " 'qᵊṭˈōreṯ',\n", " 'mˌî',\n", " 'wᵊhāʕˌām',\n", " 'lirᵊʔˈôṯ',\n", " 'wᵊlifᵊnˌê',\n", " 'haggilʕˈāḏ',\n", " 'šānˌîm'],\n", " 7: ['miššˌām',\n", " 'wayyˈaʕan',\n", " 'wᵊšˌēm',\n", " 'hāʔᵃḏāmˈā',\n", " 'ʔettˈēn',\n", " 'šāmˈaʕtî',\n", " 'ṣāfˈônā',\n", " 'ʕôlˈām',\n", " 'ʕammˈîm',\n", " 'llēʔmˈōr',\n", " 'ʔᵃḏabbˈēr',\n", " 'ḥᵒḏāšˈîm',\n", " 'ʔᵃšer-lˈô',\n", " 'hammeleḵ',\n", " 'lᵊhillāḥˌēm',\n", " 'liqᵊrāṯˈô',\n", " 'bᵊrîṯ-[yᵊhwˈāh',\n", " 'wᵊḵˈēn',\n", " 'lāqˈaḥ',\n", " 'ʔel-ʔˌereṣ'],\n", " 8: ['wᵊhinnˌē',\n", " 'wayᵊḏabbˌēr',\n", " 'wayyišlˌaḥ',\n", " 'gˈam',\n", " 'wᵊʔāmartˈā',\n", " 'ʔᵉlōhˌîm',\n", " 'kᵊḵˌōl',\n", " 'llēʔmˈōr',\n", " 'ʕˈîr',\n", " 'ˈkō',\n", " 'šˈaʕar',\n", " 'ˈkōl',\n", " 'lammilḥāmˈā',\n", " 'wᵊhalᵊwiyyˈim',\n", " 'ʔefrˈayim',\n", " 'malkˈā',\n", " 'vᵊnˌô',\n", " 'hayyāmˌîm',\n", " 'wᵊšˈēm',\n", " 'yᵊrûšālˈāim'],\n", " 9: ['ʔel-hammˈeleḵ',\n", " 'ʔᵃlêḵˈem',\n", " 'nāṯˈattî',\n", " 'ʕavdᵊḵˈā',\n", " 'mˈeleḵ-bāvˈel',\n", " 'malkˈā',\n", " 'ʔˈōreḵ',\n", " 'wayᵊhˈî',\n", " 'bᵊhˌar',\n", " 'lᵊḥaṭṭˈāṯ',\n", " 'šivʕˈā',\n", " 'yˈom',\n", " 'hˌēm',\n", " 'leʔᵉḵˈōl',\n", " 'wᵊʔˌēlleh',\n", " 'ʕavᵊḏˈê',\n", " 'ʔᵃšer-ʕāśˌā',\n", " 'baššānˈā',\n", " 'maddˈûₐʕ',\n", " 'bišᵊnˈaṯ'],\n", " 10: ['wᵊhāyˌû',\n", " 'šᵊmˈô',\n", " 'lāḵˌēn',\n", " 'hazzāhˈāv',\n", " 'lˌāh',\n", " 'lišᵊlōmˈō',\n", " 'tihyˌeh',\n", " 'ʔeṯ-haddāvˌār',\n", " 'šimšˈôn',\n", " 'ʔeṯ-ʕammˈî',\n", " 'lˌānû',\n", " 'ʕālˌayiḵ',\n", " 'ʕal-kˈēn',\n", " 'wᵊʕeśrˌîm',\n", " 'wᵊʔargāmˈān',\n", " 'ʔeṯ-pᵊnˈê',\n", " 'heḥᵉzˈîq',\n", " 'wᵊyˌeṯer',\n", " 'ûvānˈôṯ',\n", " 'hammalkˈā'],\n", " 11: ['ʕᵃlêḵˌem',\n", " 'vᵊnˈê-yiśrāʔˈēl',\n", " 'haššˈaʕar',\n", " 'lalᵊwiyyˈim',\n", " 'ʔˈānōḵî',\n", " 'minḥˌā',\n", " 'baḥˈerev',\n", " 'ṯaʕᵃśˈû',\n", " 'ʔestˈēr',\n", " 'šᵊmˈeḵā',\n", " 'ûšᵊnˈê',\n", " 'yōšᵊvˌê',\n", " 'mizbˈēₐḥ',\n", " 'rabbˌîm',\n", " 'ʔᵃḥˈî',\n", " 'šᵊmayyˈā',\n", " 'lᵊšālˈôm',\n", " 'ʔeṯ-hāʔˌāreṣ',\n", " 'hannᵊvîʔˈîm',\n", " 'nōśˈē'],\n", " 12: ['dˌî',\n", " 'kˌō',\n", " 'ʕāśˈû',\n", " 'ʔˈāwen',\n", " 'qˈôl',\n", " 'wattˌōmer',\n", " 'ʕōśˈeh',\n", " 'wayyˌēšev',\n", " 'ʕimmˌô',\n", " 'wayyˌaḵ',\n", " 'vabbˈōqer',\n", " 'bᵊnˌô',\n", " 'mᵊʔˈûmā',\n", " 'wᵊḵol-hāʕˈām',\n", " 'wᵊḵˌî',\n", " 'maʕᵃśˌē',\n", " 'šᵊʔˈôl',\n", " 'yāḏˌî',\n", " 'yôšˈēv',\n", " 'wᵊnāṯattˌā'],\n", " 13: ['wayyˈōmer',\n", " 'wayyˌōmer',\n", " 'wᵊlˌō',\n", " 'hāʔˈāreṣ',\n", " 'ʔāmˈar',\n", " 'ʔēlˈāʸw',\n", " 'ʔˈēṯ',\n", " 'miṣrˈāyim',\n", " 'šᵊlōmˈō',\n", " 'ʔōṯˈām',\n", " 'ʔˈô',\n", " 'hāʕˈîr',\n", " 'yᵊhôšˈuₐʕ',\n", " 'ʔôṯˈām',\n", " 'wayyāvˈō',\n", " 'yˈôm',\n", " 'zˈōṯ',\n", " 'mēʕˌal',\n", " 'lˈa[yhwˌāh',\n", " 'parʕˈō'],\n", " 14: ['ḥᵃmiššˈā',\n", " 'mēʔˌā',\n", " 'ḥoḵmˈā',\n", " 'ʔˈārᵊṣā',\n", " 'ʔˈāz',\n", " 'wayyˌāqom',\n", " 'lᵊrištˈāh',\n", " 'ʕālˈāy',\n", " 'ba[yhwˈāh',\n", " 'ʔᵃšer-dibbˌer',\n", " 'yāvˈōʔû',\n", " 'mˌôṯ',\n", " 'mālˈaḵ',\n", " 'ḏᵊvar-[yᵊhwˈāh',\n", " 'bˈêṯ-ʔˈēl',\n", " 'bēʔḏˈayin',\n", " 'lᵊmalᵊḵˌê',\n", " 'wᵊʔarbāʕˌā',\n", " 'ʔᵉlōhˌênû',\n", " 'kˈî-ʔattˈā'],\n", " 15: ['yiśrāʔˌēl',\n", " 'mˈeleḵ',\n", " 'bᵊnˌê',\n", " 'ʕimmˈô',\n", " 'ʔˌô',\n", " 'hinnˌî',\n", " 'wayyiqrˈā',\n", " 'wayyišlˈaḥ',\n", " 'haddāvˈār',\n", " 'ḏāwˈiḏ',\n", " 'ʔimmˈô',\n", " 'ʔeṯ-[yᵊhwˌāh',\n", " 'zˌeh',\n", " 'hayyardˈēn',\n", " 'rāšˈāʕ',\n", " 'wᵊʔˌîš',\n", " 'ṭˌôv',\n", " 'yaʕᵃśˈeh',\n", " 'wᵊʕˌaḏ',\n", " 'waʔᵃšˈer'],\n", " 16: ['šˈāmmā',\n", " 'mēʔˌereṣ',\n", " 'mēhˈem',\n", " 'livᵊnˌê',\n", " 'wᵊʔattˈem',\n", " 'ʔˈarṣā',\n", " 'ûmˌî',\n", " 'liqᵊrˈaṯ',\n", " 'halᵊwiyyˌim',\n", " 'yāmˈûṯ',\n", " 'wayyihyˈû',\n", " 'kˈî-ʔᵃnˈî',\n", " 'bammˈāyim',\n", " 'ḥālˌāv',\n", " 'dāwˌîḏ',\n", " 'wᵊḵol-yiśrāʔˌēl',\n", " 'šēnˈîṯ',\n", " 'šᵊlˌōš',\n", " 'hammˈāyim',\n", " 'ʕālˈayiḵ'],\n", " 17: ['yᵊhwˈāh',\n", " 'ʔᵃšˌer',\n", " 'yᵊhwˌāh',\n", " 'lˈō',\n", " 'kˌî',\n", " 'lēʔmˈōr',\n", " 'lˌō',\n", " 'hammˈeleḵ',\n", " 'bᵊnˈê',\n", " 'lˈô',\n", " 'ʔᵉlōhˈîm',\n", " 'lāḵˈem',\n", " 'lifᵊnˈê',\n", " 'mōšˈeh',\n", " 'ʔōṯˈô',\n", " 'ʔᵉlōhˈeʸḵā',\n", " 'wayᵊhˌî',\n", " 'bˈêṯ',\n", " 'hayyˈôm',\n", " 'dāwˈiḏ'],\n", " 18: ['hˈû',\n", " 'wayᵊhˈî',\n", " 'šānˈā',\n", " 'hāʔᵉlōhˈîm',\n", " 'ˈkî',\n", " 'kˈēn',\n", " 'rabbˈîm',\n", " 'ʕˈal',\n", " 'lˈeḥem',\n", " 'ḏᵊvar-[yᵊhwˌāh',\n", " 'bᵊʕênˈê',\n", " 'ʔᵉlōhˌîm',\n", " 'zāhˌāv',\n", " 'ʔaḥᵃrˌê',\n", " 'ʔˈaḵ',\n", " 'lᵊfānˈeʸḵā',\n", " 'wayyōmᵊrˌû',\n", " 'fᵊlištˈîm',\n", " 'lˈāmmā',\n", " 'mālˌaḵ'],\n", " 19: ['lāhˌem',\n", " 'wᵊhˌû',\n", " 'lˈēv',\n", " 'ṣˈōn',\n", " 'ʔištˈô',\n", " 'ʕēśˈāw',\n", " 'ʕaḏ-hāʕˈārev',\n", " 'laʕᵃśˌôṯ',\n", " 'ʔᵉmˈeṯ',\n", " 'wayyiqrˌā',\n", " 'lēʔmˌōr',\n", " 'ʔittˈî',\n", " 'šᵊlōšˈîm',\n", " 'wᵊḥaṣrêhˈen',\n", " 'wᵊʔānōḵˌî',\n", " 'hammilḥāmˌā',\n", " 'haššēnˈîṯ',\n", " 'wayyiśśˌā',\n", " 'kol-hāʔˈāreṣ',\n", " 'wᵊḵol-yiśrāʔˈēl']}" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "topic2words" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [], "source": [ "filename = 'topic_list.html'\n", "with open(filename, 'w') as f:\n", " f.write('''\n", "\n", "\n", "''')\n", " for (t, words) in sorted(topic2words.items()):\n", " f.write('''\n", "

Topic {}

\n", "

{}

\n", "'''.format(t, t, '
'.join(words)))\n", " f.write('''\n", "\n", "\n", "''')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Producing a text\n", "\n", "We produce a phonemic Hebrew Bible in markdown syntax." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "31m 38s Writing Bible in markdown\n", "31m 41s Converting markdown to html\n", "31m 52s Done\n" ] } ], "source": [ "msg('Writing Bible in markdown')\n", "filename = 'etcbc_topictext.html'\n", "md_lines = []\n", "for v in F.otype.s('verse'):\n", " b = L.u('book', v)\n", " book = T.book_name(b)\n", " chapter = F.chapter.v(v)\n", " verse = F.verse.v(v)\n", " label = '{} {}:{}'.format(book, chapter, verse)\n", " line = []\n", " for w in L.d('word', v):\n", " word = T.words([w], fmt='pf')\n", " word_lookup = word.rstrip()\n", " topics = word2topic.get(word_lookup, [])\n", " topic_refs = ' '.join(\n", " '[{}](topics.html#topic_{})'.format(topic, topic) for topic in topics\n", " )\n", " line.append('{}{}'.format(topic_refs, word))\n", " md_lines.append('{} {}'.format(label, ''.join(line)))\n", "msg('Converting markdown to html')\n", "tf = open(filename, 'w')\n", "tf.write('''\n", "\n", "{}\n", "\n", "\n", "'''.format(markdown('\\n'.join(md_lines))))\n", "tf.close()\n", "msg('Done')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }