{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting started\n", "\n", "It is assumed that you have read\n", "[start](start.ipynb)\n", "and followed the installation instructions there." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Corpus\n", "\n", "This is:\n", "\n", "* `dss` Dead Sea Scrolls" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# First acquaintance\n", "\n", "We just want to grasp what the corpus is about and how we can find our way in the data.\n", "\n", "Open a terminal or command prompt and say one of the following\n", "\n", "```text-fabric dss```\n", "\n", "Wait and see a lot happening before your browser starts up and shows you an interface on the corpus:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Text-Fabric needs an app to deal with the corpus-specific things.\n", "It downloads/finds/caches the latest version of the **app**:\n", "\n", "```\n", "Using TF-app in /Users/dirk/text-fabric-data/annotation/app-dss/code:\n", "\trv0.6=#304d66fd7eab50bbe4de8505c24d8b3eca30b1f1 (latest release)\n", "```\n", "\n", "It downloads/finds/caches the latest version of the **data**:\n", "\n", "```\n", "Using data in /Users/dirk/text-fabric-data/etcbc/dss/tf/0.6:\n", "\trv0.6=#9b52e40a8a36391b60807357fa94343c510bdee0 (latest release)\n", "```\n", "\n", "The data is preprocessed in order to speed up typical Text-Fabric operations.\n", "The result is cached on your computer.\n", "Preprocessing costs time. Next time you use this corpus on this machine, the startup time is much quicker.\n", "\n", "```\n", "TF setup done.\n", "```\n", "\n", "Then the app goes on to act as a local webserver serving the corpus that has just been downloaded\n", "and it will open your browser for you and load the corpus page\n", "\n", "```\n", " * Running on http://localhost:8107/ (Press CTRL+C to quit)\n", "Opening dss in browser\n", "Listening at port 18987\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Help!\n", "\n", "Indeed, that is what you need. Click the vertical `Help` tab.\n", "\n", "From there, click around a little bit. Don't read closely, just note the kinds of information that is presented to you.\n", "\n", "Later on, it will make more sense!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Browsing\n", "\n", "First we browse our data. Click the browse button.\n", "\n", "\n", "\n", "and then, in the table of *documents* (scrolls), click on a fragment of scroll `1QSb`:\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you're looking at a fragment of a scroll: the writing in Hebrew characters without vowel signs.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now click the *Options* tab and select the `layout-orig-unicode` format to see the same fragment in a layout that indicates the status\n", "of the pieces of writing.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can click a triangle to see how a line is broken down:\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Searching\n", "\n", "In this corpus there is a lot of attention for the uncertainty of signs and whether they have been corrected, either in antiquity or\n", "in more modern times.\n", "\n", "Also, the corpus is marked up with part-of-speech for each word.\n", "\n", "So we can, for example, search for *verbs* that have an uncertain or corrected or removed consonant in them.\n", "\n", "```\n", "word sp=verb\n", " sign type=cons\n", " /with/\n", " .. unc=1|2|3|4\n", " /or/\n", " .. cor=1|2|3\n", " /or/\n", " .. rem=1|2\n", " /or/\n", " .. rec=1\n", " /-/\n", "```\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In English:\n", "\n", "search all `word`s that contain a `sign` with feature `type`\n", "having value `cons` (consonant) where at least one of the following holds for\n", "that sign:\n", "\n", "* the feature `unc` has value `1` or `2` or `3` or `4`\n", "* the feature `cor` has value `1` or `2` or `3`\n", "* the feature `rem` has value `1` or `2`\n", "* the feature `rec` has value `1`\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Words with multiple uncertain signs correspond with multiple results. We can condense the results in such a way that all results for the same word are shown as one result.\n", "\n", "Click the options tab, check *condense results*, and check *word* as the container into you want to condense.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can expand results by clicking the triangle.\n", "\n", "You can see the result in context by clicking the browse icon.\n", "\n", "You can go back to the result list by clicking the results icon.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Computing\n", "\n", "This triggers other questions.\n", "\n", "For example: how many verbs are there, if there are already 37344 with uncertain signs?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How is uncertainty distributed over the verbs?\n", "I.e. how many verbs have how many uncertain/corrected/removed signs?\n", "\n", "*This is a typical question where you want to leave the search mode and enter computing mode*.\n", "\n", "Let's find out.\n", "\n", "Extra information:\n", "\n", "* the features `unc`, `cor`, `rem` have values 1, 2, 3, 4 that indicate the kind of uncertainty, correction, removal.\n", " We just use those values as the seriousness of the uncertainty.\n", " Essentially, we just sum up all values of these features for each sign.\n", "* the feature `rec` means, if that the sign is reconstructed. We consider it to be severely uncertain, and add the penalty 10 for\n", " such signs.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have followed the installation instructions, you are set.\n", "Go to the browser window that opened when you gave the command `jupyter notebook` in your terminal.\n", "\n", "Then continue reading, and, ... executing.\n", "\n", "You can execute a cell by putting your cursor inside it and pressing `Shift Enter`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we load the Text-Fabric module, as follows:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import seaborn as sns\n", "from tf.app import use" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we load the TF-app for the corpus `dss` and that app loads the corpus data.\n", "\n", "We give a name to the result of all that loading: `A`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "TF-app: ~/github/annotation/app-dss/code" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/github/etcbc/dss/tf/0.6" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "data: ~/github/etcbc/dss/parallels/tf/0.6" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Text-Fabric: Text-Fabric API 8.3.0, app-dss, Search Reference
Data: DSS, Character table, Feature docs
Features:
Parallel Passagessim
Dead Sea Scrollsafter
alt
biblical
book
chapter
cl
cl2
cor
fragment
full
fulle
fullo
glex
glexe
glexo
glyph
glyphe
glypho
gn
gn2
gn3
halfverse
intl
lang
lex
lexe
lexo
line
md
merr
morpho
nu
nu2
nu3
otype
ps
ps2
ps3
punc
punce
punco
rec
rem
script
scroll
sp
srcLn
st
type
unc
vac
verse
vs
vt
occ
oslots
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Text-Fabric API: names N F E L T S C TF directly usable

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A = use(\"dss:clone\", checkout=\"clone\", hoist=globals())\n", "# A = use('dss', hoist=globals())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some bits are familiar from above, when you ran the `text-fabric` command in the terminal.\n", "\n", "Other bits are links to the documentation, they point to the same places as the links on the Text-Fabric browser.\n", "\n", "You see a list of all the data features that have been loaded.\n", "\n", "And a list of references to the API documentation, which tells you how you can use this data in your program statements." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Searching (revisited)\n", "\n", "We do the same search again, but now inside our program.\n", "\n", "That means that we can capture the results in a list for further processing." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 4.47s 125111 results\n" ] } ], "source": [ "template = \"\"\"\n", "word sp=verb\n", " sign type=cons\n", " /with/\n", " .. unc=1|2|3|4\n", " /or/\n", " .. cor=1|2|3\n", " /or/\n", " .. rem=1|2\n", " /or/\n", " .. rec=1\n", " /-/\n", "\"\"\"\n", "results = A.search(template)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In less than five seconds, we have all the results!\n", "\n", "Let's look at the first one:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1607456, 1742)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each result is a list of numbers: for a\n", "\n", "1. word\n", "1. sign" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the second one:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1607456, 1743)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here the last one:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2107836, 1430165)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results[-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are only interested in the words that we have encountered.\n", "We collect them in a set:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "37344" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "verbs = sorted({result[0] for result in results})\n", "len(verbs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This corresponds exactly to the number of condensed results!\n", "\n", "Now we get the number of verbs:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "data": { "text/plain": [ "58873" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(F.sp.s(\"verb\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In English: take feature `sp` (part-of-speech), and collect all nodes that have value `verb` for this feature.\n", "Then take the length of this list." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "Now we want to find out something for each result verb: what is the accumulated uncertainty of that verb?\n", "Some verbs have more consonants than others, so we divide by the number of consonants.\n", "\n", "We define a function that collects the uncertainty of a single sign:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def getUncertainty(sign):\n", " return sum(\n", " (\n", " F.unc.v(sign) or 0,\n", " F.cor.v(sign) or 0,\n", " F.rem.v(sign) or 0,\n", " 10 if F.rec.v(sign) else 0,\n", " )\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see what this gives for the first sign in the 1000th result:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
י
type=cons
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sign = results[999][1]\n", "A.pretty(sign)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10\n" ] } ], "source": [ "unc = getUncertainty(sign)\n", "print(unc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An other one:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "data": { "text/html": [ "
ו
type=cons
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "4\n" ] } ], "source": [ "sign = results[12][1]\n", "A.pretty(sign)\n", "print(getUncertainty(sign))" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "Now we define a function that gives us the uncertainty of a word.\n", "We collect the consonants of the word.\n", "We sum the uncertainty of them and divide it by the number of consonants in the word." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "def uncertainty(word):\n", " signs = L.d(\n", " word, otype=\"sign\"\n", " ) # go a Level down to signs and collect them in a list\n", " return sum(getUncertainty(sign) for sign in signs) / len(signs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We compute the uncertainty of some verbs." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
sp=verbtype=glyph
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "verb = verbs[999]\n", "A.pretty(verb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the computation:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4.0\n" ] } ], "source": [ "unc = uncertainty(verb)\n", "print(unc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An other one:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
sp=verbtype=glyph
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "1.25\n" ] } ], "source": [ "verb = verbs[12]\n", "A.pretty(verb)\n", "print(uncertainty(verb))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We compute all word uncertainties and store them in a dictionary:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "37344" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "verbUncertainty = {verb: uncertainty(verb) for verb in verbs}\n", "len(verbUncertainty)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the minimum and the maximum uncertainty?" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "14.0" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "max(verbUncertainty.values())" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.14285714285714285" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "min(verbUncertainty.values())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to visualize how many how uncertain verbs there are, we make a scatterplot,\n", "using the *seaborn* library." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.set(color_codes=True)\n", "sns.distplot(list(verbUncertainty.values()), axlabel=\"uncertainty\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's single out the verbs with uncertainty greater than 9, but lower than 10, and inspect a few." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "10" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "verbHighUnc = [verb for (verb, unc) in verbUncertainty.items() if 9 < unc < 10]\n", "len(verbHighUnc)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

result 1

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
sp=verbtype=glyph
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

result 2

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
sp=verbtype=glyph
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

result 3

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
sp=verbtype=glyph
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

result 4

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
sp=verbtype=glyph
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

result 5

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
word הלך
sp=verbtype=glyph
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

result 6

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
sp=verbtype=glyph
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

result 7

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
sp=verbtype=glyph
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

result 8

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
sp=verbtype=glyph
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

result 9

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
word בנו
sp=verbtype=glyph
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

result 10

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
sp=verbtype=glyph
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "A.show([[verb] for verb in verbHighUnc], fmt=\"layout-orig-full\", condenseType=\"word\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }