<img align="right" src="images/tf.png" width="128"/>
<img align="right" src="images/ninologo.png" width="128"/>
<img align="right" src="images/dans.png" width="128"/>

---

To get started: consult [start](start.ipynb)

---

# Similar lines

We spot the many similarities between lines in the corpus.

There are ca 25000 lines in the corpus. To compare them all requires 300 million comparisons.
That is a costly operation.
[On this laptop it took 6 whole minutes](https://nbviewer.jupyter.org/github/nino-cunei/oldbabylonian/blob/master/programs/parallels.ipynb).

The good news it that we have stored the outcome in an extra feature.

This feature is packaged in a TF data module, that we will load below, by using the parameter `mod` in the `use()` statement.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import collections

from tf.app import use

In [3]:
A = use(
    "Nino-cunei/oldbabylonian",
    mod="Nino-cunei/oldbabylonian/parallels/tf:clone",
    hoist=globals(),
)

This is Text-Fabric 9.2.2
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

68 features found and 0 ignored


The new feature is **sim** and it it an edge feature.
It annotates pairs of lines $(l, m)$ where $l$ and $m$ have similar content.
The degree of similarity is a percentage (between 90 and 100), and this value
is annotated onto the edges.

Here is an example:

In [4]:
exampleLine = F.otype.s("line")[0]
sisters = E.sim.b(exampleLine)
print(f"{len(sisters)} similar lines")
print("\n".join(f"{s[0]} with similarity {s[1]}" for s in sisters[0:10]))
A.table(tuple((s[0],) for s in sisters), end=10)

75 similar lines
235394 with similarity 100
235421 with similarity 100
235434 with similarity 100
235464 with similarity 100
235478 with similarity 100
235503 with similarity 100
235529 with similarity 100
235585 with similarity 100
235615 with similarity 100
235629 with similarity 100


n,p,line
1,P510729 obverse:1,a-na {d}suen-i-din-[nam]
2,P510730 obverse:1,a-na {d}suen-i-din-nam
3,P510731 obverse:1,a-na {d}suen-i-din-nam
4,P510732 obverse:1,a-na {d}suen#-i-din-nam#
5,P497779 obverse:1,a-na {d}suen#-[i]-din-nam
6,P510733 obverse:1,[a-na] {d}[suen-i-din-nam]
7,P510734 obverse:1,[a-na {d}suen-i-din-nam]
8,P510736 obverse:1,a-na {d}suen-i-din-nam
9,P510737 obverse:1,a-na {d}suen-i-din-nam#
10,P370926 obverse:1,a-na {d}suen-i-din-nam


# All similarities

Let's first find out the range of similarities:

In [5]:
minSim = None
maxSim = None

for ln in F.otype.s("line"):
    sisters = E.sim.f(ln)
    if not sisters:
        continue
    thisMin = min(s[1] for s in sisters)
    thisMax = max(s[1] for s in sisters)
    if minSim is None or thisMin < minSim:
        minSim = thisMin
    if maxSim is None or thisMax > maxSim:
        maxSim = thisMax

print(f"minimum similarity is {minSim:>3}")
print(f"maximum similarity is {maxSim:>3}")

minimum similarity is  90
maximum similarity is 100


# The bottom lines

We give a few examples of the least similar lines.

**N.B.** When lines are less than 90% similar, they have not made it into the `sim` feature!

We can use a search template to get the 90% lines.

In [6]:
query = """
line
-sim=90> line
"""

In words: find a line connected via a sim-edge with value 90 to an other line.

In [7]:
results = A.search(query)

  0.19s 722 results


Not very much indeed. It seems that lines are either very similar, or not so similar at all.

In [8]:
A.table(results, start=1, end=10)

n,p,line,line.1
1,P509373 obverse:10,_a-sza3 a-gar3_ na-ag-[ma-lum] _uru_ x x x{ki},_a-[sza3 a-gar3_ na-ag]-ma-lum _uru gan2_ x x{ki}
2,P509374 obverse:4,_{d}utu_ u3 _{d}marduk_ da-ri-[isz] _u4_-[mi x],{d}utu# u3 {d}marduk# [da-ri-isz _u4_-mi-im]
3,P509374 obverse:4,_{d}utu_ u3 _{d}marduk_ da-ri-[isz] _u4_-[mi x],_{d}utu_ u3 _{d}marduk_ da-ri-isz u4-mi-im
4,P509374 obverse:4,_{d}utu_ u3 _{d}marduk_ da-ri-[isz] _u4_-[mi x],{d}utu u3 {d}[marduk da-ri-isz _u4_]-mi#-im
5,P509376 obverse:11,it-ti-szu a-na _a-sza3_ ri-id-ma,[it-ti]-szu#-nu a-na _a-sza3_ ri-id-ma
6,P510527 obverse:4,"{d}utu u3 {d}marduk li-ba-al-li-t,u2-ka","{d}utu u3 {d}marduk li-ba-al-li-t,u2-ka!(KI)"
7,P510527 obverse:4,"{d}utu u3 {d}marduk li-ba-al-li-t,u2-ka","{d}utu u3 {d}marduk tu-ba-al-li-t,u2-ka"
8,P510529 obverse:4,{d}utu u3 {d}marduk da-ri-isz _u4_-mi,{d}utu# u3 {d}marduk# [da-ri-isz _u4_-mi-im]
9,P510529 obverse:4,{d}utu u3 {d}marduk da-ri-isz _u4_-mi,_{d}utu_ u3 _{d}marduk_ da-ri-isz u4-mi-im
10,P510529 obverse:4,{d}utu u3 {d}marduk da-ri-isz _u4_-mi,{d}utu u3 {d}[marduk da-ri-isz _u4_]-mi#-im


In case the ATF flags and clusters are a bit heavy on the eye, you can switch to a more pleasing rich text layout:

In [9]:
A.table(results, start=1, end=10, fmt="layout-orig-rich")

n,p,line,line.1
1,P509373 obverse:10,a-ša₃ a-gar₃ na-ag-ma-lum uru x x xki,a-ša₃ a-gar₃ na-ag-ma-lum uru gan₂ x xki
2,P509374 obverse:4,dutu u₃ dmarduk da-ri-iš u₄-mi x,dutu u₃ dmarduk da-ri-iš u₄-mi-im
3,P509374 obverse:4,dutu u₃ dmarduk da-ri-iš u₄-mi x,dutu u₃ dmarduk da-ri-iš u₄-mi-im
4,P509374 obverse:4,dutu u₃ dmarduk da-ri-iš u₄-mi x,dutu u₃ dmarduk da-ri-iš u₄-mi-im
5,P509376 obverse:11,it-ti-šu a-na a-ša₃ ri-id-ma,it-ti-šu-nu a-na a-ša₃ ri-id-ma
6,P510527 obverse:4,dutu u₃ dmarduk li-ba-al-li-ṭu₂-ka,dutu u₃ dmarduk li-ba-al-li-ṭu₂-ka=⌈KI⌉
7,P510527 obverse:4,dutu u₃ dmarduk li-ba-al-li-ṭu₂-ka,dutu u₃ dmarduk tu-ba-al-li-ṭu₂-ka
8,P510529 obverse:4,dutu u₃ dmarduk da-ri-iš u₄-mi,dutu u₃ dmarduk da-ri-iš u₄-mi-im
9,P510529 obverse:4,dutu u₃ dmarduk da-ri-iš u₄-mi,dutu u₃ dmarduk da-ri-iš u₄-mi-im
10,P510529 obverse:4,dutu u₃ dmarduk da-ri-iš u₄-mi,dutu u₃ dmarduk da-ri-iš u₄-mi-im


Or even in cuneiform unicode:

In [10]:
A.table(results, start=1, end=10, fmt="layout-orig-unicode")

n,p,line,line.1
1,P509373 obverse:10,𒀀𒊮 𒀀𒃼 𒈾𒀝𒈠𒈝 𒌷 x x x𒆠,𒀀𒊮 𒀀𒃼 𒈾𒀝𒈠𒈝 𒌷 𒃷 x x𒆠
2,P509374 obverse:4,𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪 x,𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪𒅎
3,P509374 obverse:4,𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪 x,𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪𒅎
4,P509374 obverse:4,𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪 x,𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪𒅎
5,P509376 obverse:11,𒀉𒋾𒋗 𒀀𒈾 𒀀𒊮 𒊑𒀉𒈠,𒀉𒋾𒋗𒉡 𒀀𒈾 𒀀𒊮 𒊑𒀉𒈠
6,P510527 obverse:4,𒀭𒌓 𒅇 𒀭𒀫𒌓 𒇷𒁀𒀠𒇷𒌅𒅗,𒀭𒌓 𒅇 𒀭𒀫𒌓 𒇷𒁀𒀠𒇷𒌅𒅗=⌈𒆠⌉
7,P510527 obverse:4,𒀭𒌓 𒅇 𒀭𒀫𒌓 𒇷𒁀𒀠𒇷𒌅𒅗,𒀭𒌓 𒅇 𒀭𒀫𒌓 𒌅𒁀𒀠𒇷𒌅𒅗
8,P510529 obverse:4,𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪,𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪𒅎
9,P510529 obverse:4,𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪,𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪𒅎
10,P510529 obverse:4,𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪,𒀭𒌓 𒅇 𒀭𒀫𒌓 𒁕𒊑𒅖 𒌓𒈪𒅎


From now on we forget about the level of similarity, and focus on whether two lines are just "similar", meaning that they have
a high degree of similarity.

# Cluster the lines

Before we try to find them, let's see if we can cluster the lines in similar clusters.

In [11]:
CLUSTER_THRESHOLD = 0.5


def makeClusters():
    A.indent(reset=True)
    chunkSize = 1000
    b = 0
    j = 0
    clusters = []
    for ln in F.otype.s("line"):
        j += 1
        b += 1
        if b == chunkSize:
            b = 0
            A.info(f"{j:>5} lines and {len(clusters):>5} clusters")
        lSisters = {x[0] for x in E.sim.b(ln)}
        lAdded = False
        for cl in clusters:
            if len(cl & lSisters) > CLUSTER_THRESHOLD * len(cl):
                cl.add(ln)
                lAdded = True
                break
        if not lAdded:
            clusters.append({ln})
    A.info(f"{j} lines and {len(clusters)} clusters")
    return clusters

In [12]:
clusters = makeClusters()

  0.10s  1000 lines and   858 clusters
  0.31s  2000 lines and  1691 clusters
  0.65s  3000 lines and  2509 clusters
  1.12s  4000 lines and  3338 clusters
  1.72s  5000 lines and  4135 clusters
  2.42s  6000 lines and  4885 clusters
  3.21s  7000 lines and  5659 clusters
  4.05s  8000 lines and  6358 clusters
  5.07s  9000 lines and  7125 clusters
  6.23s 10000 lines and  7894 clusters
  7.49s 11000 lines and  8715 clusters
  8.82s 12000 lines and  9450 clusters
    10s 13000 lines and 10166 clusters
    12s 14000 lines and 11011 clusters
    14s 15000 lines and 11774 clusters
    15s 16000 lines and 12592 clusters
    17s 17000 lines and 13219 clusters
    19s 18000 lines and 13893 clusters
    21s 19000 lines and 14637 clusters
    24s 20000 lines and 15380 clusters
    26s 21000 lines and 16095 clusters
    28s 22000 lines and 16799 clusters
    31s 23000 lines and 17505 clusters
    33s 24000 lines and 18235 clusters
    36s 25000 lines and 19005 clusters
    39s 26000 lines and 1

What is the distribution of the clusters, in terms of how many similar lines they contain?
We count them.

In [13]:
clusterSizes = collections.Counter()

for cl in clusters:
    clusterSizes[len(cl)] += 1

for (size, amount) in sorted(
    clusterSizes.items(),
    key=lambda x: (-x[0], x[1]),
):
    print(f"clusters of size {size:>4}: {amount:>5}")

clusters of size 1006:     1
clusters of size  129:     1
clusters of size  126:     1
clusters of size  125:     1
clusters of size   84:     1
clusters of size   78:     1
clusters of size   76:     1
clusters of size   74:     1
clusters of size   69:     1
clusters of size   64:     1
clusters of size   56:     1
clusters of size   52:     1
clusters of size   51:     1
clusters of size   49:     1
clusters of size   48:     1
clusters of size   45:     1
clusters of size   44:     1
clusters of size   43:     1
clusters of size   39:     1
clusters of size   35:     1
clusters of size   34:     1
clusters of size   32:     1
clusters of size   30:     3
clusters of size   29:     1
clusters of size   28:     4
clusters of size   27:     2
clusters of size   26:     2
clusters of size   25:     3
clusters of size   24:     1
clusters of size   23:     3
clusters of size   22:     3
clusters of size   20:     4
clusters of size   19:     2
clusters of size   18:     3
clusters of si

# Interesting groups

Let's investigate some interesting groups, that lie in some sweet spots.

* the biggest clusters: more than 31 members
* the medium clusters: between 12 and 30 members
* the small clusters: between 2 and 11 members

---

All chapters:

* **[start](start.ipynb)** become an expert in creating pretty displays of your text structures
* **[display](display.ipynb)** become an expert in creating pretty displays of your text structures
* **[search](search.ipynb)** turbo charge your hand-coding with search templates
* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results
* **[share](share.ipynb)** draw in other people's data and let them use yours
* **similar Lines** spot the similarities between lines

---

See the [cookbook](cookbook) for recipes for small, concrete tasks.

CC-BY Dirk Roorda