# Building a citable text corpus from OCRE

This notebook shows you how to load OCRE data from a CEX file over the internet, and build a corpus of text citable by CTS URN. It uses version `1.7.0` of the `nomisma` library. 


## Configure Jupyter notebook

First configure the Jupyter notebook. In addition to the `nomisma` library, we'll need the `cite` and `ohco2` libraries from the CITE architecture.

In [None]:
// 1. Add maven repository where we can find our libraries
val myBT = coursierapi.MavenRepository.of("https://dl.bintray.com/neelsmith/maven")
interp.repositories() ++= Seq(myBT)

In [None]:
// 2. Make libraries available with `$ivy` imports:
import $ivy.`edu.holycross.shot::nomisma:1.7.0`
import $ivy.`edu.holycross.shot::ohco2:10.16.0`
import $ivy.`edu.holycross.shot.cite::xcite:4.1.1`

## Load the full OCRE data set

In [None]:
import edu.holycross.shot.nomisma._
val ocreCex = "https://raw.githubusercontent.com/neelsmith/nomisma/master/cex/ocre-cite-ids.cex"
val ocre = OcreSource.fromUrl(ocreCex)

// Sanity check:
require(ocre.size > 50000) 

## TL;DR

You can build an OHCO2 corpus with the `corpus` function.

In [None]:
import edu.holycross.shot.ohco2._
import edu.holycross.shot.cite._

val corpus: Corpus = ocre.corpus
println("Citable nodes of text in corpus: " + corpus.size)

## How it works for individual issues

The `OcreIssue` class includes a `textNodes` function that creates a Vector of 0-2 `CitableNode`s. There will be two text nodes if the issue has both an obverse and reverse legend. Let's examine the CTS URNS of an issue that has both obverse and reverse legends.

In [None]:
val issueId = "3.com.43"
val randomIssue = ocre.issue(issueId).get

println("In issue " + issueId + ", made " + randomIssue.textNodes.size + " text nodes")

for (n <- randomIssue.textNodes) {
 println("\nReference: " + n.urn)
 println("Text content: " + n.text)
}


Let's parse the components of the URN.

It belongs to the CTS namespace `hcnum`, and a text group `issues`. 

Within that group, its document identifier is `ric`, and the specific version identifier is `raw`. When we process the corpus (e.g., to generate a fully expanded version of abbreviated terms), we will use a different version identifier, but the rest of the URN will be the same.

The passage component is directly adapted from the nomisma.org identifier: `3.com.43` identifies RIC volume 3, Commodus, issue 43. The final piece of the passage component distinguishes obverse text from reverse text.

## How it works: building a corpus

The `corpus` function in `Ocre` creates 0-2 `CitableNode`s for each issue and compiles them into a text `Corpus`. 

As in any CTS environment, we can then select texts identified at any level of the passage and work hierarchies.
 

In [None]:
val commodus43 = corpus.nodes.filter(_.urn <= CtsUrn("urn:cts:hcnum:issues.ric.raw:3.com.43"))
println("**OBV** " + commodus43.map(_.text).mkString(" **REV** "))


In [None]:
val allCommodus = corpus.nodes.filter(_.urn <= CtsUrn("urn:cts:hcnum:issues.ric.raw:3.com"))
println("All legends in coins of Commodus: " + allCommodus.size)


In [None]:
val allRIC3 = corpus.nodes.filter(_.urn <= CtsUrn("urn:cts:hcnum:issues.ric.raw:3"))
println("All legends in RIC 3: " + allRIC3.size)