# ETCBC in R

This notebook exports the ETCBC database to an R data frame.
The nodes are exported as rows, they correspond to the objects.
The edges corresponding to the etcbc features *mother*, *functional_parent*, *distributional_parent* are
exported as columns. For each row, such a column indicates the target of a corresponding outgoing edge.
In the ETCBC data objects have at most one outgoing edge for each type of edge.

Extra data such as lexicon, phonetic transcription, and ketiv-qere is also included.

In [1]:
import sys, collections
%load_ext rpy2.ipython

from laf.fabric import LafFabric
import etcbc

fabric = LafFabric()

 0.00s This is LAF-Fabric 4.5.6
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html



In [2]:
API = fabric.load('etcbc4b', 'lexicon', 'hinr', {
 "xmlids": {"node": False, "edge": False},
 "features": ('''
''',""),
 "primary": False,
})

API['F_all']

 0.00s LOADING API: please wait ... 
 0.00s INFO: USING DATA COMPILED AT: 2015-11-02T15-08-56
 0.00s INFO: USING DATA COMPILED AT: 2015-11-03T06-44-21
 1.07s LOGFILE=/Users/dirk/SURFdrive/laf-fabric-output/etcbc4b/hinr/__log__hinr.txt
 1.07s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK hinr AT 2016-01-12T15-50-18


[('etcbc4',
 ['db.maxmonad',
 'db.minmonad',
 'db.monads',
 'db.oid',
 'db.otype',
 'ft.code',
 'ft.det',
 'ft.dist',
 'ft.dist_unit',
 'ft.domain',
 'ft.function',
 'ft.g_cons',
 'ft.g_cons_utf8',
 'ft.g_lex',
 'ft.g_lex_utf8',
 'ft.g_nme',
 'ft.g_nme_utf8',
 'ft.g_pfm',
 'ft.g_pfm_utf8',
 'ft.g_prs',
 'ft.g_prs_utf8',
 'ft.g_uvf',
 'ft.g_uvf_utf8',
 'ft.g_vbe',
 'ft.g_vbe_utf8',
 'ft.g_vbs',
 'ft.g_vbs_utf8',
 'ft.g_word',
 'ft.g_word_utf8',
 'ft.gn',
 'ft.is_root',
 'ft.kind',
 'ft.language',
 'ft.lex',
 'ft.lex_utf8',
 'ft.ls',
 'ft.mother_object_type',
 'ft.nme',
 'ft.nu',
 'ft.number',
 'ft.pdp',
 'ft.pfm',
 'ft.prs',
 'ft.ps',
 'ft.rela',
 'ft.sp',
 'ft.st',
 'ft.tab',
 'ft.trailer_utf8',
 'ft.txt',
 'ft.typ',
 'ft.uvf',
 'ft.vbe',
 'ft.vbs',
 'ft.vs',
 'ft.vt',
 'kq.g_qere_utf8',
 'kq.qtrailer_utf8',
 'lex.entry',
 'lex.entry_heb',
 'lex.entryid',
 'lex.g_entry',
 'lex.g_entry_heb',
 'lex.gloss',
 'lex.id',
 'lex.lan',
 'lex.nametype',
 'lex.pos',
 'lex.root',
 'lex.subpos',
 '

In [3]:
all_features = [x.split('.', 1)[1] for x in API['F_all'][0][1]]
all_feature_str = ' '.join(all_features)

API = fabric.load_again({
 "xmlids": {"node": False, "edge": False},
 "features": (all_feature_str,""),
 "primary": False,
})
exec(fabric.localnames.format(var='fabric'))

 0.00s LOADING API: please wait ... 
 0.00s INFO: USING DATA COMPILED AT: 2015-11-02T15-08-56
 0.00s INFO: USING DATA COMPILED AT: 2015-11-03T06-44-21
 17s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK hinr AT 2016-01-12T15-50-51


In [14]:
hr = open('/Users/dirk/SURFdrive/laf-fabric-data/r/etcbc4b.txt', 'w')
use_features = all_features
# use_features = ['oid'] + all_features[60:77]
hr.write('{}\n'.format('\t'.join(use_features)))
chunk_size = 100000
i = 0
s = 0
for n in NN():
 all_values = [F.item[x].v(n) for x in use_features]
 hr.write('{}\n'.format(('\t'.join(x or '' for x in all_values)).replace('\n','')))
 i += 1
 s += 1
 if s == chunk_size:
 s = 0
 msg('{:>7} nodes written'.format(i))
hr.close()
msg('{:>7} nodes written and done'.format(i))

56m 22s 100000 nodes written
56m 29s 200000 nodes written
56m 36s 300000 nodes written
56m 43s 400000 nodes written
56m 50s 500000 nodes written
56m 57s 600000 nodes written
57m 04s 700000 nodes written
57m 10s 800000 nodes written
57m 18s 900000 nodes written
57m 26s 1000000 nodes written
57m 32s 1100000 nodes written
57m 40s 1200000 nodes written
57m 47s 1300000 nodes written
57m 54s 1400000 nodes written
57m 57s 1436858 nodes written and done


Now read the data in R and save it in compact .rds format.

Note that we have to ignore quotes and comment signs!

In [15]:
%%R
etcbc = read.table(
 '/Users/dirk/SURFdrive/laf-fabric-data/r/etcbc4b.txt', 
 sep="\t", 
 header=TRUE, 
 comment.char="",
 quote="",
 as.is = TRUE,
)
dim(etcbc)

[1] 1436858 76


In [16]:
%%R
saveRDS(
 object=etcbc, 
 file='/Users/dirk/SURFdrive/laf-fabric-data/r/etcbc4b.rds'
)

In [17]:
!ls -lh /Users/dirk/SURFdrive/laf-fabric-data/r

total 565792
-rw-r--r-- 1 dirk staff 43M Jan 12 17:53 etcbc4b.rds
-rw-r--r-- 1 dirk staff 233M Jan 12 17:48 etcbc4b.txt


Now check how fast this loads. (Half the time)

In [18]:
%%R
etcbc = readRDS(
 file='/Users/dirk/SURFdrive/laf-fabric-data/r/etcbc4b.rds'
)
dim(etcbc)

[1] 1436858 76


Copy it to the github directory

In [19]:
!cp '/Users/dirk/SURFdrive/laf-fabric-data/r/etcbc4b.rds' '/Users/dirk/SURFdrive/current/demos/github/laf-fabric-data/etcbc4b.rds'

The result is in the [laf-fabric-data github repository](https://github.com/ETCBC/laf-fabric-data).