{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"
\n",
"
\n",
"\n",
"\n",
"# Import BHSA data into Pandas\n",
"\n",
"This notebook contains the Pandas instructions to load the\n",
"[Pandas export](export.ipynb) export of the BHSA.\n",
"\n",
"We then perform some simple information extracting on the data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# How to get the Pandas file\n",
"\n",
"The direct download link is \n",
"[data-2021.pd](https://github.com/ETCBC/bhsa/releases/download/v1.8/data-2021.pd)\n",
"\n",
"The pandas file is over 50 MB, a bit too large for GitHub without large file support.\n",
"So I attached it to the\n",
"[latest release](https://github.com/ETCBC/bhsa/releases/tag/v1.8).\n",
"\n",
"## Reproduction\n",
"\n",
"If you want to do it yourself,\n",
"\n",
"* clone this repo\n",
"* find the [export](export.ipynb) notebook\n",
"* run it in Jupyterlab\n",
"* pick up the newly generated file from the `/pandas` subdirectory."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import pandas as pd # pip3 install pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# File locations\n",
"\n",
"We set up some variables for the location of the Pandas file and a location\n",
"where we will save the full text of this corpus."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"VERSION = \"2021\"\n",
"PANDAS_DIR = os.path.abspath(\"../pandas\")\n",
"TEXT_DIR = os.path.abspath(os.path.expanduser(\"~/Downloads/text\"))\n",
"TABLE_FILE_PD = f\"{PANDAS_DIR}/data-{VERSION}.pd\"\n",
"TABLE_FILE_TXT = f\"{TEXT_DIR}/data-{VERSION}.txt\"\n",
"\n",
"if not os.path.exists(TEXT_DIR):\n",
" os.makedirs(TEXT_DIR)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Load the dataframe"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done. Size=104171832\n"
]
}
],
"source": [
"frame = pd.read_parquet(TABLE_FILE_PD, engine=\"pyarrow\")\n",
"print(\"Done. Size={}\".format(frame.size))"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1446831, 72)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"frame.shape"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
| \n", " | nd | \n", "otype | \n", "g_cons | \n", "g_cons_utf8 | \n", "g_lex | \n", "g_lex_utf8 | \n", "g_word | \n", "g_word_utf8 | \n", "lex | \n", "lex_utf8 | \n", "... | \n", "tab | \n", "txt | \n", "typ | \n", "uvf | \n", "vbe | \n", "vbs | \n", "verse | \n", "voc_lex | \n", "vs | \n", "vt | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "426591 | \n", "book | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | ... | \n", "<NA> | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | <NA> | \n", "\n", " | \n", " | \n", " |
| 1 | \n", "426630 | \n", "chapter | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | ... | \n", "<NA> | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | <NA> | \n", "\n", " | \n", " | \n", " |
| 2 | \n", "1414389 | \n", "verse | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | ... | \n", "<NA> | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | 1 | \n", "\n", " | \n", " | \n", " |
| 3 | \n", "1172308 | \n", "sentence | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | ... | \n", "<NA> | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | <NA> | \n", "\n", " | \n", " | \n", " |
| 4 | \n", "1236025 | \n", "sentence_atom | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | ... | \n", "<NA> | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | <NA> | \n", "\n", " | \n", " | \n", " |
| 5 | \n", "427559 | \n", "clause | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | ... | \n", "<NA> | \n", "? | \n", "xQtX | \n", "\n", " | \n", " | \n", " | <NA> | \n", "\n", " | \n", " | \n", " |
| 6 | \n", "515690 | \n", "clause_atom | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | ... | \n", "<NA> | \n", "\n", " | xQtX | \n", "\n", " | \n", " | \n", " | <NA> | \n", "\n", " | \n", " | \n", " |
| 7 | \n", "606394 | \n", "half_verse | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | ... | \n", "<NA> | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | <NA> | \n", "\n", " | \n", " | \n", " |
| 8 | \n", "651573 | \n", "phrase | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | ... | \n", "<NA> | \n", "\n", " | PP | \n", "\n", " | \n", " | \n", " | <NA> | \n", "\n", " | \n", " | \n", " |
| 9 | \n", "904776 | \n", "phrase_atom | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | ... | \n", "<NA> | \n", "\n", " | PP | \n", "\n", " | \n", " | \n", " | <NA> | \n", "\n", " | \n", " | \n", " |
| 10 | \n", "1437602 | \n", "lex | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | B | \n", "ב | \n", "... | \n", "<NA> | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | <NA> | \n", "B.: | \n", "\n", " | \n", " |
| 11 | \n", "1 | \n", "word | \n", "B | \n", "ב | \n", "B.:- | \n", "בְּ | \n", "B.:- | \n", "בְּ | \n", "B | \n", "ב | \n", "... | \n", "<NA> | \n", "\n", " | \n", " | absent | \n", "n/a | \n", "n/a | \n", "<NA> | \n", "B.: | \n", "NA | \n", "NA | \n", "
| 12 | \n", "1437603 | \n", "lex | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | R>CJT/ | \n", "ראשׁית֜ | \n", "... | \n", "<NA> | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | <NA> | \n", "R;>CIJT | \n", "\n", " | \n", " |
| 13 | \n", "2 | \n", "word | \n", "R>CJT | \n", "ראשׁית | \n", "R;>CIJT | \n", "רֵאשִׁית | \n", "R;>CI73JT | \n", "רֵאשִׁ֖ית | \n", "R>CJT/ | \n", "ראשׁית | \n", "... | \n", "<NA> | \n", "\n", " | \n", " | absent | \n", "n/a | \n", "n/a | \n", "<NA> | \n", "R;>CIJT | \n", "NA | \n", "NA | \n", "
| 14 | \n", "1437604 | \n", "lex | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | BR>[ | \n", "ברא | \n", "... | \n", "<NA> | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | <NA> | \n", "BR> | \n", "\n", " | \n", " |
| 15 | \n", "651574 | \n", "phrase | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | ... | \n", "<NA> | \n", "\n", " | VP | \n", "\n", " | \n", " | \n", " | <NA> | \n", "\n", " | \n", " | \n", " |
| 16 | \n", "904777 | \n", "phrase_atom | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | ... | \n", "<NA> | \n", "\n", " | VP | \n", "\n", " | \n", " | \n", " | <NA> | \n", "\n", " | \n", " | \n", " |
| 17 | \n", "3 | \n", "word | \n", "BR> | \n", "ברא | \n", "B.@R@> | \n", "בָּרָא | \n", "B.@R@74> | \n", "בָּרָ֣א | \n", "BR>[ | \n", "ברא | \n", "... | \n", "<NA> | \n", "\n", " | \n", " | absent | \n", "\n", " | absent | \n", "<NA> | \n", "BR> | \n", "qal | \n", "perf | \n", "
| 18 | \n", "1437605 | \n", "lex | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | >LHJM/ | \n", "אלהים֜ | \n", "... | \n", "<NA> | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | <NA> | \n", ">:ELOHIJM | \n", "\n", " | \n", " |
| 19 | \n", "651575 | \n", "phrase | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | ... | \n", "<NA> | \n", "\n", " | NP | \n", "\n", " | \n", " | \n", " | <NA> | \n", "\n", " | \n", " | \n", " |
| 20 | \n", "904778 | \n", "phrase_atom | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | ... | \n", "<NA> | \n", "\n", " | NP | \n", "\n", " | \n", " | \n", " | <NA> | \n", "\n", " | \n", " | \n", " |
| 21 | \n", "4 | \n", "word | \n", ">LHJM | \n", "אלהים | \n", ">:ELOH | \n", "אֱלֹה | \n", ">:ELOHI92JM | \n", "אֱלֹהִ֑ים | \n", ">LHJM/ | \n", "אלהים | \n", "... | \n", "<NA> | \n", "\n", " | \n", " | absent | \n", "n/a | \n", "n/a | \n", "<NA> | \n", ">:ELOHIJM | \n", "NA | \n", "NA | \n", "
| 22 | \n", "606395 | \n", "half_verse | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | ... | \n", "<NA> | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | <NA> | \n", "\n", " | \n", " | \n", " |
| 23 | \n", "651576 | \n", "phrase | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | ... | \n", "<NA> | \n", "\n", " | PP | \n", "\n", " | \n", " | \n", " | <NA> | \n", "\n", " | \n", " | \n", " |
| 24 | \n", "904779 | \n", "phrase_atom | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | ... | \n", "<NA> | \n", "\n", " | PP | \n", "\n", " | \n", " | \n", " | <NA> | \n", "\n", " | \n", " | \n", " |
| 25 | \n", "1300539 | \n", "subphrase | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | ... | \n", "<NA> | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | <NA> | \n", "\n", " | \n", " | \n", " |
| 26 | \n", "1437606 | \n", "lex | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | >T | \n", "את | \n", "... | \n", "<NA> | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | <NA> | \n", ">;T | \n", "\n", " | \n", " |
| 27 | \n", "5 | \n", "word | \n", ">T | \n", "את | \n", ">;T | \n", "אֵת | \n", ">;71T | \n", "אֵ֥ת | \n", ">T | \n", "את | \n", "... | \n", "<NA> | \n", "\n", " | \n", " | absent | \n", "n/a | \n", "n/a | \n", "<NA> | \n", ">;T | \n", "NA | \n", "NA | \n", "
| 28 | \n", "1437607 | \n", "lex | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | H | \n", "ה | \n", "... | \n", "<NA> | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | <NA> | \n", "HA | \n", "\n", " | \n", " |
| 29 | \n", "6 | \n", "word | \n", "H | \n", "ה | \n", "HA- | \n", "הַ | \n", "HA- | \n", "הַ | \n", "H | \n", "ה | \n", "... | \n", "<NA> | \n", "\n", " | \n", " | absent | \n", "n/a | \n", "n/a | \n", "<NA> | \n", "HA | \n", "NA | \n", "NA | \n", "
30 rows × 72 columns
\n", "