{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Apache Avro as serialization of TF data\n", "\n", "[Apache Avro](https://avro.apache.org/docs/current/index.html)\n", "is a data serialization format used in Big Data use cases.\n", "\n", "```\n", "pip3 install avro-python3\n", "```\n", "\n", "It looks attractive from the specifactions, but how will it perform?\n", "\n", "As a simple test, we take the feature data for `g_word_utf8`.\n", "It is a map from the numbers 1 to 426584 to Hebrew word occurrences (Unicode strings).\n", "\n", "In Text-Fabric we have a representation in plain text and a compressed, pickled representation.\n", "\n", "# Outcome\n", "\n", "Text-Fabric is much faster in loading this kind of data.\n", "\n", "The size of the Avro binary serialization is much bigger than the TF text representation.\n", "\n", "The sizes of the gzipped Avro serialization and the gzipped, pickled TF serialization are approximately equal.\n", "\n", "# Detailed comparison\n", "\n", "name | kind | size | load time\n", ":--- | :--- | ---: | ---:\n", "g_word_utf8.tf | tf: plain unicode text | 5.4 MB | 1.6 s\n", "g_word_utf8.tfx | tf: gzipped binary |3.2 MB | 0.2 s\n", "g_word_utf8.avdt | avro: binary | 8.3 MB | 3.2 s\n", "g_word_utf8.avdt.gz | avro: gzipped binary | 3.2 MB | 4.7 s\n", "\n", "# Conclusion\n", "\n", "**We do not see reasons to replace the TF feature data serialization by Avro.**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2018-05-09T13:52:56.726798Z", "start_time": "2018-05-09T13:52:56.660991Z" } }, "outputs": [], "source": [ "import os\n", "import gzip\n", "import avro\n", "import avro.schema\n", "from avro.datafile import DataFileReader, DataFileWriter\n", "from avro.io import DatumReader, DatumWriter\n", "\n", "from tf.fabric import Fabric\n", "\n", "GZIP_LEVEL = 2 # same as used in Text-Fabric" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load from the textual data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2018-05-09T13:53:03.953876Z", "start_time": "2018-05-09T13:52:57.594126Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is Text-Fabric 5.5.22\n", "Api reference : https://annotation.github.io/text-fabric/Api/Fabric/\n", "Tutorial : https://github.com/annotation/text-fabric/blob/master/docs/tutorial.ipynb\n", "Example data : https://github.com/annotation/text-fabric-data\n", "\n", "117 features found and 0 ignored\n", " 0.00s loading features ...\n", " | 1.60s T g_word_utf8 from /Users/dirk/github/etcbc/BHSA/tf/c\n", " 6.02s All features loaded/computed - for details use loadLog()\n" ] } ], "source": [ "VERSION = 'c'\n", "BHSA = f'BHSA/tf/{VERSION}'\n", "PARA = f'parallels/tf/{VERSION}'\n", "\n", "TF = Fabric(locations='~/github/etcbc', modules=[BHSA, PARA])\n", "api = TF.load('')\n", "api.makeAvailableIn(globals())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The load time is **~ 1.6 seconds**.\n", "\n", "But during this time, the textual data has been compiled and written to a binary form.\n", "Let's load again.\n", "\n", "## Load from binary data" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is Text-Fabric 5.5.22\n", "Api reference : https://annotation.github.io/text-fabric/Api/Fabric/\n", "Tutorial : https://github.com/annotation/text-fabric/blob/master/docs/tutorial.ipynb\n", "Example data : https://github.com/annotation/text-fabric-data\n", "\n", "117 features found and 0 ignored\n", " 0.00s loading features ...\n", " 5.28s All features loaded/computed - for details use loadLog()\n" ] } ], "source": [ "TF = Fabric(locations='~/github/etcbc', modules=[BHSA, PARA])\n", "api = TF.load('')\n", "api.makeAvailableIn(globals())" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " | 0.03s B otype from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.55s B oslots from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.01s B book from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.01s B chapter from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B verse from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.18s B g_cons from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.27s B g_cons_utf8 from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.18s B g_lex from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.25s B g_lex_utf8 from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.17s B g_word from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.21s B g_word_utf8 from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.12s B lex0 from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.18s B lex_utf8 from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B qere from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B qere_trailer from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B qere_trailer_utf8 from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B qere_utf8 from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.07s B trailer from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.08s B trailer_utf8 from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B __levels__ from otype, oslots, otext\n", " | 0.03s B __order__ from otype, oslots, __levels__\n", " | 0.03s B __rank__ from otype, __order__\n", " | 1.36s B __levUp__ from otype, oslots, __rank__\n", " | 1.12s B __levDown__ from otype, __levUp__, __rank__\n", " | 0.39s B __boundary__ from otype, oslots, __rank__\n", " | 0.01s B __sections__ from otype, oslots, otext, __levUp__, __levels__, book, chapter, verse\n", " | 0.00s B book@am from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@ar from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@bn from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@da from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@de from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@el from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@en from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@es from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@fa from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@fr from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@he from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@hi from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@id from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@ja from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@ko from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@la from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@nl from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@pa from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@pt from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@ru from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@sw from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@syc from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@tr from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@ur from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@yo from /Users/dirk/github/etcbc/BHSA/tf/c\n", " | 0.00s B book@zh from /Users/dirk/github/etcbc/BHSA/tf/c\n" ] } ], "source": [ "loadLog()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The load time of the feature `g_word_utf8` is **~ 0.2 seconds**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Make an Avro feature data file" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "tempDir = os.path.expanduser('~/github/annotation/text-fabric/_temp/avro')\n", "os.makedirs(tempDir, exist_ok=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data is a map from *integers* to strings, but Avro does not support that.\n", "We have to represent the integers as strings." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "426584\n", "426584\n", "רֵאשִׁ֖ית\n" ] } ], "source": [ "feature = 'g_word_utf8'\n", "tfData = TF.features[feature].data\n", "print(len(tfData))\n", "#data = {str(i): w for (i,w) in tfData.items() if i < 12}\n", "data = {str(i): w for (i,w) in tfData.items()}\n", "print(len(data))\n", "print(data['2'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We define a record schema, where each record consists of the name of a feature and its data.\n", "We only will put one record in a file, because we want to load features individually." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "schemaJSON = '''\n", "{\n", " \"name\": \"tffeature\",\n", " \"type\": \"record\",\n", " \"fields\": [\n", " {\n", " \"name\": \"name\",\n", " \"type\": \"string\"\n", " },\n", " { \"name\": \"data\",\n", " \"type\": {\n", " \"type\": \"map\",\n", " \"values\": \"string\"\n", " }\n", " }\n", " ]\n", "}\n", "'''\n", "schemaFile = 'tf.avsc'\n", "schemaPath = f'{tempDir}/{schemaFile}'\n", "with open(schemaPath, 'w') as sf:\n", " sf.write(schemaJSON)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "schema = avro.schema.Parse(open(schemaPath, \"rb\").read())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We write the feature data to an Avro data file." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "dataFile = f'{tempDir}/{feature}.avdt'\n", "writer = DataFileWriter(open(dataFile, \"wb\"), DatumWriter(), schema)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s start writing\n", " 2.55s done\n" ] } ], "source": [ "indent(reset=True)\n", "info('start writing')\n", "writer.append({\"name\": feature, \"data\": data})\n", "writer.close()\n", "info('done')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We make also a gzipped data file." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "dataFileZ = f'{dataFile}.gz'\n", "with open(dataFile, 'rb') as ah:\n", " avroData = ah.read()\n", "with gzip.open(dataFileZ, 'wb', compresslevel=GZIP_LEVEL) as ah:\n", " ah.write(avroData)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load from avro binary" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s start reading\n", "g_word_utf8 - רֵאשִׁ֖ית\n", " 3.24s done\n" ] } ], "source": [ "indent(reset=True)\n", "info('start reading')\n", "reader = DataFileReader(open(dataFile, \"rb\"), DatumReader())\n", "for x in reader:\n", " print(f'{x[\"name\"]} - {x[\"data\"][\"2\"]}')\n", "reader.close()\n", "info('done')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load time **~ 3.2 seconds**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load from avro binary (gzipped)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s start reading\n", "g_word_utf8 - רֵאשִׁ֖ית\n", " 4.72s done\n" ] } ], "source": [ "indent(reset=True)\n", "info('start reading')\n", "reader = DataFileReader(gzip.open(dataFileZ, \"rb\"), DatumReader())\n", "for x in reader:\n", " print(f'{x[\"name\"]} - {x[\"data\"][\"2\"]}')\n", "reader.close()\n", "info('done')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }