{ "cells": [{ "cell_type": "markdown", "metadata": {}, "source": ["# Lesson 1: Transformers\n", "\n", "Transformers are just a composable way to serialize the stuff. `unprocess` serializes, `process` parses."] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": ["testJsonDict = {\"abolish\": [\"patent\", \"copyright\"]}"] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{ "name": "stdout", "output_type": "stream", "text": ["{\n", "\t\"abolish\": [\n", "\t\t\"patent\",\n", "\t\t\"copyright\"\n", "\t]\n", "}\n"] } ], "source": ["from transformerz.serialization.json import jsonFancySerializer\n", "\n", "print(jsonFancySerializer.unprocess(testJsonDict))"] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{ "name": "stdout", "output_type": "stream", "text": ["[None]\n"] } ], "source": ["print(jsonFancySerializer.process(\"[null]\"))"] }, { "cell_type": "markdown", "metadata": {}, "source": [" They can be composed using `+` operation. "] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{ "name": "stdout", "output_type": "stream", "text": ["'{\\n\\t\"abolish\": [\\n\\t\\t\"patent\",\\n\\t\\t\"copyright\"\\n\\t]\\n}'\n"] } ], "source": ["from transformerz.serialization.pon import ponSerializer # JSON <-> JavaScript === PON <-> Python\n", "\n", "print((ponSerializer + jsonFancySerializer).unprocess(testJsonDict)) # Returns a \"PON\" `str` in which a JSON string is serialized"] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{ "name": "stdout", "output_type": "stream", "text": ["\"{'abolish': ['patent', 'copyright']}\"\n"] } ], "source": ["print((jsonFancySerializer + ponSerializer).unprocess(testJsonDict)) # Returns a JSON `str` in which \"PON\" string is serialized"] }, { "cell_type": "markdown", "metadata": {}, "source": ["But you cannot save strings into files, you need to save bytes into files ..."] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{ "name": "stdout", "output_type": "stream", "text": ["b'{\\n\\t\"abolish\": [\\n\\t\\t\"patent\",\\n\\t\\t\"copyright\"\\n\\t]\\n}'\n"] } ], "source": ["from transformerz.text import utf8Transformer\n", "\n", "ourTransformer = utf8Transformer + jsonFancySerializer\n", "print(ourTransformer.unprocess(testJsonDict)) # Returns raw bytes of a \"PON\" string is serialized"] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": ["del testJsonDict"] }, { "cell_type": "markdown", "metadata": {}, "source": ["The data can be compressed. For compression we use the stuff available in my fork of `kaitai.compress` library (I hope it would be merged somewhen). `BinaryProcessor` is an adapter allowing to use the stuff from that lib."] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{ "name": "stdout", "output_type": "stream", "text": ["b'x\\x9c\\xed\\xc6\\xa1\\r\\x00 \\x0c\\x000\\xcd\\xce\\x98\\xdeG\\x04E\\x82A\\xef\\x7f\\x14_\\xb4\\xaa3F\\x9e\\xde7KDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD~b=s\\x83\\xe3\\x84'\n"] } ], "source": ["from transformerz.compression import BinaryProcessor\n", "from transformerz.kaitai.compress import Zlib\n", "\n", "zlibProcessor = BinaryProcessor(\"zlib\", Zlib()) # You must name your processor!\n", "print((zlibProcessor + ourTransformer).unprocess([\"fuck\"] * 8000)) # Returns ZLib-compressed UTF-8 encoded JSON string"] }, { "cell_type": "markdown", "metadata": {}, "source": ["# Lesson 2: Concepts of cold and hot storage and objects representing them"] }, { "cell_type": "markdown", "metadata": {}, "source": ["There exist a pair of concepts of storage in computer science. Cold storage and hot storage.\n", "\n", "* Cold storage is permanent and costly to access. It is used for long-term storage of data and distribution on physical medium. The examples are HDD, magnetic tape, flash memory, a piece of paper with handwritten data, a DVD, a holograph, anything mentioned within a safe, or even [pieces of glass with dots burned in thew with a laser](https://www.microsoft.com/en-us/research/project/project-silica/) [stored in a abandoned mine in permafrost](https://archiveprogram.github.com/).\n", "* Hot storage may be not permanent, but it must be efficient to access. It is usally RAM.\n", "\n", " We use the term `store` for cold storage and the term `cache` for hot storage. In our case hot storage is usually RAM and requires no explicit serialization (the data are stored the way defined by runtime and compiler), and cold storage is usually HDD/SSD requiring data to be serialized before written and unserialized after being read."] }, { "cell_type": "markdown", "metadata": {}, "source": ["## Defining cold storage\n", "To define the way we are going to STORE data we need to answer to the following \"orthogonal\" questions:\n", "* WHERE are we going to store it? `Saver` object is an answer. `Mapper.saver` answers the question."] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": ["# Here is an example of a saver saving the data into a file. The first argument is a path to the file (without extension!). Second argument is the extension.\n", "from pathlib import Path\n", "from urm.storers.cold import FileSaver\n", "\n", "savedDataRootDir = Path(\"./tests/testSavedDataRootDir\") # a directory where a file will reside\n", "ourSaver = FileSaver(savedDataRootDir, \"json\") # json is the extension of the files used. `ourSaver` will be populated into `saver` property in future"] }, { "cell_type": "markdown", "metadata": {}, "source": ["* WHAT is the mapping between our internal `key`s and `key`s in the storage of this piece of data? A `key` is an answer. `Mapper.key` answers this question."] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": ["from urm.mappers.key import fieldNameKeyMapper\n", "\n", "keyMapper = fieldNameKeyMapper # We don't need a key mapper currently, since we deal with scalars in this example"] }, { "cell_type": "markdown", "metadata": {}, "source": ["For cold storage we need to answer an additional question:\n", "* HOW are we going to serialize the data? `transformerz.Transformer` object is an answer. `ColdMapper.serializer` answers this question. It is a function, that returns a transformer we will use to serialize the data. It allows you to change the transformer depending on some conditions, some of which can be encoded in other serialized data."] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": ["from transformerz.core import TransformerBase\n", "from urm.ProtoBundle import ProtoBundle\n", "\n", "def constantParamsSerializerMapper(parent: ProtoBundle) -> TransformerBase: # it will be populated into `serializer` property in future\n", " return ourTransformer"] }, { "cell_type": "markdown", "metadata": {}, "source": ["So `ColdMapper` object answers the questions related to storage of values in cold storage. Let's construct it!"] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": ["from urm.mappers import ColdMapper\n", "\n", "ourStorer = ColdMapper(keyMapper, ourSaver, constantParamsSerializerMapper)"] }, { "cell_type": "markdown", "metadata": {}, "source": ["## Defining hot storage\n", "To work with data we need a hot storage."] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": ["from urm.mappers.key import PrefixKeyMapper, fieldNameKeyMapper\n", "from urm.mappers import HotMapper\n", "from urm.storers.hot import PrefixCacher\n", "\n", "ourCacher = HotMapper(fieldNameKeyMapper, PrefixCacher())"] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": ["# Lesson 3: Fields, strategies and bundles\n", "\n", "A `key` is a `tuple` of strings and numbers. It is an unique identifier of a piece of data. It is decoupled from the actual storage implementation. We address data by keys."] }, { "cell_type": "markdown", "metadata": {}, "source": ["A field strategy is an object of `FieldStrategy` class that describes our pattern of loading/storing data. There are 2:\n", "* \"cold\" one (`ColdStrategy` class): always load data from medium on getting the field value and always store data to medium when the field is assigned with a value.\n", "* \"cached\" one (`CachedStrategy` class): on accesses only alter the data in the hot storage (aka `cache`). Load data to cache from cold storage the first time it is read. Store the data to cold storage when explicitly asked.\n", "\n", "As you see, the most basic strategy is the cold one. The strategy using only hot storage makes completely no sense by itself, to use it you don't need all this framework.\n", "\n", "So, let's get familiar with the cold strategy first.\n", "\n", "## Defining a cold strategy"] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": ["from urm.fields import ColdStrategy\n", "\n", "coldStrategy = ColdStrategy(ourStorer) # Don't do like this in real code, a strategy object must never be reused! We have a better way to set it."] }, { "cell_type": "markdown", "metadata": {}, "source": ["## Defining our class with a property backed by cold storage\n", "\n", "Now we define a class, which properties are backed by cold storage."] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": ["from urm.fields import Field, Field0D, FieldND\n", "from urm.mappers.serializer import JustReturnSerializerMapper\n", "\n", "class A:\n", " __slots__ = ()\n", " scalarField = Field0D(None)\n", " scalarField.strategy = coldStrategy # Don't define the field like this, we have a better option. This way is about how strategies work"] }, { "cell_type": "markdown", "metadata": {}, "source": ["... and test it ..."] }, { "cell_type": "code", "execution_count": 16, "metadata": { "scrolled": false }, "outputs": [{ "data": { "text/plain": ["True"] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": ["import json\n", "\n", "a = A()\n", "dataToSave = {\"a\": [\"b\", \"c\"]}\n", "a.scalarField = dataToSave\n", "json.loads((savedDataRootDir / (A.scalarField.strategy.name + \".json\")).read_text()) == dataToSave # the data read from the file by another way must match the value we have saved!"] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [{ "data": { "text/plain": ["100500"] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": ["(savedDataRootDir / (A.scalarField.strategy.name + \".json\")).write_text(\"100500\") # we replace the value in the file ...\n", "a.scalarField # ... and the returned value changes"] }, { "cell_type": "markdown", "metadata": {}, "source": ["## Defining a cached strategy\n", "\n", "A cached strategy requires both cold and hot mappers."] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": ["from urm.fields import CachedStrategy\n", "\n", "cachedStrategy = CachedStrategy(ourStorer, ourCacher)"] }, { "cell_type": "markdown", "metadata": {}, "source": ["## Defining our class with a property backed by cached storage\n", "\n", "Now we define a class, which properties are backed by cached storage.\n", "* Such classes must inherit from `ProtoBundle`!\n", "* Hot storage is placed into an attr of the class prefixed with an underscore, so they have to be added into slots."] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": ["class B(ProtoBundle):\n", " __slots__ = (\"_scalarField\",)\n", " scalarField = Field0D(None)\n", " scalarField.strategy = cachedStrategy # Don't define the field like this in production, we have a better option. This way is only to explain how strategies work"] }, { "cell_type": "markdown", "metadata": {}, "source": ["... and test it ..."] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{ "data": { "text/plain": ["True"] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": ["b = B() # since we use the same storer, the data is loaded from the same storage\n", "b.scalarField == a.scalarField"] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [{ "data": { "text/plain": ["True"] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": ["dataToSave2 = {\"d\": [\"e\", \"f\"]}\n", "b.scalarField = dataToSave2\n", "b.scalarField == dataToSave2"] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [{ "data": { "text/plain": ["False"] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": ["# but the data in cold storage is not automatically updated ...\n", "b.scalarField == dataToSave"] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [{ "data": { "text/plain": ["False"] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": ["# We can save the data ....\n", "b.save()\n", "a.scalarField == dataToSave"] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [{ "data": { "text/plain": ["True"] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": ["a.scalarField == dataToSave2"] }, { "cell_type": "markdown", "metadata": {}, "source": ["The way above is useful for just understanding how the stuff works. In real code you gonna create the storage-backed fields using the following syntax:"] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [{ "data": { "text/plain": ["True"] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": ["class A:\n", " scalarField = Field0D(ourStorer) # uncached\n", "\n", "class B(ProtoBundle):\n", " __slots__ = (\"_scalarField\",)\n", " scalarField = Field0D(ourStorer, ourCacher) # cached\n", "\n", "\n", "a = A()\n", "b = B()\n", "b.scalarField == a.scalarField"] }, { "cell_type": "markdown", "metadata": {}, "source": ["# Lesson 4: Bird-eye picture"] }, { "cell_type": "markdown", "metadata": {}, "source": ["To create a bidirectional mapping between class properties, we need to answer the following questions:\n", "* How does data flows between hot and cold storages? `FieldStrategy` subclasses are the answers.\n", "* How do we STORE the data? `ColdMapper` object is an answer. `FieldStrategy.cold\" answers this question.\n", "* How do we CACHE the data? `HotMapper` object is an answer. `FieldStrategy.hot\" answers this question.\n", "* How do we create our internal `key`s? `Field` subclasses contain the answers."] }, { "cell_type": "markdown", "metadata": {}, "source": ["# Lesson 5: File-backed collections\n", "\n", "For fields containing collections of objects mapped to unrelational entities you need key mappers, mapping keys. For scalars keys always were empty. For collections the keys are provided by the user when indexing."] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [{ "name": "stdout", "output_type": "stream", "text": ["True\n", "False\n", "True\n"] } ], "source": ["from urm.storers.hot import CollectionCacher\n", "\n", "vectorKeyMapper = PrefixKeyMapper()\n", "ourVectorStorer = ColdMapper(vectorKeyMapper, ourSaver, constantParamsSerializerMapper)\n", "ourVectorCacher = HotMapper(vectorKeyMapper, CollectionCacher(dict)) # our hot storage is a dict, but we can plug there any collection\n", "\n", "class C(ProtoBundle):\n", " vectorField = FieldND(ourVectorStorer, ourVectorCacher) # cached\n", "\n", "\n", "c = C()\n", "c.vectorField[\"aaaa\"] = 10\n", "c.vectorField[\"bbbb\"] = {25: 36}\n", "c.vectorField[\"cccc\"] = {\"25\": 36}\n", "c.save()\n", "print((savedDataRootDir / \"aaaa.json\").read_text() == str(c.vectorField[\"aaaa\"]))\n", "print(json.loads((savedDataRootDir / \"bbbb.json\").read_text()) == c.vectorField[\"bbbb\"]) # False, because it is JSON!\n", "print(json.loads((savedDataRootDir / \"cccc.json\").read_text()) == c.vectorField[\"cccc\"])"] }, { "cell_type": "markdown", "metadata": {}, "source": ["# Lesson 6: Controlling paths with dynamic attributes and cache invalidation\n", "\n", "To resolve paths dynamically we have a wrapper object `Dynamic`. It is a path in object hierarchy.\n", "To invalidate cache, set the value to None"] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [{ "name": "stdout", "output_type": "stream", "text": ["Wn: hv y brgt??\n", "ptchk: Y nw hv 1\n", "1\n", "0\n"] } ], "source": ["from urm.core import Dynamic\n", "from urm.fields import Field0D, FieldND\n", "from urm.mappers.serializer import JustReturnSerializerMapper\n", "\n", "controlledPathKeyMapper = PrefixKeyMapper(Dynamic(\"name\"))\n", "ourNameControlledStorer = ColdMapper(controlledPathKeyMapper, ourSaver, constantParamsSerializerMapper)\n", "\n", "class Pocket(ProtoBundle):\n", " __slots__ = (\"name\", \"_shit\")\n", " shit = Field0D(ourNameControlledStorer, ourCacher)\n", " def __init__(self, name: str):\n", " self.name = name\n", "\n", "\n", "ptchkPocket = Pocket(\"ptchk\")\n", "ptchkPocket.shit = 2\n", "ptchkPocket.save()\n", "print(\"Wn: hv y brgt??\")\n", "(savedDataRootDir / \"ptchk.json\").write_text(str(json.loads((savedDataRootDir / \"ptchk.json\").read_text()) - 1))\n", "ptchkPocket.shit = None # invalidates cache\n", "print(\"ptchk: Y nw hv\", ptchkPocket.shit)\n", "ptchkPocket.shit -= 1\n", "print(json.loads((savedDataRootDir / \"ptchk.json\").read_text()))\n", "ptchkPocket.save()\n", "print(json.loads((savedDataRootDir / \"ptchk.json\").read_text()))"] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9" } }, "nbformat": 4, "nbformat_minor": 4 }