{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", "![](files/pycon-uk-images/intel.jpg)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Actions : Events\n", "============\n", "\n", "* Search\n", "* Clicks\n", "* Visits\n", "* Pageviews\n", "* Interactions\n", "* Product Search\n", "* Product Detail" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", "![](files/pycon-uk-images/lambda_new.jpg)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "

First Approach - Log everything ?

" ] }, { "cell_type": "code", "collapsed": false, "input": [ "'''\n", "1,0,3,\"Braund, Mr. Owen Harris\",male,22,1,0,A/5 21171,7.25,,S\n", "2,1,1,\"Cumings, Mrs. John Bradley (Florence Briggs Thayer)\",female,38,1,0,PC 17599,71.2833,C85,C\n", "'''" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Challenges :\n", "=============\n", "\n", "* Log formats are all over the place.\n", "* No logging guidelines\n", "* No proper categorization / Taxonomies\n", "* Not at all flexible\n", "* Scaling becomes a challenge\n", "* Single line and multi-line logs" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "

Second Approach : Simply use JSON for logging ?

" ] }, { "cell_type": "code", "collapsed": false, "input": [ "'''\n", "{\n", " \"Fare\": 71.2833,\n", " \"Name\": \"Cumings, Mrs. John Bradley (Florence Briggs Thayer)\",\n", " \"Embarked\": \"C\",\n", " \"Age\": 38,\n", " \"Parch\": 0,\n", " \"Pclass\": 1,\n", " \"Sex\": \"female\",\n", " \"Survived\": 1,\n", " \"SibSp\": 1,\n", " \"PassengerId\": 2,\n", " \"Ticket\": \"PC 17599\",\n", " \"Cabin\": \"C85\"\n", "}\n", "'''" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "

Challenges :

\n", "\n", "

\"When starting a data analysis project, most developers don\u2019t think about how they\u2019ll be able to handle gradual application upgrades through a large organization\"

" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "

Challenges :

\n", "\n", "

\"Schema changes happen frequently, and often without warning.\"

" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "

Challenges :

\n", "\n", "

\"This makes it harder to independently upgrade the applications that are writing the data and the applications reading the data.\"

" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Other side-effects:\n", "----------\n", "\n", "* \"Flexible\" hence anyone can dump anything.\n", "* Suffers from lack of information.\n", "* KeyNames are not very informative.\n", " * Same keyname might be used for different types of values.\n", " * Same type of values but different Keynames.\n", "* Need of compression to save disk space, network-bandwidth, optimizing map-reduce jobs.\n", "* Cannot evolve, adding or removing keys may result in breaking things.\n", "* Does not protect against data inconsistency.\n", "* Validation of data as soon as it enters the pipeline.\n", "* Binary representation is hard." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "

Third Approach : Using Schema

\n", "\n", "* Structural integrity\n", "* Apply Constraints\n", "* Protection against corruption.\n", " * If there is no schema, data corruptions go ignored until data is read.\n", " * It is better to throw an exception where the mistake is made.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "

What is Avro ?

\n", "\n", "* **Avro** is a data serialization format.\n", "* **Schemas** written in human-friendly format aka **JSONs**\n", "* Schemas with **Rich** data structures\n", "\t* Primitive\n", "\t* Complex\n", "* Fast, compact serialization.\n", "* **Compressible** file formats\n", "\t* Deflate\n", "\t* Snappy\n", "* Support for many programming languages.\n", "* **Schema evolution** : Get rid of Version Management Baggage\n", "* **Schema resolutions** : Making data both forward and backward compatible.\n", "* Supports both **unstructured** and **semi-structured** data." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# AVRO SCHEMA\n", "{\n", " \"namespace\": \"avro.example.titanic\",\n", " \"type\": \"record\",\n", " \"name\": \"demo\",\n", " \"doc\": \"Titanic data set\",\n", " \"fields\": [\n", " {\"name\": \"PassengerId\", \"type\": \"int\", \"doc\": \"Passenger ID\"},\n", " {\"name\": \"Survived\", \"type\": \"int\", \"doc\": \"Survival (0 = No; 1 = Yes)\"},\n", " {\"name\": \"Pclass\", \"type\": \"int\", \"doc\": \"Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)\"},\n", " {\"name\": \"Name\", \"type\": \"string\", \"doc\": \"Name\"},\n", " {\"name\": \"Sex\", \"type\": \"string\", \"doc\": \"Sex\"},\n", " {\"name\": \"Age\", \"type\": \"int\", \"doc\": \"Age\"},\n", " {\"name\": \"SibSp\", \"type\": \"int\", \"doc\": \"Number of Siblings/Spouses Aboard\"},\n", " {\"name\": \"Parch\", \"type\": \"int\", \"doc\": \"Number of Parents/Children Aboard\"},\n", " {\"name\": \"Ticket\", \"type\": \"string\", \"doc\": \"Ticket Number\"},\n", " {\"name\": \"Fare\", \"type\": \"float\", \"doc\": \"Passenger Fare\"},\n", " {\"name\": \"Cabin\", \"type\": \"string\", \"doc\": \"Cabin\"},\n", " {\"name\": \"Embarked\", {\"type\":\"enum\",\"symbols\":[\"C\",\"S\",\"Q\"], \"doc\": \"Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)\"}\n", " ]\n", "}" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "import avro.schema\n", "import io, random\n", "from avro.io import DatumWriter, BinaryEncoder\n", "from avro.datafile import DataFileReader, DataFileWriter\n", "\n", "# Path to user.avsc avro schema\n", "schema_path=\"schemas/titanic.avsc\"\n", "schema = avro.schema.parse(open(schema_path).read())\n", "writer = DataFileWriter(open(\"avroFiles/titanic.avro\",'w'), DatumWriter(), schema, codec=\"null\")\n", "\n", "\n", "writer.append({\"Fare\": 71.2833, \"Name\": \"Cumings, Mrs. John Bradley (Florence Briggs Thayer)\", \"Embarked\": \"C\", \"Age\": 38, \"Parch\": 0, \"Pclass\": 1, \"Sex\": \"female\", \"Survived\": 1, \"SibSp\": 1, \"PassengerId\": 2, \"Ticket\": \"PC 17599\", \"Cabin\": \"C85\"})\n", "writer.append({\"Fare\": 7.925, \"Name\": \"Heikkinen, Miss. Laina\", \"Embarked\": \"S\", \"Age\": 26, \"Parch\": 0, \"Pclass\": 3, \"Sex\": \"female\", \"Survived\": 1, \"SibSp\": 0, \"PassengerId\": 3, \"Ticket\": \"STON/O2. 3101282\", \"Cabin\": \"\"})\n", "\n", "writer.close()" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "

Schema Evolution

\n", "\n", "
" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Schema Version1\n", "{\n", " \"type\": \"record\",\n", " \"name\": \"Employee\",\n", " \"fields\": [\n", " {\"name\": \"name\", \"type\": \"string\"},\n", " {\"name\": \"age\", \"type\": \"int\"},\n", " {\"name\": \"emails\", \"type\": {\"type\": \"array\", \"items\": \"string\"}},\n", " {\"name\": \"boss\", \"type\": [\"Employee\",\"null\"]}\n", " ]\n", "}\n", "\n", "\n", "# Schema Version2\n", "{\n", " \"type\": \"record\",\n", " \"name\": \"Employee\",\n", " \"fields\": [\n", " {\"name\": \"name\", \"type\": \"string\"},\n", " {\"name\": \"yrs\", \"type\": \"int\", \"aliases\": [\"age\"]},\n", " {\"name\": \"gender\", \"type\": \"string\", \"default\":\"unknown\"},\n", " {\"name\": \"emails\", \"type\": {\"type\": \"array\", \"items\": \"string\"}}\n", " ]\n", "}" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "

Schema-Resolution

\n", "\n", "The schemas can be categorized into two types:\n", "-----------------------------\n", "\n", "* Writers Schema\n", "* Readers Schema\n", "\n", "\n", "How it works :\n", "---------\n", "\n", "| New Schema | Writer | Reader | Action |\n", "|--------------|--------|--------|--------------------------------------------------------------|\n", "| **Added Field** | Old | New | Reader uses default value |\n", "| | New | Old | Reader Ignores it. |\n", "| **Remove Field** | Old | New | Reader ignores the removed field |\n", "| | New | Old | Old reads only if there is a default value;Other wise error. |\n", "| **Change Name** | Old | New | Define Aliases. |" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "

Is storing avro schema an overhead ?

\n", "\n", "Object container files\n", "--------------------\n", "\n", "* They just include the schema once at the beginning of the file, and the rest of the file can be decoded with that schema. Super efficeint when for systems like Hadoop.\n", " * A file has a schema, and all objects stored in the file must be written according to that schema.\n", " * Syncronization markers are used between blocks to permit efficient splitting of files for MapReduce processing\n", " \n", "Schema Registry :\n", "-----------------\n", "\n", "* Attach the record with hash, rather than the schema.\n", "* Schema metadata can then be used to auto-create hive tables or infer fields in PIG scripts.\n", "* Serves as up-to-date documentation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "

Sneek peek into custom build schema-repository

\n", "\n", "Purposes:\n", "---------\n", "\n", "* Register a schema which gives back a unique id for it or you can query for a schema. \n", "* Before emitting/storing a record you must first publish its schema to the registry or know that it has already been published (by checking your cache of published schemas). \n", "* When reading you check your cache and if you don't find the id/schema pair there you query the registry to look it up\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "

AVRO in the big data ecosystem

\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "

AVRO CLI - Demo

" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "

Words of Caution

\n", "\n", "

\"Schemas are very powerful, if done the right way.\"

\n", "\n", "* For best results, always provide a default value for the fields in your schema. This makes it possible to delete fields later on if you decide it is necessary. If you do not provide a default value for a field, you cannot delete that field from your schema.\n", "\n", "* You cannot change a field's data type. If you have decided that a field should be some data type other than what it was originally created using, then add a whole new field to your schema that uses the appropriate data type.\n", "\n", "* You cannot rename an existing field. However, if you want to access the field by some name other than what it was originally created using, add and use aliases for the field." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
" ] } ], "metadata": {} } ] }