{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## CPC Sketch Examples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basic Sketch Usage" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from datasketches import cpc_sketch, cpc_union" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll create a sketch with log2(k) = 12" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "sk = cpc_sketch(12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Insert ~2 million points. Values are hashed, so using sequential integers is fine for demonstration purposes." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "### CPC sketch summary:\n", " lgK : 12\n", " seed hash : 93cc\n", " C : 38212\n", " flavor : 4\n", " merged : false\n", " compressed : false\n", " intresting col : 5\n", " HIP estimate : 2.09721e+06\n", " kxp : 11.4725\n", " offset : 6\n", " table : allocated\n", " num SV : 135\n", " window : allocated\n", "### End sketch summary\n", "\n" ] } ], "source": [ "n = 1 << 21\n", "for i in range(0, n):\n", " sk.update(i)\n", "print(sk)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we know the exact value of n we can look at the estimate and upper/lower bounds as a % of the true value. We'll look at the bounds at 1 standard deviation. In this case, the true value does lie within the bounds, but since these are probabilistic bounds the true value will sometimes be outside them (especially at 1 standard deviation)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Upper bound (1 std. dev) as % of true value: 100.9281\n" ] } ], "source": [ "print(\"Upper bound (1 std. dev) as % of true value: \", round(100*sk.get_upper_bound(1) / n, 4))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Estimate as % of true value: 100.0026\n" ] } ], "source": [ "print(\"Estimate as % of true value: \", round(100*sk.get_estimate() / n, 4))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Lower bound (1 std. dev) as % of true value: 99.0935\n" ] } ], "source": [ "print(\"Lower bound (1 std. dev) as % of true value: \", round(100*sk.get_lower_bound(1) / n, 4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can serialize and deserialize the sketch, which will give us back the same structure." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2484" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sk_bytes = sk.serialize()\n", "len(sk_bytes)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "### CPC sketch summary:\n", " lgK : 12\n", " seed hash : 93cc\n", " C : 38212\n", " flavor : 4\n", " merged : false\n", " compressed : false\n", " intresting col : 5\n", " HIP estimate : 2.09721e+06\n", " kxp : 11.4725\n", " offset : 6\n", " table : allocated\n", " num SV : 135\n", " window : allocated\n", "### End sketch summary\n", "\n" ] } ], "source": [ "sk2 = cpc_sketch.deserialize(sk_bytes)\n", "print(sk2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sketch Union Usage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we'll create two sketches with partial overlap in values. For good measure, we'll let k be larger in one sketch. For most applications we'd generally create all new data using the same size sketch, allowing differences to creep in when combining new and historica data." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "k = 12\n", "n = 1 << 20\n", "offset = int(3 * n / 4)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "sk1 = cpc_sketch(k)\n", "sk2 = cpc_sketch(k + 1)\n", "for i in range(0, n):\n", " sk1.update(i)\n", " sk2.update(i + offset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a union object and add the sketches to that. To demonstrate smoothly handling multiple sketch sizes, we'll use a size of k+1 here." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "union = cpc_union(k+1)\n", "union.update(sk1)\n", "union.update(sk2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note how log config k has automatically adopted the value of the smaller input sketch." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "### CPC sketch summary:\n", " lgK : 12\n", " seed hash : 93cc\n", " C : 37418\n", " flavor : 4\n", " merged : true\n", " compressed : false\n", " intresting col : 5\n", " HIP estimate : 0\n", " kxp : 4096\n", " offset : 6\n", " table : allocated\n", " num SV : 123\n", " window : allocated\n", "### End sketch summary\n", "\n" ] } ], "source": [ "result = union.get_result()\n", "print(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can again compare against the exact result, in this case 1.75*n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Estimate as % of true value: 99.6646\n" ] } ], "source": [ "print(\"Estimate as % of true value: \", round(100*result.get_estimate() / (7*n/4), 4))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }