{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Simple Examples of using PL/Python in a SQL Database" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Written by Ian Huston.\n", "\n", "These notes are a short introduction to using the procedural language PL/Python and how this can help speed up calculations by leveraging the power of a massively parallel processing database like the [Pivotal Greenplum Database](http://gopivotal.com/products/pivotal-greenplum-database).\n", "\n", "Requirements: [PostgreSQL](http://postgresql.org) or [Greenplum DB](http://gopivotal.com/products/pivotal-greenplum-database), PL/Python + [NumPy](http://numpy.scipy.org) installed on all nodes.\n", "\n", "For this notebook you will also need [ipython-sql](https://github.com/catherinedevlin/ipython-sql) by Catherine Devlin" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Load the ipython-sql magic command\n", "%load_ext sql" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "The sql extension is already loaded. To reload it, use:\n", " %reload_ext sql\n" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we need to connect to the Greenplum database" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Normally use the following:\n", "# %sql postgresql://user:passwd@server/db" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "# Use a workaround to not display password!\n", "with open(\"./.config\") as conn_config:\n", " conn_info = conn_config.readline().strip()\n", "ip = get_ipython()\n", "ip.magic(\"%sql \" + conn_info)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 7, "text": [ "u'Connected: ihuston@hustondb'" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will do our work in a separate schema (container inside a database) to keep things tidy." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql \n", "-- Some quick tests of the capability of PL/Python on \n", "-- PostgreSQL and Greenplum DB.\n", "-- Create a schema to do some work in\n", "\n", "DROP SCHEMA plp CASCADE;\n", "CREATE SCHEMA plp;" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Done.\n", "Done." ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 8, "text": [ "[]" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "First demonstrate what a User Defined Function looks like in SQL." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql \n", "-- Simple SQL User Defined Function to get started\n", "\n", "CREATE FUNCTION plp.times2(INT)\n", "RETURNS INT\n", "AS $$\n", "SELECT 2 * $1;\n", "$$ LANGUAGE sql;" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Done.\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ "[]" ] } ], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql \n", "--Try it out\n", "\n", "SELECT plp.times2(10);" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1 rows affected.\n" ] }, { "html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
times2
20
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ "[(20,)]" ] } ], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "-- Test using Python code\n", "\n", "CREATE FUNCTION plp.pymax (a integer, b integer)\n", "RETURNS integer\n", "AS $$\n", "if a > b:\n", " return a\n", "return b\n", "$$ LANGUAGE plpythonu;" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Done.\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 11, "text": [ "[]" ] } ], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "--Test Python code\n", "\n", "SELECT plp.pymax(10, 5);" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1 rows affected.\n" ] }, { "html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
pymax
10
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "[(10,)]" ] } ], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "-- Create a composite return type\n", "\n", "CREATE TYPE plp.named_value AS (\n", " name text,\n", " value integer\n", ");" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Done.\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ "[]" ] } ], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "--Simple function which returns a composite object\n", "\n", "CREATE OR REPLACE FUNCTION plp.make_pair (name text, value integer)\n", "RETURNS named_value\n", "AS $$\n", "return [ name, value ]\n", " # or alternatively, as tuple: return ( name, value )\n", " # or as dict: return { \"name\": name, \"value\": value }\n", " # or as an object with attributes .name and .value\n", "$$ LANGUAGE plpythonu;" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Done.\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 14, "text": [ "[]" ] } ], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "--Try out the function\n", "-- (See https://en.wikipedia.org/wiki/Zozimus)\n", "\n", "SELECT plp.make_pair('Zozimus', 1);" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1 rows affected.\n" ] }, { "html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
make_pair
(Zozimus,1)
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 15, "text": [ "[('(Zozimus,1)',)]" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that UDFs and PL/X functions allow for multiple function signatures." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "--Using NumPy inside a PL/Python function\n", "\n", "CREATE OR REPLACE FUNCTION plp.make_pair (name text)\n", "RETURNS named_value\n", "AS $$\n", "import numpy as np\n", "a = np.arange(100)\n", "return [name, a[2]]\n", "$$ LANGUAGE plpythonu;" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Done.\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 16, "text": [ "[]" ] } ], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "--Try it out\n", "\n", "SELECT plp.make_pair('Horatio');" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1 rows affected.\n" ] }, { "html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
make_pair
(Horatio,2)
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 17, "text": [ "[('(Horatio,2)',)]" ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the SQL calling syntax affects how the results are returned. \n", "\n", "In SQL this is the difference between returning composite types and splitting out results into each subtype (name and value here)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "\n", "SELECT * FROM plp.make_pair('Horatio');" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1 rows affected.\n" ] }, { "html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namevalue
Horatio2
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": [ "[(u'Horatio', 2)]" ] } ], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "--Returning a set of results using SETOF\n", "\n", "CREATE OR REPLACE FUNCTION make_pair_sets (name text)\n", "RETURNS SETOF named_value\n", "AS $$\n", "import numpy as np\n", "return ((name, i) for i in np.arange(3))\n", "$$ LANGUAGE plpythonu;" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Done.\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 19, "text": [ "[]" ] } ], "prompt_number": 19 }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "--Try it out\n", "\n", "SELECT * FROM make_pair_sets('Gerald');" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "3 rows affected.\n" ] }, { "html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namevalue
Gerald0
Gerald1
Gerald2
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 20, "text": [ "[(u'Gerald', 0), (u'Gerald', 1), (u'Gerald', 2)]" ] } ], "prompt_number": 20 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Parallelisation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we will try to use parallelisation with PL/Python" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "--Set up some data to show parallelisation\n", "\n", "DROP TABLE IF EXISTS plp.test_data;\n", "\n", "CREATE TABLE plp.test_data AS\n", "SELECT \n", " 'a'::text AS name\n", " , generate_series(0,1000000)::float AS x\n", " , generate_series(0,1000000)/100.0 AS y\n", "DISTRIBUTED BY (name);\n", "\n", "INSERT INTO plp.test_data \n", "SELECT \n", " 'b'::text AS name\n", " , generate_series(0,1000000)::float AS x\n", " , sin(generate_series(0,1000000)/100.0) AS y;\n", "\n", "INSERT INTO plp.test_data \n", "SELECT \n", " 'c'::text AS name\n", " , generate_series(0,1000000)::float AS x\n", " , 100.0 + sin(generate_series(0,1000000)/100.0) AS y;" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Done.\n", "1000001 rows affected." ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "1000001 rows affected." ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "1000001 rows affected." ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 21, "text": [ "[]" ] } ], "prompt_number": 21 }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "-- Create a function to find the mean of some numbers\n", "\n", "DROP FUNCTION IF EXISTS plp.np_mean(double precision[]);\n", "\n", "CREATE OR REPLACE FUNCTION plp.np_mean(value_array double precision[])\n", "RETURNS float\n", "AS $$\n", "import numpy as np\n", "return np.mean(value_array)\n", "$$ LANGUAGE plpythonu;" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Done.\n", "Done." ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 22, "text": [ "[]" ] } ], "prompt_number": 22 }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "-- Need to pass the numbers as an array using array_agg\n", "\n", "SELECT plp.np_mean(array_agg(y)) FROM plp.test_data;" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1 rows affected.\n" ] }, { "html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
np_mean
1700.00013004
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 23, "text": [ "[(1700.00013004,)]" ] } ], "prompt_number": 23 }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "-- Now try to do this for each type of data in parallel by grouping\n", "\n", "SELECT \n", " name, \n", " plp.np_mean(array_agg(y)) \n", "FROM plp.test_data \n", "GROUP BY name ORDER BY name;" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "3 rows affected.\n" ] }, { "html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namenp_mean
a5000.0
b0.000195060907772
c100.000195061
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 24, "text": [ "[(u'a', 5000.0), (u'b', 0.000195060907772), (u'c', 100.000195061)]" ] } ], "prompt_number": 24 }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "-- Now try do something even more interesting\n", "\n", "DROP FUNCTION IF EXISTS plp.linregr(double precision[]);\n", "\n", "CREATE OR REPLACE FUNCTION \n", " plp.linregr(x double precision[], y double precision[])\n", "RETURNS float[]\n", "AS $$\n", "from scipy import stats\n", "return stats.linregress(x, y)\n", "$$ LANGUAGE plpythonu;" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Done.\n", "Done." ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 25, "text": [ "[]" ] } ], "prompt_number": 25 }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "-- Do linear regression for all data\n", "\n", "SELECT plp.linregr(array_agg(x), array_agg(y)) \n", "FROM plp.test_data;" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1 rows affected.\n" ] }, { "html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
linregr
[0.00333333331357, 33.3334732575, 0.335532910587, 0.0, 5.4031491216e-06]
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 26, "text": [ "[([0.00333333331357, 33.3334732575, 0.335532910587, 0.0, 5.4031491216e-06],)]" ] } ], "prompt_number": 26 }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "-- Now do it separately for each 'name'\n", "\n", "SELECT name, plp.linregr(array_agg(x), array_agg(y)) \n", "FROM plp.test_data \n", "GROUP BY name ORDER BY name;" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "3 rows affected.\n" ] }, { "html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namelinregr
a[0.0100000000001, -6.82430254528e-08, 1.0, 0.0, 0.0]
b[-2.96599680404e-11, 0.000209890891792, -1.210882548e-05, 0.990338798439, 2.44945167615e-09]
c[-2.96599680394e-11, 100.000209891, -1.21088254796e-05, 0.99033879844, 2.44945167615e-09]
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 27, "text": [ "[(u'a', [0.0100000000001, -6.82430254528e-08, 1.0, 0.0, 0.0]),\n", " (u'b', [-2.96599680404e-11, 0.000209890891792, -1.210882548e-05, 0.990338798439, 2.44945167615e-09]),\n", " (u'c', [-2.96599680394e-11, 100.000209891, -1.21088254796e-05, 0.99033879844, 2.44945167615e-09])]" ] } ], "prompt_number": 27 }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example we have shown how to run models separately for different data using the GROUP BY clause. It is important to have distributed your data in the correct way to utilise the parallel architecture.\n", "\n", "For further information [see these notes](http://gopivotal.github.io/gp-r/) on using PL/R in addition to PL/Python." ] } ], "metadata": {} } ] }