{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Simple Examples of using PL/Python in a SQL Database" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Written by Ian Huston.\n", "\n", "These notes are a short introduction to using the procedural language PL/Python and how this can help speed up calculations by leveraging the power of a massively parallel processing database like the [Pivotal Greenplum Database](http://gopivotal.com/products/pivotal-greenplum-database).\n", "\n", "Requirements: [PostgreSQL](http://postgresql.org) or [Greenplum DB](http://gopivotal.com/products/pivotal-greenplum-database), PL/Python + [NumPy](http://numpy.scipy.org) installed on all nodes.\n", "\n", "For this notebook you will also need [ipython-sql](https://github.com/catherinedevlin/ipython-sql) by Catherine Devlin" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Load the ipython-sql magic command\n", "%load_ext sql" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we need to connect to the Greenplum database" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Normally use the following:\n", "# %sql postgres://user:passwd@server/db" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Use a workaround to not display password!\n", "with open(\"./.config\") as conn_config:\n", " conn_info = conn_config.readline().strip()\n", "ip = get_ipython()\n", "ip.magic(\"%sql \" + conn_info)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will do our work in a separate schema (container inside a database) to keep things tidy." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql \n", "-- Some quick tests of the capability of PL/Python on \n", "-- PostgreSQL and Greenplum DB.\n", "-- Create a schema to do some work in\n", "DROP SCHEMA plp CASCADE;\n", "CREATE SCHEMA plp;" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First demonstrate what a User Defined Function looks like in SQL." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql \n", "-- Simple SQL User Defined Function to get started\n", "CREATE FUNCTION plp.times2(INT)\n", "RETURNS INT\n", "AS $$\n", "SELECT 2 * $1;\n", "$$ LANGUAGE sql;" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql \n", "--Try it out\n", "SELECT plp.times2(10);" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "-- Test using Python code\n", "CREATE FUNCTION plp.pymax (a integer, b integer)\n", "RETURNS integer\n", "AS $$\n", "if a > b:\n", " return a\n", "return b\n", "$$ LANGUAGE plpythonu;" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "--Test Python code\n", "SELECT plp.pymax(10, 5);" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "-- Create a composite return type\n", "CREATE TYPE plp.named_value AS (\n", " name text,\n", " value integer\n", ");" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "--Simple function which returns a composite object\n", "CREATE OR REPLACE FUNCTION plp.make_pair (name text, value integer)\n", "RETURNS named_value\n", "AS $$\n", "return [ name, value ]\n", " # or alternatively, as tuple: return ( name, value )\n", " # or as dict: return { \"name\": name, \"value\": value }\n", " # or as an object with attributes .name and .value\n", "$$ LANGUAGE plpythonu;" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "--Try out the function\n", "-- (See https://en.wikipedia.org/wiki/Zozimus)\n", "SELECT plp.make_pair('Zozimus', 1);" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that UDFs and PL/X functions allow for multiple function signatures." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "--Using NumPy inside a PL/Python function\n", "CREATE OR REPLACE FUNCTION plp.make_pair (name text)\n", "RETURNS named_value\n", "AS $$\n", "import numpy as np\n", "a = np.arange(100)\n", "return [name, a[2]]\n", "$$ LANGUAGE plpythonu;" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "--Try it out\n", "SELECT plp.make_pair('Horatio');" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the SQL calling syntax affects how the results are returned. \n", "\n", "In SQL this is the difference between returning composite types and splitting out results into each subtype (name and value here)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "SELECT * FROM plp.make_pair('Horatio');" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "--Returning a set of results using SETOF\n", "CREATE OR REPLACE FUNCTION make_pair_sets (name text)\n", "RETURNS SETOF named_value\n", "AS $$\n", "import numpy as np\n", "return ((name, i) for i in np.arange(3))\n", "$$ LANGUAGE plpythonu;" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "--Try it out\n", "SELECT * FROM make_pair_sets('Gerald');" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Parallelisation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we will try to use parallelisation with PL/Python" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "--Set up some data to show parallelisation\n", "DROP TABLE IF EXISTS plp.test_data;\n", "\n", "CREATE TABLE plp.test_data AS\n", "SELECT \n", " 'a'::text AS name\n", " , generate_series(0,1000000)::float AS x\n", " , generate_series(0,1000000)/100.0 AS y\n", "DISTRIBUTED BY (name);\n", "\n", "INSERT INTO plp.test_data \n", "SELECT \n", " 'b'::text AS name\n", " , generate_series(0,1000000)::float AS x\n", " , sin(generate_series(0,1000000)/100.0) AS y;\n", "\n", "INSERT INTO plp.test_data \n", "SELECT \n", " 'c'::text AS name\n", " , generate_series(0,1000000)::float AS x\n", " , 100.0 + sin(generate_series(0,1000000)/100.0) AS y;" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "-- Create a function to find the mean of some numbers\n", "DROP FUNCTION IF EXISTS plp.np_mean(double precision[]);\n", "\n", "CREATE OR REPLACE FUNCTION plp.np_mean(value_array double precision[])\n", "RETURNS float\n", "AS $$\n", "import numpy as np\n", "return np.mean(value_array)\n", "$$ LANGUAGE plpythonu;" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "-- Need to pass the numbers as an array using array_agg\n", "SELECT plp.np_mean(array_agg(y)) FROM plp.test_data;" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "-- Now try to do this for each type of data in parallel by grouping\n", "SELECT \n", " name, \n", " plp.np_mean(array_agg(y)) \n", "FROM plp.test_data \n", "GROUP BY name ORDER BY name;" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "-- Now try do something even more interesting\n", "DROP FUNCTION IF EXISTS plp.linregr(double precision[]);\n", "\n", "CREATE OR REPLACE FUNCTION \n", " plp.linregr(x double precision[], y double precision[])\n", "RETURNS float[]\n", "AS $$\n", "from scipy import stats\n", "return stats.linregress(x, y)\n", "$$ LANGUAGE plpythonu;" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "-- Do linear regression for all data\n", "SELECT plp.linregr(array_agg(x), array_agg(y)) \n", "FROM plp.test_data;" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sql\n", "-- Now do it separately for each 'name'\n", "SELECT name, plp.linregr(array_agg(x), array_agg(y)) \n", "FROM plp.test_data \n", "GROUP BY name ORDER BY name;" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example we have shown how to run models separately for different data using the GROUP BY clause. It is important to have distributed your data in the correct way to utilise the parallel architecture.\n", "\n", "For further information [see these notes](http://gopivotal.github.io/gp-r/) on using PL/R in addition to PL/Python." ] } ], "metadata": {} } ] }