{ "metadata": { "name": "", "signature": "sha256:4a51afdcc72d6c3700e33a0f3fa5a5570e51f803ccfcaeea43ab06c775c03da7" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Pyxplorer - interactive data set exploration\n", "\n", "The goal of pyxplorer is to provide a simple tool that allows interactive\n", "profiling of datasets that are accessible via a SQL like interface. The only\n", "requirement to run data profiling is that you are able to provide a Python\n", "DBAPI like interface to your data source and the data source is able to\n", "understand simplistic SQL queries.\n", "\n", "I built this piece of software while trying to get a better understanding of\n", "data distribution in a massive several hundred million record large dataset.\n", "Depending on the size of the dataset and the query engine the resposne time\n", "can ranging from seconds (Impala) to minutes (Hive) or even hourse (MySQL)\n", "\n", "The typical use case is to use `pyxplorer` interactively from an iPython\n", "Notebook or iPython shell to incrementally extract information about your data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " $> pip install pyxplorer pympala" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Questions, Ideas, Comments:\n", "\n", "https://github.com/grundprinzip/pyxplorer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example using Impala\n", "\n", "Basically `pyexplorer` works with all DBAPI like interfaces, but to show the advantages of running a high-performance data analysis on large amounts of data we will use Impala to store our data." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from impala.dbapi import connect\n", "conn = connect(host='diufpc57', port=21050)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Database Operations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Imagine that you are provided with access to a huge Hive/Impala database on\n", "your very own Hadoop cluster and you're asked to profile the data to get a\n", "better understanding for performing more specific data science later on. \n", "Based on this connection, we can now instantiate a new explorer object." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pyxplorer as pxp\n", "data = pxp.Database(\"tpcc3\", conn)\n", "data" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
Name | Size |
---|---|
customerp | 30000000 |
districtp | 10000 |
historyp | 30000000 |
itemp | 100000 |
new_orderp | 9000000 |
oorderp | 30000000 |
order_linep | 299991280 |
stockp | 100000000 |
warehousep | 1000 |
Rows: 9 / Columns: 2
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 4, "text": [ "Name | Value |
---|---|
Min | 1 |
Max | 1000 |
#Distinct Values | 1000 |
Most Frequent | 109 (301593) |
Least Frequent | 212 (298395) |
Top 10 MF | 109,676,117,460,19,877,165,764,340,689 |
Top 10 LF | 212, 405, 52, 284, 304, 769, 727, 665, 90, 163 |
Uniqueness | 3.33343022504e-06 |
Constancy | 0.00100533922186 |
Rows: 9 / Columns: 2
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 20, "text": [ "ol_w_id" ] } ], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Its possible to access the column either using subscript notation or directly as an attribute of the object" ] }, { "cell_type": "code", "collapsed": false, "input": [ "tab.ol_w_id" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "Name | Value |
---|---|
Min | 1 |
Max | 1000 |
#Distinct Values | 1000 |
Most Frequent | 109 (301593) |
Least Frequent | 212 (298395) |
Top 10 MF | 109,676,117,460,19,877,165,764,340,689 |
Top 10 LF | 212, 405, 52, 284, 304, 769, 727, 665, 90, 163 |
Uniqueness | 3.33343022504e-06 |
Constancy | 0.00100533922186 |
Rows: 9 / Columns: 2
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 21, "text": [ "ol_w_id" ] } ], "prompt_number": 21 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on this information we can further deduct what role this column might have in the overall schema. For example, based on the uniqueness we can say that the column is not suitable to uniquely identify every row. In additiona, based on the constancy of the most frequent value and the spread between the most and least frequent value we can deduct that the data is almost uniformly distributed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Distinct Values and Distinct Value Distribution\n", "\n", "One important feature ist to look at the distinct values and their distribution" ] }, { "cell_type": "code", "collapsed": false, "input": [ "tab.ol_w_id.dcount()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 22, "text": [ "1000" ] } ], "prompt_number": 22 }, { "cell_type": "code", "collapsed": false, "input": [ "dist = tab.ol_w_id.distribution(limit=10000)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 24 }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline\n", "dist.fraction.hist()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 25, "text": [ "\n", " | table | \n", "column | \n", "distinct | \n", "size | \n", "fraction | \n", "
---|---|---|---|---|---|
0 | \n", "customerp | \n", "c_w_id | \n", "1000 | \n", "30000000 | \n", "3.333333e-05 | \n", "
1 | \n", "customerp | \n", "c_d_id | \n", "10 | \n", "30000000 | \n", "3.333333e-07 | \n", "
2 | \n", "customerp | \n", "c_id | \n", "3000 | \n", "30000000 | \n", "1.000000e-04 | \n", "
3 | \n", "customerp | \n", "c_discount | \n", "5000 | \n", "30000000 | \n", "1.666667e-04 | \n", "
4 | \n", "customerp | \n", "c_credit | \n", "2 | \n", "30000000 | \n", "6.666667e-08 | \n", "
5 | \n", "customerp | \n", "c_last | \n", "1000 | \n", "30000000 | \n", "3.333333e-05 | \n", "
6 | \n", "customerp | \n", "c_first | \n", "29999284 | \n", "30000000 | \n", "9.999761e-01 | \n", "
7 | \n", "customerp | \n", "c_credit_lim | \n", "1 | \n", "30000000 | \n", "3.333333e-08 | \n", "
8 | \n", "customerp | \n", "c_balance | \n", "1 | \n", "30000000 | \n", "3.333333e-08 | \n", "
9 | \n", "customerp | \n", "c_ytd_payment | \n", "1 | \n", "30000000 | \n", "3.333333e-08 | \n", "
Name | Size |
---|---|
pyxplorer_data | 7 |
Rows: 1 / Columns: 2
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 33, "text": [ "Name | Value |
---|---|
Min | 100 |
Max | distance |
#Distinct Values | 7 |
Most Frequent | distance (1) |
Least Frequent | distance (1) |
Top 10 MF | 100,129,148,distance,192,113,168 |
Top 10 LF | 100, 129, 148, distance, 192, 113, 168 |
Uniqueness | 1.0 |
Constancy | 0.142857142857 |
Rows: 9 / Columns: 2
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 39, "text": [ "col_0" ] } ], "prompt_number": 39 } ], "metadata": {} } ] }