{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Connect Python and Pandas To Redshift\n", "> I've found the Pandas + SQLAlchemy combination to be very useful; extending this setup just a bit further can help query Redshift directly from Python\n", "\n", "- toc: false\n", "- branch: master\n", "- badges: true\n", "- categories: [redshift,pandas]\n", "- hide: false\n", "- search_exclude: false" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many situations where you might like to get data from a database into a `pandas` dataframe. The simplest way to is to initialize a dataframe via the `pandas` [read_sql_query](https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.read_sql_query.html) method. The ORM of choice for pandas is [SQLAlchemy](http://www.sqlalchemy.org/). By installing a couple more packages, you can query Redshift data and read that into a dataframe with just a few lines of of Python code." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Setup\n", "This post assumes you have a number of Python packages already installed, such as `pandas`, `numpy`, `sqlalchemy` and `iPython` or `Jupyter Lab`. I've used the Python distribution [Anacoda](http://docs.continuum.io/anaconda/install.html), which can be downloaded [here](http://continuum.io/downloads). It comes packed with many of the [libraries](http://docs.continuum.io/anaconda/pkg-docs.html) you might need for data analysis.\n", "\n", "The [Dialects documentation](http://docs.sqlalchemy.org/en/rel_0_9/dialects/) for SQLAlchemy mentions that Redshift is supported through another Python package, which itself depends on a PostgreSQL driver. So, the additional packages needed for connecting to Redshift are [redshift-sqlalchemy](https://pypi.org/project/sqlalchemy-redshift/) and [psycopg2](http://initd.org/psycopg/). If you already have Anaconda, you can install `psycopg2` quickly using conda. For getting `redshift-sqlalchemy` installed, here are some docs on using [conda & pip](http://conda.pydata.org/docs/faq.html#env-installation) together. That being said, the process of downloading the packages looked like this in my terminal:\n", "\n", "```bash\n", "conda install psycopg2\n", "cd Downloads/redshift-sqlalchemy-0.4.1\n", "python setup.py install\n", "```\n", "\n", "Of course, I did this in my default environment. Everything seems to work, but the Anaconda docs are very useful if you have multiple Python environments on your machine." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## On To The Data\n", "Start iPython in your terminal (or Notebook if you prefer). Note that the [connection string](http://docs.sqlalchemy.org/en/rel_1_0/core/engines.html) follows the pattern `'flavor+DBAPI://username:password@host:port/database'`. Make sure to change the connection string below to your own!\n", "\n", "```python\n", "import pandas as pd\n", "import numpy as np\n", "from sqlalchemy import create_engine\n", " \n", "# what is the database equivalent of 'hello world' anyways?\n", "test_query = \"\"\"\n", "SELECT COUNT(*)\n", "FROM datawarehouse.users u\n", "WHERE u.date BETWEEN '2015-01-01' AND '2015-05-31'\n", "\"\"\"\n", "\n", "red_engine = create_engine('redshift+psycopg2://username:password@your.redshift.host.123456abcdef.us-east-1.redshift.amazonaws.com:port/database')\n", "test = pd.read_sql_query(test_query,red_engine)\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we look at the \"test\" object I see:\n", "```python\n", "In [5]: test\n", "Out[5]:\n", " count\n", "0 178545\n", "```\n", "\n", "Of course, you'll need to modify the query string to match the settings for your own warehouse to test this out." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Update/Issues\n", "It's been a while since I had to use this type of connection, so things may have changed since this was originally in June of 2015. Specifically, once my company required SSL connections, the setup above went south and I got errors like:\n", "```python\n", "OperationalError: (OperationalError) sslmode value \"require\" invalid when SSL support is not compiled in\n", " None None\n", "```\n", "\n", "[Samantha Zeitlin](https://twitter.com/samanthazeitlin?lang=en) responded to my request for help with a super-thorough post to help debug the issue, which had to do with fixing symlinks. Unfortunately the Google Group where that response was posted is now defunct :weary:\n", "\n", "Miraculously I happen to have the last post to that thread:\n", "\n", "> For posterity, a workaround that worked for me without having to rely on symlinks:\n", "> \n", "> As others have suggested, install with pip:\n", "> $ `conda uninstall psycopg2`\n", "> $ `pip install psycopg2`\n", "> \n", "> Python will throw an error when trying to import psycopg2, so run this:\n", "> $ `export DYLD_FALLBACK_LIBRARY_PATH=$HOME/anaconda/lib/:$DYLD_FALLBACK_LIBRARY_PATH`\n", "> \n", "> Credit: http://stackoverflow.com/questions/27264574/import-psycopg2-library-not-loaded-libssl-1-0-0-dylib\n", "\n", "\n", "If you have any issues or fixes, I'd love to hear about them; you can reach out to me on [Twitter](https://twitter.com/measurestuff)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }