{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***TL;DR*** *I generate a big amount of fake data for Spring PetClinic with Faker that I store directly in a MySQL database via Pandas / SQLAlchemy.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "In preparation for a talk about performance optimization, I needed some monstrous amounts of fake data for a system under test. I choose the Spring Pet Clinic project as my \"patient\" because there are some typical problems that this application does wrong. But this application comes with round about 100 database entries. This isn't enough at all.\n",
    "\n",
    "So in this notebook, I'll show you how to\n",
    "- examine the existing database tables\n",
    "- generate some fake data\n",
    "- initializing a database schema\n",
    "- dumping all data into the database\n",
    "\n",
    "And of course: How to do that very efficiently with the Python Data Analysis Library aka [Pandas](http://pandas.pydata.org/#what-problem-does-pandas-solve) and some other little helpers.\n",
    "\n",
    "We will also be playing a little around to get an impression where to get some other fake data, too!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Context\n",
    "The Spring PetClinic software project is a showcase example for the Spring framework. It comes with a nice little Web UI and a backend written in Java. The [documentation](http://docs.spring.io/docs/petclinic.html) says\n",
    "\n",
    "> The application requirement is for an information system that is accessible through a web browser. The users of the application are employees of the clinic who in the course of their work need to view and manage information regarding the veterinarians, the clients, and their pets.\n",
    "\n",
    "The application stores some data into a database:\n",
    "\n",
    "![](./resources/spring_petclinic_db_schema.png)\n",
    "\n",
    "There are some issues with the application while accessing the data, but I won't get into this in this notebook. Instead, I will focus on the data generation for all these tables. My approach here is straightforward: I adopt the existing tables with their data types and constraints, delete existing data and inserts some new data into the existing tables. Hereby I respect the generation of unique primary keys and foreign keys by the means of Pandas' abilities. We also have to keep the right insertion order in mind. There are some tables that depend on already existing data from other tables. But I will get into details later."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# Configuration\n",
    "At the beginning, one can define the amount of data that should be created and be stored in the database. We don't need it yet, but I find it always nice to have the parameters that can be change at the beginning of a notebook. The configuration is then printed out. Let's produce some production data!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Generating fake data for\n",
      "- 1000 vets, \n",
      "- each having ~2000 specialties, \n",
      "- each for serving ~10000 owners,\n",
      "- each caring for ~20000 pets,\n",
      "- of max. ~2000 types/races and \n",
      "- each taking them to ~40000 visits.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "AMOUNT_VETS = 1000\n",
    "AMOUNT_SPECIALTIES = 2 * AMOUNT_VETS\n",
    "AMOUNT_OWNERS = 10 * AMOUNT_VETS\n",
    "AMOUNT_PETS = 2 * AMOUNT_OWNERS\n",
    "AMOUNT_PET_TYPES = int(AMOUNT_PETS / 10)\n",
    "AMOUNT_VISITS = 2 * AMOUNT_PETS\n",
    "\n",
    "print(\"\"\"\n",
    "Generating fake data for\n",
    "- %d vets, \n",
    "- each having ~%d specialties, \n",
    "- each for serving ~%d owners,\n",
    "- each caring for ~%d pets,\n",
    "- of max. ~%d types/races and \n",
    "- each taking them to ~%d visits.\n",
    "\"\"\" % (AMOUNT_VETS, AMOUNT_SPECIALTIES, AMOUNT_OWNERS, AMOUNT_PETS, AMOUNT_PET_TYPES, AMOUNT_VISITS))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#  Examine the database schema"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Connect to the database\n",
    "This step usually occurs at the end of a script, but in this notebook, I want to show you how the tables are made up. So simply create a database connection with [SQLAlchemy](https://www.sqlalchemy.org/) and the underlying [MySQL Python Connector](https://dev.mysql.com/downloads/connector/python/):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'mysqlconnector'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sqlalchemy import create_engine\n",
    "engine = create_engine('mysql+mysqlconnector://root:admin@localhost:3306/petclinic', echo=False)\n",
    "engine.driver"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Inspect the schema"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "SQLAlchemy brings a great tool for inspecting the database: The Inspector."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'petclinic'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sqlalchemy.engine import reflection\n",
    "insp = reflection.Inspector.from_engine(engine)\n",
    "insp.default_schema_name"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Inspector allows us to iterator over various data of the schema:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['get_columns',\n",
       " 'get_foreign_keys',\n",
       " 'get_indexes',\n",
       " 'get_pk_constraint',\n",
       " 'get_primary_keys',\n",
       " 'get_schema_names',\n",
       " 'get_sorted_table_and_fkc_names',\n",
       " 'get_table_names',\n",
       " 'get_table_options',\n",
       " 'get_temp_table_names',\n",
       " 'get_temp_view_names',\n",
       " 'get_unique_constraints',\n",
       " 'get_view_definition',\n",
       " 'get_view_names']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "relevant_methods = [x for x in dir(insp) if x.startswith(\"get\")]\n",
    "relevant_methods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So for example you, can easily lists all tables:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['owners',\n",
       " 'pets',\n",
       " 'specialties',\n",
       " 'types',\n",
       " 'vet_specialties',\n",
       " 'vets',\n",
       " 'vets_specialties',\n",
       " 'visits']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "insp.get_table_names()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the Inspector from SQLAlchemy, we can easily list the needed data types for the table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>autoincrement</th>\n",
       "      <th>default</th>\n",
       "      <th>name</th>\n",
       "      <th>nullable</th>\n",
       "      <th>type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>True</td>\n",
       "      <td>None</td>\n",
       "      <td>id</td>\n",
       "      <td>False</td>\n",
       "      <td>INTEGER(4) UNSIGNED</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td>first_name</td>\n",
       "      <td>True</td>\n",
       "      <td>VARCHAR(30)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td>last_name</td>\n",
       "      <td>True</td>\n",
       "      <td>VARCHAR(30)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td>address</td>\n",
       "      <td>True</td>\n",
       "      <td>VARCHAR(255)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td>city</td>\n",
       "      <td>True</td>\n",
       "      <td>VARCHAR(80)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td>telephone</td>\n",
       "      <td>True</td>\n",
       "      <td>VARCHAR(20)</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  autoincrement default        name nullable                 type\n",
       "0          True    None          id    False  INTEGER(4) UNSIGNED\n",
       "1           NaN    None  first_name     True          VARCHAR(30)\n",
       "2           NaN    None   last_name     True          VARCHAR(30)\n",
       "3           NaN    None     address     True         VARCHAR(255)\n",
       "4           NaN    None        city     True          VARCHAR(80)\n",
       "5           NaN    None   telephone     True          VARCHAR(20)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "pd.DataFrame(insp.get_columns('owners'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data generation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fake data in general\n",
    "We could just fill up the data randomly, but I want to show you, how to get some real looking data. For this, nice little helpers are out there for implementing that. In this notebook, I use the fake data provider Faker (https://github.com/joke2k/faker). It comes with nice helper methods for generating data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Nancy Garcia'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from faker import Factory\n",
    "fake = Factory.create()\n",
    "fake.name()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'28699 Gilmore Bridge Suite 610'"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fake.street_address()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'(562)399-3432x5090'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fake.phone_number()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But there is one drawback: Faker doesn't seem to be appropriate for generating massive amount of test data. For example, on my machine (Lenovo X220 i5) it takes almost 5 seconds to generate 100k phone numbers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wall time: 5.39 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "[fake.phone_number() for _ in range (1,100000)]\n",
    "_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While this is no problem for our little scenario, there could be room for performance improvement (and I've already programmed a prototype, stay tuned!).\n",
    "\n",
    "But let's get back to our original task: Generating fake data for Spring PetClinic!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fake \"Owners\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, for the table for all the pet's owners, we need a <tt>DataFrame</tt> that looks like this one:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>address</th>\n",
       "      <th>city</th>\n",
       "      <th>telephone</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [first_name, last_name, address, city, telephone]\n",
       "Index: []"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# just some unreadable code to make a point\n",
    "pd.DataFrame(columns=pd.DataFrame(insp.get_columns('owners'))[['name']].T.reset_index().iloc[0][1::]).set_index('id')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In other words: We need a set of <tt>Series</tt> objects that we fill up with data that makes sense for each column. OK, let's rebuild it step by step (or better to say: column by column). To keep it simple, we ignore the database specific data types in this example.\n",
    "\n",
    "The first trick is to fill the index (the later \"id\" column aka primary key) with the amount of data that is requested. We set the amount already at the beginning of the notebook and simply use it here. We use the built-in <tt>range</tt> method for generating a continuous stream of numbers from <tt>1</tt> to the requested number of owners <tt>+1</tt>. We need to shift the lower and upper bound because the primary keys for our database starts at 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: [1, 2, 3, 4, 5]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "owners = pd.DataFrame(index=range(1,AMOUNT_OWNERS+1))\n",
    "owners.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we set the name of the index column to <tt>id</tt>. This is just a minor correction to store the data more easily in the database later on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: [1, 2, 3, 4, 5]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "owners.index.name='id'\n",
    "owners.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alright, let's generate some first names with Faker. We sample via the <tt>map</tt> function of the index (which is not very performant, but will do for now)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Eric</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Cathy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Barbara</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Kevin</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>James</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   first_name\n",
       "id           \n",
       "1        Eric\n",
       "2       Cathy\n",
       "3     Barbara\n",
       "4       Kevin\n",
       "5       James"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "owners['first_name'] = owners.index.map(lambda x : fake.first_name())\n",
    "owners.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We repeat that for all the other columns with the appropriate data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>address</th>\n",
       "      <th>city</th>\n",
       "      <th>telephone</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Eric</td>\n",
       "      <td>Mann</td>\n",
       "      <td>28236 Danielle Track</td>\n",
       "      <td>East Nicholaston</td>\n",
       "      <td>(882)917-5347x3716</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Cathy</td>\n",
       "      <td>Mullen</td>\n",
       "      <td>87654 Lewis Spurs</td>\n",
       "      <td>Jasonshire</td>\n",
       "      <td>1-189-763-2384x16254</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Barbara</td>\n",
       "      <td>Miller</td>\n",
       "      <td>63860 Reed Cliff</td>\n",
       "      <td>East Kristin</td>\n",
       "      <td>1-065-041-3835</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Kevin</td>\n",
       "      <td>Conley</td>\n",
       "      <td>909 Christian Manors</td>\n",
       "      <td>Jasonville</td>\n",
       "      <td>726-589-8897x948</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>James</td>\n",
       "      <td>Payne</td>\n",
       "      <td>74424 Oscar Court</td>\n",
       "      <td>Colintown</td>\n",
       "      <td>055-423-8490x95700</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   first_name last_name               address              city  \\\n",
       "id                                                                \n",
       "1        Eric      Mann  28236 Danielle Track  East Nicholaston   \n",
       "2       Cathy    Mullen     87654 Lewis Spurs        Jasonshire   \n",
       "3     Barbara    Miller      63860 Reed Cliff      East Kristin   \n",
       "4       Kevin    Conley  909 Christian Manors        Jasonville   \n",
       "5       James     Payne     74424 Oscar Court         Colintown   \n",
       "\n",
       "               telephone  \n",
       "id                        \n",
       "1     (882)917-5347x3716  \n",
       "2   1-189-763-2384x16254  \n",
       "3         1-065-041-3835  \n",
       "4       726-589-8897x948  \n",
       "5     055-423-8490x95700  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "owners['last_name'] = owners.index.map(lambda x : fake.last_name())\n",
    "owners['address'] = owners.index.map(lambda x : fake.street_address())\n",
    "owners['city'] = owners.index.map(lambda x : fake.city())\n",
    "owners['telephone'] = owners.index.map(lambda x : fake.phone_number())\n",
    "owners.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The generation of this table was very easy. Let's see what's next!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fake \"Types\" (of a pet)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each owner has a pet of a specific type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>autoincrement</th>\n",
       "      <th>default</th>\n",
       "      <th>name</th>\n",
       "      <th>nullable</th>\n",
       "      <th>type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>True</td>\n",
       "      <td>None</td>\n",
       "      <td>id</td>\n",
       "      <td>False</td>\n",
       "      <td>INTEGER(4) UNSIGNED</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td>name</td>\n",
       "      <td>True</td>\n",
       "      <td>VARCHAR(80)</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  autoincrement default  name nullable                 type\n",
       "0          True    None    id    False  INTEGER(4) UNSIGNED\n",
       "1           NaN    None  name     True          VARCHAR(80)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(insp.get_columns('types'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So we need a <tt>DataFrame</tt> like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [name]\n",
       "Index: []"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# just some unreadable code to make a point\n",
    "pd.DataFrame(columns=pd.DataFrame(insp.get_columns('types'))[['name']].T.reset_index().iloc[0][1::]).set_index('id')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need some animal names for generating the pet's type table. Unfortunately, Faker doesn't provide such data. Luckily, after  one Google search, someone placed a list of animals on the World Wide Web. We just read that data with Pandas as an index.\n",
    "\n",
    "Note: We could have written our own [specific provider for fake data](https://faker.readthedocs.io/en/latest/index.html#how-to-create-a-provider), but that too much for this notebook.\n",
    "\n",
    "Side note: I took not the raw data GitHub provides and saved it locally with a reference to the original origin (as you normally should do), but instead took the HTML pages **just because I can** :-)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Aardvark</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Albatross</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Alligator</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Alpaca</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Ant</th>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: [Aardvark, Albatross, Alligator, Alpaca, Ant]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# loads all HTML tables from the site, but take only the first found and the second column\n",
    "animal_names = pd.read_html(\"https://github.com/hzlzh/Domain-Name-List/blob/master/Animal-words.txt\")[0][[1]]\n",
    "# set the ony column as index \n",
    "animal_names = animal_names.set_index(1)\n",
    "# remove the index name\n",
    "animal_names.index.name = None\n",
    "animal_names.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we are getting to a key trick in generating data very efficiently: Vector operations. We have a <tt>DataFrame</tt> only consisting of an index column. Mathematically speaking, it's a one-dimensional vector. Pandas (respectively the underlying Numpy library) is very efficient in working with these kinds of data. \n",
    "\n",
    "There exist multiple operations that support working on vectors. What we need is to get a random amount of entries from a given data set, which is called \"sampling\". Pandas <tt>DataFrame</tt> provides such a sampling function to achieve that. We use sampling to draw some entries from the animals' data set, e. g. three different kinds:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Whale</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Rail</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Shrimp</th>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: [Whale, Rail, Shrimp]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "animal_names.sample(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OK, lets' get back to the <tt>types</tt> table. We generate the index first. Here we have to be careful: It could be that one requests more pet types as there are in the <tt>animal_names</tt> dataset, but we don't want to allow duplicates. So we limit the index with a <tt>min</tt>-function if the requested number of animals exceeds the number of animals avaliable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: [1, 2, 3, 4, 5]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "types = pd.DataFrame(index=range(1, min(AMOUNT_PET_TYPES, len(animal_names))+1))\n",
    "types.index.name='id'\n",
    "types.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we draw the animals from <tt>animal_names</tt>. We sample the number of requested pet types at once from the <tt>animal_names</tt>' <tt>index</tt>."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Snail</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Gaur</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Mule</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Meerkat</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Dogfish</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       name\n",
       "id         \n",
       "1     Snail\n",
       "2      Gaur\n",
       "3      Mule\n",
       "4   Meerkat\n",
       "5   Dogfish"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "types['name'] = animal_names.sample(len(types)).index\n",
    "types.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And that's all fake data for the pet types."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fake \"Pets\"\n",
    "Let's get back to some more easy data: The Pets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>autoincrement</th>\n",
       "      <th>default</th>\n",
       "      <th>name</th>\n",
       "      <th>nullable</th>\n",
       "      <th>type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>True</td>\n",
       "      <td>None</td>\n",
       "      <td>id</td>\n",
       "      <td>False</td>\n",
       "      <td>INTEGER(4) UNSIGNED</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td>name</td>\n",
       "      <td>True</td>\n",
       "      <td>VARCHAR(30)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td>birth_date</td>\n",
       "      <td>True</td>\n",
       "      <td>DATE</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>False</td>\n",
       "      <td>None</td>\n",
       "      <td>type_id</td>\n",
       "      <td>False</td>\n",
       "      <td>INTEGER(4) UNSIGNED</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>False</td>\n",
       "      <td>None</td>\n",
       "      <td>owner_id</td>\n",
       "      <td>False</td>\n",
       "      <td>INTEGER(4) UNSIGNED</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  autoincrement default        name nullable                 type\n",
       "0          True    None          id    False  INTEGER(4) UNSIGNED\n",
       "1           NaN    None        name     True          VARCHAR(30)\n",
       "2           NaN    None  birth_date     True                 DATE\n",
       "3         False    None     type_id    False  INTEGER(4) UNSIGNED\n",
       "4         False    None    owner_id    False  INTEGER(4) UNSIGNED"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(insp.get_columns('pets'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need some fake data and some ids to already existing entries from the two tables <tt>owners</tt> and <tt>types</tt>. \n",
    "\n",
    "Let's see if we can get some nice data looking like that <tt>Dataframe</tt> here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>birth_date</th>\n",
       "      <th>type_id</th>\n",
       "      <th>owner_id</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [name, birth_date, type_id, owner_id]\n",
       "Index: []"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# just some unreadable code to make a point\n",
    "pd.DataFrame(columns=pd.DataFrame(insp.get_columns('pets'))[['name']].T.reset_index().iloc[0][1::]).set_index('id')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>birth_date</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Shawn</td>\n",
       "      <td>2004-08-28</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Michele</td>\n",
       "      <td>1975-02-18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>2009-04-29</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Justin</td>\n",
       "      <td>1973-04-08</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Joel</td>\n",
       "      <td>2014-01-25</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       name  birth_date\n",
       "id                     \n",
       "1     Shawn  2004-08-28\n",
       "2   Michele  1975-02-18\n",
       "3   Stephen  2009-04-29\n",
       "4    Justin  1973-04-08\n",
       "5      Joel  2014-01-25"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pets = pd.DataFrame(index=range(1,AMOUNT_PETS+1))\n",
    "pets.index.name='id'\n",
    "pets['name'] = pets.index.map(lambda x : fake.first_name())\n",
    "pets['birth_date'] = pets.index.map(lambda x : fake.date())\n",
    "pets.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the ids to the <tt>owners</tt> and <tt>types</tt> table, we use the sampling function that I've introduced above to draw some ids. The important different is, that we set an additional argument <tt>replace=True</tt>, which is necessary when more samples should be drawn than data entries are available in the dataset. Or in plain English: If duplicates should be allowed. This makes perfect sense: One owner can own more than one pet of different kinds/types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>birth_date</th>\n",
       "      <th>type_id</th>\n",
       "      <th>owner_id</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Shawn</td>\n",
       "      <td>2004-08-28</td>\n",
       "      <td>72</td>\n",
       "      <td>3737</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Michele</td>\n",
       "      <td>1975-02-18</td>\n",
       "      <td>8</td>\n",
       "      <td>720</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>2009-04-29</td>\n",
       "      <td>187</td>\n",
       "      <td>7814</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Justin</td>\n",
       "      <td>1973-04-08</td>\n",
       "      <td>55</td>\n",
       "      <td>6659</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Joel</td>\n",
       "      <td>2014-01-25</td>\n",
       "      <td>19</td>\n",
       "      <td>7915</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       name  birth_date  type_id  owner_id\n",
       "id                                        \n",
       "1     Shawn  2004-08-28       72      3737\n",
       "2   Michele  1975-02-18        8       720\n",
       "3   Stephen  2009-04-29      187      7814\n",
       "4    Justin  1973-04-08       55      6659\n",
       "5      Joel  2014-01-25       19      7915"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pets['type_id'] = types.sample(len(pets), replace=True).index\n",
    "pets['owner_id'] = owners.sample(len(pets), replace=True).index\n",
    "pets.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fake \"Visits\"\n",
    "The next few tables are straightforward."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>autoincrement</th>\n",
       "      <th>default</th>\n",
       "      <th>name</th>\n",
       "      <th>nullable</th>\n",
       "      <th>type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>True</td>\n",
       "      <td>None</td>\n",
       "      <td>id</td>\n",
       "      <td>False</td>\n",
       "      <td>INTEGER(4) UNSIGNED</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>False</td>\n",
       "      <td>None</td>\n",
       "      <td>pet_id</td>\n",
       "      <td>False</td>\n",
       "      <td>INTEGER(4) UNSIGNED</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td>visit_date</td>\n",
       "      <td>True</td>\n",
       "      <td>DATE</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td>description</td>\n",
       "      <td>True</td>\n",
       "      <td>VARCHAR(255)</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  autoincrement default         name nullable                 type\n",
       "0          True    None           id    False  INTEGER(4) UNSIGNED\n",
       "1         False    None       pet_id    False  INTEGER(4) UNSIGNED\n",
       "2           NaN    None   visit_date     True                 DATE\n",
       "3           NaN    None  description     True         VARCHAR(255)"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(insp.get_columns('visits'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pet_id</th>\n",
       "      <th>visit_date</th>\n",
       "      <th>description</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>16596</td>\n",
       "      <td>1997-05-04</td>\n",
       "      <td>Aut ipsa illo maiores necessitatibus velit. Do...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>14427</td>\n",
       "      <td>2014-10-16</td>\n",
       "      <td>Mollitia praesentium quasi libero accusantium ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>12582</td>\n",
       "      <td>2002-12-06</td>\n",
       "      <td>Qui nobis sit consequatur voluptate et nulla. ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3135</td>\n",
       "      <td>1972-01-07</td>\n",
       "      <td>Consectetur debitis in alias quis magnam elige...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1328</td>\n",
       "      <td>1988-07-28</td>\n",
       "      <td>Doloribus facere perferendis illum necessitati...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    pet_id  visit_date                                        description\n",
       "id                                                                       \n",
       "1    16596  1997-05-04  Aut ipsa illo maiores necessitatibus velit. Do...\n",
       "2    14427  2014-10-16  Mollitia praesentium quasi libero accusantium ...\n",
       "3    12582  2002-12-06  Qui nobis sit consequatur voluptate et nulla. ...\n",
       "4     3135  1972-01-07  Consectetur debitis in alias quis magnam elige...\n",
       "5     1328  1988-07-28  Doloribus facere perferendis illum necessitati..."
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "visits = pd.DataFrame(index=range(1,AMOUNT_VISITS+1))\n",
    "visits.index.name='id'\n",
    "visits['pet_id'] = pets.sample(len(visits), replace=True).index\n",
    "visits['visit_date'] = visits.index.map(lambda x :  fake.date())\n",
    "# just add some random texts\n",
    "visits['description'] = visits.index.map(lambda x :  fake.text())\n",
    "visits.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fake \"Vets\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>autoincrement</th>\n",
       "      <th>default</th>\n",
       "      <th>name</th>\n",
       "      <th>nullable</th>\n",
       "      <th>type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>True</td>\n",
       "      <td>None</td>\n",
       "      <td>id</td>\n",
       "      <td>False</td>\n",
       "      <td>INTEGER(4) UNSIGNED</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td>first_name</td>\n",
       "      <td>True</td>\n",
       "      <td>VARCHAR(30)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td>last_name</td>\n",
       "      <td>True</td>\n",
       "      <td>VARCHAR(30)</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  autoincrement default        name nullable                 type\n",
       "0          True    None          id    False  INTEGER(4) UNSIGNED\n",
       "1           NaN    None  first_name     True          VARCHAR(30)\n",
       "2           NaN    None   last_name     True          VARCHAR(30)"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(insp.get_columns('vets'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Tiffany</td>\n",
       "      <td>Townsend</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Karen</td>\n",
       "      <td>Lewis</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Richard</td>\n",
       "      <td>Roberts</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Daniel</td>\n",
       "      <td>Nunez</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Benjamin</td>\n",
       "      <td>Castillo</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   first_name last_name\n",
       "id                     \n",
       "1     Tiffany  Townsend\n",
       "2       Karen     Lewis\n",
       "3     Richard   Roberts\n",
       "4      Daniel     Nunez\n",
       "5    Benjamin  Castillo"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vets = pd.DataFrame(index=range(1,AMOUNT_VETS+1))\n",
    "vets.index.name='id'\n",
    "vets['first_name'] = vets.index.map(lambda x : fake.first_name())\n",
    "vets['last_name'] = vets.index.map(lambda x : fake.last_name())\n",
    "vets.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fake \"Specialties\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>autoincrement</th>\n",
       "      <th>default</th>\n",
       "      <th>name</th>\n",
       "      <th>nullable</th>\n",
       "      <th>type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>True</td>\n",
       "      <td>None</td>\n",
       "      <td>id</td>\n",
       "      <td>False</td>\n",
       "      <td>INTEGER(4) UNSIGNED</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td>name</td>\n",
       "      <td>True</td>\n",
       "      <td>VARCHAR(80)</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  autoincrement default  name nullable                 type\n",
       "0          True    None    id    False  INTEGER(4) UNSIGNED\n",
       "1           NaN    None  name     True          VARCHAR(80)"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(insp.get_columns('specialties'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Harum</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Fuga</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Perspiciatis</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Laudantium</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Aspernatur</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            name\n",
       "id              \n",
       "1          Harum\n",
       "2           Fuga\n",
       "3   Perspiciatis\n",
       "4     Laudantium\n",
       "5     Aspernatur"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "specialties = pd.DataFrame(index=range(1,AMOUNT_SPECIALTIES+1))\n",
    "specialties.index.name='id'\n",
    "specialties['name'] = specialties.index.map(lambda x : fake.word().title())\n",
    "specialties.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fake \"Vet_Specialties\"\n",
    "OK, this table is special and worth a few words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>autoincrement</th>\n",
       "      <th>default</th>\n",
       "      <th>name</th>\n",
       "      <th>nullable</th>\n",
       "      <th>type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>False</td>\n",
       "      <td>None</td>\n",
       "      <td>vet_id</td>\n",
       "      <td>False</td>\n",
       "      <td>INTEGER(4) UNSIGNED</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>False</td>\n",
       "      <td>None</td>\n",
       "      <td>specialty_id</td>\n",
       "      <td>False</td>\n",
       "      <td>INTEGER(4) UNSIGNED</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  autoincrement default          name nullable                 type\n",
       "0         False    None        vet_id    False  INTEGER(4) UNSIGNED\n",
       "1         False    None  specialty_id    False  INTEGER(4) UNSIGNED"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(insp.get_columns('vet_specialties'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's a many to many join table between the <tt>vets</tt> table and the <tt>specialties</tt> table. So we need a table that has the connections to the ids of both tables with the appropriate length \"n x m\". But there is a catch that we have to address later, this is why I use a temporary (\"tmp\") <tt>DataFrame</tt>:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>specialty_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1792</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>111</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1440</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>845</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1608</th>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: [1792, 111, 1440, 845, 1608]"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vet_specialties_tmp = pd.DataFrame(\n",
    "    index=specialties.sample(\n",
    "        len(vets)*len(specialties),\n",
    "        replace=True).index)\n",
    "\n",
    "vet_specialties_tmp.index.name = \"specialty_id\"\n",
    "vet_specialties_tmp.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For all specialties, we assign vets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>vet_id</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>specialty_id</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1792</th>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>111</th>\n",
       "      <td>427</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1440</th>\n",
       "      <td>517</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>845</th>\n",
       "      <td>589</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1608</th>\n",
       "      <td>350</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              vet_id\n",
       "specialty_id        \n",
       "1792               7\n",
       "111              427\n",
       "1440             517\n",
       "845              589\n",
       "1608             350"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vet_specialties_tmp['vet_id'] = vets.sample(len(vet_specialties_tmp), replace=True).index\n",
    "vet_specialties_tmp.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We set the ids of the vets as the index, too."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>specialty_id</th>\n",
       "      <th>vet_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1792</th>\n",
       "      <th>7</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>111</th>\n",
       "      <th>427</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1440</th>\n",
       "      <th>517</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>845</th>\n",
       "      <th>589</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1608</th>\n",
       "      <th>350</th>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: [(1792, 7), (111, 427), (1440, 517), (845, 589), (1608, 350)]"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vet_specialties_tmp = vet_specialties_tmp.set_index([vet_specialties_tmp.index, 'vet_id'])\n",
    "vet_specialties_tmp.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have to make sure, that we don't have duplicates in the dataset. We take only the unique index entries and create the actual <tt>vet_specialties</tt> <tt>DataFrame</tt> with the right index names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>specialty_id</th>\n",
       "      <th>vet_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1792</th>\n",
       "      <th>7</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>111</th>\n",
       "      <th>427</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1440</th>\n",
       "      <th>517</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>845</th>\n",
       "      <th>589</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1608</th>\n",
       "      <th>350</th>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: [(1792, 7), (111, 427), (1440, 517), (845, 589), (1608, 350)]"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vet_specialties = pd.DataFrame(index=pd.MultiIndex.from_tuples(vet_specialties_tmp.index.unique()))\n",
    "vet_specialties.index.names =[\"specialty_id\" , \"vet_id\"]\n",
    "vet_specialties.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we're almost done! So far it seems like a brainless activity in most cases...maybe we can automate that in the future ;-)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Store the data\n",
    "Now we store the generated data in the database."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Remove old data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we insert the data, we clean up the existing database by dropping all the tables with all the existing data. We have to do that in the right order to avoid violating any constraints."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "drop_order = [\n",
    "    \"vet_specialties\",\n",
    "    \"specialties\",\n",
    "    \"vets\",\n",
    "    \"visits\",\n",
    "    \"pets\",\n",
    "    \"owners\",\n",
    "    \"types\"    \n",
    "    ]\n",
    "\n",
    "with engine.connect() as con:\n",
    "    for table in drop_order:\n",
    "        con.execute(\"DROP TABLE IF EXISTS \" + table + \";\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare the database schema\n",
    "We execute the init script that comes with the Spring PetClinic project to get a shiny new database. For this, we read the original file via Pandas <tt>read_csv</tt> method and make sure, that we break the statements as needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sql</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CREATE DATABASE IF NOT EXISTS petclinic</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ALTER DATABASE petclinic  DEFAULT CHARACTER SE...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>GRANT ALL PRIVILEGES ON petclinic.* TO pc@loca...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>USE petclinic</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>CREATE TABLE IF NOT EXISTS vets (  id INT(4) U...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                 sql\n",
       "0            CREATE DATABASE IF NOT EXISTS petclinic\n",
       "1  ALTER DATABASE petclinic  DEFAULT CHARACTER SE...\n",
       "2  GRANT ALL PRIVILEGES ON petclinic.* TO pc@loca...\n",
       "3                                      USE petclinic\n",
       "4  CREATE TABLE IF NOT EXISTS vets (  id INT(4) U..."
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "init_db = pd.read_csv(\"data/spring-petclinic/initDB.sql\", lineterminator=\";\", sep=\"\\u0012\", header=None, names=['sql'])\n",
    "init_db['sql'] = init_db['sql'].apply(lambda x : x.replace(\"\\r\", \"\").replace(\"\\n\", \"\"))\n",
    "init_db.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we execute all statements line by line via SQLAlchemy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "with engine.connect() as con:\n",
    "    init_db['sql'].apply(lambda statement : con.execute(statement))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Store the new data\n",
    "Last but not least, here comes the easy part: Storing the data. We've modeled the <tt>DataFrame</tt>s after the existing tables, so we don't face any problems. The short helper function <tt>store</tt> avoids heavy code duplication. In here, we use the <tt>if_exists=\"append\"</tt>-Parameter to reuse the existing database schema (with all the original data types, constraints and indexes). To send data on chunks to the database, we add <tt>chunksize=100</tt>."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def store(dataframe, table_name):\n",
    "    dataframe.to_sql(table_name, con=engine, if_exists=\"append\", chunksize=100)\n",
    "\n",
    "store(owners,'owners')\n",
    "store(types, 'types')\n",
    "store(pets, 'pets')\n",
    "\n",
    "store(visits, 'visits')\n",
    "\n",
    "store(vets, 'vets')\n",
    "store(specialties, 'specialties')\n",
    "store(vet_specialties, 'vet_specialties')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Summary\n",
    "OK, I hope I could help you to understand how you can easily generate nice fake data with Pandas and Faker. I tried to explain the common tasks like generating data and connecting tables via foreign keys.\n",
    "\n",
    "Please let me know if you think that there is anything awkward (or good) with my pragmatic approach."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [Root]",
   "language": "python",
   "name": "Python [Root]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}