{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Table Enforcer Demo\n", "\n", "\n", "## Description\n", "\n", "A python package to facilitate the iterative process of developing and\n", "using schema-like representations of table data to recode and validate\n", "instances of these data stored in pandas DataFrames. This is a fairly\n", "young attempt to solve a recurrent problem many people have. So far I\n", "have looked at multiple solutions, but none really did it for me.\n", "\n", "They either deal primarily with JSON encoded data or they only really solve\n", "the validation side of the problem and consider recoding to be a\n", "separate issue. They seem to assume that recoding and cleaning has\n", "already been done and all we care about is making sure the final product\n", "is sane.\n", "\n", "To me, this seems backwards.\n", "\n", "I need to load, recode, and validate tables all day, everyday. Sometimes\n", "its simple; I can `pandas.read_table()` and all is good. But sometimes I\n", "have a 700 column long RedCap data dump that is complicated af, and it\n", "really helps me to develop my recoding logic through an iterative\n", "process. For me it makes sense to couple the recoding process directly\n", "with the validation process: to write the \"tests\" for each column first,\n", "then add recoding logic in steps until the tests pass.\n", "\n", "So Table Enforcer is my attempt to apply a sort of \"test driven\n", "development\" workflow to data cleaning and validation.\n", "\n", "## Basic Workflow\n", "\n", "1. For each column that you care about in your source table:\n", " 1. Define a `Column` object that represents the ideal state of your\n", " data by passing a list of small, independent, reusable validator\n", " functions and some descriptive information.\n", " 2. Use this object to validate the column data from your source\n", " table.\n", " - It will fail.\n", " 3. Add small, composable, reusable recoding functions to the column\n", " object and iterate until your validations pass.\n", "2. Define an `Enforcer` object by passing it a list of your column\n", " representation objects.\n", "3. This enforcer can be used to recode or validate recoded tables of\n", " the same kind as your source table wherever your applications use\n", " that type of data.\n", "\n", "Please take a look and offer thoughts/advice.\n", "\n", " - Free software: MIT license\n", " - Web site: \n", " - Documentation: .\n", "\n", "## Features\n", "\n", " - `Enforcer` and `Column` classes to define what columns should look\n", " like in a table.\n", " - `CompundColumn` class that supports complex operations including\n", " \"one-to-many\" and \"many-to-one\" recoding logic as sometimes a column\n", " tries to do too much and should really be multiple columns as well\n", " as the reverse.\n", " - Growing cadre of built-in validator functions and decorators.\n", " - Decorators for use in defining parameterized validators like\n", " `between_4_and_60()`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:50.170525Z", "start_time": "2018-02-14T20:36:50.152356Z" } }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:50.538563Z", "start_time": "2018-02-14T20:36:50.172364Z" } }, "outputs": [], "source": [ "from pathlib import Path\n", "import re\n", "\n", "from box import Box\n", "import pandas as pd\n", "import numpy as np\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Table Enforcer" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:50.561705Z", "start_time": "2018-02-14T20:36:50.540462Z" } }, "outputs": [], "source": [ "from table_enforcer import Enforcer, Column, CompoundColumn\n", "import table_enforcer.errors as e\n", "\n", "from table_enforcer import validate as v\n", "from table_enforcer import recode as r" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Constants" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Load or create your Table" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:50.742115Z", "start_time": "2018-02-14T20:36:50.563415Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2col3col4col5col6col7col8
07310male01:one*011
1246m3:Three100
2662F10:ten010
3555Female6:Six111
\n", "
" ], "text/plain": [ " col1 col2 col3 col4 col5 col6 col7 col8\n", "0 7 3 10 male 01:one* 0 1 1\n", "1 2 4 6 m 3:Three 1 0 0\n", "2 6 6 2 F 10:ten 0 1 0\n", "3 5 5 5 Female 6:Six 1 1 1" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table_path = \"../tests/files/demo_table.csv\"\n", "df = pd.read_csv(table_path)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Validator Functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Built-in Validators\n", "\n", "Some validator functions come built-in (See the `table_enforcer/validate/funcs.py` module code for current list)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:50.764140Z", "start_time": "2018-02-14T20:36:50.746789Z" } }, "outputs": [], "source": [ "# ! cat $v.funcs.__file__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Validator Call Signatures\n", "\n", "In general, validators take a single `pandas.Series` object as input and return a `pandas.Series` of the same shape and indexes containing `True` or `False` relative to which items passed the validation logic. \n", "\n", "This changes a little if you are using a decorator like `@maxmin` (_See section on Decorating Validators_)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing Custom Validators" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:50.952317Z", "start_time": "2018-02-14T20:36:50.929601Z" } }, "outputs": [], "source": [ "def gte2(series):\n", " return series >= 2\n", "\n", "def lte10(series):\n", " return series <= 10\n", "\n", "def length_is_one(series):\n", " return series.str.len() == 1\n", "\n", "def valid_sex(series):\n", " sex = set(['M', 'F'])\n", " return series.isin(sex)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Decorating Validators\n", "You can define decorators that extend the call signature of validators to accomadate more complex things like ranges.\n", "\n", "Here we can define a single validator that accomplishes the same thing as the combination of the first two validators that we defined above:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:51.064903Z", "start_time": "2018-02-14T20:36:51.036035Z" } }, "outputs": [], "source": [ "@v.decorators.minmax(low=2, high=10)\n", "def bt_2_and_10(series):\n", " \"\"\"Test that the data items fall within range: 2 <= x <= 10.\"\"\"\n", " return series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Note:__ pay attention to the fact that here we return the __original__ series object. _We didn't do ANYTHING to it._ The testing gets done in the `@v.decorators.minmax` decorator function. This allows our validators to always expect a single argument at run-time and always return a single argument. This is _very_ important.\n", "\n", "### Writing Custom Validator Decorators\n", "\n", "This is a bit beyond this demo's scope but here is the `table_enforcer/validate/decorators.py` code so that you can get an idea oh how to do it if you are brave." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:51.162212Z", "start_time": "2018-02-14T20:36:51.145349Z" } }, "outputs": [], "source": [ "# !cat $v.decorators.__file__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Defining Columns\n", "\n", "Now that we have a few validator functions, its time to put together a representation of our table and use that object to help us get us a final data table that is in a form we like. We will start with `col1`." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:51.371062Z", "start_time": "2018-02-14T20:36:51.320455Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2col3col4col5col6col7col8
07310male01:one*011
1246m3:Three100
2662F10:ten010
3555Female6:Six111
\n", "
" ], "text/plain": [ " col1 col2 col3 col4 col5 col6 col7 col8\n", "0 7 3 10 male 01:one* 0 1 1\n", "1 2 4 6 m 3:Three 1 0 0\n", "2 6 6 2 F 10:ten 0 1 0\n", "3 5 5 5 Female 6:Six 1 1 1" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets say for this example we **only** care about `col1`,`col3`, and `col4`.\n", "\n", "We create a column object for each column we want that contains the name of the column we want to represent. We set the options to describe our ideal format. We set the values for each option for each column including a a list of validator functions that should all pass for all items in each column if we got our way.\n", "\n", "For now, we will ignore the `recoders` argument. What we end up with is the three column objects below." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:51.518575Z", "start_time": "2018-02-14T20:36:51.455214Z" } }, "outputs": [], "source": [ "col1 = Column(name='col1',\n", " dtype=np.int,\n", " unique=False,\n", " validators=[v.funcs.not_null, v.funcs.positive, bt_2_and_10],\n", " recoders=None)\n", "\n", "col3 = Column(name='col3',\n", " dtype=np.int,\n", " unique=True,\n", " validators=[v.funcs.not_null, v.funcs.positive, bt_2_and_10],\n", " recoders=None)\n", "\n", "col4 = Column(name='col4',\n", " dtype=str,\n", " unique=False,\n", " validators=[v.funcs.upper, length_is_one, valid_sex],\n", " recoders=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Validating a Column object\n", "For each column object we store the validator functions in a `dict`.\n", "\n", "Here is the value for `col4`" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:51.643722Z", "start_time": "2018-02-14T20:36:51.562661Z" } }, "outputs": [ { "data": { "text/plain": [ "{'length_is_one': ,\n", " 'upper': ,\n", " 'valid_sex': }" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "col4.validators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We validate a column by passing our entire original dataframe to the column object. The tests get run independently and after the data has passed through each test, we get a dataframe of results. We can see right away which data items fail which tests.\n", "\n", "This helps us plan our recoding efforts later. We can immediately see what sorts of operations need to be run to convert those `False` tests to `True`." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:51.754279Z", "start_time": "2018-02-14T20:36:51.670249Z" }, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
length_is_oneuppervalid_sexdtype
0FalseFalseFalseTrue
1TrueFalseFalseTrue
2TrueTrueTrueTrue
3FalseFalseFalseTrue
\n", "
" ], "text/plain": [ " length_is_one upper valid_sex dtype\n", "0 False False False True\n", "1 True False False True\n", "2 True True True True\n", "3 False False False True" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "col4.validate(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In reality we can have hundreds or thousands of rows and it would be nice to focus on only those that fail validation. Here is how you do that." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:51.880391Z", "start_time": "2018-02-14T20:36:51.782530Z" }, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
length_is_oneuppervalid_sexdtype
0FalseFalseFalseTrue
1TrueFalseFalseTrue
3FalseFalseFalseTrue
\n", "
" ], "text/plain": [ " length_is_one upper valid_sex dtype\n", "0 False False False True\n", "1 True False False True\n", "3 False False False True" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "col4.validate(df, failed_only=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wait a minute... Where did that validation test called `dtype` come from?\n", "\n", "Turns out we get that one for free since we defined that this column needs to be of `dtype` `np.int`. We get a similar `unique` validation check inserted if we set that option to `True`. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Recoder Functions\n", "\n", "Recoder functions have a very similar structure to validator functions. But their purpose is to coerce the raw data into a more useful form. For example `Col4` is obviously meant to represent 'male'/'female' logic. But we don't want to have to interpret all possible ways to represent that logic every time we want to use that column. So we write a recoder function to do convert all items to either \"M\" or \"F\" so we only need to understand two choices from now on." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:52.029978Z", "start_time": "2018-02-14T20:36:51.999938Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2col3col4col5col6col7col8
07310male01:one*011
1246m3:Three100
2662F10:ten010
3555Female6:Six111
\n", "
" ], "text/plain": [ " col1 col2 col3 col4 col5 col6 col7 col8\n", "0 7 3 10 male 01:one* 0 1 1\n", "1 2 4 6 m 3:Three 1 0 0\n", "2 6 6 2 F 10:ten 0 1 0\n", "3 5 5 5 Female 6:Six 1 1 1" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Built-in Recoders\n", "\n", "Some recoder functions come built-in (See the `table_enforcer/recode/funcs.py` module code for current list)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:52.125086Z", "start_time": "2018-02-14T20:36:52.107938Z" } }, "outputs": [], "source": [ "# !cat $r.funcs.__file__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recoder Call Signatures\n", "\n", "Like validators, recoders take a single `pandas.Series` object as input and return a `pandas.Series` of the same shape and indexes as the original series object. However, instead of returning a series of `True`/`False` values, it performs some operation on the data that gets the column data closer to being how you want it to look during analysis operations.\n", "\n", "## Writing Custom Recoders\n", "\n", "### Recoders are Meant to be Composable\n", "\n", "You are able to build a single, monolithic recoder that transforms a column all the way into what you want in a single step. But it may be better to write recoders that perform only a single step on the way to getting a particular column into shape. \n", "\n", "This results in reusable functions that are \"composable\". Meaning that for the NEXT column, you may not even need to write a new function. All you may need to do is list a sequence of recoders that already exist. \n", "\n", "This is because, recoders are applied in a pipeline model. The output from the first is supplied to the second etc.\n", "\n", "Now, lets write our recoders to transform `col4`." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:52.274906Z", "start_time": "2018-02-14T20:36:52.224742Z" } }, "outputs": [], "source": [ "def standardize_sex(series):\n", " \"\"\"Return a series where common representations of 'male'/'female' are standardized.\n", " \n", " Things like ['M', 'MALE', 'M', 'BOY', ...] are converted to `M`.\n", " Representations of female are treated similarly. \n", " \"\"\"\n", " mapper = {\"M\": \"M\",\n", " \"MALE\": \"M\",\n", " \"BOY\": \"M\",\n", " \"F\": \"F\",\n", " \"FEMALE\": \"F\",\n", " \"GIRL\": \"F\",\n", " }\n", " if series.str.islower().any():\n", " raise ValueError(\"standardize_sex expects input series to contain only UPPERCASE letters.\")\n", " else:\n", " return series.apply(lambda x: mapper[x])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we plan to call the built-in recoder `upper` first, we only need to support uppercase text here. This is kind of a silly requirement in practice but it demonstrates how you can write composable recoders to fail if expectations are not met." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next Iteration: Validate Recoded Table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets revisit our original definition of `col4` and add the recoders we have now." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:52.627560Z", "start_time": "2018-02-14T20:36:52.609558Z" } }, "outputs": [], "source": [ "col4 = Column(name='col4',\n", " dtype=str,\n", " unique=False,\n", " validators=[v.funcs.upper, length_is_one, valid_sex],\n", " recoders=[r.funcs.upper, standardize_sex])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now what do we have after we recode this column?" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:52.746143Z", "start_time": "2018-02-14T20:36:52.721506Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col4
0M
1M
2F
3F
\n", "
" ], "text/plain": [ " col4\n", "0 M\n", "1 M\n", "2 F\n", "3 F" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "col4.recode(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That looks pretty good. But let's not rely on our eyes and instead rely on the validation tests we defined. \n", "\n", "Now we try `col4.recode` again but this time tell it to perform validation after recoding and see what happens.\n", "\n", "If all is well, we will simply get the recoded column back. If validation fails, we will get a `ValidationError` raised." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:52.872898Z", "start_time": "2018-02-14T20:36:52.840705Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col4
0M
1M
2F
3F
\n", "
" ], "text/plain": [ " col4\n", "0 M\n", "1 M\n", "2 F\n", "3 F" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "col4.recode(df, validate=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are good! We can now move on to our other columns and repeat the process until all the validation check pass on all of our columns. Then we can populate our `Enforcer` object and we will be nearing the end of our initial sanity check / recoding phase for this table. " ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2017-11-14T22:36:25.371934Z", "start_time": "2017-11-14T22:36:25.347337Z" } }, "source": [ "# Defining Table Enforcers\n", "\n", "Just pass a list of the column objects we created to the `Enforcer`. The validation/recoding api for the `Enforcer` mirrors the one for `Columns` so you can reuse what we learned above." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:53.073198Z", "start_time": "2018-02-14T20:36:53.055550Z" } }, "outputs": [], "source": [ "demo = Enforcer(columns=[col1,col3,col4])" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:53.170458Z", "start_time": "2018-02-14T20:36:53.074747Z" } }, "outputs": [ { "data": { "text/plain": [ "[,\n", " ,\n", " ]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "demo.columns" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:53.319126Z", "start_time": "2018-02-14T20:36:53.172985Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col3col4
0710M
126M
262F
355F
\n", "
" ], "text/plain": [ " col1 col3 col4\n", "0 7 10 M\n", "1 2 6 M\n", "2 6 2 F\n", "3 5 5 F" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "demo.recode(df, validate=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__NOTE:__\n", "\n", "- we will see what happens when the `demo.recode(df, validate=True)` encounters problems in recoding or validating when we pass some bad data to our `load_csv()` function later." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing loading functions that use enforcers\n", "\n", "Use these enforcers to recode or validate recoded tables of the same kind as your source table wherever your applications use that type of data like in your loading functions.\n", "\n", "You can pass them into a function as an argument or reference them from the script's local scope as you would an imported function.\n", "\n", "I pass it in here so that I can easily switch between different enforcers for the examples. " ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:53.400218Z", "start_time": "2018-02-14T20:36:53.320895Z" } }, "outputs": [], "source": [ "def load_csv(path, enforcer):\n", " df = pd.read_csv(path)\n", " return enforcer.recode(df, validate=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Loading Good Data" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:53.512678Z", "start_time": "2018-02-14T20:36:53.402456Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2col3col4col5col6col7col8
07310male01:one*011
1246m3:Three100
2662F10:ten010
3555Female6:Six111
\n", "
" ], "text/plain": [ " col1 col2 col3 col4 col5 col6 col7 col8\n", "0 7 3 10 male 01:one* 0 1 1\n", "1 2 4 6 m 3:Three 1 0 0\n", "2 6 6 2 F 10:ten 0 1 0\n", "3 5 5 5 Female 6:Six 1 1 1" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:53.637337Z", "start_time": "2018-02-14T20:36:53.514450Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col3col4
0710M
126M
262F
355F
\n", "
" ], "text/plain": [ " col1 col3 col4\n", "0 7 10 M\n", "1 2 6 M\n", "2 6 2 F\n", "3 5 5 F" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = load_csv(path=table_path, enforcer=demo)\n", "df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Loading Bad Data \n", "In this case we load a CSV that is slightly different that the one we built our `enforcer` on: perhaps this is represents a new data dump or data from another source.\n", "\n", "## The problems\n", "1. `col1` includes a value less than 2 and we have a validator that requires that all values be between 2 and 10. \n", " - Perhaps this is because this value represents some lab test that has a limit of detection of 2 so we can't confidently distinguish between 0-2.\n", " - So we want to set all values below the detection limit to the detection limit itself. However, we neglected to include a recoder for this since we didnt see these types of values in our first data-set.\n", " - We will see how this causes a `ValidationError` with helpful information to be raised.\n", "\n", "2. `Feemale` is not found in the `standardize_sex()` recoder's `mapper` dictionary.\n", " - Here we encounter a value that actually has nothing to do with a validation check failing, but breaks our recoder itself.\n", " - **This time**, it causes a helpful `RecodingError` to be raised.\n", "\n" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:53.731006Z", "start_time": "2018-02-14T20:36:53.641883Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2col3col4col5col6col7col8
07310male01:one*011
1246m3:Three100
2162F10:ten010
3655Feemale6:Six111
\n", "
" ], "text/plain": [ " col1 col2 col3 col4 col5 col6 col7 col8\n", "0 7 3 10 male 01:one* 0 1 1\n", "1 2 4 6 m 3:Three 1 0 0\n", "2 1 6 2 F 10:ten 0 1 0\n", "3 6 5 5 Feemale 6:Six 1 1 1" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table_path_bad = \"../tests/files/demo_table_bad.csv\"\n", "df_bad = pd.read_csv(table_path_bad)\n", "df_bad" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ValidationErrors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A `ValidationError` is raised when we try to get a recoded table or column while setting the `recode` method's option `validate=True`.\n", "\n", "Lets see what happens when we try to load this table with our original `demo` enforcer." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:54.263306Z", "start_time": "2018-02-14T20:36:54.072997Z" } }, "outputs": [ { "ename": "ValidationError", "evalue": "Rows that failed to validate for column 'col1':\n bt_2_and_10 not_null positive dtype\n2 False True True True", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValidationError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mload_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtable_path_bad\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0menforcer\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdemo\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m\u001b[0m in \u001b[0;36mload_csv\u001b[0;34m(path, enforcer)\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mload_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0menforcer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0mdf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0menforcer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrecode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalidate\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m~/src/repos/git/project-repos/table_enforcer/table_enforcer/main_classes.py\u001b[0m in \u001b[0;36mrecode\u001b[0;34m(self, table, validate)\u001b[0m\n\u001b[1;32m 67\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 68\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mcolumn\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 69\u001b[0;31m \u001b[0mdf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcolumn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mupdate_dataframe\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtable\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtable\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalidate\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mvalidate\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 70\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 71\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/src/repos/git/project-repos/table_enforcer/table_enforcer/main_classes.py\u001b[0m in \u001b[0;36mupdate_dataframe\u001b[0;34m(self, df, table, validate)\u001b[0m\n\u001b[1;32m 81\u001b[0m \u001b[0;34m\"\"\"Perform ``self.recode`` and add resulting column(s) to ``df`` and return ``df``.\"\"\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 82\u001b[0m \u001b[0mdf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcopy\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 83\u001b[0;31m \u001b[0mrecoded_columns\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrecode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtable\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtable\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalidate\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mvalidate\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 84\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mconcat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mdf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrecoded_columns\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 85\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/src/repos/git/project-repos/table_enforcer/table_enforcer/main_classes.py\u001b[0m in \u001b[0;36mrecode\u001b[0;34m(self, table, validate)\u001b[0m\n\u001b[1;32m 202\u001b[0m \u001b[0mfailed_rows\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfind_failed_rows\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalidate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_frame\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 203\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mfailed_rows\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 204\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mValidationError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf\"Rows that failed to validate for column '{self.name}':\\n{failed_rows}\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 205\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 206\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_frame\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mValidationError\u001b[0m: Rows that failed to validate for column 'col1':\n bt_2_and_10 not_null positive dtype\n2 False True True True" ] } ], "source": [ "load_csv(path=table_path_bad, enforcer=demo)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A column object raised a `ValidationError` because it saw some data that did not pass validation while inside the `recode` method because we told it to validate. Also, it gives us some helpful information:\n", "\n", "1. The column name\n", "2. The rows that did not pass validation\n", "3. And which validations happened to fail on each row\n", "\n", "Now we add a recoder that deals with that value and move on." ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:50:57.031675Z", "start_time": "2018-02-14T20:50:56.978604Z" } }, "outputs": [], "source": [ "# write another recoder\n", "def handle_values_below_detection_limit(series):\n", " series[series < 2] = 2\n", " return series\n", "\n", "# add the recoder to a Column representing `col1`\n", "col1_new = Column(name='col1',\n", " dtype=np.int,\n", " unique=False,\n", " validators=[v.funcs.not_null, v.funcs.positive, bt_2_and_10],\n", " recoders=[handle_values_below_detection_limit])\n", "\n", "# build enforcer with the corrected Column object\n", "demo2 = Enforcer(columns=[col1_new,\n", " col3,\n", " col4])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets try this again with `demo2`." ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:37:38.919074Z", "start_time": "2018-02-14T20:37:38.805609Z" } }, "outputs": [ { "ename": "RecodingError", "evalue": "Recoder 'standardize_sex' raised the following error on column 'col4': KeyError('FEEMALE',).", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m~/src/repos/git/project-repos/table_enforcer/table_enforcer/main_classes.py\u001b[0m in \u001b[0;36mrecode\u001b[0;34m(self, table, validate)\u001b[0m\n\u001b[1;32m 196\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 197\u001b[0;31m \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrecoder\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 198\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mBaseException\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m\u001b[0m in \u001b[0;36mstandardize_sex\u001b[0;34m(series)\u001b[0m\n\u001b[1;32m 16\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 17\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mseries\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mapply\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;32mlambda\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mmapper\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m~/.anaconda/envs/table_enforcer/lib/python3.6/site-packages/pandas/core/series.py\u001b[0m in \u001b[0;36mapply\u001b[0;34m(self, func, convert_dtype, args, **kwds)\u001b[0m\n\u001b[1;32m 2550\u001b[0m \u001b[0mvalues\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0masobject\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2551\u001b[0;31m \u001b[0mmapped\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlib\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmap_infer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mconvert\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mconvert_dtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2552\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32mpandas/_libs/src/inference.pyx\u001b[0m in \u001b[0;36mpandas._libs.lib.map_infer\u001b[0;34m()\u001b[0m\n", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m(x)\u001b[0m\n\u001b[1;32m 16\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 17\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mseries\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mapply\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;32mlambda\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mmapper\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mKeyError\u001b[0m: 'FEEMALE'", "\nDuring handling of the above exception, another exception occurred:\n", "\u001b[0;31mRecodingError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf3\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mload_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtable_path_bad\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0menforcer\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdemo2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mdf3\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m\u001b[0m in \u001b[0;36mload_csv\u001b[0;34m(path, enforcer)\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mload_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0menforcer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0mdf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0menforcer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrecode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalidate\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m~/src/repos/git/project-repos/table_enforcer/table_enforcer/main_classes.py\u001b[0m in \u001b[0;36mrecode\u001b[0;34m(self, table, validate)\u001b[0m\n\u001b[1;32m 67\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 68\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mcolumn\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 69\u001b[0;31m \u001b[0mdf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcolumn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mupdate_dataframe\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtable\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtable\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalidate\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mvalidate\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 70\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 71\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/src/repos/git/project-repos/table_enforcer/table_enforcer/main_classes.py\u001b[0m in \u001b[0;36mupdate_dataframe\u001b[0;34m(self, df, table, validate)\u001b[0m\n\u001b[1;32m 81\u001b[0m \u001b[0;34m\"\"\"Perform ``self.recode`` and add resulting column(s) to ``df`` and return ``df``.\"\"\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 82\u001b[0m \u001b[0mdf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcopy\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 83\u001b[0;31m \u001b[0mrecoded_columns\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrecode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtable\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtable\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalidate\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mvalidate\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 84\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mconcat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mdf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrecoded_columns\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 85\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/src/repos/git/project-repos/table_enforcer/table_enforcer/main_classes.py\u001b[0m in \u001b[0;36mrecode\u001b[0;34m(self, table, validate)\u001b[0m\n\u001b[1;32m 197\u001b[0m \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrecoder\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 198\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mBaseException\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 199\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mRecodingError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcol\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrecoder\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 200\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 201\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mvalidate\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mRecodingError\u001b[0m: Recoder 'standardize_sex' raised the following error on column 'col4': KeyError('FEEMALE',)." ] } ], "source": [ "df3 = load_csv(path=table_path_bad, enforcer=demo2)\n", "df3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Problem 1__ seems to be fixed, but now we get this new `RecodingError`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## RecodingErrors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A `RecodingError` is raised when a recoder function fails and raises an exception: any exception.\n", "\n", "Like the `ValidationError` it tries to give you as much helpful information about where and how it failed.\n", "\n", "1. The recoder that raised an exception\n", "2. The column name\n", "3. And the exception and its value.\n", "\n", "In this case we need to update our recoder to deal with this new typo of the word \"female\"" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:51:07.170254Z", "start_time": "2018-02-14T20:51:07.098889Z" } }, "outputs": [], "source": [ "# update the recoder\n", "def standardize_sex2(series):\n", " \"\"\"Return a series where common representations of 'male'/'female' are standardized.\n", " \n", " Things like ['M', 'MALE', 'M', 'BOY', ...] are converted to `M`.\n", " Representations of female are treated similarly. \n", " \"\"\"\n", " mapper = {\"M\": \"M\",\n", " \"MALE\": \"M\",\n", " \"BOY\": \"M\",\n", " \"F\": \"F\",\n", " \"FEMALE\": \"F\",\n", " \"GIRL\": \"F\",\n", " \"FEEMALE\": \"F\",\n", " }\n", " if series.str.islower().any():\n", " raise ValueError(\"standardize_sex expects input series to contain only UPPERCASE letters.\")\n", " else:\n", " return series.apply(lambda x: mapper[x])\n", "\n", "# add the recoder to a Column representing `col4`\n", "col4_new = Column(name='col4',\n", " dtype=str,\n", " unique=False,\n", " validators=[v.funcs.upper, length_is_one, valid_sex],\n", " recoders=[r.funcs.upper, standardize_sex2])\n", "\n", "# build enforcer with the corrected Column object\n", "demo3 = Enforcer(columns=[col1_new,\n", " col3,\n", " col4_new])" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2018-02-02T16:19:56.618954Z", "start_time": "2018-02-02T16:19:56.583069Z" } }, "source": [ "OK one more time with `demo3` this time." ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:51:10.259969Z", "start_time": "2018-02-14T20:51:10.202687Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col3col4
0710M
126M
222F
365F
\n", "
" ], "text/plain": [ " col1 col3 col4\n", "0 7 10 M\n", "1 2 6 M\n", "2 2 2 F\n", "3 6 5 F" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df4 = load_csv(path=table_path_bad, enforcer=demo3)\n", "df4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Success! __Problem 2__ is no also fixed and we are ready to get back to work.\n", "\n", "\n", "...until the next head ache." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The CompoundColumn Class " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Overview of `CompoundColumn` class:__\n", "\n", "- The `CompoundColumn` class is based on the same class that the `Column` inherits from (`BaseColumn`).\n", "- This means it has many methods with the same names and general behavior as the `Column` class that you are used to.\n", " - namely: `validate`, `recode`, and `update_dataframe`\n", " - `update_dataframe` is used by the `Enforcer` class and determines how the Column object adds columns to the final table.\n", "- However they behave slightly differently because this class has a slightly different job to do than `Column`.\n", "- This class is actually contains and coordinates **multiple** `Column` objects.\n", "- These objects allow more complex transformation and validation logic that spans multiple columns in either or both the original table and final table.\n", " - Transformations like **Many-To-One** and **One-To-Many** column conversions for example.\n", "- It is structured sort of like a container of `Column`s and will actually contain three plain `Column` objects.\n", "- Here we introduce the concept of *input_columns* and *output_columns*.\n", "- Its call signature is fairly different than its cousin the `Column`.\n", "- It accepts only three arguments:\n", " - `input_columns`: A list of fully defined `Column` objects, each representing a single column in the **original** `Dataframe`. \n", " - `output_columns`: A list of fully defined `Column` objects, each representing a single column in the **final** `DataFrame`. \n", " - `column_transform`: A function accepting the input `DataFrame`, performing transformations to it and returning a new `DataFrame` containing the TRANSFORMED columns only.\n", " \n", "__Interaction with `Enforcer` objects:__\n", "\n", "- These are subclasses of `BaseColumn` and implement the three essential methods compatible with the interface that `Enforcer` objects expect each column to use. \n", " - `validate`\n", " - `recode`\n", " - `update_dataframe`\n", "- So adding them to `Enforcer` objects works exactly like the `Column` objects we already know." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:55.141538Z", "start_time": "2018-02-14T20:36:55.124259Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Construct a new ``CompoundColumn`` object.\n", "\n", " Args:\n", " input_columns (list, Column): A list of ``Column`` objects representing column(s) from the SOURCE table.\n", " output_columns (list, Column): A list of ``Column`` objects representing column(s) from the FINAL table.\n", " column_transform (Callable): Function accepting the table object, performing transformations to it and returning a DataFrame containing the NEW columns only.\n", " \n" ] } ], "source": [ "print(CompoundColumn.__init__.__doc__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## One-to-many Column Transformation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Now we will be focusing on column 5.\n", "- Here we want to split it into two columns because it represents complex information encoded into a single column.\n", "- It would be better for us to treat it as separate columns.\n", "- In these examples we will split `col5` into two new columns so we must keep tract of how to deal with **three** columns:\n", " - the original column plus two derivative columns." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:55.384938Z", "start_time": "2018-02-14T20:36:55.351781Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2col3col4col5col6col7col8
07310male01:one*011
1246m3:Three100
2662F10:ten010
3555Female6:Six111
\n", "
" ], "text/plain": [ " col1 col2 col3 col4 col5 col6 col7 col8\n", "0 7 3 10 male 01:one* 0 1 1\n", "1 2 4 6 m 3:Three 1 0 0\n", "2 6 6 2 F 10:ten 0 1 0\n", "3 5 5 5 Female 6:Six 1 1 1" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Definition and Usage" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:55.612668Z", "start_time": "2018-02-14T20:36:55.485509Z" } }, "outputs": [], "source": [ "\n", "## Validators and Recoders\n", "bad_chars = re.compile(pattern=\"\"\"[*(]\"\"\")\n", "\n", "def no_bad_characters(series):\n", " \"\"\"Validator\"\"\"\n", " def test(x):\n", " if bad_chars.search(x) is None:\n", " return True\n", " else:\n", " return False\n", " return series.astype(str).apply(test)\n", "\n", "\n", "def fix_bad_characters(series):\n", " \"\"\"Recoder\"\"\"\n", " def fix(x):\n", " return bad_chars.sub(repl='', string=x)\n", " return series.astype(str).apply(fix)\n", "\n", "def recode_upper(series):\n", " return series.astype(str).str.upper()\n", "\n", "def to_int(series):\n", " return series.astype(np.int)\n", "\n", "def to_str(series):\n", " return series.astype(str)\n", "\n", "\n", "# Transformation function\n", "def split_on_colon(df):\n", " cols = Box()\n", " cols.col5_number = df.col5.apply(lambda x: x.split(\":\")[0])\n", " cols.col5_word = df.col5.apply(lambda x: x.split(\":\")[1])\n", " \n", " new_columns = pd.DataFrame(cols)[[\"col5_number\", \"col5_word\"]]\n", " return new_columns\n", "\n", "\n", "\n", "## Defining the Input Column\n", "col5 = Column(\n", " name='col5',\n", " dtype=str,\n", " unique=False,\n", " validators=[v.funcs.not_null, no_bad_characters],\n", " recoders=[fix_bad_characters])\n", "\n", "## Defining the Output Columns (col5_a/col5_b)\n", "col5_a = Column(\n", " name='col5_number',\n", " dtype=np.int,\n", " unique=False,\n", " validators=[v.funcs.not_null,],\n", " recoders=[to_int],)\n", "\n", "col5_b = Column(\n", " name='col5_word',\n", " dtype=str,\n", " unique=False,\n", " validators=[v.funcs.not_null, v.funcs.upper, no_bad_characters],\n", " recoders=[to_str, recode_upper],)\n", "\n", "## Defining the Compound Column\n", "col5_split = CompoundColumn(input_columns=[col5],\n", " output_columns=[col5_a, col5_b],\n", " column_transform=split_on_colon)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- When we run `col5_split.validate` we get a more complex dataframe than before.\n", "- It is the total validations for each column (in this case 3)." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:55.784520Z", "start_time": "2018-02-14T20:36:55.614270Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dtypeno_bad_charactersnot_nullupper
validation_typecolumn_namerow
inputcol50TrueFalseTrueTrue
1TrueTrueTrueTrue
2TrueTrueTrueTrue
3TrueTrueTrueTrue
outputcol5_number0FalseTrueTrueTrue
1FalseTrueTrueTrue
2FalseTrueTrueTrue
3FalseTrueTrueTrue
col5_word0TrueFalseTrueFalse
1TrueTrueTrueFalse
2TrueTrueTrueFalse
3TrueTrueTrueFalse
\n", "
" ], "text/plain": [ " dtype no_bad_characters not_null upper\n", "validation_type column_name row \n", "input col5 0 True False True True\n", " 1 True True True True\n", " 2 True True True True\n", " 3 True True True True\n", "output col5_number 0 False True True True\n", " 1 False True True True\n", " 2 False True True True\n", " 3 False True True True\n", " col5_word 0 True False True False\n", " 1 True True True False\n", " 2 True True True False\n", " 3 True True True False" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "col5_split.validate(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- `col5_split.recode` returns a `pd.DataFrame` representing the `output_columns` in the provided order." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:55.866106Z", "start_time": "2018-02-14T20:36:55.786364Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col5_numbercol5_word
01ONE
13THREE
210TEN
36SIX
\n", "
" ], "text/plain": [ " col5_number col5_word\n", "0 1 ONE\n", "1 3 THREE\n", "2 10 TEN\n", "3 6 SIX" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "col5_split.recode(df, validate=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Many-To-One Column Transformation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Lets look at columns col6, col7, and col8.\n", " - Imagine that these columns result from a situation were a subject's tissue was submitted for zero or more different tests.\n", " - These columns track which tests were done for each subject.\n", "- In a pure SQL-like relational context we would likely want to break these data out into a supporting table with a one-to-many relation (1 subject to *N* tests).\n", "- But what if the data is destined for a more flexible format like MongoDB, where you are able to store these data in a list-like format.\n", " - In that case we may want to collect these columns into a single column where each item is a `list` representing the names of which if any of the tests were done for each subject.\n", "- We can do this using the `CompoundColumn` class." ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:56.015486Z", "start_time": "2018-02-14T20:36:55.986607Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2col3col4col5col6col7col8
07310male01:one*011
1246m3:Three100
2662F10:ten010
3555Female6:Six111
\n", "
" ], "text/plain": [ " col1 col2 col3 col4 col5 col6 col7 col8\n", "0 7 3 10 male 01:one* 0 1 1\n", "1 2 4 6 m 3:Three 1 0 0\n", "2 6 6 2 F 10:ten 0 1 0\n", "3 5 5 5 Female 6:Six 1 1 1" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Definition and Usage" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:56.445739Z", "start_time": "2018-02-14T20:36:56.128775Z" } }, "outputs": [], "source": [ "# helper functions\n", "def is_subset(x, ref_set):\n", " if not isinstance(ref_set, set):\n", " valid = set(ref_set)\n", " \n", " if isinstance(x, (list, tuple, set)):\n", " set_x = set(x)\n", " else:\n", " set_x = set([x])\n", " \n", " return set_x.issubset(ref_set)\n", "\n", "\n", "# Transformation function\n", "def join_as_tuple(df):\n", " cols = Box()\n", " cols.col6_7_8 = df[[\"col6\", \"col7\", \"col8\"]].apply(lambda row: (row.col6, row.col7, row.col8,), axis=1)\n", " \n", " new_columns = pd.DataFrame(cols)\n", " return new_columns\n", " \n", "\n", "# Validators\n", "def col6_valid_values(series):\n", " \"\"\"Validator\"\"\"\n", " valid = [None, \"DNASeq\"]\n", " return series.apply(is_subset, ref_set=valid)\n", "\n", "def col7_valid_values(series):\n", " \"\"\"Validator\"\"\"\n", " valid = [None, \"Protein Function\"]\n", " return series.apply(is_subset, ref_set=valid)\n", "\n", "def col8_valid_values(series):\n", " \"\"\"Validator\"\"\"\n", " valid = [None, \"RNASeq\"]\n", " return series.apply(is_subset, ref_set=valid)\n", "\n", "\n", "def col6_7_8_valid_values(series):\n", " \"\"\"Validator\"\"\"\n", " valid = set([\"DNASeq\", \"Protein Function\", \"RNASeq\"])\n", " return series.apply(is_subset, ref_set=valid)\n", "\n", "# Recoders\n", "def translate_col6(series):\n", " \"\"\"Recode 0-> None; 1-> 'DNASeq' \"\"\"\n", " def rcode(x):\n", " mapping = {0: None, 1: \"DNASeq\"}\n", " return mapping[x]\n", " return series.apply(rcode)\n", " \n", "def translate_col7(series):\n", " \"\"\"Recode 0-> None; 1-> 'Protein Function' \"\"\"\n", " def rcode(x):\n", " mapping = {0: None, 1: \"Protein Function\"}\n", " return mapping[x]\n", " return series.apply(rcode)\n", " \n", "def translate_col8(series):\n", " \"\"\"Recode 0-> None; 1-> 'RNASeq' \"\"\"\n", " def rcode(x):\n", " mapping = {0: None, 1: \"RNASeq\"}\n", " return mapping[x]\n", " return series.apply(rcode)\n", "\n", "def setify_drop_nones(series):\n", " \"\"\"Convert to sets and drop ``None`` values.\"\"\"\n", " def drop_nones(x):\n", " x.discard(None)\n", " return x\n", " return series.apply(lambda x: set(x)).apply(drop_nones).apply(list)\n", "\n", "# Defining the Input Columns\n", "col6 = Column(\n", " name='col6',\n", " dtype=(str, type(None)),\n", " unique=False,\n", " validators=[col6_valid_values],\n", " recoders=[translate_col6],)\n", "\n", "col7 = Column(\n", " name='col7',\n", " dtype=(str, type(None)),\n", " unique=False,\n", " validators=[col7_valid_values],\n", " recoders=[translate_col7],)\n", "\n", "col8 = Column(\n", " name='col8',\n", " dtype=(str, type(None)),\n", " unique=False,\n", " validators=[col8_valid_values],\n", " recoders=[translate_col8],)\n", "\n", "\n", "# Defining the Output Column\n", "col6_7_8 = Column(\n", " name='col6_7_8',\n", " dtype=list,\n", " unique=False,\n", " validators=[v.funcs.not_null, col6_7_8_valid_values],\n", " recoders=[setify_drop_nones],)\n", "\n", "# Defining the Compound Column\n", "col6_7_8_join = CompoundColumn(input_columns=[col6, col7, col8],\n", " output_columns=[col6_7_8],\n", " column_transform=join_as_tuple)\n" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:56.595678Z", "start_time": "2018-02-14T20:36:56.447377Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2col3col4col5col6col7col8
07310male01:one*011
1246m3:Three100
2662F10:ten010
3555Female6:Six111
\n", "
" ], "text/plain": [ " col1 col2 col3 col4 col5 col6 col7 col8\n", "0 7 3 10 male 01:one* 0 1 1\n", "1 2 4 6 m 3:Three 1 0 0\n", "2 6 6 2 F 10:ten 0 1 0\n", "3 5 5 5 Female 6:Six 1 1 1" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- When we run `col6_7_8_join.validate` we, again, get a more complex dataframe than with a `Column` object." ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:56.748596Z", "start_time": "2018-02-14T20:36:56.597387Z" }, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col6_7_8_valid_valuescol6_valid_valuescol7_valid_valuescol8_valid_valuesdtypenot_null
validation_typecolumn_namerow
inputcol60TrueFalseTrueTrueFalseTrue
1TrueFalseTrueTrueFalseTrue
2TrueFalseTrueTrueFalseTrue
3TrueFalseTrueTrueFalseTrue
col70TrueTrueFalseTrueFalseTrue
1TrueTrueFalseTrueFalseTrue
2TrueTrueFalseTrueFalseTrue
3TrueTrueFalseTrueFalseTrue
col80TrueTrueTrueFalseFalseTrue
1TrueTrueTrueFalseFalseTrue
2TrueTrueTrueFalseFalseTrue
3TrueTrueTrueFalseFalseTrue
outputcol6_7_80FalseTrueTrueTrueFalseTrue
1FalseTrueTrueTrueFalseTrue
2FalseTrueTrueTrueFalseTrue
3FalseTrueTrueTrueFalseTrue
\n", "
" ], "text/plain": [ " col6_7_8_valid_values col6_valid_values \\\n", "validation_type column_name row \n", "input col6 0 True False \n", " 1 True False \n", " 2 True False \n", " 3 True False \n", " col7 0 True True \n", " 1 True True \n", " 2 True True \n", " 3 True True \n", " col8 0 True True \n", " 1 True True \n", " 2 True True \n", " 3 True True \n", "output col6_7_8 0 False True \n", " 1 False True \n", " 2 False True \n", " 3 False True \n", "\n", " col7_valid_values col8_valid_values dtype \\\n", "validation_type column_name row \n", "input col6 0 True True False \n", " 1 True True False \n", " 2 True True False \n", " 3 True True False \n", " col7 0 False True False \n", " 1 False True False \n", " 2 False True False \n", " 3 False True False \n", " col8 0 True False False \n", " 1 True False False \n", " 2 True False False \n", " 3 True False False \n", "output col6_7_8 0 True True False \n", " 1 True True False \n", " 2 True True False \n", " 3 True True False \n", "\n", " not_null \n", "validation_type column_name row \n", "input col6 0 True \n", " 1 True \n", " 2 True \n", " 3 True \n", " col7 0 True \n", " 1 True \n", " 2 True \n", " 3 True \n", " col8 0 True \n", " 1 True \n", " 2 True \n", " 3 True \n", "output col6_7_8 0 True \n", " 1 True \n", " 2 True \n", " 3 True " ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "col6_7_8_join.validate(df)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:56.848036Z", "start_time": "2018-02-14T20:36:56.751350Z" }, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col6_7_8
0[RNASeq, Protein Function]
1[DNASeq]
2[Protein Function]
3[DNASeq, RNASeq, Protein Function]
\n", "
" ], "text/plain": [ " col6_7_8\n", "0 [RNASeq, Protein Function]\n", "1 [DNASeq]\n", "2 [Protein Function]\n", "3 [DNASeq, RNASeq, Protein Function]" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "col6_7_8_join.recode(df, validate=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building an Enforcer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Now lets see how these fit together into an `Enforcer` object.\n", "- We are using the same table as before but now we add our new `CompoundColumns`." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:56.983776Z", "start_time": "2018-02-14T20:36:56.849652Z" } }, "outputs": [], "source": [ "now_w_moar_complexity = Enforcer(columns=[col1,\n", " col3,\n", " col4,\n", " col5_split,\n", " col6_7_8_join])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- We can do all the same stuff with this `Enforcer`.\n", "- This is because each subclass of `BaseColumn` includes a special set of methods that allow it to interact with `Enforcer` objects in their specific ways." ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:57.165249Z", "start_time": "2018-02-14T20:36:56.985662Z" } }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# this should fail\n", "now_w_moar_complexity.validate(df)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:57.281155Z", "start_time": "2018-02-14T20:36:57.167100Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col3col4col5_numbercol5_wordcol6_7_8
0710M1ONE[RNASeq, Protein Function]
126M3THREE[DNASeq]
262F10TEN[Protein Function]
355F6SIX[DNASeq, RNASeq, Protein Function]
\n", "
" ], "text/plain": [ " col1 col3 col4 col5_number col5_word col6_7_8\n", "0 7 10 M 1 ONE [RNASeq, Protein Function]\n", "1 2 6 M 3 THREE [DNASeq]\n", "2 6 2 F 10 TEN [Protein Function]\n", "3 5 5 F 6 SIX [DNASeq, RNASeq, Protein Function]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# this should not\n", "now_w_moar_complexity.recode(df, validate=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- The original table again:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:57.329325Z", "start_time": "2018-02-14T20:36:57.282806Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2col3col4col5col6col7col8
07310male01:one*011
1246m3:Three100
2662F10:ten010
3555Female6:Six111
\n", "
" ], "text/plain": [ " col1 col2 col3 col4 col5 col6 col7 col8\n", "0 7 3 10 male 01:one* 0 1 1\n", "1 2 4 6 m 3:Three 1 0 0\n", "2 6 6 2 F 10:ten 0 1 0\n", "3 5 5 5 Female 6:Six 1 1 1" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Testing Facilitation" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:57.435377Z", "start_time": "2018-02-14T20:36:57.331086Z" } }, "outputs": [], "source": [ "def sort_columns(df):\n", " return df.T.sort_index().T.sort_index()\n", "\n", "def check_and_print(df, column):\n", " if sort_columns(column.validate(df).reset_index()).equals(sort_columns(pd.read_json(validate_all_json))):\n", " print(f\"validate_all_json = '''{validate_all_json}'''\")\n", " else:\n", " raise ValueError()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## OTM" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:57.561967Z", "start_time": "2018-02-14T20:36:57.437018Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "validate_input_json = '''{\"validation_type\":{\"0\":\"input\",\"1\":\"input\",\"2\":\"input\",\"3\":\"input\"},\"column_name\":{\"0\":\"col5\",\"1\":\"col5\",\"2\":\"col5\",\"3\":\"col5\"},\"row\":{\"0\":0,\"1\":1,\"2\":2,\"3\":3},\"no_bad_characters\":{\"0\":false,\"1\":true,\"2\":true,\"3\":true},\"not_null\":{\"0\":true,\"1\":true,\"2\":true,\"3\":true},\"dtype\":{\"0\":true,\"1\":true,\"2\":true,\"3\":true}}'''\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column_namedtypeno_bad_charactersnot_nullrowvalidation_type
0col5TrueFalseTrue0input
1col5TrueTrueTrue1input
2col5TrueTrueTrue2input
3col5TrueTrueTrue3input
\n", "
" ], "text/plain": [ " column_name dtype no_bad_characters not_null row validation_type\n", "0 col5 True False True 0 input\n", "1 col5 True True True 1 input\n", "2 col5 True True True 2 input\n", "3 col5 True True True 3 input" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "validate_input = col5_split._validate_input(df).reset_index()\n", "validate_input\n", "validate_input_json = validate_input.to_json()\n", "\n", "print(f\"validate_input_json = '''{validate_input_json}'''\")\n", "pd.read_json(validate_input_json)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:57.674692Z", "start_time": "2018-02-14T20:36:57.564495Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "validate_output_json = '''{\"validation_type\":{\"0\":\"output\",\"1\":\"output\",\"2\":\"output\",\"3\":\"output\",\"4\":\"output\",\"5\":\"output\",\"6\":\"output\",\"7\":\"output\"},\"column_name\":{\"0\":\"col5_number\",\"1\":\"col5_number\",\"2\":\"col5_number\",\"3\":\"col5_number\",\"4\":\"col5_word\",\"5\":\"col5_word\",\"6\":\"col5_word\",\"7\":\"col5_word\"},\"row\":{\"0\":0,\"1\":1,\"2\":2,\"3\":3,\"4\":0,\"5\":1,\"6\":2,\"7\":3},\"dtype\":{\"0\":false,\"1\":false,\"2\":false,\"3\":false,\"4\":true,\"5\":true,\"6\":true,\"7\":true},\"no_bad_characters\":{\"0\":null,\"1\":null,\"2\":null,\"3\":null,\"4\":false,\"5\":true,\"6\":true,\"7\":true},\"not_null\":{\"0\":true,\"1\":true,\"2\":true,\"3\":true,\"4\":true,\"5\":true,\"6\":true,\"7\":true},\"upper\":{\"0\":null,\"1\":null,\"2\":null,\"3\":null,\"4\":false,\"5\":false,\"6\":false,\"7\":false}}'''\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column_namedtypeno_bad_charactersnot_nullrowuppervalidation_type
0col5_numberFalseNaNTrue0NaNoutput
1col5_numberFalseNaNTrue1NaNoutput
2col5_numberFalseNaNTrue2NaNoutput
3col5_numberFalseNaNTrue3NaNoutput
4col5_wordTrue0.0True00.0output
5col5_wordTrue1.0True10.0output
6col5_wordTrue1.0True20.0output
7col5_wordTrue1.0True30.0output
\n", "
" ], "text/plain": [ " column_name dtype no_bad_characters not_null row upper validation_type\n", "0 col5_number False NaN True 0 NaN output\n", "1 col5_number False NaN True 1 NaN output\n", "2 col5_number False NaN True 2 NaN output\n", "3 col5_number False NaN True 3 NaN output\n", "4 col5_word True 0.0 True 0 0.0 output\n", "5 col5_word True 1.0 True 1 0.0 output\n", "6 col5_word True 1.0 True 2 0.0 output\n", "7 col5_word True 1.0 True 3 0.0 output" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "validate_output = col5_split._validate_output(df).reset_index()\n", "validate_output\n", "validate_output_json = validate_output.to_json()\n", "\n", "print(f\"validate_output_json = '''{validate_output_json}'''\") \n", "pd.read_json(validate_output_json)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:57.788188Z", "start_time": "2018-02-14T20:36:57.679358Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "validate_all_json = '''{\"validation_type\":{\"0\":\"input\",\"1\":\"input\",\"2\":\"input\",\"3\":\"input\",\"4\":\"output\",\"5\":\"output\",\"6\":\"output\",\"7\":\"output\",\"8\":\"output\",\"9\":\"output\",\"10\":\"output\",\"11\":\"output\"},\"column_name\":{\"0\":\"col5\",\"1\":\"col5\",\"2\":\"col5\",\"3\":\"col5\",\"4\":\"col5_number\",\"5\":\"col5_number\",\"6\":\"col5_number\",\"7\":\"col5_number\",\"8\":\"col5_word\",\"9\":\"col5_word\",\"10\":\"col5_word\",\"11\":\"col5_word\"},\"row\":{\"0\":0,\"1\":1,\"2\":2,\"3\":3,\"4\":0,\"5\":1,\"6\":2,\"7\":3,\"8\":0,\"9\":1,\"10\":2,\"11\":3},\"dtype\":{\"0\":true,\"1\":true,\"2\":true,\"3\":true,\"4\":false,\"5\":false,\"6\":false,\"7\":false,\"8\":true,\"9\":true,\"10\":true,\"11\":true},\"no_bad_characters\":{\"0\":false,\"1\":true,\"2\":true,\"3\":true,\"4\":true,\"5\":true,\"6\":true,\"7\":true,\"8\":false,\"9\":true,\"10\":true,\"11\":true},\"not_null\":{\"0\":true,\"1\":true,\"2\":true,\"3\":true,\"4\":true,\"5\":true,\"6\":true,\"7\":true,\"8\":true,\"9\":true,\"10\":true,\"11\":true},\"upper\":{\"0\":true,\"1\":true,\"2\":true,\"3\":true,\"4\":true,\"5\":true,\"6\":true,\"7\":true,\"8\":false,\"9\":false,\"10\":false,\"11\":false}}'''\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column_namedtypeno_bad_charactersnot_nullrowuppervalidation_type
0col5TrueFalseTrue0Trueinput
1col5TrueTrueTrue1Trueinput
10col5_wordTrueTrueTrue2Falseoutput
11col5_wordTrueTrueTrue3Falseoutput
2col5TrueTrueTrue2Trueinput
3col5TrueTrueTrue3Trueinput
4col5_numberFalseTrueTrue0Trueoutput
5col5_numberFalseTrueTrue1Trueoutput
6col5_numberFalseTrueTrue2Trueoutput
7col5_numberFalseTrueTrue3Trueoutput
8col5_wordTrueFalseTrue0Falseoutput
9col5_wordTrueTrueTrue1Falseoutput
\n", "
" ], "text/plain": [ " column_name dtype no_bad_characters not_null row upper \\\n", "0 col5 True False True 0 True \n", "1 col5 True True True 1 True \n", "10 col5_word True True True 2 False \n", "11 col5_word True True True 3 False \n", "2 col5 True True True 2 True \n", "3 col5 True True True 3 True \n", "4 col5_number False True True 0 True \n", "5 col5_number False True True 1 True \n", "6 col5_number False True True 2 True \n", "7 col5_number False True True 3 True \n", "8 col5_word True False True 0 False \n", "9 col5_word True True True 1 False \n", "\n", " validation_type \n", "0 input \n", "1 input \n", "10 output \n", "11 output \n", "2 input \n", "3 input \n", "4 output \n", "5 output \n", "6 output \n", "7 output \n", "8 output \n", "9 output " ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "validate_all = col5_split.validate(df).reset_index()\n", "validate_all\n", "validate_all_json = validate_all.to_json()\n", "\n", "print(f\"validate_all_json = '''{validate_all_json}'''\") \n", "pd.read_json(validate_all_json)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MTO" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:57.921233Z", "start_time": "2018-02-14T20:36:57.789925Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "validate_input_json = '''{\"validation_type\":{\"0\":\"input\",\"1\":\"input\",\"2\":\"input\",\"3\":\"input\",\"4\":\"input\",\"5\":\"input\",\"6\":\"input\",\"7\":\"input\",\"8\":\"input\",\"9\":\"input\",\"10\":\"input\",\"11\":\"input\"},\"column_name\":{\"0\":\"col6\",\"1\":\"col6\",\"2\":\"col6\",\"3\":\"col6\",\"4\":\"col7\",\"5\":\"col7\",\"6\":\"col7\",\"7\":\"col7\",\"8\":\"col8\",\"9\":\"col8\",\"10\":\"col8\",\"11\":\"col8\"},\"row\":{\"0\":0,\"1\":1,\"2\":2,\"3\":3,\"4\":0,\"5\":1,\"6\":2,\"7\":3,\"8\":0,\"9\":1,\"10\":2,\"11\":3},\"col6_valid_values\":{\"0\":false,\"1\":false,\"2\":false,\"3\":false,\"4\":null,\"5\":null,\"6\":null,\"7\":null,\"8\":null,\"9\":null,\"10\":null,\"11\":null},\"col7_valid_values\":{\"0\":null,\"1\":null,\"2\":null,\"3\":null,\"4\":false,\"5\":false,\"6\":false,\"7\":false,\"8\":null,\"9\":null,\"10\":null,\"11\":null},\"col8_valid_values\":{\"0\":null,\"1\":null,\"2\":null,\"3\":null,\"4\":null,\"5\":null,\"6\":null,\"7\":null,\"8\":false,\"9\":false,\"10\":false,\"11\":false},\"dtype\":{\"0\":false,\"1\":false,\"2\":false,\"3\":false,\"4\":false,\"5\":false,\"6\":false,\"7\":false,\"8\":false,\"9\":false,\"10\":false,\"11\":false}}'''\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col6_valid_valuescol7_valid_valuescol8_valid_valuescolumn_namedtyperowvalidation_type
00.0NaNNaNcol6False0input
10.0NaNNaNcol6False1input
10NaNNaN0.0col8False2input
11NaNNaN0.0col8False3input
20.0NaNNaNcol6False2input
30.0NaNNaNcol6False3input
4NaN0.0NaNcol7False0input
5NaN0.0NaNcol7False1input
6NaN0.0NaNcol7False2input
7NaN0.0NaNcol7False3input
8NaNNaN0.0col8False0input
9NaNNaN0.0col8False1input
\n", "
" ], "text/plain": [ " col6_valid_values col7_valid_values col8_valid_values column_name \\\n", "0 0.0 NaN NaN col6 \n", "1 0.0 NaN NaN col6 \n", "10 NaN NaN 0.0 col8 \n", "11 NaN NaN 0.0 col8 \n", "2 0.0 NaN NaN col6 \n", "3 0.0 NaN NaN col6 \n", "4 NaN 0.0 NaN col7 \n", "5 NaN 0.0 NaN col7 \n", "6 NaN 0.0 NaN col7 \n", "7 NaN 0.0 NaN col7 \n", "8 NaN NaN 0.0 col8 \n", "9 NaN NaN 0.0 col8 \n", "\n", " dtype row validation_type \n", "0 False 0 input \n", "1 False 1 input \n", "10 False 2 input \n", "11 False 3 input \n", "2 False 2 input \n", "3 False 3 input \n", "4 False 0 input \n", "5 False 1 input \n", "6 False 2 input \n", "7 False 3 input \n", "8 False 0 input \n", "9 False 1 input " ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "validate_input = col6_7_8_join._validate_input(df).reset_index()\n", "validate_input\n", "validate_input_json = validate_input.to_json()\n", "\n", "print(f\"validate_input_json = '''{validate_input_json}'''\")\n", "pd.read_json(validate_input_json)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:58.013834Z", "start_time": "2018-02-14T20:36:57.925570Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "validate_output_json = '''{\"validation_type\":{\"0\":\"output\",\"1\":\"output\",\"2\":\"output\",\"3\":\"output\"},\"column_name\":{\"0\":\"col6_7_8\",\"1\":\"col6_7_8\",\"2\":\"col6_7_8\",\"3\":\"col6_7_8\"},\"row\":{\"0\":0,\"1\":1,\"2\":2,\"3\":3},\"col6_7_8_valid_values\":{\"0\":false,\"1\":false,\"2\":false,\"3\":false},\"not_null\":{\"0\":true,\"1\":true,\"2\":true,\"3\":true},\"dtype\":{\"0\":false,\"1\":false,\"2\":false,\"3\":false}}'''\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col6_7_8_valid_valuescolumn_namedtypenot_nullrowvalidation_type
0Falsecol6_7_8FalseTrue0output
1Falsecol6_7_8FalseTrue1output
2Falsecol6_7_8FalseTrue2output
3Falsecol6_7_8FalseTrue3output
\n", "
" ], "text/plain": [ " col6_7_8_valid_values column_name dtype not_null row validation_type\n", "0 False col6_7_8 False True 0 output\n", "1 False col6_7_8 False True 1 output\n", "2 False col6_7_8 False True 2 output\n", "3 False col6_7_8 False True 3 output" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "validate_output = col6_7_8_join._validate_output(df).reset_index()\n", "validate_output\n", "validate_output_json = validate_output.to_json()\n", "\n", "print(f\"validate_output_json = '''{validate_output_json}'''\")\n", "pd.read_json(validate_output_json)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:58.154965Z", "start_time": "2018-02-14T20:36:58.015578Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "validate_all_json = '''{\"validation_type\":{\"0\":\"input\",\"1\":\"input\",\"2\":\"input\",\"3\":\"input\",\"4\":\"input\",\"5\":\"input\",\"6\":\"input\",\"7\":\"input\",\"8\":\"input\",\"9\":\"input\",\"10\":\"input\",\"11\":\"input\",\"12\":\"output\",\"13\":\"output\",\"14\":\"output\",\"15\":\"output\"},\"column_name\":{\"0\":\"col6\",\"1\":\"col6\",\"2\":\"col6\",\"3\":\"col6\",\"4\":\"col7\",\"5\":\"col7\",\"6\":\"col7\",\"7\":\"col7\",\"8\":\"col8\",\"9\":\"col8\",\"10\":\"col8\",\"11\":\"col8\",\"12\":\"col6_7_8\",\"13\":\"col6_7_8\",\"14\":\"col6_7_8\",\"15\":\"col6_7_8\"},\"row\":{\"0\":0,\"1\":1,\"2\":2,\"3\":3,\"4\":0,\"5\":1,\"6\":2,\"7\":3,\"8\":0,\"9\":1,\"10\":2,\"11\":3,\"12\":0,\"13\":1,\"14\":2,\"15\":3},\"col6_7_8_valid_values\":{\"0\":true,\"1\":true,\"2\":true,\"3\":true,\"4\":true,\"5\":true,\"6\":true,\"7\":true,\"8\":true,\"9\":true,\"10\":true,\"11\":true,\"12\":false,\"13\":false,\"14\":false,\"15\":false},\"col6_valid_values\":{\"0\":false,\"1\":false,\"2\":false,\"3\":false,\"4\":true,\"5\":true,\"6\":true,\"7\":true,\"8\":true,\"9\":true,\"10\":true,\"11\":true,\"12\":true,\"13\":true,\"14\":true,\"15\":true},\"col7_valid_values\":{\"0\":true,\"1\":true,\"2\":true,\"3\":true,\"4\":false,\"5\":false,\"6\":false,\"7\":false,\"8\":true,\"9\":true,\"10\":true,\"11\":true,\"12\":true,\"13\":true,\"14\":true,\"15\":true},\"col8_valid_values\":{\"0\":true,\"1\":true,\"2\":true,\"3\":true,\"4\":true,\"5\":true,\"6\":true,\"7\":true,\"8\":false,\"9\":false,\"10\":false,\"11\":false,\"12\":true,\"13\":true,\"14\":true,\"15\":true},\"dtype\":{\"0\":false,\"1\":false,\"2\":false,\"3\":false,\"4\":false,\"5\":false,\"6\":false,\"7\":false,\"8\":false,\"9\":false,\"10\":false,\"11\":false,\"12\":false,\"13\":false,\"14\":false,\"15\":false},\"not_null\":{\"0\":true,\"1\":true,\"2\":true,\"3\":true,\"4\":true,\"5\":true,\"6\":true,\"7\":true,\"8\":true,\"9\":true,\"10\":true,\"11\":true,\"12\":true,\"13\":true,\"14\":true,\"15\":true}}'''\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col6_7_8_valid_valuescol6_valid_valuescol7_valid_valuescol8_valid_valuescolumn_namedtypenot_nullrowvalidation_type
0TrueFalseTrueTruecol6FalseTrue0input
1TrueFalseTrueTruecol6FalseTrue1input
10TrueTrueTrueFalsecol8FalseTrue2input
11TrueTrueTrueFalsecol8FalseTrue3input
12FalseTrueTrueTruecol6_7_8FalseTrue0output
13FalseTrueTrueTruecol6_7_8FalseTrue1output
14FalseTrueTrueTruecol6_7_8FalseTrue2output
15FalseTrueTrueTruecol6_7_8FalseTrue3output
2TrueFalseTrueTruecol6FalseTrue2input
3TrueFalseTrueTruecol6FalseTrue3input
4TrueTrueFalseTruecol7FalseTrue0input
5TrueTrueFalseTruecol7FalseTrue1input
6TrueTrueFalseTruecol7FalseTrue2input
7TrueTrueFalseTruecol7FalseTrue3input
8TrueTrueTrueFalsecol8FalseTrue0input
9TrueTrueTrueFalsecol8FalseTrue1input
\n", "
" ], "text/plain": [ " col6_7_8_valid_values col6_valid_values col7_valid_values \\\n", "0 True False True \n", "1 True False True \n", "10 True True True \n", "11 True True True \n", "12 False True True \n", "13 False True True \n", "14 False True True \n", "15 False True True \n", "2 True False True \n", "3 True False True \n", "4 True True False \n", "5 True True False \n", "6 True True False \n", "7 True True False \n", "8 True True True \n", "9 True True True \n", "\n", " col8_valid_values column_name dtype not_null row validation_type \n", "0 True col6 False True 0 input \n", "1 True col6 False True 1 input \n", "10 False col8 False True 2 input \n", "11 False col8 False True 3 input \n", "12 True col6_7_8 False True 0 output \n", "13 True col6_7_8 False True 1 output \n", "14 True col6_7_8 False True 2 output \n", "15 True col6_7_8 False True 3 output \n", "2 True col6 False True 2 input \n", "3 True col6 False True 3 input \n", "4 True col7 False True 0 input \n", "5 True col7 False True 1 input \n", "6 True col7 False True 2 input \n", "7 True col7 False True 3 input \n", "8 False col8 False True 0 input \n", "9 False col8 False True 1 input " ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "validate_all = col6_7_8_join.validate(df).reset_index()\n", "validate_all\n", "validate_all_json = validate_all.to_json()\n", "\n", "print(f\"validate_all_json = '''{validate_all_json}'''\")\n", "pd.read_json(validate_all_json)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Full Enforcer" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "ExecuteTime": { "end_time": "2018-02-14T20:36:58.283355Z", "start_time": "2018-02-14T20:36:58.157359Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "enforcer_recode_json = '''{\"col1\":{\"0\":7,\"1\":2,\"2\":6,\"3\":5},\"col3\":{\"0\":10,\"1\":6,\"2\":2,\"3\":5},\"col4\":{\"0\":\"M\",\"1\":\"M\",\"2\":\"F\",\"3\":\"F\"},\"col5_number\":{\"0\":1,\"1\":3,\"2\":10,\"3\":6},\"col5_word\":{\"0\":\"ONE\",\"1\":\"THREE\",\"2\":\"TEN\",\"3\":\"SIX\"},\"col6_7_8\":{\"0\":[\"RNASeq\",\"Protein Function\"],\"1\":[\"DNASeq\"],\"2\":[\"Protein Function\"],\"3\":[\"DNASeq\",\"RNASeq\",\"Protein Function\"]}}'''\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col3col4col5_numbercol5_wordcol6_7_8
0710M1ONE[RNASeq, Protein Function]
126M3THREE[DNASeq]
262F10TEN[Protein Function]
355F6SIX[DNASeq, RNASeq, Protein Function]
\n", "
" ], "text/plain": [ " col1 col3 col4 col5_number col5_word col6_7_8\n", "0 7 10 M 1 ONE [RNASeq, Protein Function]\n", "1 2 6 M 3 THREE [DNASeq]\n", "2 6 2 F 10 TEN [Protein Function]\n", "3 5 5 F 6 SIX [DNASeq, RNASeq, Protein Function]" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "enforcer_recode = now_w_moar_complexity.recode(df, validate=True)\n", "\n", "enforcer_recode_json = enforcer_recode.to_json()\n", "print(f\"enforcer_recode_json = '''{enforcer_recode_json}'''\")\n", "pd.read_json(enforcer_recode_json)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "table_enforcer", "language": "python", "name": "table_enforcer" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" }, "toc": { "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "991px", "left": "0px", "right": "2145px", "top": "207.011px", "width": "361px" }, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }