{ "metadata": { "name": "", "signature": "sha256:0f9af32c03c3961c6aa73ac9230dcba29211d2d6932125b0b270ce77b14385d3" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Convert A String Categorical Variable With Patsy\n", "\n", "- **Author:** [Chris Albon](http://www.chrisalbon.com/), [@ChrisAlbon](https://twitter.com/chrisalbon)\n", "- **Date:** -\n", "- **Repo:** [Python 3 code snippets for data science](https://github.com/chrisalbon/code_py)\n", "- **Note:**\n", "\n", "Originally from: Data Origami." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### import modules" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "import patsy" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create dataframe" ] }, { "cell_type": "code", "collapsed": false, "input": [ "raw_data = {'patient': [1, 1, 1, 0, 0], \n", " 'obs': [1, 2, 3, 1, 2], \n", " 'treatment': [0, 1, 0, 1, 0],\n", " 'score': ['strong', 'weak', 'normal', 'weak', 'strong']} \n", "df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])\n", "df" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
patientobstreatmentscore
0 1 1 0 strong
1 1 2 1 weak
2 1 3 0 normal
3 0 1 1 weak
4 0 2 0 strong
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 15, "text": [ " patient obs treatment score\n", "0 1 1 0 strong\n", "1 1 2 1 weak\n", "2 1 3 0 normal\n", "3 0 1 1 weak\n", "4 0 2 0 strong" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Convert df['score'] into a categorical variable ready for regression (i.e. set one category as the baseline)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# On the 'score' variable in the df dataframe, convert to a categorical variable, and spit out a dataframe\n", "patsy.dmatrix('score', df, return_type='dataframe')" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Interceptscore[T.strong]score[T.weak]
0 1 1 0
1 1 0 1
2 1 0 0
3 1 0 1
4 1 1 0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 19, "text": [ " Intercept score[T.strong] score[T.weak]\n", "0 1 1 0\n", "1 1 0 1\n", "2 1 0 0\n", "3 1 0 1\n", "4 1 1 0" ] } ], "prompt_number": 19 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Convert df['score'] into a categorical variable without setting one category as baseline\n", "\n", "This is likely what you will want to do" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# On the 'score' variable in the df dataframe, convert to a categorical variable, and spit out a dataframe\n", "patsy.dmatrix('score - 1', df, return_type='dataframe')" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
score[normal]score[strong]score[weak]
0 0 1 0
1 0 0 1
2 1 0 0
3 0 0 1
4 0 1 0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 20, "text": [ " score[normal] score[strong] score[weak]\n", "0 0 1 0\n", "1 0 0 1\n", "2 1 0 0\n", "3 0 0 1\n", "4 0 1 0" ] } ], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a variable that is \"1\" if the variables of patient and treatment are both 1" ] }, { "cell_type": "code", "collapsed": false, "input": [ "patsy.dmatrix('patient + treatment + patient:treatment-1', df, return_type='dataframe')" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
patienttreatmentpatient:treatment
0 1 0 0
1 1 1 1
2 1 0 0
3 0 1 0
4 0 0 0
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": [ " patient treatment patient:treatment\n", "0 1 0 0\n", "1 1 1 1\n", "2 1 0 0\n", "3 0 1 0\n", "4 0 0 0" ] } ], "prompt_number": 18 } ], "metadata": {} } ] }