{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# datatable_demo\n", "\n", "This notebook demonstrates the use of the DataTable object in the ukds package.\n", "\n", "This demonstration uses for an example the following dataset: Gershuny, J., Sullivan, O. (2017). United Kingdom Time Use Survey, 2014-2015. Centre for Time Use Research, University of Oxford. [data collection]. UK Data Service. SN: 8128, http://doi.org/10.5255/UKDA-SN-8128-1\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import the ukds package\n", "\n", "This demonstration used the `ukds` package, which is available on PyPi." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import ukds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set up a filepath to a .tab data table file\n", "\n", "The filepath to the data table under study is specified here. This can be changed as needed." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "fp_tab=r'C:\\Users\\cvskf\\OneDrive - Loughborough University\\_Data\\United_Kingdom_Time_Use_Survey_2014-2015'+\\\n", " r'\\UKDA-8128-tab\\tab\\uktus15_household.tab'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set up a filepath to a UKDS .rtf data dictionary file\n", "\n", "The filepath to the associated data dictionary is specified here. This can be changed as needed." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "fp_dd=r'C:\\Users\\cvskf\\OneDrive - Loughborough University\\_Data\\United_Kingdom_Time_Use_Survey_2014-2015' + \\\n", " r'\\UKDA-8128-tab\\mrdoc\\allissue\\uktus15_household_ukda_data_dictionary.rtf'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a DataTable object \n", "\n", "A DataTable object is created. The filepaths are supplied as arguments and the files are read into the DataTable object." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A class for reading a UK Data Service .tab data table file\n", " \n", "\n" ] } ], "source": [ "dt=ukds.DataTable(fp_tab,fp_dd)\n", "print(dt.__doc__)\n", "print(dt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data table .tab file is stored in the `tab` attribute as a pandas DataFrame:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
serialstratapsuHhOuthh_wtIMonthIYearDM014DM016DM510...Relate10_P1Relate10_P2Relate10_P3Relate10_P4Relate10_P5Relate10_P6Relate10_P7Relate10_P8Relate10_P9Relate10_P10
011010903-2-2598NaN92014000...-2-2.0NaNNaNNaNNaNNaNNaNNaNNaN
111010904-2-2598NaN92014000...-2-2.0NaNNaNNaNNaNNaNNaNNaNNaN
211010906-2-2598NaN102014000...-2-2.0-2.0NaNNaNNaNNaNNaNNaNNaN
311010907-2-2598NaN92014110...-2-2.0-2.0NaNNaNNaNNaNNaNNaNNaN
411010908-2-2598NaN92014000...-2NaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

5 rows × 335 columns

\n", "
" ], "text/plain": [ " serial strata psu HhOut hh_wt IMonth IYear DM014 DM016 DM510 \\\n", "0 11010903 -2 -2 598 NaN 9 2014 0 0 0 \n", "1 11010904 -2 -2 598 NaN 9 2014 0 0 0 \n", "2 11010906 -2 -2 598 NaN 10 2014 0 0 0 \n", "3 11010907 -2 -2 598 NaN 9 2014 1 1 0 \n", "4 11010908 -2 -2 598 NaN 9 2014 0 0 0 \n", "\n", " ... Relate10_P1 Relate10_P2 Relate10_P3 Relate10_P4 Relate10_P5 \\\n", "0 ... -2 -2.0 NaN NaN NaN \n", "1 ... -2 -2.0 NaN NaN NaN \n", "2 ... -2 -2.0 -2.0 NaN NaN \n", "3 ... -2 -2.0 -2.0 NaN NaN \n", "4 ... -2 NaN NaN NaN NaN \n", "\n", " Relate10_P6 Relate10_P7 Relate10_P8 Relate10_P9 Relate10_P10 \n", "0 NaN NaN NaN NaN NaN \n", "1 NaN NaN NaN NaN NaN \n", "2 NaN NaN NaN NaN NaN \n", "3 NaN NaN NaN NaN NaN \n", "4 NaN NaN NaN NaN NaN \n", "\n", "[5 rows x 335 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dt.tab.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data dictionary .rtf file is stored in the `datadictionary` attribute as a ukds.DataDictionary object:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dt.datadictionary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get dataframe\n", "\n", "The information in the `tab` and `datadictionary` attributes can be combined by the `get_dataframe` method.\n", "\n", "This method returns a new pandas Dataframe in which:\n", "- the columns are a multi-level index which hold the data dictionary information \n", "- the table values are converted from numerical values to the label values, where applicable\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
variableserialstratapsuHhOuthh_wtIMonthIYearDM014DM016DM510...Relate10_P1Relate10_P2Relate10_P3Relate10_P4Relate10_P5Relate10_P6Relate10_P7Relate10_P8Relate10_P9Relate10_P10
variable_labelHousehold numberStrataPrimary sampling unitFinal outcome - householdHousehold weightInterview monthInterview YearNumber of children aged 0-14Number of children aged 0-16Number of children aged 5-10...Relate10_P1: How related to person 10Relate10_P2: How related to person 10Relate10_P3: How related to person 10Relate10_P4: How related to person 10Relate10_P5: How related to person 10Relate10_P6: How related to person 10Relate10_P7: How related to person 10Relate10_P8: How related to person 10Relate10_P9: How related to person 10Relate10_P10: How related to person 10
variable_typenumericnumericnumericnumericnumericnumericnumericnumericnumericnumeric...numericnumericnumericnumericnumericnumericnumericnumericnumericnumeric
SPSS_measurement_levelSCALESCALESCALESCALESCALENOMINALNOMINALNOMINALNOMINALNOMINAL...SCALESCALESCALESCALESCALESCALESCALESCALESCALENOMINAL
SPSS_user_missing_values...
pos12345678910...326327328329330331332333334335
011010903Schedule not applicableSchedule not applicableOther reasons why unproductiveNaNSeptember2014000...Schedule not applicableSchedule not applicableNaNNaNNaNNaNNaNNaNNaNNaN
111010904Schedule not applicableSchedule not applicableOther reasons why unproductiveNaNSeptember2014000...Schedule not applicableSchedule not applicableNaNNaNNaNNaNNaNNaNNaNNaN
211010906Schedule not applicableSchedule not applicableOther reasons why unproductiveNaNOctober2014000...Schedule not applicableSchedule not applicableSchedule not applicableNaNNaNNaNNaNNaNNaNNaN
311010907Schedule not applicableSchedule not applicableOther reasons why unproductiveNaNSeptember2014110...Schedule not applicableSchedule not applicableSchedule not applicableNaNNaNNaNNaNNaNNaNNaN
411010908Schedule not applicableSchedule not applicableOther reasons why unproductiveNaNSeptember2014000...Schedule not applicableNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

5 rows × 335 columns

\n", "
" ], "text/plain": [ "variable serial strata \\\n", "variable_label Household number Strata \n", "variable_type numeric numeric \n", "SPSS_measurement_level SCALE SCALE \n", "SPSS_user_missing_values \n", "pos 1 2 \n", "0 11010903 Schedule not applicable \n", "1 11010904 Schedule not applicable \n", "2 11010906 Schedule not applicable \n", "3 11010907 Schedule not applicable \n", "4 11010908 Schedule not applicable \n", "\n", "variable psu \\\n", "variable_label Primary sampling unit \n", "variable_type numeric \n", "SPSS_measurement_level SCALE \n", "SPSS_user_missing_values \n", "pos 3 \n", "0 Schedule not applicable \n", "1 Schedule not applicable \n", "2 Schedule not applicable \n", "3 Schedule not applicable \n", "4 Schedule not applicable \n", "\n", "variable HhOut hh_wt \\\n", "variable_label Final outcome - household Household weight \n", "variable_type numeric numeric \n", "SPSS_measurement_level SCALE SCALE \n", "SPSS_user_missing_values \n", "pos 4 5 \n", "0 Other reasons why unproductive NaN \n", "1 Other reasons why unproductive NaN \n", "2 Other reasons why unproductive NaN \n", "3 Other reasons why unproductive NaN \n", "4 Other reasons why unproductive NaN \n", "\n", "variable IMonth IYear \\\n", "variable_label Interview month Interview Year \n", "variable_type numeric numeric \n", "SPSS_measurement_level NOMINAL NOMINAL \n", "SPSS_user_missing_values \n", "pos 6 7 \n", "0 September 2014 \n", "1 September 2014 \n", "2 October 2014 \n", "3 September 2014 \n", "4 September 2014 \n", "\n", "variable DM014 \\\n", "variable_label Number of children aged 0-14 \n", "variable_type numeric \n", "SPSS_measurement_level NOMINAL \n", "SPSS_user_missing_values \n", "pos 8 \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 1 \n", "4 0 \n", "\n", "variable DM016 \\\n", "variable_label Number of children aged 0-16 \n", "variable_type numeric \n", "SPSS_measurement_level NOMINAL \n", "SPSS_user_missing_values \n", "pos 9 \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 1 \n", "4 0 \n", "\n", "variable DM510 ... \\\n", "variable_label Number of children aged 5-10 ... \n", "variable_type numeric ... \n", "SPSS_measurement_level NOMINAL ... \n", "SPSS_user_missing_values ... \n", "pos 10 ... \n", "0 0 ... \n", "1 0 ... \n", "2 0 ... \n", "3 0 ... \n", "4 0 ... \n", "\n", "variable Relate10_P1 \\\n", "variable_label Relate10_P1: How related to person 10 \n", "variable_type numeric \n", "SPSS_measurement_level SCALE \n", "SPSS_user_missing_values \n", "pos 326 \n", "0 Schedule not applicable \n", "1 Schedule not applicable \n", "2 Schedule not applicable \n", "3 Schedule not applicable \n", "4 Schedule not applicable \n", "\n", "variable Relate10_P2 \\\n", "variable_label Relate10_P2: How related to person 10 \n", "variable_type numeric \n", "SPSS_measurement_level SCALE \n", "SPSS_user_missing_values \n", "pos 327 \n", "0 Schedule not applicable \n", "1 Schedule not applicable \n", "2 Schedule not applicable \n", "3 Schedule not applicable \n", "4 NaN \n", "\n", "variable Relate10_P3 \\\n", "variable_label Relate10_P3: How related to person 10 \n", "variable_type numeric \n", "SPSS_measurement_level SCALE \n", "SPSS_user_missing_values \n", "pos 328 \n", "0 NaN \n", "1 NaN \n", "2 Schedule not applicable \n", "3 Schedule not applicable \n", "4 NaN \n", "\n", "variable Relate10_P4 \\\n", "variable_label Relate10_P4: How related to person 10 \n", "variable_type numeric \n", "SPSS_measurement_level SCALE \n", "SPSS_user_missing_values \n", "pos 329 \n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", "variable Relate10_P5 \\\n", "variable_label Relate10_P5: How related to person 10 \n", "variable_type numeric \n", "SPSS_measurement_level SCALE \n", "SPSS_user_missing_values \n", "pos 330 \n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", "variable Relate10_P6 \\\n", "variable_label Relate10_P6: How related to person 10 \n", "variable_type numeric \n", "SPSS_measurement_level SCALE \n", "SPSS_user_missing_values \n", "pos 331 \n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", "variable Relate10_P7 \\\n", "variable_label Relate10_P7: How related to person 10 \n", "variable_type numeric \n", "SPSS_measurement_level SCALE \n", "SPSS_user_missing_values \n", "pos 332 \n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", "variable Relate10_P8 \\\n", "variable_label Relate10_P8: How related to person 10 \n", "variable_type numeric \n", "SPSS_measurement_level SCALE \n", "SPSS_user_missing_values \n", "pos 333 \n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", "variable Relate10_P9 \\\n", "variable_label Relate10_P9: How related to person 10 \n", "variable_type numeric \n", "SPSS_measurement_level SCALE \n", "SPSS_user_missing_values \n", "pos 334 \n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", "variable Relate10_P10 \n", "variable_label Relate10_P10: How related to person 10 \n", "variable_type numeric \n", "SPSS_measurement_level NOMINAL \n", "SPSS_user_missing_values \n", "pos 335 \n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", "[5 rows x 335 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df=dt.get_dataframe()\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }