{ "metadata": { "name": "", "signature": "sha256:b60e51d051fa6c9c3777600a4e6e0401de982cb75bb521366c45044e1941d9e7" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "3. Data Pre-processing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data pre-processing techniques generally refer to the addition, deletion, or transformation of training set data. Different models have different sensitivities to the type of predictors in the model; *how* the predictors enter the model is also important.\n", "\n", "The need for data pre-processing is determined by the type of model being used. Some procedures, such as tree-based models, are notably insensitive to the characteristics of the predictor data. Others, like linear regression, are not. In this chapter, a wide array of possible methodologies are discussed. \n", "\n", "How the predictors are encoded, called *feature engineering*, can have a significant impact on model performance. Often the most effective encoding of the data is informed by the modeler's understanding of the problem and thus is not derived from any mathematical techniques." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "3.1 Case Study: Cell Segmentation in High-Content Screening" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check if data exists." ] }, { "cell_type": "code", "collapsed": false, "input": [ "!ls -l ../datasets/segmentationOriginal/" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "total 4016\r\n", "-rw-r--r-- 1 leigong staff 2053006 Nov 24 15:58 segmentationOriginal.csv\r\n" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataset is from Hill et al. (2007) that consists of 2019 cells. Of these cells, 1300 were judged to be poorly segmented (PS) and 719 were well segmented (WS); 1009 cells were reserved for the training set." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "import pandas as pd\n", "\n", "cell_segmentation = pd.read_csv(\"../datasets/segmentationOriginal/segmentationOriginal.csv\")" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "cell_segmentation.shape" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ "(2019, 120)" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "A first look at the dataset." ] }, { "cell_type": "code", "collapsed": false, "input": [ "cell_segmentation.head(5)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | Unnamed: 0 | \n", "Cell | \n", "Case | \n", "Class | \n", "AngleCh1 | \n", "AngleStatusCh1 | \n", "AreaCh1 | \n", "AreaStatusCh1 | \n", "AvgIntenCh1 | \n", "AvgIntenCh2 | \n", "... | \n", "VarIntenCh1 | \n", "VarIntenCh3 | \n", "VarIntenCh4 | \n", "VarIntenStatusCh1 | \n", "VarIntenStatusCh3 | \n", "VarIntenStatusCh4 | \n", "WidthCh1 | \n", "WidthStatusCh1 | \n", "XCentroid | \n", "YCentroid | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "207827637 | \n", "Test | \n", "PS | \n", "143.247705 | \n", "1 | \n", "185 | \n", "0 | \n", "15.711864 | \n", "3.954802 | \n", "... | \n", "12.474676 | \n", "7.609035 | \n", "2.714100 | \n", "0 | \n", "2 | \n", "2 | \n", "10.642974 | \n", "2 | \n", "42 | \n", "14 | \n", "
1 | \n", "2 | \n", "207932307 | \n", "Train | \n", "PS | \n", "133.752037 | \n", "0 | \n", "819 | \n", "1 | \n", "31.923274 | \n", "205.878517 | \n", "... | \n", "18.809225 | \n", "56.715352 | \n", "118.388139 | \n", "0 | \n", "0 | \n", "0 | \n", "32.161261 | \n", "1 | \n", "215 | \n", "347 | \n", "
2 | \n", "3 | \n", "207932463 | \n", "Train | \n", "WS | \n", "106.646387 | \n", "0 | \n", "431 | \n", "0 | \n", "28.038835 | \n", "115.315534 | \n", "... | \n", "17.295643 | \n", "37.671053 | \n", "49.470524 | \n", "0 | \n", "0 | \n", "0 | \n", "21.185525 | \n", "0 | \n", "371 | \n", "252 | \n", "
3 | \n", "4 | \n", "207932470 | \n", "Train | \n", "PS | \n", "69.150325 | \n", "0 | \n", "298 | \n", "0 | \n", "19.456140 | \n", "101.294737 | \n", "... | \n", "13.818968 | \n", "30.005643 | \n", "24.749537 | \n", "0 | \n", "0 | \n", "2 | \n", "13.392830 | \n", "0 | \n", "487 | \n", "295 | \n", "
4 | \n", "5 | \n", "207932455 | \n", "Test | \n", "PS | \n", "2.887837 | \n", "2 | \n", "285 | \n", "0 | \n", "24.275735 | \n", "111.415441 | \n", "... | \n", "15.407972 | \n", "20.504288 | \n", "45.450457 | \n", "0 | \n", "0 | \n", "0 | \n", "13.198561 | \n", "0 | \n", "283 | \n", "159 | \n", "
5 rows \u00d7 120 columns
\n", "\n", " | Unnamed: 0 | \n", "Cell | \n", "Class | \n", "AngleCh1 | \n", "AngleStatusCh1 | \n", "AreaCh1 | \n", "AreaStatusCh1 | \n", "AvgIntenCh1 | \n", "AvgIntenCh2 | \n", "AvgIntenCh3 | \n", "... | \n", "VarIntenCh1 | \n", "VarIntenCh3 | \n", "VarIntenCh4 | \n", "VarIntenStatusCh1 | \n", "VarIntenStatusCh3 | \n", "VarIntenStatusCh4 | \n", "WidthCh1 | \n", "WidthStatusCh1 | \n", "XCentroid | \n", "YCentroid | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Case | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
Test | \n", "1010 | \n", "1010 | \n", "1010 | \n", "1010 | \n", "1010 | \n", "1010 | \n", "1010 | \n", "1010 | \n", "1010 | \n", "1010 | \n", "... | \n", "1010 | \n", "1010 | \n", "1010 | \n", "1010 | \n", "1010 | \n", "1010 | \n", "1010 | \n", "1010 | \n", "1010 | \n", "1010 | \n", "
Train | \n", "1009 | \n", "1009 | \n", "1009 | \n", "1009 | \n", "1009 | \n", "1009 | \n", "1009 | \n", "1009 | \n", "1009 | \n", "1009 | \n", "... | \n", "1009 | \n", "1009 | \n", "1009 | \n", "1009 | \n", "1009 | \n", "1009 | \n", "1009 | \n", "1009 | \n", "1009 | \n", "1009 | \n", "
2 rows \u00d7 119 columns
\n", "\n", " | Unnamed: 0 | \n", "Cell | \n", "Case | \n", "Class | \n", "AngleCh1 | \n", "AngleStatusCh1 | \n", "AreaCh1 | \n", "AreaStatusCh1 | \n", "AvgIntenCh1 | \n", "AvgIntenCh2 | \n", "... | \n", "VarIntenCh1 | \n", "VarIntenCh3 | \n", "VarIntenCh4 | \n", "VarIntenStatusCh1 | \n", "VarIntenStatusCh3 | \n", "VarIntenStatusCh4 | \n", "WidthCh1 | \n", "WidthStatusCh1 | \n", "XCentroid | \n", "YCentroid | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | \n", "2 | \n", "207932307 | \n", "Train | \n", "PS | \n", "133.752037 | \n", "0 | \n", "819 | \n", "1 | \n", "31.923274 | \n", "205.878517 | \n", "... | \n", "18.809225 | \n", "56.715352 | \n", "118.388139 | \n", "0 | \n", "0 | \n", "0 | \n", "32.161261 | \n", "1 | \n", "215 | \n", "347 | \n", "
2 | \n", "3 | \n", "207932463 | \n", "Train | \n", "WS | \n", "106.646387 | \n", "0 | \n", "431 | \n", "0 | \n", "28.038835 | \n", "115.315534 | \n", "... | \n", "17.295643 | \n", "37.671053 | \n", "49.470524 | \n", "0 | \n", "0 | \n", "0 | \n", "21.185525 | \n", "0 | \n", "371 | \n", "252 | \n", "
3 | \n", "4 | \n", "207932470 | \n", "Train | \n", "PS | \n", "69.150325 | \n", "0 | \n", "298 | \n", "0 | \n", "19.456140 | \n", "101.294737 | \n", "... | \n", "13.818968 | \n", "30.005643 | \n", "24.749537 | \n", "0 | \n", "0 | \n", "2 | \n", "13.392830 | \n", "0 | \n", "487 | \n", "295 | \n", "
11 | \n", "12 | \n", "207932484 | \n", "Train | \n", "WS | \n", "109.416426 | \n", "0 | \n", "256 | \n", "0 | \n", "18.828571 | \n", "125.938776 | \n", "... | \n", "13.922937 | \n", "18.643027 | \n", "40.331747 | \n", "0 | \n", "0 | \n", "2 | \n", "17.546861 | \n", "0 | \n", "211 | \n", "495 | \n", "
14 | \n", "15 | \n", "207932459 | \n", "Train | \n", "PS | \n", "104.278654 | \n", "0 | \n", "258 | \n", "0 | \n", "17.570850 | \n", "124.368421 | \n", "... | \n", "12.324971 | \n", "17.747143 | \n", "41.928533 | \n", "0 | \n", "0 | \n", "2 | \n", "17.660339 | \n", "0 | \n", "172 | \n", "207 | \n", "
5 rows \u00d7 120 columns
\n", "\n", " | Unnamed: 0 | \n", "Cell | \n", "Case | \n", "Class | \n", "AngleCh1 | \n", "AngleStatusCh1 | \n", "AreaCh1 | \n", "AreaStatusCh1 | \n", "AvgIntenCh1 | \n", "AvgIntenCh2 | \n", "... | \n", "VarIntenCh1 | \n", "VarIntenCh3 | \n", "VarIntenCh4 | \n", "VarIntenStatusCh1 | \n", "VarIntenStatusCh3 | \n", "VarIntenStatusCh4 | \n", "WidthCh1 | \n", "WidthStatusCh1 | \n", "XCentroid | \n", "YCentroid | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | \n", "2 | \n", "207932307 | \n", "Train | \n", "PS | \n", "133.752037 | \n", "0 | \n", "819 | \n", "1 | \n", "31.923274 | \n", "205.878517 | \n", "... | \n", "18.809225 | \n", "56.715352 | \n", "118.388139 | \n", "0 | \n", "0 | \n", "0 | \n", "32.161261 | \n", "1 | \n", "215 | \n", "347 | \n", "
2 | \n", "3 | \n", "207932463 | \n", "Train | \n", "WS | \n", "106.646387 | \n", "0 | \n", "431 | \n", "0 | \n", "28.038835 | \n", "115.315534 | \n", "... | \n", "17.295643 | \n", "37.671053 | \n", "49.470524 | \n", "0 | \n", "0 | \n", "0 | \n", "21.185525 | \n", "0 | \n", "371 | \n", "252 | \n", "
3 | \n", "4 | \n", "207932470 | \n", "Train | \n", "PS | \n", "69.150325 | \n", "0 | \n", "298 | \n", "0 | \n", "19.456140 | \n", "101.294737 | \n", "... | \n", "13.818968 | \n", "30.005643 | \n", "24.749537 | \n", "0 | \n", "0 | \n", "2 | \n", "13.392830 | \n", "0 | \n", "487 | \n", "295 | \n", "
11 | \n", "12 | \n", "207932484 | \n", "Train | \n", "WS | \n", "109.416426 | \n", "0 | \n", "256 | \n", "0 | \n", "18.828571 | \n", "125.938776 | \n", "... | \n", "13.922937 | \n", "18.643027 | \n", "40.331747 | \n", "0 | \n", "0 | \n", "2 | \n", "17.546861 | \n", "0 | \n", "211 | \n", "495 | \n", "
14 | \n", "15 | \n", "207932459 | \n", "Train | \n", "PS | \n", "104.278654 | \n", "0 | \n", "258 | \n", "0 | \n", "17.570850 | \n", "124.368421 | \n", "... | \n", "12.324971 | \n", "17.747143 | \n", "41.928533 | \n", "0 | \n", "0 | \n", "2 | \n", "17.660339 | \n", "0 | \n", "172 | \n", "207 | \n", "
5 rows \u00d7 120 columns
\n", "\n", " | AngleCh1 | \n", "AngleStatusCh1 | \n", "AreaCh1 | \n", "AreaStatusCh1 | \n", "AvgIntenCh1 | \n", "AvgIntenCh2 | \n", "AvgIntenCh3 | \n", "AvgIntenCh4 | \n", "AvgIntenStatusCh1 | \n", "AvgIntenStatusCh2 | \n", "... | \n", "VarIntenCh1 | \n", "VarIntenCh3 | \n", "VarIntenCh4 | \n", "VarIntenStatusCh1 | \n", "VarIntenStatusCh3 | \n", "VarIntenStatusCh4 | \n", "WidthCh1 | \n", "WidthStatusCh1 | \n", "XCentroid | \n", "YCentroid | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | \n", "133.752037 | \n", "0 | \n", "819 | \n", "1 | \n", "31.923274 | \n", "205.878517 | \n", "69.916880 | \n", "164.153453 | \n", "0 | \n", "0 | \n", "... | \n", "18.809225 | \n", "56.715352 | \n", "118.388139 | \n", "0 | \n", "0 | \n", "0 | \n", "32.161261 | \n", "1 | \n", "215 | \n", "347 | \n", "
2 | \n", "106.646387 | \n", "0 | \n", "431 | \n", "0 | \n", "28.038835 | \n", "115.315534 | \n", "63.941748 | \n", "106.696602 | \n", "0 | \n", "0 | \n", "... | \n", "17.295643 | \n", "37.671053 | \n", "49.470524 | \n", "0 | \n", "0 | \n", "0 | \n", "21.185525 | \n", "0 | \n", "371 | \n", "252 | \n", "
3 | \n", "69.150325 | \n", "0 | \n", "298 | \n", "0 | \n", "19.456140 | \n", "101.294737 | \n", "28.217544 | \n", "31.028070 | \n", "0 | \n", "0 | \n", "... | \n", "13.818968 | \n", "30.005643 | \n", "24.749537 | \n", "0 | \n", "0 | \n", "2 | \n", "13.392830 | \n", "0 | \n", "487 | \n", "295 | \n", "
11 | \n", "109.416426 | \n", "0 | \n", "256 | \n", "0 | \n", "18.828571 | \n", "125.938776 | \n", "13.600000 | \n", "46.800000 | \n", "0 | \n", "0 | \n", "... | \n", "13.922937 | \n", "18.643027 | \n", "40.331747 | \n", "0 | \n", "0 | \n", "2 | \n", "17.546861 | \n", "0 | \n", "211 | \n", "495 | \n", "
14 | \n", "104.278654 | \n", "0 | \n", "258 | \n", "0 | \n", "17.570850 | \n", "124.368421 | \n", "22.461538 | \n", "71.206478 | \n", "0 | \n", "0 | \n", "... | \n", "12.324971 | \n", "17.747143 | \n", "41.928533 | \n", "0 | \n", "0 | \n", "2 | \n", "17.660339 | \n", "0 | \n", "172 | \n", "207 | \n", "
5 rows \u00d7 116 columns
\n", "\n", " | Unnamed: 0 | \n", "Duration | \n", "Amount | \n", "InstallmentRatePercentage | \n", "ResidenceDuration | \n", "Age | \n", "NumberExistingCredits | \n", "NumberPeopleMaintenance | \n", "Telephone | \n", "ForeignWorker | \n", "... | \n", "OtherInstallmentPlans.Bank | \n", "OtherInstallmentPlans.Stores | \n", "OtherInstallmentPlans.None | \n", "Housing.Rent | \n", "Housing.Own | \n", "Housing.ForFree | \n", "Job.UnemployedUnskilled | \n", "Job.UnskilledResident | \n", "Job.SkilledEmployee | \n", "Job.Management.SelfEmp.HighlyQualified | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "6 | \n", "1169 | \n", "4 | \n", "4 | \n", "67 | \n", "2 | \n", "1 | \n", "0 | \n", "1 | \n", "... | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "
1 | \n", "2 | \n", "48 | \n", "5951 | \n", "2 | \n", "2 | \n", "22 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "... | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "
2 | \n", "3 | \n", "12 | \n", "2096 | \n", "2 | \n", "3 | \n", "49 | \n", "1 | \n", "2 | \n", "1 | \n", "1 | \n", "... | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
3 | \n", "4 | \n", "42 | \n", "7882 | \n", "2 | \n", "4 | \n", "45 | \n", "1 | \n", "2 | \n", "1 | \n", "1 | \n", "... | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "
4 | \n", "5 | \n", "24 | \n", "4870 | \n", "3 | \n", "4 | \n", "53 | \n", "2 | \n", "2 | \n", "1 | \n", "1 | \n", "... | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "
5 rows \u00d7 63 columns
\n", "\n", " | SavingsAccountBonds.lt.100 | \n", "SavingsAccountBonds.100.to.500 | \n", "SavingsAccountBonds.500.to.1000 | \n", "SavingsAccountBonds.gt.1000 | \n", "SavingsAccountBonds.Unknown | \n", "
---|---|---|---|---|---|
0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "
1 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
2 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
3 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
4 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
5 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "
6 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
7 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
8 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "
9 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "