B - Clean

Prev Next

Overview

Most data obtained from database and external sources are not in the right format so you can start processing and analyzing the contents. Typical issues include, bugt are not limited to:

  • The table of interest does not begin at the top and needs to be moved up or extracted, with garbage around it removed
  • Non-ANSI spaces, e.g. non-break-spaces are used
  • Unnecessary or empty rows and columns
  • Contents contain leading and/or trailing spaces
  • Numbers contain distracting formatting, e.g. CHF 1'200,-
  • Incomplete and ambiguous dates, e.g. what is "02/09/24"?
  • Header names different in similar tables from different sources and along the timeline
  • Some contents are properly capitalized, other are not.

B4P provides rich set of functions which allows you to prepare and clean the data

Cleaning Columns and Headers


table lift header row() Identify the header row, then lift the header row to top row, with data blow moving up, too.
table rename headers() Rename table headers.
table correct headers() Correct the header names if wrong. Do nothing if OK. Useful to eliminate spelling and upper/lower case issues.
table insert columns() Introduce additional columns.
table insert missing columns() Introduce additional columns if missing, otherwise do nothing.
table delete columns() Delete columns not needed.
table keep columns() Keep specified columns and delete the other columns.



Cleaning Rows


table delete blank rows() Remove all blank rows.
table delete rows() Remove 1 or more rows using a specified row number.
table delete remaining rows() Delete remaining rows below a specified row number. Various functions are available to search for the end of the table and obtain the row number automatically, for example with table search row().
table keep selected rows() Keep selected rows matching the condition specified.
table delete selected rows() The opposite.



Cleaning Contents


table clean() General replacement of exotic spaces and line breaks in the UNICODE space to the familar spaces and 'new line', and trimming options for white spaces and line breaks.
clean num() Cleans numbers by removing distracting symbols such as 1000-separators, currency symbols, etc. Numbers in parentheses are treated as negative numbers, e.g. (1200) changes to -1200 .
table process selected rows() General-purpse function call to process table rows and contents without need to specify loops and variables
table process cells() General-purpose function to process all table cells: Example: table process cells( [.] = trim([.]) );



Wikipedia Example (continued from step 1)

The file downloaded from Wikipedia does indeed need some cleanup.

  • Remove all dangling and redundant spaces in both tables.
  • Remove the summary row for the whole world
  • Exotic space characters are converted to standard space characters.
  • Elimnate the footnote references
  • In table c1, remove the last row as it repeats the header row



Simple Example

table clean                     ( c1 );
table delete selected rows      ( c1, [0]='',Erde );            // Remove blank and summary rows
table process all cells         ( c1, [.] = left( [.], '[' ) ); // Remove cited references
table delete rows               ( c1, -1 );                     // Delete last row (bottom header row - repeated)

table clean                     ( c2, trim spaces );
table delete selected rows      ( c2, [0]='',World );           // Remove blank and summary rows
table process all cells         ( c2, [.] = left( [.], '[' ) ); // Remove cited references

echo( "Table C2: ");
table list                      ( c2, briefly, 4, last col, 2 ); // List just 3 columns and first and last 4 rows
The table has been cleaned up.
Table C2:
    0 : Country                           | Population (1 July 2022) | Population (1 July 2023)
    1 : India                             | 1,425,423,212            | 1,438,069,596           
    2 : China                             | 1,425,179,569            | 1,422,584,933           
    3 : United States                     | 341,534,046              | 343,477,335             
    4 : Indonesia                         | 278,830,529              | 281,190,067             
  ... :
  234 : Falkland Islands (United Kingdom) | 3.490                    | 3.477                   
  235 : Tokelau (New Zealand)             | 2.290                    | 2.397                   
  236 : Niue (New Zealand)                | 1.821                    | 1.817                   
  237 : Vatican City                      | 505                      | 496                     

Try it yourself: Open TAB_Features_Clean_Data.b4p in B4P_Examples.zip. Decompress before use.