Objectives
This lab will focus on how to effectively clean data in excel using the available functions.
Cleaning data is a very important task for a data analyst. There are some functions in Excel you can use to support this process.
Some data when imported into Excel brings with it old formatting from the source system, this may be as symbols, concatenation of strings or duplicate values. It is essential to prepare the data for analysis and thus limit the issues that may arise
The image below shows each piece of data has a symbol prefixing the actual value.

In this example it is important to strip off the leading paragraph symbol.
The Clean() function will remove any of the non-printable symbols that you may get in your data when it is imported from various sources.
Open the following file:
Using the clean() formula as below enter the formula into cell B14:
=clean(B4)
Copy the formula down to B20, click the menu button that appears beside the copied cells and choose fill without formatting to ensure just the formula is copied and not the coloured cell format.

Another common data problem is the concatenation of strings with a delimiter or separator in between each value. This a simple text extract file which can drastically shrink file size of a data set but is impossible to work with in terms of data analysis.
In the worksheet highlight the data in cells J6:J11 and then on the data ribbon click the button Text to Columns.
This tool allows us to change text strings into separate columns easier to use. In this case each value is separated by a delimiter the underscore, choose the delimited option and click next.

The default delimiter is the Tab and there are other options such as comma or semi colon. Check Other and type in an underscore.
The preview function is very useful to see if your data looks correct once separated.

click finish (or click next and ensure the data types are set as expected).

Open the sheet named Trim(), here we have rows of data that describe products. There are extra spaces between many of the letters and numbers in the data.
The trim function will strip extra spaces from text, leaving only single spaces between words and no space characters at the start or end of text.
=trim(B5)

Hit enter. The contents should now show without extra spaces.
Drag down the formula so it is copied for each of the cells from B16 to B22

It is also possible to use the trim function with the substitution function.
In this data there is a comma separating some of the numbers where we would prefer a decimal point. There are also the extra spaces to deal with.

=TRIM(Substitute(B4,",","."))
All text values must be in quotes so the comma is in quotes and the decimal point.

Copy the formula to the end of the data.
This task could of been done using the find and replace tool in excel however it is always best practice to keep the raw data in its original form. It allows you to look back at the original data to check, especially if you conduct a lot of clean up.
Investigate the use of Find and Replace command.
Using the substitute function again you will create a new list of data for the years 2015 2016.

=substitute(G5,"2013","2015")
This formula will work for the first four lines of data so copy it down, then edit the formula so that 2014 changes to 2016.
The next issue in raw data is also common when pulling data from other sources and then combining the data in an excel sheet.
Open the Values sheet and you will see Excel marks with a green triangle each value in the table.

When you click on the cell the error sign pops up. By clicking on the arrow we can see that this number is actually stored as text.

In the arrow menu we have a function to convert the cells to numbers one by one. If a number is stored as text we cannot use some of the most common functions such as sum, or average. Consequently, the sum function in cell B11 returns zero.
There are two methods for converting these values.
Using the value function:
=value(B4)

Copy the formula down and the errors have been removed and the values are now numbers. The sum function should now work.

Using text to columns:
Select the raw data, click text to column from the data ribbon, choose delimited and click next. Uncheck Tab and other as we are only dealing with one piece of data. Click next. The text to columns function knows that the data is numeric. General converts numeric values to numbers, date values to dates and all remaining values to text. Choose the destination as our table below.

Now the values are numeric and the sum function works.

Open the following file
In the worksheet ExtraRows, we’ve got some valuable information here. Within each department, for example, ADC we see a list of information about the people who work in that department. And scrolling up and down you can see over in column A how the data is indented and every time we have a new department, we've got a new list with a new heading. That heading does repeat the information that we see to the right.
The titles if you investigate are at the bottom of the sheet. If you press Ctrl End you will see them, and delete them if you wish.
Open the following workbook:
There are some entries in column A, also in column C, that are similar, they're in reverse order. They've got month and year but not a day associated with them.
See the format below:
In cell B2: =date ... link to the year out of cell A2. To access the four right-most characters, we use the function right. Cell A2 comma 4, meaning pull out those four right-most characters, that's the year.
Then we need the month, that's the two left-most characters. We're in cell A2 again comma two. Now there's no day over there, so let's simply make it be day one, first day of the month, comma one. Right parenthesis. We've got a date.
Copy/autofill this down the column.
We might also need something almost the same in column D. Since this formula we just created is based on the cell to its left, we can copy this formula we just created in column B. We’ve a different layout. So you can edit it as follows:
Start with: {{{ =VLOOKUP(MID(E2,6,3),J:K,2,FALSE) }}}
The ‘2’ means the second column which is ‘K’ and ‘FALSE’ means – choosing exact match. If all goes well, it should have ‘9’ as the answer. Convert this to Number – general format from home ribbon. Just to test this in column ‘F’, autofill down this column ‘F’ to see if it picked up the month as the numerical field: