Excel Hack 54: The Data Clean Up Function Squad

For many data analysts, a significant part of daily work is spent cleaning collated data in preparation for further analysis.

When receiving data from multiple users and sources, there is a high likelihood of finding anomalies, discrepancies and plain errors (“dirty data”) that need to be rectified if any real analysis and insights can be performed on the data. Most “dirty data” are user-generated and are due to one or more of the following:

Entry of wrong data type: Entering text into values fields, and vice versa. For instance, inputting TWO instead of 2 in columns requesting for # of units sold.

Incorrect ordering of input information: For instance, putting data in as FIRST NAME, SURNAME in some fields, and SURNAME, FIRST NAME in other fields. Or merging FIRST NAME and SURNAME into one field when they should ideally be represented across two separate columns. Or even worse, expressing dates as DD/MM/YYYY in some cells and MM/DD/YYYY (this particular error may not be rectifiable in some instances as there might be no real way to tell if the user meant February 10 or October 2.

Wrong capitalization: Using a mix of Sentence Case, UPPER CASE, lower case within the same column

Addition of extra spaces within cells: This is the most challenging type as the human eye usually will not be able to pick out the difference between Square and Square in an Excel cell. But this error will cause both entries to be aggregated separately, so if for instance you wanted to find out how many times the word Square appeared in a column, you would get a count of 4 instead of 5 because Excel counts Square as a separate item.

Addition of extraneous characters within cells: This usually happens in data fields asking for phone number or similar info. There are various formats for typing phone numbers – some use +234 900 000 00, others +234-900-000-00, or even +234.900.000.00. If users use a combination of full stops and hyphens when entering phone number data, this will need to be cleansed in preparation for further analysis.

Inclusion of non-printable characters: Sometimes, data fields end up reflecting odd-looking characters like

Some of these issues can be avoided through the use of data validation and User Forms, but in a scenario where the dataset has already been received, the following function squad line-up can come in handy in helping to resolve them:

THE DATA CLEAN-UP SQUAD

*Note: there might be more Excel functions useful for data clean-up, so the below list focuses on the most popular functions

LEN() – counts the number of characters within a string. Among other things, can be helpful in identifying cells with extra spaces within datasets, or cells with incomplete info (e.g. phone numbers that have a specific number count)
FIND() – identifies the position of one character or text string within another string. Can be used to identify positions of delimiters separating strings of text (e.g. space bars, colons) which are then used to extract these strings

LEFT() – extracts a defined number of characters from the left or beginning of a text string within a cell. Can be helpful in extracting first names from text strings of first names and surnames, and similar scenarios
RIGHT() – extracts a defined number of characters from the right or end of a text string within a cell. Can be helpful in extracting last names from text strings of first names and surnames
MID() – returns the characters from the middle of a text string, given a starting position and length

UPPER() – converts a text string to upper case
PROPER() – converts a text string to sentence case
LOWER() – converts a text string to lower case

ISNUMBER() – confirms the value in the cell is a numerical value. This is very useful as one of the key causes of wrong analysis in Excel is values that look like numbers but are recognised as text by Excel. When this happens, those values will not be aggregated like regular values in =SUM, =AVERAGE and other mathematical equations, and will cause those functions to generate incorrect results.

SUBSTITUTE() – replaces existing text with another within a text string. This can be used to remove the extraneous full stops and hyphens in the phone column highlighted above (demonstrated below). In the snapshot below, two sets of =SUBSTITUTE() functions are executed – one for each type of extraneous character. Both functions can be nested, as shown in Column R
VALUE() – converts text string to values. This comes in handy when you need to reconvert numerical data back to values, after it has been converted to text in the data cleansing process

TRIM OR CLEAN() – removes extra spaces or non-printable characters within a string.
EXACT() – Checks if the content of two cells are equal.

Bonus: =ISTEXT (just like ISNUMBER, but confirms if cell contains text), =REPLACE (similar to SUBSTITUTE)

2 thoughts on “Excel Hack 54: The Data Clean Up Function Squad”

Pingback: Excel Hack 53: Learning the most relevant Excel function for your daily task (the Function Squad Series) – 101 Excel Hacks
Pingback: Excel Hack 32: Counting items included with other text in a range – 101 Excel Hacks

101 Excel Hacks