Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

Data Cleaning Checklist

Make sure you identified the most common problems and corrected them, including:

 Sources of errors: Did you use the right tools and functions to find the source of the
errors in your dataset?
 Null data: Did you search for NULLs using conditional formatting and filters?
 Misspelled words: Did you locate all misspellings?
 Mistyped numbers: Did you double-check that your numeric data has been entered
correctly?
 Extra spaces and characters: Did you remove any extra spaces or characters using
the TRIM function?
 Duplicates: Did you remove duplicates in spreadsheets using the Remove Duplicates
function or DISTINCT in SQL?
 Mismatched data types: Did you check that numeric, date, and string data are
typecast correctly?
 Messy (inconsistent) strings: Did you make sure that all of your strings are
consistent and meaningful?
 Messy (inconsistent) date formats: Did you format the dates consistently throughout
your dataset?
 Misleading variable labels (columns): Did you name your columns meaningfully?
 Truncated data: Did you check for truncated or missing data that needs correction?
 Business Logic: Did you check that the data makes sense given your knowledge of
the business?

Review the goal of your project


Once you have finished these data cleaning tasks, it is a good idea to review the goal of your
project and confirm that your data is still aligned with that goal. This is a continuous process
that you will do throughout your project-- but here are three steps you can keep in mind while
thinking about this:

 Confirm the business problem


 Confirm the goal of the project
 Verify that data can solve the problem and is aligned to the goal
1. TRIM
This function removes leading and trailing spaces from a string.
Example: Remove spaces from the employee names.
SELECT TRIM(employee_name) AS trimmed_name
FROM employees;

2. UPPER and LOWER


SELECT UPPER('Hello World'), LOWER('Hello World');

3. REPLACE
This function replaces all occurrences of a specified substring with
another substring.
SELECT REPLACE(email, '@old_domain.com', '@new_domain.com') AS
updated_email
FROM employees;

4. NULLIF
This function returns NULL if two expressions are equal; otherwise,
it returns the first expression.
SELECT employee_id, NULLIF(salary, 0) AS adjusted_salary
FROM employees;

5. COALESCE
This function returns the first non-NULL expression from a list of
expressions.
SELECT employee_id, COALESCE(salary, default_salary) AS final_salary
FROM employees;

6. CONCAT and CONCAT_WS


The CONCAT function concatenates two or more strings,
while CONCAT_WS concatenates strings with a specified separator.
SELECT CONCAT(first_name, ' ', last_name) AS full_name
FROM employees;

7. SUBSTRING and SUBSTRING_INDEX


The SUBSTRING and SUBSTRING_INDEX functions are used to extract
parts of a string.
SELECT SUBSTRING(employee_id, 1, 3) AS short_id
FROM employees;

8. LENGTH and CHAR_LENGTH


The LENGTH and CHAR_LENGTH functions return the length of a string in
bytes and characters, respectively.
SELECT employee_name
FROM employees
WHERE CHAR_LENGTH(employee_name) > 10;

9. ROUND, CEIL, and FLOOR


The ROUND, CEIL, and FLOOR functions are used to round numbers to
the nearest integer, the smallest integer greater than or equal to the
number, and the largest integer less than or equal to the number,
respectively.
SELECT employee_id, ROUND(salary, -2) AS rounded_salary
FROM employees;

10. CAST and CONVERT


The CAST and CONVERT functions are used to change the data type of a
value or column.
SELECT employee_id, CAST(hire_date AS VARCHAR) AS hire_date_string
FROM employees; 3
Handling Missing Data

1. Filtering NULL values

NULL values can be used to represent missing data in SQL. To


filter out rows with missing data, use the IS NULL or IS NOT

NULL operators.

SELECT * FROM table_name WHERE column_name IS NULL;

2. Setting Default values

Default values can be assigned to a column during table creation,


which will be used when no value is provided during data insertion
or update. To set a default value for a column, use
the DEFAULT keyword.

CREATE TABLE employees (


employee_id INT PRIMARY KEY,
employee_name VARCHAR(255),
employee_status VARCHAR(50) DEFAULT 'Active'
);

3. SELECT DISTINCT

Duplicates can occur when data is collected from multiple sources or


due to data entry errors. To remove duplicates, use
the DISTINCT keyword.
SELECT DISTINCT employee_id, department_id
FROM employees;
Data Validation and Constraints

1. CHECK

A CHECK constraint ensures that the data in a column meets a


specific condition. If the condition is not met, the data cannot be
inserted or updated.
CREATE TABLE products (
product_id INT PRIMARY KEY,
product_price DECIMAL(10, 2) CHECK (product_price > 0)
);

2. UNIQUE

A UNIQUE constraint ensures that all values in a column are


unique. This helps prevent duplicate data.
CREATE TABLE users (
user_id INT PRIMARY KEY,
email VARCHAR(255) UNIQUE
);

3. FOREIGN KEY

A FOREIGN KEY constraint is used to maintain referential integrity


between two tables. It ensures that the data in a column matches the
data in the primary key column of another table.
CREATE TABLE orders (
order_id INT PRIMARY KEY,
order_date DATE
);

CREATE TABLE order_items (


order_item_id INT PRIMARY KEY,
order_id INT REFERENCES orders (order_id),
product_id INT,
quantity INT
);

You might also like