Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

University of Science and Technology

School of computational science and artificial intelligence


Data governance

Quality control sheet

1. Write a code to identify and remove duplicate rows from a dataset.

2. Write a code to correct the data type of the “column 1” in a dataset from strings to numeric
values

3. You have a dataset that has a column ”color” with categorical variables that are encoded as
strings. You want to convert these categorical variables into numeric values using one-hot
encoding. Write a code snippet usings sklearn to perform one-hot encoding on the dataset.

4. You have a dataset that has a column ”color” with categorical variables that are encoded as
strings. You want to convert these categorical variables into numeric values using label encoding.
Write a code snippet usings sklearn to perform that on the dataset.

5. Create a code to scale data by scaling it to a common range to use it in linear regression
which needs the data to be normalized within a range from 0 to 1.

6. Create a code to scale data by scaling it to a common range to use it in linear regression
which needs the data to be normalized around 0.

7. Create a code to scale data that includes outliers by scaling it to a common range using a
suitable scaler.

8. write a Python code that inserts a new column into a dataframe, categorizing people based on
the "age" column as follows: age less than 11 is categorized as "child", age between 11 and 20
(inclusive) is categorized as "teenager", and age greater than 20 is categorized as "adult"

Sample data:

data = {'Name': ['John', 'Emma', 'Ryan', 'Sophia'],


'Age': [8, 15, 28, 35]}

9. write a python code that add a cloumn to dataset that add vlaues of 3 cloumns and add
20% tax

10. write a python code to merge both “data1” and “data2” dataframes
University of Science and Technology
School of computational science and artificial intelligence
Data governance

data1 = {'Name': ['John', 'Jane', 'Mike'],


'Age': [25, 30, 35]}

data2 = {'Name': ['Sarah', 'David'],


'Age': [28, 32]}

11. Consider a scenario where you have a large data frame that needs to be validated against a
schema using Pandera in Python. Explain how you can leverage the lazy=True parameter in
Pandera to optimize the validation process.
Provide an example code snippet demonstrating the usage of lazy=True in conjunction with
DataFrame schema validation.
Given that the used data frame is:
data = pd.read_csv('data.csv')

12. You have been given a dataset containing customer information, including their names,
ages, and email addresses.
Write a Python code snippet to perform data validation and profiling on this dataset using
Pandas and Pandera.
Data Validation:
a. Validate that the 'name' column contains only string values and does not have any missing or
null values.
b. Validate that the 'age' column contains only integer values and falls within a specific range
(18 to 65).
Data Profiling:
a. Calculate and display basic statistics for the 'age' column, such as minimum, maximum,
mean, and standard deviation.
b. Count and display the number of unique values in the 'name' column.

13-You are working with a dataset containing information about customers' preferences for
outdoor activities. The dataset includes columns for 'Customer ID', 'Age', 'Activity Type', and
'Rating'. Your task is to use Great Expectations to validate the data and ensure that the 'Rating'
column values are within a specific range based on the 'Activity Type'. Write Python code to
load the dataset, define custom expectations for rating ranges based on activity types, apply
these expectations, and print the validation result.

You might also like