Professional Documents
Culture Documents
Assignment 02
Assignment 02
Assignment 02
Assignment 02
Submitted to:
Submitted By:
1
Software Re-Engineering (SWE-417) SSUET/QR/114
Assignment 02
Answer # 01:
i. Part (a): In this part where we have to download the dataset from Kaggle and determine the
steps that can be taken in order to clean the data there are many that can help us to clean our
data set. This data set which we’re working to work on is named as “Student Skillset Analysis”.
So, the steps that can help us to clean our dataset are:
• Understanding the data: comprehending the data that will be the focus of our work.
We'll be understanding or identifying the variables, their definitions, and what kinds of
data they can be used with. For example, some variables can be numerical, categorical,
or textual.
• Fixing Structural Errors: will be correcting typos, incorrect capitalization, and odd
naming conventions. Mislabeled categories in our data are one example of these
inconsistencies.
• Handling missing data: In this method, we'll take care of the empty columns that make
your row or the information about that particular subject look incomplete. In order to
handle it, we can either add missing values or completely remove the row or column,
but doing so may result in data loss or may change how the data is used to navigate null
values.
• Filtering unwanted Outliers: Unwanted columns that have no relation to the other
columns occasionally appear in the dataset that we view. By using this method, we can
2
Software Re-Engineering (SWE-417) SSUET/QR/114
eliminate these outliers, which will improve the efficiency with which we can analyze
the data.
• Validating the data: Validating the data can be a very efficient step in the data cleaning
process because it will confirm the high data consistency and quality.
These steps can be very sufficient for the dataset “Student skillset analysis”. In my perspective
because the data we downloaded isn’t much complex. It is simple it doesn’t even contain the
numerical value so we wouldn’t have to apply the step standardize or normalizing the data. There
can be many other steps that can help to clean the data but it totally depends which types of
• Code:
import pandas as pd
import numpy as np
# Import dataset
file_path = "studentdata.xlsx"
data = pd.read_excel(file_path)
print(data)
# Handling missing values
data.dropna(inplace=True) # Remove rows with missing values
# Removing duplicates
data.drop_duplicates(inplace=True)
# Handling outliers using IQR
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
3
Software Re-Engineering (SWE-417) SSUET/QR/114
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
data = data[~((data.lt(lower_bound)) | (data.gt(upper_bound))).any(axis=1)].copy()
# Use .copy() to avoid FutureWarning
# Checking for inconsistencies
data['Active in developer Communities'] = data['Active in developer
Communities'].replace('incorrect_value', 'correct_value')
data['How many programming languages do you know'] = data['How many
programming languages do you know'].replace('incorrect_value', 'correct_value')
# Removing unnecessary columns
data = data.drop(['Are you associated with any developer Community'], axis=1)
# Checking and converting data types
data['Percentage in Class 10th'] = pd.to_numeric(data['Percentage in Class 10th'],
errors='coerce')
# Convert to numeric type, coercing non-numeric values to NaN
data ['Percentage in Class 12th'] = pd.to_numeric(data['Percentage in Class 12th'],
errors='coerce')
# Resetting index
data.reset_index(drop=True, inplace=True)
print("\nAfter Applying Some Commands of Data Cleaning the is:\n\n",data)
# Final cleaned dataset
print("\n Top Rows:\n\n",data.head())
4
Software Re-Engineering (SWE-417) SSUET/QR/114
• Output:
5
Software Re-Engineering (SWE-417) SSUET/QR/114
Answer # 02:
• Code:
1. read (n)
2. i := 1
3. sum := 0
4 product := 1
5. while i <= n
6. {
7. sum := sum + i
8. product := product * i
9. i := i + 1
10. }
11. write (sum)
12. write (product)
[2] i =1
[4] product = 1
[6] {
[9] i = i + 1
[10] }
6
Software Re-Engineering (SWE-417) SSUET/QR/114
References
1. https://www.javatpoint.com/data-cleaning-in-data-
mining#:~:text=Data%20mining%20is%20a%20key%20technique%20for%20data%20cleaning.
2. https://www.kaggle.com/datasets/kushagrathisside/student-skillset-analysis
3. https://monkeylearn.com/blog/data-cleaning-steps/
4. https://www.geotab.com/blog/data-cleaning/
5. https://analyticsindiamag.com/understanding-the-importance-of-data-cleaning-and-
normalization/
6. https://realpython.com/python-data-cleaning-numpy-pandas/
7. https://www.w3schools.com/python/pandas/pandas_cleaning.asp
8. Miss Khalil, N. (2023, May). Data Preprocessing [PowerPoint slides]. Lecture presented in
the course "Software Reengineering," Sir Syed University of Technology and Engineering.
9. Miss Khalil, N. (2023). Lecture 7 & 8 [PowerPoint slides]. Lecture presented in the course