Assignment 02

Software Re-Engineering (SWE-417) SSUET/QR/114
Sir Syed University of Engineering & Technology (SSUET)
Department: Software Engineering
Course Name: Software Re-Engineering
Assignment 02
Submitted to:
Miss Nida Khalil
Submitted By:
Student Name Roll Number Section

Mukadus Jawaid 2020-SE-104 C
Abdullah Khan 2020-SE-135 C
Rabee Amir 2020-SE-137 C
Nofil Saleem 2020-SE-140 C
1
Assignment 02
Answer # 01:
i. Part (a): In this part where we have to download the dataset from Kaggle and determine the
steps that can be taken in order to clean the data there are many that can help us to clean our
data set. This data set which we’re working to work on is named as “Student Skillset Analysis”.
So, the steps that can help us to clean our dataset are:
• Understanding the data: comprehending the data that will be the focus of our work.
We'll be understanding or identifying the variables, their definitions, and what kinds of
data they can be used with. For example, some variables can be numerical, categorical,
or textual.
• Fixing Structural Errors: will be correcting typos, incorrect capitalization, and odd
naming conventions. Mislabeled categories in our data are one example of these
inconsistencies.
• Handling missing data: In this method, we'll take care of the empty columns that make
your row or the information about that particular subject look incomplete. In order to
handle it, we can either add missing values or completely remove the row or column,
but doing so may result in data loss or may change how the data is used to navigate null
values.
• Filtering unwanted Outliers: Unwanted columns that have no relation to the other
columns occasionally appear in the dataset that we view. By using this method, we can
2
eliminate these outliers, which will improve the efficiency with which we can analyze
the data.
• Validating the data: Validating the data can be a very efficient step in the data cleaning
process because it will confirm the high data consistency and quality.
These steps can be very sufficient for the dataset “Student skillset analysis”. In my perspective
because the data we downloaded isn’t much complex. It is simple it doesn’t even contain the
numerical value so we wouldn’t have to apply the step standardize or normalizing the data. There
can be many other steps that can help to clean the data but it totally depends which types of
data set you’re going to choose.
ii. Part (b): cleaning the data through a python program.
• Code:
import pandas as pd
import numpy as np
# Import dataset
file_path = "studentdata.xlsx"
data = pd.read_excel(file_path)
print(data)
# Handling missing values
data.dropna(inplace=True) # Remove rows with missing values
# Removing duplicates
data.drop_duplicates(inplace=True)
# Handling outliers using IQR
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
3
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
data = data[~((data.lt(lower_bound)) | (data.gt(upper_bound))).any(axis=1)].copy()
# Use .copy() to avoid FutureWarning
# Checking for inconsistencies
data['Active in developer Communities'] = data['Active in developer
Communities'].replace('incorrect_value', 'correct_value')
data['How many programming languages do you know'] = data['How many
programming languages do you know'].replace('incorrect_value', 'correct_value')
# Removing unnecessary columns
data = data.drop(['Are you associated with any developer Community'], axis=1)
# Checking and converting data types
data['Percentage in Class 10th'] = pd.to_numeric(data['Percentage in Class 10th'],
errors='coerce')
# Convert to numeric type, coercing non-numeric values to NaN
data ['Percentage in Class 12th'] = pd.to_numeric(data['Percentage in Class 12th'],
errors='coerce')
# Resetting index
data.reset_index(drop=True, inplace=True)
print("\nAfter Applying Some Commands of Data Cleaning the is:\n\n",data)
# Final cleaned dataset
print("\n Top Rows:\n\n",data.head())
4
• Output:
5
Answer # 02:
• Code:
1. read (n)
2. i := 1
3. sum := 0
4 product := 1
5. while i <= n
6. {
7. sum := sum + i
8. product := product * i
9. i := i + 1
10. }
11. write (sum)
12. write (product)
• After Applying Backward Slicing from [12, Product]
[1] read (n)
[2] i =1
[4] product = 1
[5] while i <= n
[6] {
[8] product = product * i
[9] i = i + 1
[10] }
[12] write (product)
6
References
1. https://www.javatpoint.com/data-cleaning-in-data-
mining#:~:text=Data%20mining%20is%20a%20key%20technique%20for%20data%20cleaning.
2. https://www.kaggle.com/datasets/kushagrathisside/student-skillset-analysis
3. https://monkeylearn.com/blog/data-cleaning-steps/
4. https://www.geotab.com/blog/data-cleaning/
5. https://analyticsindiamag.com/understanding-the-importance-of-data-cleaning-and-
normalization/
6. https://realpython.com/python-data-cleaning-numpy-pandas/
7. https://www.w3schools.com/python/pandas/pandas_cleaning.asp
8. Miss Khalil, N. (2023, May). Data Preprocessing [PowerPoint slides]. Lecture presented in
the course "Software Reengineering," Sir Syed University of Technology and Engineering.
9. Miss Khalil, N. (2023). Lecture 7 & 8 [PowerPoint slides]. Lecture presented in the course
"Software Reengineering," Sir Syed University of Technology and Engineering.
----------------------------- End of the Assignment 02 ----------------------

Assignment 02

Uploaded by

Copyright:

Available Formats

You might also like

Assignment 02

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assignment 02

Uploaded by

Copyright:

Available Formats

Software Re-Engineering (SWE-417) SSUET/QR/114

Sir Syed University of Engineering & Technology (SSUET)

Department: Software Engineering

Course Name: Software Re-Engineering

Miss Nida Khalil

Student Name Roll Number Section

Abdullah Khan 2020-SE-135 C

Rabee Amir 2020-SE-137 C

Nofil Saleem 2020-SE-140 C

data set you’re going to choose.

ii. Part (b): cleaning the data through a python program.

• After Applying Backward Slicing from [12, Product]

[1] read (n)

[5] while i <= n

[8] product = product * i

[12] write (product)

"Software Reengineering," Sir Syed University of Technology and Engineering.

----------------------------- End of the Assignment 02 ----------------------

You might also like