Assignment 02

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Software Re-Engineering (SWE-417) SSUET/QR/114

Sir Syed University of Engineering & Technology (SSUET)

Department: Software Engineering

Course Name: Software Re-Engineering

Assignment 02
Submitted to:

Miss Nida Khalil

Submitted By:

Student Name Roll Number Section


Mukadus Jawaid 2020-SE-104 C

Abdullah Khan 2020-SE-135 C

Rabee Amir 2020-SE-137 C

Nofil Saleem 2020-SE-140 C

1
Software Re-Engineering (SWE-417) SSUET/QR/114

Assignment 02

Answer # 01:

i. Part (a): In this part where we have to download the dataset from Kaggle and determine the

steps that can be taken in order to clean the data there are many that can help us to clean our

data set. This data set which we’re working to work on is named as “Student Skillset Analysis”.

So, the steps that can help us to clean our dataset are:

• Understanding the data: comprehending the data that will be the focus of our work.

We'll be understanding or identifying the variables, their definitions, and what kinds of

data they can be used with. For example, some variables can be numerical, categorical,

or textual.

• Fixing Structural Errors: will be correcting typos, incorrect capitalization, and odd

naming conventions. Mislabeled categories in our data are one example of these

inconsistencies.

• Handling missing data: In this method, we'll take care of the empty columns that make

your row or the information about that particular subject look incomplete. In order to

handle it, we can either add missing values or completely remove the row or column,

but doing so may result in data loss or may change how the data is used to navigate null

values.

• Filtering unwanted Outliers: Unwanted columns that have no relation to the other

columns occasionally appear in the dataset that we view. By using this method, we can

2
Software Re-Engineering (SWE-417) SSUET/QR/114

eliminate these outliers, which will improve the efficiency with which we can analyze

the data.

• Validating the data: Validating the data can be a very efficient step in the data cleaning

process because it will confirm the high data consistency and quality.

These steps can be very sufficient for the dataset “Student skillset analysis”. In my perspective

because the data we downloaded isn’t much complex. It is simple it doesn’t even contain the

numerical value so we wouldn’t have to apply the step standardize or normalizing the data. There

can be many other steps that can help to clean the data but it totally depends which types of

data set you’re going to choose.

ii. Part (b): cleaning the data through a python program.

• Code:

import pandas as pd
import numpy as np
# Import dataset
file_path = "studentdata.xlsx"
data = pd.read_excel(file_path)
print(data)
# Handling missing values
data.dropna(inplace=True) # Remove rows with missing values
# Removing duplicates
data.drop_duplicates(inplace=True)
# Handling outliers using IQR
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)

3
Software Re-Engineering (SWE-417) SSUET/QR/114

IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
data = data[~((data.lt(lower_bound)) | (data.gt(upper_bound))).any(axis=1)].copy()
# Use .copy() to avoid FutureWarning
# Checking for inconsistencies
data['Active in developer Communities'] = data['Active in developer
Communities'].replace('incorrect_value', 'correct_value')
data['How many programming languages do you know'] = data['How many
programming languages do you know'].replace('incorrect_value', 'correct_value')
# Removing unnecessary columns
data = data.drop(['Are you associated with any developer Community'], axis=1)
# Checking and converting data types
data['Percentage in Class 10th'] = pd.to_numeric(data['Percentage in Class 10th'],
errors='coerce')
# Convert to numeric type, coercing non-numeric values to NaN
data ['Percentage in Class 12th'] = pd.to_numeric(data['Percentage in Class 12th'],
errors='coerce')
# Resetting index
data.reset_index(drop=True, inplace=True)
print("\nAfter Applying Some Commands of Data Cleaning the is:\n\n",data)
# Final cleaned dataset
print("\n Top Rows:\n\n",data.head())

4
Software Re-Engineering (SWE-417) SSUET/QR/114

• Output:

5
Software Re-Engineering (SWE-417) SSUET/QR/114

Answer # 02:

• Code:

1. read (n)
2. i := 1
3. sum := 0
4 product := 1
5. while i <= n
6. {
7. sum := sum + i
8. product := product * i
9. i := i + 1
10. }
11. write (sum)
12. write (product)

• After Applying Backward Slicing from [12, Product]

[1] read (n)

[2] i =1

[4] product = 1

[5] while i <= n

[6] {

[8] product = product * i

[9] i = i + 1

[10] }

[12] write (product)

6
Software Re-Engineering (SWE-417) SSUET/QR/114

References

1. https://www.javatpoint.com/data-cleaning-in-data-

mining#:~:text=Data%20mining%20is%20a%20key%20technique%20for%20data%20cleaning.

2. https://www.kaggle.com/datasets/kushagrathisside/student-skillset-analysis

3. https://monkeylearn.com/blog/data-cleaning-steps/

4. https://www.geotab.com/blog/data-cleaning/

5. https://analyticsindiamag.com/understanding-the-importance-of-data-cleaning-and-

normalization/

6. https://realpython.com/python-data-cleaning-numpy-pandas/

7. https://www.w3schools.com/python/pandas/pandas_cleaning.asp

8. Miss Khalil, N. (2023, May). Data Preprocessing [PowerPoint slides]. Lecture presented in

the course "Software Reengineering," Sir Syed University of Technology and Engineering.

9. Miss Khalil, N. (2023). Lecture 7 & 8 [PowerPoint slides]. Lecture presented in the course

"Software Reengineering," Sir Syed University of Technology and Engineering.

----------------------------- End of the Assignment 02 ----------------------

You might also like