DS Practical

Sub: Data Science Class: TYCS
Practical No. 1
Aim: Introduction to Excel:
● Perform conditional formatting on a dataset using various criteria.
● Create a pivot table to analyze and summarize data.
● Use the VLOOKUP function to retrieve information from a different worksheet or table.
● Perform what-if analysis using Goal Seek to determine input values for desired output.
1. Perform conditional formatting on a dataset using various criteria.

We perform conditional formatting on the "Sell Price" column to highlight cells with a
price greater than $ 2000 using following steps:
Steps:
1. Select the "Sell Price" column (Column E).
2. Go to the "Home" tab on the ribbon.

3. Click on "Conditional Formatting" in the toolbar.
4. Choose "Highlight Cells Rules" and then "Greater Than."
Name: Amit Bablu Rai Roll No: 101

5. Enter the threshold value as 3500.

6. Customize the formatting options (e.g., choose a fill color).
7. Click "OK" to apply the rule.

2. Create a pivot table to analyze and summarize data.

Create Pivot table for following analysis :
a) How many cars do you have by make and model and by color?
b) Find out the profit margin on different vehicles.
c) Find out the average cost of vehicles.
d) Find out the percentage of cars of each color.
Steps:
1. Select the entire dataset including headers.
2. Go to the "Insert" tab on the ribbon.

3. Click on "PivotTable."

4. Choose where you want to place the PivotTable (e.g., new worksheet).

5. Let us find out, how many cars do we have by make and model and by color?
Drag the "Make" column to the Rows area.
Drag the “Model” column to the Rows area under the “Make” column.
Drag the “Color” column to the Column Section.
Drag the “Make” column to the Values Section.
This will give us Number of cars by Make, Model and Color.

6. Let us find out the profit margin on different vehicles.
Drag the “Make” column to the Values Section.
Drag the “Sell Price” column to the Values Section.
Drag the “Buy Price” column to the Values Section.

Add one calculated field “Profit Margin”.

Insert “Sell Price” – “Buy Price” in the Formula section. You will get a profit
margin. (You can sort the data.)

7. Let us find out the average cost of vehicles.

Drag the “Buy Price” column to the Values Section.

8. Let us find out the percentage of cars of each color.

Drag the "Color" column to the Rows area.
Drag the “Color” column to the Values Section.
Right click on Count of color column Value -> ShowValue as And select “% of
column total”.

Now Visualize the data using PivotChart.

Click on PivotChart Analyze Tab and click on PivotChart
Add the “Make” column in the Row area and immediately PivotChart will update.

3. Use the VLOOKUP function to retrieve information from a different worksheet

or table.
Steps:
1. Consider Car dataset. Let us Find out the sale price of a particular car.
2. Enter Car Company in one cell and then add VLOOKUP Formula.
=VLOOKUP(A12,A1:F7,5,0)
i) Lookup_value : index value of particular item.
ii) Table_array: Select complete dataset
iii) Col_index_num : The value which we want to find out.
iv) Range lookup: True or False
4. Perform what-if analysis using Goal Seek to determine input values for desired output.
Steps:
1. Go to the "Data" tab on the ribbon.
2. Click on "What-If Analysis" and select "Goal Seek."

3. Set "Set cell" to the Saving cell, "To value" to 7000, and "By changing cell" to
the Transportation cell.
4. Click "OK" to let Excel determine the required Savings.

Practical No. 2
Aim: Data Frames and Basic Data Pre-processing
● Read data from CSV and JSON files into a data frame.
● Perform basic data pre-processing tasks such as handling missing values and outliers.
● Manipulate and transform data using functions like filtering, sorting, and grouping.
Create a CSV file given below:
Code:
#Read Data From .CSV File
import pandas as pd
data=pd.read_csv('student.csv')
print("Reading data from CSV file:")
data.head()
#Read Data From JSon File:

data_json = pd.read_json("animals.json")
print("Reading data from JSON file:")
data_json.head()

import json
#Creating JSON data manually

input = '''[
{"studid" : "001", "name" : "Nandita", "Age": "25"},
{"studid" : "002", "name" : "Rinki", "Age": "19"},
{"studid" : "009", "name" : "Kajal", "Age": "15"}
]'''
info = json.loads(input)
print("User Count : ",len(info))
for item in info:
print("\nStudent ID: ", item["studid"])
print("Name: ", item["name"])
print("Age: ", item["Age"])
Output:
#Handling Missing Values:

df_car = pd.read_csv("Car Inventory Details.csv")
df_car

df_car_clean = df_car.dropna(axis=0, how='any')

df_car_clean
df_car_filled = df_car_clean.fillna(0)
df_car_filled

#Handling Outliers
median_value = df_car_filled['Sell Price'].median()
upper_threshold = df_car_filled['Sell Price'].mean() +2 * df_car_filled['Sell Price'].std()
lower_threshold = df_car_filled['Sell Price'].mean() - 2 * df_car_filled['Sell Price'].std()
df_car_filled['Sell Price'] = df_car_filled['Sell Price'].apply(lambda x: median_value if x
> upper_threshold or x < lower_threshold else x)
df_car_filled
#Manipulate And Transform Data:

filtered_data = df_car_filled[df_car_filled['Sell Price'] > 3000]

filtered_data
sorted_data = df_car_filled.sort_values(by='Sell Price', ascending=True)

sorted_data
numeric_columns = ['Sell Price', 'Buy Price']

grouped_data = df_car_filled.groupby('Make')[numeric_columns].mean()

grouped_data

Practical No. 3
Aim: Feature Scaling and Dummification
● Apply feature-scaling techniques like standardization and normalization to numerical
features.
● Perform feature dummification to convert categorical variables into numerical
representations.
Code:
#Importing Libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
#Define The Data:

data = {
'Product': ['Apple_juice','Banana_Smoothie','Orange_Jam','Grape_Jelly','Kiwi_Juice',
'Mango_Pickle','Pineapple_Sorbet','Strawberry_Yoghurt','Blueberry_Pie','Cherry_Salsa'],
'Category':
['Apple','Banana','Orange','Grape','Kiwi','Mango','Pineapple','Strawberry','Blueberry','Cher
ry'],
'Sales':[1200,1700,2200,1400,2000,1000,1500,1800,1300,1600],
'Cost':[600,850,1100,700,1000,500,750,900,650,800],
'Profit':[600,850,1100,700,1000,500,750,900,650,800]
}
df = pd.DataFrame(data)
print("Original Dataset:")
print(df)
Output:

#Feature Scaling (Standardization and Normalization):

numeric_columns=['Sales','Cost','Profit']
Scaler_std=StandardScaler()
Scaler_normal=MinMaxScaler()
df_scaled_std=pd.DataFrame(Scaler_std.fit_transform(df[numeric_columns]),columns=n
umeric_columns)
df_scaled_normal=pd.DataFrame(Scaler_normal.fit_transform(df[numeric_columns]),col
umns=numeric_columns)
df_scaled_std
df_scaled=pd.concat([df_scaled_std,df.drop(numeric_columns,axis=1)],axis=1)
print("\n Dataset after Feature Scaling:")
print(df_scaled)
Output:
#Feature Dummification (Convert Categorical Columns to numerical representation):

categorical_columns=['Product','Category']
preprocessor= ColumnTransformer(
transformers=[
('categorical',OneHotEncoder(),categorical_columns)
],remainder='passthrough')
df_dummified=pd.DataFrame(preprocessor.fit_transform(df))
print("\n Dataset after Feature Dummification:")

print(df_dummified)
Output:

Practical No. 4
Aim: Hypothesis Testing
● Formulate null and alternative hypotheses for a given problem.
● Conduct a hypothesis test using appropriate statistical tests (e.g., t-test, chi-square test).
● Interpret the results and draw conclusions based on the test outcomes.
Codes:
from scipy import stats
import numpy as np
#One Sampled T-Test:

ages = [45, 89, 23, 46, 12, 69, 45, 24, 34, 67]
print(ages)
mean = np.mean(ages)
print("Mean of Age values: ", mean)
H0 = "The average age of 10 people is 30."
H1 = "The average age of 10 people is more than 30."
t_stats, p_val = stats.ttest_1samp(ages, 30)
print("P-value is: ", p_val)
print("The T-Statistics is: ",t_stats)
if p_val < 0.05:
print("We can reject the null hypothesis")
else:
print("We can accept the null hypothesis")
Output:
#Independent T-Test or Two Sampled T-Test:

data_group1 = np.array([12, 18, 12, 13, 15, 1, 7, 20, 21, 25, 19, 31, 21, 17, 17, 15, 19, 15,
12, 15])
data_group2 = np.array([23, 22, 24, 25, 21, 26, 21, 21, 25, 30, 24, 21, 23, 19, 14, 18, 14,
12, 19, 15])
mean1 = np.mean(data_group1)
mean2 = np.mean(data_group2)
print("Data group 1 mean value:", mean1)
print("Data group 2 mean value:", mean2)
std1 = np.std(data_group1)
std2 = np.std(data_group2)
print("Data group 1 STD value:", std1)
print("Data group 2 STD value:", std2)
Output:

H0 = "Independent sample means are equal."

H1 = "Independent sample means are not equal."
t_stats,p_val = stats.ttest_ind(data_group1, data_group2)
print("The P-value is: ", p_val)
if p_val < 0.05:
else:
Output:
#Paired T-Test:
sample1 = [29, 30, 33, 41, 38, 36, 35, 31, 29, 30]
sample2 = [31, 32, 33, 39, 30, 33, 30, 28, 29, 31]
H0 = "Dependent sample means are equal."
H1 = "Dependent sample means are not equal."
t_stats,p_val = stats.ttest_rel(data_group1, data_group2)
print("The P-value of the test is: ", p_val)
if p_val < 0.05:
else:
Output:
#Chi-Square Test:

data = [[231, 256, 321],[245, 312, 213]]

H0 = "There is no relation between variables."
H1 = "There is significant relation between variables."
t_stats,p_val, dof, expected_val = stats.chi2_contingency(data)
print("The p-value of our test is " + str(p_val))
if p_val < 0.05:
else:
Output:

Practical No. 5
Aim: ANOVA (Analysis of Variance)
● Perform One-way ANOVA to compare means across multiple groups.
Code:
from scipy import stats
performance1 = [89, 89, 88, 78, 79]

performance2 = [93, 92, 94, 89, 88]
performance3 = [89, 88, 89, 93, 90]
performance4 = [81, 78, 81, 92, 82]
f_stats,p_val = stats.f_oneway(performance1, performance2, performance3,
performance4)
print("p-Value : ",p_val)
if p_val < 0.05:
else:
Output:

Practical No. 6(A)

Aim: Regression and Its Types
● Implement Simple Linear Regression using a given dataset.
● Explore and interpret the Regression Model coefficients and goodness-of-fit measures.
Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
data=pd.read_csv('salaryData.csv')
x=data.iloc[:,:-1].values
y=data.iloc[:,1].values
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

Output:
y_pred=regre.predict(x_test)
x_pred=regre.predict(x_train)
print('R squared :{:.2f}',format(regre.score(x,y)*100))

print('Mean Absolute Error:',metrics.mean_absolute_error(y_test,y_pred))
print('Mean Squared Error:',metrics.mean_squared_error(y_test,y_pred))
print('root Mean squared
Error:',metrics.mean_squared_error(y_test,y_pred,squared=False))
Output:
print('coefficients:',regre.coef_)

print('intercept:',regre.intercept_)
Output:
plt.scatter(x_train,y_train,color="purple")
plt.plot(x_train,x_pred,color="orange")
plt.title("salary vs experience(training dataset)")
plt.xlabel("years of experience")
plt.ylabel("salary")
plt.show()
Output:
plt.scatter(x_test,y_test,color="green")
plt.plot(x_train,x_pred,color="red")

plt.title("salary vs experience(training dataset)")

plt.xlabel("years of experience")
plt.ylabel("salary")
plt.show()
Output:

Practical No. 6(B)

Aim: Implement Multiple Linear Regression and assess the impact of additional predictors.
Code:
import numpy as np
import pandas as pd
from sklearn import metrics
dataset=pd.read_csv('50_Startups.csv')
X=dataset.iloc[:,:-1].values
Y=dataset.iloc[:,-1].values
print(X)
Output:
from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder
ct=ColumnTransformer(transformers=[('encoder',OneHotEncoder(),
[3])],remainder='passthrough')
X=np.array(ct.fit_transform(X))

print(X)
Output:

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=0)
from sklearn.linear_model import LinearRegression

regressor=LinearRegression()
regressor.fit(X_train,Y_train)
Output:
y_pred=regressor.predict(X_test)
print("predictions{}:".format(y_pred))
Output:
mlr_diff=pd.DataFrame({'Actual value':Y_test,'predicted value':y_pred})

mlr_diff.head()
Output:
print('R squared:{:.2f}'.format(regressor.score(X,Y)*100))
print('Mean Absolute Error:',metrics.mean_absolute_error(Y_test,y_pred))
print('Mean Squared Error:',metrics.mean_squared_error(Y_test,y_pred))
print('Root Mean Squared
Error:',metrics.mean_squared_error(Y_test,y_pred,squared=False))
Output:

Practical No. 7(A)

Aim: Logistic Regression and Decision Tree
● Build a Logistic Regression Model to predict a binary outcome.
● Evaluate the model’s performance using classification metrics.
Code:
import numpy as np
import pandas as pd
dataset=pd.read_csv('Social_Network_Ads.csv')
X=dataset.iloc[:,:-1].values
Y=dataset.iloc[:,-1].values

X_train,X_test,Y_train,y_test=train_test_split(X,Y,test_size=0.2,random_state=0)
print(X_train)
Output:
print(Y_train)
Output:
from sklearn.preprocessing import StandardScaler

sc=StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.transform(X_test)
from sklearn.linear_model import LogisticRegression

classifier=LogisticRegression(random_state=0)
classifier.fit(X_train,Y_train)
Output:
print(classifier.predict(sc.transform([[30,87000]])))
Output:
y_pred=classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1),y_test.reshape(len(y_test),1)),1))
Output:
from sklearn.metrics import confusion_matrix,accuracy_score

cm=confusion_matrix(y_test,y_pred)

print(cm)
accuracy_score(y_test,y_pred)
Output:

Practical No. 7(B)

Aim: Construct a Decision Tree model and interpret rules for classification.
Code:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt
from sklearn import datasets
from sklearn import tree
iris_data=load_iris()
iris=pd.DataFrame(iris_data.feature_names)
print("Features Name:",iris_data.feature_names)
Output:
X=iris_data.data
print(X)
Output:
Y=iris_data.target

print(Y)
Output:
X_train,X_test,y_train,y_test=train_test_split(X,Y,random_state=50,test_size=0.3)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
Output:
from sklearn.tree import DecisionTreeClassifier

clf=DecisionTreeClassifier(random_state=100)
clf.fit(X_train,y_train)
Output:
y_pred=clf.predict(X_test)
print(y_pred)
Output:
print("Accuracy:",accuracy_score(y_test,y_pred))
Output:
from sklearn.metrics import confusion_matrix

cm=np.array(confusion_matrix(y_test,y_pred))
cm

Output:
from sklearn.metrics import plot_confusion_matrix

plot_confusion_matrix(clf,X_test,y_test)
plt.show()
Output:
tree.plot_tree(clf)
Output:


Practical No. 8
Aim: K-Means Clustering
● Apply the K-Meansalgorithm to grouyp similar data points into clusters.
● Determine optimal number of clusters using elbow method or silhouette analysis.
● Visualize the clustering results and analyse the cluster characteristics.
Code:
#Importing Libraries
import numpy as np
import pandas as pd
import sklearn.cluster as cluster
from sklearn.cluster import KMeans
import seaborn as sns
import sklearn.metrics as metrics
dataset = pd.read_csv('Iris.csv')
x = dataset.iloc[:, [1,2,3,4]].values
print(x)
Output:

K = range(1, 10)
wss = []
for k in K:
kmeans = cluster.KMeans(n_clusters=k, init="k-means++")
kmeans = kmeans.fit(x) # Fit the model to the data
wss_iter = kmeans.inertia_ # Access inertia_ after fitting
wss.append(wss_iter)
Output:
#Storing Number of clustering along with this WSS in Dataframe:

mycenters = pd.DataFrame({'Cluster' : K, 'WSS' : wss})
mycenters
Output:
#Plot Elbow Plot:

sns.lineplot(x = 'Cluster', y = 'WSS', data = mycenters, marker="+")

Output:
#Silhoutte Method To Identify Clusters:

SK = range(3, 10)
sil_score = []
for i in SK:
labels = cluster.KMeans(n_clusters=i, init="k-means++",
random_state=100).fit(x).labels_
score = metrics.silhouette_score(x, labels, metric="euclidean", sample_size=1000,
random_state=100)
sil_score.append(score)
print("Silhouette score for k(Clusters) = " + str(i) + " is " +
str(metrics.silhouette_score(x, labels, metric="euclidean", sample_size=150,
random_state=100)))
Output:
sil_centers = pd.DataFrame({'Clusters' : SK, 'Sil Score' : sil_score})

sil_centers

Output:
#Perform K-Means Clustering With 3 Cluster:

kmeans = cluster.KMeans(n_clusters=3 , init = "k-means++")
y_kmeans = kmeans.fit_predict(x)
#Visualization of Clusters:
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], c = 'red', label = "setosa")
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], c = 'blue', label = "versicolour")
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], c = 'green', label = "virginica")
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 100, c =
'yellow', label = 'Centroids')
plt.legend()
Output:
1. sns.lineplot(x = 'Clusters', y = 'Sil Score', data = sil_centers, marker = "+")

Output:


Practical No. 9
Aim: Principal Component Analysis
● Perform PCA on a dataset to reduce dimensionality.
● Evaluate the explained variance and select the appropriate number of principal
components.
● Visualize the data in the reduced dimensional space.
Code:
import numpy as np
import pandas as pd
dataset=pd.read_csv('Wine.csv')
x=dataset.iloc[:,:-1].values
y=dataset.iloc[:,-1].values

from sklearn.preprocessing import StandardScaler

sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.transform(x_test)
print(x_train)
Output:

from sklearn.decomposition import PCA

pca=PCA(n_components = 2)
x_train =pca.fit_transform(x_train)
x_test= pca.transform(x_test)
print(x_train)
Output:
plt.scatter(x_train[:, 0], x_train[:, 1])

plt.xlabel('First principal component')

plt.ylabel('Second principal component')
plt.show()
Output:

DS Practical

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DS Practical

Uploaded by

Copyright:

Available Formats

Sub: Data Science Class: TYCS

1. Perform conditional formatting on a dataset using various criteria.

2. Go to the "Home" tab on the ribbon.

Name: Amit Bablu Rai Roll No: 101

5. Enter the threshold value as 3500.

7. Click "OK" to apply the rule.

Name: Amit Bablu Rai Roll No: 101

2. Create a pivot table to analyze and summarize data.

2. Go to the "Insert" tab on the ribbon.

Name: Amit Bablu Rai Roll No: 101

Name: Amit Bablu Rai Roll No: 101

This will give us Number of cars by Make, Model and Color.

Name: Amit Bablu Rai Roll No: 101

Add one calculated field “Profit Margin”.

Name: Amit Bablu Rai Roll No: 101

7. Let us find out the average cost of vehicles.

Name: Amit Bablu Rai Roll No: 101

8. Let us find out the percentage of cars of each color.

Name: Amit Bablu Rai Roll No: 101

Now Visualize the data using PivotChart.

Name: Amit Bablu Rai Roll No: 101

3. Use the VLOOKUP function to retrieve information from a different worksheet

Name: Amit Bablu Rai Roll No: 101

4. Click "OK" to let Excel determine the required Savings.

Name: Amit Bablu Rai Roll No: 101

Create a CSV file given below:

#Read Data From JSon File:

Name: Amit Bablu Rai Roll No: 101

#Creating JSON data manually

#Handling Missing Values:

Name: Amit Bablu Rai Roll No: 101

df_car_clean = df_car.dropna(axis=0, how='any')

Name: Amit Bablu Rai Roll No: 101

#Manipulate And Transform Data:

Name: Amit Bablu Rai Roll No: 101

filtered_data = df_car_filled[df_car_filled['Sell Price'] > 3000]

sorted_data = df_car_filled.sort_values(by='Sell Price', ascending=True)

numeric_columns = ['Sell Price', 'Buy Price']

Name: Amit Bablu Rai Roll No: 101

Name: Amit Bablu Rai Roll No: 101

#Define The Data:

Name: Amit Bablu Rai Roll No: 101

#Feature Scaling (Standardization and Normalization):

#Feature Dummification (Convert Categorical Columns to numerical representation):

Name: Amit Bablu Rai Roll No: 101

print("\n Dataset after Feature Dummification:")

Name: Amit Bablu Rai Roll No: 101

#One Sampled T-Test:

#Independent T-Test or Two Sampled T-Test:

Name: Amit Bablu Rai Roll No: 101

H0 = "Independent sample means are equal."

Name: Amit Bablu Rai Roll No: 101

data = [[231, 256, 321],[245, 312, 213]]

Name: Amit Bablu Rai Roll No: 101

performance1 = [89, 89, 88, 78, 79]

Name: Amit Bablu Rai Roll No: 101

Practical No. 6(A)

from sklearn.model_selection import train_test_split

from sklearn.model_selection import train_test_split

print('R squared :{:.2f}',format(regre.score(x,y)*100))

Name: Amit Bablu Rai Roll No: 101

Name: Amit Bablu Rai Roll No: 101

plt.title("salary vs experience(training dataset)")

Name: Amit Bablu Rai Roll No: 101