Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 45

Sub: Data Science Class: TYCS

Practical No. 1
Aim: Introduction to Excel:
● Perform conditional formatting on a dataset using various criteria.
● Create a pivot table to analyze and summarize data.
● Use the VLOOKUP function to retrieve information from a different worksheet or table.
● Perform what-if analysis using Goal Seek to determine input values for desired output.

1. Perform conditional formatting on a dataset using various criteria.


We perform conditional formatting on the "Sell Price" column to highlight cells with a
price greater than $ 2000 using following steps:
Steps:
1. Select the "Sell Price" column (Column E).

2. Go to the "Home" tab on the ribbon.


3. Click on "Conditional Formatting" in the toolbar.
4. Choose "Highlight Cells Rules" and then "Greater Than."

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

5. Enter the threshold value as 3500.


6. Customize the formatting options (e.g., choose a fill color).

7. Click "OK" to apply the rule.

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

2. Create a pivot table to analyze and summarize data.


Create Pivot table for following analysis :
a) How many cars do you have by make and model and by color?
b) Find out the profit margin on different vehicles.
c) Find out the average cost of vehicles.
d) Find out the percentage of cars of each color.
Steps:
1. Select the entire dataset including headers.

2. Go to the "Insert" tab on the ribbon.


3. Click on "PivotTable."

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

4. Choose where you want to place the PivotTable (e.g., new worksheet).

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

5. Let us find out, how many cars do we have by make and model and by color?
Drag the "Make" column to the Rows area.
Drag the “Model” column to the Rows area under the “Make” column.
Drag the “Color” column to the Column Section.
Drag the “Make” column to the Values Section.

This will give us Number of cars by Make, Model and Color.


6. Let us find out the profit margin on different vehicles.
Drag the "Make" column to the Rows area.
Drag the “Make” column to the Values Section.
Drag the “Sell Price” column to the Values Section.
Drag the “Buy Price” column to the Values Section.

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

Add one calculated field “Profit Margin”.


Insert “Sell Price” – “Buy Price” in the Formula section. You will get a profit
margin. (You can sort the data.)

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

7. Let us find out the average cost of vehicles.


Drag the "Make" column to the Rows area.
Drag the “Buy Price” column to the Values Section.

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

8. Let us find out the percentage of cars of each color.


Drag the "Color" column to the Rows area.
Drag the “Color” column to the Values Section.
Right click on Count of color column Value -> ShowValue as And select “% of
column total”.

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

Now Visualize the data using PivotChart.


Click on PivotChart Analyze Tab and click on PivotChart

Add the “Make” column in the Row area and immediately PivotChart will update.

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

3. Use the VLOOKUP function to retrieve information from a different worksheet


or table.
Steps:
1. Consider Car dataset. Let us Find out the sale price of a particular car.
2. Enter Car Company in one cell and then add VLOOKUP Formula.
=VLOOKUP(A12,A1:F7,5,0)
i) Lookup_value : index value of particular item.
ii) Table_array: Select complete dataset
iii) Col_index_num : The value which we want to find out.
iv) Range lookup: True or False

4. Perform what-if analysis using Goal Seek to determine input values for desired output.
Steps:
1. Go to the "Data" tab on the ribbon.
2. Click on "What-If Analysis" and select "Goal Seek."

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

3. Set "Set cell" to the Saving cell, "To value" to 7000, and "By changing cell" to
the Transportation cell.

4. Click "OK" to let Excel determine the required Savings.

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

Practical No. 2
Aim: Data Frames and Basic Data Pre-processing
● Read data from CSV and JSON files into a data frame.
● Perform basic data pre-processing tasks such as handling missing values and outliers.
● Manipulate and transform data using functions like filtering, sorting, and grouping.

Create a CSV file given below:

Code:
#Read Data From .CSV File
import pandas as pd

data=pd.read_csv('student.csv')
print("Reading data from CSV file:")
data.head()

#Read Data From JSon File:


data_json = pd.read_json("animals.json")
print("Reading data from JSON file:")
data_json.head()

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

import json

#Creating JSON data manually


input = '''[
{"studid" : "001", "name" : "Nandita", "Age": "25"},
{"studid" : "002", "name" : "Rinki", "Age": "19"},
{"studid" : "009", "name" : "Kajal", "Age": "15"}
]'''
info = json.loads(input)
print("User Count : ",len(info))
for item in info:
print("\nStudent ID: ", item["studid"])
print("Name: ", item["name"])
print("Age: ", item["Age"])
Output:

#Handling Missing Values:


df_car = pd.read_csv("Car Inventory Details.csv")
df_car

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

df_car_clean = df_car.dropna(axis=0, how='any')


df_car_clean

df_car_filled = df_car_clean.fillna(0)
df_car_filled

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

#Handling Outliers
median_value = df_car_filled['Sell Price'].median()
upper_threshold = df_car_filled['Sell Price'].mean() +2 * df_car_filled['Sell Price'].std()
lower_threshold = df_car_filled['Sell Price'].mean() - 2 * df_car_filled['Sell Price'].std()
df_car_filled['Sell Price'] = df_car_filled['Sell Price'].apply(lambda x: median_value if x
> upper_threshold or x < lower_threshold else x)

df_car_filled

#Manipulate And Transform Data:

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

filtered_data = df_car_filled[df_car_filled['Sell Price'] > 3000]


filtered_data

sorted_data = df_car_filled.sort_values(by='Sell Price', ascending=True)


sorted_data

numeric_columns = ['Sell Price', 'Buy Price']


grouped_data = df_car_filled.groupby('Make')[numeric_columns].mean()

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

grouped_data

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

Practical No. 3
Aim: Feature Scaling and Dummification
● Apply feature-scaling techniques like standardization and normalization to numerical
features.
● Perform feature dummification to convert categorical variables into numerical
representations.
Code:
#Importing Libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

#Define The Data:


data = {
'Product': ['Apple_juice','Banana_Smoothie','Orange_Jam','Grape_Jelly','Kiwi_Juice',
'Mango_Pickle','Pineapple_Sorbet','Strawberry_Yoghurt','Blueberry_Pie','Cherry_Salsa'],
'Category':
['Apple','Banana','Orange','Grape','Kiwi','Mango','Pineapple','Strawberry','Blueberry','Cher
ry'],
'Sales':[1200,1700,2200,1400,2000,1000,1500,1800,1300,1600],
'Cost':[600,850,1100,700,1000,500,750,900,650,800],
'Profit':[600,850,1100,700,1000,500,750,900,650,800]
}

df = pd.DataFrame(data)
print("Original Dataset:")
print(df)
Output:

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

#Feature Scaling (Standardization and Normalization):


numeric_columns=['Sales','Cost','Profit']
Scaler_std=StandardScaler()
Scaler_normal=MinMaxScaler()
df_scaled_std=pd.DataFrame(Scaler_std.fit_transform(df[numeric_columns]),columns=n
umeric_columns)
df_scaled_normal=pd.DataFrame(Scaler_normal.fit_transform(df[numeric_columns]),col
umns=numeric_columns)

df_scaled_std

df_scaled=pd.concat([df_scaled_std,df.drop(numeric_columns,axis=1)],axis=1)
print("\n Dataset after Feature Scaling:")
print(df_scaled)
Output:

#Feature Dummification (Convert Categorical Columns to numerical representation):

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

categorical_columns=['Product','Category']
preprocessor= ColumnTransformer(
transformers=[
('categorical',OneHotEncoder(),categorical_columns)
],remainder='passthrough')

df_dummified=pd.DataFrame(preprocessor.fit_transform(df))

print("\n Dataset after Feature Dummification:")


print(df_dummified)
Output:

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

Practical No. 4
Aim: Hypothesis Testing
● Formulate null and alternative hypotheses for a given problem.
● Conduct a hypothesis test using appropriate statistical tests (e.g., t-test, chi-square test).
● Interpret the results and draw conclusions based on the test outcomes.
Codes:
from scipy import stats
import numpy as np

#One Sampled T-Test:


ages = [45, 89, 23, 46, 12, 69, 45, 24, 34, 67]
print(ages)
mean = np.mean(ages)
print("Mean of Age values: ", mean)
H0 = "The average age of 10 people is 30."
H1 = "The average age of 10 people is more than 30."
t_stats, p_val = stats.ttest_1samp(ages, 30)
print("P-value is: ", p_val)
print("The T-Statistics is: ",t_stats)
if p_val < 0.05:
print("We can reject the null hypothesis")
else:
print("We can accept the null hypothesis")
Output:

#Independent T-Test or Two Sampled T-Test:


data_group1 = np.array([12, 18, 12, 13, 15, 1, 7, 20, 21, 25, 19, 31, 21, 17, 17, 15, 19, 15,
12, 15])
data_group2 = np.array([23, 22, 24, 25, 21, 26, 21, 21, 25, 30, 24, 21, 23, 19, 14, 18, 14,
12, 19, 15])
mean1 = np.mean(data_group1)
mean2 = np.mean(data_group2)
print("Data group 1 mean value:", mean1)
print("Data group 2 mean value:", mean2)
std1 = np.std(data_group1)
std2 = np.std(data_group2)
print("Data group 1 STD value:", std1)
print("Data group 2 STD value:", std2)
Output:

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

H0 = "Independent sample means are equal."


H1 = "Independent sample means are not equal."
t_stats,p_val = stats.ttest_ind(data_group1, data_group2)
print("The P-value is: ", p_val)
print("The T-Statistics is: ",t_stats)
if p_val < 0.05:
print("We can reject the null hypothesis")
else:
print("We can accept the null hypothesis")
Output:

#Paired T-Test:
sample1 = [29, 30, 33, 41, 38, 36, 35, 31, 29, 30]
sample2 = [31, 32, 33, 39, 30, 33, 30, 28, 29, 31]
H0 = "Dependent sample means are equal."
H1 = "Dependent sample means are not equal."
t_stats,p_val = stats.ttest_rel(data_group1, data_group2)
print("The P-value of the test is: ", p_val)
print("The T-Statistics is: ",t_stats)
if p_val < 0.05:
print("We can reject the null hypothesis")
else:
print("We can accept the null hypothesis")
Output:

#Chi-Square Test:

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

data = [[231, 256, 321],[245, 312, 213]]


H0 = "There is no relation between variables."
H1 = "There is significant relation between variables."
t_stats,p_val, dof, expected_val = stats.chi2_contingency(data)
print("The p-value of our test is " + str(p_val))
if p_val < 0.05:
print("We can reject the null hypothesis")
else:
print("We can accept the null hypothesis")
Output:

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

Practical No. 5
Aim: ANOVA (Analysis of Variance)
● Perform One-way ANOVA to compare means across multiple groups.
Code:
from scipy import stats

performance1 = [89, 89, 88, 78, 79]


performance2 = [93, 92, 94, 89, 88]
performance3 = [89, 88, 89, 93, 90]
performance4 = [81, 78, 81, 92, 82]
f_stats,p_val = stats.f_oneway(performance1, performance2, performance3,
performance4)
print("p-Value : ",p_val)
if p_val < 0.05:
print("We can reject the null hypothesis")
else:
print("We can accept the null hypothesis")
Output:

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

Practical No. 6(A)


Aim: Regression and Its Types
● Implement Simple Linear Regression using a given dataset.
● Explore and interpret the Regression Model coefficients and goodness-of-fit measures.
Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics

data=pd.read_csv('salaryData.csv')

x=data.iloc[:,:-1].values
y=data.iloc[:,1].values

from sklearn.model_selection import train_test_split


x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

from sklearn.model_selection import train_test_split


x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)
Output:

y_pred=regre.predict(x_test)
x_pred=regre.predict(x_train)

print('R squared :{:.2f}',format(regre.score(x,y)*100))


print('Mean Absolute Error:',metrics.mean_absolute_error(y_test,y_pred))
print('Mean Squared Error:',metrics.mean_squared_error(y_test,y_pred))
print('root Mean squared
Error:',metrics.mean_squared_error(y_test,y_pred,squared=False))
Output:

print('coefficients:',regre.coef_)

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

print('intercept:',regre.intercept_)
Output:

plt.scatter(x_train,y_train,color="purple")
plt.plot(x_train,x_pred,color="orange")
plt.title("salary vs experience(training dataset)")
plt.xlabel("years of experience")
plt.ylabel("salary")
plt.show()
Output:

plt.scatter(x_test,y_test,color="green")
plt.plot(x_train,x_pred,color="red")

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

plt.title("salary vs experience(training dataset)")


plt.xlabel("years of experience")
plt.ylabel("salary")
plt.show()
Output:

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

Practical No. 6(B)


Aim: Implement Multiple Linear Regression and assess the impact of additional predictors.
Code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import metrics

dataset=pd.read_csv('50_Startups.csv')
X=dataset.iloc[:,:-1].values
Y=dataset.iloc[:,-1].values

print(X)
Output:

from sklearn.compose import ColumnTransformer


from sklearn.preprocessing import OneHotEncoder
ct=ColumnTransformer(transformers=[('encoder',OneHotEncoder(),
[3])],remainder='passthrough')
X=np.array(ct.fit_transform(X))

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

print(X)
Output:

from sklearn.model_selection import train_test_split


X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=0)

from sklearn.linear_model import LinearRegression


regressor=LinearRegression()
regressor.fit(X_train,Y_train)
Output:

y_pred=regressor.predict(X_test)
print("predictions{}:".format(y_pred))
Output:

mlr_diff=pd.DataFrame({'Actual value':Y_test,'predicted value':y_pred})

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

mlr_diff.head()
Output:

print('R squared:{:.2f}'.format(regressor.score(X,Y)*100))
print('Mean Absolute Error:',metrics.mean_absolute_error(Y_test,y_pred))
print('Mean Squared Error:',metrics.mean_squared_error(Y_test,y_pred))
print('Root Mean Squared
Error:',metrics.mean_squared_error(Y_test,y_pred,squared=False))
Output:

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

Practical No. 7(A)


Aim: Logistic Regression and Decision Tree
● Build a Logistic Regression Model to predict a binary outcome.
● Evaluate the model’s performance using classification metrics.
Code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

dataset=pd.read_csv('Social_Network_Ads.csv')
X=dataset.iloc[:,:-1].values
Y=dataset.iloc[:,-1].values

from sklearn.model_selection import train_test_split


X_train,X_test,Y_train,y_test=train_test_split(X,Y,test_size=0.2,random_state=0)

print(X_train)
Output:

print(Y_train)
Output:

from sklearn.preprocessing import StandardScaler

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

sc=StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.transform(X_test)

from sklearn.linear_model import LogisticRegression


classifier=LogisticRegression(random_state=0)
classifier.fit(X_train,Y_train)
Output:

print(classifier.predict(sc.transform([[30,87000]])))
Output:

y_pred=classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1),y_test.reshape(len(y_test),1)),1))
Output:

from sklearn.metrics import confusion_matrix,accuracy_score


cm=confusion_matrix(y_test,y_pred)

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

print(cm)
accuracy_score(y_test,y_pred)
Output:

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

Practical No. 7(B)


Aim: Construct a Decision Tree model and interpret rules for classification.
Code:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt
from sklearn import datasets
from sklearn import tree

iris_data=load_iris()
iris=pd.DataFrame(iris_data.feature_names)

print("Features Name:",iris_data.feature_names)
Output:

X=iris_data.data
print(X)
Output:

Y=iris_data.target

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

print(Y)
Output:

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,Y,random_state=50,test_size=0.3)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
Output:

from sklearn.tree import DecisionTreeClassifier


clf=DecisionTreeClassifier(random_state=100)
clf.fit(X_train,y_train)
Output:

y_pred=clf.predict(X_test)
print(y_pred)
Output:

print("Accuracy:",accuracy_score(y_test,y_pred))
Output:

from sklearn.metrics import confusion_matrix


cm=np.array(confusion_matrix(y_test,y_pred))
cm

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

Output:

from sklearn.metrics import plot_confusion_matrix


plot_confusion_matrix(clf,X_test,y_test)
plt.show()
Output:

tree.plot_tree(clf)
Output:

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

Practical No. 8
Aim: K-Means Clustering
● Apply the K-Meansalgorithm to grouyp similar data points into clusters.
● Determine optimal number of clusters using elbow method or silhouette analysis.
● Visualize the clustering results and analyse the cluster characteristics.
Code:
#Importing Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn.cluster as cluster
from sklearn.cluster import KMeans
import seaborn as sns
import sklearn.metrics as metrics

dataset = pd.read_csv('Iris.csv')
x = dataset.iloc[:, [1,2,3,4]].values

print(x)
Output:

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

K = range(1, 10)
wss = []
for k in K:
kmeans = cluster.KMeans(n_clusters=k, init="k-means++")
kmeans = kmeans.fit(x) # Fit the model to the data
wss_iter = kmeans.inertia_ # Access inertia_ after fitting
wss.append(wss_iter)
Output:

#Storing Number of clustering along with this WSS in Dataframe:


mycenters = pd.DataFrame({'Cluster' : K, 'WSS' : wss})
mycenters
Output:

#Plot Elbow Plot:

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

sns.lineplot(x = 'Cluster', y = 'WSS', data = mycenters, marker="+")


Output:

#Silhoutte Method To Identify Clusters:


SK = range(3, 10)
sil_score = []
for i in SK:
labels = cluster.KMeans(n_clusters=i, init="k-means++",
random_state=100).fit(x).labels_
score = metrics.silhouette_score(x, labels, metric="euclidean", sample_size=1000,
random_state=100)
sil_score.append(score)
print("Silhouette score for k(Clusters) = " + str(i) + " is " +
str(metrics.silhouette_score(x, labels, metric="euclidean", sample_size=150,
random_state=100)))
Output:

sil_centers = pd.DataFrame({'Clusters' : SK, 'Sil Score' : sil_score})


sil_centers

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

Output:

#Perform K-Means Clustering With 3 Cluster:


kmeans = cluster.KMeans(n_clusters=3 , init = "k-means++")
y_kmeans = kmeans.fit_predict(x)

#Visualization of Clusters:
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], c = 'red', label = "setosa")
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], c = 'blue', label = "versicolour")
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], c = 'green', label = "virginica")
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 100, c =
'yellow', label = 'Centroids')
plt.legend()
Output:

1. sns.lineplot(x = 'Clusters', y = 'Sil Score', data = sil_centers, marker = "+")


Output:

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

Practical No. 9
Aim: Principal Component Analysis
● Perform PCA on a dataset to reduce dimensionality.
● Evaluate the explained variance and select the appropriate number of principal
components.
● Visualize the data in the reduced dimensional space.
Code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

dataset=pd.read_csv('Wine.csv')
x=dataset.iloc[:,:-1].values
y=dataset.iloc[:,-1].values

from sklearn.model_selection import train_test_split


x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

from sklearn.preprocessing import StandardScaler


sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.transform(x_test)

print(x_train)
Output:

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

from sklearn.decomposition import PCA


pca=PCA(n_components = 2)
x_train =pca.fit_transform(x_train)
x_test= pca.transform(x_test)

print(x_train)
Output:

plt.scatter(x_train[:, 0], x_train[:, 1])

Name: Amit Bablu Rai Roll No: 101


Sub: Data Science Class: TYCS

plt.xlabel('First principal component')


plt.ylabel('Second principal component')
plt.show()
Output:

Name: Amit Bablu Rai Roll No: 101

You might also like