Professional Documents
Culture Documents
DS Practical
DS Practical
Practical No. 1
Aim: Introduction to Excel:
● Perform conditional formatting on a dataset using various criteria.
● Create a pivot table to analyze and summarize data.
● Use the VLOOKUP function to retrieve information from a different worksheet or table.
● Perform what-if analysis using Goal Seek to determine input values for desired output.
4. Choose where you want to place the PivotTable (e.g., new worksheet).
5. Let us find out, how many cars do we have by make and model and by color?
Drag the "Make" column to the Rows area.
Drag the “Model” column to the Rows area under the “Make” column.
Drag the “Color” column to the Column Section.
Drag the “Make” column to the Values Section.
Add the “Make” column in the Row area and immediately PivotChart will update.
4. Perform what-if analysis using Goal Seek to determine input values for desired output.
Steps:
1. Go to the "Data" tab on the ribbon.
2. Click on "What-If Analysis" and select "Goal Seek."
3. Set "Set cell" to the Saving cell, "To value" to 7000, and "By changing cell" to
the Transportation cell.
Practical No. 2
Aim: Data Frames and Basic Data Pre-processing
● Read data from CSV and JSON files into a data frame.
● Perform basic data pre-processing tasks such as handling missing values and outliers.
● Manipulate and transform data using functions like filtering, sorting, and grouping.
Code:
#Read Data From .CSV File
import pandas as pd
data=pd.read_csv('student.csv')
print("Reading data from CSV file:")
data.head()
import json
df_car_filled = df_car_clean.fillna(0)
df_car_filled
#Handling Outliers
median_value = df_car_filled['Sell Price'].median()
upper_threshold = df_car_filled['Sell Price'].mean() +2 * df_car_filled['Sell Price'].std()
lower_threshold = df_car_filled['Sell Price'].mean() - 2 * df_car_filled['Sell Price'].std()
df_car_filled['Sell Price'] = df_car_filled['Sell Price'].apply(lambda x: median_value if x
> upper_threshold or x < lower_threshold else x)
df_car_filled
grouped_data
Practical No. 3
Aim: Feature Scaling and Dummification
● Apply feature-scaling techniques like standardization and normalization to numerical
features.
● Perform feature dummification to convert categorical variables into numerical
representations.
Code:
#Importing Libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
df = pd.DataFrame(data)
print("Original Dataset:")
print(df)
Output:
df_scaled_std
df_scaled=pd.concat([df_scaled_std,df.drop(numeric_columns,axis=1)],axis=1)
print("\n Dataset after Feature Scaling:")
print(df_scaled)
Output:
categorical_columns=['Product','Category']
preprocessor= ColumnTransformer(
transformers=[
('categorical',OneHotEncoder(),categorical_columns)
],remainder='passthrough')
df_dummified=pd.DataFrame(preprocessor.fit_transform(df))
Practical No. 4
Aim: Hypothesis Testing
● Formulate null and alternative hypotheses for a given problem.
● Conduct a hypothesis test using appropriate statistical tests (e.g., t-test, chi-square test).
● Interpret the results and draw conclusions based on the test outcomes.
Codes:
from scipy import stats
import numpy as np
#Paired T-Test:
sample1 = [29, 30, 33, 41, 38, 36, 35, 31, 29, 30]
sample2 = [31, 32, 33, 39, 30, 33, 30, 28, 29, 31]
H0 = "Dependent sample means are equal."
H1 = "Dependent sample means are not equal."
t_stats,p_val = stats.ttest_rel(data_group1, data_group2)
print("The P-value of the test is: ", p_val)
print("The T-Statistics is: ",t_stats)
if p_val < 0.05:
print("We can reject the null hypothesis")
else:
print("We can accept the null hypothesis")
Output:
#Chi-Square Test:
Practical No. 5
Aim: ANOVA (Analysis of Variance)
● Perform One-way ANOVA to compare means across multiple groups.
Code:
from scipy import stats
data=pd.read_csv('salaryData.csv')
x=data.iloc[:,:-1].values
y=data.iloc[:,1].values
y_pred=regre.predict(x_test)
x_pred=regre.predict(x_train)
print('coefficients:',regre.coef_)
print('intercept:',regre.intercept_)
Output:
plt.scatter(x_train,y_train,color="purple")
plt.plot(x_train,x_pred,color="orange")
plt.title("salary vs experience(training dataset)")
plt.xlabel("years of experience")
plt.ylabel("salary")
plt.show()
Output:
plt.scatter(x_test,y_test,color="green")
plt.plot(x_train,x_pred,color="red")
dataset=pd.read_csv('50_Startups.csv')
X=dataset.iloc[:,:-1].values
Y=dataset.iloc[:,-1].values
print(X)
Output:
print(X)
Output:
y_pred=regressor.predict(X_test)
print("predictions{}:".format(y_pred))
Output:
mlr_diff.head()
Output:
print('R squared:{:.2f}'.format(regressor.score(X,Y)*100))
print('Mean Absolute Error:',metrics.mean_absolute_error(Y_test,y_pred))
print('Mean Squared Error:',metrics.mean_squared_error(Y_test,y_pred))
print('Root Mean Squared
Error:',metrics.mean_squared_error(Y_test,y_pred,squared=False))
Output:
dataset=pd.read_csv('Social_Network_Ads.csv')
X=dataset.iloc[:,:-1].values
Y=dataset.iloc[:,-1].values
print(X_train)
Output:
print(Y_train)
Output:
sc=StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.transform(X_test)
print(classifier.predict(sc.transform([[30,87000]])))
Output:
y_pred=classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1),y_test.reshape(len(y_test),1)),1))
Output:
print(cm)
accuracy_score(y_test,y_pred)
Output:
iris_data=load_iris()
iris=pd.DataFrame(iris_data.feature_names)
print("Features Name:",iris_data.feature_names)
Output:
X=iris_data.data
print(X)
Output:
Y=iris_data.target
print(Y)
Output:
X_train,X_test,y_train,y_test=train_test_split(X,Y,random_state=50,test_size=0.3)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
Output:
y_pred=clf.predict(X_test)
print(y_pred)
Output:
print("Accuracy:",accuracy_score(y_test,y_pred))
Output:
Output:
tree.plot_tree(clf)
Output:
Practical No. 8
Aim: K-Means Clustering
● Apply the K-Meansalgorithm to grouyp similar data points into clusters.
● Determine optimal number of clusters using elbow method or silhouette analysis.
● Visualize the clustering results and analyse the cluster characteristics.
Code:
#Importing Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn.cluster as cluster
from sklearn.cluster import KMeans
import seaborn as sns
import sklearn.metrics as metrics
dataset = pd.read_csv('Iris.csv')
x = dataset.iloc[:, [1,2,3,4]].values
print(x)
Output:
K = range(1, 10)
wss = []
for k in K:
kmeans = cluster.KMeans(n_clusters=k, init="k-means++")
kmeans = kmeans.fit(x) # Fit the model to the data
wss_iter = kmeans.inertia_ # Access inertia_ after fitting
wss.append(wss_iter)
Output:
Output:
#Visualization of Clusters:
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], c = 'red', label = "setosa")
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], c = 'blue', label = "versicolour")
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], c = 'green', label = "virginica")
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 100, c =
'yellow', label = 'Centroids')
plt.legend()
Output:
Practical No. 9
Aim: Principal Component Analysis
● Perform PCA on a dataset to reduce dimensionality.
● Evaluate the explained variance and select the appropriate number of principal
components.
● Visualize the data in the reduced dimensional space.
Code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset=pd.read_csv('Wine.csv')
x=dataset.iloc[:,:-1].values
y=dataset.iloc[:,-1].values
print(x_train)
Output:
print(x_train)
Output: