DV Lab

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 52

EX.

NO:1 DPLYR PACKAGE

DATE:

Do the data manipulation operations for iris and mtcars dataset using dplyr package and obtain
the results for following functions

i)filter

ii)select

iii)arrange

iv) mutate

v) summarise

AIM:

To Do the data manipulation operations for iris and mtcars dataset using dplyr package and obtain the

results for following functions.

Procedure and code:

i) Filter:

To install dplyr, use the below command

install.packages("dplyr")

To load dplyr, use the below command

library(dplyr)

Loading a data set

data("mtcars")

data("iris")
mydata <- mtcars

mydata

Creating a local data frame. Local frame are easier to read

mynewdata <- tbl_df(mydata)

mynewdata

myirisdata <-tbl_df(iris)

myirisdata
Use filter to filter data with required condition

filter(mynewdata,cyl>4 & gear>4)

filter(mynewdata,cyl>4)

filter( myirisdata,Species %in% c('setosa' , 'virginica'))


II) select:

When you are working with large datasets with many columns, but you are interested in a few,

select() allows you to rapidly zoom in on on a useful subset using operations that usually only work

on numeric variable positions.

select(mynewdata, cyl, mpg, hp)

Hide a range of columns:

select(mynewdata,-c(mpg, cyl, disp))


III) arrange:

mynewdata %>%

select( cyl,wt,gear)%>%

arrange( desc(wt))
IV) mutate:

This function ,mutate() adds new variables while preserving the existing ones. mutate() is used to

select sets of existing columns and add new columns that are functions of existing columns.

mynewdata %>%

select( mpg,cyl)%>%

mutate( newvariable = mpg*cyl)

v)summarise:

The summarise() function collapses a data frame to a single row

myirisdata %>%

group_by(Species)%>%

summarise(average=mean(Sepal.Length,na.rm=TRUE))
RESULT:

The dplyr package Program has been Executed Successfully


EX.NO:2 TIDYR PACKAGE

DATE:
Create a data frame and do the following operations using tidyr package

i)gather
ii)spread
iii) separate

iv)unite
AIM:
To Create a data frame and do the following operations using tidyr package
Procedure and Code:
Installing tidyr package
install.packages('tidyr')
library(tidyr)
Creating a dummy data set.
name <- c('Akanash', 'Bhanu','Vinay', 'Varun', 'Prashanth')
weight <- c(35,45,55,65,75)
age <- c(20,21,22,23,24)
class <- c('maths','physics','chemistry','biology','science')
Create a data frame
tdata <- data.frame( name, weight, age, class)
tdata
I) gather():

gathers multiple columns and converts them into key: value pairs. This function transforms wide form of
data to long form. It can be used as an alternative to ‘melt’ in reshape package.
longt <- tdata %>% gather( key, value, weight:class)
longt

II) Spread():
Does reverse of gather. It accepts a key:value pair and converts it into
Separate columns.
wide <- longt %>% spread( key, value)
wide
III) separate():
Splits a column into multiple columns.
Use the separate function when you have date time variable in the data set. Because a column contains
multiple information , It make sense to split it and use those values individually. The following code
shows the usage of the separate function.
Create a data frame:
Humidity <- c(37.79,42.34,52.16,44.57,48.83,44.59)
Rain <- c(0.971360441,1.1096716,1.06475853,0.953183435,0.98878849,0.9887643)
Time <- c("13/03/2018 23:24","09/01/2019 15.44","25/12/2018 19:15","02/01/2019 07:46","14/03/2018
01:55","20/10/2018 20:52")
dset <- data.frame (Humidity,Rain,Time)
dset

Using separate function we can separate date, month,year.


separate_d <- dset %>% separate(Time,c('Date','Month','Year'))
separate_d
IV) unite():
Does reverse of separate. It unites multiple columns into single column.
unite_d <-separate_d%>%unite( Time,
c(Date, Month, Year), sep="/")
unite_d

RESULT:
The tidyr package Program has been Executed Successfully
EX.NO:3 TABLE PACKAGE

DATE:
Do the following operations for any external dataset using data. table package
i) Select a subset row
ii) Select a column with particular values

iii) Select columns with multiple values

AIM:
To do the data manipulation operations for external csv file using data.table package.
Procedure and Code:

Loading air quality data:


data("airquality")
mydata <- airquality
mydata

Loading iris data:


data("iris")
myiris <- iris
myiris
Converting into table format:
install.packages("data.table")
library(data.table)
myirisdata <-data.table(myiris)
myirisdata

i) Select subset rows:


mydata[2:4,]

ii) Select a column with particular values


myirisdata[Species == 'setosa’]
iii) Select columns with multiple values
myirisdata[Species %in% c('setosa','virginica')]

RESULT:
Thus, the Data Manipulation using DATA. Table Package Executed Successfully.
EXPNO: 4 GGPLOT PACKAGE

DATE:

AIM:
To do the different types of visualization for air quality data set using ggplot
package in R.
a. Line graph
a. Bar graph
a. Histogram
a. Scatter plot
a. Pie chart

PROCEDURE AND CODE:

Ggplot2 is a plotting package that provides helpful commands to create plots


from data in a data frame. It provides a more programmatic interface
Installation and Loading:
install.packages("ggplot2")
library(ggplot2)
The 'iris' data comprises of 150 observations with 5 variables.
i) Line graph:
ggplot(iris, aes(x=Sepal.Length, color=Species)) + geom_density( )
OUTPUT:

ii) Bar graph:


ggplot(mpg, aes(x= class)) + geom_bar()
qplot->plot function from ggplpt2 library
geom -> geometry (plot type)
fill -> Denotes colour
OUTPUT:
iii) Histogram:
ggplot(data = iris, aes( x = Sepal.Length)) + geom_histogram( )
OUTPUT:

iv) Scatter plot:


library(ggplot2)
ggplot(mtcars, aes(x = drat, y = mpg)) +
geom_point()
OUTPUT:

RESULT:
Thus, the Data Visualisation using DATA. GG Plot Package Executed Successfully.
EX.NO:5 PANDAS PACKAGE
DATE:

AIM:
To Do the data manipulation operations for iris and airquality dataset
using data.table package and obtain the results for following
functions.
a. Select a subset row
b. Select a column with particular values
c. Select columns with multiple values
d. Select a column to return a vector
e. Select multiple columns
f. Returns the sum and standard deviation
g. Sum of selected columns
PROCEDURE AND CODE:
Data.table package is a enhanced version of data.frame s, which
are the standard data structure for storing data in base R.
To install dplyr, use the below command
install.packages("data.table")

To load dplyr, use the below command


library(data.table)
Converted data set into data.table:
data<-as.data.table(iris)
air<-as.data.table(airquality)
head(air)
head(data)
i) Select subset rows:
head(data[,2:4])
OUTPUT:

ii) Select a column with particular values:

data[Species == 'setosa‘]
OUTPUT:

iii) Select columns with multiple values:

data [Species %in% c('setosa','virginica')]


OUTPUT:
iv) Select a column to return
a vector: air [, Temp]
OUTPUT:

v) Select multiple columns: air[,.(Temp,Month)]


OUTPUT:

vi) Returns the sum and standard deviation


myairdata[,.(sum(Ozone,na.rm=TRUE),
sd(Ozone,na.rm=TRUE))] OUTPUT:

vii) Sum of selected columns:


myairdata[,sum(Ozone,na.rm=TRUE)]
OUTPUT:
RESULT:
Thus, the DATA.TABLE Package Program has been Excuted
Successfully.
EX.NO:6 CREATE THE VISUALIZATION GRAPHS
DATE:

AIM:
To create the different types of graphs for user inputs. 1)Line graph 2)Line
Graph with style 3)Bar Graph(Horizontal and verticle) 4)Histogram
5)Scatter Plot.

PROCEDURE AND CODE:


1. LINE GRAPH:
import matplotlib.pyplot as plt
x=[5,6,8,10,15]
y=[20,30,40,50,55]
plt.plot(x,y)
plt.title("STUDENT DATA-LINE GRAPH")
plt.ylabel('Present %')
plt.xlabel('Roll.no')
plt.show()
OUTPUT PLOT:

2. LINE GRAPH WITH STYLE:


import matplotlib.pyplot as plt
import matplotlib.style
x=[5,6,8,10,15]
y=[20,30,40,50,55]
x2=[2,13,16,20,18]
y2=[25,35,16,23.5,40]
plt.plot(x,y,'c',label='A',linewidth=6)
plt.plot(x2,y2,'purple',label='B',linewidth=6)
plt.title('STUDENT DATA-LINE GRAPH WITH STYLE')
plt.ylabel('Present %')
plt.xlabel('Roll.no')
plt.legend()
plt.show()
OUTPUT PLOT:

3. BAR
GRAPH:
A-VERTICAL
import matplotlib.pyplot as plt
studentnames = ['Adeline','Jane','Roo','Bluewhale','Rossey'] marks =
[85,55,90,45,60]
plt.bar(studentnames,marks,color='purple')
plt.title('STUDENT DATA-BAR GRAPH VERTICAL')
plt.xlabel('NAMES')
plt.ylabel('MARKS)
plt.show()
OUTPUT PLOT:

B-HORIZONTAL :
import matplotlib.pyplot as plt
studentnames = ['Adeline','Jane','Roo','Bluewhale','Rossey'] marks =
[85,55,90,45,60]
plt.barh(studentnames,marks,color='c')
plt.title('STUDENT DATA-BAR GRAPH VERTICAL')
plt.xlabel('`MARKS')
plt.ylabel ('NAMES')
plt.show()
OUTPUT PLOT:

4. HISTOGRAM:
import matplotlib.pyplot as plt
student_marks=[45,12,13,26,15,55,100,98,95,54,58,56,52,24,71,66,6
6.5,12,23,55,78,10,9,5,10,22,35,65,45]
bins=[0,10,20,30,40,50,60,70,80,90,100]
plt.hist(student_marks,bins,histtype='bar',rwidth=0.8,color='purple')
plt.xlabel('MARKS')
plt.ylabel('NUMBER OF STUDENT')
plt.title('STUDENT DATA-HISTOGTAM')
plt.show()
OUTPUT PLOT:

5. SCATTER PLOT:
import matplotlib.pyplot as plt
import matplotlib.style
x=[5,6,8,10,15]
y=[20,30,40,50,55]
x2=[2,13,16,20,18]
y2=[25,35,16,23.5,40]
plt.scatter(x,y,color='purple')
plt.scatter(x2,y2,color='c')
plt.title=('STUDENT DATA-SCATTER PLOT')
plt.ylabel('Present %')
plt.xlabel('Roll.no')
plt.show()
OUTPUT PLOT:

RESULT:
The Visualization Graphs Program has been Executed Successfully
EX.NO:7 EXPLORATORY DATA ANALYSIS(EDA)
DATE:

AIM:
To Write the R program to implement the Exploratory
Data Analysis for the inbuild data set in data
visualization

PROCEDURE AND CODE:


The EDA approach can be used to gather knowledge about the
following aspects of data:
Main characteristics or features of the data.
Finding out the important variables that can be used in our
problem
In R Language, we are going to perform EDA under two broad
classifications:
Descriptive Statistics, which includes mean, median, mode,
inter-quartile range, and so on.
Graphical Methods, which includes histogram, density
estimation, box plots, and so on.
Reading dataset:
import pandas as pd
import numpy as np
data=pd.read_csv(r"D:\Dataset\train.csv")
print(data)
OUTPUT:

1) Getting first few rows of the dataset:


print(data.head())

OUTPUT:
2) Getting shape of the data:
print(data.shape)
OUTPUT:

3) Checking missing values in the data:


print(data.isnull().sum())
OUTPUT:

4) Checking Data Types of the data:


print(data.dtypes)
OUTPUT:
5) Filling missing values with categorical variable mode:

data["Gender"].fillna(data["Gender"].mode()[0],inplace=Tr
ue)
data["Married"].fillna(data["Married"].mode()[0],inplace=T
rue)
data["Dependents"].fillna(data["Dependents"].mode()[0],inp
lace=True)
data["Self_Employed"].fillna(data["Self_Employed"].mode(
)[0],inplace=True)

data["Loan_Amount_Term"].fillna(data["Loan_Amount_Te
rm"].mode()[0],inplace=True)
data["Credit_History"].fillna(data["Credit_History"].mode(
) [0],inplace=True)
6) Filling missing values with continuous variable
with mean:
data["LoanAmount"].fillna(data["LoanAmount"].mean(),inp
lace=True)
7) Checking missing values:

print(data.isnull().sum())
OUTPUT:
8) Converting Categorical into numerical:

print(data['Gender'].replace(['Male', 'Female'], [0,


1], inplace=True))
print(data['Married'].replace(['No', 'Yes'], [0, 1], inplace=True))
print(data['Dependents'].replace(['0', '1', '2', '3+'], [0, 1, 2, 3],
inplace=True))
print(data['Education'].replace(['Not Graduate', 'Graduate'], [0,
1], inplace=True))
print(data['Self_Employed'].replace(['No', 'Yes'], [0, 1],
inplace=True))
print(data['Property_Area'].replace(['Rural', 'Semiurban',
'Urban'], [0, 1, 2], inplace=True))
print(data['Loan_Status'].replace(['N', 'Y'], [0, 1], inplace=True))
9) Checking data values:

print(data.head())
OUTPUT:

10) Saving the pre-processed data:


data.to_csv(“new_data.csv”,index=False)
RESULT:
The Exploratory data Analysis Program has been Executed Successfully
EX.NO: 8 IBM WATSON STUDIO-PROJECT
DATE:

AIM:

To create a new Data visualization project in IBM Watson


Studio Using individual account.

PROCEDURE AND CODE:


1) Open IBM Watson studio and Login using your account. The
project and the catalog must be created by members of the same IBM
Cloud account.
2) Click New project on the home page or on your Projects page.

3) Choose whether to create an empty project or to create a


project based on an exported project file or a sample project.
4) On the New project screen, add a name and optional
description for the project.

If the project file that you select to import is encrypted, you must
enter the password that was used for encryption to enable decrypting
sensitive connection properties.

5) Create a New project in Data analysis section


6) Add to Project

7) Select Data for data manipulation.


8) Assets and Browse the data

9) Refine the data:


10) Visualizations

11) Columns to visualize:


12) Visualize the Data

RESULT:
Thus, the IBM-Watson Studio Project has been Executed
Successfully.
EX.NO: 9 DATA ANALYSIS – COVID 19 DATASET
DATE:

AIM:
To do the data analysis and visualization for covid19
dataset.

PROCEDURE AND CODE:


Import the Libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
2) Read the Covid Analysis Data set:
data=pd.read_csv(r"D:\Dataset\Covid
Analysis.csv") print(data.head())
OUTPUT:
3) Getting statistical information from the data:
print(data.describe())
OUTPUT:

4) Making serial number as an index:


print(data.index.name == 'S_No')
OUTPUT:

5) Removing Null value:


new_data =
data.drop(0)
print(new_data)
OUTPUT:
6) Describing the new cleaned dataset:
print(new_data.describe())
OUTPUT:

7) Fetching the information of Tamilnadu:


Tn = new_data.loc[29]
print(Tn)
OUTPUT:

8) Plotting a bargraph for STATE Vs Death:


plt.figure(figsize=(10,10))
plt.bar(new_data['Name of State / UT'],new_data['Death'])
plt.xticks(rotation=90)
plt.show()
OUTPUT:

9) Plotting a graph All variables for an each state:

plt.plot(new_data['Name of State / UT'],new_data['Date'],color='Blue')

plt.scatter(new_data['Name of State / UT'],new_data['Total Confirmed cases*


'],color='Blue')

plt.plot(new_data['Name of State /
UT'],new_data['Cured/Discharged/Migrated'],color='Red')

plt.scatter(new_data['Name of State / UT'],new_data['Death'],color='Red')

plt.plot(new_data['Name of State / UT'],new_data['Latitude'],color='Green')

plt.scatter(new_data['Name of State / UT'],


new_data['Longitude'],color='Green')

plt.xticks(rotation=90)

Plt.show()
OUTPUT:

RESULT:
The Data analysis Program has been Executed Successfully.

You might also like