DSBDAL - Assignment No 10

TE (Comp)
Experiment:10
Title :
Download the Iris flower dataset or any other dataset into a DataFrame.
(eg. https://archive.ics.uci.edu/ml/datasets/Iris) Use Python/R and Perform following –
• How many features are there and what are their types (e.g., numeric, nominal)?
• Compute and display summary statistics for each feature available in the dataset.
(eg. minimum value, maximum value, mean, range, standard deviation, variance and
percentiles
• Data Visualization-Create a histogram for each feature in the dataset to illustrate
the feature distributions. Plot each histogram.
• Create a boxplot for each feature in the dataset. All of the boxplots should be
combined into a single plot. Compare distributions and identify outliers
Aim :
Implement a dataset into a dataframe. Implement the following operations:
1. Display data set details.

2. Calculate min, max ,mean, range, standard deviation, variance.
3. Create histograph using hist function.
4. Create boxplot using boxplot function.
Prerequisites:
Fundamentals of R -Programming Languages OR Python
Objectives :
To learn the concept of how to display summary statistics for each feature Available
in the dataset
Theory:
How to Find the Mean, Median, Mode, Range, and Standard Deviation
Simplify comparisons of sets of number, especially large sets of number, by calculating
the center values using mean, mode and median. Use the ranges and standard
deviations of the sets to examine the variability of data.
Calculating Mean
The mean identifies the average value of the set of numbers. For example, consider
the data set containing the values 20, 24, 25, 36, 25, 22, 23.
Formula
Department of Computer Engineering

TE (Comp)
To find the mean, use the formula: Mean equals the sum of the numbers in the data set
divided by the number of values in the data set. In mathematical terms: Mean=(sum of
all terms)÷(how many terms or values in the set).
Adding Data Set

Add the numbers in the example data set: 20+24+25+36+25+22+23=175.
Finding Divisor
Divide by the number of data points in the set. This set has seven values so divide by 7.
Finding Mean
Insert the values into the formula to calculate the mean. The mean equals the sum of the
values (175) divided by the number of data points (7). Since 175÷7=25, the mean of
this data set equals 25. Not all mean values will equal a whole number.
Calculating Range
Range shows the mathematical distance between the lowest and highest values in the
data set. Range measures the variability of the data set. A wide range indicates greater
variability in the data, or perhaps a single outlier far from the rest of the data. Outliers
may skew, or shift, the mean value enough to impact data analysis.
Identifying Low and High Values

In the sample group, the lowest value is 20 and the highest value is 36.
Calculating Range
To calculate range, subtract the lowest value from the highest value. Since 36-20=16,
the range equals 16.
.
Calculating Standard Deviation

Standard deviation measures the variability of the data set. Like range, a smaller
standard deviation indicates less variability.
Formula
Finding standard deviation requires summing the squared difference between each data
2
point and the mean [∑(x-µ) ], adding all the squares, dividing that sum by one less than
the number of values (N-1), and finally calculating the square root of the dividend.
Mathematically, start with calculating the mean.
Calculating the Mean

Calculate the mean by adding all the data point values, then dividing by the number of
data points. In the sample data set, 20+24+25+36+25+22+23=175. Divide the sum,
175, by the number of data points, 7, or 175÷7=25. The mean equals 25.
Squaring the Difference

TE (Comp)
Next, subtract the mean from each data point, then square each difference. The formula
looks like this: ∑(x-µ)2, where ∑ means sum, x represents each data set value and µ
represents the mean value. Continuing with the example set, the values become: 20-25=-
5 and -52=25; 24-25=-1 and -12=1; 25-25=0 and 02=0; 36-25=11 and 112=121; 25-25=0
2 2 2
and 0 =0; 22-25=-3 and -3 =9; and 23-25=-2 and -2 =4.
Adding the Squared Differences
Adding the squared differences yields: 25+1+0+121+0+9+4=160.
Division by N-1
Divide the sum of the squared differences by one less than the number of data points.
The example data set has 7 values, so N-1 equals 7-1=6. The sum of the squared
differences, 160, divided by 6 equals approximately 26.6667.
Standard Deviation
Calculate the standard deviation by finding the square root of the division by N-1. In the
example, the square root of 26.6667 equals approximately 5.164. Therefore, the
standard deviation equals approximately 5.164.
Evaluating Standard Deviation
Standard deviation helps evaluate data. Numbers in the data set that fall within one
standard deviation of the mean are part of the data set. Numbers that fall outside of two
standard deviations are extreme values or outliers. In the example set, the value 36 lies
more than two standard deviations from the mean, so 36 is an outlier. Outliers may
represent erroneous data or may suggest unforeseen circumstances and should be
carefully considered when interpreting data.
Facilities :Windows/Linux Operating Systems, RStudio, jdk.
Application:
1. The histogram is suitable for visualizing distribution of numerical data over
acontinuous interval, or a certain time period. The histogram organizes large
amounts of data, and produces a visualization quickly, using a single dimension.
2. The box plot allows quick graphical examination of one or more data sets. Box plots
may seem more primitive than a histogram but they do have some advantages. They take
up less space and are therefore particularly useful for comparing distributions between
several groups or sets of data. Choice of number and width of bins techniques can
heavily influence the appearance of a histogram, and choice of bandwidth can heavily
influence the appearance of a kernel density estimate.
3. Data Visualization Application lets you quickly create insightful data visualizations,
in minutes.
Data visualization tools allow anyone to organize and present information intuitively.
They enables users to share data visualizations with others.

TE (Comp)
Input :
Structured Dataset : Iris Dataset
File: iris.csv
Output :
1. Display Dataset Details.
2. Calculate Min, Max,Mean,Varience value and Percentiles of probabilities also Display
Specific use quantile.
3. Dispaly the Histogram using Hist Function.
4.Display the Boxplot using Boxplot Function.
Conclusion:
Hence, we have studied using dataset into a dataframe and compare distribution and
identify outliers.
Questions:
1. What is Data visualization?
2. How to calculate min,max,range and standard deviation?
3. How to create boxplot for each feature in the daset?
4. How to create histogram?
5. What is dataset.

DSBDAL - Assignment No 10

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DSBDAL - Assignment No 10

Uploaded by

Copyright:

Available Formats

TE (Comp)

1. Display data set details.

Department of Computer Engineering

Adding Data Set

Identifying Low and High Values

Calculating Standard Deviation

Calculating the Mean

Squaring the Difference

Facilities :Windows/Linux Operating Systems, RStudio, jdk.

Department of Computer Engineering

4.Display the Boxplot using Boxplot Function.

You might also like