This document provides an introduction and overview of key concepts for data science with R programming including:
- Standard deviation and variance which are measures of how varied or dispersed data values are from the mean.
- Linear regression which models the relationship between two variables and can be used to make predictions.
- The ggplot2 package for creating elegant graphics and plots in R.
- Key steps and formulae are provided for calculating standard deviation, variance, and performing linear regression. Examples are also given to illustrate these statistical techniques.
This document provides an introduction and overview of key concepts for data science with R programming including:
- Standard deviation and variance which are measures of how varied or dispersed data values are from the mean.
- Linear regression which models the relationship between two variables and can be used to make predictions.
- The ggplot2 package for creating elegant graphics and plots in R.
- Key steps and formulae are provided for calculating standard deviation, variance, and performing linear regression. Examples are also given to illustrate these statistical techniques.
This document provides an introduction and overview of key concepts for data science with R programming including:
- Standard deviation and variance which are measures of how varied or dispersed data values are from the mean.
- Linear regression which models the relationship between two variables and can be used to make predictions.
- The ggplot2 package for creating elegant graphics and plots in R.
- Key steps and formulae are provided for calculating standard deviation, variance, and performing linear regression. Examples are also given to illustrate these statistical techniques.
Associate Professor Department of Computer Science Nehru Arts and Science College Coimbatore TABLE OF CONTENTS Standard deviation Variance Linear Regression Standard deviation • Standard deviation (SD) is a measure of how varied is the data in a data set. • Mathematically it measures how distant or close are each value to the mean value of a data set. • A standard deviation value close to 0 indicates that the data points tend to be very close to the mean of the data set • High standard deviation indicates that the data points are spread out over a wider range of values Procedure • To calculate the standard deviation of the numbers: 1. Work out the Mean (the simple average of the numbers) 2. Then for each number: subtract the Mean and square the result. Vec <- c(4,6,8,4,10) S <- (sd(Vec)) print (s) Steps • Calculate the mean. • Subtract the mean from each observation. • Square each of the resulting observations. • Add these squared results together. • Divide this total by the number of observations (variance, S2). • Use the positive square root (standard deviation, S). Formulae SD Example Variate • A variate is a quantity which may take any of the values of a specified set with a specified relative frequency or probability. The variate is therefore often known as a random variable. • Univariate data – This type of data consists of only one variable. The analysis of univariate data is thus the simplest form of analysis since the information deals with only one quantity that changes • Bivariate data is used for little complex analysis than as compared with univariate data. Bivariate data is the data in which analysis are based on two variables per observation simultaneously. • Multivariate data is the data in which analysis are based on more than two variables per observation. Usually multivariate data is used for explanatory purposes. ggplot2 The ggplot2 package, created by Hadley Wickham, offers a powerful graphics language for creating elegant and complex plots. Its popularity in the R community has exploded in recent years. ... There is a helper function called qplot() (for quick plot) that can hide much of this complexity when creating standard graph Linear Regression Linear Regression x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131) y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48) relation <- lm(y~x) print(summary(relation)) a <- data.frame(x = 170) result <- predict(relation,a) print(result) # Give the chart file a name. png(file = "linearregression.png") plot(y,x,col = "blue",main = "Height & Weight Regression", abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height in cm") # Save the file. dev.off() Thank You Any Queries