Data Analytics Using R-Programming Notes

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 100

Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Data Analytics using R-Programming


29th July 2019

Introduction to R-Programming
Everything in R is taken in terms of strings.
Ctrl + L Clear the screen in the R-Programming
1. Greet the world using “Hello World”
a. Command: > print(“Hello World”)
2. Add 2 to each element of vector x

a.
3. Add the values 6:10

a.
4. x <- 100:105
a. Then do x + 1:3

b.
5. Generate the sum of the first 10 natural numbers

a.
b. This is a case of compute, store and return.
6. Generate the sum of the squares of the first 10 natural numbers
a. sum((1:10)^2)
7. Construct a vector with elements 1,2,3,4,5
a. > c(1:5)
b. c is used to construct the vector.
c. c() is the way to construct a vector.
8. Display First 10 natural numbers, their squares and cubes
a. a <- 1:10
b. b <- a^2
c. c <- a^3
d. print (a,b,c)

1|Page
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

i. Print will require only a single argument


ii. Hence in the above example on a will be displayed
iii. Print returns the value associated with the first argument.
e. Paste(a, b, c)
i. It picks up the corresponding values
ii.

9. Assign the elements 1:5 to x and 10 to y. Add y to every value of x


a. x <- 1:5
b. y <- 10
c. x <- x + y
d. print(x)
The vector value changes only when “=” sign is there
R is a case sensitive programming language.
c() can also be used for concatenation.

In the case if we do print(p, g)


o Here p is numerical values
o g is string
o Hence there will be an incompatibility in printing p and g

2|Page
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

print ( paste(p, g))

Paste command or paste function logically is not used to display. However, in console it does
both.
Print is an actual function to display.
Paste is used for combining the values and each of the values is taken as strings.
Anything taken in “” is taken as a string.
Each vector has an order.
Vectors have the following:
1. Vectors have an index
2. Vectors have an order
m = c(11, 13, 12, 9, 17)
order(m): This gives the index of the vector m in ascending order.
Sort (m): Sorts the values in ascending order.

To sort the vector in decreasing order, then:

3|Page
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

To accept the value of the variables through console during runtime.


Yourname = readline(“prompt”)

Prompt: Pause for the message with a value.


getwd(). This command shows the current working directory.

The moment we say New Script, the console goes to the backend and only the script comes.
Any programming language is made up of keywords. Keywords are commands in programming
language.
Command + parameters makes one instruction.
With every instruction one requires command with parameters.
In R-programming, console takes one instruction and executes. Such set of instruction is referred
to as program. In R it is known as script. In R
Once the script is saved with a name, the script becomes permanent before it is deleted.
To execute use source() function.
objects(). This function list all the objects created during the session.

To check all the packages installed in the R-console use the function: installed.packages()
R gives built in dataset to do some type of analysis.

4|Page
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

To convert a 1D array to a 2D array.

t(x): transpose of the x matrix


cbind(y)
This stands for column bind

x <- 1:12
y <- c(x, x^2, x^3)
cbind(y): cbind() is a column bind vectors.
dim(y) <- c(12,3)

5|Page
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Illustration of cbind() function in R-Programming. cbind() requires the argument to be specified


again.

6|Page
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

mode() function returns the data type of the variable. mode() function is not for frequency. It is
for datatype.

Difference between c function() and list()

Powerful illustration of paste. For every item of A to paste B (which has only a single cell)

Function type:
Two types of functions in arguments:
 Argument with default value
 Argument without default value

7|Page
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Every time a print is given for a numeric argument, the number is displayed about 6 digits after
the decimal. i.e. x value if 11/7 will be displayed as 1.571429

To specify the number of decimal places to be displayed for the number use digit.
If the number is too large or too small, then the number will be displayed by e (10 to the power
of).
To print 0.0000062
It will be displayed as 6.2e-5
By default, R returns seven digits for numeric value.

pi is a ready to use vector in excel.

There is another option to print known as cat(). Whenever a print() command is given, by default
the command is moved to the new line. However, in the case of cat(), the command will remain
in the line where the cat() function is used.

8|Page
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

30th July 2019

Operators

R has four different groups of mathematical operators and functions:


1. Basic Arithmetic operators
2. Mathematical Functions
3. Vector operations
4. Matrix operations

5%%2 gives remainder. 5%/%2 gives quotient.


2%%5 = 2
2%/%5 = 0

5/0 will give Inf. Inf stands for Infinite.

Complex numbers
y <- 3 + 2i
 Re(y) gives the real part of a number.
 Im(y) gives the imaginary part of a number.

9|Page
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Mathematical Functions
Sr. No Mathematical Function Description of the function
1. abs()
2. log() To get the log of a number
3. sqrt() To get the square root of a number
4. exp() To get e3, display the number as exp(3)
5. choose(n, r) This is used for the combination of numbers. i.e. nCr
6. floor(n) This gives the lower value of a number
7. ceiling(n) This gives the higher value of a number
8. round(n, 6) Round up the specified number to 6 decimal places
9. trunc(n) Truncate the decimal part of the number
10. str(v) Structure of v.

Round(n, -1) Round the number n to the nearest rupee.


Round(n, -2) Round the number n to the nearest tens.

Trigonometric Functions
cos(π/4) will give 180ο/4 = 45ο which is 0.707

To get the alphabets in lower case and upper case.

10 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Types of vectors:
 Numeric vectors
 Integer vectors
 Logical vectors
 Character vectors
 DateTime vectors
 Factors

11 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Replicate/ Repeat

Index of an element
Generate natural numbers btw. 1 to 30 in reverse order. After that display elements at index 11,
17 and 21.

Continuous range is allowed in index. For discrete range, put the numbers in an array and then
use in the index position. To extract the last 10 elements of the vector.

To change a specific index of the array.

12 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

To change the 1st and the 5th index to 18.

a = 5 (Assigning the value 5 to a)


b = 5 (Assigning the value 5 to b)
a == b (Is a = b, if yes then it gives true else false)
Logical Operators
Sr. No Operator Description of the Operator
1. & Logical And
2. && Logical And
3. | Logical OR
4. !! Logical OR
5. ! Logical Not

Or
> i <- which(a>b)
> paste(i, a[i], b[i])

13 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

any() and all() function in R-Programming


x <- rep(1,10)
This function replicates 1 ten times in x
x then becomes 1 1 1 1 1 1 1 1 1 1
> all(x==1)
[1] TRUE
> any(x!=1)
[1] FALSE

Arithmetic Vector Operation


Sr. No Arithmetic Vector Description
Operation
1. sum() Sum of all the numbers
2. prod() Product of all the numbers
3. min() Min of all the numbers
4. max() Max of all the numbers
5. cumsum() It displays a new series which is the cumulative sum of all the
numbers till the index under operation is hit.
6. cumprod() It displays a new series which is the cumulative product of all
the numbers till the index under operation is hit.
7. cummin() It compares the next value with the previous value. If it is
lower, it replaces the new low value with the previous low
value.
8. cummax() It compares the next value with the previous value. If it is
lower, it replaces the new high value with the previous high
value.
9. diff()

14 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

31st July 2019

Class Exercises
A1 – Dealing with missing values
A student can attend any 4 out of 5 tests. A student can skip any of the five tests. Scores of the
students are: (27, 32, 45, NA, 39)
Any construct requires:
1. Input:
a. Through console
i. Verification and Validation happens
b. From any of the data file one is extracting
c. From unstructured sources
2. Processing
a. Compute
b. Branching
c. Looping
3. Output
a. Format
Algorithm:
1. Input: Scores in 5 tests
2. Process:
a. Determine index of NA (Not available)
b. Determine valid test scores
c. Compute internal marks
3. Output: Provide the internal marks

15 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

To remove the index where NA is there

A2 – Finding Min and Max


“CSK scores in IPL 2018 matches are given”
Determine the min and max scores
Also, return the match numbers where they scored min and max scores.

A3 – Improvement/ Decline over previous result


Analyze the performance improvement/ decline of CSK as compared to the previous game (over
16 matches)
Logic: Analyze the match numbers from 2 to 16 and find out if there has been an improvement/
decline from the previous match and determine the percentage of improvement.

16 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

A3a – Improvement/ Decline over previous result


Analyze the improvement/ decline in test performance of the following scores by a student in
comparison with the previous score (discard the NA)

A4 – Solve
Consider vector j having elements 11:16. Multiply each element of vector alternatively with 2
and 3 respectively

17 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

31st July 2019

Text Manipulations

> nm <- “Adam”


> length(nm)
This function gives the number of words in nm which is 1.
> nm <- “Adam Smith”
> length(nm)
This function gives the number of words in nm which is 1.

However,
> nm <- c(“Adam”, “Smith”)
> length(nm)
This function gives the number of words in nm which is 2.

Length will give the number of words. nchar() will give the number of characters.
islands is a built-in dataset available in R.

18 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

str function is for structure.

Exercise 5
Upper-case and lower-case alphabets
print(letters)
print(LETTERS)

Generate uppercase alphabets from A to J


Print(upper_case_alphabets[1:10])

Generate lowercase alphabets from A to J


Print(lower_case_alphabets[1:10])

> lower_case_alphabets <- letters


> lower_case_alphabets[1:6]
To replicate the above function and display the first 6 characters, use the function head.
To display the last 6 characters, use the function tail.
> head(lower_case_alphabets)
To get more than 6 characters use:
> head(lower_case_alphabets, 9)

The default value for head and tail is 6.

19 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Extract names from islands

str(islands)
names(islands)
The str(islands) will store islands in the form of attr(*, “names”)
Atomic value is a value which cannot be further split.
Extract the names of the first 8 islands:

One can also use:


names(islands[1:8])
Extract names from islands in Ascending order of size

Create a named vector month.days19 <- c(31,28,31,30,31,30,31,31,30,31,30,31)

20 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Splitting and concatenating text.

strsplit() splits the string and it gives as many words as there are in a string.

Unique returns one value of the duplicates one has.


To convert an upper case to lower case use tolower() function.

Data Frame
Data frame is the same view as in excel sheet.

21 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Excel sheet can be brought into R in the form of data frame. Data frame is like excel. Data Frame
is fixed. It is a 2D data structure.
In a list different data types can be combined.

Now if one uses the function edit(us.states), this appears:

22 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Other functions:
 substr() for extracting specified characters from string.
 grep() returns index of a character string in an expression
o grep() is a search especially used in large databases.
 gsub() replaces all matches of a string

To find out the substring of all names displayed by the head function.

Using the gsub() function

23 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

1st Aug 2019

Benford’s Law

Flexible Spending Account (FSC) is another name to CTC.


One cannot audit all the transactions.
Google application of Benford’s law.
Null hypothesis should be in such a way such that whatever we thought before beginning the
experiment must be true after conducting the experiment.
Fraudulence would be a significant departure of the actual from the expected values.
Sr. No Actual Expected - Benford's law
1 132 89.41
2 50 53.46
3 32 35.64
4 29 29.70
5 19 23.76
6 11 20.79
7 10 17.82
8 9 14.85
9 5 14.85
Total -> 297 300.28

There are n = 9 observations and 3 vectors.

p-value = 0.2425 is an important part here.

24 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Measuring Central Tendency


Few functions for measuring central tendency is:
1. Mean
2. Median
One wants to know where the larger number of data tends to converge or diverge.

The median is technically the 50% trimmed mean i.e., trimming to 50% would remove half the
observations above the middle of x and half the observation below the middle of x.

25 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

When one trims 50% of the values from both the ends, then mean = median.
Year Invs A Invs B
1 12% 50%
2 -3% -40%
3 8% 30%
4 15% 70%
5 0% 10%
6 4% -50%

Geometric mean applies to only +ve numbers. In order to work on negative numbers, change the
data set.

26 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

quantile() is the function for finding out 1st quartile and subsequent quartiles.

The standard deviation is bigger when the differences is more spread out.
For computing Z-Score, the function is scale()

27 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Mean Absolute Deviation


The mean absolute deviation (MAD) of a set of data is the avg. distance between each data value
and the mean. The MAD is the “Avg.” of the “positive distances” of each point from the mean.
The larger the MAD, the greater variability there is in the data (the data is more spread out).
Find out the purpose for MAD

MAD: Exercise
The likes received by MAD is: (10, 15, 15, 17, 18, 21)
To find out how far each like is away from the mean, one would use Mean Absolute Deviation.

28 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

29 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

1st Aug 2019

Unit VIII: Tables and Graphs

30 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

One way of doing aggregation is using frequency tables.

In the above table, between two eruptions 78 mins waiting time was the largest time difference.

The difference between the above 2 images is that in the second pictures, there are ranges where
the dataset is divided into different bins. Hence a range of numbers is provided. For obtaining
bins, the cut function must be used. This would be helpful especially when dealing with larger
datasets.
Table: Illustrations

31 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Use the function: pie(scores)

For the above pie-chart, the color should match with the colors mentioned.

32 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

In order to change the colors of each pie of the pie-chart, use the color codes for the names of
colors.

Bar Plots
0 to 15 is the default range for the bar plot.

33 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

abline(h = mean(sample_data))
Abline is to get the horizontal line of the quantity.

34 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

In the barplot function instead of using percentage, if we use -percentage, then:

35 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Using the sort function for colors:

To plot the above graph horizontally:

36 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Business Trend
  Development Credit Rating Exception Handling Process Mapping Analysis
PG Finance 2 9 3 11 9
MMS Finance 4 8 7 3 12
PG FS 5 2 8 10 11

37 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

If we remove beside = TRUE in the barplot() then:

38 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Boxplots (or a box and whisker plot)

39 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Boxplots Illustration – 5

40 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Histogram

41 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

42 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Line Graphs
plot() is ideally used to give a scatter plot.

43 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

If we put bch = 19 inside the plot function, then:

To plot the Year in the graph insert Year before Value in the plot() as shown below.

44 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Density

45 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

There is a ready to use dataset known as trees.

46 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Girth tells the diameter of the trunk.

47 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

plot(trees)

48 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

contour(volcano)

persp(volcano)

49 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

image(volcano)

Creating Factors

50 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

By typing, we
get:

Research
1. Qualitative
2. Quantitative
a. Experimental (Cause and effect)
i. Between subject design: Different participants are randomly assigned to
different conditions.
ii. Within subject design: Same participants are randomly allocated to more
than one condition – referred also to as repeated measures design.
iii. Mixed design experiment
b. Non- experimental
i. Commonly used in sociology, political science and management
disciplines.
ii. Research is often done with the survey
iii. No random assignment of assignment of participants to a particular group
iv. Two approaches to analyzing such data are:
1. Testing for significant differences across the groups
a. E.g. such as IQ levels of participants from different ethnic
background
2. Testing for significant association between two factors
a. E.g. such as firms’ sales and advertising expenditure

51 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Qualitative Research
 Involves collecting qualitative data by means of interviews, observations, field notes,
open-ended questions etc.
 Researcher is the primary data collection instrument.
 Data could be in the form of words, images, patterns etc.
 Two types of data used are:
o Primary data: Data collected directly from the subjects of study
o Secondary data: also known as archival data

Type of variables
1. Qualitative variables
a. Differ in kind rather than degree
2. Quantitative variables
a. Differ in degree rather than kind

Reliability and validity


Two important characteristics of measurement are reliability and validity
Reliability of an instrument doesn’t warranty its validity.
Reliability is the degree to which one may expect to find the same result if a measurement is
repeated.

Assessing validity
Aims at determining how accurate is the relationship between the measure and the underlying
trait it is trying to measure.
Epsilon: Inherent error in every individual, human being and task.

Three important aspects of validity:


1. Predictive
2. Content
a. Refers to the extent to which a measurement reflects the specific domain of
content
3. Construct
a. Most commonly used technique in social sciences

52 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

b. Looks for expected patterns of relationships among variables (based on theory


and principles)
Hypothesis Testing
A hypothesis is an assumption or claim about some characteristics of population, which through
empirical evidence can be supported or rejected.
The different types of hypothesis are:
1. Null hypothesis
o When Null hypothesis is accepted, no action is required
2. Alternative hypothesis
o When Alternative hypothesis is accepted, a corrective action must be taken

Types of error
1. Type I error
a. While testing a hypothesis, if it gets rejected when it should have been accepted.
2. Type II error
a. While testing a hypothesis, if it gets accepted when it should have been rejected.

53 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

6th Aug 2019

Stats App 2

Standard Deviation
Coefficient of variation=
Mean
y -> Dependent variable
x -> Independent variable

Probability Distribution

54 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Normal distribution
Normal distribution is continuous with range -infinity to infinity.

55 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Some of the options available for creating the graph:


 “p” for points
 “l” for lines
 “b” for both
 “c” for the lines part alone of “b”
 “o” for both overplotted
 “h” for ‘histogram’ like vertical lines
 “n”: No Plotting

56 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

lwd is the thickness of the line.

57 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

58 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Strip chart

RColorBrewer package

59 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

60 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

61 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

62 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

63 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

64 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

> hist(x, col = “green”, font = 4, lty = 8)

65 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

66 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

For mtext() function, side is used for position.

67 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

68 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

69 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

70 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

71 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

72 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

7th Aug 2019

Analysis using R

Data analysis is all about searching for patterns. Research in behavioral sciences:
1. Qualitative research
2. Quantitative research
a. Data in the form of variables, establishing statistical relationship
b. Results are then generalizable to the entire population
c. There are two types of this research:
i. Experimental Research
1. In between-subject design: Different participants are randomly
assigned to different conditions
2. In within-subject design
ii. Non-experimental Research
1. Testing for significant differences
2. Testing for significant associations
Hypothesis testing: Options are mutually exclusive and exhaustive
 Null hypothesis (µ0)
 Alternative hypothesis (µ1)

73 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

hist(sales_data$Age)

> boxplot(sales_data$Age)

74 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

boxplot(sales_data)

describeBy() helps to group the variables as shown above.

Skewness
For skewed distributions, it is quite common to have one tail of the distribution. It is very
difficult to get skewness of 0. The skewness value can be positive or negative.

75 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Kurtosis is used for cluster analysis. The coefficient of kurtosis measures the peakedness of a
distribution. High kurtosis means that values close to the mean are relatively more frequent and
extreme values (very far from the mean) are also relatively more frequent.
Z-tests and t-tests
 Common rationale but different assumptions
 For z-tests, the population mean, and standard deviation should be known exactly.
 If the sample data is smaller, than use t-tests

One sample t-test


Compare the mean of single sample with the population mean.
 Economist: Is per-capita income of a particular region same as the national average?

76 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

77 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

What is to be rejected must be a part of the Null hypothesis (µ0).


 µ0: There is no difference between the avg. salary of the business school being tested and
the avg. salary of top five business schools.
 µ1: There is a difference between the avg. salary of the business school being tested and
the avg. salary of top five business schools.

From the above function, we find out that the alternative hypothesis is true. True mean is not
equal to 750.
Analysis Case Exercise - 3
Compare the efficiency of the workers of two mines – privately owned (mine 1) and government
owned (mine 2)
What is to be rejected must be a part of the Null hypothesis (µ0).
 µ0: There is no difference between the efficiency of the workers of mine 1 being tested
and the efficiency of the workers of mine 2.
 µ1: There is a difference between the efficiency of the workers of mine 1 being tested and
the efficiency of the workers of mine 2.

78 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

79 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

However, if the test assumption is:


 µ0: Efficiency of workers of mine 1 is ≤ Efficiency of workers of mine 2
 µA: Efficiency of workers of mine 1 is > Efficiency of workers of mine 2

Dependent (paired) samples t-test


 A mid-size call center deputes 20 of its agents to a training program.
 The training consultants claim is that on completion of training it will greatly enhance the
efficiency of the agents
 To check the efficiency of the employees before training and after training.

80 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

7th Aug 2019

Regression Analysis

Predictive modelling technique


Estimates relationship between dependent (target) and an independent (predictor) variable.
Types of regression analysis:
1. Linear Regression lm()
a. It attempts to model the relationship between two variables by fitting a linear
equation to observed data.
b. A linear regression line has an equation of the form Y = ax + b
c. Purpose:
i. Used for forecasting
ii. Modelling the relationship between x and Y.
iii. Testing of hypothesis
2. Polynomial relationship
a. Relationship between the independent variable x and the dependent variable y is
modelled as an nth degree polynomial in x.
b. Polynomial regression fits a non-linear relationship between x and Y.
3. Logistic Regression
a. It is a statistical method for analyzing in a dataset
Regression analysis: Estimates relationship between:
 Multiple independent variable
 One dependent variable
Linear Regression

81 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

 H0: There exist no relationship between number of hrs. of study and freshman score.
 H1: There exist a relationship between number of hrs. of study and freshman score.

Me
an as a measure of central tendency doesn’t fall into the scheme of things. Hence linear
regression must be done here for analysis.

82 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

The equation from the above analysis is: Y = 22.315x + 3.479

In the above equation there is a deviation between the estimated value (y) and the actual value
(fscr).

The residual value (diff) = Actual Value (fscr) – Predicted Value (y)

83 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

In the above picture, Pr(>|t|) is the probability of observing any value equal or larger than t. The
coefficient market with ‘***’ represents a highly significant value (p < 0.001). The coefficient
market with ‘**’ represents a highly significant value (p < 0.01).

84 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Any activity will enhance some type of an error known as Epslon error.
Coefficient Estimate
T −value=
Standard error
Further from zero, null hypothesis can be rejected.
Degree of freedom = number of rows in dataset – number of columns in dataset
Degree of freedom is 8.
Residual standard error (RSE): It is a measure of quality of a linear regression fit.
Every linear model is assumed to contain an error term – E, preventing prediction of exact
response values.

Multiple R-squared or R2R2 statistics or coefficient determination provides a measure of how


well the model fits the actual data. R2R2 is the measure of linear relationship between predictor
and response variables.

85 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

8th Aug 2019

R Analytics

If the value of x = 25, y = 30. For all values of x, y = 30.

86 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Any statistical model in R works properly with data.frame()


Mean is not an exact measure for predicting the exact score based on the values of the dependent
variable. If there are more no. of variables, then correlation becomes very important.
Relationship between independent variables is known as multi-collinearity.
For creating a simple regression model: lm(dependent variable ~ independent variable).
In any business value, one must compare expected values with actual values. Benchmarking is
the best practices that the industry follows.

Case Analysis 12.1


Adjusted R2 is less than multiple R2
Explained variation
∗(n−k )
Unexplained variation
F=
(k −1)

R2
∗(n−k )
(1−R2 )
F=
(k−1)
Larger the number of factors, the more it is important to understand the statictics.
Two most important functions to remember: predict() and resid().

87 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

88 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

For all the values to be predicted for weight, the corresponding heights must be put inside a.

Exercise 2: Polynomial Regression

89 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

y = 0.006757 * t + 2.306306

90 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Add a function of polynomial with degree 2

y = coefficient of t + coefficient of t2

91 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Multiple Regression
Total delivery time depends upon total delivery in miles as well as number of deliveries.

92 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

> plot(sds_data, col = “red”, pch = 19)

Infographics was created in 1862.


One needs to know the audience in print media.
Hans Rosling looked at the social causes and created a wonderful visual.

93 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Hans Rosling books:


1. Factfulness
2. Leap

p <- c(12,14,16,18,21) can also be written as: c(12,14,16,18,21) -> p


Functions can also be created in R.
The 3 regressions are:
1. Linear
2. Polynomial
3. Multiple
glm() is known as generalized linear model.
ts() is the function for creating the time series.
Decision trees are those trees which have the structure of a tree.

Logistics Regression
Logistic regression is a predictive modelling algorithm that is used when the Y variable is binary
categorical. The goal is to determine a mathematical equation that can be used to predict the
probability of event 1. Once the equation is established, it can be used to predict the Y when only
the X’s are known. Logistics regression is used for:
1. Spam detection
2. Credit card fraud
3. Health
4. Marketing
5. Banking

Logistic Regression example

94 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

From the above data, check if any data is Null. Use the function is.na()

To check if the admits are distributed well enough in each category of rank.

To change the order use:

Convert rank variable from integer to factor. Using the function as.factor(rank)

95 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

A unit change in GPA increases the chance of getting admitted by 0.777014.


Greater the difference between Null deviance and Residual deviance better the model.
Predict the chances of getting admitted if GRE is 790, GPA is 3.8 and undergrad from rank 1
college.

This shows a75.52% chance of getting admitted in the univeristy.

96 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Decision tree
The function used is ctree()

library(“party”)

97 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

98 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Time series
This can be used for continuously monitoring of a person’s heart.
The four types of variations are:
Trend variations

ts() is a function used for time series.


The function to plot the time series is plot.ts()

Changing from frequency 1 to frequency 12.

99 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Create a vector to contain prices of a product in a year from Jan’17 to Dec’17. Create a time
series data of 12 months.

Now to do a time series from Apr to Dec

100 | P a g e

You might also like