Data Analytics Using R-Programming Notes

Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Data Analytics using R-Programming

29th July 2019
Introduction to R-Programming
Everything in R is taken in terms of strings.
Ctrl + L Clear the screen in the R-Programming
1. Greet the world using “Hello World”
a. Command: > print(“Hello World”)
2. Add 2 to each element of vector x
a.
3. Add the values 6:10
a.
4. x <- 100:105
a. Then do x + 1:3
b.
5. Generate the sum of the first 10 natural numbers
a.
b. This is a case of compute, store and return.
6. Generate the sum of the squares of the first 10 natural numbers
a. sum((1:10)^2)
7. Construct a vector with elements 1,2,3,4,5
a. > c(1:5)
b. c is used to construct the vector.
c. c() is the way to construct a vector.
8. Display First 10 natural numbers, their squares and cubes
a. a <- 1:10
b. b <- a^2
c. c <- a^3
d. print (a,b,c)
1|Page
i. Print will require only a single argument

ii. Hence in the above example on a will be displayed
iii. Print returns the value associated with the first argument.
e. Paste(a, b, c)
i. It picks up the corresponding values
ii.
9. Assign the elements 1:5 to x and 10 to y. Add y to every value of x

a. x <- 1:5
b. y <- 10
c. x <- x + y
d. print(x)
The vector value changes only when “=” sign is there
R is a case sensitive programming language.
c() can also be used for concatenation.
In the case if we do print(p, g)

o Here p is numerical values
o g is string
o Hence there will be an incompatibility in printing p and g
2|Page
print ( paste(p, g))
Paste command or paste function logically is not used to display. However, in console it does
both.
Print is an actual function to display.
Paste is used for combining the values and each of the values is taken as strings.
Anything taken in “” is taken as a string.
Each vector has an order.
Vectors have the following:
1. Vectors have an index
2. Vectors have an order
m = c(11, 13, 12, 9, 17)
order(m): This gives the index of the vector m in ascending order.
Sort (m): Sorts the values in ascending order.
To sort the vector in decreasing order, then:
3|Page
To accept the value of the variables through console during runtime.

Yourname = readline(“prompt”)
Prompt: Pause for the message with a value.

getwd(). This command shows the current working directory.
The moment we say New Script, the console goes to the backend and only the script comes.
Any programming language is made up of keywords. Keywords are commands in programming
language.
Command + parameters makes one instruction.
With every instruction one requires command with parameters.
In R-programming, console takes one instruction and executes. Such set of instruction is referred
to as program. In R it is known as script. In R
Once the script is saved with a name, the script becomes permanent before it is deleted.
To execute use source() function.
objects(). This function list all the objects created during the session.
To check all the packages installed in the R-console use the function: installed.packages()
R gives built in dataset to do some type of analysis.
4|Page
To convert a 1D array to a 2D array.
t(x): transpose of the x matrix

cbind(y)
This stands for column bind
x <- 1:12
y <- c(x, x^2, x^3)
cbind(y): cbind() is a column bind vectors.
dim(y) <- c(12,3)
5|Page
Illustration of cbind() function in R-Programming. cbind() requires the argument to be specified

again.
6|Page
mode() function returns the data type of the variable. mode() function is not for frequency. It is
for datatype.
Difference between c function() and list()
Powerful illustration of paste. For every item of A to paste B (which has only a single cell)
Function type:
Two types of functions in arguments:
 Argument with default value
 Argument without default value
7|Page
Every time a print is given for a numeric argument, the number is displayed about 6 digits after
the decimal. i.e. x value if 11/7 will be displayed as 1.571429
To specify the number of decimal places to be displayed for the number use digit.
If the number is too large or too small, then the number will be displayed by e (10 to the power
of).
To print 0.0000062
It will be displayed as 6.2e-5
By default, R returns seven digits for numeric value.
pi is a ready to use vector in excel.
There is another option to print known as cat(). Whenever a print() command is given, by default
the command is moved to the new line. However, in the case of cat(), the command will remain
in the line where the cat() function is used.
8|Page
30th July 2019
Operators
R has four different groups of mathematical operators and functions:

1. Basic Arithmetic operators
2. Mathematical Functions
3. Vector operations
4. Matrix operations
5%%2 gives remainder. 5%/%2 gives quotient.

2%%5 = 2
2%/%5 = 0
5/0 will give Inf. Inf stands for Infinite.
Complex numbers
y <- 3 + 2i
 Re(y) gives the real part of a number.
 Im(y) gives the imaginary part of a number.
9|Page
Mathematical Functions
Sr. No Mathematical Function Description of the function
1. abs()
2. log() To get the log of a number
3. sqrt() To get the square root of a number
4. exp() To get e3, display the number as exp(3)
5. choose(n, r) This is used for the combination of numbers. i.e. nCr
6. floor(n) This gives the lower value of a number
7. ceiling(n) This gives the higher value of a number
8. round(n, 6) Round up the specified number to 6 decimal places
9. trunc(n) Truncate the decimal part of the number
10. str(v) Structure of v.
Round(n, -1) Round the number n to the nearest rupee.

Round(n, -2) Round the number n to the nearest tens.
Trigonometric Functions
cos(π/4) will give 180ο/4 = 45ο which is 0.707
To get the alphabets in lower case and upper case.
10 | P a g e
Types of vectors:
 Numeric vectors
 Integer vectors
 Logical vectors
 Character vectors
 DateTime vectors
 Factors
11 | P a g e
Replicate/ Repeat
Index of an element
Generate natural numbers btw. 1 to 30 in reverse order. After that display elements at index 11,
17 and 21.
Continuous range is allowed in index. For discrete range, put the numbers in an array and then
use in the index position. To extract the last 10 elements of the vector.
To change a specific index of the array.
12 | P a g e
To change the 1st and the 5th index to 18.
a = 5 (Assigning the value 5 to a)

b = 5 (Assigning the value 5 to b)
a == b (Is a = b, if yes then it gives true else false)
Logical Operators
Sr. No Operator Description of the Operator
1. & Logical And
2. && Logical And
3. | Logical OR
4. !! Logical OR
5. ! Logical Not
Or
> i <- which(a>b)
> paste(i, a[i], b[i])
13 | P a g e
any() and all() function in R-Programming

x <- rep(1,10)
This function replicates 1 ten times in x
x then becomes 1 1 1 1 1 1 1 1 1 1
> all(x==1)
[1] TRUE
> any(x!=1)
[1] FALSE
Arithmetic Vector Operation

Sr. No Arithmetic Vector Description
Operation
1. sum() Sum of all the numbers
2. prod() Product of all the numbers
3. min() Min of all the numbers
4. max() Max of all the numbers
5. cumsum() It displays a new series which is the cumulative sum of all the
numbers till the index under operation is hit.
6. cumprod() It displays a new series which is the cumulative product of all
the numbers till the index under operation is hit.
7. cummin() It compares the next value with the previous value. If it is
lower, it replaces the new low value with the previous low
value.
8. cummax() It compares the next value with the previous value. If it is
lower, it replaces the new high value with the previous high
value.
9. diff()
14 | P a g e
31st July 2019
Class Exercises
A1 – Dealing with missing values
A student can attend any 4 out of 5 tests. A student can skip any of the five tests. Scores of the
students are: (27, 32, 45, NA, 39)
Any construct requires:
1. Input:
a. Through console
i. Verification and Validation happens
b. From any of the data file one is extracting
c. From unstructured sources
2. Processing
a. Compute
b. Branching
c. Looping
3. Output
a. Format
Algorithm:
1. Input: Scores in 5 tests
2. Process:
a. Determine index of NA (Not available)
b. Determine valid test scores
c. Compute internal marks
3. Output: Provide the internal marks
15 | P a g e
To remove the index where NA is there
A2 – Finding Min and Max

“CSK scores in IPL 2018 matches are given”
Determine the min and max scores
Also, return the match numbers where they scored min and max scores.
A3 – Improvement/ Decline over previous result

Analyze the performance improvement/ decline of CSK as compared to the previous game (over
16 matches)
Logic: Analyze the match numbers from 2 to 16 and find out if there has been an improvement/
decline from the previous match and determine the percentage of improvement.
16 | P a g e
A3a – Improvement/ Decline over previous result

Analyze the improvement/ decline in test performance of the following scores by a student in
comparison with the previous score (discard the NA)
A4 – Solve
Consider vector j having elements 11:16. Multiply each element of vector alternatively with 2
and 3 respectively
17 | P a g e
31st July 2019
Text Manipulations
> nm <- “Adam”

> length(nm)
This function gives the number of words in nm which is 1.
> nm <- “Adam Smith”
> length(nm)
However,
> nm <- c(“Adam”, “Smith”)
> length(nm)
Length will give the number of words. nchar() will give the number of characters.
islands is a built-in dataset available in R.
18 | P a g e
str function is for structure.
Exercise 5
Upper-case and lower-case alphabets
print(letters)
print(LETTERS)
Generate uppercase alphabets from A to J

Print(upper_case_alphabets[1:10])
Generate lowercase alphabets from A to J

Print(lower_case_alphabets[1:10])
> lower_case_alphabets <- letters

> lower_case_alphabets[1:6]
To replicate the above function and display the first 6 characters, use the function head.
To display the last 6 characters, use the function tail.
> head(lower_case_alphabets)
To get more than 6 characters use:
> head(lower_case_alphabets, 9)
The default value for head and tail is 6.
19 | P a g e
Extract names from islands
str(islands)
names(islands)
The str(islands) will store islands in the form of attr(*, “names”)
Atomic value is a value which cannot be further split.
Extract the names of the first 8 islands:
One can also use:

names(islands[1:8])
Extract names from islands in Ascending order of size
Create a named vector month.days19 <- c(31,28,31,30,31,30,31,31,30,31,30,31)
20 | P a g e
Splitting and concatenating text.
strsplit() splits the string and it gives as many words as there are in a string.
Unique returns one value of the duplicates one has.

To convert an upper case to lower case use tolower() function.
Data Frame
Data frame is the same view as in excel sheet.
21 | P a g e
Excel sheet can be brought into R in the form of data frame. Data frame is like excel. Data Frame
is fixed. It is a 2D data structure.
In a list different data types can be combined.
Now if one uses the function edit(us.states), this appears:
22 | P a g e
Other functions:
 substr() for extracting specified characters from string.
 grep() returns index of a character string in an expression
o grep() is a search especially used in large databases.
 gsub() replaces all matches of a string
To find out the substring of all names displayed by the head function.
Using the gsub() function
23 | P a g e
1st Aug 2019
Benford’s Law
Flexible Spending Account (FSC) is another name to CTC.

One cannot audit all the transactions.
Google application of Benford’s law.
Null hypothesis should be in such a way such that whatever we thought before beginning the
experiment must be true after conducting the experiment.
Fraudulence would be a significant departure of the actual from the expected values.
Sr. No Actual Expected - Benford's law
1 132 89.41
2 50 53.46
3 32 35.64
4 29 29.70
5 19 23.76
6 11 20.79
7 10 17.82
8 9 14.85
9 5 14.85
Total -> 297 300.28
There are n = 9 observations and 3 vectors.
p-value = 0.2425 is an important part here.
24 | P a g e
Measuring Central Tendency

Few functions for measuring central tendency is:
1. Mean
2. Median
One wants to know where the larger number of data tends to converge or diverge.
The median is technically the 50% trimmed mean i.e., trimming to 50% would remove half the
observations above the middle of x and half the observation below the middle of x.
25 | P a g e
When one trims 50% of the values from both the ends, then mean = median.
Year Invs A Invs B
1 12% 50%
2 -3% -40%
3 8% 30%
4 15% 70%
5 0% 10%
6 4% -50%
Geometric mean applies to only +ve numbers. In order to work on negative numbers, change the
data set.
26 | P a g e
quantile() is the function for finding out 1st quartile and subsequent quartiles.
The standard deviation is bigger when the differences is more spread out.
For computing Z-Score, the function is scale()
27 | P a g e
Mean Absolute Deviation

The mean absolute deviation (MAD) of a set of data is the avg. distance between each data value
and the mean. The MAD is the “Avg.” of the “positive distances” of each point from the mean.
The larger the MAD, the greater variability there is in the data (the data is more spread out).
Find out the purpose for MAD
MAD: Exercise
The likes received by MAD is: (10, 15, 15, 17, 18, 21)
To find out how far each like is away from the mean, one would use Mean Absolute Deviation.
28 | P a g e
29 | P a g e
1st Aug 2019
Unit VIII: Tables and Graphs
30 | P a g e
One way of doing aggregation is using frequency tables.
In the above table, between two eruptions 78 mins waiting time was the largest time difference.
The difference between the above 2 images is that in the second pictures, there are ranges where
the dataset is divided into different bins. Hence a range of numbers is provided. For obtaining
bins, the cut function must be used. This would be helpful especially when dealing with larger
datasets.
Table: Illustrations
31 | P a g e
Use the function: pie(scores)
For the above pie-chart, the color should match with the colors mentioned.
32 | P a g e
In order to change the colors of each pie of the pie-chart, use the color codes for the names of
colors.
Bar Plots
0 to 15 is the default range for the bar plot.
33 | P a g e
abline(h = mean(sample_data))
Abline is to get the horizontal line of the quantity.
34 | P a g e
In the barplot function instead of using percentage, if we use -percentage, then:
35 | P a g e
Using the sort function for colors:
To plot the above graph horizontally:
36 | P a g e
Business Trend
Development Credit Rating Exception Handling Process Mapping Analysis
PG Finance 2 9 3 11 9
MMS Finance 4 8 7 3 12
PG FS 5 2 8 10 11
37 | P a g e
If we remove beside = TRUE in the barplot() then:
38 | P a g e
Boxplots (or a box and whisker plot)
39 | P a g e
Boxplots Illustration – 5
40 | P a g e
Histogram
41 | P a g e
42 | P a g e
Line Graphs
plot() is ideally used to give a scatter plot.
43 | P a g e
If we put bch = 19 inside the plot function, then:
To plot the Year in the graph insert Year before Value in the plot() as shown below.
44 | P a g e
Density
45 | P a g e
There is a ready to use dataset known as trees.
46 | P a g e
Girth tells the diameter of the trunk.
47 | P a g e
plot(trees)
48 | P a g e
contour(volcano)
persp(volcano)
49 | P a g e
image(volcano)
Creating Factors
50 | P a g e
By typing, we
get:
Research
1. Qualitative
2. Quantitative
a. Experimental (Cause and effect)
i. Between subject design: Different participants are randomly assigned to
different conditions.
ii. Within subject design: Same participants are randomly allocated to more
than one condition – referred also to as repeated measures design.
iii. Mixed design experiment
b. Non- experimental
i. Commonly used in sociology, political science and management
disciplines.
ii. Research is often done with the survey
iii. No random assignment of assignment of participants to a particular group
iv. Two approaches to analyzing such data are:
1. Testing for significant differences across the groups
a. E.g. such as IQ levels of participants from different ethnic
background
2. Testing for significant association between two factors
a. E.g. such as firms’ sales and advertising expenditure
51 | P a g e
Qualitative Research
 Involves collecting qualitative data by means of interviews, observations, field notes,
open-ended questions etc.
 Researcher is the primary data collection instrument.
 Data could be in the form of words, images, patterns etc.
 Two types of data used are:
o Primary data: Data collected directly from the subjects of study
o Secondary data: also known as archival data
Type of variables
1. Qualitative variables
a. Differ in kind rather than degree
2. Quantitative variables
a. Differ in degree rather than kind
Reliability and validity

Two important characteristics of measurement are reliability and validity
Reliability of an instrument doesn’t warranty its validity.
Reliability is the degree to which one may expect to find the same result if a measurement is
repeated.
Assessing validity
Aims at determining how accurate is the relationship between the measure and the underlying
trait it is trying to measure.
Epsilon: Inherent error in every individual, human being and task.
Three important aspects of validity:

1. Predictive
2. Content
a. Refers to the extent to which a measurement reflects the specific domain of
content
3. Construct
a. Most commonly used technique in social sciences
52 | P a g e
b. Looks for expected patterns of relationships among variables (based on theory

and principles)
Hypothesis Testing
A hypothesis is an assumption or claim about some characteristics of population, which through
empirical evidence can be supported or rejected.
The different types of hypothesis are:
1. Null hypothesis
o When Null hypothesis is accepted, no action is required
2. Alternative hypothesis
o When Alternative hypothesis is accepted, a corrective action must be taken
Types of error
1. Type I error
a. While testing a hypothesis, if it gets rejected when it should have been accepted.
2. Type II error
a. While testing a hypothesis, if it gets accepted when it should have been rejected.
53 | P a g e
6th Aug 2019
Stats App 2
Standard Deviation
Coefficient of variation=
Mean
y -> Dependent variable
x -> Independent variable
Probability Distribution
54 | P a g e
Normal distribution
Normal distribution is continuous with range -infinity to infinity.
55 | P a g e
Some of the options available for creating the graph:

 “p” for points
 “l” for lines
 “b” for both
 “c” for the lines part alone of “b”
 “o” for both overplotted
 “h” for ‘histogram’ like vertical lines
 “n”: No Plotting
56 | P a g e
lwd is the thickness of the line.
57 | P a g e
58 | P a g e
Strip chart
RColorBrewer package
59 | P a g e
60 | P a g e
61 | P a g e
62 | P a g e
63 | P a g e
64 | P a g e
> hist(x, col = “green”, font = 4, lty = 8)
65 | P a g e
66 | P a g e
For mtext() function, side is used for position.
67 | P a g e
68 | P a g e
69 | P a g e
70 | P a g e
71 | P a g e
72 | P a g e
7th Aug 2019
Analysis using R
Data analysis is all about searching for patterns. Research in behavioral sciences:
1. Qualitative research
2. Quantitative research
a. Data in the form of variables, establishing statistical relationship
b. Results are then generalizable to the entire population
c. There are two types of this research:
i. Experimental Research
1. In between-subject design: Different participants are randomly
assigned to different conditions
2. In within-subject design
ii. Non-experimental Research
1. Testing for significant differences
2. Testing for significant associations
Hypothesis testing: Options are mutually exclusive and exhaustive
 Null hypothesis (µ0)
 Alternative hypothesis (µ1)
73 | P a g e
hist(sales_data$Age)
> boxplot(sales_data$Age)
74 | P a g e
boxplot(sales_data)
describeBy() helps to group the variables as shown above.
Skewness
For skewed distributions, it is quite common to have one tail of the distribution. It is very
difficult to get skewness of 0. The skewness value can be positive or negative.
75 | P a g e
Kurtosis is used for cluster analysis. The coefficient of kurtosis measures the peakedness of a
distribution. High kurtosis means that values close to the mean are relatively more frequent and
extreme values (very far from the mean) are also relatively more frequent.
Z-tests and t-tests
 Common rationale but different assumptions
 For z-tests, the population mean, and standard deviation should be known exactly.
 If the sample data is smaller, than use t-tests
One sample t-test

Compare the mean of single sample with the population mean.
 Economist: Is per-capita income of a particular region same as the national average?
76 | P a g e
77 | P a g e
What is to be rejected must be a part of the Null hypothesis (µ0).

 µ0: There is no difference between the avg. salary of the business school being tested and
the avg. salary of top five business schools.
 µ1: There is a difference between the avg. salary of the business school being tested and
the avg. salary of top five business schools.
From the above function, we find out that the alternative hypothesis is true. True mean is not
equal to 750.
Analysis Case Exercise - 3
Compare the efficiency of the workers of two mines – privately owned (mine 1) and government
owned (mine 2)
What is to be rejected must be a part of the Null hypothesis (µ0).
 µ0: There is no difference between the efficiency of the workers of mine 1 being tested
and the efficiency of the workers of mine 2.
 µ1: There is a difference between the efficiency of the workers of mine 1 being tested and
the efficiency of the workers of mine 2.
78 | P a g e
79 | P a g e
However, if the test assumption is:

 µ0: Efficiency of workers of mine 1 is ≤ Efficiency of workers of mine 2
 µA: Efficiency of workers of mine 1 is > Efficiency of workers of mine 2
Dependent (paired) samples t-test

 A mid-size call center deputes 20 of its agents to a training program.
 The training consultants claim is that on completion of training it will greatly enhance the
efficiency of the agents
 To check the efficiency of the employees before training and after training.
80 | P a g e
7th Aug 2019
Regression Analysis
Predictive modelling technique

Estimates relationship between dependent (target) and an independent (predictor) variable.
Types of regression analysis:
1. Linear Regression lm()
a. It attempts to model the relationship between two variables by fitting a linear
equation to observed data.
b. A linear regression line has an equation of the form Y = ax + b
c. Purpose:
i. Used for forecasting
ii. Modelling the relationship between x and Y.
iii. Testing of hypothesis
2. Polynomial relationship
a. Relationship between the independent variable x and the dependent variable y is
modelled as an nth degree polynomial in x.
b. Polynomial regression fits a non-linear relationship between x and Y.
3. Logistic Regression
a. It is a statistical method for analyzing in a dataset
Regression analysis: Estimates relationship between:
 Multiple independent variable
 One dependent variable
Linear Regression
81 | P a g e
 H0: There exist no relationship between number of hrs. of study and freshman score.
 H1: There exist a relationship between number of hrs. of study and freshman score.
Me
an as a measure of central tendency doesn’t fall into the scheme of things. Hence linear
regression must be done here for analysis.
82 | P a g e
The equation from the above analysis is: Y = 22.315x + 3.479
In the above equation there is a deviation between the estimated value (y) and the actual value
(fscr).
The residual value (diff) = Actual Value (fscr) – Predicted Value (y)
83 | P a g e
In the above picture, Pr(>|t|) is the probability of observing any value equal or larger than t. The
coefficient market with ‘***’ represents a highly significant value (p < 0.001). The coefficient
market with ‘**’ represents a highly significant value (p < 0.01).
84 | P a g e
Any activity will enhance some type of an error known as Epslon error.
Coefficient Estimate
T −value=
Standard error
Further from zero, null hypothesis can be rejected.
Degree of freedom = number of rows in dataset – number of columns in dataset
Degree of freedom is 8.
Residual standard error (RSE): It is a measure of quality of a linear regression fit.
Every linear model is assumed to contain an error term – E, preventing prediction of exact
response values.
Multiple R-squared or R2R2 statistics or coefficient determination provides a measure of how

well the model fits the actual data. R2R2 is the measure of linear relationship between predictor
and response variables.
85 | P a g e
8th Aug 2019
R Analytics
If the value of x = 25, y = 30. For all values of x, y = 30.
86 | P a g e
Any statistical model in R works properly with data.frame()

Mean is not an exact measure for predicting the exact score based on the values of the dependent
variable. If there are more no. of variables, then correlation becomes very important.
Relationship between independent variables is known as multi-collinearity.
For creating a simple regression model: lm(dependent variable ~ independent variable).
In any business value, one must compare expected values with actual values. Benchmarking is
the best practices that the industry follows.
Case Analysis 12.1

Adjusted R2 is less than multiple R2
Explained variation
∗(n−k )
Unexplained variation
F=
(k −1)
R2
∗(n−k )
(1−R2 )
F=
(k−1)
Larger the number of factors, the more it is important to understand the statictics.
Two most important functions to remember: predict() and resid().
87 | P a g e
88 | P a g e
For all the values to be predicted for weight, the corresponding heights must be put inside a.
Exercise 2: Polynomial Regression
89 | P a g e
y = 0.006757 * t + 2.306306
90 | P a g e
Add a function of polynomial with degree 2
y = coefficient of t + coefficient of t2
91 | P a g e
Multiple Regression
Total delivery time depends upon total delivery in miles as well as number of deliveries.
92 | P a g e
> plot(sds_data, col = “red”, pch = 19)
Infographics was created in 1862.

One needs to know the audience in print media.
Hans Rosling looked at the social causes and created a wonderful visual.
93 | P a g e
Hans Rosling books:

1. Factfulness
2. Leap
p <- c(12,14,16,18,21) can also be written as: c(12,14,16,18,21) -> p

Functions can also be created in R.
The 3 regressions are:
1. Linear
2. Polynomial
3. Multiple
glm() is known as generalized linear model.
ts() is the function for creating the time series.
Decision trees are those trees which have the structure of a tree.
Logistics Regression
Logistic regression is a predictive modelling algorithm that is used when the Y variable is binary
categorical. The goal is to determine a mathematical equation that can be used to predict the
probability of event 1. Once the equation is established, it can be used to predict the Y when only
the X’s are known. Logistics regression is used for:
1. Spam detection
2. Credit card fraud
3. Health
4. Marketing
5. Banking
Logistic Regression example
94 | P a g e
From the above data, check if any data is Null. Use the function is.na()
To check if the admits are distributed well enough in each category of rank.
To change the order use:
Convert rank variable from integer to factor. Using the function as.factor(rank)
95 | P a g e
A unit change in GPA increases the chance of getting admitted by 0.777014.

Greater the difference between Null deviance and Residual deviance better the model.
Predict the chances of getting admitted if GRE is 790, GPA is 3.8 and undergrad from rank 1
college.
This shows a75.52% chance of getting admitted in the univeristy.
96 | P a g e
Decision tree
The function used is ctree()
library(“party”)
97 | P a g e
98 | P a g e
Time series
This can be used for continuously monitoring of a person’s heart.
The four types of variations are:
Trend variations
ts() is a function used for time series.

The function to plot the time series is plot.ts()
Changing from frequency 1 to frequency 12.
99 | P a g e
Create a vector to contain prices of a product in a year from Jan’17 to Dec’17. Create a time
series data of 12 months.
Now to do a time series from Apr to Dec
100 | P a g e

Data Analytics Using R-Programming Notes

Uploaded by

Copyright:

Available Formats

You might also like

Data Analytics Using R-Programming Notes

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analytics Using R-Programming Notes

Uploaded by

Copyright:

Available Formats

Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20

Data Analytics using R-Programming

i. Print will require only a single argument

9. Assign the elements 1:5 to x and 10 to y. Add y to every value of x

In the case if we do print(p, g)

print ( paste(p, g))

To sort the vector in decreasing order, then:

To accept the value of the variables through console during runtime.

Prompt: Pause for the message with a value.

To convert a 1D array to a 2D array.

t(x): transpose of the x matrix

Illustration of cbind() function in R-Programming. cbind() requires the argument to be specified

Difference between c function() and list()

pi is a ready to use vector in excel.

30th July 2019

R has four different groups of mathematical operators and functions:

5%%2 gives remainder. 5%/%2 gives quotient.

5/0 will give Inf. Inf stands for Infinite.

Round(n, -1) Round the number n to the nearest rupee.

To get the alphabets in lower case and upper case.

To change a specific index of the array.

To change the 1st and the 5th index to 18.

a = 5 (Assigning the value 5 to a)

any() and all() function in R-Programming

Arithmetic Vector Operation

31st July 2019

To remove the index where NA is there

A2 – Finding Min and Max

A3 – Improvement/ Decline over previous result

A3a – Improvement/ Decline over previous result

31st July 2019

> nm <- “Adam”

str function is for structure.

Generate uppercase alphabets from A to J

Generate lowercase alphabets from A to J

> lower_case_alphabets <- letters

The default value for head and tail is 6.

Extract names from islands

One can also use:

Create a named vector month.days19 <- c(31,28,31,30,31,30,31,31,30,31,30,31)

Splitting and concatenating text.

Unique returns one value of the duplicates one has.

Now if one uses the function edit(us.states), this appears:

Using the gsub() function

1st Aug 2019

Flexible Spending Account (FSC) is another name to CTC.

There are n = 9 observations and 3 vectors.

p-value = 0.2425 is an important part here.

Measuring Central Tendency

Mean Absolute Deviation

1st Aug 2019

Unit VIII: Tables and Graphs

One way of doing aggregation is using frequency tables.

Use the function: pie(scores)

In the barplot function instead of using percentage, if we use -percentage, then:

Using the sort function for colors:

To plot the above graph horizontally: