Tutorial 1 - Regression

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Big Data

Big-Data-Analytics-with-R-and-Hadoop

Trainer: Ts. Dr. Ahmad Anwar Zainuddin

Performing data modelling in R

Data modelling is a machine learning technique to identify the hidden pattern from the historical
dataset, and this pattern will help in future value prediction over the same data. This technique
highly focusses on past user actions and learns their taste. Most of these data modelling techniques
have been adopted by many popular organizations to understand the behaviour of their customers
based on their past transactions. These techniques will analyse data and predict for the customers
what they are looking for. Amazon, Google, Facebook, eBay, LinkedIn, Twitter, and many other
organizations are using data mining for changing the definition applications.

Tutorial 1: Simple Linear Regression in R : //www.youtube.com/watch?v=66z_MRwtFJM

Objective: • Regression: In statistics, regression is a classic technique to identify the scalar relationship
between two or more variables by fitting the state line on the variable values. That relationship will
help to predict the variable value for future events. For example, any variable y can be modeled as
linear function of another variable x with the formula y = mx+c. Here, x is the predictor variable, y is
the response variable, m is slope of the line, and c is the intercept. Sales forecasting of products or
services and predicting the price of stocks can be achieved through this regression. R provides this
regression feature via the lm method, which is by default present in R.

Method:
1. Prior to start the coding, install some packages as below by: Go to Package > Install as
shown in Figure 1.

Figure 1 : Install Packages

2. Go to the Packages section and type the following packages and install them a shown in the
Figure 2 .
i. ggplot2
ii. plyr
iii. Shiny
iv. Rpubs
v. devtools.
Big Data

Figure 2: ggplot2 Installing Package

3. Import the dataset from this link:


http://www.mediafire.com/file/nayf5x3fz208wm8/BigData-with-R-LungCapData.zip/file

Go to File > Import Dataset > From Excel as shown in Figure 3. The dataset of
“LungCapData” is imported and displayed on the screen as shown in Figure 4.

Figure 3: Import Dataset


Big Data

Figure 4: The LungCapData is imported

4. Type the following coding below :

> #Import the LungCapData


> LungCapData <- read.delim(file.choose(), header=T)
> #Attach the data
> attach(LungCapData)
> #Check names
> names(LungCapData)
> View(LungCapData)
> names(LungCapData)
[1] "LungCap" "Age" "Height" "Smoke" "Gender" "Caesarean"
> #Check the type of variable for Age and LungCap
> class(Age)
[1] "numeric"
>
> #To plot Scatterplot Y-X axis for Age and LungCap
> plot(Age, LungCap, main="Scatterplot")
Big Data

> #To absolute lines for Age and LungCap


> abline(mod)

> #To calculate correlation for Age and LungCap


> cor(Age, LungCap)
[1] 0.8196749
>
> #To open Help function
> help(lm)
>
> mod <- lm(LungCap ~ Age)
Big Data

> summary(mod)

Call:
lm(formula = LungCap ~ Age)

Residuals:
Min 1Q Median 3Q Max
-4.7799 -1.0203 -0.0005 0.9789 4.2650

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.14686 0.18353 6.249 7.06e-10 ***
Age 0.54485 0.01416 38.476 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.526 on 723 degrees of freedom


Multiple R-squared: 0.6719, Adjusted R-squared: 0.6714
F-statistic: 1480 on 1 and 723 DF, p-value: < 2.2e-16

>
>
> #To show correlation for Age and LungCap
> attributes(mod)
$names
[1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" "qr"
[8] "df.residual" "xlevels" "call" "terms" "model"

$class
[1] "lm"

> #To calculate mod efficient for Age and LungCap


> mod$coefficients
(Intercept) Age
1.1468578 0.5448484
> mod$coef
(Intercept) Age
1.1468578 0.5448484
> coef(mod)
(Intercept) Age
1.1468578 0.5448484

> #To calculate anova(mod) for Age and LungCap


> anova(mod)
Analysis of Variance Table

Response: LungCap
Df Sum Sq Mean Sq F value Pr(>F)
Age 1 3447.0 3447.0 1480.4 < 2.2e-16 ***
Big Data

Residuals 723 1683.5 2.3


---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

You might also like