Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

GETTING STARTED IN R

1. Getting and setting the working directory


Check where R files are currently saved and looked up: getwd()
Setting the working directory using setwd(“”). (with /)

2. Objects:
Objects are assigned values using the <- operator (although = also works)
There are four modes: numeric, character, logical, complex.
To check the mode of an object: mode(name_object)
Objects which consist of multiple values of the same mode are called vectors vectors are
usually created using the c() function. If the vector consists of multiple values of different
modes, the vector has the mode most restrictive, the predominant.

3. Managing the Workspace


ls: lists objects in workspace
save.image: saves the entire workspace as a .RData file (save.image("MyWorkspace.RData"))
save: saves a given object as a .RData file (save(z,file="z.RData"))
load: loads .Rdata files (load("MyWorkspace.RData"))
You can use rm to delete objects: rm(z)

4. Dealing with missing data


Be careful with missing data (NA). There are functions that treat them, but not all functions
treat them in the same way.

5. Functions involving random numbers


There are functions that generate random numbers but every time it executes produces
different numbers. To generate the same random numbers at the same function, set.seed(), to
reproduce the same numbers.

6. User-defined functions
function.name <- function(arguments) {
computations on the arguments and some other code
}
Recursivity in R, use Recall() in the definition of a function to avoid problems.

7. Loading data from text files


Usually file extensions .csv, .txt or .dat.
Two ways:
1. Click on Import Dataset and navigate to the folder containing the file. Check for
correctness of the settings and click on import.
2. From the command line:
- Using read.table: reads a file in table format (see ?read.table)
(stringsAsFactors=TRUE)
- Using read.csv: identical to read.table except for the defaults
It is highly advisable to have a look at the data before importing them!!
8. Exporting data from

Native R files (.RData files) are stored on the computer and can be directly loaded in R later
using the command load:
load("bikes data.RData")

9. Basic Data management


- Number of rows and columns: dim(bikes.data)
- Brief description of the variable types: str(bikes.data)
- Print first three rows: head(bikes.data,3)
- Print last three rows: tail(bikes.data,3)
- Load an example data set: data("iris")
- Variable (column) names: names(iris)
- You can refer to columns within a data frame using the “$” symbol.
- Using the command attach, you adjunct the data in the working space and you can
directly refer to columns in a data set. And when you are done with a data set then
use detach(). Be careful in attaching multiple datasets because they can have same
columns names. The columns of attached data become separate from the dataset
they were derived from. To modify directly in the attached data, use $.
- Obtain frequency table of a column: table(name_column)
- Adding margins of a table (row and column sums): addmargins(name_table)
- Using proportions (percentages) instead: prop.table(name_table, 1(by row); 2(by
column); all table)

10. Factors
Factors in R refer to categorical variables.
Columns containing non-numerical values are automatically considered factors by R.
Levels of a factor refer to the unique categories of the variable: levels(name_column).
The function factor converts a variable into a factor. If the data contain categorical variables
that have been coded using numbers then it is essential to convert these to factors before
running any statistical analyzes: gender.new <- factor(gender) or bikes.data$transport <-
factor(bikes.data$transport)
It is not essential to rename the levels to text rather than numbers, but it often helps to avoid
confusion: levels(gender.new) <- c("Male","Female")
Replace first value in gender.new (Male) by Female: gender.new[1] <- "Female", it can be
replaced by a valid category/level.
Automatic conversion to factor can be forced by setting the argument as.is = FALSE in the
function read.csv().
Factor to numeric values: as.numeric(as.character(x)). If the factor contains any values which
cannot be converted to a number these will be set to be NA, and a warning will be printed.
Creating your own categorical variable from the data: daytime <- (hour > 6) & (hour < 20);
fdaytime <- factor(daytime)

11. Exploratory Data Analysis


Summary statistics: summary(name_data), it is generally used throughout R to show main
output after fitting models.
Table: table(name_column), provides a way of tabulating the frequencies of one or two
categorical variables (factors).
Tapply provides a way of summarising a numeric variable that has been split into groups using
a categorical variable: tapply(numerical_column, categorical_column, function) or one
continuous variables, two factors (create tables):
tapply(numerical_column,list2categorical_column1, categorical_columns),function)
Producing graphs:
For numerical variables
- Histogram: shows the frequency of the variable at the different intervals of data
and look at the distribution of the numerical data set - hist(column_name)
- Scatterplot: graphic representation of the data showing the relation between two
quantity variables (positive and negative correlation or no-correlation) -
plot(column_name1, column_name2) or plot(column_name1~ column_name2,
data = name_data) or two continuous variables, one factor: plot(column_name1,
column_name2, col=c("blue","red","green")[categorical_column_name]) or
plot(column_name1, column_name2,col= categorical_column_name) pch(form
point), cex(size points), xlab, ylab, main, col = “”, xlim-ylim(c()) (values rank axis),
add representation: points(), join points: lines(), organize graphics: par().
- Boxplot: shows the distribution and dispersion of the data set and identify atypic
values or outliers. Represents the median, a line in the box’s center that divide the
data set in two equal parts and it’s the median value of the sort data, the box
contains the 50% of the data between de Q1 (25%) and Q3 (75%) – interquartile
rank, and the outliers are the point out the box - boxplot(column_name1 ~
column_name2)

For categorical variables


- Barplot: useful to compare different categories, represents the quantity in the
dataset of the categorical variable - barplot(table(housing), col="red") or
barplot(values/frequency, names.arg = categories)
Saving graphs: go to Plots pane > Export
There exist specialised R add-on packages for graphics which offer additional functionalities
and different styles to produce graphs in R. The most popular nowadays are: The lattice
package and the ggplot2 package (based on the grammar of graphics) and related packages
forming a ggplot2ecosystem.

Basic Statistic:
The syntax for statistical modelling is generally very consistent, regardless of the type of
model/analysis being used: modellingfunction(myform, data = datasetname)
And the formula is generally of the form: myform <- responsevariable ~ explanatoryvariables
- T-test: A statistical test used to determine the existence of a difference between
the means of two groups. This test provides us with the T-value and the P-value.
The T-value measures the difference between the means of two groups in terms of
data variability. A high T-value suggests a significant difference in means. The P-
value is the probability of the observed difference being true. There is the null
hypothesis that asserts no real difference between the groups or that the observed
difference is random, and there is the alternative hypothesis that asserts a real
difference. If the P-value is low (less than 0.05), it indicates a low probability of the
difference being due to randomness. In this case, we reject the null hypothesis and
accept the alternative (when smaller is the p-value, more significantly is). However,
if the P-value is high (more than 0.05), we do not have sufficient evidence to reject
the null hypothesis, and it is accepted. – t.test(myform, data = name_data), pt():
calculate the cumulative probability of a Student's t distribution. The function
receives as arguments the value x (point of interest), the degrees of freedom df
and optionally lower.tail which indicates whether the probability is calculated in
the lower tail (by default it is TRUE). The cumulative probability of a t distribution
represents the probability that a random variable distributed according to a
Student's t takes a value less than or equal to x.
- One-way ANOVA: used to compare the mean of three or more different groups to
determine that almost one of them is significantly different from others, to know
the difference between the means of several groups. – aov(result ~ group, data =
name_data), you can use the summary of the aov to summarize the result of aov.
The summary provides information about statistic F and P-value. The statistic F is
the relation between the variability between and inside the groups, a high F-value
is more probably that almost one of the groups has a mean different. The P-value is
the probability to find a big difference between the groups. If the P-value is low, it
indicates that almost one of the groups has a different mean from others.
However, if the P-value is high, there is not sufficient evidence to conclude that
there are significative differences between the groups. - summary(aov(wtgain ~
diet, data = turkey)) or anova(lm(wtgain ~ diet, data = turkey)).
- Two-way ANOVA: it is a extension of one-way(factor) ANOVA, it allows to evaluate
the effect of two categorical variables (factors) in front the continue interest value.
- aov(wtgain ~ diet + housing + diet:housing, data = turkey) or aov(wtgain ~
diet*housing, data = turkey), summary(aov(wtgain ~ diet*housing, data = turkey))
- Linear regression: realize a lineal regression analisy and evaluate the relationship
between two or more variables: a response/dependent variable (y) and one
(simple linear regression) or more “explanatory variables”/independents (x), that is
possible to be a mix of categorical and numeric variables (multiple linear
regression)- ajust the lineal regression model: lm (dependent ~ independent1 + ….,
data = name_data), then interpretate the results: summary(model_created). A low
P-value indicates the dependence between the variables x-y. R-square indicates
how many the variability of the dependent variable is explicated by the model,
better when it’s closer to 1. F-statistic evaluates if some of the independent
variables has a significative effect in the dependent variable. We can extract the
ANOVA table for a linear regression using anova: anova(name_model). We can
produce graphical diagnostics, to assess how well the model fits, using plot:
plot(name_model): residuals vs fitted, normal q-q, scale-location and residuals vs
leverage.

12. Add-on packages


Packages are essentially collections of specialised R functions that extend the basic analysis and
data management capabilities of R. Over 10.000 available; some popular ones are:
- Tidyverse adds a set of packages sharing an underlying design philosophy,
grammar, and data structures
- dplyr and data.table to easily manipulate data
- stringr to manipulate strings
- zoo to work with regular and irregular time series
- ggplot2, plotly, lattice, and rCharts to visualise data
- caret and tidymodels for machine learning, predictive modelling
- shiny to build web apps and dashboards based on R code
- knitr, engine for dynamic report generation that enables integration of varied input
languages (e.g., R, Python, and shell scripts) into varied markup languages (e.g.,
LaTeX, HTML, Markdown, AsciiDoc, and reStructuredText), facilitates
Most of them on CRAN (https://cran.r-project.org), but R-Forge or BioConductor are also
popular repositories. Locate and install package (CRAN is the default repository, but this can be
changed): go to Packages pane > install or with commands:
install.packages("MASS", repos="http://cran.us.r-project.org") Or install.packages("MASS")
Or also installing the countreg package from R-forge: install.packages("countreg",
repos="http://R-Forge.R-project.org")
Installing packages from load files: install.packages("c:/path/to/location/packagename.zip",
repos = NULL)
And load the package into the R session: library(MASS)
You can remove a package from the R sesión: detach(package:MASS)

13. Data Subsetting


Square brackets allow us to extract elements from a vector:
x <- c(7,5,8,9,9,1)
x[2] # Access to the element in the index 2 (index begins at 1)
x[3:5] # return the elements from index 3 to 5
x[-4] # remove the element in the index, also you can remove more than one using :

Square brackets [] can also be used to subset by logical values:


y <- c(TRUE, FALSE, FALSE, TRUE, FALSE, TRUE)
x[y]
[1] 7 9 1

This also allows us to subset based on whether particular conditions apply:


x[x > 7]
[1] 8 9 9

Square brackets [,] can also be used to extract rows and columns from a data frame, or rank of
them using : .
x <- data.frame(my = c(1,2), yours = c("a", "b"))
Extract columns using their names:
names(name_data) # extract name of columns of the data
name_data$name_column

Also:
new <- turkey[turkey$initwt > 25,] # select rows that initwt >25
new <- turkey new$wtgain[new$initwt > 25] <- NA # set NA to the rows whete initwt > 25

OVERVIEW BASIC STATS AND PROBABILITY

1. Introduction

Data

Data Structures

- Structured, like a spreadsheet (excel)


- Unstructured, text, images, …

Data Types

- Categorical data (factors): variables that acting acting as labels/categories:


nominal, when the categories don’t have an order and are exclusive, and ordinal,
when they have an order.
- Continuous data: variables that can take any value in a certain range. These
variables are measured on a continuous scale and may not be exact, decimal
values. Denote size information or amount of some quantity, and usually have
units (e.g. meters, grams)
- Discrete data: variables that can only take specific and distinct values, without the
possibility of intermediate values within a range. These values are countable and
generally represent counts. They cannot take fractional or decimal values, limited
by integer values (int).

Data typically arranged in rectangular shape:

- Each row represents a sample (subject)


- Each column has information on the sample (variable, feature, measurements)

Data pre-processing:

- Cleaning: correcting and/or removing inaccurate or missing records .


- Integration: combining several sources of data into one column, spreadsheet or
database.
- Reduction: consolidating categories to reduce the number of attributes.
Types of analyses/methods:

Parametric (model-based)

- Use known models/probability distributions (e.g. normal curve) to obtain


probabilities, estimates, make inferences
- In statistical testing: more powerful when correct distribution is chosen
- In modelling: make assumptions about the functional form/shape of a relationship
(e.g. linear regression); generally require estimating less parameters, less data to
work, but poor predictions if far from actual shape

Non-parametric

- Do not depend on a particular distribution


- Still assume variables or groups have the same distribution, though it’s unknown
- Generally more powerful if wrong distribution is assumed
- Do not impose explicit functional form (e.g. curve fitting by smoothing splines);
flexible but require more data and risk of overfitting

Descriptive and inferential statistics:

- Descriptive: descriptive statistics focus on summarizing and describing the


fundamental characteristics of a data set. These statistics provide an overview of
the data and help understand its structure and distribution: Measures of Central
Tendency (mean, median, mode), Measures of Dispersion or Variability (standard
deviation, range, interquartile range), measures of the shape of the distribution
(symmetry, kurtosis) and Descriptive Graphics (histograms, scatter plots, plots,
boxplots, etc.).
- Inferential/modelling: used to make inferences (generalizations) or conclusions
about a population based on information collected from a sample. This involves
making decisions or making statements about a population based on evidence
from a sample Hypothesis Testing, Confidence Intervals, Regression and Analysis of
Variance and Predictions and Models.

Descriptive statistics help you understand and summarize your data, while inferential statistics
allow you to draw conclusions beyond your sample.

Most common statistical inference paradigms:

Frequentist Paradigm:

- The parameters are seen as fixed and unknown.


- Use probabilities to investigate what might happen with repeated data.
- It employs Maximum Likelihood Estimation (MLE) to estimate the parameters.

Bayesian paradigm:

- The parameters are viewed as random variables with a priori probability


distributions based on prior knowledge.
- It uses the data to update beliefs and obtain posterior distributions for the
parameters.
- The inference is based on the posterior distributions of the parameters.

In short, the frequentist views parameters as fixed and uses probabilities to analyze repeated
collections of data. The Bayesian views parameters as random variables and uses prior
knowledge along with the data to update beliefs and obtain posterior distributions.

Exploratory data analysis (EDA):

It consists of the process of visualization, summary and initial understanding of a data set.
Helps you gain a deep understanding of the data before performing more advanced analysis or
modeling.

Some common steps performed:

1. Statistical Summary: Calculate descriptive statistics such as mean, median, standard


deviation, etc., to understand the central characteristics and dispersion of the data.
2. Data Visualization: Create graphs such as histograms, scatter plots, box plots, etc., to
explore the distribution and relationships between variables.
3. Handling Outliers: Identify and address outliers that may affect subsequent analyses.
4. Correlation Analysis: Determine relationships between variables and understand how
they relate to each other.
5. Exploring Patterns or Trends: Look for temporal, seasonal or cyclical patterns in the
data.
6. Segmentation or Clustering: Group similar data to identify subgroups or segments.
7. Frequency Analysis: Count the frequency of different categories or values in a
categorical variable.
8. Multivariate Visualization: Explore complex relationships between multiple variables at
once.
9. Principal Component Analysis (PCA): Dimensionality reduction to visualize high-
dimensional data.
10. Data Preparation: Deal with missing or null values, normalization of variables if
necessary, etc.
11. Missing data: include deletion, imputation and other schemes
12. Transforming data: raw data may be on different scales or not meet model
assumptions:
- Normalization: convert the numerical variables to a scale, e. g. from 0 to 1
- Standardization: convert data so the numerical mean is zero with a standard
deviation of one
- Log transformation: convert a skewed distribution to a normal distribution or using
a Fourier transform to change from a time domain to a frequency domain
13. Categorization (binning or bagging): Converting continuous data into bins/categories is
also known as “discretization”: ages grouped into ranges such as "young", "adults" and
"seniors".
14. Dummy coding: There are times when nominal data (categorical) must be converted to
numerical values, such as 0 and 1; so called dummy coding so that it can be processed
by algorithms and use analysis models.
15. Dimension reduction: at times variables can be grouped together to simplify a model,
simplify these data sets while maintaining as much relevant information as possible.

Summary plots

- Histogram (hist(name_variable)): shows the distribution of the data and look for:
skewness, outliers, multimodality. INTERPRETATION: Shape of the Distribution:
Observe the general shape of the histogram. It can be symmetrical, skewed to the
right (positively skewed), skewed to the left (negatively skewed), or have other
specific shapes. Modes (Peaks): The peaks in the histogram represent the "modes"
or values that occur most frequently. Dispersion and Kurtosis: Dispersion is related
to the extent or width of the distribution. A wider histogram indicates greater
dispersion. Kurtosis describes the shape of the tails of the distribution (whether
they are pointed or flat). Outliers: Observe if there are outliers that are very far
from the rest of the data. -> A logarithmic transformation when there is an
exponential growth to linearize the relationship or to stabilize/reduce the variance
in right-skewed data and improve symmetry. Or square rooting when the data is
not negative and to stabilize/reduce variance in rightskewed data.
- Boxplot(name_variable): The box contains 50% of the data. The whiskers stretch
out to where most of the data would be expected. Points outside the whiskers are
potential outliers.

- Plot(name_variable): Plot is a generic function that generates different outputs


depending on its arguments.
• To identify specific data in x: axis(1, at=c(12,36,60,84,108,132,156)) #add
tick marks at defined locations to horizontal axis
• Vertical lines to see it even better:
abline(v=c(12,36,60,84,108,132,156),col="green", lty= "dotted") # v for
vertical axis, h for horizontal axis
• Relationships among variables: If both variables are continuous
scatterplots are useful: plot(name_variable1, name_variable2)
• One numerical variable, one categorical variable:
plot(windsp,winddir,pch='|'). Add labels to the categorical variable:
plot(windsp,winddir,pch='|',yaxt="n") #don’t include y-axis tickmarks
axis(2,at=c(1,2,3,4),labels=levels(winddir)) #add labels to plot
- Summary statistics (summary(name_data)):
• Mean: arithmetic average, measure of the centre of the data
• Median: half-way value, 50% of observations are less than it, 50% are
bigger. Less influenced by outliers or skewness
• Minimum and maximum: the smallest and largest values respectively
• Lower and upper quartiles: 25% of values are below (above) these values
- Measures of variability: Standard deviation (sd) , variance (var), interquartile range
(IQR) (upper quartile – lower quartile).
- Coefficient of variation (CV): It is a measure of the relative variability of a statistical
distribution and provides a measure of the relative dispersion in percentage terms
of the mean. The higher the CV value, the greater the variability compared to the
mean. If the CV is low, it means that the data is relatively clustered around the
mean (%).
CV <- 100*sd(name_variable)/mean(name_variable)

Exploring Data – Relationships among variables:

pairs(): visualize relationships between multiple variables in a data set. Each cell in the matrix
generated by pairs() represents a scatter plot showing the relationship between two variables.
The main diagonal contains histograms of each individual variable.

INTERPRETATION:

Positive or Negative Relationship: If the points tend to form an ascending line or curve, it
indicates a positive relationship. If they tend to form a descending line or curve, it indicates a
negative relationship.

Dispersion: The dispersion of the points around the trend line indicates the strength of the
relationship between the two variables. A narrower dispersion indicates a stronger
relationship.

Outliers: You can identify outliers or outliers that deviate from the general pattern.

Shape of the Distribution: The histograms on the main diagonal show the univariate
distribution of each variable. You can see if they follow a normal distribution or if they show
some type of skew.

pairs(whr[c(4,5,6,8)], col=rain) # scatterplot of the selectioned variables and check the factor
levels rain

Correlation coefficient r : measures of strength of linear relationship (cor(columns_names))

- Close to -1: Strong negative relationship


- Between -1 and 0: Negative relationship
- Near 0: Little relationship
- Between 0 and 1: Positive relationship
- Close to 1: Strong positive relationship

Subsetting data: extract a subset of rows or columns from a data set based on certain
conditions - subset(name_data, condition, select(c(columns)), ...)

2. Probability and distributions/models

Probability (Prob): a number between 0 (impossible) and 1 (certain) measuring the uncertainty
of an event.
Full table of probabilities: addmargins(prop.table(table(name_column)))

The normal distribution:

Bell Shape: a beak in the center and tails extending to both sides.

Mean and Standard Deviation: The normal distribution is determined by its mean (μ) and its
standard deviation (σ). The mean determines the location of the peak and the standard
deviation controls how wide or narrow the bell is.

Symmetry: It is symmetrical about the mean.

Probabilities here correspond to the area under the curve.

For 𝜇 = 0, 𝜎 = 1: standard normal distribution

Comparison between relative frequencies and theoretical probabilities:

Practically: is based on observed data and calculates the proportion of values that meet a
specific condition.

prop.table(table(x< 1.5))

FALSE TRUE

0.057 0.943

Theoretically: acumulative probability distribution function (CDF) of a standard normal


distribution to calculate the probability that a random variable follows a normal distribution
with a mean of 0 and a standard deviation of 1, and is less than or equal to 1.5.

pnorm(1.5,mean=0,sd=1)

0.9332

Standardisation of normal variables: Useful transformation to homogenise variability and


centre when dealing with multiple variables measured on diverse scales. Makes them more
comparable, gives them analogous weight in the analysis.

Quantiles: For the standard normal distribution we might denote these by z0.25, z0.50 and
z0.75 - z25 <- qnorm(.25,mean=0,sd=1)

X ~ N( μ, σ 2 ), Prob( μ – 1.96σ < X < μ + 1.96σ ) = 0.95

R functions for normal distributions:

- rnorm(n,μ,σ) to generate n random values from a Normal (μ, σ 2 ) distribution -


set.seed(num)
- pnorm(k,μ,σ) to find the probability that X is less than or equal to k when X ~ N (μ,
σ2)
- qnorm(p,μ,σ) to find the quantile Zp , such that prob( X < Zp ) = p, where X ~ N (μ,
σ2)
- dnorm(x,μ,σ) to find the height of the density (shape) function at X=x, X ~ N (μ, σ 2
).

Discrete distributions: applies to random variables that can take a finite or countably infinite
set of specific and distinct (discrete) values.

Two common distributions:

- Binomial model: models the number of successes in a fixed number of trials


(experiments with two possible outcomes: success or failure), each with the same
probability of success. (n, num trials and p, probability of success)
• dbinom(x, size, prob) returns the probability mass function (PMF) of the
binomial distribution. In other words, calculate the probability of getting
exactly x successes on size trials, where the probability of success on each
trial is prob.
• pbinom(x, size, prob) returns the cumulative distribution function (CDF) of
the binomial distribution. In other words, it calculates the probability of
obtaining x or fewer successes on size trials, where the probability of
success on each trial is prob.
- Poisson model: models the number of events (usually rare events) that occur in a
fixed, continuous interval of time or space. (lambda, average event rate per
interval).
• dpois(x, lambda) returns the probability mass function (PMF) of the
Poisson distribution. In other words, it calculates the probability of getting
exactly x events in an interval of time or space, where the average rate of
events per interval is lambda.
• ppois(x, lambda) returns the cumulative distribution function (CDF) of the
Poisson distribution. In other words, it calculates the probability of getting
x or fewer events in an interval of time or space, where the average rate of
events per interval is lambda.

Sampling, estimation and confidence intervals:

Population: the total group of interest, the target of the data analysis we want to draw
conclusions about, not possible to measure everyone.

Sample: affordable subset of the population, selected by some sampling method (typically
some random procedure to avoid biases).

Sample unit: every single element (individual, object, …) being studied by theis variables
(data/values). The size of the sample is the number of sample units chosen
We see the variability of the sample means is much smaller than the population one.

Standard Error (SE) = SD of the sample mean = SD/square root(n)

In real life: population SD unknown, estimated from data (sd(samples)):

SE = sample SD/square root(n)

(sample mean ± 1.96σ/ n ) is called a 95% confidence interval for the population mean μ. In
general, for any parameter θ with a standard error SE(θ), an approximate confidence interval
will be:

Estimate of θ ± 2*SE(θ)

Significance testing:

Hypothesis test: a way to statistically measure the support the data provide to a research idea
(hypothesis). It is used to make decisions about a statement or hypothesis about a population
parameter, based on the evidence provided by a data sample.

- Null Hypothesis (H0 ): it’s easier to disprove rather than prove an idea, so we state
it in the negative and see if we can reject that statement
- Alternative Hypothesis (HA ): the reverse of the null hypothesis.
- Test statistic: a quantity derived from the sample used to perform the test.
- P-value, cutoff value - a probability, usually called alpha and set at 𝛼 = 0.05.
Measures the probability of obtaining the observed value of the test statistic if the
null hypothesis were true.

Two types of significance:

- Statistical significance: says that what we have found in the sample is likely to
happen in the population of interest
- With large sample sizes (e.g. 10,000 patients), we can be confident of statistically
detecting every tiny relationship or difference in our sample, but …
- “Real-world” significance: says that what we have found statistically is actually
meaningful, relevant in the context of application

Variables and effects:

- Independent variable: a variable which can be controlled, called predictors or


attributes.
- Dependent variable: an outcome variable, with the independent variables
controlles as an input.
- Effect: the impact of the independent variable on the dependent.
- Effect size: measures the strength of relationship or amount of group difference
found in standardized ways.

Types of errors:

- Type I: rejecting the null hypothesis when it’s true (false positive)
- Type II: accepting the null hypothesis when it’s false (false negative)
- Multiple testing problem: need to control the overall error rate when performing
multiple comparisons or statistical tests to avoid incorrect conclusions due to
significant results obtained by chance → Bonferroni, Benjamini and Hochberg’s
false discovery rate control, …
- Power: the probability of correctly rejecting the null hypothesis. Related to sample
size, effect size, significance level, … → power calculations to design experiments

General hypothesis testing procedure:

1. Define the null and alternative hypotheses, H0 and HA


2. Select an appropriate test statistic (typically: signal/noise)
3. Calculate the value of the test statistic for your data
4. Derive the probability of getting a value for the test statistic as extreme, or more
extreme, than the value generated by your data, assuming H0 is true. This is the p-
value of the test
5. Compare the p-value with pre-establish significance threshold, typically 0.05

A schematic guide to basic statistical tests:

One sided vs two-sided tests:

- Two-sided tests is open to all possibilities, determine if there is a significant


difference, either in a positive or negative direction
- One-sided test should only be used if we are only interested in changes in one
direction or changes in the other direction can be discarded, determine whether a
sample is significantly greater or less than a reference value, but not both. That is,
we are investigating whether the observed value is significantly different in a
specific direction.

The chi-square test: determine if there is an association between two categorical variables.
How to use:

1. Hypothesis formulation:
- Null Hypothesis (H0): There is no association between the variables.
- Alternative Hypothesis (H1): There is an association between the variables.
2. Data collection: The data should be organized in a contingency table that shows the
observed frequencies for each combination of categories of the two variables.
3. Calculation of the Chi-Square Statistic: The chi-square statistic is calculated from the
observed and expected frequencies.
4. Degrees of freedom: The number of degrees of freedom depends on the size of the
table and is calculated as (r-1)(c-1), where "r" is the number of rows and "c" is the
number of columns in the table.
5. Comparison with the Critical Value or p-value: Using the chi-square distribution and
degrees of freedom, determine whether the observed chi-square value is statistically
significant. This is done by comparing the calculated value with a critical value from the
chi-square table or by calculating a p-value.

How to interpret:

If the calculated chi-square value is greater than the critical value in the table or if the p-value
is less than a previously established significance level (for example, 0.05), then the null
hypothesis is rejected and it is concluded that there is a significant association between the
variables.

If the calculated chi-square value is not large enough or the p-value is greater than the
significance level, there is not enough evidence to reject the null hypothesis and it is concluded
that there is no significant association between the variables.

NOTE: Contingency table is a tool in statistics that is used to organize and summarize
categorical data in the form of a matrix of two or more dimensions.

You might also like