Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Log 708 - Chapter 1

Halvard Arntzen

Log 708 - Chapter 1 1/15


About statistics

Making sense out of data - in two ways


1 Descriptive statistics
Summarize and present data for “human processing”
Key numbers and graphics
2 Statistical analysis (Inferential statistics)
Gain knowledge about the world through data and models.

Log 708 - Chapter 1 2/15


Data

Objects are in lines, variables are in columns. For each object


(here apartments) values are recorded on a set of variables.
## price area rooms standard situated town
## 1 1031 100 3 2 6 1
## 2 1129 116 3 1 5 1
## 3 1123 110 3 2 5 1
## 4 607 59 2 3 5 1
## 5 858 72 2 3 4 1
## 6 679 64 2 2 3 1
## 7 1300 129 4 1 5 1
## 8 1004 103 3 2 5 1
## 9 673 59 2 2 5 1
## 10 1187 115 3 1 6 1
This we could call “raw data”.

Log 708 - Chapter 1 3/15


Getting some information out of (raw) data

Key numbers, like mean, median, standard deviation . . .


Graphical display of various characteristics.
Relevant information depends on what type of variable we are
dealing with

Log 708 - Chapter 1 4/15


Types of variables

Numerical - when values are naturally numeric: Counting or


measuring things
Discrete
Continuous

Categorical - when there are no natural numeric values: Used


for classification of objects.
Nominal
Ordinal

Log 708 - Chapter 1 5/15


Numerical variables
Discrete variables - limited number of prescribed numerical
values.
E.g. counting something.

Continuous variables - any value in an interval is possible.


Measuring something.

Sometimes it is not entirely apparent whether a variable X should


be treated as discrete or continuous. Then Ask: Does it make
practical sense to assign individual probabilities

P[X = a]

for all possible values a? If YES -> discrete, otherwise ->


continuous.
for example, the price of a flat in NOK is best treated as continuous,
even if it has integer values.
Log 708 - Chapter 1 6/15
Categorical variables

Nominal
When no ordering can be assigned to values. (Color of cars =
Green, Blue, Black,. . . )

Ordinal
Some sort of order exist. (health state: Bad, Medium, Good)

One should always take existing order into account when


presenting summaries. (E.g. not in the order Bad - Good -
Medium).

Log 708 - Chapter 1 7/15


Key numbers - numerical variables

Measures of central tendency


Mean x̄ (traditional average value)
Median m (middle observation)
Can be quite different if data distribution is skewed.

Measures of variation
Variance - Standard deviation
Interquartile range
Standard deviation far more common, but sensitive to extreme
values.

Log 708 - Chapter 1 8/15


About skewness

Histogram of flat prices.

15
count

10

0
500 1000 1500
price

These data are slightly right skewed, so the mean will be slighly
higher than the median.
## Mean = 995.4333 Median = 974

Log 708 - Chapter 1 9/15


Rule of thumb for standard deviation.

For data that are not very skewed, roughly 95% of observations will
be within
x̄ ± 2S
where S is the sample standard deviation.
For the flat price variable we get
## Mean = 995.4333 Standard Dev = 283.4122
So we get 95% of data between
995 − 2 · 283 = 429, 995 + 2 · 283 = 1561

Log 708 - Chapter 1 10/15


Sample correlation coefficient

For two numerical (usually continuous) variables X , Y


Measures the connection between variations in the variables
Covariance - hard to interpret
Correlation coefficient - better

Log 708 - Chapter 1 11/15


Example with scatterplot

Price vs area for flats

1600
price

1200

800

400
50 100 150 200
area

The correlation coefficient between these variables is 0.95.

Log 708 - Chapter 1 12/15


More about visualization for continuous variables

The compendium show more examples like


histogram
boxplot
scatterplot

Log 708 - Chapter 1 13/15


Most important visualization and summary for categorical
variables.

Frequency tables
Bar chart, Pie chart
Crosstabulations
Stacked or clustered charts
The compendium show examples.

Log 708 - Chapter 1 14/15


Summary

Talked about descriptive statistics


Making data accessible for us
Key numbers
Visualizations
Depending on type of variable (continuous vs categorical)

Log 708 - Chapter 1 15/15

You might also like