Lec SFM

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 43

IV.

Data sources (Slides)


26/9/2023
Lesson 2
Descriptive statistics: Tabular and graphical methods
I. Describing data

II. Qualitative data


* Frequency distribution:
- A tabular summary of data showing the frequency (or number) of items in
each of several nonocerlapping classes.
-> Provide insights about the data that cannot be quickly obtained by looking
only at the original data
- A grouping of qualitative data into mutually exclusive classes showing the
number of observations in each class
* Relative frequency distribution
- The relative frequency of a class is the fraction or proportion of the total
number of data items belonging to the class.
- Is a tabular summary of a set of data showing the relative frequency for
each class.
* Percent frequency distribution
- The percent frequency of a class is the relative frequency multiplied by 100
- Is a tabular summary of a set of data showing the percent prequency for
each class.
Ex:

III. Quantitative data


* Frequency distribution
Vdu: chia nhóm nhiệt độ 12 – 21.2 và 21.2 – 30.4
=> Nghĩa là nhóm 1: 12 đến dưới 21.2
Nhóm 2: 21.2 đến dưới 30.4
2k => k: number of classes
* 2k > n; n: number of observations
58−12
W= 5
-> round up = 10

Ex:
109−52
W= 6
= 9.5
* Dot plot
- One of the simplest graphical summaries of data is a dot plot
- A horizontal axis shows the range of data values
- The each data value is represented by dot placed above the axis
* Histogram and bar chart
- Histogram is continuous data so it has no gap between each other, it also has
the different in horizontal axis.
* Stem-and-Leaf display
=> Biết được các giá trị thực tế phân phối ntn, tuy nhiên ít số liệu sẽ dễ dàng
thực hiện hơn.
Explain:
Data: 52, 57,65,…, 109
Stem Leaf
5 2
5 7
6 5
10 9
Cách 1:
Stem Leaf
9 8 9
10 2 4 6
11 5 4 5

Cách 2:
Stem Leaf
0 98 99
1
IV. Cummulative Distribution

* Frequency distribution

Ex: 9 = 3 + 6; 14 = 3 + 6 + 5
* Ogive
3/10/2023
Lesson 3
Descriptive Statistics: Numerical Methods
Population (All) => Parameters Sample (Some) =>
Statistics
Size N n
Mean μ x
Xi = 15 for the group of “10 up to 20”
* Mean -> Arithmetic => Simple = x =
∑ xi
n

∑ x i wi
=> Weighted = x =
∑ wi
∑ x i fi
=> Grouped = x =
∑ fi
-> Geometric => Simple = x = √n π xi
=> Weighted x = ∑ wi√ π xi wi
π : Tích

- Tốc độ tăng trưởng sẽ tính growth factors trc


- Growth rate = Growth factor – 1 ( or -100%)
-> ko tính đc TB growth rate, phải tính từ growth factor
- Growth factor = Growth rate + 1
Stock A Stock B
104.01
114.31
119.01
85.31
73.51
108.01
105.81

x = √7 104.01 x 114.31 x … x 105.81 = 102.05


=> Growth rate = 102.05 – 100 = 2.05
Data 1: 6
Data 2: Mean of 7 and 8
Middle points example
4 = 2 x 2 -> MP: 2 & 3
6 = 2 x 3 -> MP: 3 & 4
8 = 2 x 4 -> MP: 4 & 5
∑ fi = 2 x m -> MP: m & m + 1
- Mean và median điểm yếu là chỉ áp dụng cho quanti
Data 1: Mode is 7
Data 2: No mode
Data 3: Modes are 5 & 6
Data 4: No mode

Mode here is 40-50


Green: Normal distribution
Blue: Spreader
Red: Sharper
=> Comment on data distribution with the Skewness coefficient.
- Divide datas into 100 equal parts # Median divide datas into 2 equal parts
-> each percentile divide data into 2 parts
Ex: 90th percentile: lower 90% and upper 10% and the value of 90 th percentile
is 3s.
-> 90% of the number of requests having the response time lower or equal
than 3s.
Exercise

1,
Mean: 270
Median: 260
Mode: 240
2, Lệch phải do Mode nhỏ nhất
3.
- Mean cao cao vì mức sau 240 đều khá cao, vị trí việc làm khác nhâu nên
được hưởng lương cao hơn
=> Chọn mean vì nó đã là ,ức cao nhất, việc thương lượng lương sẽ khó xảy
ra conflict hơn
=> Nếu là representative của nhân viên chọn Mode vì biểu hiện nv chỉ đạt ở
mức 240
Week 4
- Cons: Because the range only depend on maximum and minimum
variables
- R= xmax – xmin
 ∑x1 - x
(vở)

3. Interquartile
-
Bình phương vì số chênh lệch có thể ăn hoặc dương
Group data
Ex: 29 textbooks (p.124)

c.
Mean of Pomona: 48.3
Mean of Anaheim: 48.5
S of Pomona + 9.63
S of Anaheim = 11.66
11.66
 CV for Anaheim: 48.5 x 100=24.04 %

 Air quality of Pomona sẽ cao hơn vì CV thấp. Vì CV là nhiệt độ dao


động
Mean of 2 data sets is difference => must use the coefficient of variation
707.11
CV1= 50000
x 100 = 14.14%
112.12
CV2= 500 x 100=22.42 %

 CV1 has lower degree of risk.

C. MEASURES OF RELATIVE LOCATION AND DETECTING


OUTLINES
I. Z-Score
- Measures of relative location: how far a paritcular value is from the mean
- Also called the the standardized value
• Formula:

 The number of standard deviations xi is from the mean


Eg:

You might also like