Prob Dist

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Statistic Corner

A practical overview on probability distributions


Andrea Viti1, Alberto Terzi2, Luca Bertolaccini2
1
Thoracic Surgery Unit, S. Croce Carle Hospital, Cuneo, Italy; 2Thoracic Surgery Unit, Sacro Cuore Research Hospital, Negrar Verona, Italy
Correspondence to: Andrea Viti, MD, PhD. Thoracic Surgery Unit, S. Croce e Carle Hospital, Via Michele Coppino 26, 12100 Cuneo, Italy.
Email: vitimassa@hotmail.it.

Abstract: Aim of this paper is a general definition of probability, of its main mathematical features
and the features it presents under particular circumstances. The behavior of probability is linked to the
features of the phenomenon we would predict. This link can be defined probability distribution. Given the
characteristics of phenomena (that we can also define variables), there are defined probability distribution.
For categorical (or discrete) variables, the probability can be described by a binomial or Poisson distribution
in the majority of cases. For continuous variables, the probability can be described by the most important
distribution in statistics, the normal distribution. Distributions of probability are briefly described together
with some examples for their possible application.

Keywords: Probability distributions; discrete variables; continuous variables

Submitted Nov 09, 2014. Accepted for publication Dec 17, 2014.
doi: 10.3978/j.issn.2072-1439.2015.01.37
View this article at: http://dx.doi.org/10.3978/j.issn.2072-1439.2015.01.37

A short definition of probability (II) The sum of the probabilities of E = P(E1) + (E2) + …
+ P (En) is 100%;
We can define the probability of a given event by
(III) If E1 and E3 are two possible events, the probability
thatone or the other could happen P (E1 or E3) 1is equal 2to the
evaluating, in previous observations, the incidence of P  X x 1 E P  E   P  E    P  En 

the same event under circumstances that are as similar
sum of the probability of E1 and the probability of E3 (Eq. [2]):
as possible to the circumstances we are observing [this is
the frequentistic definition of probability, and is based on P  E1 or
E2  P  E1   P  E3  P  p, p, p
, q, q  pppqq
 [2] p3q 2
the relative frequency of an observed event, observed in Probability could be described by a formula, a graph, in
previous circumstances (1)]. In other words, probability which each event is linked to its probability. This kind of
n!
describes the possibility of an event to occur given a series nCx 
description f  x   nC
of probability is called probability
x n x
xp q
distribution.
x! n  x !  
of circumstances (or under a series of pre-event factors). It
is a form of inference, a way to predict what may happen,
Binomial
f  x   1 distribution
based on what happened before under the same (never f  4  10
C4  0.3  0.7  0.2001
4 6

exactly the same) circumstances. Probability can vary from  f  xexample


A classic  1 of probability distribution is the binomial
0 (our expected event was never observed, and should distribution. It is the representation of the probability when
never happen) to 1 (or 100%, the event is almost sure). It is only two events may happen, that are mutually exclusive.
 x e  2.753 e2.75
described by the following formula: if X = probability of a f  x 
The typical example is when you toss P
  X You
a coin. 3 can only 0.221
x! 3!
given x event (Eq. [1]): have two results. In this case, the probability is 50% for both
 P  X x 1 E P  E1   P  E2  [1] P  Enevents.
  However, binomial distribution may describe also
two events that are mutually exclusive but are x
This is one of the three axioms of probability, as  x   
2 Z not equally
1 
described by Kolmogorov (2):  f  x  (for instance
possible e 2 , that  xa newborn
 baby will be left- z
2

P  E1 or
E2  P  E1   P  E3  P  p, p, p
, q, q  pppqq
 p 3q 2
2
2  1
(I) If under some circumstances, a given number of handed or right-handed). The probability that  f x zindividuals
 e 2 ,   z  
2
events (E) could verify (E1, E2, E3, …, En), the probability (P) present a given characteristic, p, that is mutually exclusive
of any E is always
n! more than zero; of another one, called q, depends on the possible number
nCx  f  x   nCx p q
x n x

x ! n  x ! z1 1 z2 65  70
z0
e
2 2
dz z
3
 1.67
© Journal of Thoracic Disease. All rights reserved. www.jthoracdis.com J Thorac Dis 2015;7(3):E7-E10
f  x  1
f  4  10C4  0.3  0.7  0.2001
4 6
 
 f  x  1
1.672
1.67 1
 e 2
P    z  1.67   0.0475

 P  X x 1 E P  E1   P  E2  
  P  En 
E8 Viti et al. Probability distributions

P  E1 or
E2  P  E1   P  E3  P  p, p, p
, q, q  pppqq
 p 3q 2
of combinations of x individuals within the population, when the period of observation is longer.
called C. If my population is composed of five5 individuals, To predict the probability, I must know how the events
that can be p or q, I have ten possible combinations of, for behave (thisn !data comes from previous, or historical,
nCx  f  x   nC p x q n x
instance, three individuals with p is (Eq. [3]): observations x ! nof xthe! same event before the time xI am trying
E P  E1   P  E2    P  En 

pppqq, pqqpp, ppqpq, ppqqp, pqpqpq, qpppq, qpqpp, qppqp, to perform my analysis). This parameter, that is a mean
qqppp of the events in a given interval, as derived from previous
f  x  1
observations, is called λ. f  4  10 C4  0.3  0.7  0.2001
4 6
P  E3  P  p, p, p , q, q  pppqq  pq 3 2  
[3]  f Poisson
x  1
The distribution follows the following formula

Then P  X p 3 2
q x  1
will be multiplied
  P
E
for

X
Pthe
x
 E 11number
 P  E 2   of  P  E 
(Eq.
n  E [8]):
P  1   E2    P  En 
E  P
combinations (ten times).
f  x   nCx p q x n x
 x e  2.753 e2.75
If, in experimental population, I had a big number of f  x  P  X 3  [8] 0.221
P  E1 or E2  P  E1   P  E3  P  p, p, p , q, q  pppqq  p 3q 2 x! 3!
individuals (n), the number of combinations P  E1 or E2  ofPx E 1 
individuals
 P  E 3  P  p, p , p, q , q  pppqq
 p 3 2
q
where the number e is an important mathematical constant
within the population will be (Eq. [4]):
that is the base of the natural logarithm. It is approximately x
f  4  10 Cn4! 0.3  0.7  0.2001
4 6
  Z
f  x n! nCx p q x n x
equal to 2.71828.
2
nCx  [4]  x   

E P xE!1nPxE ! 2    P  En  nCx   ff xx  nC1x p qe 2 ,    x  
2
x n  x

 P  X x 1  x ! n  x  !
E P  E1   P  E2    P  En  For example, 2 the distribution of major thoracic traumas2
1 2z
Therefore, the probability that a group of x individuals needing intensive care unit (ICU) recovery f  z  a month
during e ,   z 
3 2.75 2
within the 2.75 e
P  E3  P f  px,Xp
, 1p,3q
 population
, q  pppqq  of q 2 individuals
3 n
p0.221 f  x  f 14  10
presents4 the 6 in the last three years in a Third Level Trauma Center
P  E
characteristic or E 
2 p, that P  E 3!   P
1 excludes E  q, will Pbe p , p , p
described ,Cq4, q 0.3
 pppqq
by  0.7
the  p0.2001
3 2
qfollows
f  4  a 10 C4  0.3 distribution,
Poisson

4
 0.7  0.2001were λ=2.75. In a future
6
 f  x  1
1 3
 f  x   1 1  z 2
65  70
following formula (Eq. [5]): z
period of one e month, what is the probability to have three

1
dz z 1.67
z
patients 2 
with 2major thoracic trauma in ICU? 3(Eq. [9]):
f  x   nCxnp! xq 
0
x n  x
[5]
nCx  Zx e   f  xx   nCx p q
x n  x 3 2.75
2.75 e
f  x
   x   that describes  x ! n  x  ! f  x 
P  e X
 3 
It follows3!the 0.221 P  X 3 1.67
2.753 e2.75
x ! the 1binomial 2
z distribution.  0.221 [9]
x! 3!
2

Kolmogorow’s  f  z rules (Eq. e 2 [6]):


,   z   1.67 1
 E P  E1   P  E242  6 PE 
 2 e the probability is 22.1%.P    z  1.67   0.0475
Therefore,
2

 ff  4x 10 1 C4  0.3  0.7  0.2001 n


x  4
f  4  10 ZC4  0.3  0.7 [6] 0.2001The binomial x   distribution refers only to discrete
6
 
   Z  present a limited number of values within
2
 x   
f x  1
1 2
 x    variables (that    0.9082
zP p, p, p
 f x65  70 e ,   3x 2  1 P    z  1.33 P  0.9082  0.0475  0.8607 86.07%
2
2
P  E3  , q1.67  f  x e 2 , 1  xz   a given
2
, qpopulation, pppqq p q of the
2
30%
In a given 3 2 people are left-handed. interval). However, in nature, many variables may
2

2.75 e 3 2.75 2 
 f  z  e 2
,    z  
 z
1 2z
 X 3xten 2 2.75 present an finfinite e ,    zof
2distribution
 values, within a given
If we
P select e  individuals
  from0.221 this population, what 2.75is3 ethe 
f  x  3! P  X 3   0.221 interval.
P  65  These
x  74   are
 65  70
P  called continuous
74  70 
probability that x ! four out of ten individuals are left handed? 3! z   Pvariables
 1.67  z (3).
1.33  P    z  1.33  P  
3 3
Wef  xcan  nC x n x
x p qthe binomial distribution, since we suppose
 
P z  1 zapply
z2 1.67   0.0475 65  70
thatza person
1z   z 2 65  70
1
e xdz
 
z  1.67
may be either z 2 right-handed.
left-handed or
1
0 2 Z 2 e 3dz x   z
Distributions 1.67continuous variables
 of
our 0 2 3
Se we can use 2formula (Eq. [7]): Z 
  x   1
 x   
2 
 Pff  4x0.9082
 e 241
2 z
e,0.8607 x 
26 0.2001 An example of continuous variable is the systolic blood
,  86.07%

4z  0.3  0.7 
082  10
 
f2C 0.0475
 z 1 [7]2z
2
2
1.67
1.67 1 2 1.67 P1  2
2 zf  z1.67 2
1.67 e ,  pressure.
 z   Within a given cohort of systolic blood pressure
 2 e 2  2 e
 0.0475
canPbe  presented
 z  1.67as  in
0.0475
Figure 1. Each single histogram
74  70 
z P  1.67  z  1.33  P    z  1.33  P    z  1.67 
  Poisson length represents an interval of the measure of interest
3  65  distribution
70 2.753 e2.75
zPz 1X  3 2 1.67  0.221  70  0.0475  0.8607 between two intervals on the x-axis, while the histogram
3ez z1.33 P  65
Pz  2important dz  distribution
3!0.9082 0.9082 86.07% 
1

Another of  z z  1.33


Pprobability is 
the 1.67
 0.9082
Poisson P  0.9082
height represents 0.0475 0.8607 of
the number 86.07% 
measured values within the
0 2 3
distribution. It is useful to describe the probability that a interval. When the number of observation becomes very
given event can 65  70 within
happen 74 a70given
 period (for instance,
P  65  x  74  
 x 
P  z    P  1.67  z   
1.33
65   P
70  z 
  z
74   large
1.33
70  P  (tends
 z  1.67to
 infinite) and the length of the histogram
P1.67 1 zZ1.67 2
1.67   3 0.0475 3 P 65  x  74   P    P  1.67  z  1.33  P    z  1.33  P    z  1.67 
how many thoracic  traumas could need the involvement
3
P    z  1.67   0.0475
 3  becomes narrower (tends to 0), the above representation
  x    thoracic
of the 2
e 2
surgeon
2
z
in a day, or a week, etc.). The becomes more similar to a curved line (Figure 2). This
1
 f  z e 2 ,   z  
events that may be 2described  by this distribution have the curve describes the distribution of probability, f (density
082 P  0.9082
following  0.0475  0.8607 86.07% 
characteristics:
P    z  1.33  0.9082 P  0.9082  0.0475  0.8607 of probability)
86.07%  for any given value of x, the continuous
(I) The events are independent from one another; variable. The area under the curve is equal to 1 (100%
65  70
74  70  z  Within
(II)  agiven
1.67 interval the event may present from 0 of probability). We can now assume that the value of our
z   P  1.67  3 z  1.33  P    z  1.33  P    z  1.67 
 65  70 74  70 
3  to infinite
P  65  x times; 74   P  z   P  1.67  z  1.33  P    z  1.33continuous  P    z  1.67 
variable X depends on a very large number
 3 3 
(III) The probability of an event to happen increases of other factors (in many cases beyond our possibility of

P    z  1.67   0.0475
© Journal of Thoracic Disease. All rights reserved. www.jthoracdis.com J Thorac Dis 2015;7(3):E7-E10

082 P  0.9082  0.0475  0.8607 86.07% 


P  E1 or
E2  P  E1   P  E3  P  p, p, p
, q, q  pppqq
 p 3q 2

Journal of Thoracic Disease, Vol 7, No 3 March 2015 n! E9


nCx  f  x   nCx p x q n x
x ! n  x !
 P  X x 1  E1   P  E2  
E Pdispersion).
  P  En 
(I) The main characteristics of this distribution are:
f  x  1
Ef  4P  E10C4  0.3  0.7  P 0.2001
4 6
(II)
 PIt Xis
 symmetric
 1 3 around  the µ; 
x 1   P  E2   En 
60 P  E1 or E2  P  E1   P  E3  P  p, p, p
(III)
f  x 
 pppqq
, q, qThe 1
area under
2
p q the curve is 1;
If we consider the area under the curve between µ ± σ,
Number of patients

50

n! this  E1 orwill
Parea   P  E68%
Ex2ecover 1  P
of Eall
3  the possible
P  pvalues
, p, p , q 2.75
, qof X, pppqq

while
3 2.75
e p 3q 2
40 nCx  f  x the
 nCf  
x x n  x P
x p qbetween µ ± 2σ, it will cover 95% of all the values.
area
 X 3 
  0.221
x ! n  x ! x! 3!
30 The two parameters of the distribution are linked in the
n!
nCx (Eq. [10]):
formula f  x   nCx p x q n x
x ! n  x ! x
20 f  x  1 Z
f  4  10 C4  0.3 10.7   x 0.2001
4 6 
2
  
 f  x  1  f  x e 2 ,    x   [10]
2
10 2
2 1 2z
f  x  1  f  z e 6 ,   z 
 4 10 C4  0.3 20.7  0.2001
4
Forf µ x= 0,1 and σ = 1, the  curve is f called standardized
110 120 130 140 Systolic Blood 
f  x     X Pressure
119 129 139 149  x e P x 1 (mmHg) 
normal P
 E1 3 eP2.75 E2  All
E Pdistribution.
2.75
 X P  X
3
 the P possible
En  normal distributions of
x!
Figure 1 Graphical description of the distribution of systolic blood x may 3! 1 0.221
x E PE   PE  
 1 2 n  PE 
z be 1 “normalized”
z2 by defining a derived
65 variable
70 called
fz x[11]):
1
e x   dz z    1.67
, p
3 2.75
pressure in a given population. z. (Eq. e 2.75 e
P  E1 or
E2  P  E1   P  E3  P  p,2p  , q2, q  pppqq p 3q 2 P  X3 3   0.221
0

P  E1xor  Ex2 ! P  E1   P  E3  P  p, p, p 3!
, q, q  pppqq
 p 3q 2
 x   
2 Z
120 f  x130
110 
1
140e
2 2
,   x    P  X 1.67 x
2  1 E P  E1   P  E2    P  En 
[11]
N. of 150
n! 1.67 1 2

   zZ x1.67   0.0475
z
2
nCx  f zx  2nC 1e 2x n x
P
 2nx !peq ,x  z
pts 2
x ! n  x !
2

40 nCx  1 f  x   nCx p q

x n x
 fP xE 1 or !En2 xe!P 2E1 ,P E3x   P  p, p, p , q, q  pppqq
2
x 2 p3q 2
To calculate 2 the probability that our variable falls 1 2z
within
 f  z e ,   z 
70 interval, for
P   z  1.33  instance 0.9082 z0 and z1P  0.9082  0.0475 2 0.8607 86.07%
z1 1 fzx2   1 65a given , we should calculate
z0 2 e 2 f dz 4x 1.67 !0.3  0.7  0.2001
4 6
30  z fffollowing 10
1 Cn4definite
the integral calculus (Eq. [12]): x n4x
  x   1 3 nC    x  10
f f 4 C4x p0.3
nC q   0.7  0.2001
6

 f
x
  74
x x!1 n  x  !
 z  P  65  70  z  74  70   P  1.6765 z 701.33  P    z  1.33  P  
2
P z165 1  xe
20 2
 z0 2 2
dz 3 3  z
3
 1.67 [12]
1.67 1 1.672  x e  2.75 e 3 2.75

 2fe x   x ! P    Fortunately,


zP 1x3e
f x1.67
X  0.0475
for the standard  0.221normalized distribution
3! 2.7534 of z 6
e2.75
f  x   Pf  4  X 10 
3C 4  0.3  0.7  0.2210.2001
every f 
possiblex  x !1
1.67interval
2 has been tabulated. 3!
1.67 1
10
 a given
In e 2population of adult men, P   the zmean  1.67   0.0475
weight
P    z  1.33  0.9082 2
P  0.9082 
is 70 kg, with 0.0475  x 
0.8607  86.07% 
Z x  a standard deviation of 3 kg. What is the
1
 x   
2

 e a xrandomly  3 e2.75


x 2.75
f  x   that Z 3from
 f  x e 2 ,    x   probability
2

1x !
 
2

 z2
selected  
individual
P X
   this  0.221
2  Pf   x   zwould
 1.33 e  120.9082
2
,    x   P  0.9082  0.04753! 0.8607 86.07%
Systolic blood
P  65  x  74   P 
 65Pressure
70
z
(mmHg)
74  70  population
  P  1.67  z  1.33  P 2
f
z z  1.33 e P  
have a 2weight of
z65
,  z 1.67 
kg or less?
1 2z
2

 3
Figure 2 Graphical description of the normal distribution. 3  2 
To “normalize” our distribution, we should calculate the  f  z  e ,   z
2
value of z (Eq. [13]):  65  70 74  70  x
P  65  x  74   P   x   2  z    P  1.67  z  1.33Z   P    z  1.33  P  
1 z 2
65  70 1  32 3  
z zf1  x1  z2 1.67
z1
dz X becomes  e 2 ,    x  
z0 2 e 2 of
direct analysis), the probability distribution 3e 2dz 65  70 [13]  z 2
 z1.671 e 2 ,    z 
similar to a particular form of distribution, called normal  z0
2 we2 should calculate the area under
z 
3
f
2
Then, the curve
distribution or Gauss distribution. The aforementioned (Eq. [14]):
2

concept is the famous Central Limit 1.67 1 1.672 The normal


Theorem.
 2 distribution of
e P1.67
z
11 z 1.67z21.67   0.0475
2
65  70
distribution represents a very important z 22ee 22 dz Pz  z  1.67  [14]
1
1.67  0.0475
0 3
probability because f, that is the distribution of probability
of our variables, can be represented by only two parameters: The value of our interval has been already calculated
P    z  1.33  0.9082 P  0.9082  0.0475  0.8607 86.07% 
• µ = mean; andPtabulated
 [the tables can be easily
1.67 1z  1.33  0.9082
1.67 2
P found
0.9082 in any text
 0.0475  0.8607 86.07%
• σ = standard deviation.  2or
of statistics e 2in the web (4)]. Our probability P    z   is1.67   0.0475
0.0475
The mean is a so-called measure  65  70 74  70  (4.75%). We may also calculate the probability to find,
P  65of
 xcentral
 74   P tendency  z (it   P  1.67  z  1.33  P65    z  1.33  P    z  1.67 
 3 3   70 74  70 
P  65 the
within x  74   P  population,
same z   P  1.67whose   P   isz  1.33  P  
 z  1.33weight
represents the more central value of our curve), while the  3 3 someone

standard deviation represents how dispersed are the values P    z  1.33   0.9082
between 65 and 74 kg. This probability can be seen P  0.9082  0.0475
as the  0.8607 86.07
of probability around the central value (is a measure of difference of distribution between those whose weight is 74 kg
 65  70 74  70 
P  65  x  74   P  z   P  1.67  z  1.33  P    z  1.33  P  
 3 3 
© Journal of Thoracic Disease. All rights reserved. www.jthoracdis.com J Thorac Dis 2015;7(3):E7-E10
3 P  X x3!  0.221 3!
1.672 3!
1.67 1
P    z  1.67   0.0475
3 2.75
2.75 e
P

 X 3e
2
2
3!
 0.221
x
x x   
2 Z
P    z  1.67f  x  0.0475
Z1
e 2 ,    x  
2 
2

  x   2x   1 2z
E10  z2  f  z  e ,   z   Viti et al. Probability distributions
P    Zfz   e ,    z  
z 1.33
1  0.9082
2 P  0.9082 2  0.0475  0.8607 86.07% 
  x   2
 z2
f  2z  86.07% e 2 ,   z  
or less
P  0.9082  0.0475 and those
 0.8607 whose 1 weight is 65 kg or less: (Eq. [15]): category will allow a proper application of a model (for
z1 1 z 265  70 74  70z  65  70 instance, the standardized normal distribution) that would
zz06565
P  xe70
 74dz  P
2 2  1.67 3
z
3 
 1.67
  P 3 1.67  z  1.33  P    z  1.33  P    z  1.67 
easily predict the probability of a given event.
3
1.67  z  1.33  P 65 70z  1.33  P    z  1.67  [15]
z  1.67
We 3
1.67already
1 know that (Eq. [16]):
1.672
Acknowledgements
P  2ze 2−1.67 P    z  1.67   0.0475
1.67   0.0475 [16]
Disclosure: The authors declare no conflict of interest.
In the table we can find also the value for (Eq. [17]):
P    z  1.67   0.0475
P    z  1.33  0.9082  0.8607 86.07% 
P  0.9082  0.0475[17]
082 P  0.9082  0.0475  0.8607 86.07%  References
Our probability is (Eq. [18]):
 z  86.07%  1. Daniel WW. eds. Biostatistics: a foundation for analysis in
082 PP 65
 0.9082
 x  74  0.0475
700.8607
 65   74  70  [18]
74  70 
P   P  1.67  z  1.33  P    z  1.33  P    z  1.67 
z  P  1.67  z  1.33   3  z  1.333 P   z  1.67 
P   the health sciences. New York: John Wiley & Sons, 1995.

3 
2. Kolmogorov AN. eds. Foundations of Theory of
74  70 
z   Conclusions
P  1.67  z  1.33  P    z  1.33  P    z  1.67 
Probability. Oxford: Chelsea Publishing, 1950.
3 
The probability distributions are a common way to describe, 3. Lim E. Basic statistics (the fundamental concepts). J
and possibly predict, the probability of an event. The main Thorac Dis 2014;6:1875-8.
point is to define the character of the variables whose 4. Standard Normal Distribution Table. Available online:
behaviour we are trying to describe, trough probability http://www.mathsisfun.com/data/standard-normal-
(discrete or continuous). The identification of the right distribution-table.html

Cite this article as: Viti A, Terzi A, Bertolaccini L. A


practical overview on probability distributions. J Thorac Dis
2015;7(3):E7-E10. doi: 10.3978/j.issn.2072-1439.2015.01.37.

© Journal of Thoracic Disease. All rights reserved. www.jthoracdis.com J Thorac Dis 2015;7(3):E7-E10

You might also like