Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

1 of 3

ME 288 Data Analysis Lab:


Histogram and Probability Density Function (Pdf)

A good way of understanding a Pdf is to start with a histogram. Histogram is a preferred
graphical way of presenting data which have been collected in categories. Data samples in
categories or bins can not be plotted in scatter plots.

Histograms are similar to bar charts except that:

(i) It is drawn to represent the proportion (fraction) in each category (bin).
(ii) Bar width represents the range for the category (bin).
(iii) Bar height is given by Height=Fraction/width.
(iv) Bar area (not height) represents the proportion.
(v) The bars are adjacent (no gaps) the abscissa is a continuous variable.
(vi) The proportion in each category is also the probability for belonging to that category. In
other words, the next data point has the highest probability to fall in the category of
highest proportion (area).

Histogram and bar charts will look very different if the category widths are not the same.
Please look at the example on the first page of: http://en.wikipedia.org/wiki/Histogram.

For example, we would make a bar chart with two bars to show the number of men and women
in a population. Lets do a simple example for a histogram:

The NEW ENERGY COMPANY makes pressurized heat pipes to sell commercially. The
product performs well but there is concern that the heat pipes may burst. A co-op student is hired
to test the burst pressure of heat pipes that are manufactured. He runs tests on 20 samples and
gets the following table to make a histogram:

CATEGORY/BIN
Pressure Range
(unit: atm)
Mid Point # heat pipes
burst
Proportion=
Fraction
bursting
Histogram height
=Fraction/width
3.5 4.5 4 6 6/20 = 0.3 = 30% 0.3
4.5 5.5 5 10 10/20 = 0.5 = 50% 0.5
5.5 6.5 6 4 4/20 = 0.2 = 20% 0.2
3.5 6.5 Total 20 20/20 = 1 = 100% 1

In this simple case the category widths are the same (1 atm). The proportion for each category is
the same as the probability of the next data point to be in that pressure range.

We can get more points if we make smaller bins or categories e.g. 3 3.1, 3.1 3.2 etc..
When we do that the histogram shape approaches the probability density function (Pdf).
Therefore, histograms are an approximation of a Pdf.
2 of 3
We can draw a Pdf curve to approximate this distribution by using the 3 points from the above
table. Using EXCEL, we can get the curve going through these 3 points:

( ) 550 245 25
2
+ = x x x pdf in % or
( ) 50 . 5 45 . 2 25 . 0
2
+ = x x x pdf in fractions.

The plot is shown below:



Definition: The probability density function is a curve; the area under the curve in an interval
gives the probability of a data point to be in that interval. It can be obtained by smoothing a
histogram.

If we use P as the probability, then:

Mathematically, ( )
x
P
dx
dP
x pdf
A
A
~ =
Therefore the probability:

For the interval x A from a to b:
}
= A
b
a
dx pdf P ) ( = area under the curve (just like the histogram!)

What is the probability that the next data point will belong to the range 3.5 to 6.5? The answer is
given by:
( ) ( ) % % . dx x x . x . P
.
.
100 94 75 93 550 245 25 5 6 5 3
5 6
5 3
2
~ ~ = + = s s
}

3 of 3
We can do this calculation for other intervals and compare with the histogram. We expect the
Pdf to approximate the histogram:

3.5 4.5 ( ) ( ) % . dx x x . x . P
.
.
30 9 27 550 245 25 5 4 5 3
5 4
5 3
2
~ = + = s s
}

4.5 5.5 ( ) ( ) % . dx x x . x . P
.
.
50 9 47 550 245 25 5 5 5 4
5 5
5 4
2
~ = + = s s
}

5.5 6.5 ( ) ( ) % . dx x x . x . P
.
.
20 9 17 550 245 25 5 6 5 5
5 6
5 5
2
~ = + = s s
}

You might also like