Statistical and Probability Tools For Cost Engineering

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

2.

Statistical and Probability Tools for Cost Engineering


2.1 Introduction

Statistics is the field of study where data are collected for the purpose of
drawing conclusions and making inferences. Descriptive statistics is the
summarization and description of data, and inferential statistics is the
estimation, prediction, and/ or generalization about the population based on
the data from a sample.

Four elements are essential to inferential statistical problems:

1. Population is the collection of all elements of interest to the decision-


maker. The size of the population is usually denoted by N. Very
often, the population is so large that a complete census is out of the
question. Sometimes, not even a small population can be examined
entirely because it may be destructive or prohibitively expensive to
obtain the data. Under these situations, we draw inferences based
upon a part of the population (called a sample).
2. Sample is a subset of data randomly selected from a population. The
size of a sample is usually denoted by n.
3. Statistical inference is an estimation, prediction or generalization
about the population based on the information from the sample.
4. Reliability is the measurement of the "goodness" or "confidence level"
of the inference.

Mea n (a ver a ge): Mean is the sum of measurements divided by the


number of measurements.

Population mean (µ) = sum of all numbers in population/N


Sample mean (x) = sum of all numbers in sample/n

Median: Median is the middle number when the data observations are
arranged in ascending or descending order. If the number n of
measurements is even, the median is the average of the two middle
measurements in the ranking. For symmetric data set, the mean equals to
the median. If the median is less than the mean, the data set is skewed to
the right. If the median is greater than the mean, the data set is skewed to
the left.

1| Pa g e
Mode: Mode is the measurement that occurs most often in the data set. If
the observations have two modes, the data set is said to have a bimodal
distribution. When the data set is multi-modal, the mode(s) is no longer a
viable measure of the central tendency. In a large data set, the modal class
is the class containing the largest frequency. The simplest way to define the
mode will then be the midpoint of the modal class.

Figure 2.1 Mean and Median

2| Pa g e
Comparison of the Mean, Median, and Mode
The mean is the most commonly used measure of central location.
However, it is affected by extreme values. For example, the high incomes of
a few employees will influence the mean income of a small company. Under
such situation, the median maybe a better measure of central tendency.
The median is of most value in describing large data sets. It is often used in
reporting salaries, ages, sale prices, and test scores. The mode is frequently
applied in marketing.

Numerical Measures of Variability

Measures of central tendency do not describe the spread of the data set,
which may be of greater interest to the decision-maker.

Range: The difference between the largest and the smallest values of the
data set. The range only uses the two extreme values and ignores the rest
of the data set. One instinctive attempt to measure the dispersion would be
to find the deviation of each value from the mean and then calculate the
average of these deviations. One will find that this value is always zero, an
answer which is no accident. The alternative might be to calculate the
average absolute deviation. However, this measure is rarely used because it
is difficult to handle algebraically and does not have the nice mathematical
properties possessed by the variance.

Variance: The average of the squared deviations from the mean.

The population variance is denoted by

The sample variance is denoted by

3| Pa g e
Standard Deviation: The positive square root of the variance. The
population standard deviation is denoted by . The sample standard
deviation is denoted by s.

Measures of Relative Standing Another measure of interest is the


description measurement of the relative location of a particular observation
within a data set. In any data set, t h e p th per cen t ile is the number with
exactly p% of the measurements fall below it and (100- p) % fall above it
when the data are arranged in ascending or descending order based on the
required purpose of arrangement.

2.2 Random Variables and Some Important Probability


Distributions

Analyzing frequency distribution for every decision-making situation would


be very time consuming. Fortunately, many physical events that appear to
be unrelated have the same underlying characteristics and can be explained
by the same laws of probability. The mathematical model used to represent
frequency distributions is called a probability distribution. To understand the
concept and know which distribution to use in a particular situation will save
considerable time and effort in the decision-making process.

Ra n dom Va r ia ble: A random variable is a variable whose numerical value


is determined by the outcome of a random experiment. A random
experiment is the type of experiment that may produce different results in
spite of all efforts to keep the conditions of performance constant. If a
random variable can take on only countable number of values, then we call
it a discrete random variable such as the number of sales made by a
salesperson in a given day. Random variables that can assume any value
within some interval or intervals are called continuous random variables
such as the length of time an employee is late for work

Probability Distribution: Probability Distribution is expressed in a table


listing all possible values that a random variable can take on together with
the associated probabilities. Notice that the random variable itself will be
denoted by X, while the small x denotes a particular value of X. The symbol
p(x) = Pr(X= x) means the probability that the experiment yields the value
x. In a discrete probability distribution, each p(x) 0 for all values of x and
p(x) =1.

4| Pa g e
Discrete Random Variables

There are several theoretical discrete probability distributions that have


extensive applications in decision-making. One will be introduced in this
section.

Binomial Distribution

Many decisions are of the either/ or variety. A company bidding for a


contract may either get the contract or it won t. The responses to a public
opinion poll may be either favor or oppose . Many experiments
(situations) have only two possible alternatives, such as yes/no, pass/fail, or
acceptable/ defective.

Consider a series of experiments which have the following properties:

The experiment is performed n times under identical conditions.


The result of each experiment can be classified into one of two
categories, say, success (S) and failure (F).
The probability of a success, denoted by p, is the same for each
experiment. The probability of a failure is denoted by q. Note that q
=1-p.
Each experiment is independent of all the others.
The binomial random variable X is the number of successes in n
experiments. Probability of x successes in n experiments:

The name binomial arises from the fact that the probabilities p(x), x = 0, 1,
2 n, are terms of the binomial expansion, (q+p)n.

Continuous Random Variables

The probability distribution for a continuous random variable is often


denoted by f(x) and is variously called a probability density function. The
primary difference between probabilities for discrete and continuous random
variables is that while probabilities for a discrete random variable are
defined for specific values of the variable, the probabilities of a continuous
random variable are defined for a range of values of the variable. The

5| Pa g e
graphic form of f(x) is a smooth curve and the area under the curve
corresponds to probabilities for x. For example, the area A beneath the
curve between the two points "a" and "b" is the probability Pr (a<x<b).

Figure 2.2 Continuous Distribution

Because there is no area over a point, the probability associated with any
particular value of x, say, x= a, is equal to zero. Pr (a<x<b)= Pr (a x b) or
in other words, the probability is the same regardless of whether the
endpoints of the intervals are included. The total area under the curve,
which is the total probability for x, equals to 1.

The areas under most probability density functions are obtained by the use
of calculus or other numerical methods. This is often a difficult procedure.
However, as with commonly used discrete probability distributions, there
are tables exist for finding probabilities under commonly used continuous
probability distributions.

The Normal Distribution

The most important continuous distribution in statistical decision making is


the normal distribution. It is important for the following reasons:

Many observed variables are normally distributed.


Many of the procedures used in statistical inference require the
assumption that a population is normal.

6| Pa g e
Figure 2.2 Two Normal Distributions

The graph of a normal distribution is called a normal curve and it has the
following characteristics:

It is bell-shaped and is symmetrical about the mean. The mean,


median and mode are all equal. Probability density decreases
symmetrically as x values move from the mean in either direction.
Since the total area (probability) under the curve is 1, the area on
each side of the mean is 1/2.
The curve approaches but never touches the horizontal axis.
However, when the value of X is more than three standard deviations
from the mean, the curve approaches the axis so closely that the
extended area under the curve is negligible.

The Standard Normal Distribution

If X is a normally distributed random variable with mean and standard


deviation , the random variable Z, defined by Z= ( X- µ)/ is a normally
distributed variable where µ= 0 and =1. The probability distribution of Z
is called the standard normal distribution. Notice that z gives the number of
standard deviations that the value of x lies above or below.

Figure 2.3 A Normal Probability Distribution

7| Pa g e
2.3 Monte Carlo Simulation

Throughout the many examples and case discussions, analytical techniques


have been used to develop or approximate the probability distribution of a
system's cost. However, using analytical techniques has its limitations.
System's work breakdown structure cost model can contain cost estimating
relationships too complex for strict analytical study. In such circumstances,
a technique known as the Monte Carlo method is frequently used.

Monte Carlo method falls into a class of techniques known as simulation.


Simulation has varying definitions among practitioners. "Simulation is a
numerical technique conducting experiments on a digital computer, which
involves certain types of mathematical and logical models that describe the
behavior business economic system over extended period's real time. With
easy access to powerful microcomputers and applications software (such as
electronic spreadsheets), simulation is a widely used problem-solving
technique in management science and operations research.

Monte Carlo method involves the generation of random variables from


known, or assumed, probability distributions. The process of generating
random variables from such distributions is known as random variable
generation or Monte Carlo sampling. Simulations driven by sampling are
known as Monte Carlo simulations. Monte Carlo simulation became (and
remains) a popular approach for studying cost uncertainty, as well as in
evaluating the cost-effectiveness of a system's design alternatives. In
concert with Rubinstein s definition the WBS serves as the mathematical
logical cost model of the system within which to conduct the simulation. In
this context, the steps in a Monte Carlo simulation are as follows:

For each random variable defined in the system's WBS, randomly


select (sample) a value from its distribution function, which is known
(or assumed).
Once a set of feasible values for each random variable has been
established, combine these values according to the mathematical
relationships specified across the WBS. This process produces a single
value for the system's total cost.
Repeat the above two steps n-times (e.g., ten-thousand times). This
produces n-values each representing a possible (i.e., feasible) value
for the system's total cost.

8| Pa g e
Develop a frequency distribution from these n-values. This
distribution is the simulated (i.e., empirical) distribution of total
system cost.

In cost uncertainty analysis, Monte Carlo simulations are generally static


simulations. Static simulations are those used to study the behavior of a
system (or model) at a specific point in time. In contrast, dynamic
simulations are those used to study such behavior as it changes over time.

Decision-makers require understanding how uncertainties between a


system s cost and schedule interact. A decision-maker might bet on a high-
risk schedule in hopes of keeping the systems cost within requirement.

On the other hand, the decision-maker may be willing to assume more


cost for a schedule with a small chance of being exceeded. These common
tradeoffs decision-makers on systems engineering projects, the family of
distributions provides an analytical basis for computing this tradeoff, using
joint and conditional cost schedule probabilities. This family is a set of
mathematical models that might hypothesize for capturing the joint
interaction between cost and schedule.

9| Pa g e
Monte Carlo Simulation - Crystal Ball - Examples

The following example shows how to effectively enhance a simple what-if


spreadsheet model, to a complex quantitative risk analysis model using
Monte Carlo simulation.

Initial situation
a product is to be developed, tested and introduced into the market. To
examine the PROFIT & LOSS potential of the product, the following model
has been constructed:

Model

as shown above, the expenses required to research, produce, ship and


market the product all add up to $6 (million). The expected amount of units
sold, 400 000, at $18 a unit will give us revenue of $7.2 (million). The
expected profit will therefore be $1 200 000. Under these assumptions the
product shows a profit.

What- if scenarios
we have seen that if our model follows our expectations, we will achieve a
profit of $1 200 000. However, this is a very unstable reading as the
following example demonstrates:

10 | P a g e
If we were to assume that the product could not sell 400000 units but only
300 000, the product would no longer achieve a profit but a loss of
$600 000. This demonstrates how changing only one variable can result in
the product never even reaching the developing stage. A classical what-if
analysis will help us to understand the bounds of our PROFIT & LOSS value
by supplying a worst, probable and best case value to our model.

11 | P a g e
After examining this result table, we have not reached a conclusion but
merely gathered more information, allowing us to depict the range of profit
and loss for our product. We do not know the probability of any given value,
nor do we know what exact PROFIT & LOSS we can expect.

Combination of w hat- if scenarios:


We assume that not all variables will take on their most negative value at
the same time. To evaluate each single case we would have to simulate all
numbers for all variables, an extremely time and effort consuming
procedure. And even if we had all the results, we would still not know the
probability of reaching a certain outcome.

Monte Carlo simulation


Crystal Ball can be used to specify a probability distribution for every input
identified as an assumption cell. Then, after running the simulation model,
Crystal Ball automatically performs a statistical analysis for every output we
have specified as a forecast cell. By using probabilistic assumptions and
analyzing forecasts statistically, Crystal Ball helps us gain insights that we
could not extract from a deterministic model. These insights lead to better
decisions in the probabilistic situations we face in real world applications.

Assigning a distribution
Let us now concentrate on each variable separately, starting with the
testing costs. We can assume that the research costs will vary between $1.3
(million) and $2.3 (million). We can also assume that each value will
emerge with the same probability. Graphically depicted, our assumptions
would look like this:

12 | P a g e
As all values are equally likely to appear this distribution is called: Uniform
distribution.

Uniform distribution is only one of many (usually more complex)


distributions that can be chosen for a variable. Looking at the marketing
strategy cost, we can assume that our costs will be in the vicinity of $1.5
(million). We also know that our costs are as likely to be above as below
this number, but realize that they are more likely to be in this range than
far away from our expectations. Graphically our assumptions for our
marketing strategy costs would look something like this:

This distribution is called the normal distribution and it theoretically


describes many natural phenomena.

Running a simulation
If we were to assume each of our variable cost (research, production,
shipping and wages) were uniformly distributed, and our marketing strategy
costs were normally distributed, we could set a minimum and maximum
range for each, using our worst and best case examples from before. Let us
also assume that the amount of units sold as well as the unit price, are also
uniformly distributed between the worst and best case examples. The
PROFIT & LOSS value, which is a calculated value, we will define as our
forecast cell.

13 | P a g e
After defining all variables (assumptions and forecasts) we are ready to
perform a simulation. For every simulation, a number will be picked
randomly for each assumption cell, from within the range specified and with
the distribution specified. From these assumptions, the forecast cell will be
calculated and the result graphically plotted, thus producing a graph
showing each result achieved in the simulation and the probability of
reaching this result.

Interpreting the results


once the simulation has started, a window will appear and the user will be
presented with a graphic illustration of the results.

14 | P a g e
This chart shows the PROFIT & LOSS forecast achieved in the 100000 trials
specified. On the horizontal axis we see the PROFIT & LOSS expected. On
the vertical axis we see the probability with which a certain PROFIT & LOSS
is expected. At the bottom of the windows, we are shown that with 100%
certainty, the total will be within the minimum range of negative infinity and
the maximum range of infinity. If we want to know the probability of
achieving a profit we can change the lower bound of our range to 0. From
this diagram we can read that the certainty of achieving a profit is 25.25%.
It is now not likely that this product will be produced.

Quantitative Risk Management:


We can now estimate with a certain degree of certainty how much profit
our product will achieve. However, Crystal Ball also lets us analyze what
factors affect our forecast variable and by how much.

From this sensitivity chart, we can see that the number of units sold has the
greatest effect on our business model and that our production costs will
have a relatively small effect on our final PROFIT & LOSS.

15 | P a g e
2.4 References
1. AACE International. 2004. Skills & Knowledge of Cost Engineering. 5th
edition Section 7.

2. Paul R. Garvey. 1999. Probability Methods for Cost Uncertainty


Analysis. Marcel Dekker Inc.

16 | P a g e

You might also like