Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 20

Core 6: Business Analytics (MS6192)

Business analytics: need - scope – applications – descriptive analytics – predictive analytics


– prescriptive analytics; Descriptive analytics – types of data – creating distributions from
data – measures of location – measures of variability – measures of variability – measures of
association.

Data Visualization for Manager: Visualization Imperative-Message to charts-Visual


Perception-Grammar of Graphics (Using R)- Component level design of tables and graphs-
Storytelling using Visualization;

Introduction to multivariate statistical analysis techniques: Multivariate linear regression


models, principal component analysis, linear discriminant analysis, factor analysis, evaluation
matrices and model diagnostics for regression models.
Logistic regression, decision trees, cluster analysis, Causality Test, Forecasting techniques
(AR, MA, ARMA and ARIMA models) using R.

Books:
1. Camm, Cochran, Fry, Ohlmann, Anderson, Sweeney, Williams,‖Essentials of
Business Analytics‖, Cengage Learning
2. SandhyaKuruganti, Business Analytics: Applications To Consumer Marketing ―,
McGraw Hill
3. Bernard Marr, “Big Data : Using Smart Big Data, Analytics and Metrics to Make
Better Decisions and Improve Performance”, Wiley
4. R For Dummies Paperback – 21 Jul 2015, Andrie de Vries (Author), JorisMeys
(Author)
5. Cooper DR & Schindler PS. 2006. Marketing Research Concepts and Cases. TMH

Business analytics: need - scope – applications – descriptive analytics – predictive analytics


– prescriptive analytics; Descriptive analytics –
Descriptive analytics: importance, benefits, & example

Have you ever wondered how businesses make sense of the mountains of data they collect
daily? The secret lies in the realm of data analytics, where one type of analytics – descriptive
analytics – transforms raw data into invaluable insights.

In this article, we’ll explore the importance of descriptive analytics, examine its remarkable
benefits, and showcase real-world examples of how various industries use it successfully.

Prepare to unlock the mysterious world of descriptive analytics and explore how it could
elevate your business. By the time you finish reading, you’ll be armed with the knowledge
needed to harness your data and steer your organization toward greater success.

What is descriptive analytics?

Descriptive analytics is one of the foundational aspects of data analytics that transforms raw
data into easily understood patterns, trends, and insights. It’s a prime example of data
aggregation that uses business intelligence and data science. This analytics process focuses
on giving decision-makers an overview of historical data and an understanding of how certain
events or actions unfolded.

Unlike predictive analytics or prescriptive analytics, descriptive analytics isn’t about


predicting future outcomes or recommending a course of action. Instead, it gives you a clear
snapshot of past data so you can understand the key factors that contributed to specific
situations.

Now that we have a basic idea of what descriptive analytics is, let’s dive into its purpose.
Imagine having a massive amount of data and trying to make the most of it – descriptive
analytics works to present the data required in a more digestible format.

Organizations can then spot important developments, challenges, and opportunities that can
shape future strategies and improvements. They can also leverage these insights to monitor
key performance indicators (KPIs) and assess how well certain initiatives are doing.

How descriptive analytics can help your business?

Let’s go over how descriptive analytics can help your organization.

Enhancing business performance: Descriptive analytics helps businesses identify data trends
and patterns. For a simple example, let’s imagine a clothing store tracks past sales metrics to
notice that jackets sell like hotcakes during the fall.

This insight offers the business an understanding of customer behavior, which ultimately
helps it develop targeted marketing strategies, increase sales, and boost performance.

Leveraging historical data: One of the most powerful things about descriptive analytics is its
ability to give meaning to historical data. Businesses can use past data to gain insights into
the root cause of what shaped their current situation. For example, take a company that uses
AI, machine learning, and descriptive analytics to analyze historical sales data and customer
demographics.

This analysis can deliver tangible benefits for demand forecasting. By understanding patterns
in previous sales, the company can better predict future product demands and adjust its
inventory accordingly. This helps the company avoid excess stock, minimize waste, and
improve its bottom line.

Improving communication: Descriptive analytics can work wonders when it comes to


packaging complex data into something easily digestible. Let’s say a team leader wants to
share information on project progress with their team and stakeholders.

Using descriptive analytics, they can convert that raw data into visually appealing charts or
graphs. This helps ensure everyone gets the picture without swimming through a sea of
numbers, making communication much more effective and enjoyable.

Enabling data-driven decisions: Lastly, descriptive analytics empowers businesses to make


well-informed decisions by providing solid data. Imagine a restaurant owner examining
customer reviews to gauge a dish’s popularity.

Descriptive analytics could highlight patterns and trends, such as a specific dish receiving
rave reviews or another with less-than-stellar ratings. The owner can then decide to promote
the popular dish or improve the one that’s not performing well. This data-driven approach
enhances decision-making and increases the chances of achieving business goals.

What are some data analysis and visualization techniques to try?

Now, we’ll explore some data analysis and visualization techniques that are at your disposal
thanks to descriptive analytics. These analytics tools help paint a full picture of our datasets
while keeping them interesting and easy to understand.

Data mining: Data mining is essentially treasure hunting in the world of data analysis. As a
descriptive analysis technique, it involves sifting through large data sets to identify patterns,
trends, and correlations that tell an informative story behind a given situation.

Data mining helps businesses make sense of their data by uncovering those valuable nuggets
of information hidden beneath the surface.

Charts and graphs: A picture is worth a thousand words, and that’s especially true when it
comes to presenting data. Line graphs, pie charts, and bar charts all help communicate
complex data in a simple, visual format.

Charts and graphs make it easier for businesses to quickly identify trends or anomalies, so
stakeholders can grasp the information and act accordingly, or get on board with new plans
and proposals.

Visualization tools: Tools like Tableau take data analysis and visualization to the next level.
You can use intuitive drag-and-drop interfaces to create eye-catching visuals that really bring
your data to life, even if you don’t have graphic design or coding skills.
And the best part? These tools save you a ton of time and effort compared to wrestling with
Excel. Advanced visualization capabilities let you explore data from multiple angles, identify
hidden patterns, and tell a compelling narrative.

Dashboards: The days of flipping through countless spreadsheets are behind us. In the realm
of descriptive analytics, dashboards offer a one-stop shop for all your key metrics,
attractively displayed in real-time. They consolidate and present important data in a way
that’s engaging and easy to understand.

Customizable dashboards tailored to specific roles or goals enable stakeholders to quickly


gauge the performance of various business aspects. This accessibility supports faster
decision-making and keeps everyone on the same page.

How do you apply descriptive analytics in action?

Let’s take the magic of descriptive analytics from theory to practice. We’ll explore a few
real-world examples of descriptive analytics in action to help you get the hang of it. These
use cases show the power of descriptive analytics and how it can help your business flourish.

Gaining customer insight

Businesses must understand customers’ preferences, habits, and behaviors to truly connect
with them. Descriptive analytics lets you analyze data from sources like customer reviews,
purchase history, feedback forms, and surveys.

Identifying data trends and patterns equips you with valuable insights that enable you to tailor
your products or services to customers’ needs and preferences.

Monitoring business performance

Keeping tabs on how your business is doing is essential. Descriptive analytics allows you to
track various metrics and key performance indicators (KPIs), giving you a clear picture of
your business’s health.
For instance, you can analyze sales trends to determine which products are most popular, or
dig into website data to identify areas needing improvement. Knowledge is power, and these
insights help you make data-driven decisions that can boost performance and help your
business grow.

Improving marketing campaigns

Descriptive analytics can help you optimize your campaigns by analyzing data points, such as
social media engagement, email open rates, or number of subscribers. By understanding what
works well and what doesn’t, you can tweak your strategies and allocate resources more
efficiently.

For example, Australian-based swimming pool builder Narellan Pools experienced a decline
in sales and knew they needed a targeted marketing strategy. So, the company compiled and
analyzed five years of marketing data and used the insights to drive a 23% increase in sales in
one year, spending only 70% of its media budget.
Supply chain management
Optimizing supply chain efficiency is a big win for businesses (think synchronized inventory
control, reduced lead times, and seamless logistics). Descriptive analytics can help you
achieve this by analyzing data related to supplier performance, inventory levels, and
transportation.
These trends can help you identify bottlenecks or inefficiencies so you can take timely steps
to improve your overall supply chain performance.

How can you drive better business decisions with data?


Descriptive analytics helps businesses better understand their customers, improves
workflows, fine-tunes marketing campaigns, and has the power to totally transform decision-
making for the better.

As we step further into the age of data-driven decision-making, businesses should make the
most of descriptive analytics to stay competitive and agile amidst changing markets.
Mastering these techniques can pave the way for smarter, more informed decisions, helping
your business stay on top.

Predictive Analytics: Techniques and Applications


In todays’ era, constantly connected technological space is exponentially generating excess
data volume to be collected and analyzed, predictive analytics plays an important character in
this perspective. The modernized digital tech sector employs predictive analytics omnipresent
in the business and IT domain to attain a competitive edge.

“Predictive analytics, an advanced form of data analytics, understands historical data


behaviour and anticipates data-driven insights as future outcomes.

Let’s start learning “predictive analytics” in the context of techniques, workflow and
applications of it.

Understanding Predictive Analytics

Predictive analytics is the form of advanced data analytics making predictions about future
outcomes via analyzing previous data. To analyze previous data, this method
combines statistical modelling, data mining and machine learning tools and techniques and
makes accurate and actionable insights. The science of predictive analytics can build future
insights with a significant rate of accuracy.

For example, with the help of predictive analytics tools and models, organizations use this
technique to find patterns from past data and identify risks and opportunities.
Presently, companies have a flood of data residing across transactional databases, equipment
log files and media files (images, videos, documents), sensors and other data resources. Data
experts examine this information to gain insights and make predictions about future events
using predictive analytics.

For example, companies are using it to increase the bottom line and competitive advantages.
The process includes various sorts of techniques, discussed below in the next section.

As described in the video below, predictive analytics is a practice of combining predictive


algorithms with historical data to compute something likelihood will take place.
Predictive Analytics Techniques

In todays’ industries involving healthcare, life sciences, oil and gas, insurance, etc, predictive
analytics is widely employed in these areas and provides most valued anticipations when
business strategies and applications are clearly defined.

Predictive analytics incorporates a combination of scientific methods and techniques as


discussed below;

 Data Mining: In order to manage large amounts of data sets either structured or unstructured
to recognize hidden patterns and relationships among variables provided, data mining is
aimed to. Once identified, these relationships can be used to understand the behaviour of the
event from which data is compiled.
 Statistical Modelling: In parallel to the data mining process, statistical data models can be
developed depending on the context of what needs to be anticipated using the same collected
data as for data mining. Once the model builts, the new data is fed to models to predict future
outcomes. For example, a business expert can build a cross-selling model using current
customer data and predict what other items they will likely to purchase from the same
company.
 Machine learning: ML can deploy iterative methods and techniques to identify patterns from
large data sets and build models. For example, recommendation engines are widely used for
online shopping recommendations as predictions are made from using customers' prior
purchasing and browsing behavior.
Why is Predictive Analytics Important?

Turing to predictive analytics is leading to address complex business problems and discover
potential opportunities, some common benefits include;

1. Fraud detection

In general, multiple analyzing methods are combined to analyze data that improve
accuracy of patterns detection and catch criminal behavior, preventing frequent fraud
occurrence. With the growing concern of cybersecurity,conducting high
performing behavioral analytics is demanding that examines all the suspicious
behavior/activities across a network in a real-time to detect fraud actions, zero-day
vulnerabilities and underlying threats.

2. Marketing campaigns optimization

Predictive analytics is beneficial in optimizing marketing campaigns and promotions’


events. To determine purchasing response of customers and publicizing cross-sell
opportunities, predictive models help in businesses to attract, retain and increase
valuable customers.

3. Minimization in risks

Consider a simple example of reducing risks, Credit Scores, that is highly employed to
recognize the likelihood of defaulters from the user's purchasing behaviour. In practice,
credit score is the amount generated by a predictive model via analyzing the data
relevant to a person’s credit history. Other risk- related examples count as insurance
claims and fraud claim collections.

4. Improvements in business operations:

Predictive analytics enables organizations to make smarter decisions by smoothing


operations and functions more efficiently.

From managing resources to forecasting inventory in order to streamline operations


efficiency, many companies are using predictive analytics. For example, airlines are
using predictive analytics to confirm a range of ticket pricing, hotels are using it to
predict the number of guests on a particular day/hour to maximize space occupancy
and increase revenue.

5 Applications of Predictive Analytics

1. Marketing

Consumers are attracted with pool of advertising and marketing,

Individuals working in the marketing domain need to look how consumers will react on
a particular market campaign, or what will be their impact on the overall economy
while conducting such marketing event, etc. in this case,

 Predictive analytics tools could be helpful in segmenting the marketing leads by


displaying ads over websites and social media platforms relating to consumer behavior
and interest.
 Predictive analytics tools can explore “expect to purchase” by analyzing consumer’s
behaviour on past and current available data to find people whose data matches with
ideal consumers.
 Marketers could also use predictive analytics for leads scoring by analyzing data to
identify which prospects are potentially most valuable for the company, or to identify
how likely the prospective consumers will buy products or services and to plan how
they should be contacted and with what information.
2. Retail

Either online or brick and mortar, each retailer looks for managing inventory and
logistics, and thereby predictive analytics is extremely important. The method allows
retailers to correlate huge data information such as historical sales data, purchasing
products and behavior, geographical references to optimize operations and efficiencies
in the following ways;

 Customer sales data provides personalized recommendations and promotions for


individual customers, through predictive analytics, better targeting built over real-time
data assists retailers for planning campaigns, making ads and promotions that buyers
will respond the most.
 Sales and logistics data analysis using predictive analytics helps retailers to ensure the
availability of sufficient inventory/products in warehouses, and good merchandise in
stores at the right time.
 Sales and promotion timing has become an art, conducting predictive analytics over
customer, inventory, and historical sales data provides suitable circumstances/timing
for lowering or raising prices.
 Predictive analytics lets retailers in merchandise planning and price optimization to
investigate the impact of promotional events and to figure out appropriate offers for
consumers.

3. Manufacturing

With the modernized technology and fully automated factory machines, predictive
analytics tools are very significant in operating and optimizing the manufacturing
process at each stage of designing, purchasing, developing, quality and inventory
control, delivery, etc. Moreover,

 Predictive analytics is helpful when combined with machine data in order to help in
tracking and comparing machines’ performance and equipment maintenance status and
predicting which particular machine will fail.
 Predictive analytics insights can lead to decrease in shipping and transportation
expenses by accepting all the factors included in transferring manufacturing products at
different places under the proper system.
 Considering predictions over supply chain and sales data helps in making more
considerable decisions for purchasing and ensuring that no expensive raw materials get
purchased unless not required. This data can also be used in aligning manufacturing
processes with consumer demands.

4. Healthcare

Healthcare industry is among dominant adapters to consider predictive analytics


techniques aimed at facilitating technology to save money and improve health practices
efficiencies.
 Predictive analytics can help medical practitioners by analysing data concerning global
diseases statistics, drug interactions, patient diagnostic history individually to provide
advanced care and conduct more effective medical practices.
 Applying predictive analytics on clinics’ past appointment data helps in identifying
probable no-shows or delays in cancellations more accurately and thus save time and
resources.
 To detect claims frauds, the health insurance industry is using predictive analytics to
discover patients at most risk of incurable or chronic disease, it helps companies in
finding suitable interventions.

5. Finance

Applying to a broad spectrum of banking and financial services & activities, predictive
analytics is the most valuable process helping from accessing risks to maximizing
customer satisfaction.

Predictive analytics are useful in:


 Prohibition of credit card fraud via indicating unusual transactions,
 Credit card scoring to determine whether to approve or deny loan applications,
 Most importantly, analysing customers’ churn data and facilitating banks to approach
potential customers before they are likely to switch respective institutions.
 Measuring credit risk, maximizing cross-sell/up-sell opportunities and retaining
valuable customers.
 Commonwealth bank implements predictive analytics to anticipate fraud activities for a
given transaction before it is accomplished- within 40 milliseconds of the occurrence of
transaction.
types of data – creating distributions from data –

What is Distribution?

The distribution of a statistical dataset is the spread of the data which shows all possible
values or intervals of the data and how they occur.

A distribution is simply a collection of data or scores on a variable. Usually, these scores are
arranged in order from ascending to descending and then they can be presented graphically.

The distribution provides a parameterized mathematical function which will calculate the
probability of any individual observation from the sample space.

Before moving on to distributions, understanding about the term “data” which is very
important and critical for the data analyst/data scientist

To understand more about distribution in statistics, watch this complete video where
Abhinand Sarkar will share some of his thoughts on distribution.

What is Data?

Data is a collection of information (numbers, words, measurements, observations) about


facts, figures and statistics collected together for analysis.

Example: Distribution of Categorical Data (True/False, Yes/No): It shows the number (or)
percentage of individuals in each group.

How to Visualize Categorical Data: Bar Plot, Pie Chart and Pareto Chart.

Distribution of Numerical Data (Height, Weight and Salary): Firstly, it is sorted from
ascending to descending order and grouped based on similarity. It is represented in graphs
and charts to examine the amount of variance in the data.

How to Visualize Numerical Data: Histogram, Line Plot and Scatter Plot.

What is Data?

Data is a collection of information (numbers, words, measurements, observations) about


facts, figures and statistics collected together for analysis.
Example: Distribution of Categorical Data (True/False, Yes/No): It shows the number (or)
percentage of individuals in each group.

How to Visualize Categorical Data: Bar Plot, Pie Chart and Pareto Chart.

Distribution of Numerical Data (Height, Weight and Salary): Firstly, it is sorted from
ascending to descending order and grouped based on similarity. It is represented in graphs
and charts to examine the amount of variance in the data.

How to Visualize Numerical Data: Histogram, Line Plot and Scatter Plot.

Measurement level of Data

S.N
Qualitative Quantitative
o

Nominal – Brand-name, Zip-code Ordinal – Position in Race and Date Interval –


1 and Gender Ordinal – Grades, Star Temperature in Celsius, Year of Birth Ratio –
Reviews Height, Age, Weight

What does Data do? In What ways it matters most?

1. Identifies the relationship between two variables


2. Prediction of future and forecasting based on the previous trend of data
3. Pattern determination that exists in the dataset
4. Detects Fraud and anomalies
Why are distributions important?

Sampling distributions are important for statistics because we need to collect the sample and
estimate the parameters of the population distribution. Hence distribution is necessary to
make inferences about the overall population.

For example, The most common measures of how sample differs from each other is the
standard deviation and standard error of the mean.

Difference between Frequency and Probability Distribution

S.No Frequency Distribution Probability Distribution


It records how often an event It records the likelihood that an event is to occur. It
1 occurs. It is based on actual is based on theoretical assumption of what should
observations happen

Frequency Distribution:

The number of times each numerical value occurs.

Probability Distribution

List of Probabilities associated with each of its possible numerical values.

Types of Distributions

 Bernoulli Distribution
 Uniform Distribution
 Binomial Distribution
 Normal Distribution
 Poisson Distribution
 Exponential Distribution
Python Libraries for Distributions

Bernoulli Distribution

A special case of binomial distribution. It is the discrete probability distribution and has
exactly only two possible outcomes – 1(Success) and 0(Failure) and a single trial.

Example: In Cricket: Toss a Coin leads to win or lose the toss. There is no intermediate
result. The occurrence of a head denotes success, and the occurrence of a tail denotes failure.

The probability of success (1) is 0.4 and failure(0) is 0.6

Bernoulli Distribution in Python

Normal Distribution

It is otherwise known as Gaussian Distribution and Symmetric Distribution. It is a type of


continuous probability distribution which is symmetric to the mean. The majority of the
observations cluster around the central peak point.

It is a bell-shaped curve.

Examples: Performance appraisal, Height, BP, measurement error and IQ scores follow a
normal distribution.
Mean = Median = Mode

The standard normal distribution is a normal distribution with µ = 0 and б = 1.

Basic Properties:

The normal distribution always run between –α and +α


Zero skewness and distribution is symmetrical about the mean.
Zero kurtosis
68% of the values are within 1 SD of the mean
95% of the values are within 2 SD of the mean
99.7% of the values are within 3 SD of the mean
Normal Distribution in Python

Binomial Distribution

The most widely known discrete probability distribution. It has been used hundreds of years.

Assumptions:

1.
The experiment involves n identical trials.
2.
Each trial has only two possible outcomes – success or failure.
3.
Each trial is independent of the previous trials.
4.
The terms p and q remain constant throughout the experiment, where p is the
probability of getting a success on any one trial and q = (1 – p) is the
probability of getting a failure on any one trial.
Binomial Distribution in Python

Poisson Distribution

It is the discrete probability distribution of the number of times an event is likely to occur
within a specified period of time. It is used for independent events which occur at a constant
rate within a given interval of time.

The occurrences in each interval can range from zero to infinity (0 to α).

Examples:

1. How many black colours are there in a random sample of 50 cars


2. No of cars arriving at a car wash during a 20 minute time interval
Uniform Distribution

It is a continuous or rectangular distribution. It describes an experiment where an outcome


lies between certain boundaries.

Examples:
1. Time to fly from Newark to Atlanta ranges from 120 to 150 minutes if we
monitor the fly time for many commercial flights it will follow more or less
the uniform distribution.
2. The time taken for the students to finish a one hour test may range from 50
mins to 60 mins. An equal number of students complete over 5 minutes
interval within this range – 50, 54, 56, 58 and 60. The finishing time of the test
can be approximated by a uniform distribution.
3. Time for Pizza delivery from Nanganallur to Alandur may range from 20 to 30
mins uniformly from the time delivery man leaves the Pizza Hut.
Uniform Distribution in Python

Gamma Distribution

It deals with continuous variables which take on a wide range of values such as individual
call times. Based on which we can model probabilities across any range of possible values
using a gamma distribution function. First one is shape parameter (α) and the second one is
scale parameter (β).

Examples:

The amount of rainfall accumulated in a reservoir.



The size of loan defaulters and aggregation of insurance claims

The flow of items through manufacturing and distribution processes

The load on web servers

Gamma Distribution in Python

Exponential Distribution

It is concerned with the amount of time until some specific event occurs.

Example:

 The amount of time until an earthquake occurs has an exponential distribution


 The amount of time in business telephone calls
 The car battery lasts.
 The amount of money customers spend on one trip to the supermarket follows
an exponential distribution. There are more people who spend small amounts
of money and fewer people who spend large amounts of money.
The exponential distribution is widely used in the field of reliability.

Note: Reliability deals with the amount of time a product lasts.


measures of location – measures of variability – measures of variability – measures of
association.

Measures of Location
Measures of location describe the central tendency of the data. They include the mean,
median and mode.

Mean or Average
The (arithmetic) mean, or average, of n observations (pronounced “x bar”) is simply
the sum of the observations divided by the number of observations; thus:

In this equation, xi represents the


individual sample values and Σxi their sum. The Greek letter 'Σ' (sigma) is the Greek
capital 'S' and stands for 'sum'. Their calculation is described in example 1, below.

Median
The median is defined as the middle point of the ordered data. It is estimated by first
ordering the data from smallest to largest, and then counting upwards for half the
observations. The estimate of the median is either the observation at the centre of the
ordering in the case of an odd number of observations, or the simple average of the
middle two observations if the total number of observations is even. More specifically, if
there are an odd number of observations, it is the [(n+1)/2]th observation, and if there are
an even number of observations, it is the average of the [n/2]th and the [(n/2)+1]th
observations.

Example 1 Calculation of mean and median


Consider the following 5 birth weights, in kilograms, recorded to 1 decimal place:
1.2, 1.3, 1.4, 1.5, 2.1
The mean is defined as the sum of the observations divided by the number of
observations. Thus mean = (1.2+1.3+…+2.1)/5 = 1.50kg. It is usual to quote 1 more
decimal place for the mean than the data recorded.
There are 5 observations, which is an odd number, so the median value is the (5+1)/2 =
3rd observation, which is 1.4kg. Remember that if the number of observations was even,
then the median is defined as the average of the [n/2]th and the [(n/2)+1]th. Thus, if we
had observed an additional value of 3.5kg in the birth weights sample, the median would
be the average of the 3rd and the 4th observation in the ranking, namely the average of
1.4 and 1.5, which is 1.45kg.

Advantages and disadvantages of the mean and median


The major advantage of the mean is that it uses all the data values, and is, in a statistical
sense, efficient.
The main disadvantage of the mean is that it is vulnerable to outliers. Outliers are single
observations which, if excluded from the calculations, have noticeable influence on the
results. For example, if we had entered '21' instead of '2.1' in the calculation of the mean
in Example 1, we would find the mean changed from 1.50kg to 7.98kg. It does not
necessarily follow, however, that outliers should be excluded from the final data
summary, or that they always result from an erroneous measurement.
The median has the advantage that it is not affected by outliers, so for example the
median in the example would be unaffected by replacing '2.1' with '21'. However, it is not
statistically efficient, as it does not make use of all the individual data values.

Mode
A third measure of location is the mode. This is the value that occurs most frequently, or,
if the data are grouped, the grouping with the highest frequency. It is not used much in
statistical analysis, since its value depends on the accuracy with which the data are
measured; although it may be useful for categorical data to describe the most frequent
category. The expression 'bimodal' distribution is used to describe a distribution with two
peaks in it. This can be caused by mixing populations. For example, height might appear
bimodal if one had men and women on the population. Some illnesses may raise a
biochemical measure, so in a population containing healthy and ill people one might
expect a bimodal distribution. However, some illnesses are defined by the measure (e.g.
obesity or high blood pressure) and in this case the distributions are usually unimodal.

Measures of Dispersion or Variability


Measures of dispersion describe the spread of the data. They include the range,
interquartile range, standard deviation and variance.

Range and Interquartile Range


The range is given as the smallest and largest observations. This is the simplest measure
of variability. Note in statistics (unlike physics) a range is given by two numbers, not the
difference between the smallest and largest. For some data it is very useful, because one
would want to know these numbers, for example knowing in a sample the ages of
youngest and oldest participant. If outliers are present it may give a distorted impression
of the variability of the data, since only two observations are included in the estimate.
Quartiles and Interquartile Range
The quartiles, namely the lower quartile, the median and the upper quartile, divide the
data into four equal parts; that is there will be approximately equal numbers of
observations in the four sections (and exactly equal if the sample size is divisible by four
and the measures are all distinct). Note that there are in fact only three quartiles and these
are points not proportions. It is a common misuse of language to refer to being ‘in the top
quartile’. Instead one should refer to being ‘in the top quarter or ‘above the top quartile’.
However, the meaning of the first statement is clear and so the distinction is really only
useful to display a superior knowledge of statistics! The quartiles are calculated in a
similar way to the median; first arrange the data in size order and determine the median,
using the method described above. Now split the data in two (the lower half and upper
half, based on the median). The first quartile is the middle observation of the lower half,
and the third quartile is the middle observation of the upper half. This process is
demonstrated in Example 2, below.
The interquartile range is a useful measure of variability and is given by the lower and
upper quartiles. The interquartile range is not vulnerable to outliers and, whatever the
distribution of the data, we know that 50% of observations lie within the interquartile
range.

Example 2 Calculation of the quartiles


Suppose we had 18 birth weights arranged in increasing order.
1.51, 1.53. 1.55, 1.55, 1.79. 1.81, 2.10, 2.15, 2.18,
2.22, 2.35, 2.37, 2.40, 2.40, 2.45, 2.78. 2.81, 2.85.
The median is the average of the 9th and 10th observations (2.18+2.22)/2 = 2.20 kg. The
first half of the data has 9 observations so the first quartile is the 5th observation, namely
1.79kg. Similarly the 3rd quartile would be the 5th observation in the upper half of the
data, or the 14th observation, namely 2.40 kg. Hence the interquartile range is 1.79 to
2.40 kg.

Standard Deviation and Variance


The standard deviation of a sample (s) is calculated as follows:

The expression ∑(x - ) is interpreted as: from each individual observation (x ) subtract
i
2
i

the mean ( ), then square this difference. Next add each of the n squared differences.
This sum is then divided by (n-1). This expression is known as the sample variance (s ). 2

The variance is expressed in square units, so we take the square root to return to the
original units, which gives the standard deviation, s. Examining this expression it can be
seen that if all the observations were the same (i.e. x x x x ), then they would equal the
1= 2= 3 ... = n

mean, and so s would be zero. If the x's were widely scattered about, then s would be
large. In this way, s reflects the variability in the data. The calculation of the standard
deviation is described in Example 3. The standard deviation is vulnerable to outliers, so if
the 2.1 was replace by 21 in Example 3 we would get a very different result.

Example 3 Calculation of the standard deviation


Consider the data from example 1. The calculations required to determine the sum of the
squared differences from the mean are given in Table 1, below. We found the mean to be
1.5kg. We subtract this from each of the observations. Note the mean of this column is
zero. This will always be the case: the positive deviations from the mean cancel the
negative ones. A convenient method for removing the negative signs is squaring the
deviations, which is given in the next column. These values are then summed to get a
value of 0.50 kg . We need to find the average squared deviation. Common-sense would
2

suggest dividing by n, but it turns out that this actually gives an estimate of the population
variance, which is too small. This is because we are using the estimated mean in the
calculation and we should really be using the true population mean. It can be shown that
it is better to divide by the degrees of freedom, which is n minus the number of estimated
parameters, in this case n-1. An intuitive way of looking at this is to suppose one
had n telephone poles each 100 meters apart. How much wire would one need to link
them? As with variation, here we are not interested in where the telegraph poles are, but
simply how far apart they are. A moment's thought should convince one that n-1 lengths
of wire are required to link n telegraph poles.

Table 1 Calculation of the mean squared deviation


From the results calculated thus far, we can determine the variance and standard
deviation, as follows:
n=5
Variance = 0.50/(5-1) = 0.125 kg 2

Standard deviation = √(0.125) = 0.35 kg


Why is the standard deviation useful?
It turns out in many situations that about 95% of observations will be within two standard
deviations of the mean, known as a reference interval. It is this characteristic of the
standard deviation which makes it so useful. It holds for a large number of measurements
commonly made in medicine. In particular, it holds for data that follow a Normal
distribution. Standard deviations should not be used for highly skewed data, such as
counts or bounded data, since they do not illustrate a meaningful measure of variation,
and instead an IQR or range should be used. In particular, if the standard deviation is of a
similar size to the mean, then the SD is not an informative summary measure, save to
indicate that the data are skewed.
Standard deviation is often abbreviated to SD in the medical literature.

Data Visualization for Manager: Visualization Imperative-Message to charts-Visual


Perception-Grammar of Graphics (Using R)-

It seems like you're touching on various aspects of data visualization and the principles
behind conveying information effectively through charts. Let's break down these terms and
concepts:

1. Visualization Imperative: This refers to the necessity or requirement for visualizing


data. In many cases, data is complex and difficult to understand when presented in
raw form. Visualization helps in simplifying and clarifying the patterns, trends, and
relationships within the data, making it easier for humans to comprehend and derive
insights.
2. Message to Charts: This phrase suggests the process of translating a message or
information into visual representations, typically through charts or graphs. It involves
deciding which type of chart best communicates the intended message, selecting
appropriate data variables, and designing the visualization to effectively convey the
desired information.
3. Visual Perception: Understanding how humans perceive and interpret visual
information is crucial in designing effective visualizations. Principles such as Gestalt
psychology, which explains how humans perceive patterns and organize visual
elements, play a significant role. Factors like color, shape, size, and spatial
arrangement influence how data is interpreted visually.
4. Grammar of Graphics: Coined by Leland Wilkinson, the "Grammar of Graphics"
refers to a systematic framework for constructing visualizations. It breaks down the
process of creating visualizations into a set of grammar rules or components,
including data, aesthetics (visual properties like color, size, shape), scales (mapping
data values to aesthetics), and layers (combining multiple graphical elements). This
concept provides a structured approach to creating a wide range of visualizations.

In summary, effectively communicating data through visualizations requires understanding


the visualization imperative, translating messages into charts, considering principles of visual
perception, and applying frameworks like the Grammar of Graphics to design clear and
informative visual representations.

Component level design of tables and graphs-Storytelling using Visualization;

Designing tables and graphs at a component level involves breaking down each element of
the visualization and understanding its purpose and how it contributes to the overall
communication of data. Here's a breakdown of some key components:

1. Title: A clear and concise title provides context for the visualization and
communicates the main message or purpose.
2. Axis Labels: Axes should be labeled clearly to indicate the variables being
represented. Include units of measurement if applicable.
3. Data Points or Bars: These are the graphical representations of the data. Ensure that
they are accurately plotted and appropriately sized to reflect the underlying values.
4. Legends: If the visualization includes multiple data series or categories, a legend can
help clarify what each color or symbol represents.
5. Gridlines: Gridlines can aid in reading values from the axes and provide reference
points for comparison.
6. Annotations: Annotations such as labels, arrows, or text boxes can provide additional
context or highlight specific data points of interest.
7. Color Palette: Choose a color palette that is visually appealing and accessible for all
viewers. Avoid using colors that are too similar or difficult to distinguish.
8. Whitespace: Proper use of whitespace can improve the readability and clarity of the
visualization by separating different elements and reducing clutter.

Now, when it comes to storytelling using visualization, the goal is to use these components to
craft a narrative that guides the viewer through the data and conveys a clear message or
insight. Here's how you can approach it:

1. Identify the Story: Start by understanding the key message or insight you want to
communicate with the data. What story does the data tell?
2. Sequence of Information: Arrange the components of the visualization in a logical
sequence that guides the viewer through the story. This might involve starting with an
overview and then delving into specific details.
3. Use of Visual Cues: Use visual cues such as color, size, and position to draw
attention to important points or trends in the data. Highlighting key data points can
help reinforce the narrative.
4. Contextualization: Provide context and background information to help the viewer
understand the significance of the data. This might include explaining any relevant
trends, patterns, or outliers.
5. Interactivity: If possible, incorporate interactive elements into the visualization that
allow viewers to explore the data further and draw their own conclusions.
6. Conclusion: Summarize the key findings or takeaways from the data and reiterate the
main message of the story.

By carefully designing each component of the visualization and weaving them together into a
cohesive narrative, you can effectively use visualization to tell a compelling story with your
data.

Introduction to multivariate statistical analysis techniques: Multivariate linear regression


models, principal component analysis, linear discriminant analysis, factor analysis, evaluation
matrices and model diagnostics for regression models.
Logistic regression, decision trees, cluster analysis, Causality Test, Forecasting techniques
(AR, MA, ARMA and ARIMA models) using R.

You might also like