Lecture 6: Modeling, Evaluation, and Visualization

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

Lecture 6: Modeling, Evaluation, and Visualization

1. Modeling and Evaluation

Understanding various models and techniques used for data analytics is an increasingly


important skill for accountants. In this module, we evaluate several different approaches and
models and identify when to use them and how to interpret the results. We also provide specific
accounting-related examples of when each of these specific data approaches and models is
appropriate to address our particular question.

Using the IMPACT cycle model, in Topic 5 we listed various approaches that business analysts
use to address business questions. Before we discuss these approaches, we need to bring you up
to speed on some data-specific terms:

 A target is an expected attribute or value that we want to evaluate. For example, if we


are trying to predict whether a transaction is fraudulent, the target might be a specific
“fraud score.” If we’re trying to predict an interest rate, the target would be “interest
rate.”
 A class is a manually assigned category applied to a record based on an event. For
example, if the credit department has rejected a credit line for a customer, the credit
department assigns the class “Rejected” to the customer’s master record. Likewise, if the
internal auditors have confirmed that fraud has occurred, they would assign the class
“fraud” to that transaction.

There are numerous models to choose from when evaluating a given set of data. The choice of
model depends on the desired outcome of the business question. If you don’t have a specific
question and are simply exploring the data for potential patterns of interest, you would use
an unsupervised approach. For example, consider the question: “Do our vendors form natural
groups based on similar attributes?” In this case, there isn’t a specific target because you don’t
yet know what similarities our vendors have. You may use clustering to evaluate the vendor
attributes and see which ones are closely related, shown in Figure 6.1.
 

Figure 6.1. Clustering

You could also use co-occurrence grouping to match vendors by geographic region; data


reduction to simplify vendors into obvious categories, such as wholesale or retail or based on
overall volume of orders; or profiling to evaluate vendors with similar on-time delivery
behavior, shown in Figure 6.2. In any of these cases, the data drive the decision, and you
evaluate the output to see if it matches our intuition. These exploratory exercises may help to
define better questions but are generally less useful for making decisions.

Figure 6.2. Profiling

 
On the other hand, we may ask questions with specific outcomes, such as: “Will a new vendor
ship a large order on time?” When you are performing analysis that uses historical data to predict
a future outcome, you will use a supervised approach. We use historical data to create the new
model. Using a classification model, you can predict whether a new vendor belongs to one class
or another based on the behavior of the others, shown in Figure 6.3. You might also
use regression to predict a specific value to answer a question such as, “How many days do we
predict it will take a new vendor to ship an order?” Again, the prediction is based on the activity
we have observed from other vendors, shown in Figure 6.4. Causal modeling, similarity
matching, and link prediction are additional supervised approaches where you attempt to
identify causation (which can be expensive), identify a series of characteristics that predict a
model, or attempt to identify other relationships, respectively.

Figure 6.3. Classification

 
Figure 6.4. Regression

Ultimately, the model you use comes down to the questions you are trying to answer. The
flowchart in Figure 6.5 shows several decisions that will help you select an appropriate model, or
data approach. By evaluating your data, the question that needs to be addressed as well as the
desired outcomes, an appropriate data approach can be determined. Once you’ve selected an
approach, then your analysis can begin.
Figure 6.5. Flowchart to Help Choose an Appropriate Data Model

2. Data Visualization

Data are important, and Data Analytics are effective, but they are only as important and effective
as we can communicate and make the data understandable. What would do if you were interns
and your boss asked you to supply information about where the customers her organization
served were located. Would you simply point your boss to the Customers table in the sales
database? Would you go a step further and isolate the attributes to the Company Name and the
State? Perhaps you could go a step further and run a quick query or PivotTable to perform a
count on the number of customers in each different state that the company serves. If you were to
give your boss what she actually wanted, however, you should provide a short-
written summary of the answer to the research question, as well as an organized chart to
visualize the results. Data visualization isn’t just for people who are “visual” learners. When the
results of data analysis are visualized appropriately, the results are made easier and quicker to
interpret for everybody. Whether the data you are analyzing are “small” data or “big” data, they
still merit synthesis and visualization to help your stakeholders interpret the results with ease
and efficiency.

Think back to some of the first data visualizations and categorizations you were exposed to (the
food guide pyramid/food plate, the animal kingdom, the periodic table) and, more modernly,
how frequently infographics are applied to break down a series of complicated information on
social media. These charts and infographics make it easier for people to understand difficult
concepts by breaking them down into categories and visual components.

2.1. Determine the Purpose of your Data Visualization

As with selecting and refining your analytical model, communicating results is more art than
science. Once you are familiar with the tools that are available, your goal should always be to
share critical information with stakeholders in a clear, concise manner. This could involve a chart
or graph, a callout box, or a few key statistics. Visualizations have become very popular over the
past three decades. Managers use dashboards to quickly evaluate key performance
indicators (KPIs) and quickly adjust operational tasks; analysts use graphs to plot stock price
and financial performance over time to select portfolios that meet expected performance goals.

In any project that will result in a visual representation of data, the first charge is ensuring that
the data are reliable and that the content necessitates a visual. In our case, however, ensuring
that the data are reliable and useful has already been done through the first three steps of the
IMPACT model.

At this stage in the IMPACT model, determining the method for communicating your results
requires the answers to two questions:

1. Are you explaining the results of previously done analysis, or are you exploring the data
through the visualization? (Is your purpose declarative or exploratory?)
2. What type of data is being visualized (conceptual, qualitative data or data-driven,
quantitative data)?

A summary of the possible answers to these questions is shown in a chart in Figure 6.6. The
majority of the work that we will do with the results of data analysis projects will reside
in quadrant 2 of Figure 6.6, the declarative, data-driven quadrant. We will also do a bit of
work in Figure 6.6’s quadrant 4, the data-driven, exploratory quadrant. There isn’t as much
qualitative work to be done, although we will work with categorical qualitative data
occasionally. When we do work with qualitative data, it will most frequently be visualized using
the tools in quadrant 1, the declarative, conceptual quadrant.

 
Figure 6.6. The Four Chart Types

Once you know the answers to the two key questions and have determined which quadrant
you’re working in, you can determine the best tool for the job. Is a written report with a simple
chart sufficient? If so, Word will suffice. Will an interactive dashboard and repeatable report be
required? If so, Excel or Tableau may be a better tool.

2.2. Quadrants 1 and 3 versus Quadrants 2 and 4: Qualitative versus Quantitative

Qualitative data are categorical data. All you can do with these data is count them
and group them, and in some cases, you can rank them. Qualitative data can be further defined
in two ways, nominal data and ordinal data.

Nominal data are the simplest form of data. Examples of nominal data are hair color, gender, and
ethnic groups.
Increasing in complexity, but still categorized as qualitative data, are ordinal data. Ordinal data
can also be counted and categorized like nominal data but can go a step further—the categories
can also be ranked. Examples of ordinal data include gold, silver, and bronze medals, 1–5 rating
scales on teacher evaluations, and letter grades.

Beyond counting and possibly sorting (if you have ordinal data), the primary statistic used with
quantitative data is proportion. The proportion is calculated by counting the number of items in
a particular category, then dividing that number by the total number of observations. For
example, if I had a dataset of 150 people and had each individual’s corresponding hair color with
25 people in my dataset having red hair, I could calculate the proportion of red-haired people in
my dataset by dividing 25 (the number of people with red hair) by 150 (the total number of
observations in my dataset). The proportion of red-haired people, then, would be 16.7 percent.

Qualitative data (both nominal and ordinal) can also be referred to as “conceptual” data because
such data are text-driven and represent concepts instead of numbers.

Quantitative data are more complex than qualitative data because not only can they be counted
and grouped just like qualitative data, but the differences between each data point are
meaningful—when you subtract 4 from 5, the difference is a numerical measure that can be
compared to subtracting 3 from 5. Quantitative data are made up of observations that
are numerical and can be counted and ranked, just like ordinal qualitative data, but that can
also be averaged. A standard deviation can be calculated, and datasets can be easily compared
when standardized (if applicable).

Similar to qualitative data, quantitative data can be categorized into two different
types: interval and ratio.

Ratio data are considered the most sophisticated type of data, and the simplest way to express
the difference between interval and ratio data is that ratio data have a meaningful 0 and
interval data do not. In other words, for ratio data, when a dataset approaches 0, 0 means “the
absence of.” Consider money as ratio data—we can have 5 dollars, 72 dollars, or 8,967 dollars,
but as soon as we reach 0, we have “the absence of” 0.

The other scale for quantitative data is interval data, which are not as sophisticated as ratio
data. Interval data do not have a meaningful 0; in other words, in interval data, 0 does not
mean “the absence of” but is simply another number. An example of interval data is the
Fahrenheit scale of temperature measurement, where 90 degrees is hotter than 70 degrees, which
is hotter than 0 degrees, but 0 degrees does not represent “the absence of” temperature—it’s just
another number on the scale.

Quantitative data can be further categorized as either discrete or continuous data. Discrete


data are data that are represented by whole numbers. An example of discrete data is points in a
basketball game—you can earn 2 points, 3 points, or 157 points, but you cannot earn 3.5 points.
On the other hand, continuous data are data that can take on any value within a range. An
example of continuous data is height: you can be 4.7 feet, 5 feet, or 6.27345 feet. The difference
between discrete and continuous data can be blurry sometimes because you can express a
discrete variable as continuous—for example, the number of children a person can have is
discrete (a woman can’t have 2.7 children, but she could have 2 or 3), but if you are researching
the average number of children that women aged 25–40 have in the United States, the average
would be a continuous variable. Whether your data are discrete or continuous can also help you
determine the type of chart you create because continuous data lend themselves more to a line
chart than do discrete data.

2.3. Quadrants 1 and 2 versus Quadrants 3 and 4: Declarative versus Exploratory

Declarative visualizations are the product of wanting to “declare” or present your findings to an


audience. The data analysis projects begin with a question, proceed through analysis, and end
with communicating those findings. This means that while the visualization may prompt
conversation and debate, the information provided in the charts should be solid.

On the other hand, you will sometimes use data visualizations to satisfy an exploratory
visualization purpose. When this is done, the lines between steps P (perform test plan), A
(address and refine results), and C (communicate results) are not as clearly divided. Exploratory
data visualization will align with performing the test plan within visualization software—for
example, Tableau—and gaining insights while you are interacting with the data. Often the
presenting of exploratory data will be done in an interactive setting, and the answers to the
questions from step I (identify the questions) won’t have already been answered before working
with the data in the visualization software.

Figure 6.7 is similar to the first four chart types presented to you in Figure 6.6 but Figure 6.7 has
more detail to help you determine what to do once you’ve answered the first two questions.
Remember that the quadrant represents two main questions:

1. Are you explaining the results of the previously done analysis, or are you exploring the
data through the visualization? (Is your purpose declarative or exploratory?)
2. What type of data is being visualized (conceptual qualitative data or data-driven
quantitative data)?

Once you have determined the answers to the first two questions, you are ready to begin
determining which type of visualization will be the most appropriate for your purpose and
dataset.

Figure 6.7. The Four Chart Types with Detail


3. Choosing the Right Chart

Once you have determined the type of data you’re working with and the purpose of your data
visualization, the next questions have to do with the design of the visualization— color, font,
graphics—and most importantly, type of chart/graph. The visual should speak for itself as
much as necessary, without needing too much explanation for what’s being represented. Aim for
simplicity over bells and whistles that “look cool,” but end up being distracting.

Because qualitative and quantitative data have such different levels of complexity and
sophistication, there are some charts that are not appropriate for qualitative data that do work for
quantitative data.

When it comes to visually representing qualitative data, the charts most frequently considered for
depicting qualitative data are:

 Bar charts.
 Pie charts.
 Stacked bar chart.

The pie chart is probably the most famous (some would say infamous) data visualization for
qualitative data. It shows the parts of the whole; in other words, it represents the proportion of
each category as it corresponds to the whole dataset.

Similarly, a bar chart also shows the proportions of each category as compared to each of the
others.

In most cases, a bar chart is more easily interpreted than a pie chart because our eyes are more
skilled at comparing the height of columns (or the lengths of horizontal bars, depending on the
orientation of your chart) than they are at comparing sizes of pie, especially if the proportions are
relatively similar.

Consider the two different charts from the Sláinte dataset in Figure 6.8. Each compares the
proportion of each beer type sold by the brewery.

The magnitude of the difference between the Imperial Stout and the IPA is almost impossible to
see in the pie chart. This difference is easier to digest in the bar chart.

Of course, we could improve the pie chart by adding in the percentages associated with each
proportion, but it is much quicker for us to see the difference in proportions by glancing at the
order and length of the bars in a bar chart (Figure 6.9).

Figure 6.8. Pie Charts and Column Chart Show Different Ways to Visualize Proportions

 
Figure 6.9

A summary of the chart types just described appears in Figure 6.10. Each chart option works
equally well for exploratory and declarative data visualizations. The chart types are categorized
based on when they will be best used (e.g., when comparing qualitative variables, a bar chart is
an optimal choice), but this figure shouldn’t be used to stifle creativity—bar charts can also be
used to show comparisons among quantitative variables, just as many of the charts in the listed
categories can work well with other datatypes and purposes than their primary categorization
below.
Figure 6.10.  Summary of chart types

Bottom Line: 

As with selecting and refining your analytical model, communicating results is more art than
science. Once you are familiar with the tools that are available, your goal should always be to
share critical information with stakeholders in a clear, concise manner.

You might also like