3 Understanding Data

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

3 Understanding Data

3.1 Business problem → Data problem

Figure 3-1 The CRISP Cycle

Business understanding requires knowledge of the business: its objectives, its products, its
customers, its processes and people,…. It is context specific, so it is difficult to give guidance to
this phase beyond the questions highlighted previously. In many instances, the organization has
difficulty articulating the issue, except in general terms.

In the summer of 2020, we were in the midst of the COVID-19 pandemic. Colleges and
universities were worried about their finances. Most Canadian and US institutions are highly
dependent upon tuition revenues. A top of mind question was what is the impact of COVID on
revenue?

To answer the question, we needed data. But what data? The pandemic may affect different
students differently. What is the impact on local students? Those from out-of-province?
International students? Which programs can be offered online? How are returning students
affected compared to new students? How will online course delivery affect retention and then
future revenues? Will new students defer admission and how will this affect revenue in future
years?

3 Understanding Data Page 1 of 10


The business problem has become a set of specific questions that may be answered with data
analysis. The business problem has become a set of data problems.
• What data do we need?
• Where can we get the data?
• How can we get the data?
• What will be the “quality” of the data?
• Will the data answer our questions?

3.2 How do we obtain our data?


There is so much data. We leave traces behind that we may not realize or see, just like the DNA
or fingerprint evidence that we see analyzed in crime shows. One can think of all of the traces
as being the population whereas what we actually analyze as being a sample.

Example: Credit Card Fraud Detection


If we are trying to build a model to detect whether a credit card transaction is fraudulent, we
could analyze billions of transactions that have occurred over the last 10 years, but this might
overwhelm us. The population is extremely large. Can we look at a subset (a sample) and
still make valid inferences?

Each transaction is an observation. Associated with the transaction we have data about
• the customer,
• the vendor,
• the amount,
• the time and date,
• the location of the vendor,
• was it tap, pin code, chip, swipe, data entry, paper-based?
• did the customer subsequently report the transaction as fraudulent?
• was there confirmatory evidence of fraud?
• ….

We want the sample to be a representative snapshot of transactions over the last decade.
Even a sample of a million observations does not mean that it is an accurate picture. If there
had been 100 billion transactions over the last decade, we could select every 100,000th
transaction (100 billion is 100,000 million). But if transactions have been increasing from 5
billion transactions in 2013 to 15 billion in 2022, we would be over-representing recent years
over the more distant past. But, maybe this is a good idea. Recent data may be more
reflective of current transactions.

Ideally, we want a random sample, where each member of the population has equal chance of
being in the sample. In this manner we are not introducing bias into our sample. There are

3 Understanding Data Page 2 of 10


many ways we can inadvertently introduce bias, and in some cases, we won’t be able to avoid
it.

Example: What were people doing during Hurricane Sandy in 2012?


Cathy O’Neil and Rachel Schutt1 give the example of analyzing tweets at the time Hurricane
Sandy hit the east coast of the US in 2012. Looking at the tweets pre-Sandy, it would appear
everyone was shopping and post-Sandy they were partying. Most of the tweets were from
residents of New York. The ones hardest hit by Sandy were in New Jersey, on the coast.
Those in New Jersey were boarding up their homes. You have to ask whether those who
were active on Twitter are representative of those that were affected by Sandy. The
population we want to study is not necessarily the population of all people who would be
active on Twitter during the hurricane.

This is an example of an observational study. An observational study is one in which we are


simply analyzing the data we have at hand.

Returning to the question about evaluating the impact of the pandemic on university revenues,
we have current and historical data about students who have registered by June each year. We
know where they are from and what programs they are taking. We have significant data that is
stored in our student information system. This information system is an administrative
database.

Administrative data is usually stored in a structured database. These databases are complex,
interconnected files (relational database). For example, Amazon has records about each
customer.

• In one file, they have the various addresses that they have shipped to.
• In another, they have the various credit cards they have used.
• In another, there are the many purchase transactions.
• In another, there is the history of what items they have viewed and when, how long,…
• And lots more in many other files.

To do analysis, we need to extract the data we want and organize it in a fashion we can easily
analyze. The extract takes data from multiple files (tables) and combines them into a single file.
The extract is usually done using rules, such as SQL, to query the database. The final data file is
usually a “flat file” that looks like a simple spreadsheet. Each row is one “observation” (record)
and the columns are the attributes (variables). If we were studying customers, then ideally, we
would want one row (record) for each unique customer.

Since customers have made many visits and made purchases from Amazon, we might
transform the raw information into new variables, such as,

• Average number of visits per month

3 Understanding Data Page 3 of 10


• Average purchase in $$
• Average number of items purchased per order
• Where I live (billing address?)

For the case of studying the effect of COVID-19 on revenues, the registration data is incomplete
and possibly inaccurate. Many students may register in July and August, and others who have
registered may choose to defer. To get better insights, we may wish to email students and ask
them to complete a survey with respect to their intentions. This is another observational study.

Not everyone responds. There is variation in response rates by gender, age, income, location,
education level, ethnicity,… But even more importantly, we don’t know why some respond and
some do not, and whether this has any bearing on the issue we want to examine.

In an ideal world, we would want to perform experiments, where we have control over some
characteristic that we wish to study. With experiments, you can exclude all other factors that
may affect behavior, so you can be confident in drawing conclusions that can be generalized to
the general population. In a medical situation, we randomly assign patients to a treatment
group and a control group. The treatment group is given the new drug we want to evaluate and
the control group is given a placebo. Neither patient nor doctor know which drug is being given
so cannot bias the results. If there is a difference between the groups, then the drug is the
cause. But is this process ethical?

Even when it is a well designed experiment, there may be limits on how we can generalize the
results. The Astra-Zeneca vaccine for COVID was subjected to clinical trials on tens of
thousands of individuals. However, only people aged 16 to 65 were included. Initially, COVID
had the greatest impact on the elderly. They were the priority group for vaccination but we
had no data about the Astra-Zeneca efficacy for this population. Initially, Astra-Zeneca was
recommended only for those under 65. There needs to be alignment between the sample and
the population we wish to speak about.

You are being subjected to such testing all the time online. Marketers are constantly doing
small experiments where you are randomly seeing one advertisement or another and the
marketer is evaluating which works better. They call them A/B experiments. The internet has
opened a huge number of opportunities to do experiments on consumers. Unfortunately, most
of the time, we cannot assign “treatments” to subjects and we must work with observational
data. We will look more closely at A/B experiments later in the text.

Example: Signet Bank revolutionized the credit card industry


Have you ever heard of the Signet Bank? In 1988, Richard Fairbank believed that banks were
missing an opportunity with their credit card business. He engaged another consultant, Nigel
Morris, to help develop a more scientific approach to marketing credit cards. Fairbank and
Morris pitched their idea to more than 20 US retail banks before Signet finally agreed to try
them out.

3 Understanding Data Page 4 of 10


"The genius of Morris and Fairbank was to burrow deep into the spending habits and
lifestyles of these so-called prime customers to find their best bets, then offer them various
rates based upon their various risks."2

How did they get the data for their models to tell them what rate to offer a customer of a
given risk? They did experiments. They randomly offered rates to various customers and
then tracked their behaviour. Sometimes they gave good credit to bad risks. This was
expensive. It took time to fine tune the models, but once working well, the increased profits
easily repaid the earlier losses. Eventually the credit card division was spun off as a separate
business. You may have heard of it. Capital One.

3.3 Measurement Scales


Traditionally, we classify data in terms of measurement scales. Measurements are either
numeric (continuous or discrete/counting) or categorical (words-qualitative). Common
measurement scales are ratio, interval, ordinal or nominal.

Income is ratio data. If one person earns $80,000 per year and another earns $40,000, then the
first earns twice what the second one does. It is valid to divide one observation by another.

Financial data is usually ratio.

Temperature is an example of interval data. It is quantitative (numeric), but a place that is 30


degrees is not twice as warm as one that is 15 degrees. Is temperature being measured in
Celsius or Fahrenheit? The ratios would be different.

There are very few cases where numeric variables are only measured on an interval scale and
not a ratio scale. With ratio and interval data, it is valid to calculate averages and do
mathematical transformations to the data. We will treat interval and ratio data the same and
simply refer to it as numeric.

Later, when looking at some theoretic aspects of numeric data, we will differentiate between
discrete and continuous variables. “Discrete” means that there are gaps between successive
values, such as integers used for counting. “Continuous” means that there are no gaps, such as
measurements of length or weight. From a data analysis perspective, we use the same
methods to explore both types of variables.

Ordinal and nominal data are considered to be categorical (qualitative). Objects are placed into
categories with verbal labels.

3 Understanding Data Page 5 of 10


A ranking is an ordinal scale. A grade of A is better than B which is better than C. We frequently
get ordinal data when we have a survey that asks are you (1) very satisfied, (2) satisfied, (3)
neither, (4) dissatisfied, or (5) very dissatisfied. In this case, we sometimes call this equal
interval data and treat it as if it were interval by taking averages. This is an example of what is
known as a Likert scale. At the end of this course you will complete a course evaluation survey
and score my effectiveness as a teacher. I will receive a report with my average score. What
does an average score of 4.2 or 1.9 represent? 4.2 suggests most students gave very high
ratings and 1.9 suggests most gave low ratings, but there is no simple interpretation of the
average as a number. Although treated as interval data, it is really just numbers representing
words. And don't make the additional error of thinking of these averages as ratio data. An
instructor with a 4.2 score is not twice as good as one with a 2.1 score. The assignment of
numbers to labels was arbitrary. You may draw misleading conclusions by treating ordinal data
as numeric.

“Program” is an example of nominal data. You are enrolled in Arts, Business or Science. The
data is words that have no sequence or scale attached to them. I cannot put them in my
calculator and average them.

However, today we must take a much broader view of what data is. We are now analyzing text
data (tweets, emails, books,…) looking for patterns. Your email server has a filter that is
analyzing every email and trying to filter out the spam and malicious emails. When you deposit
a cheque in the bank machine, it is scanning the image and deciphering what is written on the
cheque. Facial recognition software is scanning images looking for matches. There is an
enormous amount of different types of sensor data that is being collected continuously. Your
GPS location data from your smart phone is being looked at right now by someone somewhere.
Data is every trace we are leaving behind us.

Although these "new" types of data may look different, to analyze them, we describe them with
various measurements (e.g., text: word count, word position, frequency of word pairs,...).
These measurements are numerical or categorical. In these notes, we will limit ourselves to the
traditional numerical and categorical data, rather than the analysis of such measurements that
come from measuring text and image data.

3.4 Data Quality and Cleaning


We must also be concerned about the quality of our data. Are the responses honest?

Example: Condom use in the USA


Seth Stephens-Davidowith3, in his book, Everybody Lies, Big Data, New Data, and what the
internet can tell us about who we really are, cites numerous examples of what we say and
what we do being different. One example he cites has to do with the General Social Survey,
one of the most authoritative surveys on the behavior of Americans.

3 Understanding Data Page 6 of 10


According to the survey, heterosexual women say that, on average, they have sex 55 times
per year and use a condom 16% of the time. This translates into 1.1 billion condoms per
year.

For men, their responses translate into 1.6 billion per year. But Nielsen, the firm that tracks
everything, claims that annual condom sales are just 600 million. It seems everyone is lying or
just has bad memory.

In the 2016 presidential election in the US, poll results were almost unanimous that Clinton
would defeat Trump. But people lied with respect to whether they would vote and who they
would vote for. Polls are particularly challenging to interpret correctly. Much depends upon
the wording of the question, whether respondents understand the issue/question, and whose
responses are being reported.

Example: Conflicting Results about the US Debt Ceiling


One of the most challenging issues in recent years in the US has been increasing the “Debt
Ceiling”. This used to be a routine authorization to allow the government to borrow money
to pay for decisions and programs that Congress and Senate had already authorized. The US
routinely runs deficits and must borrow to cover them. The debt ceiling is the limit on
borrowing. Republicans and Democrats use this vote as a lever to get other decisions
approved. In the summer of 2023, as in previous years, opinion polls gave conflicting results
with respect to whether the American public supported or opposed increasing the debt
ceiling. The differences in poll results could be explained by differences in how the issue was
presented and whether the public really understood what the issue was. In general, most
individuals likely did not understand the subject well enough to give a credible response. For
a full discussion, read “Why Debt Ceiling Polls Keep Giving Us Conflicting Results”, by Kaleigh
Rogers. https://fivethirtyeight.com/features/debt-ceiling-spending-cuts-confusing-polls/

Is the data what you think it is?


When I was a Registrar, I was often asked how many students we had from a particular town,
province or country. This should be easy to find. Just look at students by their address. But all
too often, the student’s address was their local address. Almost everyone was from Halifax or
the surrounding area! When a student changes their address, our information system created a
new record and did not delete the old address. For many students, we had many addresses.

We could classify the first address we had on file as the home address. This was likely the city
they lived in when they applied for admission. That worked for most students, but what about
transfer students? The student was from one city, enrolled at a university in another city, and

3 Understanding Data Page 7 of 10


now wants to transfer to Saint Mary’s University in Halifax. The address on their application for
admission was where they were currently studying, not where their home is.

We often generated summaries based upon the student’s citizenship as a proxy for where they
were from. But this was misleading since we have a significant number of students who are
permanent residents or have dual citizenship. We discovered that almost half of our students
with Lebanese or Jordanian citizenship were local residents. Most of our Palestinian and Iraqi
citizens had come to us from the United Arab Emirates. How do you define “home”?

Is data coded in a consistent manner? Are there coding errors?


We will not be merging data files in this text, but it is common to merge data from more than
one database. Some fields may exist in both databases. Do the fields mean the same thing in
both? Often there are subtle differences or they have been coded in different ways (e.g., name
of province versus abbreviation).

In the university’s information system, most high schools in Canada are coded into a database,
such that you don’t type in the school name, but select it from a list (a very long list). It is easy
to pick the wrong school when several have similar names, or to accidentally choose the one
above or below the correct choice. How will we know that an error has been made?

Is there missing data?


In most surveys, we find that respondents do not answer every question. Maybe they don’t
know how to answer or they simply don’t want to answer. Many respondents experience
survey fatigue and do not complete the survey. Missing values are more common at the end of
a survey than at the beginning. Our challenge is that we do not know why the individual did not
respond. Are those who don’t answer different from those who do?

From our customer database, if we were to look for patterns among product purchases among
customers, we would find that our data file has many blanks (missing values) because every
customer does not buy every product.

Are there duplicate records?


When extracting data from an administrative relational database, we have to link together
multiple tables where information is stored. When extracting our file, we must be careful that
the linking does not generate duplicate records – multiple records for the same individual.

Are the data values reasonable?


Look at the smallest and largest data values. Are they plausible? In surveys of incomes, it is not
unusual for someone to add or drop a digit. For example, they may type 5000 or 500000
instead of $50,000. It may be possible to design surveys such that you are asked to confirm
your answer when it looks unreasonable. Try depositing a $30,000 cheque in an ATM and it will
likely ask you if this value is correct. Values that appear unreasonably large or small should be
investigated.

3 Understanding Data Page 8 of 10


Also look at the very small numbers. In a survey of student total summer earnings, one student
reported earning $45. Another earned $95, but reportedly worked 9 hours per week. Maybe
the student only worked 9 hours in total? Maybe the student was reporting average daily
earnings?

Most values were reported in hundreds or thousands of dollars, but some were recorded
precisely (e.g., $3,586). Did the student just make up the number? In auditing financial
records, it is common to check for patterns among digits. Patterns may suggest that amounts
have been fabricated or changed.

In some instances, you can incorporate consistency checks (e.g., date shipped must be later
than date ordered).

Have you been asked to answer survey questions for which you do not know the answer or
can't remember? How much did you earn last summer? How much did you save for
school? How much do you spend on Christmas presents? How often did you eat at a
restaurant last year? How much tax was deducted off your last paycheck? Would you take the
time to look up exact answers or just make something up? How does the researcher evaluate
the validity of any of these answers?

The COVID-19 pandemic highlighted the challenges of collecting good quality data. Donald
Trump and some other world leaders suggested that the numbers were exaggerated and others
claimed that they seriously underestimated the crisis. Nate Silver offered his take in his blog,
Coronavirus Case Counts are Meaningless.

3.5 Data Issues Summary


• Business problems are often poorly defined and unstructured.
• To create a data analysis problem from a business problem, you must define questions
that you can measure.
• Often you have data for some questions, but maybe not all, or the data may not be
exactly what the question calls for.
• Your data may come from an administrative database, a survey, a 3rd party, or a
combination of all of these if you can link the records together.
• Most of the time you are using “observational data” – what you observed in an
uncontrolled environment. This may limit your ability to generalize to a broader
population.
• Ideally we would like to have controlled experiments, but often people will not let you
do experiments on them.
• There are usually some data quality challenges, even with administrative data.
• Many valuable insights can only be obtained through surveys. Surveys present many
data quality challenges (non-response, bias, data errors, honesty, misunderstanding,…).

3 Understanding Data Page 9 of 10


• The type of analysis you can do is somewhat limited by the measurement scale
(nominal, ordinal, interval and ratio).
• The new frontiers in data analysis are with respect to text, image, sound and other types
of non-numeric data that we can now capture and store. Although non-numeric, we can
measure characteristics of text or images and then analyze these measurements.
• Our single biggest issue is data quality. There are many dimensions to data quality. It is
not a simple question. The critical issue is to question the data. Is it what you think it is?

Footnotes
1
O’Neil, C. and Schutt, R., (2014), Doing Data Science. O’Reilly Media, Inc., Sebastopol, CA.
2
Condon, Bernard, “House of Cards,” Forbes, April 2, 2001, p. 77.
2
Stephens-Davidowitz, S. (2017) Everybody Lies. Harper Collins Publishers, New York.

3 Understanding Data Page 10 of 10

You might also like