Professional Documents
Culture Documents
3 Understanding Data
3 Understanding Data
3 Understanding Data
Business understanding requires knowledge of the business: its objectives, its products, its
customers, its processes and people,…. It is context specific, so it is difficult to give guidance to
this phase beyond the questions highlighted previously. In many instances, the organization has
difficulty articulating the issue, except in general terms.
In the summer of 2020, we were in the midst of the COVID-19 pandemic. Colleges and
universities were worried about their finances. Most Canadian and US institutions are highly
dependent upon tuition revenues. A top of mind question was what is the impact of COVID on
revenue?
To answer the question, we needed data. But what data? The pandemic may affect different
students differently. What is the impact on local students? Those from out-of-province?
International students? Which programs can be offered online? How are returning students
affected compared to new students? How will online course delivery affect retention and then
future revenues? Will new students defer admission and how will this affect revenue in future
years?
Each transaction is an observation. Associated with the transaction we have data about
• the customer,
• the vendor,
• the amount,
• the time and date,
• the location of the vendor,
• was it tap, pin code, chip, swipe, data entry, paper-based?
• did the customer subsequently report the transaction as fraudulent?
• was there confirmatory evidence of fraud?
• ….
•
We want the sample to be a representative snapshot of transactions over the last decade.
Even a sample of a million observations does not mean that it is an accurate picture. If there
had been 100 billion transactions over the last decade, we could select every 100,000th
transaction (100 billion is 100,000 million). But if transactions have been increasing from 5
billion transactions in 2013 to 15 billion in 2022, we would be over-representing recent years
over the more distant past. But, maybe this is a good idea. Recent data may be more
reflective of current transactions.
Ideally, we want a random sample, where each member of the population has equal chance of
being in the sample. In this manner we are not introducing bias into our sample. There are
Returning to the question about evaluating the impact of the pandemic on university revenues,
we have current and historical data about students who have registered by June each year. We
know where they are from and what programs they are taking. We have significant data that is
stored in our student information system. This information system is an administrative
database.
Administrative data is usually stored in a structured database. These databases are complex,
interconnected files (relational database). For example, Amazon has records about each
customer.
• In one file, they have the various addresses that they have shipped to.
• In another, they have the various credit cards they have used.
• In another, there are the many purchase transactions.
• In another, there is the history of what items they have viewed and when, how long,…
• And lots more in many other files.
To do analysis, we need to extract the data we want and organize it in a fashion we can easily
analyze. The extract takes data from multiple files (tables) and combines them into a single file.
The extract is usually done using rules, such as SQL, to query the database. The final data file is
usually a “flat file” that looks like a simple spreadsheet. Each row is one “observation” (record)
and the columns are the attributes (variables). If we were studying customers, then ideally, we
would want one row (record) for each unique customer.
Since customers have made many visits and made purchases from Amazon, we might
transform the raw information into new variables, such as,
For the case of studying the effect of COVID-19 on revenues, the registration data is incomplete
and possibly inaccurate. Many students may register in July and August, and others who have
registered may choose to defer. To get better insights, we may wish to email students and ask
them to complete a survey with respect to their intentions. This is another observational study.
Not everyone responds. There is variation in response rates by gender, age, income, location,
education level, ethnicity,… But even more importantly, we don’t know why some respond and
some do not, and whether this has any bearing on the issue we want to examine.
In an ideal world, we would want to perform experiments, where we have control over some
characteristic that we wish to study. With experiments, you can exclude all other factors that
may affect behavior, so you can be confident in drawing conclusions that can be generalized to
the general population. In a medical situation, we randomly assign patients to a treatment
group and a control group. The treatment group is given the new drug we want to evaluate and
the control group is given a placebo. Neither patient nor doctor know which drug is being given
so cannot bias the results. If there is a difference between the groups, then the drug is the
cause. But is this process ethical?
Even when it is a well designed experiment, there may be limits on how we can generalize the
results. The Astra-Zeneca vaccine for COVID was subjected to clinical trials on tens of
thousands of individuals. However, only people aged 16 to 65 were included. Initially, COVID
had the greatest impact on the elderly. They were the priority group for vaccination but we
had no data about the Astra-Zeneca efficacy for this population. Initially, Astra-Zeneca was
recommended only for those under 65. There needs to be alignment between the sample and
the population we wish to speak about.
You are being subjected to such testing all the time online. Marketers are constantly doing
small experiments where you are randomly seeing one advertisement or another and the
marketer is evaluating which works better. They call them A/B experiments. The internet has
opened a huge number of opportunities to do experiments on consumers. Unfortunately, most
of the time, we cannot assign “treatments” to subjects and we must work with observational
data. We will look more closely at A/B experiments later in the text.
How did they get the data for their models to tell them what rate to offer a customer of a
given risk? They did experiments. They randomly offered rates to various customers and
then tracked their behaviour. Sometimes they gave good credit to bad risks. This was
expensive. It took time to fine tune the models, but once working well, the increased profits
easily repaid the earlier losses. Eventually the credit card division was spun off as a separate
business. You may have heard of it. Capital One.
Income is ratio data. If one person earns $80,000 per year and another earns $40,000, then the
first earns twice what the second one does. It is valid to divide one observation by another.
There are very few cases where numeric variables are only measured on an interval scale and
not a ratio scale. With ratio and interval data, it is valid to calculate averages and do
mathematical transformations to the data. We will treat interval and ratio data the same and
simply refer to it as numeric.
Later, when looking at some theoretic aspects of numeric data, we will differentiate between
discrete and continuous variables. “Discrete” means that there are gaps between successive
values, such as integers used for counting. “Continuous” means that there are no gaps, such as
measurements of length or weight. From a data analysis perspective, we use the same
methods to explore both types of variables.
Ordinal and nominal data are considered to be categorical (qualitative). Objects are placed into
categories with verbal labels.
“Program” is an example of nominal data. You are enrolled in Arts, Business or Science. The
data is words that have no sequence or scale attached to them. I cannot put them in my
calculator and average them.
However, today we must take a much broader view of what data is. We are now analyzing text
data (tweets, emails, books,…) looking for patterns. Your email server has a filter that is
analyzing every email and trying to filter out the spam and malicious emails. When you deposit
a cheque in the bank machine, it is scanning the image and deciphering what is written on the
cheque. Facial recognition software is scanning images looking for matches. There is an
enormous amount of different types of sensor data that is being collected continuously. Your
GPS location data from your smart phone is being looked at right now by someone somewhere.
Data is every trace we are leaving behind us.
Although these "new" types of data may look different, to analyze them, we describe them with
various measurements (e.g., text: word count, word position, frequency of word pairs,...).
These measurements are numerical or categorical. In these notes, we will limit ourselves to the
traditional numerical and categorical data, rather than the analysis of such measurements that
come from measuring text and image data.
For men, their responses translate into 1.6 billion per year. But Nielsen, the firm that tracks
everything, claims that annual condom sales are just 600 million. It seems everyone is lying or
just has bad memory.
In the 2016 presidential election in the US, poll results were almost unanimous that Clinton
would defeat Trump. But people lied with respect to whether they would vote and who they
would vote for. Polls are particularly challenging to interpret correctly. Much depends upon
the wording of the question, whether respondents understand the issue/question, and whose
responses are being reported.
We could classify the first address we had on file as the home address. This was likely the city
they lived in when they applied for admission. That worked for most students, but what about
transfer students? The student was from one city, enrolled at a university in another city, and
We often generated summaries based upon the student’s citizenship as a proxy for where they
were from. But this was misleading since we have a significant number of students who are
permanent residents or have dual citizenship. We discovered that almost half of our students
with Lebanese or Jordanian citizenship were local residents. Most of our Palestinian and Iraqi
citizens had come to us from the United Arab Emirates. How do you define “home”?
In the university’s information system, most high schools in Canada are coded into a database,
such that you don’t type in the school name, but select it from a list (a very long list). It is easy
to pick the wrong school when several have similar names, or to accidentally choose the one
above or below the correct choice. How will we know that an error has been made?
From our customer database, if we were to look for patterns among product purchases among
customers, we would find that our data file has many blanks (missing values) because every
customer does not buy every product.
Most values were reported in hundreds or thousands of dollars, but some were recorded
precisely (e.g., $3,586). Did the student just make up the number? In auditing financial
records, it is common to check for patterns among digits. Patterns may suggest that amounts
have been fabricated or changed.
In some instances, you can incorporate consistency checks (e.g., date shipped must be later
than date ordered).
Have you been asked to answer survey questions for which you do not know the answer or
can't remember? How much did you earn last summer? How much did you save for
school? How much do you spend on Christmas presents? How often did you eat at a
restaurant last year? How much tax was deducted off your last paycheck? Would you take the
time to look up exact answers or just make something up? How does the researcher evaluate
the validity of any of these answers?
The COVID-19 pandemic highlighted the challenges of collecting good quality data. Donald
Trump and some other world leaders suggested that the numbers were exaggerated and others
claimed that they seriously underestimated the crisis. Nate Silver offered his take in his blog,
Coronavirus Case Counts are Meaningless.
Footnotes
1
O’Neil, C. and Schutt, R., (2014), Doing Data Science. O’Reilly Media, Inc., Sebastopol, CA.
2
Condon, Bernard, “House of Cards,” Forbes, April 2, 2001, p. 77.
2
Stephens-Davidowitz, S. (2017) Everybody Lies. Harper Collins Publishers, New York.