Session Transcript M1S1

Session Transcript
Getting To Know Your Data
Note: This transcription document is a text version of the upGrad videos present in this session. It is not
meant to be read independently, but can be used to complement your video watching experience.
Video 1
Speaker: Edward H K Ng
In today’s business world, competition is not just about products or services. With digitalisation, data
has become the new resource and analytics the new means of competition. One needs to look no
further than Google, which has mastered the art of big data analytics to become a global behemoth.
To perform analytics, one has to first know the population of concern. The population of smartphone
users will be anyone who owns or uses a smartphone. It will not be those who watch TV though
there is an overlap of users.
A population comprises all data points in a domain. A data point or datum is also termed a record or
observation.
It is a challenge, however, to identify every datum in a population. If the population is small, analytics
is unlikely to be economically worthwhile. No analytics are needed for spacecrafts that have travelled
to Mars as there are less than a handful. Where the population is large and it is here that analytics
are useful, gathering data from every observation is usually impossible.
To overcome this, one or more samples are collected for analysis.
A sample is a selection of data from and is representative of the population. The keyword here is
‘representative’ as one that is not, can lead to incorrect inferences or conclusions about the
population. There is no minimum sample size but one is that no less than 30 is advisable.
There are multiple facets of a population.
The main ones of concern and use are the mean and the variance or its square root form, the standard
deviation. Mean is the average of a measure across all observations. Variance is the dispersion used
to evaluate how widely distributed this measure is. Sometimes, the median which is the midpoint
from the smallest to the largest value of the measure is also captured.
Metrics of a population are called parameters. The same for samples are known as statistics.
© Copyright upGrad Education Pvt. Ltd. All rights reserved

We now look at an example of Apple preparing for an analysis of iPhone users. Apple wants to derive
a profile of the population of iPhone users to identify areas for further market penetration but that
is impossible so it has to take a sample for analysis.
The population of concern is all iPhone owners or users.
Around 35m iPhones were sold in 2020 but this is a steep decline from 71m in 2015.
Apple would like to know which demographics it has failed to penetrate. It could be losing senior
citizen customers who feel unable to cope with the increasingly complicated features. As it's not
possible to have population data, it plans to collect a sample of 10,000 for analysis. This sample must
be representative of the population to ensure that any insights obtained through analysis are not
wrong or distorted.
Non-iPhone users will, therefore, not be appropriate for the sample.
Besides being only the practical approach to analysing a population, collecting and using samples
offers other advantages. It allows for a quick and cost-effective means of getting the data needed.
Without this possibility, analytics cannot be economically attractive if the return on investment is
too low.
The size of a sample is much smaller than that of the population and so much more manageable.
Although hardware capacity has increased substantially, it still requires quite massive resources to
analyse large amounts of data. As long as a sample is not biased, any results obtained can be regarded
as reliable representations of the population.
A final word about sampling.
Unless it is possible to collect a large sample, it is advisable to have multiple small ones to ensure that
any results obtained are not due to sampling error which we will be discussing further in this course.
This property of sampling is especially important for clinical trials before a new drug is approved for
public use.
Video 2
Business analytics is about deriving insights from information and information is essentially
processed data. Data is then the raw material and has been called the virtual gold. There are several
aspects of data that should be understood to determine if any analytics is reliable.
A single piece of data is a datum also termed observation or record. A datum has a value or
measurement which need not be numeric. Silver is the value of a datum like the colour of a laptop
cover for example. The number of times a value occurs is called frequency or count.
A collection of data is a dataset. A dataset can be the sample or extractions from the sample which
are called subsamples. Data comprise variables that are used for diagnostics or predictions.
Common variables are demographics like Gender, Age, Education etc or business indicators like
Industry, Size, etc. The measurement of variables can take on different forms or types which can be
broadly divided into nominal, categorical, ordinal and cardinal. The paint colour on a car is nominal.
It has a measurement or value that cannot be meaningfully compared to another. It cannot be said
that red colour is superior or inferior to black. Nominal data are usually described as those with no
intrinsic value. No intrinsic value, however, does not mean no analytical value.
In big data analytics, nominal data are used to derive predictive models. A search with the word
"Green '' which has no intrinsic value may suggest that the web user is interested in plants, gardening
etc. Google and other e-commerce sites often use such associations to push notifications to potential
buyers of its products or services.
The second type of data is categorical. A datum is assigned to one natural or created category.
Natural categories are like Gender, Citizenship etc. Created categories are important for analysis and
modelling.
A property agent, for example, may want to know if a resident is an Owner or a Tenant as tenants
cannot make decisions on disposal or renting out. A very common form of categorical data is the
Yes/No divide according to a criterion.
Like nominal data, there is no comparability across categories although there is always preferability
like a Yes to the question whether the respondent is a customer. Categorical data are widely captured
in surveys as it allows for easier and less intrusive answers by participants.
The third type is ordinal which is data that can be ranked. 2 is larger than 1 or a degree, is a higher
education level than high school. This data type is popular for testing a theory since variables with
such values can be used to explain and predict. A widely tested theory is that education is a key
determinant of income level.
The final type is cardinal which is essentially a more granular form of ordinal. Instead of just 1, 2, 3
and so on, cardinal data can be in the form of 1, 1.1111, 1.1112 etc.
This data type applies to income, time intervals and other measurements that cannot be
predetermined. Because ordinal and cardinal data allow for comparability, they are regarded as
superior to nominal and categorical ones in the past. More traditional statistical methods can only
make use of ordinal or cardinal values but more recent ones like classification trees, for example,
have extracted useful information from nominal and categorical values.
Another dimension to data is whether they are discrete or continuous.
Discrete data are those with distinct values i.e. one can say that a value is different from another. All
nominal, categorical and ordinal data are discrete. Continuous data are those where a distinct value
cannot be located. The number of cars pulling into a gas station at an exact time cannot be
determined, for instance, although it is possible to do so for a time interval.
It is important to note that data are all discrete when collected. It is impossible to collect continuous
data as each observation has a specific value. Continuous data is a concept created to facilitate the
generation of a distribution that can be used for probability estimation as will be covered a little later.
The probability of an observation at a specific value on a continuous data distribution is zero.
We continue with the iPhone analysis for illustration. The product codes for a single iPhone model
would have nominal values if it differs across countries since this code has no intrinsic value although
the model has. One code cannot be regarded as superior or inferior to another as they both are used
for the same model. Gender is categorical as a datum must fall into one predetermined category.
Such data are very useful for classification e.g. buy or not buy.
The age band (in years) is ordinal as a ranking or order is possible. The band 45-55 is certainly senior
to 25-35. Finally, the Price of the iPhone is cardinal as it can lie on a continuous range. The sale price
is a discrete value but it can vary by a single cent among retailers such that it will be difficult to
determine exactly how many units are sold at a specific price.
Video 3
Like other forms of raw material, data themselves are of little use. They have to be processed
and packaged into information that can be consumed to support decision making. As mentioned
earlier, only one or more samples of data can be collected for use. Such data are termed historical
as they are for occurrences in the past. There are other types of data that are simulated or
generated for scenarios.
These will be discussed later. A key assumption underlying the use of historical data is that they
provide some indications for the future. There is a school of thought that historical data is
useless as history doesn't repeat itself.
This is quite erroneous as though the future will not be identical to the past, historical data
provide a basis for predictions. Someone who predicts that it will snow in the Sahara desert next
year will be ridiculed as it has barely rained there over recorded history.
Climate change may be incorporated to predict the weather in the Sahara but that cannot be
independent of how the desert has been over the past centuries. As historical data are essential
to information generated, what is collected must be properly designed to ensure maximum value
is derived.
Data can be in the form of cross-sectional or time series.
Cross-sectional data are variables across different sources of the same kind. An example is the
profit margin across industries. Keeping sources to the same kind for consistency is important
for such data.
If it is industries, one should not use technologies as the basis as different industries may depend
on the same technologies such as computers. Such data are collected to analyse if the source or

variable is a factor in determining an outcome like profitability. It is also essential to note that
cross-sectional data should be collected over the same period to eliminate any time effect.
Data can also be collected over time and these are called time-series data. Such data are most
common in finance like securities prices, for example. They are needed to estimate volatility and
correlations over time. How long a time period for sampling is a critical consideration for time
series data.
If it is too short, there is a risk that any estimate is unreliable as it could be due to events
happening during that particular period. Securities prices collected during the Global Financial
Crisis between 2008 to 2010 will not be representative of long term levels. If the sample period
is too long, it can be subject to too many changes over time that make it difficult to arrive at
meaningful conclusions.
The right sampling period is dependent on the objective of the analysis.
For slower-moving values like GNP or Gross National Product, for example, 20 years of data
may yield 80 quarters of data points. For daily securities prices, 2 years can provide 500 trading
days of data. It is worthwhile noting that time series data should be collected over fixed intervals.
Daily values should not be commingled with monthly or other time intervals.
A combination of cross-sectional and time series data is known as panel data.
Such data are essential for studies on treatments where subjects are divided into those that
receive the treatment and those who don’t. The aim is to determine with a specific treatment
produce the desired results so both pre and post-treatment offers data are collected. This
combines cross-sectional which is with and without treatment and time series which is pre and
post-treatment.
We now use the example of the State Bank of India analysing its staff turnover.
Concerned with staff turnover, the State Bank of India wants to gain some insights through data.
Those across states, cities, branches etc would be considered cross-sectional data. The Bank can
decide if it wants to examine differences across states, cities, branches etc. It can do so for all
sources and use the information derived for decision making. The bank may also want to
understand turnover throughout the years.
That would be time-series data. Combining these with the cross-sectional data provides panel
data that can produce valuable insights into human resource management in different areas over
the years. It may find that certain states have significantly reduced turnover in the past decade
through successful implementation of some policies whereas some cities are not doing so well
in keeping staff compared to their peers.

Video 4
Almost every type of data collected has deficiencies that become issues in analytics and modelling.
Ignoring them can lead to technically perfect models with very wrong results.
The acronym GIGO which stands for Garbage In Garbage Out is very real in business analytics. In
a well-publicised case, the Hubble Telescope was unusable after a successful launch as its lenses
were polished using imperial measurement of inches by the US for a design with metric ones using
millimetres.
A simple human error cost millions of dollars to NASA. In some countries, data issues due to
regulations have caused major losses and withdrawals of investments. Any analysis has to be
evaluated on possible errors caused by the data themselves. The most common issues are missing,
outdated, invalid and unreliable data. Data gathered before digitisation are riddled with missing
values.
With paper, there is no way to enforce mandatory fields that must be filled in before a form is
submitted. Digital forms have significantly reduced this problem but discretionary data fields remain
a challenge. Missing data can cause biases beyond the fact that there is no value. In one bank, the
marital status field is filled with missing values. On closer examination, few customers declare
themselves as divorcees.
A credit risk model shows that the missing marital status is a significant factor in defaulting. This is
problematic as the bank does not know what the missing value is and make better credit decisions.
Data can be outdated and undetected.
The values are there but they are from the past. This is especially common for financial statements
like Balance Sheets and Income statements.
In some countries, regulatory enforcement is weak or non-existent and firms can recycle old
financial statements to be audited and submitted. Outdated data can lead to wrong conclusions
especially when sales or transactions have declined substantially but not been updated. Care should
always be exercised when reviewing analysis based on financial statements to ensure that they are
not based on outdated data.
Data values can be outright invalid and should have been eliminated but are not. Sales, for instance,
should not have a negative value. Invalid values can arise from two causes. One is data entry error
which is easier to correct. A negative personal income, for example, is obviously wrong.

The other is due to an inconsistent definition. It is easily assumed that Sales is a common term
known to everyone in business. In accounting standards, however, sales can be gross or net and
both are acceptable. Net sales are sales less returns and if the latter is larger, negative net sales are
produced which is correct by accounting standards. The problem lies with the commingling of the
net with gross sales resulting in invalid negative values. Unfortunately, such a deficiency can be
easily missed as it may be buried by other valid data.
The final deficiency is unreliable values.
This is perhaps the hardest to detect and requires knowledge of the business domain. It is common
for e-commerce sites to request customers to provide reviews of their products or services. A
rigorous analysis found that many of such reviews of products from a specific country are
unreliable. The percentage of perfect scores are significantly higher than comparable sites from
other countries.
Even then, vendors in this country deny the charge and continue to offer information of superior
reviews of their products. Outliers are legitimate values but have to be treated with care to ensure
that they don't influence a model's results. There are different ways to manage outliers as will be
discussed in the module on modelling later.
We now use an example of Amazon assessing customer satisfaction for illustration.
Amazon wants to evaluate the customer satisfaction data collected in major markets. Such data can
be riddled with deficiencies. They can be missing if customers refuse to provide some values. The
average customer is reluctant to reveal personal income even if it is a choice of a range. During the
analysis, old data from earlier years could be wrongly uploaded resulting in outdated values.
This caused past scores of one market to be compared to recent ones of other markets.
A programming bug can create invalid data with negative values when only positives are possible
e.g. when the value is derived by subtracting one from another. In some markets, returns are
common resulting in negative net sales which were incorrectly derived by the algorithm. Finally,
data can be fake and so unreliable e.g. when a some vendors hire feedback providers to submit
perfect scores.
Aside from these, there could be outliers e.g. when scores are the lowest possible in a market that
has political frictions with Amazon's home country. All these deficiencies can significantly affect a
model leading to incorrect predictions and decisions.
Disclaimer: All content and material on the upGrad website is copyrighted, either belonging to upGrad or
its bonafide contributors and is purely for the dissemination of education. You are permitted to access,
print and download extracts from this site purely for your own education only and on the following basis:
● You can download this document from the website for self-use only.
● Any copies of this document, in part or full, saved to disk or to any other storage medium, may only
be used for subsequent, self-viewing purposes or to print an individual extract or copy for non-
commercial personal use only.
● Any further dissemination, distribution, reproduction, copying of the content of the document
herein or the uploading thereof on other websites, or use of the content for any other
commercial/unauthorised purposes in any way which could infringe the intellectual property rights
of upGrad or its contributors, is strictly prohibited.
● No graphics, images or photographs from any accompanying text in this document will be used
separately for unauthorised purposes.
● No material in this document will be modified, adapted or altered in any way.
● No part of this document or upGrad content may be reproduced or stored in any other website or
included in any public or private electronic retrieval system or service without upGrad’s prior written
permission.
● Any right not expressly granted in these terms is reserved.

Session Transcript M1S1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session Transcript M1S1

Uploaded by

Copyright:

Available Formats

Session Transcript

Getting To Know Your Data

To overcome this, one or more samples are collected for analysis.

There are multiple facets of a population.

© Copyright upGrad Education Pvt. Ltd. All rights reserved

The population of concern is all iPhone owners or users.

Non-iPhone users will, therefore, not be appropriate for the sample.

A final word about sampling.

Data can be in the form of cross-sectional or time series.

© Copyright upGrad Education Pvt. Ltd. All rights reserved

The right sampling period is dependent on the objective of the analysis.

A combination of cross-sectional and time series data is known as panel data.

© Copyright upGrad Education Pvt. Ltd. All rights reserved

Data can be outdated and undetected.

© Copyright upGrad Education Pvt. Ltd. All rights reserved

The final deficiency is unreliable values.

We now use an example of Amazon assessing customer satisfaction for illustration.

© Copyright upGrad Education Pvt. Ltd. All rights reserved

You might also like