Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 25

Types of data and data quality

KIT306/606: Data Analytics


Unit Coordinator: A/Prof. Quan Bai
University of Tasmania
Announcement
• Quiz 1
• Due date: Sunday 11:59 pm
• 2%
• 4 Questions
• One attempt
• True or false and multiple choice questions

• Tutorials start from this week


• Tutorial tasks (1.5% each week)
• You need to attend the tutorial, complete all the tasks and show to your tutor
Data!!What is it?
• Data are characteristics that are collected through observation. In a more
technical sense, data is a set of values of qualitative or quantitative variables
about one or more objects (Wikipedia)

176 cm
Single attribute:

multiple attributes:

Student Age Height Study hours Grade


ID per day
001 23 180 4 HD
• A data set can often be viewed as a collection
A data instance
of data instances, also called records or
examples.
Student Age Height Study hours Grade
• An attribute/feature is a property of an data ID per day
instance that may very, either from one
001 23 180 4 HD
instance to another or from one time to
another. 002 20 165 2 DN
003 31 210 1 CR
• A measurement scale is the application of a
measurement scale to associate a value with a 004 29 167 0.5 PP
particular attribute of a specific object. 005 21 175 3 HD
006 23 178 2 HD
007 38 182 4 DN
008 22 166 5 HD
009 24 159 0.1 F
010 25 167 0.2 F

attributes/features
Types of attributes
• What operations we can have for attributes?
• Distinctness: = , != Nominal
• Order: <, >, >=, =< Ordinal
• Addition: +, - Interval
• Multiplication: *, / Ratio

• Accordingly, we can define 4 types of attributes:


• Nominal
• Ordinal
• Interval
• Ratio
• Nominal: a variable with categories that do not have a natural order or
ranking.
• E.g.: car colour, blood type, phone brand …

• Blue != Red, but you cannot say Blue > Red or Blue < Red

• Ordinal: ordered scale, but the differences between adjacent categories


do not necessarily have the same meaning. 
• E.g.: Satisfaction degree (“like”, “neutral”, “dislike”), Judo levels, …
• Black Belt > Red Belt > …, but it is meaningless to Black Belt + Red Belt
• Interval: ordered scale and difference between 2 values is meaningful; 0 does
NOT mean there is none of that variable.
• E.g., temperature, GPA, …
20oC – 15 oC is meaningful, but 20oC / 15 oC is not.
0 oC != no temperature

• Ratio: all the properties of an interval variable, and also has a clear definition of 0.
When the variable equals 0, there is none of that variable. AND the ratio of two
measurements has a meaningful interpretation
• E.g., Weights, length, student number …
50kg/100kg is meaningful;
0kg means no weight
Types of attributes (cont.)

All variables

numerical categorical

continuous discrete ordinal not ordinal


Types of variables: Numerical
• Numerical : values or observation that can be measured
• Discrete Variable
• A variable can take on certain individual numeric value.
• E.g. # of pages in a book, # of people in Hobart

• Continuous Variable
• A variable can take on any value in a certain range
• An infinite number of values
• E.g. Length of a film, temperature, time taken to run a race
General characteristics of data sets:
• Dimensionality: number of attributes
• Sparsity: most attributes of an instance have the 0 value, and very little attributes
have none zero values. a1 a2 a3 a4 a5 a6
0 0 0 0 0 0
0 0 0 0 0 0
0 1 0 0 0 0
0 0 0 4 0 0
0 0 0 0 0 0

• Resolution: how fine the measurement is. E.g., meter vs. km, hour vs. second, …
Types of data sets
• Transaction data: each record involved a set
of items.
• E.g., shopping basket data

• Graph-based data:
• Data with relationships among instances
• We can use graph to capture relationships
among data instances
• E.g., social network data
• Ordered data: the attributes of data instances have relationships
which can be ordered, e.g., in time or space.
• Sequence data
• A data set that is a sequence of individual entities
• E.g., natural language data
“London is the capital city of the UK”
• Time series data:
• A special type of sequential data
• Each record is a time series, e.g., daily
• E.g., water level of a river in 2020

• Spatial data: data instances with special attributes, i.e.,


positions or areas.
• E.g., weather data
Structured vs. unstructured data
• Structured data is comprised of clearly defined data types whose
pattern makes them easily searchable; while unstructured data –
“everything else” – is comprised of data that is usually not as easily
searchable, including formats like audio, video, and social media
postings. (datamation.com )
• Structured data normally has pre-defined data schema; Unstructured
data has internal structure but is not structured via pre-defined data
models or schema.

14
Schema vs Schemaless data
• A Schema specify the structure and types of a data, e.g. the types of each column in a table.
They may also specify constraints within or between data fields.
Structured data generally has a schema associated with it

• Schema is a kind of meta-data


Meta-data: data about data, e.g. author of a file, file size, the date the document was created,
keywords of a document, …

Relational DB: Schema or Schemaless?


Tabular Data
• What is a table?
• A table is a collection of rows and columns
• Each row has an index
• Each column has a name
• A cell is specified by an (index, name) pair
• A cell may or may not have a value

• Schema = (minimally) column types.

• Often stored as text files in CSV format.

16
Case Study: Cars dataset

type price passengers weight


1 small 15.9 5 2705
2 midsize 33.9 5 3580
… … … ... …
54 midsize 26.7 5 3245

• What are the instances?


• What are the variables?
• What is the schema?
Types of variables in car dataset

type price passengers weight


1 small 15.9 5 2705
2 midsize 33.9 5 3580
… … … ... …
54 midsize 26.7 5 3245

• Type:
• Price:
• Passengers:
• Weight:
Quick Question: the type of variable
• What type of variable is a postcode?
a)Numerical, continuous
b)Numerical, discrete
c)Categorical, nominal
d)Categorical, ordinal
Associated vs. independent
• When two variables show some connection with one another, they
are called associated variables.
• Associated variables can also be called dependent variables and vice-versa.
• If two variables are not associated. (i.e. there is no evident connection
between the two, then they are said to be independent.)
• It is also possible for observations to be independent as well.
Sample Data Extraction
• Sample data set is extracted from a larger data set called population.
All data

• It is usually not feasible to collect information on the entire dataset due to high
costs of data collection so statisticians instead work with samples that are
(hopefully) representative of the populations they come from.
• Using summary statistics and graphs based on these samples we try to
understand certain features of the population as a whole.
Data quality
• Measurement and data collection issues:
• Measurement error: any problem resulting from the measurement process, i.e., the
difference between the measurement value and the true value
• Data collection error: errors such as omitting data objects or attribute values, or
inappropriately including a data object.

• Noise: random component of a measurement error


Measurement of data/measurement quality
• Precision: the closeness of repeated measurement of the same quantity to one another
• Normally measured by standard deviation of a set of values
• Bias: a systematic variation of measurements from the quantity being measured.
• Measured taking the difference between the mean of the set of values and the known value of the
quantity being measured.
• Accuracy: the closeness of measurements to the true value of the quantity being measured.
• Depends on precision and bias.

accurate NOT precise precise NOT accurate precise AND accurate

Can be caused by bias


Data quality issues
• Outliers: an outlier is a data point that differs significantly from
other observations.
• An outlier may be due to variability in the measurement or it
may indicate experimental error

• Missing values
• Duplicate data

• Timeliness: some data starts to age as soon as it has been


collected, e.g., purchasing behaviour of customers, popularity of
movies, …
• Relevance: data needs to contain necessary information for the
application. E.g., age of the driver for car insurance analysis.
Questions?

You might also like