Types of Data and Data Quality: KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania

Types of data and data quality
KIT306/606: Data Analytics

Unit Coordinator: A/Prof. Quan Bai
University of Tasmania
Announcement
• Quiz 1
• Due date: Sunday 11:59 pm
• 2%
• 4 Questions
• One attempt
• True or false and multiple choice questions
• Tutorials start from this week

• Tutorial tasks (1.5% each week)
• You need to attend the tutorial, complete all the tasks and show to your tutor
Data!!What is it?
• Data are characteristics that are collected through observation. In a more
technical sense, data is a set of values of qualitative or quantitative variables
about one or more objects (Wikipedia)
176 cm
Single attribute:
multiple attributes:
Student Age Height Study hours Grade

ID per day
001 23 180 4 HD
• A data set can often be viewed as a collection
A data instance
of data instances, also called records or
examples.
Student Age Height Study hours Grade
• An attribute/feature is a property of an data ID per day
instance that may very, either from one
001 23 180 4 HD
instance to another or from one time to
another. 002 20 165 2 DN
003 31 210 1 CR
• A measurement scale is the application of a
measurement scale to associate a value with a 004 29 167 0.5 PP
particular attribute of a specific object. 005 21 175 3 HD
006 23 178 2 HD
007 38 182 4 DN
008 22 166 5 HD
009 24 159 0.1 F
010 25 167 0.2 F
attributes/features
Types of attributes
• What operations we can have for attributes?
• Distinctness: = , != Nominal
• Order: <, >, >=, =< Ordinal
• Addition: +, - Interval
• Multiplication: *, / Ratio
• Accordingly, we can define 4 types of attributes:

• Nominal
• Ordinal
• Interval
• Ratio
• Nominal: a variable with categories that do not have a natural order or
ranking.
• E.g.: car colour, blood type, phone brand …
• Blue != Red, but you cannot say Blue > Red or Blue < Red
• Ordinal: ordered scale, but the differences between adjacent categories

do not necessarily have the same meaning.
• E.g.: Satisfaction degree (“like”, “neutral”, “dislike”), Judo levels, …
• Black Belt > Red Belt > …, but it is meaningless to Black Belt + Red Belt
• Interval: ordered scale and difference between 2 values is meaningful; 0 does
NOT mean there is none of that variable.
• E.g., temperature, GPA, …
20oC – 15 oC is meaningful, but 20oC / 15 oC is not.
0 oC != no temperature
• Ratio: all the properties of an interval variable, and also has a clear definition of 0.
When the variable equals 0, there is none of that variable. AND the ratio of two
measurements has a meaningful interpretation
• E.g., Weights, length, student number …
50kg/100kg is meaningful;
0kg means no weight
Types of attributes (cont.)
All variables
numerical categorical
continuous discrete ordinal not ordinal

Types of variables: Numerical
• Numerical : values or observation that can be measured
• Discrete Variable
• A variable can take on certain individual numeric value.
• E.g. # of pages in a book, # of people in Hobart
• Continuous Variable
• A variable can take on any value in a certain range
• An infinite number of values
• E.g. Length of a film, temperature, time taken to run a race
General characteristics of data sets:
• Dimensionality: number of attributes
• Sparsity: most attributes of an instance have the 0 value, and very little attributes
have none zero values. a1 a2 a3 a4 a5 a6
0 0 0 0 0 0
0 0 0 0 0 0
0 1 0 0 0 0
0 0 0 4 0 0
0 0 0 0 0 0
• Resolution: how fine the measurement is. E.g., meter vs. km, hour vs. second, …
Types of data sets
• Transaction data: each record involved a set
of items.
• E.g., shopping basket data
• Graph-based data:
• Data with relationships among instances
• We can use graph to capture relationships
among data instances
• E.g., social network data
• Ordered data: the attributes of data instances have relationships
which can be ordered, e.g., in time or space.
• Sequence data
• A data set that is a sequence of individual entities
• E.g., natural language data
“London is the capital city of the UK”
• Time series data:
• A special type of sequential data
• Each record is a time series, e.g., daily
• E.g., water level of a river in 2020
• Spatial data: data instances with special attributes, i.e.,

positions or areas.
• E.g., weather data
Structured vs. unstructured data
• Structured data is comprised of clearly defined data types whose
pattern makes them easily searchable; while unstructured data –
“everything else” – is comprised of data that is usually not as easily
searchable, including formats like audio, video, and social media
postings. (datamation.com )
• Structured data normally has pre-defined data schema; Unstructured
data has internal structure but is not structured via pre-defined data
models or schema.
14
Schema vs Schemaless data
• A Schema specify the structure and types of a data, e.g. the types of each column in a table.
They may also specify constraints within or between data fields.
Structured data generally has a schema associated with it
• Schema is a kind of meta-data

Meta-data: data about data, e.g. author of a file, file size, the date the document was created,
keywords of a document, …
Relational DB: Schema or Schemaless?

Tabular Data
• What is a table?
• A table is a collection of rows and columns
• Each row has an index
• Each column has a name
• A cell is specified by an (index, name) pair
• A cell may or may not have a value
• Schema = (minimally) column types.
• Often stored as text files in CSV format.
16
Case Study: Cars dataset
type price passengers weight

1 small 15.9 5 2705
2 midsize 33.9 5 3580
… … … ... …
54 midsize 26.7 5 3245
• What are the instances?

• What are the variables?
• What is the schema?
Types of variables in car dataset
type price passengers weight

1 small 15.9 5 2705
2 midsize 33.9 5 3580
… … … ... …
54 midsize 26.7 5 3245
• Type:
• Price:
• Passengers:
• Weight:
Quick Question: the type of variable
• What type of variable is a postcode?
a)Numerical, continuous
b)Numerical, discrete
c)Categorical, nominal
d)Categorical, ordinal
Associated vs. independent
• When two variables show some connection with one another, they
are called associated variables.
• Associated variables can also be called dependent variables and vice-versa.
• If two variables are not associated. (i.e. there is no evident connection
between the two, then they are said to be independent.)
• It is also possible for observations to be independent as well.
Sample Data Extraction
• Sample data set is extracted from a larger data set called population.
All data
• It is usually not feasible to collect information on the entire dataset due to high
costs of data collection so statisticians instead work with samples that are
(hopefully) representative of the populations they come from.
• Using summary statistics and graphs based on these samples we try to
understand certain features of the population as a whole.
Data quality
• Measurement and data collection issues:
• Measurement error: any problem resulting from the measurement process, i.e., the
difference between the measurement value and the true value
• Data collection error: errors such as omitting data objects or attribute values, or
inappropriately including a data object.
• Noise: random component of a measurement error

Measurement of data/measurement quality
• Precision: the closeness of repeated measurement of the same quantity to one another
• Normally measured by standard deviation of a set of values
• Bias: a systematic variation of measurements from the quantity being measured.
• Measured taking the difference between the mean of the set of values and the known value of the
quantity being measured.
• Accuracy: the closeness of measurements to the true value of the quantity being measured.
• Depends on precision and bias.
accurate NOT precise precise NOT accurate precise AND accurate
Can be caused by bias

Data quality issues
• Outliers: an outlier is a data point that differs significantly from
other observations.
• An outlier may be due to variability in the measurement or it
may indicate experimental error
• Missing values
• Duplicate data
• Timeliness: some data starts to age as soon as it has been

collected, e.g., purchasing behaviour of customers, popularity of
movies, …
• Relevance: data needs to contain necessary information for the
application. E.g., age of the driver for car insurance analysis.
Questions?

Types of Data and Data Quality: KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Types of Data and Data Quality: KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania

Uploaded by

Copyright:

Available Formats

Types of data and data quality

KIT306/606: Data Analytics

• Tutorials start from this week

Student Age Height Study hours Grade

• Accordingly, we can define 4 types of attributes:

• Ordinal: ordered scale, but the differences between adjacent categories

continuous discrete ordinal not ordinal

• Spatial data: data instances with special attributes, i.e.,

• Schema is a kind of meta-data

Relational DB: Schema or Schemaless?

• Schema = (minimally) column types.

• Often stored as text files in CSV format.

type price passengers weight

• What are the instances?

type price passengers weight

• Noise: random component of a measurement error

accurate NOT precise precise NOT accurate precise AND accurate

Can be caused by bias

• Timeliness: some data starts to age as soon as it has been

You might also like