Professional Documents
Culture Documents
Types of Data and Data Quality: KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania
Types of Data and Data Quality: KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania
176 cm
Single attribute:
multiple attributes:
attributes/features
Types of attributes
• What operations we can have for attributes?
• Distinctness: = , != Nominal
• Order: <, >, >=, =< Ordinal
• Addition: +, - Interval
• Multiplication: *, / Ratio
• Blue != Red, but you cannot say Blue > Red or Blue < Red
• Ratio: all the properties of an interval variable, and also has a clear definition of 0.
When the variable equals 0, there is none of that variable. AND the ratio of two
measurements has a meaningful interpretation
• E.g., Weights, length, student number …
50kg/100kg is meaningful;
0kg means no weight
Types of attributes (cont.)
All variables
numerical categorical
• Continuous Variable
• A variable can take on any value in a certain range
• An infinite number of values
• E.g. Length of a film, temperature, time taken to run a race
General characteristics of data sets:
• Dimensionality: number of attributes
• Sparsity: most attributes of an instance have the 0 value, and very little attributes
have none zero values. a1 a2 a3 a4 a5 a6
0 0 0 0 0 0
0 0 0 0 0 0
0 1 0 0 0 0
0 0 0 4 0 0
0 0 0 0 0 0
• Resolution: how fine the measurement is. E.g., meter vs. km, hour vs. second, …
Types of data sets
• Transaction data: each record involved a set
of items.
• E.g., shopping basket data
• Graph-based data:
• Data with relationships among instances
• We can use graph to capture relationships
among data instances
• E.g., social network data
• Ordered data: the attributes of data instances have relationships
which can be ordered, e.g., in time or space.
• Sequence data
• A data set that is a sequence of individual entities
• E.g., natural language data
“London is the capital city of the UK”
• Time series data:
• A special type of sequential data
• Each record is a time series, e.g., daily
• E.g., water level of a river in 2020
14
Schema vs Schemaless data
• A Schema specify the structure and types of a data, e.g. the types of each column in a table.
They may also specify constraints within or between data fields.
Structured data generally has a schema associated with it
16
Case Study: Cars dataset
• Type:
• Price:
• Passengers:
• Weight:
Quick Question: the type of variable
• What type of variable is a postcode?
a)Numerical, continuous
b)Numerical, discrete
c)Categorical, nominal
d)Categorical, ordinal
Associated vs. independent
• When two variables show some connection with one another, they
are called associated variables.
• Associated variables can also be called dependent variables and vice-versa.
• If two variables are not associated. (i.e. there is no evident connection
between the two, then they are said to be independent.)
• It is also possible for observations to be independent as well.
Sample Data Extraction
• Sample data set is extracted from a larger data set called population.
All data
• It is usually not feasible to collect information on the entire dataset due to high
costs of data collection so statisticians instead work with samples that are
(hopefully) representative of the populations they come from.
• Using summary statistics and graphs based on these samples we try to
understand certain features of the population as a whole.
Data quality
• Measurement and data collection issues:
• Measurement error: any problem resulting from the measurement process, i.e., the
difference between the measurement value and the true value
• Data collection error: errors such as omitting data objects or attribute values, or
inappropriately including a data object.
• Missing values
• Duplicate data