Unit - 3 Data Taxonomy

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

UNIT-3

DATA THEORY AND TAXONOMY OF DATA THEORY

Data are individual pieces of factual information recorded and used for the purpose of analysis.
It is the raw information from which statistics are created. Data analysis is the practice of
working with data to glean useful information, which can then be used to make informed
decisions.

DATA TAXONOMY:
Data taxonomy is the classification of data into categories and sub-categories. It provides
a unified view of the data in an organization and introduces common terminologies and
semantics across multiple systems. Establishing a hierarchy within a set of metadata and
segregating it into categories creates a better understanding of the relationships between data
points.

benefits gained by creating a taxonomy:

 Fundamental understanding of data — As a result of the GDPR, it is possible that


many existing data elements are not compliant with the regulation and need to be fixed.
Taxonomy helps discover such data quality issues by providing a basic understanding of
what the data is and its lineage.
 Data access — GDPR provides data subjects with the right to access their data in an
electronic format whenever needed. Categorization of data helps with faster retrieval of
data, as it extends the search for a keyword automatically to other, closely related terms.
 Risk analysis — Classification of data helps determine whether it risks non-compliance.
The process helps identify data that falls under the highly sensitive category. Such data
would require anonymization per the GDPR. Other, non-sensitive data can be ignored
for compliance analysis, saving time and effort.
 Reduce unwanted data — GDPR recommends data minimization to collect and store
only as much personal data as required. A taxonomy helps get rid of existing ROT
(redundant, obsolete, or trivial) data, which decreases the risk of storing non-compliant
personal data.
VARIOUS SCALES OF MEASUREMENT OF DATA:

Levels of Measurements
There are four different scales of measurement. The data can be defined as being one of the
four scales. The four types of scales are:

 Nominal Scale
 Ordinal Scale
 Interval Scale
 Ratio Scale

Nominal Scale
A nominal scale is the 1st level of measurement scale in which the numbers serve as “tags” or
“labels” to classify or identify the objects. A nominal scale usually deals with the non-numeric
variables or the numbers that do not have any value.
Characteristics of Nominal Scale

 A nominal scale variable is classified into two or more categories. In this measurement
mechanism, the answer should fall into either of the classes.
 It is qualitative. The numbers are used here to identify the objects.
 The numbers don’t define the object characteristics. The only permissible aspect of
numbers in the nominal scale is “counting.”
Example:
An example of a nominal scale measurement is given below:
What is your gender?
M- Male
F- Female
Here, the variables are used as tags, and the answer to this question should be either M or F.

Ordinal Scale
The ordinal scale is the 2nd level of measurement that reports the ordering and ranking of data
without establishing the degree of variation between them. Ordinal represents the “order.”
Ordinal data is known as qualitative data or categorical data. It can be grouped, named and also
ranked.
Characteristics of the Ordinal Scale

 The ordinal scale shows the relative ranking of the variables


 It identifies and describes the magnitude of a variable
 Along with the information provided by the nominal scale, ordinal scales give the
rankings of those variables
 The interval properties are not known
 The surveyors can quickly analyse the degree of agreement concerning the identified
order of variables
Example:

 Ranking of school students – 1st, 2nd, 3rd, etc.


 Ratings in restaurants
 Evaluating the frequency of occurrences
 Very often
 Often
 Not often
 Not at all
 Assessing the degree of agreement
 Totally agree
 Agree
 Neutral
 Disagree
 Totally disagree

Interval Scale
The interval scale is the 3rd level of measurement scale. It is defined as a quantitative
measurement scale in which the difference between the two variables is meaningful. In other
words, the variables are measured in an exact manner, not as in a relative way in which the
presence of zero is arbitrary.
Characteristics of Interval Scale:

 The interval scale is quantitative as it can quantify the difference between the values
 It allows calculating the mean and median of the variables
 To understand the difference between the variables, you can subtract the values between
the variables
 The interval scale is the preferred scale in Statistics as it helps to assign any numerical
values to arbitrary assessment such as feelings, calendar types, etc.
Example:

 Likert Scale
 Net Promoter Score (NPS)
 Bipolar Matrix Table

Ratio Scale
The ratio scale is the 4th level of measurement scale, which is quantitative. It is a type of
variable measurement scale. It allows researchers to compare the differences or intervals. The
ratio scale has a unique feature. It possesses the character of the origin or zero points.
Characteristics of Ratio Scale:

 Ratio scale has a feature of absolute zero


 It doesn’t have negative numbers, because of its zero-point feature
 It affords unique opportunities for statistical analysis. The variables can be orderly
added, subtracted, multiplied, divided. Mean, median, and mode can be calculated using
the ratio scale.
 Ratio scale has unique and useful properties. One such feature is that it allows unit
conversions like kilogram – calories, gram – calories, etc.
Example:
An example of a ratio scale is:
What is your weight in Kgs?

 Less than 55 kgs


 55 – 75 kgs
 76 – 85 kgs
 86 – 95 kgs
 More than 95 kgs

1. Structured data –
Structured data is data whose elements are addressable for effective analysis. It has
been organized into a formatted repository that is typically a database. It concerns all
data which can be stored in database SQL in a table with rows and columns. They have
relational keys and can easily be mapped into pre-designed fields. Today, those data are
most processed in the development and simplest way to manage
information. Example: Relational data.

2. Semi-Structured data –
Semi-structured data is information that does not reside in a relational database but that
has some organizational properties that make it easier to analyze. With some processes,
you can store them in the relation database (it could be very hard for some kind of semi-
structured data), but Semi-structured exist to ease space. Example: XML data.

3. Unstructured data –
Unstructured data is a data which is not organized in a predefined manner or does not
have a predefined data model, thus it is not a good fit for a mainstream relational
database. So for Unstructured data, there are alternative platforms for storing and
managing, it is increasingly prevalent in IT systems and is used by organizations in a
variety of business intelligence and analytics applications. Example: Word, PDF, Text,
Media logs.
What is quantitative data?

Quantitative data refers to any information that can be quantified. If it can be counted or
measured, and given a numerical value, it’s quantitative data. Quantitative data can tell you
“how many,” “how much,” or “how often”—for example, how many people attended last
week’s webinar? How much revenue did the company make in 2019? How often does a certain
customer group use online banking?
What is qualitative data?

Unlike quantitative data, qualitative data cannot be measured or counted. It’s descriptive,
expressed in terms of language rather than numerical values. Researchers will often turn to
qualitative data to answer “Why?” or “How?” questions. For example, if your quantitative data
tells you that a certain website visitor abandoned their shopping cart three times in one week,
you’d probably want to investigate why—and this might involve collecting some form of
qualitative data from the user. Perhaps you want to know how a user feels about a particular
product; again, qualitative data can provide such insights. In this case, you’re not just looking
at numbers; you’re asking the user to tell you, using language, why they did something or how
they feel. Qualitative data also refers to the words or labels used to describe certain
characteristics or traits—for example, describing the sky as blue or labeling a particular ice
cream flavor as vanilla.
What are the main differences between quantitative and qualitative data?

The main differences between quantitative and qualitative data lie in what they tell us, how
they are collected, and how they are analyzed. Let’s summarize the key differences before
exploring each aspect in more detail:
 Quantitative data is countable or measurable, relating to numbers. Qualitative data is
descriptive, relating to language.
 Quantitative data tells us how many, how much, or how often (e.g. “20 people signed
up to our email newsletter last week”). Qualitative data can help us to understand the
“why” or “how” behind certain behaviors, or it can simply describe a certain attribute—
for example, “The postbox is red” or “I signed up to the email newsletter because I’m
really interested in hearing about local events.”
 Quantitative data is fixed and “universal,” while qualitative data is subjective and
dynamic. For example, if something weighs 20 kilograms, that can be considered an
objective fact. However, two people may have very different qualitative accounts of
how they experience a particular event.
 Quantitative data is gathered by measuring and counting. Qualitative data is collected
by interviewing and observing.
 Quantitative data is analyzed using statistical analysis, while qualitative data is
analyzed by grouping it in terms of meaningful categories or themes.

COMMONLY USED STATISTICAL TOOLS


NCSS: A robust statistical and graphics program, NCSS is used in a variety of industries from
medical investigation and business analytics, to engineering, quality control, and academic
research. Learn why thousands of researchers, consultants, professionals, engineers, and
scientists are using NCSS worldwide.
POWER BI: Power BI is an interactive data visualization software product developed
by Microsoft with a primary focus on business intelligence. It is part of the Microsoft Power
Platform. Power BI is a collection of software services, apps, and connectors that work together
to turn unrelated sources of data into coherent, visually immersive, and interactive insights.
Data may be input by reading directly from a database, webpage, or structured files such as
spreadsheets, CSV, XML, and JSON.
SPSS (IBM): SPSS, (Statistical Package for the Social Sciences) is perhaps the most widely
used statistics software package within human behavior research. SPSS offers the ability to
easily compile descriptive statistics, parametric and non-parametric analyses, as well as
graphical depictions of results through the graphical user interface (GUI). It also includes the
option to create scripts to automate analysis, or to carry out more advanced statistical
processing.
R (R Foundation for Statistical Computing): R is a free statistical software package that is
widely used across both human behavior research and in other fields. Toolboxes (essentially
plugins) are available for a great range of applications, which can simplify various aspects of
data processing. While R is a very powerful software, it also has a steep learning curve,
requiring a certain degree of coding. It does however come with an active community engaged
in building and improving R and the associated plugins, which ensures that help is never too
far away.
MATLAB (The Mathworks): MatLab is an analytical platform and programming language
that is widely used by engineers and scientists. As with R, the learning path is steep, and you
will be required to create your own code at some point. A plentiful amount of toolboxes are
also available to help answer your research questions (such as EEGLab for analysing EEG
data). While MatLab can be difficult to use for novices, it offers a massive amount of flexibility
in terms of what you want to do – as long as you can code it (or at least operate the toolbox
you require).
Microsoft Excel: While not a cutting-edge solution for statistical analysis, MS Excel does
offer a wide variety of tools for data visualization and simple statistics. It’s simple to generate
summary metrics and customizable graphics and figures, making it a usable tool for many who
want to see the basics of their data. As many individuals and companies both own and know
how to use Excel, it also makes it an accessible option for those looking to get started with
statistics.
SAS (Statistical Analysis Software): SAS is a statistical analysis platform that offers options
to use either the GUI, or to create scripts for more advanced analyses. It is a premium solution
that is widely used in business, healthcare, and human behavior research alike. It’s possible to
carry out advanced analyses and produce publication-worthy graphs and charts, although the
coding can also be a difficult adjustment for those not used to this approach.
GraphPad Prism: GraphPad Prism is premium software primarily used within statistics
related to biology, but offers a range of capabilities that can be used across various fields.
Similar to SPSS, scripting options are available to automate analyses, or carry out more
complex statistical calculations, but the majority of the work can be completed through the
GUI.
Minitab: The Minitab software offers a range of both basic and fairly advanced statistical tools
for data analysis. Similar to GraphPad Prism, commands can be executed through both the GUI
and scripted commands, making it accessible to novices as well as users looking to carry out
more complex analyses.

MEASUREMENT AND SCALING CONCEPT:

Scales of measurement is how variables are defined and categorised. Psychologist Stanley
Stevens developed the four common scales of
measurement: nominal, ordinal, interval and ratio. Each scale of measurement has properties
that determine how to properly analyse the data. The properties evaluated
are identity, magnitude, equal intervals and a minimum value of zero.
Properties of Measurement
• Identity: Identity refers to each value having a unique meaning.
• Magnitude: Magnitude means that the values have an ordered relationship to one
another, so there is a specific order to the variables.
• Equal intervals: Equal intervals mean that data points along the scale are equal, so the
difference between data points one and two will be the same as the difference between
data points five and six.
• A minimum value of zero: A minimum value of zero means the scale has a true zero
point. Degrees, for example, can fall below zero and still have meaning. But if you
weigh nothing, you don’t exist.

Quantitative Processing
Quantitative processing describe the relationships of the data. Depending on the sample, there
are different ways to communicate quantitative data.
• Nominal comparison: Sub-categories are individually compared in no particular order.
• Time series: An individual variable is tracked over a period of time, usually represented
in a line chart.
• Ranking: Sub-categories are ranked in order, usually represented in a bar chart.
• Part-to-whole: Sub-categories are represented as a ratio in comparison with the whole,
usually represented in a bar or pie chart.
• Deviation: Sub-categories are compared with a reference point, usually represented in
a bar chart.
• Frequency distribution: Sub-categories are counted in intervals, usually represented in
a histogram.
• Correlation: Two sets of measures are compared to identify if they move in the same or
opposite directions, usually represented in a scatter plot.

Steps to Effective Data Classification and Categorization Procedure


Formalized Classification Policy

Major Classification Methods


• Equal intervals,
• Mean-standard deviation,
• Quantiles,
• Maximum breaks and
• Natural breaks

The Equal Interval Classification (constant class intervals)


To determine the class interval, you divide the whole range of all your data (highest data value
minus lowest data value) by the number of classes you have decided to generate.

The Mean-Standard Deviation Classification


Another method that allows us to classify our dataset is the standard deviation. This method
takes into account how data is distributed along the dispersion graph. To apply this method, we
repeatedly add (or subtract) the calculated standard deviation from the statistical mean of our
dataset. The resulting classes reveal the frequency of elements in each class.
The Maximum Breaks Classification
When we choose to use the method of maximum breaks we first order our raw data from low
to high. Then we calculate the differences between each neighboring value, when the largest
value differences will be applied as class breaks. You can also recognize the maximum breaks
visually on the dispersion graph: large value differences are represented by blank spaces.
The Natural Breaks Classification
Applying the classification method of "natural breaks”, we consider visually logical and
subjective aspects to grouping our data set. One important purpose of natural breaks is to
minimise value differences between data within the same class. Another purpose is to
emphasize the differences between the created classes.

Data Processing Cycle and Techniques

Types of Data Processing


Neural Network:
SOLVED EXAMPLE NEURAL NETWORK

FUZZY LOGIC:
FUZZY LOGIC NUMERICAL

You might also like