RM - Data Exploration and Pre-Processing Using Orange

Segment: Data Exploration and Preprocessing
Using Orange
Topic 1: Overview of Data Exploration
Topic 2: Preprocessing Data in Orange
Topic 3: Data Exploration Techniques in
Orange
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE
Hello, I am Karthikeya Bolar. Welcome to this video on data exploration and pre-processing
using Orange.
In this video, we will cover three topics. The first topic is an overview of data exploration. In
this topic, we will explore the importance of data exploration. In the second topic, we will
explore the pre-processing data in Orange. This will involve the different pre-processing
©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 2/28
activities which are carried out in Orange. In the third topic, we will explore data exploration
techniques in Orange. This will involve data exploration activities. Let us look at these topics
in-depth to understand better data exploration and pre-processing using Orange.
Let us first look at the learning objectives of this session.
At the end of this video, you will be able to:
• Apply preprocessing techniques on data using Orange

• Interpret data using Orange
Topic 1: Overview of Data Exploration
Let us begin our exploration of the first topic, which is an overview of data exploration.
Data exploration is an important step in the data analysis process as it helps us get a first
look at the data to identify patterns, trends, and potential outliers.
The main goal of data exploration is to understand the data and prepare it for further
analysis. Data exploration techniques include visualization, summary statistics, and data
cleaning.
The process can help identify potential data issues, such as missing values or identification
of data entry errors. It is important to ensure the accuracy and reliability of data analysis
results. Further, it can help identify opportunities for further research and analysis.
Topic 2: Pre-processing Data in Orange
Following are the pre-processing activities in Orange:

• Data Cleaning
• Data Integration
• Data Transformation
• Data Reduction
• Data Discretisation
• Feature Selection
Data cleaning is the process of removing and correcting noisy, incomplete, or inconsistent
data in the dataset.
Data integration is the process of merging multiple data sources into a single dataset to
facilitate analysis.
Data transformation is the process of converting the raw data into an appropriate format for
analysis.
Data reduction is the process of reducing the size of the dataset without losing important
information. Sometimes the dataset would be too huge for analysis. To make it easy for
analysis, reduce the size of the dataset.
Data discretisation is the process of converting continuous variables into discrete

categories or easier analysis.
Feature selection is the process that helps select relevant features from the dataset to
improve the model's performance.
Furthermore, pre-processing activities include handling missing values, outlier detection,

data normalisation, and data encoding.
1) Handling missing values: Filling or imputing the missing data with appropriate values
in the incomplete dataset. The incomplete dataset can also be deleted if a large
number of values are missing.
2) Outlier detection: Identifying and handling data points that deviate significantly from
the rest of the dataset. Outliers must be detected for effective and smooth analysis.
3) Data normalisation: It is the process of scaling the data to a common range to avoid
bias in the analysis. This process is carried out in case of different data present in
different units with zero effects on results after the implementation of the process. It's
also called as data standardisation.
4) Data encoding: It is the process of converting categorical data into a numerical form
for analysis, such as dummy variable regression, with the categorical value being
converted into categories represented by numerical values, such as 1 and 0.
Topic 3: Data Exploration Techniques in Orange
Moving further with the data exploration activities.
1. Data profiling: This involves understanding the data types, ranges, missing values, and
other basic characteristics of the dataset. It helps to identify any data quality issues
and gives a general overview of the data.
2. Univariate analysis: Here, the name itself suggests univariate means a single variable
analysis at a time. This involves isolating individual variables to understand
distributions, ranges, central tendencies, and variability. It also helps to identify
outliers, data skewness, and other issues that may affect the analysis.
3. Bivariate analysis: As the name itself suggests, bivariate means pairs. It explores the
relationship between pairs of variables. Bivariate analysis helps identify correlations
and dependencies between variables and can highlight potential causal relationships.
4. Multivariate analysis: It involves multiple variables at a time for analysis to identify

patterns and relationships between them. Multivariate analysis can be useful in
identifying clusters of sub-groups within the data or in understanding how different
variables interact with each other.
5. Data visualization: It involves graphs, charts, and other visual tools to represent the
data in a way that is easy to understand. It also helps to identify patterns and outliers
that may not be immediately obvious from the raw data.
Moving further to explore different possible data exploration in Orange with certain
instances. First, use certain pre-existing widgets, such as datasets which contain pre-
existing data. One should stay connected with the internet as it will help in uploading the
dataset to the repository. Then, click on the dataset.
Use a dataset of Iris which is already available in the dataset. To make a dataset active with
its description, click on the dataset, and it will get activated with the indication of green
colour. This dataset is about three different types of flowers with different specifications. In
this instance, there are four different specifications, such as petal length, petal width, sepal
length, and sepal width. Moving further to use this dataset for data exploration and pre-
processing activities.
Note: Pre-processing and exploration cannot be applicable to a particular or every dataset.

But all those activities might be useful and applicable to different datasets.
Let's move on with certain pre-processing features and exploration features which can be
carried out in this dataset. At first, examine the data with its representation on the data table.
Before examining the data, activate the data set. Then, go to the data table.
The uploaded dataset has 150 instances, four features, and a target with three values.
Instances are nothing but the different number of rows which are there in the data set. Four
features indicate that they could be the attributes, which would describe each row on the
data set. The target value indicates the three different classification schemes for different
types of iris flowers. So, altogether, there are five variables which are distinguished as four
features and one target variable with 150 records in this data set.
Moving ahead with the data transformation activity as the standardisation procedure. Select
columns from the dataset to implement the process.
Here are the features present in the dataset, also called quantitative variables in this dataset,
denoted by N in front of them in the red-coloured box, which are to be converted into a
standardised form.
We can proceed further to the pre-processed widget.
Number of pre-processors are available in the pre-process widget in Orange. To normalise

the feature, select "Normalise Feature".
Further, we have different ways of normalising the feature. Here we will go with the default
one, ' Standardise', where the mean equals zero, and variance equals one.
The result that we get after implementing several processes will automatically get
normalised. Normalisation means converting data into a different scale where the mean is
zero, and the variance equals one. So this is one of the pre-processing activities, which is
called as data transformation.
Moving further, to go to another 'Select Columns'.
The target variable, 'Iris', will be brought to the features. It is up to the user whether to keep
variables in ignored section or meta section. Putting something in the ignored section will
not be available for further analysis, whereas keeping something in the meta section can be
used for annotation purposes, or it would be available for further analysis if you wish to bring
it to features.
Let's bring it to Metas, and it can be noticed that only Iris is retained in the feature list, that
is, the categorical variable, which was intended to be converted into a continuous variable.
Moving further to use one more feature that is 'Continuize'.
This feature can be used with the default settings without making any changes to get a view
of the data table.
Here is the encoded version of the categorical variable 'Iris', where Iris is equal to Iris
versicolor and Iris Virginica. Iris versicolor and Iris virginica have been represented as 0 and
1 because the categorical variable has only three categories. So, let us say the first record
represented setosa, which has been encoded as Iris equal to Iris versicolor and Iris equal to
Iris verginica as zero. Here, zero means they are not versicolor or verginica, but it says setosa
flower. So, whenever a variable has three categories, only two variables are required to
encode them into zeros and ones.
Now, take a box plot for doing data visualization. A box plot can be developed using the
same features and target variable.
Here is a box plot indicating five number summaries. One can get his lower limit, the first
quartile, the median, the third quartile, and the higher limit. As we had three different
categories of flowers, here we have three different side-by-side box plots for each of the
quantitative variables. So, here, we can see that we have a box plot set up for different
categories like Iris setosa, Iris versicolor and Iris verginica, and the variable which we are
considering in the box plot is the sepal length. So, we can make a comparison of the same
feature across the different species of those flowers, and this is how exploration is being
done. Here, overlap can be noticed between the two categories of flowers, like versicolor
and verginica. Still, the Iris setosa is a completely different flower from the rest of them,
indicating that the setosa flower's features are different from versicolor and verginica. So,
the dataset can be explored in these ways, and certain types of relationships between the
variables can be figured out. This could be very well explained by even a bivariate analysis,
which we can carry out, as it suggests the relationship between a categorical and
quantitative variable.
Summary
In this topic, we discussed:
• Orange is a widely used open-source tool in academia and industry for various
techniques of preprocessing data using widgets available under 'Transformation'.
• Orange is used for various data exploration and visualization techniques, mainly
available in widgets under 'Data' and 'Visualize'.

RM - Data Exploration and Pre-Processing Using Orange

Uploaded by

Copyright:

Available Formats

You might also like

RM - Data Exploration and Pre-Processing Using Orange

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RM - Data Exploration and Pre-Processing Using Orange

Uploaded by

Copyright:

Available Formats

Segment: Data Exploration and Preprocessing

Let us first look at the learning objectives of this session.

At the end of this video, you will be able to:

• Apply preprocessing techniques on data using Orange

Topic 1: Overview of Data Exploration

Topic 2: Pre-processing Data in Orange

Following are the pre-processing activities in Orange:

Data discretisation is the process of converting continuous variables into discrete

Furthermore, pre-processing activities include handling missing values, outlier detection,

Topic 3: Data Exploration Techniques in Orange

Moving further with the data exploration activities.

4. Multivariate analysis: It involves multiple variables at a time for analysis to identify

Note: Pre-processing and exploration cannot be applicable to a particular or every dataset.

We can proceed further to the pre-processed widget.

Number of pre-processors are available in the pre-process widget in Orange. To normalise

Moving further, to go to another 'Select Columns'.

Moving further to use one more feature that is 'Continuize'.

In this topic, we discussed:

You might also like