RM - Data Exploration and Pre-Processing Using Orange

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Segment: Data Exploration and Preprocessing

Using Orange
Topic 1: Overview of Data Exploration
Topic 2: Preprocessing Data in Orange
Topic 3: Data Exploration Techniques in
Orange
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

Hello, I am Karthikeya Bolar. Welcome to this video on data exploration and pre-processing
using Orange.

In this video, we will cover three topics. The first topic is an overview of data exploration. In
this topic, we will explore the importance of data exploration. In the second topic, we will
explore the pre-processing data in Orange. This will involve the different pre-processing

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 2/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

activities which are carried out in Orange. In the third topic, we will explore data exploration
techniques in Orange. This will involve data exploration activities. Let us look at these topics
in-depth to understand better data exploration and pre-processing using Orange.

Let us first look at the learning objectives of this session.

At the end of this video, you will be able to:

• Apply preprocessing techniques on data using Orange


• Interpret data using Orange

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 3/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

Topic 1: Overview of Data Exploration

Let us begin our exploration of the first topic, which is an overview of data exploration.

Data exploration is an important step in the data analysis process as it helps us get a first
look at the data to identify patterns, trends, and potential outliers.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 4/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

The main goal of data exploration is to understand the data and prepare it for further
analysis. Data exploration techniques include visualization, summary statistics, and data
cleaning.

The process can help identify potential data issues, such as missing values or identification
of data entry errors. It is important to ensure the accuracy and reliability of data analysis
results. Further, it can help identify opportunities for further research and analysis.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 5/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

Topic 2: Pre-processing Data in Orange

Following are the pre-processing activities in Orange:


• Data Cleaning
• Data Integration
• Data Transformation
• Data Reduction
• Data Discretisation
• Feature Selection

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 6/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

Data cleaning is the process of removing and correcting noisy, incomplete, or inconsistent
data in the dataset.

Data integration is the process of merging multiple data sources into a single dataset to
facilitate analysis.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 7/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

Data transformation is the process of converting the raw data into an appropriate format for
analysis.

Data reduction is the process of reducing the size of the dataset without losing important
information. Sometimes the dataset would be too huge for analysis. To make it easy for
analysis, reduce the size of the dataset.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 8/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

Data discretisation is the process of converting continuous variables into discrete


categories or easier analysis.

Feature selection is the process that helps select relevant features from the dataset to
improve the model's performance.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 9/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

Furthermore, pre-processing activities include handling missing values, outlier detection,


data normalisation, and data encoding.

1) Handling missing values: Filling or imputing the missing data with appropriate values
in the incomplete dataset. The incomplete dataset can also be deleted if a large
number of values are missing.

2) Outlier detection: Identifying and handling data points that deviate significantly from
the rest of the dataset. Outliers must be detected for effective and smooth analysis.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 10/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

3) Data normalisation: It is the process of scaling the data to a common range to avoid
bias in the analysis. This process is carried out in case of different data present in
different units with zero effects on results after the implementation of the process. It's
also called as data standardisation.

4) Data encoding: It is the process of converting categorical data into a numerical form
for analysis, such as dummy variable regression, with the categorical value being
converted into categories represented by numerical values, such as 1 and 0.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 11/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

Topic 3: Data Exploration Techniques in Orange

Moving further with the data exploration activities.

1. Data profiling: This involves understanding the data types, ranges, missing values, and
other basic characteristics of the dataset. It helps to identify any data quality issues
and gives a general overview of the data.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 12/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

2. Univariate analysis: Here, the name itself suggests univariate means a single variable
analysis at a time. This involves isolating individual variables to understand
distributions, ranges, central tendencies, and variability. It also helps to identify
outliers, data skewness, and other issues that may affect the analysis.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 13/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

3. Bivariate analysis: As the name itself suggests, bivariate means pairs. It explores the
relationship between pairs of variables. Bivariate analysis helps identify correlations
and dependencies between variables and can highlight potential causal relationships.

4. Multivariate analysis: It involves multiple variables at a time for analysis to identify


patterns and relationships between them. Multivariate analysis can be useful in
identifying clusters of sub-groups within the data or in understanding how different
variables interact with each other.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 14/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

5. Data visualization: It involves graphs, charts, and other visual tools to represent the
data in a way that is easy to understand. It also helps to identify patterns and outliers
that may not be immediately obvious from the raw data.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 15/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

Moving further to explore different possible data exploration in Orange with certain
instances. First, use certain pre-existing widgets, such as datasets which contain pre-
existing data. One should stay connected with the internet as it will help in uploading the
dataset to the repository. Then, click on the dataset.

Use a dataset of Iris which is already available in the dataset. To make a dataset active with
its description, click on the dataset, and it will get activated with the indication of green
colour. This dataset is about three different types of flowers with different specifications. In
this instance, there are four different specifications, such as petal length, petal width, sepal
length, and sepal width. Moving further to use this dataset for data exploration and pre-
processing activities.

Note: Pre-processing and exploration cannot be applicable to a particular or every dataset.


But all those activities might be useful and applicable to different datasets.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 16/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

Let's move on with certain pre-processing features and exploration features which can be
carried out in this dataset. At first, examine the data with its representation on the data table.

Before examining the data, activate the data set. Then, go to the data table.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 17/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

The uploaded dataset has 150 instances, four features, and a target with three values.
Instances are nothing but the different number of rows which are there in the data set. Four
features indicate that they could be the attributes, which would describe each row on the
data set. The target value indicates the three different classification schemes for different
types of iris flowers. So, altogether, there are five variables which are distinguished as four
features and one target variable with 150 records in this data set.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 18/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

Moving ahead with the data transformation activity as the standardisation procedure. Select
columns from the dataset to implement the process.

Here are the features present in the dataset, also called quantitative variables in this dataset,
denoted by N in front of them in the red-coloured box, which are to be converted into a
standardised form.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 19/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

We can proceed further to the pre-processed widget.

Number of pre-processors are available in the pre-process widget in Orange. To normalise


the feature, select "Normalise Feature".

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 20/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

Further, we have different ways of normalising the feature. Here we will go with the default
one, ' Standardise', where the mean equals zero, and variance equals one.

The result that we get after implementing several processes will automatically get
normalised. Normalisation means converting data into a different scale where the mean is

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 21/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

zero, and the variance equals one. So this is one of the pre-processing activities, which is
called as data transformation.

Moving further, to go to another 'Select Columns'.

The target variable, 'Iris', will be brought to the features. It is up to the user whether to keep
variables in ignored section or meta section. Putting something in the ignored section will

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 22/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

not be available for further analysis, whereas keeping something in the meta section can be
used for annotation purposes, or it would be available for further analysis if you wish to bring
it to features.

Let's bring it to Metas, and it can be noticed that only Iris is retained in the feature list, that
is, the categorical variable, which was intended to be converted into a continuous variable.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 23/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

Moving further to use one more feature that is 'Continuize'.

This feature can be used with the default settings without making any changes to get a view
of the data table.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 24/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

Here is the encoded version of the categorical variable 'Iris', where Iris is equal to Iris
versicolor and Iris Virginica. Iris versicolor and Iris virginica have been represented as 0 and
1 because the categorical variable has only three categories. So, let us say the first record
represented setosa, which has been encoded as Iris equal to Iris versicolor and Iris equal to
Iris verginica as zero. Here, zero means they are not versicolor or verginica, but it says setosa
flower. So, whenever a variable has three categories, only two variables are required to
encode them into zeros and ones.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 25/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

Now, take a box plot for doing data visualization. A box plot can be developed using the
same features and target variable.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 26/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

Here is a box plot indicating five number summaries. One can get his lower limit, the first
quartile, the median, the third quartile, and the higher limit. As we had three different
categories of flowers, here we have three different side-by-side box plots for each of the
quantitative variables. So, here, we can see that we have a box plot set up for different
categories like Iris setosa, Iris versicolor and Iris verginica, and the variable which we are
considering in the box plot is the sepal length. So, we can make a comparison of the same
feature across the different species of those flowers, and this is how exploration is being
done. Here, overlap can be noticed between the two categories of flowers, like versicolor
and verginica. Still, the Iris setosa is a completely different flower from the rest of them,
indicating that the setosa flower's features are different from versicolor and verginica. So,
the dataset can be explored in these ways, and certain types of relationships between the
variables can be figured out. This could be very well explained by even a bivariate analysis,
which we can carry out, as it suggests the relationship between a categorical and
quantitative variable.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 27/28
DATA EXPLORATION AND PRE-PROCESSING USING ORANGE

Summary

In this topic, we discussed:

• Orange is a widely used open-source tool in academia and industry for various
techniques of preprocessing data using widgets available under 'Transformation'.

• Orange is used for various data exploration and visualization techniques, mainly
available in widgets under 'Data' and 'Visualize'.

©COPYRIGHT 2023 (VER. 1.0), ALL RIGHTS RESERVED. MANIPAL ACADEMY OF HIGHER EDUCATION 28/28

You might also like