Introduction To Orange: Data Analytics Core

Introduction to Orange
Data Analytics Core

Introduction to Data Mining
• Data Mining: discovering patterns in large datasets using artificial
intelligence, machine learning, statistics, and database systems
• Orange is a data mining toolkit, so you don’t need to be an expert in any of

those subjects to use it
• We will use Orange to:

• load, manipulate, and save large data sets
• visualize the relationships between variables
• discover and quantify patterns in data
• create rules to predict outcomes based on observed data
Widgets and Signals
• Orange is a graphical programming interface, where you design a program
• Each program is called a "scheme" and is made by dragging widgets
and signals around a "canvas"
• Each widget does one type of processing

• Widgets do things like load and save data to files, edit data, make
graphs, and perform calculations
• Most have inputs, outputs, or both
• A signal carries information from one widget to another.

• Signals can carry different types of information, e.g. data, calculated
values, or settings
Orange’s Graphical Programming
Environment
Using the Orange interface
• To add a widget, drag it onto the canvas from the widget panel, or just click
on it in the widget panel
• To add a signal, click on the signal attachment point on a widget and drag
from it to the signal attachment point on another widget
• Input signals come in from the left, output signals go out to the right
Using the Orange interface
• Some widgets have multiple possible input and output ports
• Orange tries to guess which one you mean
• If it guesses wrong, double click on the signal to select which inputs

and outputs you are using
• You can also temporarily disconnect or delete signals by right-clicking

on them
File Widget
• Loads data from a file
• Many different file types are supported
• Recommended: tab-delimited text
• If you click on the arrow and scroll down to the bottom of the dialog, you
can see "browse documentation datasets"
• iris.tab is an example dataset that comes with Orange, and contains 150
iris flowers from three species
Data Table Widget
• Lists rows in a dataset, sort
by clicking on the column
heading
• Each value has a bar

indicating the size of the
value
• In this case, last column is

assumed to be a
category
• Bar is colored by
category
(species)
For each of the 150
flowers in the dataset,
there is a value for:
• Petal Length
• Petal Width
• Sepal Length
• Sepal Width
Select Rows Widget
• Filters data according to simple rules
• For example: exclude all irises with short petals
• Select an attribute and a condition and press “Add” to add it to the filter
Data Selection Results
• The “petal length” column now only contains values longer than 3 cm
• The green category, iris-setosa, is now completely absent.
• All iris-setosa flowers have petals shorter than 3 cm.

Select Columns Widget (I)
• Choose which columns go in the dataset
• Features are data values to be included

in output
• Target Variable is the category of the

row
• Meta Attributes are descriptive

attributes that are excluded from the
analysis (such as a row ID)
• Available Variables are attributes

available to be loaded, but are not used in
model
Select Columns Widget (II)
• Drag or move variables between
categories with the > and < buttons
• Each variable is marked C for continuous

(numerical values) or D for discrete
(categorical values)
• You may need to click Apply before any

changes you make take effect
Select Columns in action
• Suppose we were only interested in sepals, not petals.
Feature Constructor Widget
• Defines new attributes (i.e. columns)
based on the values of existing
attributes
• Type a formula and click Add to add

a new feature
• Select fields and functions using the

drop down menus
• Widget outputs the same data set with

new attributes added
• This calculation is assumes petals are

triangular
Feature Construction Results
• New attribute is added after existing attributes but before the class
Save Widget
• Save a modified file
• Saves whatever is going to its input

• If you made changes elsewhere in the scheme, they will not be saved
• Be careful not to accidentally overwrite your input file

Exercise: First Scheme
• Load and inspect the imports-85.tab data file (documentation dataset),
which contains information about various imported cars
• Add a Volume attribute (i.e. length x width x height)
• Remove the original length, width, and height attributes
• Save the dataset using a different filename

Solution (I)
Solution (II)
• Remember to click Apply after you make changes!
Visualization
• So far, we have used Orange as a glorified spreadsheet: filtering and
sorting by attribute values, and performing computations
• With data mining, we want to gain insight that is not obvious from the
original data
• Now we will look at visualizations, designed to examine the data in a way

we can’t do with numbers in a spreadsheet
Visualization Widgets
• All of these visualization widgets work the same way:
• Connect a data input and double-click the widget to see the

visualization
• Main panel of the window contains the graphic, side panel on the left
contains options
• Left-click in the graphic panel to zoom in, right-click to zoom out

Visualizing a single variable:
box plot
• The Box Plot widget shows you the distribution of one attribute
• Select which attribute to look at on the side panel
• Quartiles are marked and shaded
• Mean value is marked in yellow

box plot
what do you notice?
Visualizing a single attribute:
Distributions (I)
• Displays a histogram or density plot – can color-code by categorical
variables
• See the overall shape and the distribution among classes

Distributions (II)
• The dashed lines display the probability density that a flower with
that measurement is from the given class
• E.g. the shorter the sepal length, the more likely the flower is to be Iris
setosa
• select which class the probability line targets on the side panel
Distributions (III)
Distributions
• For the imports-85.tab dataset, plot a continuous variable - there is no
discrete class and therefore no color-coding unless you select something
to Group by
Exercise
• Look at the Distributions tool for each variable on the auto imports dataset
(one per person)
• Which patterns do you see?

• In the case of a continuous variable, do the data appear to follow a
bell curve, or something more exotic?
• In the case of discrete variables, is there clear bias toward a specific
class?
Solutions
• The observations that you've made are probably correct; there are a lot of
patterns and correlations in a large multidimensional dataset!
• Remember what you saw here, later we will extract this information in a
quantitative way
Visualizing multiple attributes:
scatter plot
• Compare the values of two different
variables
• Select which variables on the side

panel, points are color coded by class
(species)
• Shows a relationship between petal

length and sepal length, relationship is
different for each species
• Green and red species are well

separated in two dimensions, even
though they overlapped in the one-
dimensional Distributions visualization
Visualizing multiple attributes:
Scatter Plot
• With no discrete class, points are
shaded in grayscale according to
another variable (price)
• Shading provides a 3rd dimension
• Here we can see a clear

relationship between horsepower,
gas mileage, and price
• The relationship between

horsepower and gas mileage
probably explains the
relationship between gas
mileage and price we saw in the
Distributions visualization

Introduction To Orange: Data Analytics Core

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Orange: Data Analytics Core

Uploaded by

Copyright:

Available Formats

Introduction to Orange

Data Analytics Core

• Orange is a data mining toolkit, so you don’t need to be an expert in any of

• We will use Orange to:

• Each widget does one type of processing

• A signal carries information from one widget to another.

• Orange tries to guess which one you mean

• If it guesses wrong, double click on the signal to select which inputs

• You can also temporarily disconnect or delete signals by right-clicking

• Each value has a bar

• In this case, last column is

• For example: exclude all irises with short petals

• The green category, iris-setosa, is now completely absent.

• All iris-setosa flowers have petals shorter than 3 cm.

• Features are data values to be included

• Target Variable is the category of the

• Meta Attributes are descriptive

• Available Variables are attributes

• Each variable is marked C for continuous

• You may need to click Apply before any

• Type a formula and click Add to add

• Select fields and functions using the

• Widget outputs the same data set with

• This calculation is assumes petals are

• Saves whatever is going to its input

• Be careful not to accidentally overwrite your input file

• Add a Volume attribute (i.e. length x width x height)

• Remove the original length, width, and height attributes

• Save the dataset using a different filename

• Now we will look at visualizations, designed to examine the data in a way

• Connect a data input and double-click the widget to see the

• Left-click in the graphic panel to zoom in, right-click to zoom out

• Select which attribute to look at on the side panel

• Quartiles are marked and shaded

• Mean value is marked in yellow

• See the overall shape and the distribution among classes

• Which patterns do you see?

• Select which variables on the side

• Shows a relationship between petal

• Green and red species are well

• Shading provides a 3rd dimension

• Here we can see a clear

• The relationship between

You might also like