Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Introduction to Orange

Data Analytics Core


Introduction to Data Mining
• Data Mining: discovering patterns in large datasets using artificial
intelligence, machine learning, statistics, and database systems

• Orange is a data mining toolkit, so you don’t need to be an expert in any of


those subjects to use it

• We will use Orange to:


• load, manipulate, and save large data sets
• visualize the relationships between variables
• discover and quantify patterns in data
• create rules to predict outcomes based on observed data
Widgets and Signals
• Orange is a graphical programming interface, where you design a program
• Each program is called a "scheme" and is made by dragging widgets
and signals around a "canvas"

• Each widget does one type of processing


• Widgets do things like load and save data to files, edit data, make
graphs, and perform calculations
• Most have inputs, outputs, or both

• A signal carries information from one widget to another.


• Signals can carry different types of information, e.g. data, calculated
values, or settings
Orange’s Graphical Programming
Environment
Using the Orange interface
• To add a widget, drag it onto the canvas from the widget panel, or just click
on it in the widget panel

• To add a signal, click on the signal attachment point on a widget and drag
from it to the signal attachment point on another widget
• Input signals come in from the left, output signals go out to the right
Using the Orange interface
• Some widgets have multiple possible input and output ports

• Orange tries to guess which one you mean

• If it guesses wrong, double click on the signal to select which inputs


and outputs you are using

• You can also temporarily disconnect or delete signals by right-clicking


on them
File Widget
• Loads data from a file
• Many different file types are supported
• Recommended: tab-delimited text
• If you click on the arrow and scroll down to the bottom of the dialog, you
can see "browse documentation datasets"
• iris.tab is an example dataset that comes with Orange, and contains 150
iris flowers from three species
Data Table Widget
• Lists rows in a dataset, sort
by clicking on the column
heading

• Each value has a bar


indicating the size of the
value

• In this case, last column is


assumed to be a
category

• Bar is colored by
category
(species)
For each of the 150
flowers in the dataset,
there is a value for:

• Petal Length
• Petal Width
• Sepal Length
• Sepal Width
Select Rows Widget
• Filters data according to simple rules

• For example: exclude all irises with short petals

• Select an attribute and a condition and press “Add” to add it to the filter
Data Selection Results
• The “petal length” column now only contains values longer than 3 cm

• The green category, iris-setosa, is now completely absent.

• All iris-setosa flowers have petals shorter than 3 cm.


Select Columns Widget (I)
• Choose which columns go in the dataset

• Features are data values to be included


in output

• Target Variable is the category of the


row

• Meta Attributes are descriptive


attributes that are excluded from the
analysis (such as a row ID)

• Available Variables are attributes


available to be loaded, but are not used in
model
Select Columns Widget (II)
• Drag or move variables between
categories with the > and < buttons

• Each variable is marked C for continuous


(numerical values) or D for discrete
(categorical values)

• You may need to click Apply before any


changes you make take effect
Select Columns in action
• Suppose we were only interested in sepals, not petals.
Feature Constructor Widget
• Defines new attributes (i.e. columns)
based on the values of existing
attributes

• Type a formula and click Add to add


a new feature

• Select fields and functions using the


drop down menus

• Widget outputs the same data set with


new attributes added

• This calculation is assumes petals are


triangular
Feature Construction Results
• New attribute is added after existing attributes but before the class
Save Widget
• Save a modified file

• Saves whatever is going to its input


• If you made changes elsewhere in the scheme, they will not be saved

• Be careful not to accidentally overwrite your input file


Exercise: First Scheme
• Load and inspect the imports-85.tab data file (documentation dataset),
which contains information about various imported cars

• Add a Volume attribute (i.e. length x width x height)

• Remove the original length, width, and height attributes

• Save the dataset using a different filename


Solution (I)
Solution (II)
• Remember to click Apply after you make changes!
Visualization
• So far, we have used Orange as a glorified spreadsheet: filtering and
sorting by attribute values, and performing computations

• With data mining, we want to gain insight that is not obvious from the
original data

• Now we will look at visualizations, designed to examine the data in a way


we can’t do with numbers in a spreadsheet
Visualization Widgets
• All of these visualization widgets work the same way:

• Connect a data input and double-click the widget to see the


visualization

• Main panel of the window contains the graphic, side panel on the left
contains options

• Left-click in the graphic panel to zoom in, right-click to zoom out


Visualizing a single variable:
box plot
• The Box Plot widget shows you the distribution of one attribute

• Select which attribute to look at on the side panel

• Quartiles are marked and shaded

• Mean value is marked in yellow


Visualizing a single variable:
box plot
Visualizing a single variable:
what do you notice?
Visualizing a single attribute:
Distributions (I)
• Displays a histogram or density plot – can color-code by categorical
variables

• See the overall shape and the distribution among classes


Visualizing a single attribute:
Distributions (II)
• The dashed lines display the probability density that a flower with
that measurement is from the given class

• E.g. the shorter the sepal length, the more likely the flower is to be Iris
setosa
• select which class the probability line targets on the side panel
Visualizing a single attribute:
Distributions (III)
Visualizing a single attribute:
Distributions
• For the imports-85.tab dataset, plot a continuous variable - there is no
discrete class and therefore no color-coding unless you select something
to Group by
Exercise
• Look at the Distributions tool for each variable on the auto imports dataset
(one per person)

• Which patterns do you see?


• In the case of a continuous variable, do the data appear to follow a
bell curve, or something more exotic?
• In the case of discrete variables, is there clear bias toward a specific
class?
Solutions
• The observations that you've made are probably correct; there are a lot of
patterns and correlations in a large multidimensional dataset!

• Remember what you saw here, later we will extract this information in a
quantitative way
Visualizing multiple attributes:
scatter plot
• Compare the values of two different
variables

• Select which variables on the side


panel, points are color coded by class
(species)

• Shows a relationship between petal


length and sepal length, relationship is
different for each species

• Green and red species are well


separated in two dimensions, even
though they overlapped in the one-
dimensional Distributions visualization
Visualizing multiple attributes:
Scatter Plot
• With no discrete class, points are
shaded in grayscale according to
another variable (price)

• Shading provides a 3rd dimension

• Here we can see a clear


relationship between horsepower,
gas mileage, and price

• The relationship between


horsepower and gas mileage
probably explains the
relationship between gas
mileage and price we saw in the
Distributions visualization

You might also like