Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 24

What is Data Visualization?

 Data visualization is a graphical representation of quantitative


information and data by using visual elements like graphs,
charts, and maps.

 Data visualization convert large and small data sets into visuals,
which is easy to understand and process for humans.

 Data visualization tools provide accessible ways to understand


outliers, patterns, and trends in the data.

 In the world of Big Data, the data visualization tools and


technologies are required to analyse vast amounts of
information.

 Data visualizations are common in your everyday life, but they


always appear in the form of graphs and charts. The
combination of multiple visualizations and bits of information
are still referred to as Infographics.

 Data visualizations are used to discover unknown facts and


trends. You can see visualizations in the form of line charts to
display change over time. Bar and column charts are useful for
observing relationships and making comparisons. A pie chart is
a great way to show parts-of-a-whole. And maps are the best
way to share geographical data visually.

 Today's data visualization tools go beyond the charts and graphs


used in the Microsoft Excel spreadsheet, which displays the data
in more sophisticated ways such as dials and gauges, geographic
maps, heat maps, pie chart, and fever chart.
What makes Data Visualization Effective?

Effective data visualization are created by communication, data


science, and design collide. Data visualizations did right key insights
into complicated data sets into meaningful and natural.
American statistician and Yale professor Edward Tufte believe
useful data visualizations consist of?complex ideas communicated
with clarity, precision, and efficiency.

To craft effective data visualization, you need to start with clean data
that is well-sourced and complete. After the data is ready to visualize,
you need to pick the right chart.
History of Data Visualization

The concept of using picture was launched in the 17th century to


understand the data from the maps and graphs, and then in the early
1800s, it was reinvented to the pie chart.

Several decades later, one of the most advanced examples of


statistical graphics occurred when Charles Minard mapped
Napoleon's invasion of Russia. The map represents the size of the
army and the path of Napoleon's retreat from Moscow - and that
information tied to temperature and time scales for a more in-depth
understanding of the event.

Computers made it possible to process a large amount of data at


lightning-fast speeds. Nowadays, data visualization becomes a fast-
evolving blend of art and science that certain to change the corporate
landscape over the next few years.

Importance of Data Visualization

Data visualization is important because of the processing of


information in human brains. Using graphs and charts to visualize a
large amount of the complex data sets is more comfortable in
comparison to studying the spread sheet and reports.

Data visualization is an easy and quick way to convey concepts


universally. You can experiment with a different outline by making a
slight adjustment.

Data visualization has some more specialties such as:


o Data visualization can identify areas that need improvement or
modifications.
o Data visualization can clarify which factor influence customer
behavior.
o Data visualization helps you to understand which products to
place where.
o Data visualization can predict sales volumes.
Why Use Data Visualization?
1. To make easier in understand and remember.
2. To discover unknown facts, outliers, and trends.
3. To visualize relationships and patterns quickly.
4. To ask a better question and make better decisions.
5. To competitive analyze.
6. To improve insights.
Examples of Data Visualization in Data Science
Here are some popular data visualization examples.
1. Weather reports: Maps and other plot types are commonly used
in weather reports.
2. Internet websites: Social media analytics websites such as Social
Blade and Google Analytics use data visualization techniques to
analyze and compare the performance of websites.
3. Astronomy: NASA uses advanced data visualization techniques in
its reports and presentations.
4. Geography
5. Gaming industry

What Makes Data Visualization Effective?


 Clarity: Data should be visualized in a way that everyone can
understand.
 Problem domain: When presenting data, the visualizations
should be related to the business problem.
 Interactivity: Interactive plots are useful to compare and
highlight certain things within the plot.
 Comparability: We can compare the thighs easily with good
plots.
 Aesthetics: Quality plots are visually aesthetic.
 Informative: A good plot summarizes all relevant information.
Importance of Data Science Visualization

 Data Cleaning
 Data Exploration
 Evaluation of Modeling Outputs
 Identifying Trends
 Presenting Results
Data Cleaning
 Data Science Visualization can help detect Null values of data
items in large datasets by representing them distinctively.
 This helps professionals reduce the burden of finding these
errors before working with the data.
 The data to be processed by a Data Scientist could be pulled
from multiple data sources like Databases, Datasets, etc.
 This data could consist of redundancy and noise which needs to
be eliminated before analysis.
 Visualizing these datasets gives you a complete overview
without assumptions about the correctness of the data.
Data Exploration

 The visual representation of data helps both technical and non-


technical professionals/personnel have an overview of what the
data is about.
 Data Science Visualization gives anyone the power to perform
Explorative Analysis on datasets provided. It makes it
explanatory (all principles employed). The advent of user-
friendly Data Science Visualization Tools like Tableau has also
improved the process.
 These Data Science Visualization Tools provide on-the-go
analysis on portable devices.
 Data Scientists can also leverage this to improve their decision-
making process by identifying anomalies and relationships
between data items.
Evaluation of Modeling Outputs

 Data Scientists build Machine Learning Models for Predictive


Analysis.

 These models are trained with large datasets to improve them.


When training the models, the results are evaluated with Data
Science Visualization Tools to know how well the model is
doing and where it is lacking.

 Apart from visualizing just outputs, the test data used to train the
models and the model’s responsiveness can also be visualized to
make more informed decisions.
Identifying Trends

Data Scientists and Data Analysts, at times, work with real-time data
to derive meaningful trends. As real-time data is always fluctuating, it
becomes difficult to analyze it. This is where the data can be
visualized using charts and graphs for better understanding. This
helps in making informed decisions not just in Data Science but in
Business Intelligence in general.
Presenting Results

The result of analysis at any point of processing can always be


visualized. The visualization can be done by anyone with knowledge
of Data Science Visualization Tools, not just a Data Scientist. So far
the data is from a supported data source, a Data Science Visualization
Tool can represent it in its supported formats such as Graphs, Curves,
or Charts.
Different types of Data Visualizations:
There are different types of plots
 Bar Plot
 Line Plot
 Scatter Plot
 Area Plot
 Histogram
 Pie Chart
Bar Chart:

 A Bar Plot is very easy to understand and therefore is the most


widely used plotting model.
 Simplicity and Clarity are the 2 major advantages of using a Bar
Plot. It can be used when you are comparing variables in the
same category or tracking the progression of 1 or 2 variables
over time.
 For example, to compare the marks of a student in multiple
subjects, a Bar Plot is the best choice.
Line Plot:
 A Line Plot is widely used for the comparison of stockpiles, or
for analyzing views on a video or post over time.
 The major advantage of using Line Plot is that it is very intuitive
and you can easily understand the result, even if you have no
experience in this field.
 It is commonly used to track and compare several variables over
time, analyze trends, and predict future values.
Scatter Plot:
 A Scatter Plot uses dots to illustrate values of Numerical
Variables.
 It is used to analyze individual points, observe and visualize
relationships between variables, or get a general overview of
variables.
Area Plot:
 An Area Plot displays Quantitative Data graphically.
 It is very much like Line Plot but with the key difference of
highlighting the distance between different variables.
 This makes it visually clearer and easy to understand.
 It is generally used to analyze progress in Time Series, analyze
Market Trends and Variations, etc.

Histogram:

 A Histogram graphically represents the frequency of Numerical


Data using bars.
 Unlike Bar Plot, it only represents Quantitative Data.
 The bars in the Histogram touch each other i.e. there is no space
between the bars.
 It is generally used when you are dealing with large datasets and
want to detect any unusual activities or gaps in the data.
Pie Chart:
 A Pie Chart represents the data in a circular graph.
 The slices in a Pie Chart represent the relative size of the data.
 Pie Chart is generally used to represent Categorical Data.
 For example, comparison in Areas of Growth within a business
such as Profit, Market Expenses, etc.
Top 3 Data Science Visualization Tools

Although there are various Data Science Visualization Tools available


in the market, the top 3 favored tools are listed below:

 Tableau
 Looker
 Microsoft Power BI

Tableau:
 Tableau is the most preferred Data Science Visualization Tool
by Data Analysts, Data Scientists, and Statisticians.
 It gives you the power to explore and analyze data in seconds.
 You can connect your data to your Tableau account to analyze
the trends.
 Tableau is highly compatible with Spreadsheets (Excel, Access,
etc), Databases (Microsoft SQL Server, MySQL, MonetDB,
etc), and Big Data (Cloudera Hadoop, DataStax Enterprise, etc).
 You can also access your Data Warehouses or Cloud Data using
Tableau. It is very easy to use and helps you transform and
shape your data for analysis.
Looker:
 Looker is a Data Science Visualization Tool that provides a real-
time dashboard of the data for more in-depth analysis.
 This gives you the advantage of taking instant decisions based
on the Data visualization obtained.
 Looker provides you easy connection with your Cloud-Based
Data Warehouses like Amazon Redshift, Google
BigQuery, Snowflake, as well as 50 SQL supported languages.
Microsoft power BI:
 Microsoft Power BI is a Data Science Visualization Tool that
focuses on creating a data-driven Business Intelligence culture
in an organization.
 It helps you quickly connect to your data, model it, and then
visualize it for better analysis. Microsoft Power BI also gives
you the option to securely share meaningful insights of your
data among your team members with a graphical view.
 It supports 100s of data sources (on-premise or cloud) like
Excel, Salesforce, Google Analytics, and Social Networks
(Facebook, Twitter, Reddit, etc). It also supports IoT devices for
real-time information.
Data Encoding :
 When working on some datasets, we found that some of the
features are categorical, if we pass that feature directly to our
model, our model can't understand those feature variables.
 We all know that machines can't understand categorical data.
Machines require all independent and dependent variables i.e
input and output features to be numeric.
 This means that if our data contain a categorical variable, we
must have to encode it to the numbers before we fit our data to
the model.
What is Data Encoding?
Data Encoding is an important pre-processing step in Machine
Learning. It refers to the process of converting categorical or textual
data into numerical format, so that it can be used as input for
algorithms to process. The reason for encoding is that most machine
learning algorithms work with numbers and not with text or
categorical variables.

Why it is Important?
• Most machine learning algorithms work only with numerical data,
so categorical variables (such as text labels) must be transformed into
numerical values.
• This allows the model to identify patterns in the data and make
predictions based on those patterns.

• Encoding also helps to prevent bias in the model by ensuring that all
features are equally weighted.

• The choice of encoding method can have a significant impact on


model performance, so it is important to choose an appropriate
encoding technique based on the nature of the data and the specific
requirements of the model.
There are several methods for encoding categorical variables,
including

1. One-Hot Encoding

2. Dummy Encoding

3.Ordinal Encoding

4. Binary Encoding

5. Count Encoding

6. Target Encoding

One-Hot Encoding:
• One-Hot Encoding is the Most Common method for encoding
Categorical variables.
• a Binary Column is created for each Unique Category in the
variable.

• If a category is present in a sample, the corresponding column


is set to 1, and all other columns are set to 0.

• For example, if a variable has three categories ‘A’, ‘B’ and


‘C’, three columns will be created and a sample with category
‘B’ will have the value [0,1,0].

# One-Hot Encoding:
# create a sample dataframe with a categorical variable
df = pd.DataFrame({'color': ['red', 'green', 'blue', 'red']})

# perform one-hot encoding on the 'color' column


one_hot = pd.get_dummies(df['color'])

# concatenate the one-hot encoding with the original dataframe


df1 = pd.concat([df, one_hot], axis=1)

# drop the original 'color' column


df1 = df1.drop('color', axis=1)

Dummy Encoding

• Dummy coding scheme is similar to one-hot encoding.

• This categorical data encoding method transforms the


categorical variable into a set of binary variables [0/1].

• In the case of one-hot encoding, for N categories in a variable,


it uses N binary variables.

• The dummy encoding is a small improvement over one-hot-


encoding. Dummy encoding uses N-1 features to represent N
labels/categories.
One-Hot Encoding vs Dummy Encoding:

One-Hot Encoding — N categories in a variable, N binary


variables.

Dummy encoding — N categories in a variable, N-1 binary


variables.

# Create a sample dataframe with categorical variable


data = {'Color': ['Red', 'Green', 'Blue', 'Red', 'Blue']}
df = pd.DataFrame(data)

# Use get_dummies() function for dummy encoding


dummy_df = pd.get_dummies(df['Color'], drop_first=True,
prefix='Color')

# Concatenate the dummy dataframe with the original dataframe


df = pd.concat([df, dummy_df], axis=1)

Label Encoding:

Each unique category is assigned a Unique Integer value.


This is a simpler encoding method, but it has a Drawback in that
the assigned integers may be misinterpreted by the machine
learning algorithm as having an Ordered Relationship when in
fact they do not.

# Create a sample dataframe with categorical data


df = pd.DataFrame({'color': ['red', 'green', 'blue', 'red', 'green']})

print(f"Before Encoding the Data:\n\n{df}\n")

# Create a LabelEncoder object


le = LabelEncoder()
# Fit and transform the categorical data
df['color_label'] = le.fit_transform(df['color'])

Ordinal Encoding:

• Ordinal Encoding is used when the categories in a variable


have a Natural Ordering.

• In this method, the categories are assigned a numerical value


based on their order, such as 1, 2, 3, etc.

For example, if a variable has categories ‘Low’, ‘Medium’ and


‘High’, they can be assigned the values 1, 2, and 3, respectively.

# Ordinal Encoding:
# create a sample dataframe with a categorical variable
df = pd.DataFrame({'quality': ['low', 'medium', 'high',
'medium']})
print(f"Before Encoding the Data:\n\n{df}\n")

# specify the order of the categories


quality_map = {'low': 0, 'medium': 1, 'high': 2}

# perform ordinal encoding on the 'quality' column


df['quality_map'] = df['quality'].map(quality_map)

Binary Encoding:

• Binary Encoding is similar to One-Hot Encoding, but instead


of creating a separate column for each category, the categories
are represented as binary digits.

For example, if a variable has four categories ‘A’, ‘B’, ‘C’ and
‘D’, they can be represented as 0001, 0010, 0100 and 1000,
respectively.
# Binary Encoding:

import pandas as pd

# create a sample dataframe with a categorical variable


df = pd.DataFrame({'animal': ['cat', 'dog', 'bird', 'cat']})
print(f"Before Encoding the Data:\n\n{df}\n")

# perform binary encoding on the 'animal' column


animal_map = {'cat': 0, 'dog': 1, 'bird': 2}
df['animal'] = df['animal'].map(animal_map)
df['animal'] = df['animal'].apply(lambda x: format(x, 'b'))

# print the resulting dataframe


print(f"After Encoding the Data:\n\n{df}\n")

Count Encoding:

• Count Encoding is a method for encoding categorical variables


by counting the number of times a category appears in the
dataset.

For example, if a variable has categories ‘A’, ‘B’ and ‘C’ and
category ‘A’ appears 10 times in the dataset, it will be assigned
a value of 10.

# Count Encoding:
# create a sample dataframe with a categorical variable
df = pd.DataFrame({'fruit': ['apple', 'banana', 'apple', 'banana']})
print(f"Before Encoding the Data:\n\n{df}\n")

# perform count encoding on the 'fruit' column


counts = df['fruit'].value_counts()
df['fruit'] = df['fruit'].map(counts)

# print the resulting dataframe


print(f"After Encoding the Data:\n\n{df}\n")

Target Encoding:

• This is a more advanced encoding technique used for dealing


with high cardinality categorical features, i.e., features with
many unique categories.

• The average target value for each category is calculated and


this average value is used to replace the categorical feature.

This has the advantage of considering the relationship between


the target and the categorical feature, but it can also lead to
overfitting if not used with caution.

# Create a sample dataframe with categorical data and target


df = pd.DataFrame({'color': ['red', 'green', 'blue', 'red', 'green'],
'target': [0, 1, 0, 1, 0]})
print(f"Before Encoding the Data:\n\n{df}\n")

# Calculate the mean target value for each category


target_mean = df.groupby('color')['target'].mean()

# Replace the categorical data with the mean target value


df['color_label'] = df['color'].map(target_mean)

print(f"After Encoding the Data:\n\n{df}")

Visual Encodings:

The visual encoding is the way in which data is mapped into


visual structures, upon which we build the images on a screen.

There are two types of visual encoding variables: planar and


retinal.
 Humans are sensitive to the retinal variables.
 They easily differentiate between various colors, shapes, sizes
and other properties.
 Retinal variables were introduced by Bertin

Data types

There are three basic types of data: something you can count,
something you can order and something you can just differentiate.
Quantitative

Anything that has exact numbers.

For example, Effort in points: 0, 1, 2, 3, 5, 8, 13.


Duration in days: 1, 4, 666.
Ordered / Qualitative

Anything that can be compared and ordered.

User Story Priority: Must Have, Great, Good, Not Sure.


Bug Severity: Blocking, Average, Who Cares.
Categorical

Everything else.

Entity types: Bugs, Stories, Features, Test Cases.


Fruits: Apples, Oranges, Plums.

Planar and Retinal Variables

We have several visual encoding variables.


X and Y

Planar variables are known to everybody. If you’ve studied maths


(which I’m sure you’d have), you’ve been drawing graphs across the
X- and Y-axis. Planar variables work for any data type. They work
great to present any quantitative data. It’s a pity that we have to deal
with the flat screens and just two planar variables. Well, we can try to
use Z-axis, but 3D charts look horrible on screen in 95.8% of cases.

So what should we do then to present three or more variables? We can


use the retinal variables!
Size

We know that size does matter. You can see the difference right away.
Small is innocuous, large is dangerous perhaps. Size is a good
visualizer for the quantitative data.

Texture

Texture is less common. You can’t touch it on screen, and it’s usually
less catchy than color. So, in theory texture can be used for soft
encoding, but in practice it’s better to pass on it.
Shape

Round circles ○, edgy stars ☆, solid rectangles █. We can easily


distinguish dozens of shapes. They do work well sometimes for the
visual encoding of categories.

Orientation

Orientation is tricky.

While we’re able to clearly identify vertical vs. horizontal lines, it is


harder to use it properly for visual encoding.

Color Value

Any color value can be moved over a scale. Greyscale is a good


example. While we can’t be certain that#999 color is lighter than
#888, still it’s a helpful technique to visualize the ordered data.
Color Hue

Red color is alarming. Green color is calm. Blue color is peaceful.


Colors are great to separate categories.

The general rule of thumb is that you can use no more than a dozen
colors to encode categories effectively. If there’s more, it’d be hard to
differentiate between categories quickly. These are the most
commonly used colors:

You might also like