Assignment1 DataViz D

1.
Given a dataset with age and income (assuming continuous data type), If we want to
understand the relationship between two or more continuous variable, what kind of graphs
could be useful. Support your argument with an example of data generated with seed value
of 1. The age should be between 25 - 55 and income should be between 10000-50000.
Generate one or more plot and attach that with your assignment .Rephrase the words
To understand the relationship between two or more continuous variables, such as age and income, you can
create various types of graphs to visualize the data and analyze their relationship. Some of the useful plots for
this purpose include scatter plots, line plots, and heatmaps. Let's generate an example dataset with a seed value
of 1, where age ranges from 25 to 55 and income ranges from 10,000 to 50,000, and then create a scatter plot
to illustrate the relationship between age and income.
First, let's generate the example data and create a scatter plot using Python and the Matplotlib library:
python
Copy code
import numpy as np
import matplotlib.pyplot as plt
# Set a random seed for reproducibility
np.random.seed(1)
# Generate example data
age = np.random.uniform(25, 55, 100)
income = np.random.uniform(10000, 50000, 100)
# Create a scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(age, income, alpha=0.6, color='b', edgecolors='k')
plt.title("Relationship between Age and Income")
plt.xlabel("Age")
plt.ylabel("Income")
plt.grid(True)
plt.show()
In this scatter plot, age is plotted on the x-axis, and income is plotted on the y-axis. Each point represents an
individual in the dataset, with their age and income values. The scatter plot allows you to visually assess the
relationship between age and income. You can observe whether there is any pattern or correlation between
these two variables. In this example, there is no apparent linear relationship, but you can use various statistical
measures to quantify the relationship if needed.
Additionally, you can explore other types of plots, such as line plots for time series data or heatmaps for
visualizing the correlation matrix of multiple continuous variables. The choice of the plot depends on the specific
goals of your analysis and the nature of the data.
2. Why do you thing object oriented approach of matplotlib is better than stateful version of
matplotlib? Write your points to suppor this argument with example
Answer:
The object-oriented approach of Matplotlib is generally considered more powerful and flexible than the stateful
version for several reasons. Here are some points to support this argument:
Explicit Control: With the object-oriented approach, you have more explicit control over every element of your
plot. You create and manipulate individual objects like Figure, Axes, and Artists, which allows for fine-grained
customization. This control can be essential when creating complex or customized visualizations.
Modularity: The object-oriented approach promotes modularity and reusability. You can create complex plots by
composing individual components, making your code more organized and easier to maintain. This is particularly
advantageous when working on large, multi-panel plots.
Multiple Figures and Axes: Object-oriented Matplotlib allows you to work with multiple Figures and Axes
simultaneously, making it easy to create subplots or complex layouts. In contrast, the stateful version can
become cumbersome when managing multiple plots.
Clearer Code: The object-oriented approach results in more readable and self-explanatory code. You explicitly
create and manipulate each element, making it easier to understand the purpose and hierarchy of each part of
the plot.
Let's illustrate these points with an example:
python
Copy code
# Object-oriented Matplotlib
# Create a Figure and a set of subplots (Axes)
fig, axes = plt.subplots(2, 2)

# Customize each subplot individually
axes[0, 0].plot([1, 2, 3, 4], [2, 4, 6, 8], label='Line 1')
axes[0, 0].set_title('Subplot 1')
axes[1, 1].bar(['A', 'B', 'C'], [3, 7, 2], color='red')
axes[1, 1].set_title('Subplot 4')
# Add a legend to the first subplot
axes[0, 0].legend()
# Stateful Matplotlib
plt.figure() # Create a new figure
plt.plot([1, 2, 3, 4], [2, 4, 6, 8], label='Line 1')
plt.title('Line Plot')
plt.figure() # Create another figure
plt.bar(['A', 'B', 'C'], [3, 7, 2], color='red')
plt.title('Bar Plot')
# This code results in two separate figures, but customization is less explicit.
# Stateful version may lead to less modular and more error-prone code.
In the object-oriented approach, you can see that you have full control over individual subplots, making it more
organized and modular. In contrast, the stateful version results in separate figures, but it's less clear how they
are related, which can make the code less maintainable and harder to understand.
3. What is the difference/similarities between histogram and barplots in general
Answer:
Histograms and bar plots are both common ways to visualize data, but they have distinct differences and some
similarities:
Differences:
Data Representation:
Histogram: Histograms are used to represent the distribution of a continuous variable. They divide the data into
bins or intervals and show the frequency or count of data points within each bin.
Bar Plot: Bar plots, on the other hand, are typically used to display discrete, categorical data. They show the
values of different categories or groups.
X-Axis:
Histogram: The x-axis in a histogram represents the continuous variable being measured, and the bins are
ranges of values.
Bar Plot: The x-axis in a bar plot usually represents categories or groups, and each bar represents the value
associated with that category.
Data Type:
Histogram: Typically used for quantitative data like age, income, or test scores.
Bar Plot: Typically used for categorical data like product names, cities, or survey responses.
Similarities:
Visual Representation:
Both histograms and bar plots use rectangular bars to represent data values, making them easy to interpret and
compare.
Summarization:
Both types of plots summarize data in a visually understandable format, allowing viewers to quickly grasp
essential information about the dataset.
Use of Bars:
In both plots, the height of the bars conveys information. In histograms, it represents the frequency or density of
data within each bin, while in bar plots, it represents the value associated with each category.
Customization:
Both types of plots can be customized with colors, labels, and additional information to enhance their
communicative power.
In summary, the primary difference between histograms and bar plots lies in the type of data they are designed
to represent and the x-axis's nature. Histograms are used for continuous data, showing the distribution of values,
while bar plots are used for discrete, categorical data, displaying comparisons between different categories or
groups. However, both types of plots use bars to convey information and serve as valuable tools for data
visualization.
4 . What is the disadvantage of using pie chart in normal visualization
Pie charts are widely used for visualizing data, but they have several disadvantages when compared to other
types of charts, which should be taken into consideration:
Limited Information: Pie charts are not suitable for representing complex or multivariate data. They can
effectively show the proportions of a whole but struggle to convey more detailed information, such as trends,
comparisons, or relationships among data points.
Difficulty in Comparisons: It's challenging to compare the sizes of different slices in a pie chart accurately,
especially when there are many slices or when the differences are subtle. This makes it less suitable for data with
many categories.
Misleading: Pie charts can be misleading if not used appropriately. Small differences in slice size can be hard to
distinguish, leading to misinterpretations of the data. In some cases, it's better to use a different chart type to
avoid miscommunication.
Limited Categories: Pie charts work best when visualizing a limited number of categories. If you have too many
data points, the chart can become cluttered and confusing.
Ineffective for Time-Series Data: Pie charts are not designed for representing changes over time. If you want to
show trends or time-series data, other chart types like line charts or bar charts are more suitable.
Labeling Issues: To make a pie chart readable, you often need to add labels to each slice. This can lead to
overcrowding, particularly when dealing with small slices. Labeling can become a significant challenge if the data
has many categories.
Accessibility Concerns: Pie charts may not be accessible to individuals with visual impairments or color blindness.
Ensuring that they are usable for a wide audience can be more challenging compared to some other chart types.
Wasted Space: Pie charts require a circular shape, which can waste a significant amount of space on a page. In
situations where space is limited, other chart types might be more space-efficient.
In summary, while pie charts have their place in data visualization, it's crucial to be aware of their limitations.
They are best suited for simple data representations when the goal is to show the proportion of each category in
a whole. When dealing with more complex data or when precise comparisons are required, other chart types like
bar charts, stacked bar charts, or scatterplots may be more effective and informative.
5. What is the difference between univariate and bivariate plots, Use mtcars dataset and
generate univariate and bivariate plots for atleast 5 of the features
Univariate and bivariate plots are two types of data visualization techniques used to explore and understand
data.
Univariate Plots:
Univariate plots focus on a single variable or feature at a time, allowing you to understand its distribution and
characteristics in isolation.
These plots are useful for visualizing the distribution of a single variable, identifying patterns, outliers, and
skewness.
Common univariate plots include histograms, box plots, density plots, and bar plots.
Bivariate Plots:
Bivariate plots, on the other hand, involve the analysis of the relationships between two variables simultaneously.
They are used to explore the interactions and correlations between pairs of variables, helping to uncover
patterns and associations.
Common bivariate plots include scatter plots, line plots, and joint histograms.
Now, using the mtcars dataset, let's generate both univariate and bivariate plots for at least five features to
illustrate the difference:
python
Copy code
import seaborn as sns
import pandas as pd
# Load the mtcars dataset
mtcars = pd.read_csv("mtcars.csv")
# Univariate Plots
plt.figure(figsize=(12, 5))
plt.subplot(2, 3, 1)
sns.histplot(mtcars['mpg'], kde=True)
plt.title('Univariate: MPG Distribution')
sns.boxplot(x=mtcars['cyl'], y=mtcars['mpg'])
plt.title('Bivariate: MPG vs. Cylinders')
sns.histplot(mtcars['hp'], kde=True)
plt.title('Univariate: Horsepower Distribution')
sns.boxplot(x=mtcars['gear'], y=mtcars['hp'])
plt.title('Bivariate: Horsepower vs. Gears')
sns.countplot(x=mtcars['cyl'])
plt.title('Univariate: Cylinder Count')
sns.boxplot(x=mtcars['gear'], y=mtcars['cyl'])
plt.title('Bivariate: Cylinders vs. Gears')
plt.tight_layout()
plt.show()
In the example above, we've created univariate plots for the 'mpg' (miles per gallon) and 'hp' (horsepower)
variables, as well as a count plot for the 'cyl' (number of cylinders) variable. For bivariate plots, we've explored
relationships between these features and 'cyl' and 'gear' (number of forward gears) variables. These plots help
us understand the data from both univariate and bivariate perspectives, revealing individual variable distributions
and pairwise relationships between them.

Assignment1 DataViz D

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assignment1 DataViz D

Uploaded by

Copyright:

Available Formats

1.

of 1. The age should be between 25 - 55 and income should be between 10000-50000.

import matplotlib.pyplot as plt

# Set a random seed for reproducibility

# Generate example data

age = np.random.uniform(25, 55, 100)

income = np.random.uniform(10000, 50000, 100)

# Create a scatter plot

plt.scatter(age, income, alpha=0.6, color='b', edgecolors='k')

plt.title("Relationship between Age and Income")

Let's illustrate these points with an example:

import matplotlib.pyplot as plt

# Create a Figure and a set of subplots (Axes)

fig, axes = plt.subplots(2, 2)

axes[0, 0].plot([1, 2, 3, 4], [2, 4, 6, 8], label='Line 1')

axes[0, 0].set_title('Subplot 1')

axes[1, 1].bar(['A', 'B', 'C'], [3, 7, 2], color='red')

axes[1, 1].set_title('Subplot 4')

# Add a legend to the first subplot

plt.figure() # Create a new figure

plt.plot([1, 2, 3, 4], [2, 4, 6, 8], label='Line 1')

plt.figure() # Create another figure

plt.bar(['A', 'B', 'C'], [3, 7, 2], color='red')

3. What is the difference/similarities between histogram and barplots in general

4 . What is the disadvantage of using pie chart in normal visualization

generate univariate and bivariate plots for atleast 5 of the features

import matplotlib.pyplot as plt

import seaborn as sns

# Load the mtcars dataset

plt.title('Univariate: MPG Distribution')

plt.title('Bivariate: MPG vs. Cylinders')

plt.title('Univariate: Horsepower Distribution')

plt.title('Univariate: Cylinder Count')

plt.title('Bivariate: Cylinders vs. Gears')

You might also like