Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

KNOWLEDGE INSTITUTE OF TECHNOLOGY, SALEM-637 504

Department of Computer Science and Engineering


Internal Assessment Test – II
CS3352 – FOUNDATIONS OF DATA SCIENCE
ANSWER KEY
PART - A
Q.No. Question
Write the computation formula for Standard Error of Estimate.
Standard Error of Estimate:
The standard error of estimate represents a special kind of standard deviation that
reflects the magnitude of predictive error. It is calculated as Standard Deviation by Square root of the
1. sample size.

What is Causation?
Causation refers to the relationship between cause and effect, where one event or variable is
2. responsible for the occurrence of another event or the change in the value of another variable. In other
words, if a change in one variable leads to a change in another variable, there is said to be a causal
relationship between them.
List the categories of Numpy’s basic array manipulation.
The main categories of basic array manipulation in NumPy include:
• Changing Array Shape
3. • Joining Arrays
• Splitting Arrays
• Adding/Removing Elements
• Indexing and Slicing
Write a code snippet to explain concatenation of arrays using concatenate(), hstack() and vstack().
4.
What is the essential difference between NumPy array vs. Pandas array indexing.
• Indexing Syntax
• Index Types
5.
• Handling Missing Values
• Alignment
Enumerate the attributes of numpy array.
6.
(i)shape (ii)ndim (iii)size (iv)dtype (v)itemsize (vi)nbytes
Write a code snippet for basic errorbar with a single Matplotlib function call.
7.
Sample Snippet + Output
Write the code snippet and complete the output for data = np.random.randn(1000) using plt.hist.
8. plt.hist(data, bins=30, edgecolor='black', alpha=0.7)
plt.show()
Define Kernel Density Estimation (KDE).
Kernel Density Estimation (KDE) is a non-parametric method for estimating the probability density
9. function (PDF) of a random variable. It provides a smooth, continuous estimate of the underlying
distribution of the data, which can be particularly useful for visualizing and analyzing the shape of
the distribution.
List the ways to customize matplotlib.
• Changing Plot Style
10. • Setting Plot Colors and Markers
• Adjusting Plot Size and Aspect Ratio
• Adding Labels and Titles
PART - B
Q.No. Question
Assume that an r of –.80 describes the strong negative relationship between years of heavy smoking
(X) and life expectancy (Y). Assume, furthermore, that the distributions of heavy smoking and life
expectancy each have the following means and sums of squares:
(i) Determine the least squares regression equation for predicting life expectancy from years of
11(A) heavy smoking.
(ii) Determine the standard error of estimate, Sy|x, assuming that the correlation of –.80 was based
on n = 50 pairs of observations.
(iii) Supply a rough interpretation of Sy|x.

(OR)

Calculate the Linear Regression Line using the following Example


X 16 12 18 4 3 10 5 12
y 87 88 89 68 78 80 75 83

Step 1: Tabulate x,y,xy, x2


S.No X Y xy x2
1 16 87 1392 256
2 12 88 1056 144
3 18 89 1602 324
4 4 68 272 16
5 3 78 234 9
6 10 80 800 100
7 5 75 375 25
8 12 83 996 144
∑ ∑x = 80 ∑y=648 ∑xy=6727 ∑x2=1018
11(B)
Step 2: Take Summation of x,y,xy,x 2

∑x = 80 ∑y=648 ∑xy=6727 ∑x2=1018


Step 3: Find SSxx,SSxy
SSxx = ∑x2 – [(∑x)2/n] => 1018-[(80)2/8] => 1018-800 => 218
SSxy = ∑xy – [(∑x∑y)/n] => 6727-[(80*648)/8] => 6727-6480 => 247
SSxx = 218
SSxy = 247
Step 4: Find a, b in y = aX + b
b = SSxy / SSxx => a = 247/218 => 1.133
a = 𝑦 – b*𝑥
𝑥 = ∑x/n => 80/8 => 10
𝑦 = ∑y/n => 648/8 => 81
a = 81 – 1.133*10 => 81-11.33 => 69.67
Step 6: Substitute a, b in the Equation
y = aX + b
y = 1.133x + 69.67
Explain Numpy’s functions in details with example
i. Ufuncs ii. Broadcasting
Universal Functions (ufuncs):
Universal functions (ufuncs) in NumPy are functions that operate element-wise on arrays, meaning
they perform operations on each element of an array independently. Ufuncs are a key feature of
12(A) NumPy and provide efficient computation with large datasets.
1. np.add(): Element-wise addition.
Sample Code + Output
2. np.subtract(): Element-wise subtraction.
Sample Code + Output
3. np.multiply(): Element-wise multiplication.
Sample Code + Output
4. np.divide(): Element-wise division.
Sample Code + Output
5. np.sqrt(): Element-wise square root.
Sample Code + Output
6. np.sin(): Element-wise sine.
Sample Code + Output
7. np.exp(): Element-wise exponential.
Sample Code + Output
8. np.sum(): Sum of array elements.
Sample Code + Output
ii. Broadcasting:
NumPy's broadcasting is a powerful mechanism that allows operations on arrays of different shapes
and sizes. It implicitly expands smaller arrays to match the shape of larger arrays, making element-
wise operations possible.
1. Scalar and Array:
Sample Code + Output
2. Arrays of Different Shapes:
Sample Code + Output
3. Broadcasting with a Larger Array:
Sample Code + Output
4. Broadcasting Rules:
Broadcasting follows strict rules to determine how arrays are expanded. For example, dimensions
are padded on the left, and sizes must be compatible or one of them must be 1.
5. Broadcasting Limitations:
While broadcasting is powerful, it's important to be aware of its limitations. Not all operations can be
broadcasted, and in some cases, explicit reshaping of arrays may be necessary.
These examples demonstrate the flexibility and convenience that NumPy's ufuncs and broadcasting
bring to array operations. They significantly simplify code and enhance performance in numerical
computing tasks.
(OR)
Discuss various methods of handling the missing data in pandas.
Pandas’ choice for how to handle missing values is constrained by its reliance on the NumPy
package, which does not have a built-in notion of NA values for non-floating-point datatypes.
NumPy does have support for masked arrays – i.e. arrays which have a separate boolean mask array
attached which marks data as “good” or “bad”. Pandas could have derived from this, but the
overhead in both storage, computation, and code maintenance makes that an unattractive choice.
Pandas use sentinels to handle missing values, and more specifically Pandas use two already-
existing Python null value:
•the Python None object.
•the special floating-point NaN value,
Python None object
The first sentinel value used by Pandas is None, a Python ‘object’ data that is most often used for
missing data in Python code. Because it is a Python object, None cannot be used in any arbitrary
12(B) NumPy/Pandas array, but only in arrays with data type ‘object’.
Sample Snippet
Show the Sample Output
The use of Python object in an array also means that you will generally get an error if you perform
aggregations like sum() or min().
None and NaN in Pandas
Pandas is built to handle the None and NaN nearly interchangeably, converting between them
where appropriate:
pd.Series([1, np.nan, 2, None])
0 1.0
1 NaN
2 2.0
3 NaN
dtype: float64
For types that don’t have an available sentinel value, Pandas automatically type-casts when NaN
values are present.
For example, let’s create a Panda Series with dtype=int.
x = pd.Series(range(2), dtype=int)
x
0 0
1 1
dtype: int64
If we set a value in an integer array to np.nan, it will automatically be upcast to a floating-point type
to accommodate the NaN:
x[0] = None
x
0 NaN
1 1.0
dtype: float64
Detecting missing values
Pandas has two useful methods for detecting missing values: isnull() and notnull() . Either one will
return a Boolean mask over the data. For example:
df.isnull() returns a Boolean same-sized DataFrame indicating if values are missing
Output
output of df.isnull()
df.notnull() returns a Boolean same-sized DataFrame which is just opposite of isnull()
output of df.notnull()
When the dataset is large, you can count the number of missing values instead.
df.isnull().sum() returns the number of missing values for each column (Pandas Series)
df.isnull().sum().sum() returns the total number of missing values
df.isnull().sum().sum()
3
Dropping missing data
In addition to the masking used before, there is the convenience method dropna() to remove missing
values. The method is defined as:
dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)
To drop row if any NaN values are present:
df.dropna(axis = 0)
output of df.dropna(axis = 0)
To drop column if any NaN values are present:
df.dropna(axis = 1)
output of df.dropna(axis = 1)
To drop row if the number of non-NaN is less than 6.
df.dropna(axis = 0, thresh = 6)
output of df.dropna(axis = 0, thresh = 6)
Replacing missing values
Data is a valuable asset so we should not give it up easily. Also, machine learning models almost
always tend to perform better with more data. Therefore, depending on the situation, we may prefer
replacing missing values instead of dropping.
There is the convenience method fillna() to replace missing values. The method is defined as:
To replace all NaN values with a scalar
df.fillna(value=10)
output of df.fillna(value=10)
To replaces NaN values with the values in the previous row.
df.fillna(axis=0, method='ffill')
output of df.fillna(axis=0, method=’ffill’)
To facilitate this convention, there are several useful methods for detecting, removing, and replacing
null values in Pandas data structures. They are:
isnull(): generate a boolean mask indicating missing values
notnull(): opposite of isnull()
dropna(): return a filtered version of the data
fillna(): return a copy of the data with missing values filled or imputed
Checking for missing values using isnull() and notnull()
In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull().
Both function help in checking whether a value is NaN or not.
Checking for missing values using isnull()
Sample Snippet
Show the Sample Output
Checking for missing values using notnull()
Sample Snippet
Show the Sample Output
Filling missing values using fillna(), replace() and interpolate()
In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function these
function replace NaN values with some value of their own.
All these function help in filling a null values in datasets of a DataFrame.
Interpolate() function is basically used to fill NA values in the dataframe but it uses various
interpolation technique to fill the missing values rather than hard-coding the value.
Discuss the approaches to combine datasets and identify the challenges with an example program.
Pandas Merging, Joining, and Concatenating
•Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data
structure with labelled axes (rows and columns).
•A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows
and columns. We can join, merge, and concat dataframe using different methods.
•In Dataframe df.merge(),df.join(), and df.concat() methods help in joining, merging and concating
different dataframe.
Concatenating DataFrame
13(A)
In order to concat dataframe, we use concat() function which helps in concatenating a dataframe. We
can concat a dataframe in many different ways, they are:
•Concatenating DataFrame using .concat()(Briefly Explain with suitable examples)
•Concatenating DataFrame by setting logic on axes(Briefly Explain with suitable examples)
•Concatenating DataFrame using .append()(Briefly Explain with suitable examples)
•Concatenating DataFrame by ignoring indexes(Briefly Explain with suitable examples)
•Concatenating DataFrame with group keys(Briefly Explain with suitable examples)
•Concatenating with mixed ndims(Briefly Explain with suitable examples)

Explain in detail about Pivot table with suitable examples with an example program.
A pivot table is a data processing tool used in spreadsheet programs and data analysis tools to
summarize and aggregate data based on specific criteria. It allows you to reshape and transform
data, making it easier to analyze and draw insights. Pivot tables are commonly used in spreadsheet
13(B) software like Microsoft Excel and data analysis libraries like Pandas in Python.
Sample Snippet + Output
Paraphrase on Visualization with Seaborn using an example program.
Distribution Plots
• distplot (Sample Snippet + Code)
• jointplot (Sample Snippet + Code)
• pairplot (Sample Snippet + Code)
• rugplot (Sample Snippet + Code)
• kdeplot (Sample Snippet + Code)
14(A) Categorical Data Plots
• factorplot (Sample Snippet + Code)
• boxplot (Sample Snippet + Code)
• violinplot (Sample Snippet + Code)
• stripplot (Sample Snippet + Code)
• swarmplot (Sample Snippet + Code)
• barplot (Sample Snippet + Code)
• countplot (Sample Snippet + Code)
Elaborate Visualization with Error using an example program.
Visualizing data with errors is crucial for conveying the uncertainty or variability associated with
measurements. Seaborn, along with Matplotlib, provides tools for creating visually informative plots
with error bars. Let's explore an example using a bar plot with error bars:
Example: Bar Plot with Error Bars
Suppose we have data representing the average scores of students in different subjects along with their
standard deviations:
python
Copy code
import seaborn as sns
import matplotlib.pyplot as plt
# Creating a sample DataFrame
data = {
'Subject': ['Math', 'Science', 'English', 'History'],
'AverageScore': [80, 75, 85, 78],
'StdDev': [5, 3, 6, 4]
}
df = pd.DataFrame(data)
Now, let's create a bar plot with error bars to visualize the average scores and their uncertainties:
python
14(B)
Copy code
# Bar plot with error bars
sns.barplot(x='Subject', y='AverageScore', data=df, yerr='StdDev', capsize=0.2, palette='muted')
# Adding labels and title
plt.xlabel('Subject')
plt.ylabel('Average Score')
plt.title('Average Scores in Different Subjects with Error Bars')
# Display the plot
plt.show()
In this example:
sns.barplot creates a bar plot, with x as the subject names, y as the average scores, and yerr as the
standard deviations for error bars.
capsize=0.2 controls the size of the caps at the end of the error bars.
palette='muted' sets the color palette for the plot.
This visualization provides a clear representation of the average scores in different subjects, while the
error bars indicate the uncertainty associated with each average due to the standard deviations.
Viewers can quickly assess both the central tendency and the variability of the data.
Including error bars in visualizations is essential for accurately communicating the precision or
confidence intervals of measurements. Seaborn simplifies the process of creating such informative
plots, allowing for effective data communication and interpretation.
Explain 3d Plotting with example program.
The most basic three-dimensional plot is a line or scatter plot created from sets of (x, y, z) triples. In
analogy with the more common two-dimensional plots discussed ear‐ lier, we can create these using
the ax.plot3D and ax.scatter3D functions. The call signature for these is nearly identical to that of their
two-dimensional counterparts

Analogous to the contour plots we explored in “Density and Contour Plots” on page 241, mplot3d
15(A) contains tools to create three-dimensional relief plots using the same inputs. Like two-dimensional
ax.contour plots, ax.contour3D requires all the input data to be in the form of two-dimensional regular
grids, with the Z data evaluated at each point. Here we’ll show a three-dimensional contour diagram
of a three-dimensional sinusoidal function
Wireframe:
Two other types of three-dimensional plots that work on gridded data are wireframes and surface
plots. These take a grid of values and project it onto the specified threedimensional surface, and can
make the resulting three-dimensional forms quite easy to visualize.

A surface plot is like a wireframe plot, but each face of the wireframe is a filled polygon. Adding a
colormap to the filled polygons can aid perception of the topology of the surface being visualized

Surface Triangulations
For some applications, the evenly sampled grids required by the preceding routines are overly
restrictive and inconvenient. In these situations, the triangulation-based plots can be very useful.

This leaves a lot to be desired. The function that will help us in this case is ax.plot_trisurf, which creates
a surface by first finding a set of triangles formed between adjacent points. The result is certainly not
as clean as when it is plotted with a grid, but the flexibility of such a triangulation allows for some
really interesting three-dimensional plots.

Explain Density and Contour Plots with example.


Sometimes it is useful to display three-dimensional data in two dimensions using contours or color-
coded regions. There are three Matplotlib functions that can be helpful for this task: plt.contour for
contour plots, plt.contourf for filled contour plots, and plt.imshow for showing images.

A contour plot can be created with the plt.contour function. It takes three argu‐ ments: a grid of x
values, a grid of y values, and a grid of z values. The x and y values represent positions on the plot,
and the z values will be represented by the contour levels. Perhaps the most straightforward way to
prepare such data is to use the np.meshgrid function, which builds two-dimensional grids from one-
dimensional arrays:

A standard line-only contour plot is how can be written

15(B)

Notice that by default when a single color is used, negative values are represented by dashed lines,
and positive values by solid lines. Alternatively, you can color-code the lines by specifying a colormap
with the cmap argument. Here, we’ll also specify that we want more lines to be drawn—20 equally
spaced intervals within the data range
Here we chose the RdGy (short for Red-Gray) colormap, which is a good choice for centered data.
Matplotlib has a wide range of colormaps available, which you can easily browse in IPython by doing
a tab completion on the plt.cm module:
plt.cm.
Our plot is looking nicer, but the spaces between the lines may be a bit distracting. We can change this
by switching to a filled contour plot using the plt.contourf() function (notice the f at the end), which
uses largely the same syntax as
plt.contour()
Additionally, we’ll add a plt.colorbar() command, which automatically creates an additional axis with
labeled color information for the plot

The colorbar makes it clear that the black regions are “peaks,” while the red regions are “valleys.” One
potential issue with this plot is that it is a bit “splotchy.” That is, the color steps are discrete rather than
continuous, which is not always what is desired. You could remedy this by setting the number of
contours to a very high number, but this results in a rather inefficient plot: Matplotlib must render a
new polygon for each step in the level. A better way to handle this is to use the plt.imshow() function,
which interprets a two-dimensional grid of data as an image.

plt.imshow() will automatically adjust the axis aspect ratio to match the input data; you can change
this by setting, for example, plt.axis(aspect='image') to make x and y units match.
Finally, it can sometimes be useful to combine contour plots and image plots. For example, to create
the effect shown in Figure 4-34, we’ll use a partially transparent background image (with transparency
set via the alpha parameter) and over-plot contours with labels on the contours themselves (using the
plt.clabel() function)

The combination of these three functions—plt.contour, plt.contourf, and plt.imshow—gives nearly


limitless possibilities for displaying this sort of three-dimensional data within a two-dimensional plot
PART – C
Perform EDA on Flipkart Sales Analysis.
It is necessary to involve the following steps
Step 1: Import Python Libraries
Step 2: Reading Dataset
Step 3: Data Reduction
Step 4: Feature Engineering
16(A) Step 5: Creating Features
Step 6: Data Cleaning/Wrangling
Step 7: EDA Exploratory Data Analysis
Step 8: Statistics Summary
Step 9: EDA Univariate Analysis
Step 10: Data Transformation
Step 11: Impute Missing values
Perform EDA on Youtube trending analysis.
It is necessary to involve the following steps
Step 1: Import Python Libraries
Step 2: Reading Dataset
Step 3: Data Reduction
Step 4: Feature Engineering
16(B) Step 5: Creating Features
Step 6: Data Cleaning/Wrangling
Step 7: EDA Exploratory Data Analysis
Step 8: Statistics Summary
Step 9: EDA Univariate Analysis
Step 10: Data Transformation
Step 11: Impute Missing values

Prepared by HOD

You might also like