Data Science Cat - 1

PART A
1. Define Python & Pandas
Python:
 Python is a high-level, interpreted, interactive and object-oriented scripting language. Python

is designed to be highly readable.
 It uses English keywords frequently whereas other languages use punctuation, and it has
fewer syntactical constructions than other languages.
Python Features
 Easy-to-learn.
 Easy-to-read
 Easy-to-maintain.
 Portable
 Scalable
Pandas:
 Pandas is an open-source library that is made mainly for working with relational or labelled
data both easily and intuitively.
 It provides various data structures and operations for manipulating numerical data and time
series. This library is built on top of the NumPy library.
 Pandas is fast and it has high performance & productivity for users.
Advantages
 Fast and efficient for manipulating and analysing data.

 Data from different file objects can be loaded.
 Size mutability
 Data set merging and joining.
 Flexible reshaping and pivoting of data sets.
 Provides time-series functionality.
2. LIST OUT PYTHON LIBRARIES.
 Pandas  Eli5
 NumPy  SciPy
 Keras  PyTorch
 TensorFlow  LightGBM
 Scikit Learn
3. WHAT IS DATA ANALYSIS?

 Data analysis is the process of cleaning, changing, and processing raw data and extracting
actionable, relevant information that helps businesses make informed decisions.
 The procedure helps reduce the risks inherent in decision-making by providing useful insights
and statistics, often presented in charts, images, tables, and graphs.
Data Analysis Important
 Better Customer Targeting:

 Reduce Operational Costs:
 Better Problem-Solving Methods
4. What is Skewness?
Skewness is a measure of the asymmetry of a distribution. A distribution is asymmetrical when its left
and right side are not mirror images.
The three types of skewness are:
 Right skew (also called positive skew). A right-skewed distribution is longer on the right side
of its peak than on its left.
 Left skew (also called negative skew). A left-skewed distribution is longer on the left side of
its peak than on its right.
 Zero skew.
5. What is ANOVA?
 Analysis of Variance (ANOVA) is a statistical formula used to compare variances across the
means (or average) of different groups.A range of scenarios use it to determine if there is any
difference between the means of different groups.
 The formula for Analysis of Variance is: ANOVA coefficient, F= Mean sum of squares between
the groups (MSB)/ Mean squares of errors (MSE). Therefore F = MSB/MSE.
PART B
1. Explain about Data Manipulation with Pandas.
Pandas DataFrame creation
 The fundamental Pandas object is called a DataFrame. It is a 2-dimensional size-mutable,

potentially heterogeneous, tabular data structure.
 A DataFrame can be created multiple ways. It can be created by passing in a dictionary or a
list of lists to the pd.DataFrame() method, or by reading data from a CSV file.
# Ways of creating a Pandas DataFrame
# Passing in a dictionary:
data = {'name':['Anthony', 'Maria'], 'age':[30, 28]}
df = pd.DataFrame(data)
# Passing in a list of lists:
data = [['Tom', 20], ['Jack', 30], ['Meera', 25]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
# Reading data from a csv file:
df = pd.read_csv('students.csv')
Pandas apply() function
 The Pandas apply() function can be used to apply a function on every value in a column or
row of a DataFrame, and transform that column or row to the resulting values.
# This function doubles the input value

def double(x):
return 2*x
# Apply this function to double every value in a specified column
df.column1 = df.column1.apply(double)
# Lambda functions can also be supplied to `apply()`
df.column2 = df.column2.apply(lambda x : 3*x)
# Applying to a row requires it to be called on the entire DataFrame
df['newColumn'] = df.apply(lambda row:
row['column1'] * 1.5 + row['column2'],
axis=1
Pandas DataFrames adding columns
# Specifying each value in the new column:
df['newColumn'] = [1, 2, 3, 4]
# Setting each row in the new column to the same value:
df['newColumn'] = 1
# Creating a new column by doing a
# calculation on an existing column:
df['newColumn'] = df['oldColumn'] * 5
2. Brief out in detail about Data summarization.

 Data Summarization summarizes data which included both primitive and derived data. Since
the data in the data warehouse is of very high volume, there needs to be a mechanism in
order to get only the relevant and meaningful information in a less messy format.
 Data summarization provides the capacity to give data consumers generalize view of
disparate bulks of data.
 Data Summarization in Data Mining is a key concept from which a concise description of a
dataset can be obtained to see what looks normal or out of place.
The different types of Data Summarization in Data Mining are:
 Tabular Summarization: This method instantly conveys patterns such as frequency

distribution, cumulative frequency, etc, and
 Data Visualization: Visualisations from a chosen graph style such as histogram, time-series
line graph, column/bar graphs, etc. can help to spot trends immediately in a visually
appealing way.
Centrality
The principle of Centrality is used to describe the centre or middle value of the data.
Several measures can be used to show the centrality of which the common ones are average also
called mean, median, and mode. The three of them summarize the distribution of the sample data.
 Mean: This is used to calculate the numerical average of the set of values.
 Mode: This shows the most frequently repeated value in a dataset.
 Median: This identifies the value in the middle of all the values in the dataset when values
are ranked in order.
Dispersion
The dispersion of a sample refers to how spread out the values are around the average (centre).
The different measures of dispersion are as follows:
 Standard deviation: This provides a standard way of knowing what is normal, showing what
is extra-large or extra small and helping you to understand the spread of the variable from
the mean. It shows how close all the values are to the mean.
 Variance: This is similar to standard deviation, but it measures how tightly or loosely values
are spread around the average.
 Range: The range indicates the difference between the largest and the smallest values
thereby showing the distance between the extremes.
Distribution of a Sample of Data
The distribution of sample data values has to do with the shape which refers to how data values are
distributed across the range of values in the sample. In simple terms, it means if the values are
clustered around the average to show how they are symmetrically arranged around it or if there are
more values to one side than the order.
 Histograms: Histograms are similar to bar charts where a bar represents the frequency of
values in the data that correspond to various size classes but the difference is that the bars
are drawn without gaps in them to show the x-axis representing a continuous variable.
 Tally plots: A tally plot is a kind of data frequency distribution graph that can be used to
represent the values from a dataset.
skewness and kurtosis can help give values to how central the average is and show how clustered
they are around the data average.
 Skewness: This is a measure of how central the average is in the distribution. The skewness
of a sample is a measure of how central the average is to the overall spread of values.
 Kurtosis: This is a measure of how pointy the distribution is. The Kurtosis of a sample is a
measure of how pointed the distribution is, it shows how clustered the values are around the
middle.
3. Describe about Pivot Table in Data science.
 A pivot table is a statistics tool that summarizes and reorganizes selected columns and rows
of data in a spreadsheet or database table to obtain a desired report.
 The tool does not actually change the spreadsheet or database itself, it simply “pivots” or
turns the data to view it from different perspectives.
 Pivot tables are especially useful with large amounts of data that would be time-consuming
to calculate by hand.
 A few data processing functions a pivot table can perform include identifying sums, averages,
ranges or outliers.
 The table then arranges this information in a simple, meaningful layout that draws attention
to key values.
How pivot tables work
When users create a pivot table, there are four main components:
 Columns- When a field is chosen for the column area, only the unique values of the field are
listed across the top.
 Rows- When a field is chosen for the row area, it populates as the first column. Similar to the
columns, all row labels are the unique values and duplicates are removed.
 Values- Each value is kept in a pivot table cell and display the summarized information. The
most common values are sum, average, minimum and maximum.
 Filters- Filters apply a calculation or restriction to the entire table.
basic formulas and functions in Excel.
 SUM  MIN Excel

 COUNT  MAX Excel
 COUNTA  LEN Excel
 COUNTBLANK  TRIM Excel
 AVERAGE  IF Excel
Uses of a pivot table
A pivot table helps users answer business questions with minimal effort. Common pivot table uses
include:
 To calculate sums or averages in business situations. For example, counting sales by

department or region.
 To show totals as a percentage of a whole. For example, comparing sales for a specific
product to total sales.
 To generate a list of unique values. For example, showing which states or countries have
ordered a product.
 To create a 2x2 table summary of a complex report.
 To identify the maximum and minimum values of a dataset.
 To query information directly from an online analytical processing (OLAP) server.
4. Explain about ANOVA.
 Analysis of Variance (ANOVA) is a statistical formula used to compare variances across the
means (or average) of different groups. A range of scenarios use it to determine if there is
any difference between the means of different groups.
ANOVA Terminology
 Dependent variable: This is the item being measured that is theorized to be affected by the
independent variables.
 Independent variable/s: These are the items being measured that may have an effect on the
dependent variable.
 A null hypothesis (H0): This is when there is no difference between the groups or means.
Depending on the result of the ANOVA test, the null hypothesis will either be accepted or
rejected.
 An alternative hypothesis (H1): When it is theorized that there is a difference between
groups and means.
 Factors and levels: In ANOVA terminology, an independent variable is called a factor which
affects the dependent variable. Level denotes the different values of the independent
variable that are used in an experiment.
 Fixed-factor model: Some experiments use only a discrete set of levels for factors. For
example, a fixed-factor test would be testing three different dosages of a drug and not
looking at any other dosages.
 Random-factor model: This model draws a random value of level from all the possible values
of the independent variable.
One-Way ANOVA (single-factor ANOVA or simple ANOVA.)
The one-way ANOVA is suitable for experiments with only one independent variable (factor) with
two or more levels. There will be twelve levels. A one-way ANOVA assumes:
 Independence: The value of the dependent variable for one observation is independent of
the value of any other observations.
 Normalcy: The value of the dependent variable is normally distributed
 Variance: The variance is comparable in different experiment groups.
 Continuous: The dependent variable (number of flowers) is continuous and can be measured
on a scale which can be subdivided.
Full Factorial ANOVA (also called two-way ANOVA)

Full Factorial ANOVA is used when there are two or more independent variables. Each of these
factors can have multiple levels.
This two-way ANOVA not only measures the independent vs the independent variable, but if the two
factors affect each other. A two-way ANOVA assumes:
 Continuous: The same as a one-way ANOVA, the dependent variable should be continuous.
 Independence: Each sample is independent of other samples, with no crossover.
 Variance: The variance in data across the different groups is the same.
 Normalcy: The samples are representative of a normal population.
 Categories: The independent variables should be in separate categories or groups.
PART C
1. Discuss in detail about Fundamental Python Libraries for Data Scientists.
 TensorFlow  LightGBM
 NumPy  ELI5
 SciPy  Theano
 Pandas  NuPIC
 Matplotlib  Ramp
 Keras  Pipenv
 SciKit-Learn  Bob
 PyTorch  PyBrain
 Scrapy  Caffe2
 BeautifulSoup  Chainer
A) TensorFlow
 The first in the list of python libraries for data science is TensorFlow. TensorFlow is a library
for high-performance numerical computations with around 35,000 comments and a vibrant
community of around 1,500 contributors. It’s used across various scientific fields.
Features:
 Better computational graph visualizations

 Reduces error by 50 to 60 percent in neural machine learning.
 Parallel computing to execute complex models.
 Seamless library management backed by Google.
 Quicker updates and frequent new releases to provide you with the latest features.
TensorFlow is particularly useful for the following applications:
 Speech and image recognition

 Text-based applications
 Time-series analysis
 Video detection
B) SciPy
 SciPy (Scientific Python) is another free and open-source Python library for data science that
is extensively used for high-level computations. SciPy has around 19,000 comments on
GitHub and an active community of about 600 contributors.
Features:
 Collection of algorithms and functions built on the NumPy extension of Python
 High-level commands for data manipulation and visualization
 Multidimensional image processing with the SciPy damage submodule
 Includes built-in functions for solving differential equations.
Applications:
 Multidimensional image operations

 Solving differential equations and the Fourier transform
 Optimization algorithms
 Linear algebra
C) NumPy
 NumPy (Numerical Python) is the fundamental package for numerical computation in
Python; it contains a powerful N-dimensional array object.
 It’s a general-purpose array-processing package that provides high-performance
multidimensional objects called arrays and tools for working with them.
 NumPy also addresses the slowness problem partly by providing these multidimensional
arrays as well as providing functions and operators that operate efficiently on these arrays.
Features:
 Provides fast, precompiled functions for numerical routines.

 Array-oriented computing for better efficiency
 Supports an object-oriented approach.
 Compact and faster computations with vectorization
Applications:
 Extensively used in data analysis

 Creates powerful N-dimensional array.
 Forms the base of other libraries, such as SciPy and scikit-learn.
 Replacement of MATLAB when used with SciPy and matplotlib.
D) Pandas
 Pandas (Python data analysis) is a must in the data science life cycle. It is the most popular
and widely used Python library for data science, along with NumPy in matplotlib. Pandas
provides fast, flexible data structures, such as data frame CDs, which are designed to work
with structured data very easily and intuitively.
Features:
 Eloquent syntax and rich functionalities that gives you the freedom to deal with missing data
 Enables you to create your own function and run it across a series of data
 High-level abstraction
 Contains high-level data structures and manipulation tools
Applications:
 General data wrangling and data cleaning

 ETL (extract, transform, load) jobs for data transformation and data storage, as it has
excellent support for loading CSV files into its data frame format
 Used in a variety of academic and commercial areas, including statistics, finance and
neuroscience
 Time-series-specific functionality, such as date range generation, moving window, linear
regression and date shifting.
E) Matplotlib
 Matplotlib has powerful yet beautiful visualizations. It’s a plotting library for Python with
around 26,000 comments on GitHub and a very vibrant community of about 700
contributors. It also provides an object-oriented API, which can be used to embed those
plots into applications.
Features:
 Usable as a MATLAB replacement, with the advantage of being free and open source
 Supports dozens of backends and output types, which means you can use it regardless of
which operating system you’re using or which output format you wish to use
 Pandas itself can be used as wrappers around MATLAB API to drive MATLAB like a cleaner
 Low memory consumption and better runtime behaviour
Applications:
 Correlation analysis of variables

 Visualize 95 percent confidence intervals of the models.
 Outlier detection using a scatter plot etc.
 Visualize the distribution of data to gain instant insights.
F) Keras
 Keras is another popular library that is used extensively for deep learning and neural network
modules. Keras supports both the TensorFlow and Theano backends, so it is a good option if
you don’t want to dive into the details of TensorFlow.
Features:
 Keras provides a vast prelabelled datasets which can be used to directly import and load.
 It contains various implemented layers and parameters that can be used for construction,
configuration, training, and evaluation of neural networks.
Application:
 deep learning models

G) Scikit-learn.
 A machine learning library that provides almost all the machine learning algorithms you
might need. Scikit-learn is designed to be interpolated into NumPy and SciPy.
Applications:
 clustering
 classification
 regression
 model selection
 dimensionality reduction
H) PyTorch
 PyTorch, which is a Python-based scientific computing package that uses the power of
graphics processing units. PyTorch is one of the most preferred deep learning research
platforms built to provide maximum flexibility and speed.
Applications:
 PyTorch is famous for providing two of the most high-level features

 tensor computations with strong GPU acceleration support
 building deep neural networks on a tape-based autograd system
I) Scrapy
 Scrapy isone of the most popular, fast, open-source web crawling frameworks written in
Python.
 It is commonly used to extract the data from the web page with the help of selectors based
on XPath.
Applications:
 Scrapy helps in building crawling programs (spider bots) that can retrieve structured data
from the web
 Scrappy is also used to gather data from APIs and follows a ‘Don't Repeat Yourself’ principle
in the design of its interface, influencing users to write universal codes that can be reused for
building and scaling large crawlers.
J) LightGBM
 The LightGBM Python library is a popular tool for implementing gradient-boosting algorithms
in data science projects. It provides a high-performance implementation of gradient boosting
that can handle large datasets and high-dimensional feature spaces.
Features:
 The LightGBM Python library is easy to integrate with other Python libraries, such as Pandas,
Scikit-Learn, and XGBoost.
 LightGBM is designed to be fast and memory-efficient, making it suitable for large-scale
datasets and high-dimensional feature spaces.
 The LightGBM Python library provides a wide range of hyperparameters that can be
customised to optimise model performance for specific datasets and use cases.
Applications:
 Anomaly detection
 Time series analysis
 Natural language processing
 Classification
2. Explain in detail about Data distribution in Data Science.
3. Write about Skewness and Kurtosis.
The skewness is a measure of symmetry or asymmetry of data distribution, and kurtosis measures
whether data is heavy-tailed or light-tailed in a normal distribution.
Data can be positive-skewed (data-pushed towards the right side) or negative-skewed (data-pushed
towards the left side).
What are the three types of skewness?
Right skew (also called positive skew).
A right-skewed distribution is longer on the right side of its peak than on its left.
Left skew (also called negative skew).
A left-skewed distribution is longer on the left side of its peak than on its right.
Zero skew.
Positively Skewed:
In a distribution that is Positively Skewed, the values are more concentrated towards the right side,
and the left tail is spread out.
Hence, the statistical results are bent towards the left-hand side.
Hence, that the mean, median, and mode are always positive.
In this distribution, Mean > Median > Mode.
Negatively Skewed:
In a Negatively Skewed distribution, the data points are more concentrated towards the right-hand
side of the distribution.
This makes the mean, median, and mode bend towards the right.
Hence these values are always negative.
In this distribution, Mode > Median > Mean.
What Is a Normal Distribution?
A normal distribution is a continuous probability distribution for a random variable.
A random variable is a variable whose value depends on the outcome of a random event.
For example, flipping a coin will give you either heads or tails at random.
You cannot determine with absolute certainty if the following outcome is a head or a tail.
Kurtosis:
It is also a characteristic of the frequency distribution. It gives an idea about the shape of a frequency
distribution. Basically, the measure of kurtosis is the extent to which a frequency distribution is
peaked in comparison with a normal curve. It is the degree of peakiness of a distribution.
Types of kurtosis: The following figure describes the classification of kurtosis:
Leptokurtic: Leptokurtic is a curve having a high peak than the normal distribution. In this curve,
there is too much concentration of items near the central value.
Mesokurtic: Mesokurtic is a curve having a normal peak than the normal curve. In this curve, there is
equal distribution of items around the central value.
Platykurtic: Platykurtic is a curve having a low peak than the normal curve is called platykurtic. In this
curve, there is less concentration of items around the central value.
Pearson’s first coefficient of skewness
To calculate skewness values, subtract a mode from a mean, and then divide the difference by
standard deviation.
Pearson’s first coefficient of skewness
Pearson’s second coefficient of skewness
Multiply the difference by 3, and divide the product by the standard deviation.
Skewness Kurtosis
It indicates the shape and size of variation on either It indicates the frequencies of
Skewness Kurtosis
side of the central value. distribution at the central value.
The measure differences of skewness tell us about the

It indicates the concentration of items at
magnitude and direction of the asymmetry of a
the central part of a distribution.
distribution.
It indicates how far the distribution differs from the It studies the divergence of the given
normal distribution. distribution from the normal distribution.
The measure of skewness studies the extent to which

It indicates the concentration of items.
deviation clusters is are above or below the average.
In an asymmetrical distribution, the deviation below or

No such distribution takes place.
above an average is not equal.

Data Science Cat - 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science Cat - 1

Uploaded by

Copyright:

Available Formats

PART A

1. Define Python & Pandas

 Python is a high-level, interpreted, interactive and object-oriented scripting language. Python

 Fast and efficient for manipulating and analysing data.

3. WHAT IS DATA ANALYSIS?

Data Analysis Important

 Better Customer Targeting:

The three types of skewness are:

1. Explain about Data Manipulation with Pandas.

Pandas DataFrame creation

 The fundamental Pandas object is called a DataFrame. It is a 2-dimensional size-mutable,

# Ways of creating a Pandas DataFrame

data = {'name':['Anthony', 'Maria'], 'age':[30, 28]}

# Passing in a list of lists:

data = [['Tom', 20], ['Jack', 30], ['Meera', 25]]

df = pd.DataFrame(data, columns = ['Name', 'Age'])

# Reading data from a csv file:

Pandas apply() function

# This function doubles the input value

# Apply this function to double every value in a specified column

# Lambda functions can also be supplied to `apply()`

df.column2 = df.column2.apply(lambda x : 3*x)

# Applying to a row requires it to be called on the entire DataFrame

df['newColumn'] = df.apply(lambda row:

row['column1'] * 1.5 + row['column2'],

Pandas DataFrames adding columns

# Specifying each value in the new column:

# Setting each row in the new column to the same value:

# Creating a new column by doing a

# calculation on an existing column:

2. Brief out in detail about Data summarization.

The different types of Data Summarization in Data Mining are:

 Tabular Summarization: This method instantly conveys patterns such as frequency

The different measures of dispersion are as follows:

Distribution of a Sample of Data

How pivot tables work

basic formulas and functions in Excel.

 SUM  MIN Excel

Uses of a pivot table

 To calculate sums or averages in business situations. For example, counting sales by

One-Way ANOVA (single-factor ANOVA or simple ANOVA.)

Full Factorial ANOVA (also called two-way ANOVA)

1. Discuss in detail about Fundamental Python Libraries for Data Scientists.

 Better computational graph visualizations

TensorFlow is particularly useful for the following applications:

 Speech and image recognition

 Multidimensional image operations

 Provides fast, precompiled functions for numerical routines.

 Extensively used in data analysis

 General data wrangling and data cleaning

 Correlation analysis of variables

 deep learning models

 PyTorch is famous for providing two of the most high-level features

3. Write about Skewness and Kurtosis.

Right skew (also called positive skew).

Left skew (also called negative skew).

In this distribution, Mean > Median > Mode.

In this distribution, Mode > Median > Mean.

What Is a Normal Distribution?

A normal distribution is a continuous probability distribution for a random variable.

Pearson’s first coefficient of skewness

Pearson’s first coefficient of skewness