Professional Documents
Culture Documents
Data Science Cat - 1
Data Science Cat - 1
Python:
Pandas:
Pandas is an open-source library that is made mainly for working with relational or labelled
data both easily and intuitively.
It provides various data structures and operations for manipulating numerical data and time
series. This library is built on top of the NumPy library.
Pandas is fast and it has high performance & productivity for users.
Advantages
Pandas Eli5
NumPy SciPy
Keras PyTorch
TensorFlow LightGBM
Scikit Learn
Skewness is a measure of the asymmetry of a distribution. A distribution is asymmetrical when its left
and right side are not mirror images.
Right skew (also called positive skew). A right-skewed distribution is longer on the right side
of its peak than on its left.
Left skew (also called negative skew). A left-skewed distribution is longer on the left side of
its peak than on its right.
Zero skew.
5. What is ANOVA?
Analysis of Variance (ANOVA) is a statistical formula used to compare variances across the
means (or average) of different groups.A range of scenarios use it to determine if there is any
difference between the means of different groups.
The formula for Analysis of Variance is: ANOVA coefficient, F= Mean sum of squares between
the groups (MSB)/ Mean squares of errors (MSE). Therefore F = MSB/MSE.
PART B
# Passing in a dictionary:
df = pd.DataFrame(data)
df = pd.read_csv('students.csv')
The Pandas apply() function can be used to apply a function on every value in a column or
row of a DataFrame, and transform that column or row to the resulting values.
return 2*x
df.column1 = df.column1.apply(double)
axis=1
df['newColumn'] = [1, 2, 3, 4]
df['newColumn'] = 1
df['newColumn'] = df['oldColumn'] * 5
The principle of Centrality is used to describe the centre or middle value of the data.
Several measures can be used to show the centrality of which the common ones are average also
called mean, median, and mode. The three of them summarize the distribution of the sample data.
Mean: This is used to calculate the numerical average of the set of values.
Mode: This shows the most frequently repeated value in a dataset.
Median: This identifies the value in the middle of all the values in the dataset when values
are ranked in order.
Dispersion
The dispersion of a sample refers to how spread out the values are around the average (centre).
Standard deviation: This provides a standard way of knowing what is normal, showing what
is extra-large or extra small and helping you to understand the spread of the variable from
the mean. It shows how close all the values are to the mean.
Variance: This is similar to standard deviation, but it measures how tightly or loosely values
are spread around the average.
Range: The range indicates the difference between the largest and the smallest values
thereby showing the distance between the extremes.
The distribution of sample data values has to do with the shape which refers to how data values are
distributed across the range of values in the sample. In simple terms, it means if the values are
clustered around the average to show how they are symmetrically arranged around it or if there are
more values to one side than the order.
Histograms: Histograms are similar to bar charts where a bar represents the frequency of
values in the data that correspond to various size classes but the difference is that the bars
are drawn without gaps in them to show the x-axis representing a continuous variable.
Tally plots: A tally plot is a kind of data frequency distribution graph that can be used to
represent the values from a dataset.
skewness and kurtosis can help give values to how central the average is and show how clustered
they are around the data average.
Skewness: This is a measure of how central the average is in the distribution. The skewness
of a sample is a measure of how central the average is to the overall spread of values.
Kurtosis: This is a measure of how pointy the distribution is. The Kurtosis of a sample is a
measure of how pointed the distribution is, it shows how clustered the values are around the
middle.
3. Describe about Pivot Table in Data science.
A pivot table is a statistics tool that summarizes and reorganizes selected columns and rows
of data in a spreadsheet or database table to obtain a desired report.
The tool does not actually change the spreadsheet or database itself, it simply “pivots” or
turns the data to view it from different perspectives.
Pivot tables are especially useful with large amounts of data that would be time-consuming
to calculate by hand.
A few data processing functions a pivot table can perform include identifying sums, averages,
ranges or outliers.
The table then arranges this information in a simple, meaningful layout that draws attention
to key values.
When users create a pivot table, there are four main components:
Columns- When a field is chosen for the column area, only the unique values of the field are
listed across the top.
Rows- When a field is chosen for the row area, it populates as the first column. Similar to the
columns, all row labels are the unique values and duplicates are removed.
Values- Each value is kept in a pivot table cell and display the summarized information. The
most common values are sum, average, minimum and maximum.
Filters- Filters apply a calculation or restriction to the entire table.
A pivot table helps users answer business questions with minimal effort. Common pivot table uses
include:
ANOVA Terminology
Dependent variable: This is the item being measured that is theorized to be affected by the
independent variables.
Independent variable/s: These are the items being measured that may have an effect on the
dependent variable.
A null hypothesis (H0): This is when there is no difference between the groups or means.
Depending on the result of the ANOVA test, the null hypothesis will either be accepted or
rejected.
An alternative hypothesis (H1): When it is theorized that there is a difference between
groups and means.
Factors and levels: In ANOVA terminology, an independent variable is called a factor which
affects the dependent variable. Level denotes the different values of the independent
variable that are used in an experiment.
Fixed-factor model: Some experiments use only a discrete set of levels for factors. For
example, a fixed-factor test would be testing three different dosages of a drug and not
looking at any other dosages.
Random-factor model: This model draws a random value of level from all the possible values
of the independent variable.
The one-way ANOVA is suitable for experiments with only one independent variable (factor) with
two or more levels. There will be twelve levels. A one-way ANOVA assumes:
Independence: The value of the dependent variable for one observation is independent of
the value of any other observations.
Normalcy: The value of the dependent variable is normally distributed
Variance: The variance is comparable in different experiment groups.
Continuous: The dependent variable (number of flowers) is continuous and can be measured
on a scale which can be subdivided.
This two-way ANOVA not only measures the independent vs the independent variable, but if the two
factors affect each other. A two-way ANOVA assumes:
Continuous: The same as a one-way ANOVA, the dependent variable should be continuous.
Independence: Each sample is independent of other samples, with no crossover.
Variance: The variance in data across the different groups is the same.
Normalcy: The samples are representative of a normal population.
Categories: The independent variables should be in separate categories or groups.
PART C
TensorFlow LightGBM
NumPy ELI5
SciPy Theano
Pandas NuPIC
Matplotlib Ramp
Keras Pipenv
SciKit-Learn Bob
PyTorch PyBrain
Scrapy Caffe2
BeautifulSoup Chainer
A) TensorFlow
The first in the list of python libraries for data science is TensorFlow. TensorFlow is a library
for high-performance numerical computations with around 35,000 comments and a vibrant
community of around 1,500 contributors. It’s used across various scientific fields.
Features:
Features:
Collection of algorithms and functions built on the NumPy extension of Python
High-level commands for data manipulation and visualization
Multidimensional image processing with the SciPy damage submodule
Includes built-in functions for solving differential equations.
Applications:
Features:
Applications:
Features:
Eloquent syntax and rich functionalities that gives you the freedom to deal with missing data
Enables you to create your own function and run it across a series of data
High-level abstraction
Contains high-level data structures and manipulation tools
Applications:
Features:
Usable as a MATLAB replacement, with the advantage of being free and open source
Supports dozens of backends and output types, which means you can use it regardless of
which operating system you’re using or which output format you wish to use
Pandas itself can be used as wrappers around MATLAB API to drive MATLAB like a cleaner
Low memory consumption and better runtime behaviour
Applications:
Features:
Keras provides a vast prelabelled datasets which can be used to directly import and load.
It contains various implemented layers and parameters that can be used for construction,
configuration, training, and evaluation of neural networks.
Application:
Applications:
clustering
classification
regression
model selection
dimensionality reduction
H) PyTorch
PyTorch, which is a Python-based scientific computing package that uses the power of
graphics processing units. PyTorch is one of the most preferred deep learning research
platforms built to provide maximum flexibility and speed.
Applications:
Applications:
Scrapy helps in building crawling programs (spider bots) that can retrieve structured data
from the web
Scrappy is also used to gather data from APIs and follows a ‘Don't Repeat Yourself’ principle
in the design of its interface, influencing users to write universal codes that can be reused for
building and scaling large crawlers.
J) LightGBM
The LightGBM Python library is a popular tool for implementing gradient-boosting algorithms
in data science projects. It provides a high-performance implementation of gradient boosting
that can handle large datasets and high-dimensional feature spaces.
Features:
The LightGBM Python library is easy to integrate with other Python libraries, such as Pandas,
Scikit-Learn, and XGBoost.
LightGBM is designed to be fast and memory-efficient, making it suitable for large-scale
datasets and high-dimensional feature spaces.
The LightGBM Python library provides a wide range of hyperparameters that can be
customised to optimise model performance for specific datasets and use cases.
Applications:
Anomaly detection
Time series analysis
Natural language processing
Classification
2. Explain in detail about Data distribution in Data Science.
The skewness is a measure of symmetry or asymmetry of data distribution, and kurtosis measures
whether data is heavy-tailed or light-tailed in a normal distribution.
Data can be positive-skewed (data-pushed towards the right side) or negative-skewed (data-pushed
towards the left side).
What are the three types of skewness?
A right-skewed distribution is longer on the right side of its peak than on its left.
A left-skewed distribution is longer on the left side of its peak than on its right.
Zero skew.
Positively Skewed:
In a distribution that is Positively Skewed, the values are more concentrated towards the right side,
and the left tail is spread out.
Hence, the statistical results are bent towards the left-hand side.
Hence, that the mean, median, and mode are always positive.
Negatively Skewed:
In a Negatively Skewed distribution, the data points are more concentrated towards the right-hand
side of the distribution.
This makes the mean, median, and mode bend towards the right.
Hence these values are always negative.
A random variable is a variable whose value depends on the outcome of a random event.
For example, flipping a coin will give you either heads or tails at random.
You cannot determine with absolute certainty if the following outcome is a head or a tail.
Kurtosis:
It is also a characteristic of the frequency distribution. It gives an idea about the shape of a frequency
distribution. Basically, the measure of kurtosis is the extent to which a frequency distribution is
peaked in comparison with a normal curve. It is the degree of peakiness of a distribution.
Types of kurtosis: The following figure describes the classification of kurtosis:
Leptokurtic: Leptokurtic is a curve having a high peak than the normal distribution. In this curve,
there is too much concentration of items near the central value.
Mesokurtic: Mesokurtic is a curve having a normal peak than the normal curve. In this curve, there is
equal distribution of items around the central value.
Platykurtic: Platykurtic is a curve having a low peak than the normal curve is called platykurtic. In this
curve, there is less concentration of items around the central value.
To calculate skewness values, subtract a mode from a mean, and then divide the difference by
standard deviation.
Multiply the difference by 3, and divide the product by the standard deviation.
Skewness Kurtosis
It indicates the shape and size of variation on either It indicates the frequencies of
Skewness Kurtosis
It indicates how far the distribution differs from the It studies the divergence of the given
normal distribution. distribution from the normal distribution.