What Exactly Is Data Science

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Data Science is the combination of Statistics, Machine Learning and Data Analysis.

When
we have a large number of data we need to do analysis on it and then make a decision to
get the proper and accurate data.

Data Science is more driven towards the field of big data which seeks to provide insight
information from a large amount of complex data.

It uses various tools, techniques, and various other principles to categorize a huge number
of data into the proper set of models.

Data Science Life Cycle:

1)Data Discovery and Business Understanding


Initially, we have to understand the business problems and we have to gather different
types of data that can be structured and unstructured.

2) Data Acquisition/Data Preparation


We need to leverage or convert a data into the common format.

3) Modeling/Mathematical Models
In this by using variables and equations to build a relationship and accurate result.In this
Statistics plays an important role.

4) Deploy
The correct and accurate data can be deployed.
Data Science Components:
1)Data

The raw data that is obtained by filtering the data that is not raw data which is useful and it
is the part of the Data Science

The data can be of 2 types:

1)Structured

The structured data is in the tabular form

2)Unstructured

The data such as images, videos, pdf, etc are unstructured data

2)Programming(Python and R)

To manage and analyze the data the programming languages such as Python and R are used
in data science.

3)Statistics

This is a very powerful tool of the data and with the help of mathematics, we can perform
and get the raw data that is useful.

What is Statistics and how it plays an important role?

Statistics can be defined as a methodology to perform data collection, analysis,


interpretation, and presentation.

It is a mathematical science and applies the various statistical methods or algorithms on a


set of data to determine the values which can be solved in real life.

To solve the real problems in various industry tools statistics uses various methods such as
frequency analysis, mean, median, mode, variance analysis, correlation, regression, etc and
also it focuses on analysis using standard techniques involving mathematical formulas and
methods.
How Statistical analysis plays an important role:

▪ Present key findings revealed by a dataset.


▪ It summarizes information.
▪ It calculates the measurement of data by mathematical methods.
▪ Make future predictions based on previously recorded data.
▪ It also tests experimental predictions.

In simple words, Statistics can be used to derive meaningful insights from data by
performing mathematical computations on it. To become a successful Data Scientist
Statistics must be powerful.

Application Areas Used:

▪ Stock markets, commerce, and trade


▪ Retail, education, and Insurance
▪ Psychology and Astronomy
▪ Life Science and Weather

Terminologies in Statistics – Statistics for Data Science

To deal with the statistics one must be aware of the terminologies. So let’s have a look into
the few and important key statistical terminologies stated below:

1)Population
It is a set of sources from which data has to be collected.

2)Sample
It is a subset of the Population

3)Variable
Any Characteristics, number or quantity that can be counted or measured.

4)Statistical Parameter
It is a quantity that indexes a family of probability distributions. For examples mean,
median, mode, etc

Types of Analysis:
–Two types of analysis:
1)Quantitative
2)Qualitative

Quantitative Qualitative

In this type of analysis, collecting and interpreting In this type of analysis, it gives
data in the numbers and charts, graphs, etc in order the user information, text,
to identify the pattern. sound, etc

For example, I want to purchase a burger meal from McDonald’s, it is available in small,
medium and large. This is an example of Qualitative Analysis.

But if McDonald’s stores sell 50 regular burger large meals in a week, it is Quantitative
Analysis.

Categories in Statistics There are mainly 2 Categories in Statistics:

1)Descriptive Statistics
2)Inferential Statistics

Descriptive Statistics: –It uses the data to get the descriptions of the population from the
numerical, graph and tabular form. -When we try to represent data in the form of graphs,
lines, histograms, etc the data is represented based on the central tendency. Central
tendency measures like, mean, median, or measures of the spread, etc are used for
statistical analysis.

Mean: The mean is equal to the sum of all the values in the data set divided by the number
of values in the data set i.e the calculated average.

Median: If a series or set of values are arranged in ascending (descending) order of


magnitudes then the middlemost value is called the median of the series.

Mode: A mode is defined as the value of the variable which occurred most frequently in the
set of observations.

2)Inferential Statistics:
It makes inferences and predictions about a population based on a sample of data taken
from the population in question.
Madrid Software Trainings provides huge opportunities to learn DATA SCIENCE COURSES IN
DELHI helping you to establish a career in Data Scientist by learning all the techniques and
growing in terms of knowledge and shaping the future in the right manner.

The Data Science Courses in Delhi offers a wide range of courses and is very highly qualified
to step ahead of your career and also provides end to end knowledge theoretically as well
as practically.

So, the role of Data Scientist is very important in any industry in order to leverage their
business in a correct manner.

What exactly is Data Science?

Data science is a field of study that utilizes cutting-edge tools and techniques to uncover
hidden patterns and trends, thereby generating valuable insights that can be used to make
more informed business decisions. It also encompasses predictive analytics, in which data
scientists employ a variety of machine learning or statistical algorithms.

The Data Science Lifecycle

To comprehend the role that statistics play in data science, you must first have a thorough
understanding of the data science lifecycle. There are several perspectives on the lifecycle,
but I use a simplified one. It consists of the five stages listed below.
Terminologies associated with statistics

• Population: It is an entire pool of data from where a statistical sample is extracted. It can be
visualized as a complete data set of items that are similar in nature.
• Sample: It is a subset of the population, i.e. it is an integral part of the population that has
been collected for analysis.
• Variable: A value whose characteristics such as quantity can be measured, it can also be
addressed as a data point, or a data item.
• Distribution: The sample data that is spread over a specific range of values.
• Parameter: It is a value that is used to describe the attributes of a complete data set (also
known as ‘population’). Example: Average, Percentage
• Quantitative analysis: It deals with specific characteristics of data- summarizing some part
of data, such as its mean, variance, and so on.
• Qualitative analysis: This deals with generic information about the type of data, and how
clean or structured it is.

Why Statistics

Statistics is essential in real life as well as professional life. It helps you analyze the data
given to you and make decisions according to it.
• The ability to read pie charts, bar graphs, etc., is facilitated by statistical knowledge, which
also aids in data comprehension and, ultimately, leads to enhanced skills in presenting data
in a manner that allows not only you but also others to draw conclusions.

• It enables you to see trends in any data easily; it enables you to analyze the data effectively;
it enables you to reach better and more accurate conclusions.

• In ML statistical knowledge allows you to fully understand the effectiveness of your models
based on the evaluation. You can simply not understand e.g. R² without it, or any other
performance metric.
How does analysing data using statistics help gain deep insights into data?

Statistics serve as a foundation while dealing with data and its analysis in data science.
There are certain core concepts and basics which need to be thoroughly understood
before jumping into advanced algorithms.

Not everyone understands the performance metrics of machine learning algorithms like
f-score, recall, precision, accuracy, root mean squared error, and so on. Instead, visual
representation of the data and the performance of the algorithm on the data serves as a
good metric for the layperson to understand the same.

Also, visual representation helps identify outliers, specific trivial patterns, and certain
metric summary such as mean, median, variance, that helps in understanding the
middlemost value, and how the outlier affects the rest of the data.

Statistical Data Analysis

Statistical data analysis deals with the usage of certain statistical tools that need
knowledge of statistics. Software can also help with this, but without understanding why
something is happening, it is impossible to get considerable work done in statisti cs and
data science.

Statistics deals with data variables that are either univariate or multivariate. Univariate,
as the name suggests deals with single data values, whereas multivariate data deals
with the multiple number of values. Discriminant data analysis, factor data analysis can
be performed on multivariate data. On the other hand, univariate data analysis, Z -test, F-
test can be performed if we are dealing with univariate data.

Data associated with statistics is of many types. Some of them have been discussed
below.

Categorical data represents characteristics of people, such as marital status, gender,


food they like, and so on. It is also known as ‘qualitative data’ or ‘yes/no data’. It takes
numerical values like ‘1’, ‘2’, where these numbers indicate one or other type of
characteristics. These numbers are not mathematically significant, which means it can’t
be associated with each other.

Continuous data deals with data that is immeasurable, and can’t be counted, which
basically continual forms of values are. Predictions from a linear regression are
continuous in nature. It is a continuous distribution that is also known as probability
density function.

On the other hand, discrete values can be measured, counted, and are discontinuous.
Predictions from logistic regression are considered to be discrete in nature. Discrete data
is non-continuous, and density concept doesn’t come into the picture here. The
distribution is known as probability mass function.

Descriptive Statistics

Descriptive statistics is a branch of statistics that is concerned with describing the


characteristics of the known data. Descriptive statistics provides summaries
about either the population data or the sample data. Apart from descriptive
statistics, inferential statistics is another crucial branch of statistics that is used to
make inferences about the population data.
Descriptive statistics can be broadly classified into two categories - measures of
central tendency and measures of dispersion. In this article, we will learn more
about descriptive statistics, its various types, formulas, and see associated
examples.
What are Descriptive Statistics?

Descriptive statistics are used to quantitatively or visually summarize the features


of a sample. By using certain tools data from a sample can be analyzed to catch
certain trends or patterns followed by it. It helps to organize the data in a more
manageable and readable format.
Descriptive Statistics Definition

Descriptive statistics can be defined as a field of statistics that is used to


summarize the characteristics of a sample by utilizing certain quantitative
techniques. It helps to provide simple and precise summaries of the sample and
the observations using measures like mean, median, variance, graphs, and
charts. Univariate descriptive statistics are used to describe data containing only
one variable. On the other hand, bivariate and multivariate descriptive statistics
are used to describe data with multiple variables.
Types of Descriptive Statistics

Measures of central tendency and measures of dispersion are two types of


descriptive statistics that are used to quantitatively summarize the characteristics
of grouped and ungrouped data. When an experiment is conducted, the raw data
obtained is known as ungrouped data. When this data is organized logically it is
known as grouped data. To visually represent data, descriptive statistics use
graphs, charts, and tables. Some important types of descriptive statistics are
given below.
Measures of Central Tendency

In descriptive statistics, the measures of central tendency are used to describe data
by determining a single representative central value. The important measures of
central tendency are given below:
Mean: The mean can be defined as the sum of all observations divided by the
total number of observations. The formulas for the mean are given as follows:

Median: The median can be defined as the center-most observation that is


obtained by arranging the data in ascending order. The formulas for the median
are given as follows:

l is the lower limit of the median class given by n / 2, c is the cumulative frequency, f
is the frequency of the median class and h is the class height.

Mode: The mode is the most frequently occurring observation in the data set. The
formulas for the mode are given as follows:
Ungrouped data Mode: Most recurrent observation
Measures of Dispersion

In descriptive statistics, the measures of dispersion are used to determine how


spread out a distribution is with respect to the central value. The important
measures of dispersion are given below:
Range: The range can be defined as the difference between the highest value
and the lowest value. The formula is given as follows:
Range = H - S
H is the highest value and S is the lowest value in a data set.
Variance: The variance gives the variability of the distribution with respect to the
mean. The formulas for the variance are given as follows:
Standard Deviation: The square root of the variance will result in the standard
deviation. It helps to analyze the variability in a data set in a more effective manner
as compared to the variance. The formula is given as follows:
Standard Deviation: S.D. = √Variance = σ
Mean Deviation: The mean deviation will give the average of the absolute value of
the data about the mean, median, or mode. It is also known as absolute
deviation. The formula is given as follows:

Quartile Deviation: Half of the difference between the third and first quartile
gives the quartile deviation. The formula is given as follows:

Other measures of dispersion include the relative measures also known as the
coefficients of dispersion.
Descriptive Statistics Representations

Descriptive statistics can also be used to summarize data visually before


quantitative methods of analysis are applied to them. Some important forms of
representations of descriptive statistics are as follows:
Frequency Distribution Tables: These can be either simple or
grouped frequency distribution tables. They are used to show the distribution of
values or classes along with the corresponding frequencies. Such tables are very
useful in making charts as well as catching patterns in data.
Graphs and Charts: Graphs and charts help to represent data in a completely
visual format. It can be used to display percentages, distributions, and
frequencies. Scatter plots, bar graphs, pie charts, etc., are some graphs that are
used in descriptive statistics.

Descriptive Statistics Examples

Descriptive statistics help to provide the summary statistics for different data sets
thereby, enabling comparison. The descriptive statistics examples are given as
follows:
• Suppose the marks of students belonging to class A are {70, 85, 90, 65) and class B are {60,
40, 89, 96}. Then the average marks of each class can be given by the mean as 77.5 and
71.25. This denotes that the average of class A is more than class B.
• Using the same example, suppose it needs to be determined how far apart the most extreme
responses are then the range is used. Range A = 25 and Range B = 56, thus, depicting that the
range of class B is higher than the range of class A.

Descriptive Statistics vs Inferential Statistics

Inferential and descriptive statistics are both used to analyze data. Descriptive
statistics helps to describe the data quantitatively while inferential statistics uses
these parameters to make inferences about the population. The differences
between descriptive statistics and inferential statistics are given below.
Descriptive Statistics Inferential Statistics

It is used to draw inferences


It is used to describe the
about the population data
characteristics of either the
from the sample data by
sample or the population by
making use of analytical
using quantitative tools.
tools.

Hypothesis testing and


Measures of central tendency regression analysis are the
and measures of dispersion are types of inferential statistics.
Descriptive Statistics Inferential Statistics

the most important types of


descriptive statistics.

It is used to describe the It tries to make inferences


characteristics of a known about the population that
dataset. goes beyond the known data.

Measures of descriptive Measures of inferential


statistics are mean, median, statistics are z test, f test, linear
variance, range, etc. regression, ANOVA test, etc.

Important Notes on Descriptive Statistics


• Descriptive statistics are used to describe the features of a sample or population using
quantitative analysis methods.
• Descriptive statistics can be classified into measures of central tendency and measures of
dispersion.
• Mean, mode, standard deviation, etc., are some measures of descriptive statistics.
• Data of descriptive statistics can be visually represented using tables, charts, and graphs.

You might also like