Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

CRITIQUE PAPER

ENGINEERING DATA ANALYSIS

(MWF 2:30-4:00 PM)

SUBMITTED BY :

ANNA MIKAELA DC SANCHEZ

BSCE-2B

SUBMITTED TO : ENGR. JOEL MOLINA

DATA MINING VS STATISTICS : 7 CRITICAL DIFFERENCES


TABLE OF CONTENTS

 What is Data Mining?

 What is Statistics?

 Data Mining vs Statistics: Key Differences

o Data Mining vs Statistics: Deriving Insights and Interpreting Data

o Data Mining vs Statistics: Quantitative and Generic Input

o Data Mining vs Statistics: Exploring Data and Formalizing thoughts

o Data Mining vs Statistics: Importance of Domain Knowledge

o Data Mining vs Statistics: Focus on Data Collection

o Data Mining vs Statistics: Tools and Techniques

 Conclusion

INTRODUCTION

Data Mining and Statistics are two universal terms in this domain. Data Mining is about looking

deep into data to derive hidden patterns. It involves using a variety of techniques, including

domain understanding and mathematical rules. It is usually performed by a Data scientist,

business intelligence developer, or business analyst with data exposure. Tools and techniques are

used to mine data, such as statistical and visualization frameworks.


Data Mining can be divided into four concepts on a high level. Data mining involves grouping

data according to patterns, finding anomalies, determining relationships, and predictive

modeling. Statistics is the science of analysis and interpretation of numeric data. It involves

drawing conclusions based on a small amount of data and then extending it to the whole

population. Hypothesis testing helps establish the validity of results found on smaller data to the

larger outside world.

SUMMARY

WHAT IS DATA MINING?

Data mining is about looking deep into data to derive hidden patterns. Data in this context can be

anything: natural language sentences, images, or numeric data. Data Mining involves using a

variety of techniques, including domain understanding and mathematical rules. 

In the earlier days, Data Mining used to be a manual process, but with the advent of cheap

processing power, it has become a semi-automatic process. It is usually performed by a Data

scientist, business intelligence developer, or business analyst with data exposure. 

Numerous tools are available to mine data, including statistical and visualization frameworks. A

Data mining professional usually has exposure to tools related to storage, exploration,

visualization, and statistics. Even a database with good querying ability is a productive tool for

an expert data miner. 

Data Mining can be divided into the below concepts on a high level.
Grouping Data According to Patterns: This involves techniques like clustering and

classification. Clustering group data without prior knowledge of the number of output groups.

Classification attempts to categorize data points to one of the predefined labels.

Finding Anomalies: Extracting data that is significantly different from other data points in the

set is required to establish patterns. Concepts like Normal distribution and statistical rules are

employed to extract anomalies.

Deriving Relationships: Extracting cause and effect relationships can be done statistically.

Association rule learning is commonly used to accomplish this. 

Predictive Modeling: While it may seem like an entirely different concept compared to Data

Mining, predictive modeling is often used to uncover insights like reasons for specific customer

behavior and estimate other unknown outcomes.  

Verifying results obtained through data mining is usually done using a statistical technique called

hypothesis testing. Hypothesis testing helps one establish the validity of results found on smaller

data to the larger outside world.

Since Data mining often involves dealing with personal information and deriving patterns, it

usually raises questions regarding legality and ethics. 

WHAT IS STATISTICS?

Statistics is the science of analysis and interpretation of numeric data. It is considered a part of

applied mathematics. Using Statistics generally involves drawing conclusions based on a small

amount of data and then extending it to the whole population. Population in a statistical sense is

the total data where something is applicable. A sample is a subset of the population where an

experiment or observation is conducted. Statistics can be divided into two on a high level.


Descriptive statistics and Inferential Statistics. Descriptive statistics focuses on summarizing the

data in terms of different metrics. These metrics could be aggregation metrics like mean, median,

or mode. Or it could be metrics related to variation in data like standard deviation, range, etc.

Distribution is another term that is generally used with descriptive statistics. It denotes the shape

of the data and forms the basis of defining properties like probability distribution functions. 

Inferential Statistics is the method of using descriptive statistics to form deductions about the

sample and then extending it to the whole population. It relies on probability distributions and

makes deductions based on it. Hypothesis testing is a critical part of inferential statistics.

Hypothesis testing establishes how well the sample represents the population and the degree of

validity of extending sample results to population results. 

An example of this could be using a simple survey among a small percentage of your customers

about a product feature and generalizing the results to the whole set of people who uses the

product. 

Data Mining vs Statistics: Key Differences

Now that we understand the basics of what Data Mining and Statistics is, let us explore how

these are different from each other.

Data Mining vs Statistics: Deriving Insights and Interpreting Data

Data Mining vs Statistics: Quantitative and Generic Input

Data Mining vs Statistics: Exploring Data and Formalizing thoughts

Data Mining vs Statistics: Importance of Domain Knowledge

Data Mining vs Statistics: Focus on Data Collection

Data Mining vs Statistics: Tools and Techniques

Data Mining vs Statistics: Deriving Insights and Interpreting Data


As evident from the sections above, Data Mining and Statistics are entirely different concepts.

Data Mining is the process of deriving useful insights from data. Statistics is the science of

collecting, analyzing, and interpreting data. Statistics can be one of the methods that are used in

data mining. 

Data Mining vs Statistics: Quantitative and Generic Input

Statistics is concerned with quantitative data only while Data Mining deals with any kind of data.

Deriving numeric metrics out of data is often the first step of using statistics on it.

Data Mining vs Statistics: Exploring Data and Formalizing thoughts

The final result of data mining is often a prediction method, while for statistics, this is more

about deducing something based on probability distributions. Data Mining is often exploratory in

nature. Statistics is about confirming hypotheses. 

Data Mining vs Statistics: Importance of Domain Knowledge

Heuristics are thumb rules that are formed based on the knowledge of a domain. Heuristics are

very important in data mining and often form the base of exploration. Statistics is about negating

all heuristics and interpreting data only on the basis of mathematical evidence and probability. 

Data Mining vs Statistics: Focus on Data Collection

Collecting data and cleaning is an important part of statistics. Data Mining is supposed to work

with virtually any kind of data and does not put much emphasis on the collection of data. It is

more about working with available data than defining strategies for collecting data/

Data Mining vs Statistics: Tools and Techniques

A Data Mining expert must be aware of tools and techniques used in data storage, exploration,

and visualization. This means he must be an expert in a wide range of tools. For storage, it could

be anything from a simple relational database to a completely managed flat-file storage like S3. 
Even NoSQL databases are important for a data mining professional. Data exploration tools like

SQL and processing frameworks like Spark are also important for Data Mining. Visualization

tools like Tableau, PowerBI, etc help him present the results. And Last but not least, Data Miner

must also have some background in statistics.

A Statistician works with open source or proprietary tools that help him compute descriptive

statistics and derive inferences. This includes open-source tools like R or scikit learn and

proprietary tools like SAS, SPSS, minitab, etc. Even a spreadsheet tool like Microsoft Excel or

Open Office is a potent tool for statisticians. 

CRITIQUE

The whole article served its purpose of explaning and differentiating Data Mining and

Statistics.The structure and flow was accurate and fact friendly as the comparison of the two

were very simple yet complex.Data Mining is about looking deep into data to derive hidden

patterns. Data in this context can be anything: natural language sentences, images, or numeric

data. Statistics is the science of analysis and interpretation of numeric data. It is considered a part

of applied mathematics.These two sentences were very self explanatory as it was said in the

article that Data Mining is generalized form of data gathering while Statistics are specificied

datas gather to conclude in general.

CONCLUSION

We have now learned about the basics of Data Mining and Statistics. As discussed Data Mining

and Statistics are different concepts on their own. While Data Mining is the exploration of data
to derive insights, statistics is the science of interpreting data. Statistics is a core part of Data

mining, but they are not the same. Data Mining employs statistical techniques to derive

prediction models or confirm results, but it is much more than statistics and includes storage,

exploration, visualization etc.

You might also like