Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15





PERIODO 58 (2021-2021)



Cuenca 15/6/2021

Theoretical framework

1) Introduction:

a.) What is data mining?

b.) The data mining extraction process

2.) Data preparation:

3.) Data mining techniques:

4.) Evaluation, dissemination and use of data mining

5.) Complex data mining

6.) Implementation and impact of data mining

7.) Data mining systems and tools

8.) Conclusion

9.) Bibliography
Data Mining

The information revolution, the increasing accumulation of information and the development
of advanced statistical methods for the analysis of such information are the scenarios in
which Data Mining arises.

Data Mining methods combine the analysis of information external to the company, from
surveys and macroeconomic variables, with information from internal sources within the
organization. Market research explores, describes, explains or forecasts significant facts in
the relationship between a brand and its customers. Data Mining complements the above with
the exploration and discovery of permanent or sporadic relationships in the changing history
of the company itself.

By discovering the stable or conjuncture elements within a sequence of unstable scenarios, it

makes it possible to estimate what the company or institution will be like in the days to come.
It is in this sense that it is fair and appropriate to state that Data Mining is anticipating

Data Mining is emerging as a new activity brought about by the IT revolution and the
progressive professionalization of the computational analysis of data. We are making great
strides towards a network of total and permanent mobile personal connectivity, accompanied
by a commerce that is enriching its concepts with the help of Data Mining. Computing has
produced such a vast process of transformation in human society that it is not yet possible to
comprehend the possibilities of its ever-expanding limits. In these circumstances of
astonishment and permanent change, we are at the dawn of the twenty-first century.

General objective:

- Know the subject completely about the basic concepts and everything related to Data

Specific objective:

- Identify trends and behaviors not only to extract information but also to discover
relationships in a database that can identify behaviors that are not very obvious.
- Identify techniques and tools that help in the decision-making process of
Theoretical framework

The progressive accumulation of data, the great processing capacity provided by the
information revolution and the need to develop competitive advantages, have given rise to an
activity called Business Intelligence (BI), which consists of a set of protocols and resources
aimed at the creation of knowledge through the analysis of existing data inside and outside a

The distinctive commitment of an area of Business Intelligence, Business Intelligence,

Marketing Intelligence, Insights Department or other suggestive name, is to exploit the data
of a company to contribute to the vision and decision making in the short and long term, in a
competitive environment.

Data from the company itself, information from the competitive environment, and
information from the macroeconomic environment are used at three points in time: past,
present and future. Depending on the depth and complexity of the information exploitation,
three types of results can be identified: descriptive exploitation, which tells us how things are;
explanatory exploitation, which identifies why things are the way they are; and prognostic
exploitation, which tells us how things will be in a conditioned future.

Business Intelligence is the commitment to transform data into relevant, exclusive and
confidential information, to build superior knowledge to optimize the business decision-
making process. A solvent staging of a Business Intelligence area illuminates different levels
by conjugating five verbs: Observe, what is happening?

Understand why it is happening, predict what would happen, propose what should the team
do, and decide which path to follow. Business intelligence is and acts as a strategic factor to
generate competitive advantages. Another area of development is the extraction of consumer
information from databases, through the use of applications that can isolate and identify
consumer patterns or trends in a high volume of data, ranging from statistical, descriptive and
inferential methods to neural networks.

The applications needed to manage the flow of information in business activities can be
classified into two important categories: the applications that handle transactions and the
statistics that help convert data into useful information for decision making. In addition, there
is the indicator system, consisting of the databases where important data are stored to
evaluate and improve the performance of the activities that make up the supply chain and
analysis applications that facilitate the understanding of trends and patterns in the data. The
indicator system is seen as a basic integration tool through the communication and dialogue
that is established, based on the data, between the different actors in the process.

The use of Data Mining as a decision support in business activities requires much more than
the application of sophisticated techniques such as neural networks or decision trees on data
tables. For this reason, this paper shows Data Mining, on the one hand, as one of the steps in
the process of knowledge discovery in databases (KDD) and, on the other hand, as a process
consisting of different phases, in which techniques related to statistics, pattern recognition
and learning algorithms, among others, are used as support.

This work constitutes a first approach to a recent research area, which aims to present some
theoretical bases on the incidence of Data Mining as a support for decision making, applied to
business activities. The elaboration of the theoretical reflection emphasizes the
methodological postulates of the qualitative paradigm, which allows the construction of
knowledge based on an integral, interpretative and contextual vision of the phenomenon to be
studied. The theories consulted were interpreted to establish by deductive inference some
considerations related to Data Mining and some indicators that allow us to measure the
interest and impact of the knowledge that can be obtained by using it as a support for decision
making in organizations.

Data Mining is defined as the process of extracting useful and understandable knowledge,
previously unknown, from large amounts of data stored in different formats. In other words,
the fundamental task of data mining is to find intelligible patterns from the data. For this
process to be effective it should be automatic or semi-automatic and the use of the patterns
discovered should help to make safer decisions that bring, therefore, some benefit to the

Therefore, there are two challenges in data mining: on the one hand, working with large
volumes of data, mostly from information systems, with the problems that this entails. On the
other hand, using appropriate techniques to analyze the data and extract new and useful
knowledge. In many cases, the usefulness of the mined knowledge is intimately related to the
comprehensibility of the inferred model.

In a simplistic but ambitious way, we could say that the goal of data mining is to turn data
into knowledge. This goal is not only ambitious but very broad.

IT revolution or information revolution is a period of technological advances, spanning

from the mid-20th century to the present day.

Business Intelligence (BI) set of strategies, applications, data, products, technologies and
technical architecture, which are focused on the management and creation of knowledge
about the environment, through the analysis of existing data in an organization or company.

Knowledge discovery in databases (KDD) process of discovering valuable knowledge from

a collection of data. This widely used data mining technique is a process that contains data
preparation and selection, data cleansing, incorporating prior knowledge on data sets and
interpreting accurate solutions from the observed results. (Major KDD application areas
include marketing, telecommunication and manufacturing).
1. What is data mining?

We call Data Mining the exploitation of corporate databases that record personal, family and
socioeconomic characteristics, purchasing behavior and payment behavior. Data Mining
opens a new dimension in the design of market research, data collection and analysis. All of
this is usually in the marketing areas of companies that are strongly customer oriented and
manage brands in the middle of highly competitive markets, where the discovery of a new
opportunity can be the basis for building a new temporary advantage over the competition.

Data mining allows obtaining exclusive, confidential and actionable knowledge, which serves
as clay to create short and medium term competitive advantages. Data Mining or is the
extraction of sensitive information that resides implicitly in the data. This information is
previously unknown and may be useful for some processes. In other words, data mining
prepares, probes and explores the data to extract the information hidden in it, so that a small
finding, a small relationship that is discovered, can be a fact of high impact on the company's

Data mining has several statistical and computational methodologies that, together with a
behavioral science approach, allow data analysis and the elaboration of descriptive and
predictive mathematical models of consumer behavior.

There are countless areas of application of data mining and its methodologies. Segmentation
or clustering techniques are applied to risk classification problems (good customers, bad
customers); regression analysis is applied to factor association studies on a variable response
of interest, such as: how does education level affect consumption decisions of a product?
Likewise, econometric analysis is applied to the study of the behavior of economic or
financial variables.

The data mining extraction process

KDD is an iterative and interactive process. It is iterative because the output of some of the
phases can make go back to previous steps and because several iterations are often necessary
to extract high quality knowledge. It is interactive because the user must help in the
preparation of data, validation of the extracted knowledge, etc.

The KDD process is organized around five phases.

In the integration and data collection phase, the sources of information that are useful and
where to get them are determined, then all the data are transformed into a common format.
The data is then transformed into a common format that unifies all the information collected,
detecting and resolving inconsistencies. This data warehouse greatly facilitates the navigation
and preview of your data, to discern which aspects may be of interest to be studied. Since the
data come from different sources, they may contain erroneous or missing values.
This situation is dealt with in the selection, cleaning and transformation phase in which
incorrect data are removed or corrected and the strategy to be followed with incomplete data
is decided. In addition, the data are projected to consider only those variables or attributes
that are going to be relevant, in order to make the mining task easier and the results more
useful. The selection includes both horizontal and vertical screening or merging. These two
phases are usually encompassed in data preparation.

In the data mining phase, the task to be performed is decided and the method to be used is

In the evaluation and interpretation phase, the patterns are evaluated and analyzed by the
experts and if necessary, the previous phases are returned to for a new iteration. This includes
resolving possible conflicts with previously available knowledge.

Finally, in the dissemination phase, the new knowledge is used and shared with all potential

2. Data Preparation

What is data preparation?

Data preparation is a self-service activity that converts disparate, unformatted and messy data
into a clean and consistent view. The process includes searching, cleaning, transforming,
organising and collecting data.

Proper data preparation reduces potential errors and allows for more agile and efficient
analysis of information. This stage may be the most extensive, but it is essential to eliminate
any traces of poor quality data, standardising formats, enriching source data and/or
eliminating outliers. Steps to follow in this pre-analysis stage:

- Data collection: The first step is to actively extract information from all available sources,
such as clouds and data lakes. This step aims to create the largest possible pool of

- Active preparation: This is when data analysts must start refining and cleaning the
quantitative information they collect. This means that they must meticulously search for
errors and missing values in the raw data and throw out any bad information that could
damage the results.

- Loading: In this stage, the cleaned data will be loaded into a database and transformed to
make it usable.
- Processing: Here, the data will be subjected to further processing using algorithms to make
it easy to digest and interpret across various systems and channels. The details of this step
will vary depending on the business and its industry.

- Interpretation: At this stage, the processed quality data is extracted into digestible formats
such as text, visual media or graphics.

- Final storage: This final step will be to ensure that all data is stored clearly and concisely
on an accessible drive such as a file or USB for future use.

3. Data mining techniques

• Neural networks
It is a paradigm of learning and automated processing that is inspired by the way the nervous
system of animals works. It is a system that allows neurons to interconnect in a network that
provides collaboration for the production of output stimuli.
• Decision trees
It is a prediction model that is used in the field of artificial intelligence from a database where
diagrams of logical constructions are constructed.
• Statistical models
It is a symbolic expression in the form of equality that is used in experimental designs and in
• The grouping
It is about arranging the input vectors so that they are closer to those with common
• Linear regression
Fast and efficient but insufficient in multidimensional spaces where more than 2 variables
can be related.
• Association rules
They are used to discover events that occur in common within a certain data set.
- Supervised algorithms: They predict unknown data, based on other known ones.
- Unsupervised algorithms: Patterns and trends are discovered in the data.

4. Evaluation, dissemination and use of data mining

● Applications of use
The problem variables or association relationships between these variables. Several
techniques can also be used at the same time to generate different models, although generally
each technique requires a different data preprocessor.

● Interpretation and evaluation

Once the model has been obtained, it must be validated, checking that the conclusions it
yields are valid and sufficiently satisfactory. If several models have been obtained using
different techniques, the models should be compared in search of the one that best fits the
problem. If none of the models achieves the expected results, one of the previous steps should
be altered to generate new models.

● Usage Applications

Every year, the different congresses, symposiums and workshops held around the world bring
together researchers with very diverse applications. Especially in the United States, data
mining has been incorporated into the life of companies, governments, universities, hospitals
and various organizations that are interested in exploring their databases.

Data mining is very useful in the following domains:

- Market analysis and management

- Business analysis and risk management
- Fraud detection

Apart from these, data mining can also be used in the areas of production control, customer
retention, scientific exploration, sports, astrology and Internet web browsing.

5. Complex Data mining

With data mining, a retailer can use point-of-sale records of customer purchases to send
targeted promotions based on an individual's purchase history. By mining demographic data
from comment or warranty cards, the retailer could develop products and promotions to
appeal to specific customer segments.

While large-scale information technology has been evolving separate transactions and
analytics systems, data mining provides a link between the two. Data mining software
analyzes relationships and patterns in stored transaction data based on open-ended user

There are several types of analysis software that are available: statistical, machine learning,
and neural networks. In general, any of these four types of relationships are searched for:

- Classes: Stored data is used to locate data in predetermined groups. For example, a
restaurant chain could mine customer purchase data to determine when customers
visit and what they typically order. This information could be used to increase traffic
by having daily specials.
- Groups: Data elements are grouped according to logical relationships or consumer
preferences. For example, data can be mined to identify market segments or consumer
- Associations: Data can be mined to identify associations. The beer-candy example is
an example of associative mining.
- Sequential patterns: Data is mined to anticipate behavioral patterns and trends. For
example, a retailer specializing in outdoor systems can predict the probability of
purchasing a backpack based on a consumer's purchase of sleeping bags and hiking

Data mining consists of five main elements:

- Extract, transform, and load transaction data into the data warehousing system.
- Storing and managing data in a multidimensional database system.
- Providing data access for business analysts and information technology professionals.
- Analyze data by application software.
- Present data in a useful format, such as a graph or table.

Complexity of queries

The more complex the queries and the greater the number of queries in process, the more
powerful the required system must be. Relational database-based storage technology for
management is adequate for many data mining applications that handle less than 50

However, this infrastructure must be significantly enhanced to support larger applications.

Some vendors have added extensive indexing capabilities to improve query performance.

Others use new hardware architectures, such as massively parallel processors (MPP) to
achieve magnitude improvements in query time.

6. Implementation and impact of data mining:

In the information age, it is necessary for companies to take advantage of digital tools to
collect large amounts of data. These will allow them to improve communication and
relationships with clients, users and collaborators. Currently, most companies gather a large
amount of data from both clients, users of their platforms or internal collaborators. The
management of this information allows obtaining patterns, trends or factors that help the
organization to generate effective communication. In this context, there are currently various
tools for data management; among the most important are the Data Warehouse and Data
Data mining is of great importance in today's highly competitive business environment. A
new concept of business intelligence data mining has evolved now, which is widely used by
major corporate houses to stay ahead of their competitors. Business Intelligence (BI) can help
provide the latest information and be used for competition analysis, market research,
economic trends, consumer behavior, industry research, geographic information analysis and
so on. Business intelligence data mining aids decision making.

However, data mining is a crucial process and requires a lot of time and patience to collect
the desired data due to complexity and bases. This could also be possible that you need to
seek the help of outsourcing companies. These outsourcing companies are specialized in the
extraction or extraction of the data, its filtering and then its maintenance for analysis. Data
mining has been used in different contexts, but it is commonly used for business and
organizational needs for analytical purposes such as: Discovering information that was not
expected, the results are easy to understand: people without prior knowledge In computer
engineering they can interpret the results with their own ideas and It allows to find, attract
and retain customers. Reduce the risk of losing customers: offer specific promotions or
special products to retain them and increase sales.
Data mining is implemented through four steps. The first is goal setting, under which data
will be collected. For example, the purpose will be to group the information to impact the
communicational management of the organization. Second, there is data processing, followed
by analysis, and finally, the collection of evaluations and observations for use in a corporate
In today's business environment, the most valuable information in areas such as finance,
business management, human resources, among others, is immersed in digital platforms.
Therefore, the professionals in charge of managing it are the key to the growth of any
business. To make important decisions and generate strategic communication plans, data
mining is a fundamental factor.

7. Data mining tools and systems:

With the growing need and interest in analyzing massive data, a new generation of tools has
appeared in organizations called Data Science and Machine Learning Platforms. These tools
allow data scientists, analysts, or business users to interact with your data. The idea is, on the
one hand, to create a work platform for data scientists, facilitating and standardizing their
data mining work. On the other hand, through its easy use, empower business users by
making data mining accessible to them. These tools support the complete Data Mining cycle
to create, deploy and manage advanced analytics models. The tools integrate the main
functionalities to carry out data mining projects: data import, data preparation, data
exploration, modeling, evaluation and deployment.
During the last few years, several Data Science and Machine Learning Platforms have
appeared creating a very dynamic market that is evolving rapidly. Although large companies
such as IBM, SAP or Microsoft have launched their own tools, so far they have not managed
to dominate the market leaving room for innovative new companies, for example:

KNIME(Konstanz Information Miner):

Knime is a data mining platform, which greatly facilitates the tasks of data analysis,
modeling, processing and visualization, and, in addition to all this, Knime is free software.
Despite its great growth and success, it has maintained its Open Source character. KNIME
has a free version of KNIME Analytics Platform for personal use (85% of the
functionalities), as well as a paid version of KNIME Server for use by organizations that want
to take their data mining activities to a new level. KNIME server differs from the free version
through additional functionalities for team collaboration, automation, the WebPortal
(Graphical Interface) as well as greater computing power.KNIME supports the user
throughout the Data Mining cycle and distinguishes itself through its extreme flexibility,
power and ease of use. The tool allows us to integrate data from different sources, manipulate
it, analyze it and create data mining applications. Through its graphical interface, in which
nodes that encapsulate functions are connected, the user can create workflows easily and

It´s a software platform for machine learning and data mining written in Java and developed
at the University of Waikato. Weka is free software distributed under the GNU-GPL license.
This original version was initially designed as a tool to analyze data from the domain of
agriculture, 5 6 but the most recent version based on Java (WEKA 3), which began
development in 1997, is used in many and very different areas in particular for teaching and
research purposes.
The Weka platform is characterized by the following parameters:
Available: This software platform is free thanks to the GNU General Public License.
Adaptable: being implemented in the Java language, it is compatible with almost any
Functional: it consists of a wide repository of techniques for data preprocessing and
Simple: its use is very easy thanks to its graphical user interface.

Weka is a very complete solution that incorporates powerful features for data exploitation,
with characteristics very similar to other commercial sales tools and with the advantage over
these that it is a tool that is freely accessible and completely free.
Orange has been used since its inception in biomedical studies, bioinformatics, genomic
research, and even in teaching. In these sectors, the tool has functioned as a trial and error
platform for new machine learning algorithms. While in education, it has spread among
students of biology and biomedicine, in order to implement machine learning methods and
data mining analysis, orange It allows you to create your own interactive workflows in order
to analyze and visualize the data more widely. In this way, redesigning and adapting the tool
to the needs of the company, allows the information to be viewed in different formats, from
scatter diagrams, bar graphs, trees or networks, and heat maps. This functionality allows one
type of visualization or another to be chosen based on the results to show the results more
clearly and better interpret the information.

It stands out for allowing free access and for its easy handling given that it does not require
elaborate programming knowledge, without forgetting the large selection of operators that
RapidMiner offers, it is written in Java and contains more than 500 operators with different
approaches to show the connections in the data: there are options for data mining, text mining
or web mining, but also sentiment analysis or opinion mining. Likewise, the program can
import Excel tables, SPSS files and data masses from different databases and integrates the
WEKA and R data mining programs. All of this highlights the versatility of this software; the
tool is made up of three large modules : RapidMiner Studio, RapidMinder Server, and
RapidMiner Radoop, each tasked with a different data mining technique. In addition,
RapidMiner prepares the data before analysis and optimizes it for rapid processing. For each
of these three modules there is a free version and different payment options.

SAS is the main data mining tool in the analysis of the business sector and, in fact, it is
considered the most suitable program for large companies, although it is also the software
with the highest economic cost of all those described here.The new analytical offerings for
SAS Viya are structured for a diverse range of users, and remain consistent and manageable.
In addition to SAS Visual Data Mining and Machine Learning for data scientists, the Viya
family will include SAS Visual Analytics for business analysts and SAS Visual Statistics,
aimed at advanced users of statistics.

All types of users will be satisfied with the breadth of applications that SAS Viya offers,
which maintain a consistent structure. The speed of SAS Viya's multithreaded parallel
processing engine helps you make faster decisions. While the robustness of analytics from the
leader in advanced analytics will produce reliable results.
Data Mining is presented as a support technology to explore, analyze, understand and apply
the knowledge obtained using large volumes of data and discover patterns that help in the
identification of structures in the data. The products to be commercialized are expensive and
require a lot of experience to use. It is very easy to find misleading or uninteresting patterns.
The application of these tools helps in the decision making process of organizations.

In this article we have tried to provide guidance on the most appropriate techniques and tools
that currently exist in the field of data mining. If we analyze the growing increase in the
capacity to store information and in the sensitivity that companies have in the improvement
of quality from the analysis of the information they have, we see the incredible future
potential that lies in these tools.

Data Mining, properly employed, becomes a strategic tool that raises the levels of
competence in the changing business world. Effective decision making depends on the speed
with which important information is identified and analyzed. The existence of innovative
methodologies to develop the identification and analysis process must necessarily improve
the competitive advantage to increase the number of customers.

9. Bibliography
Aggarwal, C. C. (2015). Data Mining. Springer.

Claudio Palma, Wilfredo Palma, & Pérez R. (2012). Data mining: El arte de anticipar

RIL editores.

Ferri Ramírez, C. y Ramírez Quintana, M. J. (2004). Introducción a la minería de

datos. Pearson Educación.

Larose, D. T., & Larose, C. D. (2015). Data Mining and Predictive Analytics (2.a ed.).



Márques, M. (2011). Bases de datos (1.a ed.). UJI.

Marco Teórico - Sistemas de Información 3. (s. f.). Recuperado 15

de junio de 2021, de

Sofia J. Vallejos. (2006). Minería de Datos. Universidad Nacional del Nordeste

Facultad de Ciencias Exactas.

Yuniet Rodríguez, & Díaz, A. (2009, noviembre). Herramientas de minería de datos.

Revista Cubana de Ciencias Informáticas.

You might also like