S2-Slo1 & Slo2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

What is Data Mining?

Data Mining is defined as extracting information from huge sets of data. In


other words, we can say that data mining is the procedure of mining knowledge from
data. The information or knowledge extracted so can be used for any of the following
applications −
 Market Analysis
 Fraud Detection
 Customer Retention
 Production Control
 Science Exploration
The data mining method can be categorized into these four primary steps:
• Data Gathering • Data Preparation • Mining the Data • Data Analysis and
Interpretation
Let us have a detail of each step:
• Data Gathering- Relevant data for an analytics application is recognized and
assembled. These data may be in different source systems, a data lake, or a data
warehouse, an increasingly common repository in big data environments containing
unstructured and structured data.
• Data Preparation- This step includes a set of stages to make the data prepared to be
mined. It begins with data exploration, profiling, and pre-processing, and data
cleansing works to fix errors and other data quality issues.
• Mining the Data- After the data is prepared, the role of the data scientist is to choose
the suitable data mining technique and then implement one or more algorithms to do
the mining. For example, in machine learning applications, the algorithms must be
trained on sample data sets to look for the information sought before running against
the full data set.
• Data Analysis and Interpretation- Finally, the data mining results are used to
develop analytical models to help decision-making and other business actions.

Numerical optimization:
Numerical optimization methods have been used for several years for various
applications.
These three fields form the bedrock of modern data science, each playing a crucial
role in extracting insights and building powerful models from data. Let's delve into
their individual contributions and how they intertwine:

Optimization:

 Finding the "best": Optimization algorithms seek the optimal values for
variables that minimize or maximize a specific objective function. This is
essential for training machine learning models, selecting features, tuning
hyperparameters, and more.
 Algorithmic toolbox: Data science relies on a diverse arsenal of optimization
algorithms like gradient descent, Newton-Raphson, and evolutionary
algorithms, each tailored to specific problem types and objective functions.
 Scalability and efficiency: Large datasets require scalable optimization
techniques like stochastic gradient descent and distributed computing to find
solutions efficiently.

Linear Algebra:

 Mathematical language of data: Vectors, matrices, and linear equations form


the lingua franca of data manipulation and analysis in linear algebra. It
underpins data representation, calculations, and model formulations.
 Dimensionality reduction: Techniques like Principal Component Analysis
(PCA) leverage linear algebra to reduce data dimensionality while preserving
crucial information, improving model performance and interpretability.
 Solving complex systems: Linear algebra provides powerful tools like
eigenvalue analysis and singular value decomposition (SVD) for analyzing
complex systems like recommender systems and analyzing relationships
between data points.

Statistics:

 Data understanding and uncertainty: Statistics equips data scientists with the
tools to understand the distribution of data, identify patterns, and quantify
uncertainty. This informs data cleaning, model selection, and interpretation of
results.
 Hypothesis testing and inference: Statistical tests help assess the significance
of findings and draw conclusions from data. This is crucial for validating
models, measuring their performance, and making informed decisions based
on data.
 Probabilistic modeling: Statistical models represent relationships between
variables and predict future outcomes under uncertainty. This forms the basis
for numerous data science applications like anomaly detection, time series
forecasting, and risk analysis.

You might also like