Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 26

Data Warehousing

and Data Mining


BSIT - 2
COMBIS
BUENTIPO
AMORA
DATA WAREHOUSING
● A type of data management system that is designed to enable and support business
intelligence activities, especially analytics.

● Also known as an Enterprise Data Warehouse (EDW), is a system used for


reporting and data analysis. Data Warehouses are central repositories of
integrated data from one or more disparate sources.
DATA WAREHOUSING
● Data Warehouses typically follow a dimensional modeling approach, organized
into dimensions and facts.

● The Data Warehouse architecture usually includes components such as Extract,


Transform, Load (ETL) processes for data integration, a Relational Database
Management System (RDBMS) for storage, and tools for querying and reporting.
CONCEPTS OF DATA WAREHOUSING
● Extract, Transform, Load (ETL)

● Dimensional Modeling

● Star Schema and Snowflake Schema

● Online Analytical Processing (OLAP)

● Data Mart

● Metadata Management

● Query and Reporting Tools


EXTRACT, TRANSFORM, LOAD (ETL)
● ETL is the process of extracting data from various sources, transforming it into a
consistent format, and loading it into the data warehouse.

● This process involves cleansing, standardizing, and integrating data to ensure its
quality and consistency.
DIMENSIONAL MODELING
● It is a design technique used in data warehousing to organize data into dimensions
and facts.

● Dimensions are descriptive attributes by which data is analyzed (e.g., time, product,
customer), while facts are numerical measures or metrics representing business
activities (e.g., sales revenue, quantity sold).
STAR SCHEMA AND SNOWFLAKE SCHEMA
● Star schema and snowflake schema are two common dimensional modeling
techniques. In a star schema, the data warehouse is organized into a central fact
table surrounded by dimension tables, forming a star-like structure.

● In a snowflake schema, dimension tables are normalized into multiple levels,


creating a more normalized structure.
ONLINE ANALYTICAL PROCESSING (OLAP)
● OLAP is a category of software tools used for multidimensional analysis of data in
a data warehouse.

● OLAP allows users to analyze data from multiple perspectives, perform complex
calculations, and create interactive reports and dashboards for decision-making.
DATA MART
● A data mart is a subset of a data warehouse that is focused on a specific business
function, department, or user group.

● Data marts are often used to address the needs of individual departments or teams
and are typically smaller and more specialized than enterprise data warehouses.
METADATA MANAGEMENT
● Metadata refers to the data stored in the data warehouse.

● Metadata management involves capturing, storing, and managing metadata to


provide information about the structure, meaning, and usage of data within the data
warehouse.
QUERY AND REPORTING TOOLS
● Query and reporting tools allow users to access and analyze data stored in the data
warehouse.

● These tools provide an intuitive interface for building and running queries,
generating reports, and visualizing data to support decision-making processes.
EXTRACT, TRANSFORM, LOAD (ETL) PROCESSES

● The Extract, Transform, Load (ETL) process is a crucial component of


data warehousing, responsible for collecting data from various sources,
transforming it into a consistent format, and loading it into the data
warehouse.
EXTRACT
● Source Identification: Identify the sources of data that need to be extracted. These can include databases,
flat files, spreadsheets, web services, and other data repositories.

● Connection Establishment: Establish connections to the source systems to retrieve data. This may involve
connecting to databases using database drivers, accessing files via file transfer protocols, or interfacing
with APIs.

● Data Extraction: Extract data from the source systems based on predefined criteria or queries. This can
involve querying databases using SQL, reading files line by line, or invoking APIs to retrieve data.

● Change Data Capture (CDC): In some cases, only the changes or delta records need to be extracted from
the source systems rather than the entire dataset
TRANSFORM
● Data Cleaning: Cleanse the extracted data to remove errors, inconsistencies, and duplicates. Data
cleaning tasks may include correcting misspellings, standardizing formats, and removing irrelevant or
incomplete records.

● Data Integration: Integrate data from multiple sources into a unified format. This involves resolving
inconsistencies in data formats, units, and definitions to ensure consistency and compatibility.

● Data Transformation: Transform the data into a format suitable for storage and analysis in the data
warehouse. This may involve converting data types, aggregating or summarizing data, and applying
business rules or calculations.

● Data Enrichment: Enhance the extracted data with additional information from external sources to
provide more context or detail. Data enrichment can involve merging data with reference datasets,
appending geospatial information, or enriching with demographic data.
LOAD
● Target Identification: Identify the target tables or structures in the data warehouse
where the transformed data will be loaded.

● Connection Establishment: Establish connections to the target data warehouse or


database where the data will be loaded.

● Data Loading: Load the transformed data into the target tables or structures in the data warehouse. Loading
methods can vary, including bulk loading, incremental loading, or real-time streaming, depending on the
requirements and capabilities of the data warehouse.

● Error Handling: Handle errors encountered during the loading process, such as data validation failures,
integrity constraints violations, or connectivity issues. Implement mechanisms to log errors, retry failed
loads, and notify administrators of issues.
DATA MINING
● It is the process of discovering patterns, trends, correlations, and insights from
large datasets using various statistical, machine learning, and computational
techniques.

● It involves extracting actionable knowledge from data to support decision-making


and improve business outcomes.
DATA MINING
● Data mining has widespread applications across various industries, including
finance, healthcare, retail, manufacturing, and telecommunications.

● It enables organizations to extract valuable insights from their data, improve


decision-making processes, enhance customer satisfaction, and gain a competitive
edge in the marketplace.
DATA MINING TECHNIQUES
● Classification: Classification techniques are used to categorize data into predefined
classes or labels based on input features.

● Clustering: Clustering techniques aim to group similar data points together into
clusters or segments based on their characteristics.

● Association Rule Mining: Association rule mining identifies interesting relationships


or associations between variables in large datasets.

● Regression Analysis: Regression analysis techniques are used to model the relationship
between dependent and independent variables and predict numerical outcomes.
DATA MINING TECHNIQUES
● Anomaly Detection: Anomaly detection techniques identify unusual or abnormal
patterns in data that deviate from the norm.

● Neural Networks: Neural network techniques, inspired by the structure of the human
brain, are used for various data mining tasks, including classification, regression,
clustering, and pattern recognition.

● Text Mining: Text mining techniques analyze unstructured text data to extract
meaningful information and insights.
DATA MINING ALGORITHMS
● Data mining algorithms are computational techniques used to discover patterns,
relationships, trends, and insights from large datasets.

● These algorithms operate on data to extract meaningful information that can be


used for various purposes such as classification, clustering, association rule
mining, regression analysis, and anomaly detection.
CLASSIFICATION ALGORITHMS
● Decision Trees: Recursive partitioning of data based on feature attributes to
create a tree-like structure for classification.
● Random Forest: Ensemble learning technique that builds multiple decision trees
and aggregates their predictions to improve accuracy.
● Support Vector Machines (SVM): Supervised learning algorithms that find the
optimal hyperplane to separate data into different classes.
● k-Nearest Neighbors (k-NN): Instance-based learning algorithm that classifies
data points based on the majority class of their nearest neighbors.
● Logistic Regression: Statistical technique used to model the probability of
binary or multiclass outcomes.
CLUSTERING ALGORITHMS
● K-Means Clustering: Partitioning algorithm that divides data into k clusters based on
similarity, aiming to minimize the within-cluster variance.
● Hierarchical Clustering: Agglomerative or divisive technique that creates a
hierarchical tree of clusters by iteratively merging or splitting clusters.
● Density-Based Spatial Clustering of Applications with Noise (DBSCAN): Density-
based algorithm that groups data points into clusters based on their density within a
specified neighborhood.
● Mean-Shift Clustering: Iterative technique that shifts cluster centers towards regions
of higher density in the data space.
● Gaussian Mixture Models (GMM): Probabilistic model representing data points as a
mixture of several Gaussian distributions, useful for clustering and density estimation.
ASSOCIATION RULE MINING ALGORITHMS
● Apriori Algorithm: Frequent itemset mining algorithm that generates association
rules based on the presence of itemsets in transactions.

● Frequent Pattern Growth (FP-Growth): Tree-based algorithm that efficiently


discovers frequent patterns by compressing the dataset into a frequent pattern tree.

● Eclat Algorithm: Depth-first search approach to mining frequent itemsets using an


intersection-based technique.
REGRESSION ALGORITHMS

● Linear Regression: Statistical method used to model the relationship between


independent variables and a continuous dependent variable.
● Polynomial Regression: Extension of linear regression where the relationship
between variables is modeled using polynomial functions.
● Ridge Regression and Lasso Regression: Regularized regression techniques
that penalize large coefficients to prevent overfitting.
● Support Vector Regression (SVR): Regression algorithm based on support
vector machines, used for predicting continuous outcomes.
ANOMALY DETECTION ALGORITHMS
● Isolation Forest: Ensemble-based algorithm that isolates anomalies by randomly
partitioning data into trees.
● Local Outlier Factor (LOF): Density-based algorithm that measures the local
density of data points relative to their neighbors to detect outliers.
● One-Class SVM: Support vector machine algorithm trained on a single class of data
to detect anomalies based on deviations from the normal class distribution.
● Autoencoder Neural Networks: Unsupervised learning algorithm that learns to
reconstruct input data and identifies anomalies based on reconstruction errors.
THANK YOU FOR LISTENING

You might also like