Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

1- 15 different areas where DBMS is used?

15 Different Areas Where DBMS is Used:


1. Banking and Finance: Storing customer information, account details, transaction history,
market data, risk analysis.
2. E-commerce and Retail: Product information, customer information, orders, inventory
management, sales analysis.
3. Healthcare: Patient medical records, appointment scheduling, insurance claims, drug
interactions, research data.
4. Government and Public Services: Citizen records, voting data, tax forms, licenses,
social welfare programs.
5. Education: Student records, course information, grades, attendance, research data.
6. Human Resources: Employee information, payroll, benefits, performance evaluations,
training records.
7. Manufacturing and Supply Chain Management: Product design, inventory control,
production planning, supplier information, shipping logistics.
8. Telecommunications: Customer information, call records, network analytics, billing data.
9. Transportation and Logistics: Booking information, tracking and tracing, vehicle
maintenance, route optimization.
10. Media and Entertainment: Content management, user profiles, recommendation
engines, streaming statistics.
11. Social Media and Online Communities: User profiles, interactions, comments, content
moderation.
12. Scientific Research: Data collection, analysis, visualization, collaboration, experiment
tracking.
13. Marketing and Advertising: Customer segmentation, campaign tracking, lead
generation, market research.
14. Travel and Tourism: Booking information, itinerary planning, customer reviews, location
data.
15. Digital Libraries and Archives: Preserving and managing historical documents, books,
artifacts, research data.

These are just a few examples, and the list could go on! DBMS are incredibly
versatile tools used in almost any field that needs to store, manage, and analyze
large amounts of data.
2- what types of Database is R-DBMS & DBMS or ( what types
of Database & what is relationship Database Management
System & Database Management System )

Types of Databases:

 Relational databases: These are the most common type of database. They store data in
tables, which are made up of rows and columns. Each row represents a record, and
each column represents a field. Relational databases are good for storing structured
data, such as customer information or financial data.

Relational Database

 NoSQL databases: These are databases that do not use the traditional relational model.
They are more flexible than relational databases and can store unstructured data, such
as text, images, and videos. NoSQL databases are often used for big data applications.
Object-oriented databases: These databases store data in objects, which are
collections of data and code. Object-oriented databases are used for applications that
require complex data structures.

ObjectOriented Database

 Graph databases: These databases store data in nodes and edges. Nodes represent
entities, and edges represent relationships between entities. Graph databases are good
for storing data that has complex relationships, such as social networks.
Graph Database

 Document databases: These databases store data in documents, which are collections
of key-value pairs. Document databases are good for storing unstructured data, such as
text, images, and videos.

Document Databases

- Relational Database Management System (RDBMS):

A relational database management system (RDBMS) is a software program that allows


you to create, manage, and access relational databases. Some popular RDBMSs
include MySQL, Oracle Database, and Microsoft SQL Server.
Relational Database Management System

- Database Management System (DBMS):

A database management system (DBMS) is a software program that allows you to


create, manage, and access any type of database, not just relational databases. Some
popular DBMSs include IBM DB2, Oracle Database, and MongoDB.

Database Management System


3- what are the types of Graphs(At least 20 types )
There are many different types of graphs, each with its own strengths and weaknesses.
Here are 20 common types of graphs:

Basic graphs:

 Line graph: Shows trends or changes over time.


 Bar graph: Compares different categories of data.
 Column graph: Similar to a bar graph, but the bars are horizontal instead of vertical.
 Pie chart: Shows how a whole is divided into parts.
 Scatter plot: Shows the relationship between two numerical variables.

Statistical graphs:

 Histogram: Shows the distribution of a numerical variable.


 Box plot: Shows the quartiles and outliers of a numerical variable.
 Density plot: Shows the probability distribution of a numerical variable.
 Error bar plot: Shows the mean and standard deviation of a numerical variable.

Specialized graphs:

 Heatmap: Shows the distribution of a two-dimensional variable.


 Radar chart: Shows multiple quantitative variables from the same center point.
 Sankey diagram: Shows the flow of data between different categories.
 Force-directed graph: Shows the relationships between nodes in a network.
 Chord diagram: Shows the relationships between multiple sets of data.

Other types of graphs:

 Dendrogram: Shows the hierarchical relationships between objects.


 Parallel coordinates plot: Shows the relationships between multiple variables for
individual data points.
 Violin plot: Shows the distribution of a numerical variable for several categories.
 Polar plot: Shows data points in a circular coordinate system.
 Sunburst chart: Shows hierarchical data using concentric circles.
 Treemap: Shows hierarchical data using nested rectangles.
 Network graph: Shows the relationships between nodes in a network.
 Timeline: Shows events or changes over time.

4- what are the types of Reports ?

There are many types of reports, but some of the most common include:

 Informational reports: These reports provide objective data or facts, such as


sales figures or employee absenteeism rates.
 Analytical reports: These reports go beyond simply presenting data and use it
to identify trends, solve problems, or make recommendations.
 Operational reports: These reports track the day-to-day activities of a
business, such as production levels or inventory levels.
 Progress reports: These reports track the progress of a project or task over
time.
 Proposal reports: These reports make recommendations for a course of
action, such as a new marketing campaign or a new product launch.

5- what is Normalization ?

In the context of databases, normalization is the process of organizing data in


a way that reduces redundancy and improves data integrity. Imagine a messy
desk with the same documents scattered everywhere. Normalization is like
tidying up that desk, putting each document in its proper place, and making
sure there are no duplicates.

Here's a short explanation of what normalization does:

 Reduces redundancy: This means storing data only once, eliminating the
need for duplicate entries. Think of it like avoiding having multiple copies of
the same book on your shelf.
 Improves data integrity: This ensures that data is accurate and consistent
throughout the database. If you update one copy of a document, all other
copies are automatically updated as well, preventing errors.
 Enhances efficiency: A well-normalized database is smaller and easier to
manage, which makes it faster to query and update data. It's like having a
clean and organized desk, where you can easily find what you need.

Normalization is achieved through a series of "normal forms," each with its


own set of rules for organizing data. The most common normal forms are:

 First normal form (1NF): Eliminates repeating groups of data.

First normal form database

 Second normal form (2NF): Removes partial dependencies, where a non-key


attribute depends on only part of the primary key.

Second normal form database


 Third normal form (3NF): Eliminates all transitive dependencies, where a non-
key attribute depends on another non-key attribute.

Third normal form database

6- what are the different stores sector in D-mart?


7- what is Data warehouse ? Draw figure and explain its architecture?

A data warehouse architecture is a method of defining the overall architecture of data


communication processing and presentation that exist for end-clients computing within
the enterprise. Each data warehouse is different, but all are characterized by standard
vital components.

Data Warehouse Architecture: Basic

Operational System

An operational system is a method used in data warehousing to refer to a system that


is used to process the day-to-day transactions of an organization.

Flat Files

A Flat file system is a system of files in which transactional data is stored, and every file
in the system must have a different name.

Meta Data

A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:

Meta Data summarizes necessary information about data, which can make finding and
work with particular instances of data more accessible. For example, author, data build,
and data changed, and file size are examples of very basic document metadata.

Metadata is used to direct a query to the most appropriate data source.

Lightly and highly summarized data

The area of the data warehouse saves all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.

The goals of the summarized information are to speed up query performance. The
summarized record is updated continuously as new information is loaded into the
warehouse.

End-User access Tools

The principal purpose of a data warehouse is to provide information to the business


managers for strategic decision-making. These customers interact with the warehouse
using end-client access tools.

The examples of some of the end-user access tools can be:

o Reporting and Query Tools


o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools

8- Explain Data scale & types of Data scales with examples?

Data Scale: Understanding Size and Meaning


In statistics and data analysis, the data scale refers to the level of
measurement used to quantify data points. Essentially, it tells us how
meaningful comparisons and calculations we can make based on the data's
values. There are four main types of data scales, each with its own
characteristics and limitations:

1. Nominal Scale:

 Characteristics: Categorizes data into distinct groups without any inherent


order or ranking. Imagine sorting books by genre. Each genre (fantasy,
history, etc.) is distinct, but there's no order between them.
 Examples: Eye color (blue, green, brown), blood type (A, B, AB, O), job titles
(doctor, teacher, engineer).
 Operations allowed: Counting and identifying frequencies within each
category.

2. Ordinal Scale:

 Characteristics: Data points are ranked or ordered, but the intervals between
ranks are not necessarily equal. Think of movie ratings (1-5 stars). While we
know 4 stars is "better" than 2 stars, the difference in quality might not be the
same between all levels.
 Examples: Customer satisfaction ratings (poor, average, good, excellent),
socioeconomic status (low, middle, high), degree of injury (minor, moderate,
severe).
 Operations allowed: Ranking, identifying median and mode, comparing
relative order.

3. Interval Scale:
 Characteristics: Data points are ordered with equal intervals between them,
but there is no true zero point. Consider temperature in Celsius. The
difference between 20°C and 30°C is the same as 0°C and 10°C, but a
temperature of 0°C doesn't mean "no heat" at all.
 Examples: Temperature (Celsius, Fahrenheit), height and weight, IQ scores.
 Operations allowed: All operations of ordinal scales plus calculations like
addition, subtraction, finding mean and standard deviation.

4. Ratio Scale:

 Characteristics: Data points are ordered with equal intervals and have a true
zero point, meaning the absence of the measured quantity. Imagine money. A
balance of $0 truly means no money, and the difference between $10 and $20
is the same as $20 and $30.
 Examples: Age, time, distance, salary, weight (in grams).
 Operations allowed: All operations of interval scales plus calculations like
ratios and proportions.

Choosing the right data scale is crucial for proper analysis and interpretation.
Using operations beyond the scale's limitations can lead to misleading results.
Remember, scales tell us how much we can squeeze out of our data in terms
of meaningful comparisons and calculations.

9- Explain quantitative & qualitative Data?

Data comes in many flavors, but two of the most fundamental types are
quantitative and qualitative. Understanding the difference between them is
crucial for effectively analyzing information and gaining insights.

Quantitative Data:

 Numbers-based: This data consists of numerical values that can be counted


or measured. Examples include age, income, weight, temperature, distance,
test scores, or the number of customers.
 Focus on "how many, how much, or how often": Quantitative data reveals
patterns and trends through statistical analysis. It answers questions like:
o How many people prefer apples over oranges?
o How much has the average temperature risen in the past decade?
o How often do customers visit our website?
 Analysis Tools: Spreadsheets, statistical software (e.g., SPSS, R), and charts
are used to analyze quantitative data.

Qualitative Data:

 Descriptive: This data involves words, text, images, or sounds that describe
qualities or characteristics. Examples include interview transcripts, open-
ended survey responses, observations, photographs, or video recordings.
 Focus on "why, how, or what": Qualitative data provides insights into
motivations, experiences, and meanings. It answers questions like:
o Why do people choose to buy organic products?
o How do students feel about the new teaching methods?
o What are the key themes emerging from customer reviews?
 Analysis Tools: Thematic analysis, coding, discourse analysis, and narrative
analysis are used to analyze qualitative data.

Key Differences:

Feature Quantitative Data Qualitative Data

Type of
Numbers, measurements Words, text, images, sounds
information

"How many, how much, how


Focus "Why, how, what"
often"

Thematic analysis, coding,


Analysis Statistical tools
etc.
Strengths Precise, objective, generalizable Rich, detailed, insightful

Limited in scope, can miss Subjective, difficult to


Weaknesses
nuances quantify

10- what are the methods of Data collection?


Data collection methods are like tools in a researcher's toolbox, each serving
a specific purpose and suited to different situations. Choosing the right
method depends on factors like the type of data needed, the research
question, and available resources. Here's a breakdown of some common
methods:

Primary Data Collection:

 Surveys: Gather information from a large group of people through


questionnaires (online, paper, phone). Useful for quantitative data like
demographics, opinions, and preferences.
 Interviews: In-depth one-on-one conversations to explore experiences,
motivations, and perspectives. Yield rich qualitative data but can be time-
consuming.
 Observations: Recording behavior or phenomena directly. Can be structured
(following a protocol) or unstructured (flexible and exploratory). Good for
understanding natural settings and non-verbal cues.
 Focus groups: Gather a small group of people for moderated discussions on a
specific topic. Useful for generating ideas, understanding shared experiences,
and gauging initial reactions.
 Experiments: Manipulate variables to observe cause-and-effect relationships.
Often used in scientific research and testing interventions.

Secondary Data Collection:

 Existing datasets: Utilize data already collected by others, like government


records, census data, or market research reports. Saves time and resources
but may not be specific enough for your needs.
 Published materials: Articles, books, and other media can provide valuable
insights and context. Good for background research and literature reviews.

Additional Considerations:

 Sampling: Selecting a representative subset of a larger population to collect


data from. Ensures your findings can be generalized to the whole population.
 Ethical considerations: Data collection should be respectful of privacy,
informed consent, and anonymity/confidentiality.

Remember, the best method is often a combination of approaches. Choose


the tools that best fit your research needs and ensure reliable and insightful
data collection.

11- Explain in detail VUCA with respect to Data?

VUCA and Data: A Tangled Relationship


The VUCA acronym (Volatility, Uncertainty, Complexity, Ambiguity) describes the
current dynamic and unpredictable environment that organizations and individuals face.
When we examine data through this lens, it presents both challenges and opportunities:

Challenges:

 Volatility: Data sources can be highly volatile, with rapid fluctuations and unexpected
changes. Think about how quickly social media trends or economic indicators can shift.
This makes it difficult to rely on historical data for future predictions.
 Uncertainty: The meaning and interpretation of data can be uncertain, especially in
complex systems. Correlations might not imply causation, and biases can be hidden
within datasets. This challenges our ability to draw clear conclusions from data.
 Complexity: The sheer volume and variety of data can be overwhelming. Extracting
insights requires sophisticated tools and expertise, leaving many organizations
struggling to harness the full potential of their data.
 Ambiguity: Data often lacks clear context or explanation, leading to ambiguity in its
interpretation. Different stakeholders might draw different conclusions from the same
data set, leading to confusion and disagreements.

Opportunities:
 VUCA-proofing: With careful analysis and robust data governance, organizations can
build systems that are more resilient to VUCA fluctuations. By actively monitoring data
for unusual patterns and adapting as needed, they can become more agile and
responsive to change.
 Deeper insights: Advanced analytics tools and techniques can cut through the
complexity of data to reveal hidden patterns and trends. This can lead to better
understanding of customer behavior, market dynamics, and operational efficiency.
 Decision-making under uncertainty: While absolute certainty might be elusive, data can
still inform decision-making even in uncertain environments. By using probabilistic
models and scenario planning, organizations can make informed choices even when
the future is unclear.
 Transparency and trust: Data can be used to promote transparency and trust in
decision-making. By sharing data insights and making decision-making processes more
visible, organizations can build stronger relationships with stakeholders.

In conclusion, VUCA and data are intrinsically linked. While the VUCA environment
presents challenges for data-driven decision-making, it also creates opportunities for
organizations that can harness the power of data effectively. By embracing agility,
adopting robust data governance, and investing in advanced analytics tools,
organizations can navigate the VUCA landscape and turn data into a strategic
advantage.

12- Explain the steps of data processing?

Stages of data processing


Data processing involves transforming raw data into valuable information, and it usually
follows these key steps:

1. Data Collection:

This first step gathers OR collection data from various sources like sensors, databases,
websites, surveys, or experiments. The chosen method depends on your specific data
needs and goals.

2. Data Preparation:

Here, you make the raw data usable for analysis. This often involves:

 Cleaning: Removing errors, inconsistencies, and missing values.

 Transformation: Formatting data into a consistent structure, converting units, and


handling outliers.
 Integration: Combining data from multiple sources if needed.

3. Data Input:

The prepared data is then loaded into a chosen platform for analysis, like a data
warehouse, spreadsheet, or statistical software.

4. Data Processing:

This is where you analyze and manipulate the data to extract insights. This can involve:

 Descriptive statistics: Summarizing the data through measures like mean, median, and
standard deviation.

 Data visualization: Creating charts, graphs, and other visual representations to


understand patterns and trends.

 Modeling: Building statistical or machine learning models to predict future outcomes or


relationships within the data.

5. Data Output:

The extracted insights are presented in a clear and concise way, often through reports,
dashboards, or visualizations.

6. Data Storage:

Finally, the processed data is saved securely for future use, analysis, or reference.

These steps form a general roadmap, but the specific tasks and tools used may vary
depending on the type and complexity of your data and the desired analysis.

13- Explain Data Mart and Data lake in Detail?


A data mart is a subject-oriented, integrated, time-variant, non-volatile collection of
data in support of decision-making processes for a specific department or business
unit within an organization.

Essentially, it's a smaller, more focused version of a data warehouse that caters to
the specific needs of a particular team or business area. Here's a breakdown of its
key characteristics:

1. Subject-oriented: Data marts are built around specific topics or areas of interest,
such as marketing, sales, finance, or human resources. This means they only
contain data relevant to that particular subject, making it easier for users to find the
information they need.
2. Integrated: Data marts integrate data from various sources, both internal and
external, into a single, consistent format. This eliminates the need for users to
access and merge data from multiple disparate systems.

3. Time-variant: Data marts typically track data over time, allowing users to analyze
trends and patterns. This historical data can be crucial for identifying areas for
improvement and making informed decisions.

4. Non-volatile: Unlike operational databases that are constantly being updated, data
marts are relatively static. This means that data once loaded into a data mart is not
typically subject to frequent changes, making it more reliable for analysis.

5. Decision-making support: Ultimately, the purpose of a data mart is to support


decision-making processes within a specific department or business unit. By
providing users with easy access to relevant, reliable data, data marts can help
improve operational efficiency, identify new opportunities, and make better business
decisions.

Data Lake Explained in Detail


A data lake is essentially a giant container that can hold a massive amount of data in its
raw, native format. Imagine it like a digital warehouse, but instead of neatly organizing
everything into shelves and categories, it just throws everything in together. This
includes structured data like spreadsheets and databases, as well as unstructured data
like emails, social media posts, and sensor readings.

Here are some key characteristics of a data lake:

 Scalability: Data lakes can scale up easily to accommodate whatever amount of data
you throw at them. This is because they typically use object storage, which is a cost-
effective way to store large amounts of data.
 Flexibility: You can store any type of data in a data lake, regardless of its structure or
format. This makes them ideal for organizations that deal with a lot of diverse data.
 Accessibility: Data lakes are designed to be easily accessible by data analysts and
scientists. This allows them to quickly and easily find the data they need for their
analyses.
 Cost-effectiveness: Compared to data warehouses, data lakes are typically more cost-
effective, especially for storing large amounts of data.

But it's not all sunshine and rainbows. Here are some potential drawbacks of data lakes:
 Complexity: Managing a data lake can be complex, especially as it grows in size.
It requires specialized skills and expertise to ensure that the data is properly
organized and secured.
 Data quality: Because data lakes store everything, it's easy for low-quality or
irrelevant data to creep in. This can make it difficult to find the data you need and
can lead to inaccurate results.
 Security: Ensuring the security of all that data in a data lake is crucial.
Organizations need to have strong security measures in place to prevent
unauthorized access or breaches.

14- Explain the example steps by step of Data processing ?

Example of Data Processing: Analyzing Website Traffic for


Marketing Insights
Data Source: Website server logs containing information about user visits, such as
timestamp, IP address, page visited, referrer, etc.

Steps:

1. Data Collection:

 Download server logs, which might be in plain text or a specific format like Apache
Combined Log Format.
 Use data extraction tools if necessary to pull relevant data from log files.

2. Data Cleaning:

 Identify and remove incomplete or invalid entries.


 Filter out bot traffic and unwanted requests.
 Address duplicate records.
 Standardize formatting for consistency (e.g., date format, page names).

3. Data Transformation:

 Create new variables based on existing data (e.g., "session duration" from timestamps).
 Group data by relevant dimensions (e.g., page, referrer, user agent).
 Calculate aggregate statistics (e.g., total visits, unique visitors, average session
duration).

4. Data Analysis:

 Visualize data using charts and graphs to identify trends and patterns.
 Compare data across different dimensions to understand user behavior.
 Use statistical tests to assess the significance of findings.

5. Output and Insights:

 Generate reports and dashboards to share findings with stakeholders.


 Develop actionable insights for marketing campaigns (e.g., optimizing website content,
targeting ads, improving user experience).

15- Explain VUCA in detail ?


VUCA in Data Warehouses: Navigating Uncertainty with
Insights
Data warehouses have traditionally aimed to provide a stable and reliable source of
information for organizations to make informed decisions. However, the modern world
presents a different reality - one characterized by vUCA:

 Volatility: Rapid changes in markets, regulations, and technologies create constant


upheaval.
 Uncertainty: Difficult to predict future outcomes due to unpredictable events and limited
information.
 Complexity: Interconnected systems and diverse data sources create intricate
landscapes.
 Ambiguity: Multiple interpretations and unclear cause-and-effect relationships make
meaning elusive.

These factors pose significant challenges for data warehouses and the insights they
offer. Here's how vUCA impacts them:

Challenges:
 Data currency: Keeping data fresh and relevant in a volatile environment can be difficult.
 Data quality: Uncertain data sources and complex systems can lead to errors and
inconsistencies.
 Data access: Users may struggle to navigate complex data and find relevant insights
quickly.
 Data interpretation: Ambiguity can make it hard to draw clear conclusions from the data.

Opportunities:

 Real-time integration: Integrate real-time data sources to capture volatility and make
more timely decisions.
 Data governance: Implement robust data quality checks and processes to ensure data
accuracy.
 Self-service analytics: Empower users with intuitive tools to explore and analyze data
effectively.
 Advanced analytics: Use AI and machine learning to uncover hidden patterns and gain
deeper insights.

Strategies for navigating vUCA with data warehouses:

 Agile data architectures: Flexible and scalable architectures to adapt to changing needs.
 Data virtualization: Provide unified access to diverse data sources without physically
moving data.
 Cloud-based solutions: Leverage scalability, agility, and cost-effectiveness of cloud
platforms.
 Focus on data lineage: Track data provenance to understand its origin and reliability.
 Data storytelling: Present complex insights in a clear and compelling way.

By adopting these strategies, data warehouses can transform from static repositories
into dynamic tools for navigating the vUCA world. They can empower organizations to
make data-driven decisions amidst uncertainty, complexity, and ambiguity, ultimately
leading to better outcomes.
16- Explain KDD in detail with advantages & disadvantages with
diagram & example?
in the context of computer science, “Data Mining” can be referred to as
knowledge mining from data, knowledge extraction, data/pattern analysis, data
archaeology, and data dredging. Data Mining also known as Knowledge
Discovery in Databases, refers to the nontrivial extraction of implicit, previously
unknown and potentially useful information from data stored in databases.
The need of data mining is to extract useful information from large datasets and
use it to make predictions or better decision-making. Nowadays, data mining is
used in almost all places where a large amount of data is stored and processed.
For examples: Banking sector, Market Basket Analysis, Network Intrusion
Detection.

KDD Process
KDD (Knowledge Discovery in Databases) is a process that involves the
extraction of useful, previously unknown, and potentially valuable information
from large datasets. The KDD process is an iterative process and it requires
multiple iterations of the above steps to extract accurate knowledge from the
data.The following steps are included in KDD process:
Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from
collection.
1. Cleaning in case of Missing values.
2. Cleaning noisy data, where noise is a random or variance error.
3. Cleaning with Data discrepancy detection and Data transformation tools.
Data Integration
Data integration is defined as heterogeneous data from multiple sources
combined in a common source(DataWarehouse). Data integration using Data
Migration tools, Data Synchronization tools and ETL(Extract-Load-
Transformation) process.
Data Selection
Data selection is defined as the process where data relevant to the analysis is
decided and retrieved from the data collection. For this we can use Neural
network, Decision Trees, Naive bayes, Clustering,
and Regression methods.
Data Transformation
Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure. Data Transformation is a two
step process:
1. Data Mapping: Assigning elements from source base to destination to
capture transformations.
2. Code generation: Creation of the actual transformation program.
Data Mining
Data mining is defined as techniques that are applied to extract patterns
potentially useful. It transforms task relevant data into patterns, and decides
purpose of model using classification or characterization.
Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns
representing knowledge based on given measures. It find interestingness
score of each pattern, and uses summarization and Visualization to make
data understandable by user.
Knowledge Representation
This involves presenting the results in a way that is meaningful and can be used
to make decisions.
Note: KDD is an iterative process where evaluation measures can be
enhanced, mining can be refined, new data can be integrated and transformed
in order to get different and more appropriate results.Preprocessing of
databases consists of Data cleaning and Data Integration.
Advantages of KDD
1. Improves decision-making: KDD provides valuable insights and knowledge
that can help organizations make better decisions.
2. Increased efficiency: KDD automates repetitive and time-consuming tasks
and makes the data ready for analysis, which saves time and money.
3. Better customer service: KDD helps organizations gain a better
understanding of their customers’ needs and preferences, which can help
them provide better customer service.
4. Fraud detection: KDD can be used to detect fraudulent activities by
identifying patterns and anomalies in the data that may indicate fraud.
5. Predictive modeling: KDD can be used to build predictive models that can
forecast future trends and patterns.
Disadvantages of KDD
1. Privacy concerns: KDD can raise privacy concerns as it involves collecting
and analyzing large amounts of data, which can include sensitive information
about individuals.
2. Complexity: KDD can be a complex process that requires specialized skills
and knowledge to implement and interpret the results.
3. Unintended consequences: KDD can lead to unintended consequences,
such as bias or discrimination, if the data or models are not properly
understood or used.
4. Data Quality: KDD process heavily depends on the quality of data, if data is
not accurate or consistent, the results can be misleading
5. High cost: KDD can be an expensive process, requiring significant
investments in hardware, software, and personnel.
6. Overfitting: KDD process can lead to overfitting, which is a common
problem in machine learning where a model learns the detail and noise in
the training data to the extent that it negatively impacts the performance of
the model on new unseen data.
17- Explain one , two & three tier architecture with diagram ?

Types of Data Warehouse Architectures

Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose is to minimize the
amount of data stored to reach this goal; it removes data redundancies.

The figure shows the only layer physically available is the source layer. In this method,
data warehouses are virtual. This means that the data warehouse is implemented as a
multidimensional view of operational data created by specific middleware, or an
intermediate processing layer.
The vulnerability of this architecture lies in its failure to meet the requirement for
separation between analytical and transactional processing. Analysis queries are agreed
to operational data after the middleware interprets them. In this way, queries affect
transactional workloads.

Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier
architecture for a data warehouse system, as shown in fig:
Although it is typically called two-layer architecture to highlight a separation between
physically available sources and data warehouses, in fact, consists of four subsequent
data flow stages:

1. Source layer: A data warehouse system uses a heterogeneous source of data. That data
is stored initially to corporate relational databases or legacy databases, or it may come
from an information system outside the corporate walls.
2. Data Staging: The data stored to the source should be extracted, cleansed to remove
inconsistencies and fill gaps, and integrated to merge heterogeneous sources into one
standard schema. The so-named Extraction, Transformation, and Loading Tools
(ETL) can combine heterogeneous schemata, extract, transform, cleanse, validate, filter,
and load source data into a data warehouse.
3. Data Warehouse layer: Information is saved to one logically centralized individual
repository: a data warehouse. The data warehouses can be directly accessed, but it can
also be used as a source for creating data marts, which partially replicate data warehouse
contents and are designed for specific enterprise departments. Meta-data repositories
store information on sources, access procedures, data staging, users, data mart schema,
and so on.
4. Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue reports,
dynamically analyze information, and simulate hypothetical business scenarios. It should
feature aggregate information navigators, complex query optimizers, and customer-
friendly GUIs.

Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source
system), the reconciled layer and the data warehouse layer (containing both data
warehouses and data marts). The reconciled layer sits between the source data and data
warehouse.

The main advantage of the reconciled layer is that it creates a standard reference data
model for a whole enterprise. At the same time, it separates the problems of source data
extraction and integration from those of data warehouse population. In some cases,
the reconciled layer is also directly used to accomplish better some operational tasks,
such as producing daily reports that cannot be satisfactorily prepared using the
corporate applications or generating data flows to feed external processes periodically
to benefit from cleaning and integration.

This architecture is especially useful for the extensive, enterprise-wide systems. A


disadvantage of this structure is the extra file storage space used through the extra
redundant reconciled layer. It also makes the analytical tools a little further away from
being real-time.
18- Explain data cube in data mining ?

What is OLAP?
OLAP stands for Online Analytical Processing, which is a technology that
enables multi-dimensional analysis of business data. It provides interactive
access to large amounts of data and supports complex calculations and data
aggregation. OLAP is used to support business intelligence and decision-
making processes.
Grouping of data in a multidimensional matrix is called data cubes. In Dataware
housing, we generally deal with various multidimensional data models as the
data will be represented by multiple dimensions and multiple attributes. This
multidimensional data is represented in the data cube as the cube represents a
high-dimensional space. The Data cube pictorially shows how different
attributes of data are arranged in the data model. Below is the diagram of a
general data cube.

The example above is a 3D cube having attributes like branch(A,B,C,D),item


type(home,entertainment,computer,phone,security), year(1997,1998,1999) .

Data cube classification:


The data cube can be classified into two categories:
 Multidimensional data cube: It basically helps in storing large amounts of
data by making use of a multi-dimensional array. It increases its efficiency by
keeping an index of each dimension. Thus, dimensional is able to retrieve
data fast.
 Relational data cube: It basically helps in storing large amounts of data by
making use of relational tables. Each relational table displays the
dimensions of the data cube. It is slower compared to a Multidimensional
Data Cube.
Data cube operations:

Data cube operations are used to manipulate data to meet the needs of users.
These operations help to select particular data for the analysis purpose. There
are mainly 5 operations listed below-
 Roll-up: operation and aggregate certain similar data attributes having the
same dimension together. For example, if the data cube displays the daily
income of a customer, we can use a roll-up operation to find the monthly
income of his salary.

 Drill-down: this operation is the reverse of the roll-up operation. It allows us


to take particular information and then subdivide it further for coarser
granularity analysis. It zooms into more detail. For example- if India is an
attribute of a country column and we wish to see villages in India, then the
drill-down operation splits India into states, districts, towns, cities, villages
and then displays the required information.

 Slicing: this operation filters the unnecessary portions. Suppose in a


particular dimension, the user doesn’t need everything for analysis, rather a
particular attribute. For example, country=”jamaica”, this will display only
about jamaica and only display other countries present on the country list.

 Dicing: this operation does a multidimensional cutting, that not only cuts
only one dimension but also can go to another dimension and cut a certain
range of it. As a result, it looks more like a subcube out of the whole cube(as
depicted in the figure). For example- the user wants to see the annual salary
of Jharkhand state employees.

 Pivot: this operation is very important from a viewing point of view. It


basically transforms the data cube in terms of view. It doesn’t change the
data present in the data cube. For example, if the user is comparing year
versus branch, using the pivot operation, the user can change the viewpoint
and now compare branch versus item type.

Advantages of data cubes:

 Multi-dimensional analysis: Data cubes enable multi-dimensional analysis


of business data, allowing users to view data from different perspectives and
levels of detail.
 Interactivity: Data cubes provide interactive access to large amounts of
data, allowing users to easily navigate and manipulate the data to support
their analysis.
 Speed and efficiency: Data cubes are optimized for OLAP analysis,
enabling fast and efficient querying and aggregation of data.
 Data aggregation: Data cubes support complex calculations and data
aggregation, enabling users to quickly and easily summarize large amounts
of data.
 Improved decision-making: Data cubes provide a clear and
comprehensive view of business data, enabling improved decision-making
and business intelligence.
 Accessibility: Data cubes can be accessed from a variety of devices and
platforms, making it easy for users to access and analyze business data
from anywhere.
 Helps in giving a summarised view of data.
 Data cubes store large data in a simple way.
 Data cube operation provides quick and better analysis,
 Improve performance of data.

Disadvantages of data cube:

 Complexity: OLAP systems can be complex to set up and maintain,


requiring specialized technical expertise.
 Data size limitations: OLAP systems can struggle with very large data sets
and may require extensive data aggregation or summarization.
 Performance issues: OLAP systems can be slow when dealing with large
amounts of data, especially when running complex queries or calculations.
 Data integrity: Inconsistent data definitions and data quality issues can
affect the accuracy of OLAP analysis.
 Cost: OLAP technology can be expensive, especially for enterprise-level
solutions, due to the need for specialized hardware and software.
 Inflexibility: OLAP systems may not easily accommodate changing
business needs and may require significant effort to modify or extend.

19- Explain different types of data cube ?


Data cube classification:
The data cube can be classified into two categories:
 Multidimensional data cube: It basically helps in storing large amounts of
data by making use of a multi-dimensional array. It increases its efficiency by
keeping an index of each dimension. Thus, dimensional is able to retrieve
data fast.
 Relational data cube: It basically helps in storing large amounts of data by
making use of relational tables. Each relational table displays the
dimensions of the data cube. It is slower compared to a Multidimensional
Data Cube.

Data cube operations:

Data cube operations are used to manipulate data to meet the needs of users.
These operations help to select particular data for the analysis purpose. There
are mainly 5 operations listed below-
 Roll-up: operation and aggregate certain similar data attributes having the
same dimension together. For example, if the data cube displays the daily
income of a customer, we can use a roll-up operation to find the monthly
income of his salary.

 Drill-down: this operation is the reverse of the roll-up operation. It allows us


to take particular information and then subdivide it further for coarser
granularity analysis. It zooms into more detail. For example- if India is an
attribute of a country column and we wish to see villages in India, then the
drill-down operation splits India into states, districts, towns, cities, villages
and then displays the required information.

 Slicing: this operation filters the unnecessary portions. Suppose in a


particular dimension, the user doesn’t need everything for analysis, rather a
particular attribute. For example, country=”jamaica”, this will display only
about jamaica and only display other countries present on the country list.

 Dicing: this operation does a multidimensional cutting, that not only cuts
only one dimension but also can go to another dimension and cut a certain
range of it. As a result, it looks more like a subcube out of the whole cube(as
depicted in the figure). For example- the user wants to see the annual salary
of Jharkhand state employees.

 Pivot: this operation is very important from a viewing point of view. It


basically transforms the data cube in terms of view. It doesn’t change the
data present in the data cube. For example, if the user is comparing year
versus branch, using the pivot operation, the user can change the viewpoint
and now compare branch versus item type.

You might also like