Professional Documents
Culture Documents
Tutorial Answre Unit 1
Tutorial Answre Unit 1
These are just a few examples, and the list could go on! DBMS are incredibly
versatile tools used in almost any field that needs to store, manage, and analyze
large amounts of data.
2- what types of Database is R-DBMS & DBMS or ( what types
of Database & what is relationship Database Management
System & Database Management System )
Types of Databases:
Relational databases: These are the most common type of database. They store data in
tables, which are made up of rows and columns. Each row represents a record, and
each column represents a field. Relational databases are good for storing structured
data, such as customer information or financial data.
Relational Database
NoSQL databases: These are databases that do not use the traditional relational model.
They are more flexible than relational databases and can store unstructured data, such
as text, images, and videos. NoSQL databases are often used for big data applications.
Object-oriented databases: These databases store data in objects, which are
collections of data and code. Object-oriented databases are used for applications that
require complex data structures.
ObjectOriented Database
Graph databases: These databases store data in nodes and edges. Nodes represent
entities, and edges represent relationships between entities. Graph databases are good
for storing data that has complex relationships, such as social networks.
Graph Database
Document databases: These databases store data in documents, which are collections
of key-value pairs. Document databases are good for storing unstructured data, such as
text, images, and videos.
Document Databases
Basic graphs:
Statistical graphs:
Specialized graphs:
There are many types of reports, but some of the most common include:
5- what is Normalization ?
Reduces redundancy: This means storing data only once, eliminating the
need for duplicate entries. Think of it like avoiding having multiple copies of
the same book on your shelf.
Improves data integrity: This ensures that data is accurate and consistent
throughout the database. If you update one copy of a document, all other
copies are automatically updated as well, preventing errors.
Enhances efficiency: A well-normalized database is smaller and easier to
manage, which makes it faster to query and update data. It's like having a
clean and organized desk, where you can easily find what you need.
Operational System
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file
in the system must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:
Meta Data summarizes necessary information about data, which can make finding and
work with particular instances of data more accessible. For example, author, data build,
and data changed, and file size are examples of very basic document metadata.
The area of the data warehouse saves all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The
summarized record is updated continuously as new information is loaded into the
warehouse.
1. Nominal Scale:
2. Ordinal Scale:
Characteristics: Data points are ranked or ordered, but the intervals between
ranks are not necessarily equal. Think of movie ratings (1-5 stars). While we
know 4 stars is "better" than 2 stars, the difference in quality might not be the
same between all levels.
Examples: Customer satisfaction ratings (poor, average, good, excellent),
socioeconomic status (low, middle, high), degree of injury (minor, moderate,
severe).
Operations allowed: Ranking, identifying median and mode, comparing
relative order.
3. Interval Scale:
Characteristics: Data points are ordered with equal intervals between them,
but there is no true zero point. Consider temperature in Celsius. The
difference between 20°C and 30°C is the same as 0°C and 10°C, but a
temperature of 0°C doesn't mean "no heat" at all.
Examples: Temperature (Celsius, Fahrenheit), height and weight, IQ scores.
Operations allowed: All operations of ordinal scales plus calculations like
addition, subtraction, finding mean and standard deviation.
4. Ratio Scale:
Characteristics: Data points are ordered with equal intervals and have a true
zero point, meaning the absence of the measured quantity. Imagine money. A
balance of $0 truly means no money, and the difference between $10 and $20
is the same as $20 and $30.
Examples: Age, time, distance, salary, weight (in grams).
Operations allowed: All operations of interval scales plus calculations like
ratios and proportions.
Choosing the right data scale is crucial for proper analysis and interpretation.
Using operations beyond the scale's limitations can lead to misleading results.
Remember, scales tell us how much we can squeeze out of our data in terms
of meaningful comparisons and calculations.
Data comes in many flavors, but two of the most fundamental types are
quantitative and qualitative. Understanding the difference between them is
crucial for effectively analyzing information and gaining insights.
Quantitative Data:
Qualitative Data:
Descriptive: This data involves words, text, images, or sounds that describe
qualities or characteristics. Examples include interview transcripts, open-
ended survey responses, observations, photographs, or video recordings.
Focus on "why, how, or what": Qualitative data provides insights into
motivations, experiences, and meanings. It answers questions like:
o Why do people choose to buy organic products?
o How do students feel about the new teaching methods?
o What are the key themes emerging from customer reviews?
Analysis Tools: Thematic analysis, coding, discourse analysis, and narrative
analysis are used to analyze qualitative data.
Key Differences:
Type of
Numbers, measurements Words, text, images, sounds
information
Additional Considerations:
Challenges:
Volatility: Data sources can be highly volatile, with rapid fluctuations and unexpected
changes. Think about how quickly social media trends or economic indicators can shift.
This makes it difficult to rely on historical data for future predictions.
Uncertainty: The meaning and interpretation of data can be uncertain, especially in
complex systems. Correlations might not imply causation, and biases can be hidden
within datasets. This challenges our ability to draw clear conclusions from data.
Complexity: The sheer volume and variety of data can be overwhelming. Extracting
insights requires sophisticated tools and expertise, leaving many organizations
struggling to harness the full potential of their data.
Ambiguity: Data often lacks clear context or explanation, leading to ambiguity in its
interpretation. Different stakeholders might draw different conclusions from the same
data set, leading to confusion and disagreements.
Opportunities:
VUCA-proofing: With careful analysis and robust data governance, organizations can
build systems that are more resilient to VUCA fluctuations. By actively monitoring data
for unusual patterns and adapting as needed, they can become more agile and
responsive to change.
Deeper insights: Advanced analytics tools and techniques can cut through the
complexity of data to reveal hidden patterns and trends. This can lead to better
understanding of customer behavior, market dynamics, and operational efficiency.
Decision-making under uncertainty: While absolute certainty might be elusive, data can
still inform decision-making even in uncertain environments. By using probabilistic
models and scenario planning, organizations can make informed choices even when
the future is unclear.
Transparency and trust: Data can be used to promote transparency and trust in
decision-making. By sharing data insights and making decision-making processes more
visible, organizations can build stronger relationships with stakeholders.
In conclusion, VUCA and data are intrinsically linked. While the VUCA environment
presents challenges for data-driven decision-making, it also creates opportunities for
organizations that can harness the power of data effectively. By embracing agility,
adopting robust data governance, and investing in advanced analytics tools,
organizations can navigate the VUCA landscape and turn data into a strategic
advantage.
1. Data Collection:
This first step gathers OR collection data from various sources like sensors, databases,
websites, surveys, or experiments. The chosen method depends on your specific data
needs and goals.
2. Data Preparation:
Here, you make the raw data usable for analysis. This often involves:
3. Data Input:
The prepared data is then loaded into a chosen platform for analysis, like a data
warehouse, spreadsheet, or statistical software.
4. Data Processing:
This is where you analyze and manipulate the data to extract insights. This can involve:
Descriptive statistics: Summarizing the data through measures like mean, median, and
standard deviation.
5. Data Output:
The extracted insights are presented in a clear and concise way, often through reports,
dashboards, or visualizations.
6. Data Storage:
Finally, the processed data is saved securely for future use, analysis, or reference.
These steps form a general roadmap, but the specific tasks and tools used may vary
depending on the type and complexity of your data and the desired analysis.
Essentially, it's a smaller, more focused version of a data warehouse that caters to
the specific needs of a particular team or business area. Here's a breakdown of its
key characteristics:
1. Subject-oriented: Data marts are built around specific topics or areas of interest,
such as marketing, sales, finance, or human resources. This means they only
contain data relevant to that particular subject, making it easier for users to find the
information they need.
2. Integrated: Data marts integrate data from various sources, both internal and
external, into a single, consistent format. This eliminates the need for users to
access and merge data from multiple disparate systems.
3. Time-variant: Data marts typically track data over time, allowing users to analyze
trends and patterns. This historical data can be crucial for identifying areas for
improvement and making informed decisions.
4. Non-volatile: Unlike operational databases that are constantly being updated, data
marts are relatively static. This means that data once loaded into a data mart is not
typically subject to frequent changes, making it more reliable for analysis.
Scalability: Data lakes can scale up easily to accommodate whatever amount of data
you throw at them. This is because they typically use object storage, which is a cost-
effective way to store large amounts of data.
Flexibility: You can store any type of data in a data lake, regardless of its structure or
format. This makes them ideal for organizations that deal with a lot of diverse data.
Accessibility: Data lakes are designed to be easily accessible by data analysts and
scientists. This allows them to quickly and easily find the data they need for their
analyses.
Cost-effectiveness: Compared to data warehouses, data lakes are typically more cost-
effective, especially for storing large amounts of data.
But it's not all sunshine and rainbows. Here are some potential drawbacks of data lakes:
Complexity: Managing a data lake can be complex, especially as it grows in size.
It requires specialized skills and expertise to ensure that the data is properly
organized and secured.
Data quality: Because data lakes store everything, it's easy for low-quality or
irrelevant data to creep in. This can make it difficult to find the data you need and
can lead to inaccurate results.
Security: Ensuring the security of all that data in a data lake is crucial.
Organizations need to have strong security measures in place to prevent
unauthorized access or breaches.
Steps:
1. Data Collection:
Download server logs, which might be in plain text or a specific format like Apache
Combined Log Format.
Use data extraction tools if necessary to pull relevant data from log files.
2. Data Cleaning:
3. Data Transformation:
Create new variables based on existing data (e.g., "session duration" from timestamps).
Group data by relevant dimensions (e.g., page, referrer, user agent).
Calculate aggregate statistics (e.g., total visits, unique visitors, average session
duration).
4. Data Analysis:
Visualize data using charts and graphs to identify trends and patterns.
Compare data across different dimensions to understand user behavior.
Use statistical tests to assess the significance of findings.
These factors pose significant challenges for data warehouses and the insights they
offer. Here's how vUCA impacts them:
Challenges:
Data currency: Keeping data fresh and relevant in a volatile environment can be difficult.
Data quality: Uncertain data sources and complex systems can lead to errors and
inconsistencies.
Data access: Users may struggle to navigate complex data and find relevant insights
quickly.
Data interpretation: Ambiguity can make it hard to draw clear conclusions from the data.
Opportunities:
Real-time integration: Integrate real-time data sources to capture volatility and make
more timely decisions.
Data governance: Implement robust data quality checks and processes to ensure data
accuracy.
Self-service analytics: Empower users with intuitive tools to explore and analyze data
effectively.
Advanced analytics: Use AI and machine learning to uncover hidden patterns and gain
deeper insights.
Agile data architectures: Flexible and scalable architectures to adapt to changing needs.
Data virtualization: Provide unified access to diverse data sources without physically
moving data.
Cloud-based solutions: Leverage scalability, agility, and cost-effectiveness of cloud
platforms.
Focus on data lineage: Track data provenance to understand its origin and reliability.
Data storytelling: Present complex insights in a clear and compelling way.
By adopting these strategies, data warehouses can transform from static repositories
into dynamic tools for navigating the vUCA world. They can empower organizations to
make data-driven decisions amidst uncertainty, complexity, and ambiguity, ultimately
leading to better outcomes.
16- Explain KDD in detail with advantages & disadvantages with
diagram & example?
in the context of computer science, “Data Mining” can be referred to as
knowledge mining from data, knowledge extraction, data/pattern analysis, data
archaeology, and data dredging. Data Mining also known as Knowledge
Discovery in Databases, refers to the nontrivial extraction of implicit, previously
unknown and potentially useful information from data stored in databases.
The need of data mining is to extract useful information from large datasets and
use it to make predictions or better decision-making. Nowadays, data mining is
used in almost all places where a large amount of data is stored and processed.
For examples: Banking sector, Market Basket Analysis, Network Intrusion
Detection.
KDD Process
KDD (Knowledge Discovery in Databases) is a process that involves the
extraction of useful, previously unknown, and potentially valuable information
from large datasets. The KDD process is an iterative process and it requires
multiple iterations of the above steps to extract accurate knowledge from the
data.The following steps are included in KDD process:
Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from
collection.
1. Cleaning in case of Missing values.
2. Cleaning noisy data, where noise is a random or variance error.
3. Cleaning with Data discrepancy detection and Data transformation tools.
Data Integration
Data integration is defined as heterogeneous data from multiple sources
combined in a common source(DataWarehouse). Data integration using Data
Migration tools, Data Synchronization tools and ETL(Extract-Load-
Transformation) process.
Data Selection
Data selection is defined as the process where data relevant to the analysis is
decided and retrieved from the data collection. For this we can use Neural
network, Decision Trees, Naive bayes, Clustering,
and Regression methods.
Data Transformation
Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure. Data Transformation is a two
step process:
1. Data Mapping: Assigning elements from source base to destination to
capture transformations.
2. Code generation: Creation of the actual transformation program.
Data Mining
Data mining is defined as techniques that are applied to extract patterns
potentially useful. It transforms task relevant data into patterns, and decides
purpose of model using classification or characterization.
Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns
representing knowledge based on given measures. It find interestingness
score of each pattern, and uses summarization and Visualization to make
data understandable by user.
Knowledge Representation
This involves presenting the results in a way that is meaningful and can be used
to make decisions.
Note: KDD is an iterative process where evaluation measures can be
enhanced, mining can be refined, new data can be integrated and transformed
in order to get different and more appropriate results.Preprocessing of
databases consists of Data cleaning and Data Integration.
Advantages of KDD
1. Improves decision-making: KDD provides valuable insights and knowledge
that can help organizations make better decisions.
2. Increased efficiency: KDD automates repetitive and time-consuming tasks
and makes the data ready for analysis, which saves time and money.
3. Better customer service: KDD helps organizations gain a better
understanding of their customers’ needs and preferences, which can help
them provide better customer service.
4. Fraud detection: KDD can be used to detect fraudulent activities by
identifying patterns and anomalies in the data that may indicate fraud.
5. Predictive modeling: KDD can be used to build predictive models that can
forecast future trends and patterns.
Disadvantages of KDD
1. Privacy concerns: KDD can raise privacy concerns as it involves collecting
and analyzing large amounts of data, which can include sensitive information
about individuals.
2. Complexity: KDD can be a complex process that requires specialized skills
and knowledge to implement and interpret the results.
3. Unintended consequences: KDD can lead to unintended consequences,
such as bias or discrimination, if the data or models are not properly
understood or used.
4. Data Quality: KDD process heavily depends on the quality of data, if data is
not accurate or consistent, the results can be misleading
5. High cost: KDD can be an expensive process, requiring significant
investments in hardware, software, and personnel.
6. Overfitting: KDD process can lead to overfitting, which is a common
problem in machine learning where a model learns the detail and noise in
the training data to the extent that it negatively impacts the performance of
the model on new unseen data.
17- Explain one , two & three tier architecture with diagram ?
Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose is to minimize the
amount of data stored to reach this goal; it removes data redundancies.
The figure shows the only layer physically available is the source layer. In this method,
data warehouses are virtual. This means that the data warehouse is implemented as a
multidimensional view of operational data created by specific middleware, or an
intermediate processing layer.
The vulnerability of this architecture lies in its failure to meet the requirement for
separation between analytical and transactional processing. Analysis queries are agreed
to operational data after the middleware interprets them. In this way, queries affect
transactional workloads.
Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier
architecture for a data warehouse system, as shown in fig:
Although it is typically called two-layer architecture to highlight a separation between
physically available sources and data warehouses, in fact, consists of four subsequent
data flow stages:
1. Source layer: A data warehouse system uses a heterogeneous source of data. That data
is stored initially to corporate relational databases or legacy databases, or it may come
from an information system outside the corporate walls.
2. Data Staging: The data stored to the source should be extracted, cleansed to remove
inconsistencies and fill gaps, and integrated to merge heterogeneous sources into one
standard schema. The so-named Extraction, Transformation, and Loading Tools
(ETL) can combine heterogeneous schemata, extract, transform, cleanse, validate, filter,
and load source data into a data warehouse.
3. Data Warehouse layer: Information is saved to one logically centralized individual
repository: a data warehouse. The data warehouses can be directly accessed, but it can
also be used as a source for creating data marts, which partially replicate data warehouse
contents and are designed for specific enterprise departments. Meta-data repositories
store information on sources, access procedures, data staging, users, data mart schema,
and so on.
4. Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue reports,
dynamically analyze information, and simulate hypothetical business scenarios. It should
feature aggregate information navigators, complex query optimizers, and customer-
friendly GUIs.
Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source
system), the reconciled layer and the data warehouse layer (containing both data
warehouses and data marts). The reconciled layer sits between the source data and data
warehouse.
The main advantage of the reconciled layer is that it creates a standard reference data
model for a whole enterprise. At the same time, it separates the problems of source data
extraction and integration from those of data warehouse population. In some cases,
the reconciled layer is also directly used to accomplish better some operational tasks,
such as producing daily reports that cannot be satisfactorily prepared using the
corporate applications or generating data flows to feed external processes periodically
to benefit from cleaning and integration.
What is OLAP?
OLAP stands for Online Analytical Processing, which is a technology that
enables multi-dimensional analysis of business data. It provides interactive
access to large amounts of data and supports complex calculations and data
aggregation. OLAP is used to support business intelligence and decision-
making processes.
Grouping of data in a multidimensional matrix is called data cubes. In Dataware
housing, we generally deal with various multidimensional data models as the
data will be represented by multiple dimensions and multiple attributes. This
multidimensional data is represented in the data cube as the cube represents a
high-dimensional space. The Data cube pictorially shows how different
attributes of data are arranged in the data model. Below is the diagram of a
general data cube.
Data cube operations are used to manipulate data to meet the needs of users.
These operations help to select particular data for the analysis purpose. There
are mainly 5 operations listed below-
Roll-up: operation and aggregate certain similar data attributes having the
same dimension together. For example, if the data cube displays the daily
income of a customer, we can use a roll-up operation to find the monthly
income of his salary.
Dicing: this operation does a multidimensional cutting, that not only cuts
only one dimension but also can go to another dimension and cut a certain
range of it. As a result, it looks more like a subcube out of the whole cube(as
depicted in the figure). For example- the user wants to see the annual salary
of Jharkhand state employees.
Data cube operations are used to manipulate data to meet the needs of users.
These operations help to select particular data for the analysis purpose. There
are mainly 5 operations listed below-
Roll-up: operation and aggregate certain similar data attributes having the
same dimension together. For example, if the data cube displays the daily
income of a customer, we can use a roll-up operation to find the monthly
income of his salary.
Dicing: this operation does a multidimensional cutting, that not only cuts
only one dimension but also can go to another dimension and cut a certain
range of it. As a result, it looks more like a subcube out of the whole cube(as
depicted in the figure). For example- the user wants to see the annual salary
of Jharkhand state employees.