Advanced databases and mining

1. What is data modeling?
Data modeling is the process of

creating a visual representation of
either a whole information system
or parts of it to communicate
connections between data points
and structures.
The goal of data modeling to

illustrate the types of data used
and stored within the system, the
relationships among these data
types, the ways the data can be
grouped and organized and its
formats and attributes.
Data modeling employs
standardized schemas and formal
techniques. This provides a
common, consistent, and
predictable way of defining and
managing data resources across
an organization, or even beyond.
Ideally, data models are living
documents that evolve along with
changing business needs. They
play an important role in
supporting business processes
and planning IT architecture and
strategy. Data models can be
shared with vendors, partners,
and/or industry peers.
Types of data models

Like any design process,
database and information system
design begins at a high level of
abstraction and becomes
increasingly more concrete and
specific. Data models can
generally be divided into three
categories, which vary according
to their degree of abstraction. The
process will start with a
conceptual model, progress to a
logical model and conclude with a
physical మోడల్
Conceptual data models
They are also referred to as
domain models and offer a big-
picture view of what the
system will contain, how it will be
organized, and which business
rules are ఇన్వొల్వెడ్
Conceptual models are usually
created as part of the process of
gathering initial project
requirements. Typically, they
include entity classes (defining the
types of things that are important
for the business to represent in
the data model), their
characteristics and constraints,
the relationships between them
and relevant security and data
integrity requirements. Any
notation is typically simple.
Leverage the power of GPT and

LLMs for customer experience
Know more
What is query language?
How are query languages different
from data query languages?
What are query languages used
for?
Which are the most popular query
languages?
Switch to Engati: Smarter choice
for WhatsApp Campaigns 🚀
TRY NOW
Query language
What is query language?
A query language is a specialized
programming language for
searching and changing the
contents of a database.
Essentially, a query language is a
computer programing language
that is used for the purpose of
retrieving information or data from
a database. Even though the term
originally refers to a sublanguage
for only searching (querying) the
contents of a database, modern
query languages such as SQL are
general languages for interacting
with the DBMS, including
statements for defining and
changing the database schema,
populating the contents of the
database, searching the contents
of the database, updating the
contents of the database, defining
integrity constraints over the
database, defining stored
procedures, defining authorization
rules, defining triggers, etc.
The data definition statements of

a query language provide
primitives for defining and
changing the database schema,
while data manipulation
statements allow populating,
querying, as well as updating the
database. Queries are usually
expressed declaratively without
side effects using logical
conditions.
Query language
Source: Nebula Graph
Early query langauges were
extremely complicated, which
meant that only specially trained
individuals could interact with
electronic databases. But the
interfaces have evolved and have
now become far more user
friendly, thus making it possible for
casual users to access database
information.
The most popular types of query
modes are the menu, the “fill-in-
the-blank” technique, and the
structured query. The menu is the
best option for novices as it just
requires them to simply pick from
a range of alternatives that is
displayed on the screen. The fill-
in-the-blank technique prompts
the user to enter key words as
search statements. The structured
query approach is very effective
when it is used on relational
databases. It has a formal,
powerful syntax that is actually a
programming language, and it has
the ability to accommodate logical
operators. One implementation of
the structured query approach, the
Structured Query Language
(SQL), has the form:
select [field Fa, Fb, . . ., Fn]
from [database Da, Db, . . ., Dn]
where [field Fa = abc] and [field

Fb = def].
Structured query languages
enable database searching and
other operations by making use of
commands like “find,” “delete,”
“print,” “sum,” etc. The
sentencelike structure of an SQL
query bears resemblance to
natural language, but the
difference is that its syntax is
limited and fixed.
If you do not want to use an SQL

statement, it is possible for you to
represent queries in a tabular
format. This technique is known
as query-by-example (or QBE). It
displays an empty tabular form
and expects the searcher to fill in
the search specifications into the
right columns. The program will
then construct an SQL-type query
from the table and execute it.
2. Normalization
The goal of normalization is to

transform features to be on a
similar scale. This improves the
performance and training stability
of the model.
Normalization Techniques at a
Glance
Four common normalization
techniques may be useful:
scaling to a range
clipping
log scaling
z-score
The following charts show the
effect of each normalization
technique on the distribution of the
raw feature (price) on the left. The
charts are based on the data set
from 1985 Ward's Automotive
Yearbook that is part of
the UCI Machine Learning
Repository under Automobile Data
Set.
Five graphs: 1. a raw distribution.

2. the raw distribution scaled to
a range, which has the same
shape as the raw distribution. 3.
the raw
distribution clipped, which
eliminates the highest values. 4.
the raw
distribution scaled by logarithm,
which bunches the data in the
middle
of the distribution. 5. the z-score
of the distribution, which has
a similar shape to the raw
distribution.
Figure 1. Summary of
normalization techniques.
Scaling to a range
Recall from MLCC that scaling
means converting floating-point
feature values from their natural
range (for example, 100 to 900)
into a standard range—usually 0
and 1 (or sometimes -1 to +1).
Use the following simple formula
to scale to a range:
Scaling to a range is a good

choice when both of the following
conditions are met:
You know the approximate upper

and lower bounds on your data
with few or no outliers.
Your data is approximately
uniformly distributed across that
range.
A good example is age. Most age
values falls between 0 and 90,
and every part of the range has a
substantial number of people.
In contrast, you would not use
scaling on income, because only a
few people have very high
incomes. The upper bound of the
linear scale for income would be
very high, and most people would
be squeezed into a small part of
the scale.
Feature Clipping
If your data set contains extreme
outliers, you might try feature
clipping, which caps all feature
values above (or below) a certain
value to fixed value. For example,
you could clip all temperature
values above 40 to be exactly 40.
You may apply feature clipping

before or after other
normalizations.
Formula: Set min/max values to

avoid outliers.
A comparison of a native
distribution and a capped
distribution. In the
native distribution, nearly all
values fall within the range 1 to 4,
but
a small percentage of values lie
between 5 and 55. In the capped
distribution,
all values originally above 4 now
have the
value 4.
Figure 2. Comparing a raw
distribution and its clipped version.
Another simple clipping strategy is

to clip by z-score to +-Nσ (for
example, limit to +-3σ). Note that
σ is the standard deviation.
Log Scaling
Log scaling computes the log of
your values to compress a wide
range to a narrow range.
Log scaling is helpful when a

handful of your values have many
points, while most other values
have few points. This data
distribution is known as the power
law distribution. Movie ratings are
a good example. In the chart
below, most movies have very few
ratings (the data in the tail), while
a few have lots of ratings (the data
in the head). Log scaling changes
the distribution, helping to improve
linear model performance.
Two graphs comparing raw data
vs. the log of raw data. The raw
data graph
shows a lot of ratings in the head,
followed by a long tail. The log
graph
has a more even
distribution.
distribution to its log.
Z-Score
Z-score is a variation of scaling
that represents the number of
standard deviations away from the
mean. You would use z-score to
ensure your feature distributions
have mean = 0 and std = 1. It’s
useful when there are a few
outliers, but not so extreme that
you need clipping.
The formula for calculating the z-

score of a point, x, is as follows:
Note: μ is the mean and σ is the

standard
deviation.
Two graphs comparing raw data
vs. data normalized with a z-
score. The raw
data shows a rough Poisson
distribution ranging from 5,000 to
45,000.
The normalized data ranges
from -1 to +4.
distribution to its z-score
distribution.
Notice that z-score squeezes raw

values that have a range of
~40000 down into a range from
roughly -1 to +4.
Suppose you're not sure whether

the outliers truly are extreme. In
this case, start with z-score unless
you have feature values that you
don't want the model to learn; for
example, the values are the result
of measurement error or a quirk
3. describe isolation levels and
online analytical processing
What is an “Isolation
Level”?
Database isolation refers to the

ability of a database to allow a
transaction to execute as if there
are no other concurrently running
transactions (even though in
reality there can be a large
number of concurrently running
transactions). The overarching
goal is to prevent reads and writes
of temporary, aborted, or
otherwise incorrect data written by
concurrent transactions.
There is such a thing as perfect
isolation (we will define this
below). Unfortunately, perfection
usually comes at a performance
cost—in terms of transaction
latency (how long before a
transaction completes) or
throughput (how many
transactions per
second can the system complete).
Depending on how a particular
system is architected, perfect
isolation becomes easier or
harder to achieve. In poorly
designed systems, achieving
perfection comes with a
prohibitive performance cost, and
users of such systems will be
pushed to accept guarantees
significantly short of perfection.
However, even in well-designed
systems, there is often a non-
trivial performance benefit
achieved by accepting guarantees
short of perfection.
Therefore, isolation levels came
into existence: they provide the
user of a system the ability to
trade off isolation guarantees for
improved performance.
Isolation levels define the degree

to which a transaction must be
isolated from the data
modifications made by any other
transaction in the database
system. A transaction isolation
level is defined by the following
phenomena:
Dirty Read – A Dirty read is a

situation when a transaction reads
data that has not yet been
committed. For example, Let’s say
transaction 1 updates a row and
leaves it uncommitted, meanwhile,
Transaction 2 reads the updated
row. If transaction 1 rolls back the
change, transaction 2 will have
read data that is considered never
to have existed.
Non Repeatable read – Non

Repeatable read occurs when a
transaction reads the same row
twice and gets a different value
each time. For example, suppose
transaction T1 reads data. Due to
concurrency, another transaction
T2 updates the same data and
commit, Now if transaction T1
rereads the same data, it will
retrieve a different value.
Phantom Read – Phantom Read
occurs when two same queries
are executed, but the rows
retrieved by the two, are different.
For example, suppose transaction
T1 retrieves a set of rows that
satisfy some search criteria. Now,
Transaction T2 generates some
new rows that match the search
criteria for transaction T1. If
transaction T1 re-executes the
statement that reads the rows, it
gets a different set of rows this
time.
What is OLAP (online analytical

processing)?
OLAP (online analytical

processing) is a computing
method that enables users to
easily and
selectively extract and query data
in order to analyze it from different
points of view. OLAP business
intelligence queries often aid in
trends analysis, financial
reporting, sales forecasting,
budgeting and other planning
purposes.
How OLAP systems

work
To facilitate this kind of analysis,
data is collected from multiple
sources and stored in data
warehouses, then cleansed and
organized into data cubes. Each
OLAP cube contains data
categorized by dimensions (such
as customers, geographic sales
region and time period) derived
by dimensional tables in the data
warehouses. Dimensions are then
populated by members (such as
customer names, countries and
months) that are
organized hierarchically. OLAP
cubes are often pre-summarized
across dimensions to drastically
improve query time over relational
databases.
Types of OLAP
systems
OLAP systems typically fall into
one of three types:
Multidimensional OLAP
(MOLAP) is OLAP that indexes
directly into a multidimensional
database.
Relational OLAP (ROLAP) is

OLAP that performs dynamic
multidimensional analysis of data
stored in a relational database.
Hybrid OLAP (HOLAP) is a

combination of ROLAP and
MOLAP. HOLAP combines the
greater data capacity of ROLAP
with the superior processing
capability of MOLAP.
3.Explain about sql injection
attacks ?
SQL Injection
SQL injection is a code injection
technique attackers use to gain
unauthorized access to a
database by injecting malicious
SQL commands into web page
inputs.
Attackers can extract sensitive

information, modify database
data, execute administration
operations on the database (such
as shutdown DBMS), recover the
content of a given file present on
the DBMS file system, and in
some cases, issue commands to
the operating system.
this article, we will discuss what is

SQLi(SQL Injection), Types of
SQL injection, SQL injection in
web pages, how to prevent SQL
injection attacks, and many more.
What is SQL Injection?
SQLi or SQL Injection is a web
page vulnerability that lets an
attacker make queries with the
database. Attackers take
advantage of web application
vulnerability and inject an SQL
command via the input from users
to the application.
Attackers can SQL queries like

SELECT to retrieve confidential
information which otherwise
wouldn’t be visible. SQL injection
also lets the attacker to perform a
denial-of-service (DoS) attacks by
overloading the server requests.
SQL Injection Types
There are different types of SQL
injection attacks:
1. In-band SQL Injection

It involves sending malicious SQL
queries directly through the web
application’s interface and allows
attackers to extract sensitive
information or modify the
database itself.
2. Error-based SQL Injection

Attackers exploit error messages
generated by the web application
by analyzing error messages to
gain access to confidential data or
modify the database.
3. Blind SQL Injection
Attackers send malicious SQL
queries and observe the
application’s response. By
analyzing the application’s
behavior,
attackers can determine the
success of the query.
4. Out-of-band SQL Injection

Uses a different channel to
communicate with the database.
Allows attackers to exfiltrate
sensitive data from the database.
5. Inference-based SQL Injection

Uses statistical inference to gain
access to confidential data.
Attackers create queries that
return the same result regardless
of input values.
SQL Injection Prevention

Developers can use the following
prevention measures to prevent
SQL injection attacks.
User Authentication: Validating

input from the user by pre-defining
length, type of input, of the input
field and authenticating the user.
Restricting access privileges of
users and
defining how much amount of data
any outsider can access from the
database. Basically, users should
not be granted permission to
access everything in the
database.
Do not use system administrator
accounts

Advanced databases and mining

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced databases and mining

Uploaded by

Copyright:

Available Formats

1. What is data modeling?

Data modeling is the process of

The goal of data modeling to

Types of data models

Leverage the power of GPT and

The data definition statements of

select [field Fa, Fb, . . ., Fn]

from [database Da, Db, . . ., Dn]

where [field Fa = abc] and [field

If you do not want to use an SQL

The goal of normalization is to

Five graphs: 1. a raw distribution.

Scaling to a range is a good

You know the approximate upper

You may apply feature clipping

Formula: Set min/max values to

Another simple clipping strategy is

Log scaling is helpful when a

The formula for calculating the z-

Note: μ is the mean and σ is the

Notice that z-score squeezes raw

Suppose you're not sure whether

Database isolation refers to the

Isolation levels define the degree

 Dirty Read – A Dirty read is a

 Non Repeatable read – Non

What is OLAP (online analytical

OLAP (online analytical

How OLAP systems

 Relational OLAP (ROLAP) is

 Hybrid OLAP (HOLAP) is a

Attackers can extract sensitive

this article, we will discuss what is

Attackers can SQL queries like

1. In-band SQL Injection

2. Error-based SQL Injection

4. Out-of-band SQL Injection

5. Inference-based SQL Injection

SQL Injection Prevention

User Authentication: Validating

You might also like

Dirty Read – A Dirty read is a

Non Repeatable read – Non

Relational OLAP (ROLAP) is

Hybrid OLAP (HOLAP) is a