Professional Documents
Culture Documents
1 SQL
1 SQL
for
Process Industry
- Dr. C.V.Rao
1
Why To Study This Course?
Analytics are everywhere
Upstream activities (Exploration and production
of crude oil and natural gas)
Midstream activities (processes, stores, markets
and transports commodities such as crude oil,
natural gas, natural gas liquids)
Downstream activities (includes oil refineries,
petrochemical plants, petroleum products
distributors, retail outlets and natural gas
distribution companies.)
Why To Study This Course?
Software is used in all the 3 streams
The Internet of Things (IoT) (many sensors,
actuators are remotely monitored and controlled)
There are IoT standards for all the 3 streams
Understand Technologies behind ERP
products
Different Methods and Tools for Analyzing
the data
Syllabus
• Languages of Data Science: R, Excel, SQL, Python, and Tableau;
• Introduction to Data Warehousing and OLAP; Data Preparation and
Visualization;
• Descriptive Statistics: central tendency and variability; Inferential
Statistics: Probability, Central Limit Theorem; Exploratory Data Analysis;
Hypothesis Testing; Linear Regression;
• Classification: KNN, Naive Bayes and Logistic Regression; K-Means and
Hierarchical clustering; Time Series; Decision Trees; Support Vector
Machines; Neural networks; Association Rule Mining;
• Introduction to Big Data And Hadoop; Managing Big Data: Hadoop
Ecosystem tools (Sqoop and Hive); Introduction to Spark; Big Data Analysis
using SparkR, SparkSQL; Case Studies;
• Spatial Data Model; Visualization and Query of Spatial Data; Subsurface
Mapping and Correlation and applications.
6
Database: What
• Database
– is collection of related data and its metadata organized in a structured
format
– for optimized information management
• Database System
– is an integrated system of hardware, software, people, procedures,
and data
– that define and regulate the collection, storage, management, and use
of data within a database environment
7
Database Management System
- manages interaction between end users and database
8
Database System Environment
Hardware
Software
- OS
- DBMS
- Applications
People
Procedures
Data
9
Database: Why
• Purpose of Database
– Optimizes data management
– Transforms data into information
10
Database: Why
11
Database: Data Models
• Importance
– Abstraction of complex real-word data structures in
relative simple (graphical) representations
– Facilitate interaction among the designer, the applications
programmer, and the end user
File-based
Hierarchical
Object-oriented
Network
Relational Web-based
Entity-Relationship
13
Database: Historical Roots
• Manual File System
– to keep track of data
– used tagged file folders in a filing cabinet
– organized according to expected use
• e.g. file per customer
– easy to create, but hard to
• aggregate/summarize data
• locate data
15
File System: Weakness
• Weakness
– “Islands of data” in scattered file systems.
• Problems
– Duplication
• same data may be stored in multiple files
– Inconsistency
• same data may be stored by different names in different format
– Rigidity
• requires customized programming to implement any changes
• cannot do ad-hoc queries
• Implications
– Waste of space
– Data inaccuracies
– High overhead of data manipulation and maintenance
16
File System: Problem Case
CUSTOMER file AGENT file SALES file
17
Database System vs. File System
18
Hierarchical Database
• Background
– Developed to manage large amount of data for complex manufacturing
projects
– e.g., Information Management System (IMS)
• IBM-Rockwell joint venture
• clustered related data together
• hierarchically associated data clusters using pointers
19
Hierarchical Database: Example
20
Hierarchical Database: Pros & Cons
• Advantages
– Conceptual simplicity
• groups of data could be related to each other
• related data could be viewed together
– Centralization of data
• reduced redundancy and promoted consistency
• Disadvantages
– Limited representation of data relationships
• did not allow Many-to-Many (M:N) relations
– Complex implementation
• required in-depth knowledge of physical data storage
– Structural Dependence
• data access requires physical storage path
– Lack of Standards
• limited portability
21
Network Database
• Objectives
– Represent more complex data relationships
– Improve database performance
– Impose a database standard
22
Network Database: Example
23
Network Database: Pros & Cons
• Advantages
– More data relationship types
– More efficient and flexible data access
• “network” vs. “tree” path traversal
– Conformance to standards
• enhanced database administration and portability
• Disadvantages
– System complexity
• require familiarity with the internal structure for data access
– Lack of structural independence
• small structural changes require significant program changes
24
Relational Database
• Problems with legacy database systems
– Required excessive effort to maintain
• Data manipulation (programs) too dependent on physical file
structure
– Hard to manipulate by end-users
• No capacity for ad-hoc query (must rely on DB programmers).
25
Relational Database
– Relational Database Model
• Dominant database model of today
• Eliminated pointers and used tables to represent data
• Tables
– flexible logical structure for data representation
– a series of row/column intersections
– related by sharing common entity characteristic(s)
26
Relational Database: Example
Provides a logical “human-level” view of the data and associations
among groups of data (i.e., tables)
27
Relational Database: Pros & Cons
• Advantages
– Structural independence
• Separation of database design and physical data storage/access
• Easier database design, implementation, management, and use
– Ad hoc query capability with Structured Query Language (SQL)
• SQL translates user queries to codes
• Disadvantages
– Substantial hardware and system software overhead
• more complex system
– Poor design and implementation is made easy
• ease-of-use allows careless use of RDBMS
28
Entity Relationship Model
• Peter Chen’s Landmark Paper in 1976
– “The Relationship Model: Toward a Unified View of Data”
– Graphical representation of entities and their relationships
29
Entity Relationship Model
• Entity Relationship (ER) Model
30
E-R Diagram: Chen Model
• Entity
– represented by a rectangle
with its name in capital
letters.
• Relationships
– represented by an active or
passive verb inside the
diamond that connects the
related entities.
• Connectivities
– i.e., types of relationship
– written next to each entity
box.
Database Systems: Design, Implementation, & Management: Rob & Coronel
31
E-R Diagram: Crow’s Foot Model
• Entity
– represented by a rectangle
with its name in capital
letters.
• Relationships
– represented by an active or
passive verb that connects
the related entities.
• Connectivities
– indicated by symbols next to
entities.
• 2 vertical lines for 1
• “crow’s foot” for M
32
E-R Model: Pros & Cons
• Advantages
• Disadvantages
– Incomplete model on its own
• Limited representational power
– cannot model data constraints not tied to entity relationships
» e.g. attribute constraints
– cannot represent relationships between attributes within
entities
• No data manipulation language (e.g. SQL)
– Loss of information content
• Hard to include attributes in ERD
33
Object-Oriented Database
• Semantic Data Model (SDM)
– Modeled both data and their relationships in a single structure
(object)
• Developed by Hammer & McLeod in 1981
34
Object-Oriented Database
• Object-Oriented Database Model (OODBM)
– Maintains the advantages of the ER model but adds more features
– Object = entity + relationships (between & within entity)
• consists of attributes & methods
– attributes describe properties of an object
– methods are all relevant operations that can be performed on
an object
• self-contained abstraction of real-world entity
– Class = collection of similar objects with shared attributes and
methods
• e.g. EMPLOYEE class = (employ1 object, employ2 object, …)
• organized in a class hierarchy
– e.g. PERSON > EMPLOYEE, CUSTOMER
– Incorporates the notion of inheritance
• attributes and methods of a class are inherited by its descendent
classes
35
OO Database Model vs. E-R Model
OODBM:
- can accommodate relationships within a object
- objects to be used as building blocks for autonomous structures
36
Object-Oriented Database: Pros & Cons
• Advantages
– Semantic representation of data
• fuller and more meaningful description of data via object
– Modularity, reusability, inheritance
– Ability to handle
• complex data
• sophisticated information requirements
• Disadvantages
– Lack of standards
• no standard data access method
– Complex navigational data access
• class hierarchy traversal
– Steep learning curve
• difficult to design and implement properly
– More system-oriented than user-centered
– High system overhead
• slow transactions 37
Web Database
• Internet is emerging as a prime business tool
– Shift away from models (e.g. relational vs. O-O)
– Emphasis on interfacing with the Internet
38
39
SQL Introduction
Standard language for querying and manipulating data
40
SQL
• Data Definition Language (DDL)
– Create/alter/delete tables and their attributes
• Data Manipulation Language (DML)
– Query one or more tables
– Insert/delete/modify tuples in tables
• Transact-SQL
– Idea: package a sequence of SQL statements server
41
Data in SQL
1. Atomic types, a.k.a. data types
2. Tables built from atomic types
42
Data Types in SQL
• Characters:
– CHAR(20) -- fixed length
– VARCHAR(40)-- variable length
• Numbers:
– BIGINT, INT, SMALLINT, TINYINT
– REAL, FLOAT -- differ in precision
– MONEY
• Times and dates:
– DATE
– DATETIME -- SQL Server
• Others... All are simple
43
Table name Attribute names
Tables in SQL
Product
Tuples or rows
44
Tables Explained
• A tuple = a record
– Restriction: all attributes are of atomic type
• A table = a set of tuples
– Like a list…
– …but it is unordered: no first(), no next(), no last().
45
Tables Explained
• The schema of a table is the table name and
its attributes:
Product(PName, Price, Category, Manfacturer)
46
SQL Query
SELECT attributes
FROM relations (possibly multiple)
WHERE conditions (selections)
47
Simple SQL Query
SELECT *
FROM Product
WHERE category=‘Gadgets’
48
Simple SQL Query
49
A Notation for SQL Queries
Input Schema
Output Schema
50
Selections
What goes in the WHERE clause:
• x = y, x < y, x <= y, etc
– For number, they have the usual meanings
– For CHAR and VARCHAR: lexicographic ordering
• Expected conversion between CHAR and VARCHAR
– For dates and times, what you expect...
• Pattern matching on strings...
51
The LIKE operator
• s LIKE p: pattern matching on strings
• p may contain two special symbols:
– % = any sequence of characters
– _ = any single character
SELECT *
FROM Products
WHERE PName LIKE ‘%gizmo%’
52
Eliminating Duplicates
Category
SELECT DISTINCT category Gadgets
FROM Product
Photography
Household
Compare to:
Category
Gadgets
SELECT category Gadgets
FROM Product
Photography
Household
53
Ordering the Results
Ties are broken by the second attribute on the ORDER BY list, etc.
54
Ordering the Results
SELECT category
FROM Product
ORDER BY pname
?
Gizmo $19.99 Gadgets GizmoWorks
Powergizmo $29.99 Gadgets GizmoWorks
SingleTouch $149.99 Photography Canon
MultiTouch $203.99 Household Hitachi
55
Ordering the Results
Category
SELECT DISTINCT category Gadgets
FROM Product
Household
ORDER BY category
Photography
Compare to:
?
SELECT category
FROM Product
ORDER BY pname
56
Joins in SQL
• Connect two or more tables:
58
Joins in SQL
Product Company
PName Price Category Manufacturer Cname StockPrice Country
Gizmo $19.99 Gadgets GizmoWorks GizmoWorks 25 USA
Powergizmo $29.99 Gadgets GizmoWorks Canon 65 Japan
SingleTouch $149.99 Photography Canon Hitachi 15 Japan
MultiTouch $203.99 Household Hitachi
PName Price
SingleTouch $149.99
59
Joins
Product (pname, price, category, manufacturer)
Company (cname, stockPrice, country)
SELECT country
FROM Product, Company
WHERE manufacturer=cname AND category=‘Gadgets’
60
Joins in SQL
Product Company
Name Price Category Manufacturer Cname StockPrice Country
Gizmo $19.99 Gadgets GizmoWorks GizmoWorks 25 USA
Powergizmo $29.99 Gadgets GizmoWorks Canon 65 Japan
SingleTouch $149.99 Photography Canon Hitachi 15 Japan
MultiTouch $203.99 Household Hitachi
SELECT country
FROM Product, Company
WHERE manufacturer=cname AND category=‘Gadgets’
Country
What is ??
the problem ? ??
What’s the
solution ?
61
Joins
Product (pname, price, category, manufacturer)
Purchase (buyer, seller, store, product)
Person(persname, phoneNumber, city)
62
When are two tables related?
• You guess they are
• I tell you so
• Foreign keys are a method for schema designers to
tell you so
– A foreign key states that a column is a reference to the key
of another table
ex: Product.manufacturer is foreign key of Company
Product (pname, price, category, manufacturer)
Company (cname, stockPrice, country)
63
Disambiguating Attributes
• Sometimes two relations have the same attr:
Person(pname, address, worksfor)
Company(cname, address)
Which
SELECT DISTINCT pname, address address ?
FROM Person, Company
WHERE worksfor = cname
Find all stores that sold at least one product that the store
‘BestBuy’ also sold:
Answer (store)
65
Meaning (Semantics) of SQL
Queries
SELECT a1, a2, …, ak
FROM R1 AS x1, R2 AS x2, …, Rn AS xn
WHERE Conditions
1. Nested loops:
Answer = {}
for x1 in R1 do
for x2 in R2 do
…..
for xn in Rn do
if Conditions
then Answer = Answer {(a1,…,ak)}
return Answer
66
Meaning (Semantics) of SQL
Queries
SELECT a1, a2, …, ak
FROM R1 AS x1, R2 AS x2, …, Rn AS xn
WHERE Conditions
2. Parallel assignment
Answer = {}
for all assignments x1 in R1, …, xn in Rn do
if Conditions then Answer = Answer {(a1,…,ak)}
return Answer
• Truncate Syntax:
Truncate Table TableName, eg. : Truncate table employee
– Quicker way for deleting all the rows from a table
– It releases the space used by the table
Drop Syntax:
Drop TABLE TableName, eg. : Drop table employee
– Remove the table completely from the database
Relational Database Schema--Figure 5.5
• Examples:
U4A: DELETE FROM EMPLOYEE
WHERE LNAME='Brown’
• In this request, the modified SALARY value depends on the original SALARY
value in each tuple
• The reference to the SALARY attribute on the right of = refers to the old SALARY
value before modification
• The reference to the SALARY attribute on the left of = refers to the new SALARY
value after modification
AGGREGATE FUNCTIONS
• Include COUNT, SUM, MAX, MIN, and AVG
• Query 15: Find the maximum salary, the minimum salary, and
the average salary among all employees.
Slide 8-82
AGGREGATE FUNCTIONS (cont.)
• Query 16: Find the maximum salary, the minimum salary,
and the average salary among employees who work for
the 'Research' department.
Slide 8-83
AGGREGATE FUNCTIONS (cont.)
• Queries 17 and 18: Retrieve the total number of employees
in the company (Q17), and the number of employees in the
'Research' department (Q18).
Slide 8-84
GROUPING
• In many cases, we want to apply the aggregate
functions to subgroups of tuples in a relation
• Each subgroup of tuples consists of the set of tuples
that have the same value for the grouping
attribute(s)
• The function is applied to each subgroup
independently
• SQL has a GROUP BY-clause for specifying the
grouping attributes, which must also appear in the
SELECT-clause
Slide 8-85
GROUPING (cont.)
• Query 20: For each department, retrieve the department number, the
number of employees in the department, and their average salary.
Slide 8-86
GROUPING (cont.)
• Query 21: For each project, retrieve the project number, project
name, and the number of employees who work on that project.
– In this case, the grouping and functions are applied after the joining of
the two relations
Slide 8-87
THE HAVING-CLAUSE
• Sometimes we want to retrieve the values of
these functions for only those groups that
satisfy certain conditions
• The HAVING-clause is used for specifying a
selection condition on groups (rather than on
individual tuples)
Slide 8-88
THE HAVING-CLAUSE (cont.)
• Query 22: For each project on which more than two
employees work , retrieve the project number, project
name, and the number of employees who work on
that project.
Slide 8-89
Why Transactions?
• Database systems are normally being accessed
by many users or processes at the same time.
– Both queries and modifications.
• Unlike operating systems, which support
interaction of processes, a DMBS needs to
keep processes from troublesome
interactions.
90
Example: Bad Interaction
• You and your domestic partner each take $100
from different ATM’s at about the same time.
– The DBMS better make sure one account
deduction doesn’t get lost.
• Compare: An OS allows two people to edit a
document at the same time. If both write,
one’s changes get lost.
91
Transactions
• Transaction = process involving database
queries and/or modification.
• Normally with some strong properties
regarding concurrency.
• Formed in SQL from single statements or
explicit programmer control.
92
ACID Transactions
• ACID transactions are:
– Atomic : Whole transaction or none is done.
– Consistent : Database constraints preserved.
– Isolated : It appears to the user as if only one process
executes at a time.
– Durable : Effects of a process survive a crash.
• Optional: weaker forms of transactions are often
supported as well.
93
COMMIT
94
ROLLBACK
95
Example: Interacting Processes
• Assume the usual
Sells(shop,toothpaste,price) relation, and
suppose that Joe’s shop sells only Colgate for
$2.50 and Pepsodent for $3.00.
• Sally is querying Sells for the highest and
lowest price Joe charges.
• Joe decides to stop selling Colgate and
Pepsodent, but to sell only DantKanti at
$3.50. 96
Sally’s Program
• Sally executes the following two SQL
statements called (min) and (max) to help us
remember what they do.
(max) SELECT MAX(price) FROM Sells
WHERE shop = ’Joe’’s shop’;
(min)SELECT MIN(price) FROM Sells
WHERE shop = ’Joe’’s shop’;
97
Joe’s Program
98
Interleaving of Statements
• Although (max) must come before (min), and
(del) must come before (ins), there are no
other constraints on the order of these
statements, unless we group Sally’s and/or
Joe’s statements into transactions.
99
Example: Strange Interleaving
• Suppose the steps execute in the order
(max)(del)(ins)(min).
Joe’s Prices:
{2.50,3.00} {2.50,3.00} {3.50}
Statement:
(max) (del) (ins) (min)
Result:
3.00 3.50
100
Fixing the Problem by Using
Transactions
• If we group Sally’s statements (max)(min) into
one transaction, then she cannot see this
inconsistency.
• She sees Joe’s prices at some fixed time.
– Either before or after he changes prices, or in the
middle, but the MAX and MIN are computed from
the same prices.
101
Another Problem: Rollback
102
Solution
• If Joe executes (del)(ins) as a transaction, its
effect cannot be seen by others until the
transaction executes COMMIT.
– If the transaction executes ROLLBACK instead,
then its effects can never be seen.
103