Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 103

Data Analytics and AI

for
Process Industry

- Dr. C.V.Rao

1
Why To Study This Course?
 Analytics are everywhere
Upstream activities (Exploration and production
of crude oil and natural gas)
 Midstream activities (processes, stores, markets
and transports commodities such as crude oil,
natural gas, natural gas liquids)
Downstream activities (includes oil refineries,
petrochemical plants, petroleum products
distributors, retail outlets and natural gas
distribution companies.)
Why To Study This Course?
 Software is used in all the 3 streams
The Internet of Things (IoT) (many sensors,
actuators are remotely monitored and controlled)
There are IoT standards for all the 3 streams
Understand Technologies behind ERP
products
Different Methods and Tools for Analyzing
the data
Syllabus
• Languages of Data Science: R, Excel, SQL, Python, and Tableau;
• Introduction to Data Warehousing and OLAP; Data Preparation and
Visualization;
• Descriptive Statistics: central tendency and variability; Inferential
Statistics: Probability, Central Limit Theorem; Exploratory Data Analysis;
Hypothesis Testing; Linear Regression;
• Classification: KNN, Naive Bayes and Logistic Regression; K-Means and
Hierarchical clustering; Time Series; Decision Trees; Support Vector
Machines; Neural networks; Association Rule Mining;
• Introduction to Big Data And Hadoop; Managing Big Data: Hadoop
Ecosystem tools (Sqoop and Hive); Introduction to Spark; Big Data Analysis
using SparkR, SparkSQL; Case Studies;
• Spatial Data Model; Visualization and Query of Spatial Data; Subsurface
Mapping and Correlation and applications.

• Evaluation: 20(Assignments)+ 30 (Mid) + 50 (Final) 4


Languages of Data Science:
SQL

Pre-Requisite: Database Overview


5
Outline
• Database
– What, Why, How
• Evolution of Database
– File System
– Data Models
• Hierarchical
• Network
• Relational
• Entity-Relationship
• Object-Oriented
– Web Database

6
Database: What
• Database
– is collection of related data and its metadata organized in a structured
format
– for optimized information management

• Database Management System (DBMS)


– is a software that enables easy creation, access, and modification of
databases
– for efficient and effective database management

• Database System
– is an integrated system of hardware, software, people, procedures,
and data
– that define and regulate the collection, storage, management, and use
of data within a database environment

7
Database Management System
- manages interaction between end users and database

Database Systems: Design, Implementation, & Management: Rob & Coronel

8
Database System Environment

 Hardware
 Software
- OS
- DBMS
- Applications
 People
 Procedures
 Data

Database Systems: Design, Implementation, & Management: Rob & Coronel

9
Database: Why
• Purpose of Database
– Optimizes data management
– Transforms data into information

• Importance of Database Design


– Defines the database’s expected use
• different approach needed for different types of
databases
– Avoid data redundancy & ensure data integrity
• data is accurate and verifiable
– Poorly designed database generates errors
• leads to bad decisions
• can lead to failure of organization

10
Database: Why

• Functions of DBMS/Database System


– Stores data and related data entry forms, report
definitions, etc.
– Hides the complexities of relational database model from
the user
• facilitates the definition of data elements and their
relationships
• enables data transformation and presentation
– Enforces data integrity
– Implements data security management
• access, privacy, backup & restoration

11
Database: Data Models
• Importance
– Abstraction of complex real-word data structures in
relative simple (graphical) representations
– Facilitate interaction among the designer, the applications
programmer, and the end user

• Basic Building Blocks


– Entity
• thing about which data are to be collected and stored
– Attribute
• a characteristic of an entity
– Relationship
• describes an association among entities
– Constraint
• restrictions placed on the data 12
Evolution of Data Models
• Timeline

1960s 1970s 1980s 1990s 2000+

File-based

Hierarchical
Object-oriented
Network

Relational Web-based
Entity-Relationship

13
Database: Historical Roots
• Manual File System
– to keep track of data
– used tagged file folders in a filing cabinet
– organized according to expected use
• e.g. file per customer
– easy to create, but hard to
• aggregate/summarize data
• locate data

• Computerized File System


– to accommodate the data growth and information need
– manual file system structures were duplicated in the computer
– Data Processing (DP) specialists wrote customized programs to
• write, delete, update data (i.e. management)
• extract and present data in various formats (i.e. report)
14
File System: Example

Database Systems: Design, Implementation, & Management: Rob & Coronel

15
File System: Weakness
• Weakness
– “Islands of data” in scattered file systems.

• Problems
– Duplication
• same data may be stored in multiple files
– Inconsistency
• same data may be stored by different names in different format
– Rigidity
• requires customized programming to implement any changes
• cannot do ad-hoc queries

• Implications
– Waste of space
– Data inaccuracies
– High overhead of data manipulation and maintenance

16
File System: Problem Case
CUSTOMER file AGENT file SALES file

A_Name (15 char) A_Name (20 char) AGENT (20 char)

Carol Johnson Carol T. Johnson Carol J. Smith

- inconsistent field name, field size


- inconsistent data values
- data duplication

17
Database System vs. File System

Database Systems: Design, Implementation, & Management: Rob & Coronel

18
Hierarchical Database
• Background
– Developed to manage large amount of data for complex manufacturing
projects
– e.g., Information Management System (IMS)
• IBM-Rockwell joint venture
• clustered related data together
• hierarchically associated data clusters using pointers

• Hierarchical Database Model


– Assumes data relationships are hierarchical
• One-to-Many (1:M) relationships
– Each parent can have many children
– Each child has only one parent
– Logically represented by an upside down tree

19
Hierarchical Database: Example

Database Systems: Design, Implementation, & Management: Rob & Coronel

20
Hierarchical Database: Pros & Cons
• Advantages
– Conceptual simplicity
• groups of data could be related to each other
• related data could be viewed together
– Centralization of data
• reduced redundancy and promoted consistency

• Disadvantages
– Limited representation of data relationships
• did not allow Many-to-Many (M:N) relations
– Complex implementation
• required in-depth knowledge of physical data storage
– Structural Dependence
• data access requires physical storage path
– Lack of Standards
• limited portability
21
Network Database
• Objectives
– Represent more complex data relationships
– Improve database performance
– Impose a database standard

• Network Database Model


– Similar to Hierarchical Model
• Records linked by pointers
– Composed of sets
• Each set consists of owner (parent) and member (child)
– Many-to-Many (M:N) relationships representation
• Each owner can have multiple members (1:M)
• A member may have several owners

22
Network Database: Example

Database Systems: Design, Implementation, & Management: Rob & Coronel

23
Network Database: Pros & Cons
• Advantages
– More data relationship types
– More efficient and flexible data access
• “network” vs. “tree” path traversal
– Conformance to standards
• enhanced database administration and portability

• Disadvantages
– System complexity
• require familiarity with the internal structure for data access
– Lack of structural independence
• small structural changes require significant program changes

24
Relational Database
• Problems with legacy database systems
– Required excessive effort to maintain
• Data manipulation (programs) too dependent on physical file
structure
– Hard to manipulate by end-users
• No capacity for ad-hoc query (must rely on DB programmers).

• Evolution in Data Organization

– E. F. Codd’s Relational Model proposal


• Separated the notion of physical representation (machine-view)
from logical representation (human-view)
• Considered ingenious but computationally impractical in 1970

25
Relational Database
– Relational Database Model
• Dominant database model of today
• Eliminated pointers and used tables to represent data
• Tables
– flexible logical structure for data representation
– a series of row/column intersections
– related by sharing common entity characteristic(s)

26
Relational Database: Example
 Provides a logical “human-level” view of the data and associations
among groups of data (i.e., tables)

Customer_ID Customer_Account Agent_ID


1224 4556 23
1225 4558 25

Agent_ID Last_Name First_Name Phone


23 Sturm David 334-5678
25 Long Kyle 556-3421

Customer_ID Last_Name First_Name Phone Account_Balance


1224 Vira Dyne 678-9987 1223.95
1225 Davies Tricia 556-3342 234.25

27
Relational Database: Pros & Cons
• Advantages
– Structural independence
• Separation of database design and physical data storage/access
• Easier database design, implementation, management, and use
– Ad hoc query capability with Structured Query Language (SQL)
• SQL translates user queries to codes

• Disadvantages
– Substantial hardware and system software overhead
• more complex system
– Poor design and implementation is made easy
• ease-of-use allows careless use of RDBMS

28
Entity Relationship Model
• Peter Chen’s Landmark Paper in 1976
– “The Relationship Model: Toward a Unified View of Data”
– Graphical representation of entities and their relationships

• Entity Relationship (ER) Model

– Based on Entity, Attributes & Relationships


• Entity is a thing about which data are to be collected and stored
– e.g. EMPLOYEE
• Attributes are characteristics of the entity
– e.g. SSN, last name, first name
• Relationships describe an associations between entities
– i.e. 1:M, M:N, 1:1

29
Entity Relationship Model
• Entity Relationship (ER) Model

– Complements the relational data model concepts


• Helps to visualize structure and content of data groups
– entity is mapped to a relational table
• Tool for conceptual data modeling (higher level representation)

– Represented in an Entity Relationship Diagram (ERD)


• Formalizes a way to describe relationships between groups of
data

30
E-R Diagram: Chen Model
• Entity
– represented by a rectangle
with its name in capital
letters.

• Relationships
– represented by an active or
passive verb inside the
diamond that connects the
related entities.

• Connectivities
– i.e., types of relationship
– written next to each entity
box.
Database Systems: Design, Implementation, & Management: Rob & Coronel

31
E-R Diagram: Crow’s Foot Model
• Entity
– represented by a rectangle
with its name in capital
letters.

• Relationships
– represented by an active or
passive verb that connects
the related entities.

• Connectivities
– indicated by symbols next to
entities.
• 2 vertical lines for 1
• “crow’s foot” for M
32
E-R Model: Pros & Cons
• Advantages

– Exceptional conceptual simplicity


• easily viewed and understood representation of database
• facilitates database design and management
– Integration with the relational database model
• enables better database design via conceptual modeling

• Disadvantages
– Incomplete model on its own
• Limited representational power
– cannot model data constraints not tied to entity relationships
» e.g. attribute constraints
– cannot represent relationships between attributes within
entities
• No data manipulation language (e.g. SQL)
– Loss of information content
• Hard to include attributes in ERD
33
Object-Oriented Database
• Semantic Data Model (SDM)
– Modeled both data and their relationships in a single structure
(object)
• Developed by Hammer & McLeod in 1981

• Object-oriented concepts became popular in 1990s


– Modularity facilitated program reuse and construction of complex
structures
– Ability to handle complex data types (e.g. multimedia data)

34
Object-Oriented Database
• Object-Oriented Database Model (OODBM)
– Maintains the advantages of the ER model but adds more features
– Object = entity + relationships (between & within entity)
• consists of attributes & methods
– attributes describe properties of an object
– methods are all relevant operations that can be performed on
an object
• self-contained abstraction of real-world entity
– Class = collection of similar objects with shared attributes and
methods
• e.g. EMPLOYEE class = (employ1 object, employ2 object, …)
• organized in a class hierarchy
– e.g. PERSON > EMPLOYEE, CUSTOMER
– Incorporates the notion of inheritance
• attributes and methods of a class are inherited by its descendent
classes
35
OO Database Model vs. E-R Model
OODBM:
- can accommodate relationships within a object
- objects to be used as building blocks for autonomous structures

Database Systems: Design, Implementation, & Management: Rob & Coronel

36
Object-Oriented Database: Pros & Cons
• Advantages
– Semantic representation of data
• fuller and more meaningful description of data via object
– Modularity, reusability, inheritance
– Ability to handle
• complex data
• sophisticated information requirements

• Disadvantages
– Lack of standards
• no standard data access method
– Complex navigational data access
• class hierarchy traversal
– Steep learning curve
• difficult to design and implement properly
– More system-oriented than user-centered
– High system overhead
• slow transactions 37
Web Database
• Internet is emerging as a prime business tool
– Shift away from models (e.g. relational vs. O-O)
– Emphasis on interfacing with the Internet

• Characteristics of “Internet age” databases


– Flexible, efficient, and secure Internet access
– Support for complex data types & relationships
– Seamless interfaces with multiple data sources and structures
– Ease of use for end-user, database architect, and database
administrator
• Simplicity of conceptual database model
• Many database design, implementation, and application
development tools
• Powerful DBMS GUI

38
39
SQL Introduction
Standard language for querying and manipulating data

Structured Query Language

Many standards out there:


• ANSI SQL
• SQL92 (a.k.a. SQL2)
• SQL99 (a.k.a. SQL3)
• Vendors support various subsets of these
• What we discuss is common to all of them

40
SQL
• Data Definition Language (DDL)
– Create/alter/delete tables and their attributes
• Data Manipulation Language (DML)
– Query one or more tables
– Insert/delete/modify tuples in tables
• Transact-SQL
– Idea: package a sequence of SQL statements  server

41
Data in SQL
1. Atomic types, a.k.a. data types
2. Tables built from atomic types

42
Data Types in SQL
• Characters:
– CHAR(20) -- fixed length
– VARCHAR(40)-- variable length
• Numbers:
– BIGINT, INT, SMALLINT, TINYINT
– REAL, FLOAT -- differ in precision
– MONEY
• Times and dates:
– DATE
– DATETIME -- SQL Server
• Others... All are simple
43
Table name Attribute names
Tables in SQL
Product

PName Price Category Manufacturer

Gizmo $19.99 Gadgets GizmoWorks

Powergizmo $29.99 Gadgets GizmoWorks

SingleTouch $149.99 Photography Canon

MultiTouch $203.99 Household Hitachi

Tuples or rows
44
Tables Explained
• A tuple = a record
– Restriction: all attributes are of atomic type
• A table = a set of tuples
– Like a list…
– …but it is unordered: no first(), no next(), no last().

45
Tables Explained
• The schema of a table is the table name and
its attributes:
Product(PName, Price, Category, Manfacturer)

• A key is an attribute whose values are


unique;
we underline a key
Product(PName, Price, Category, Manfacturer)

46
SQL Query

Basic form: (plus many many more bells and whistles)

SELECT attributes
FROM relations (possibly multiple)
WHERE conditions (selections)

47
Simple SQL Query

Product PName Price Category Manufacturer


Gizmo $19.99 Gadgets GizmoWorks
Powergizmo $29.99 Gadgets GizmoWorks
SingleTouch $149.99 Photography Canon
MultiTouch $203.99 Household Hitachi

SELECT *
FROM Product
WHERE category=‘Gadgets’

PName Price Category Manufacturer


Gizmo $19.99 Gadgets GizmoWorks

“selection” Powergizmo $29.99 Gadgets GizmoWorks

48
Simple SQL Query

Product PName Price Category Manufacturer


Gizmo $19.99 Gadgets GizmoWorks
Powergizmo $29.99 Gadgets GizmoWorks
SingleTouch $149.99 Photography Canon
MultiTouch $203.99 Household Hitachi

SELECT PName, Price, Manufacturer


FROM Product
WHERE Price > 100

PName Price Manufacturer


“selection” and SingleTouch $149.99 Canon
“projection” MultiTouch $203.99 Hitachi

49
A Notation for SQL Queries
Input Schema

Product(PName, Price, Category, Manfacturer)

SELECT PName, Price, Manufacturer


FROM Product
WHERE Price > 100

Answer(PName, Price, Manfacturer)

Output Schema
50
Selections
What goes in the WHERE clause:
• x = y, x < y, x <= y, etc
– For number, they have the usual meanings
– For CHAR and VARCHAR: lexicographic ordering
• Expected conversion between CHAR and VARCHAR
– For dates and times, what you expect...
• Pattern matching on strings...

51
The LIKE operator
• s LIKE p: pattern matching on strings
• p may contain two special symbols:
– % = any sequence of characters
– _ = any single character

Product(PName, Price, Category, Manufacturer)


Find all products whose name mentions ‘gizmo’:

SELECT *
FROM Products
WHERE PName LIKE ‘%gizmo%’

52
Eliminating Duplicates
Category
SELECT DISTINCT category Gadgets
FROM Product
Photography
Household

Compare to:

Category
Gadgets
SELECT category Gadgets
FROM Product
Photography
Household

53
Ordering the Results

SELECT pname, price, manufacturer


FROM Product
WHERE category=‘gizmo’ AND price > 50
ORDER BY price, pname

Ordering is ascending, unless you specify the DESC keyword.

Ties are broken by the second attribute on the ORDER BY list, etc.

54
Ordering the Results

SELECT category
FROM Product
ORDER BY pname

PName Price Category Manufacturer

?
Gizmo $19.99 Gadgets GizmoWorks
Powergizmo $29.99 Gadgets GizmoWorks
SingleTouch $149.99 Photography Canon
MultiTouch $203.99 Household Hitachi

55
Ordering the Results
Category
SELECT DISTINCT category Gadgets
FROM Product
Household
ORDER BY category
Photography

Compare to:

?
SELECT category
FROM Product
ORDER BY pname

56
Joins in SQL
• Connect two or more tables:

Product PName Price Category Manufacturer


Gizmo $19.99 Gadgets GizmoWorks
Powergizmo $29.99 Gadgets GizmoWorks
SingleTouch $149.99 Photography Canon
MultiTouch $203.99 Household Hitachi

Company Cname StockPrice Country

What is GizmoWorks 25 USA


the connection Canon 65 Japan
between
them ? Hitachi 15 Japan 57
Joins
Product (pname, price, category, manufacturer)
Company (cname, stockPrice, country)

Find all products under $200 manufactured in Japan;


return their names and prices.
Join
between Product
and Company
SELECT pname, price
FROM Product, Company
WHERE manufacturer=cname AND country=‘Japan’
AND price <= 200

58
Joins in SQL
Product Company
PName Price Category Manufacturer Cname StockPrice Country
Gizmo $19.99 Gadgets GizmoWorks GizmoWorks 25 USA
Powergizmo $29.99 Gadgets GizmoWorks Canon 65 Japan
SingleTouch $149.99 Photography Canon Hitachi 15 Japan
MultiTouch $203.99 Household Hitachi

SELECT pname, price


FROM Product, Company
WHERE manufacturer=cname AND country=‘Japan’
AND price <= 200

PName Price
SingleTouch $149.99

59
Joins
Product (pname, price, category, manufacturer)
Company (cname, stockPrice, country)

Find all countries that manufacture some product in the ‘Gadgets’


category.

SELECT country
FROM Product, Company
WHERE manufacturer=cname AND category=‘Gadgets’

60
Joins in SQL
Product Company
Name Price Category Manufacturer Cname StockPrice Country
Gizmo $19.99 Gadgets GizmoWorks GizmoWorks 25 USA
Powergizmo $29.99 Gadgets GizmoWorks Canon 65 Japan
SingleTouch $149.99 Photography Canon Hitachi 15 Japan
MultiTouch $203.99 Household Hitachi

SELECT country
FROM Product, Company
WHERE manufacturer=cname AND category=‘Gadgets’

Country
What is ??
the problem ? ??
What’s the
solution ?
61
Joins
Product (pname, price, category, manufacturer)
Purchase (buyer, seller, store, product)
Person(persname, phoneNumber, city)

Find names of people living in Seattle that bought some product in


the ‘Gadgets’ category, and the names of the stores they bought
such product from

SELECT DISTINCT persname, store


FROM Person, Purchase, Product
WHERE persname=buyer AND product = pname AND
city=‘Seattle’ AND category=‘Gadgets’

62
When are two tables related?
• You guess they are
• I tell you so
• Foreign keys are a method for schema designers to
tell you so
– A foreign key states that a column is a reference to the key
of another table
ex: Product.manufacturer is foreign key of Company
Product (pname, price, category, manufacturer)
Company (cname, stockPrice, country)

– Gives information and enforces constraint

63
Disambiguating Attributes
• Sometimes two relations have the same attr:
Person(pname, address, worksfor)
Company(cname, address)
Which
SELECT DISTINCT pname, address address ?
FROM Person, Company
WHERE worksfor = cname

SELECT DISTINCT Person.pname, Company.address


FROM Person, Company
WHERE Person.worksfor = Company.cname
64
Tuple Variables
Purchase (buyer, seller, store, product)

Find all stores that sold at least one product that the store
‘BestBuy’ also sold:

SELECT DISTINCT x.store


FROM Purchase AS x, Purchase AS y
WHERE x.product = y.product AND y.store = ‘BestBuy’

Answer (store)
65
Meaning (Semantics) of SQL
Queries
SELECT a1, a2, …, ak
FROM R1 AS x1, R2 AS x2, …, Rn AS xn
WHERE Conditions

1. Nested loops:
Answer = {}
for x1 in R1 do
for x2 in R2 do
…..
for xn in Rn do
if Conditions
then Answer = Answer  {(a1,…,ak)}
return Answer

66
Meaning (Semantics) of SQL
Queries
SELECT a1, a2, …, ak
FROM R1 AS x1, R2 AS x2, …, Rn AS xn
WHERE Conditions

2. Parallel assignment
Answer = {}
for all assignments x1 in R1, …, xn in Rn do
if Conditions then Answer = Answer  {(a1,…,ak)}
return Answer

Doesn’t impose any order !


67
SQL Environment
 Catalog
– a set of schemas that constitute the description of a database (Dictionary)
 Schema
– The structure that contains descriptions of objects created by a user (base
tables, views, constraints)
 Data Definition Language (DDL):
– Commands that define a database, including creating, altering, and dropping
tables and establishing constraints
 Data Manipulation Language (DML)
– Commands that maintain and query a database

 Data Control Language (DCL)


– Commands that control a database, including administering privileges and
committing data
System Catalog
• CREATE TABLE inserts information into the catalog
• Catalog is another table that describes Objects created such
as:
– Table names
– Constraint names
– Role Names
– Triggers, Sequences, Views, etc
– Attribute names of different tables
– Corresponding attribute types, etc.
• Catalog schema is generally fixed by vendor
• In Oracle SQL this catalog is called DICTIONARY
SQL Database Definition
 Data Definition Language (DDL)
 Major CREATE statements:
– CREATE SCHEMA – defines a portion of the database owned by a
particular user
– CREATE TABLE – defines a table and its columns
– CREATE VIEW – defines a logical table from one or more views

 Other CREATE statements: CHARACTER SET,


Sequence, Index, Constraint, Role, ..etc
Example table creation
Employee
Emp_Name Dept_no Gender Age salary

Sara John 2 M 27 1000


Sally Wood 2 F 27 2600
John Smith 1 M 32 5000
Mary Smith 10 F 42 1550

CREATE TABLE Employee (


Emp_Name VARCHAR(12),
Dept_no numeric(2),
Gender CHAR(1),
Age Numeric(3),
Salary numeric(7,2));

 After creating the table, you can view it using the


command: Desc tableName
ALTER TABLE
• ALTER TABLE table_name
ADD column_name datatype
– Adds a column to the table
Ex : Alter table employee add address varchar2(40);
• ALTER TABLE table_name
DROP COLUMN column_name
– Removes a column (and all its data) from the table
Ex : Alter table employee drop column address;
• ALTER TABLE table_name
MODIFY (column_name newType/length)
Ex : Alter table employee modify age varchar2(15);
INSERT INTO (DML)
• Adds data to a table
• Syntax:
INSERT INTO table_name (column, …, column)
VALUES (value, …, value);

• The columns are the names of columns you are


putting data into, and the values are that data
• String data must be enclosed in single quotes
• Numbers are not quoted
• You can omit the column names if you supply a value
for every column
• Important Note: Only the constraints specified in the
DDL commands are automatically enforced by the
DBMS when updates are applied to the database
INSERT INTO (Cont.)
 Inserting into a table
– Insert into employee (emp_Name, Dept_no, gender, salary)
Values (‘Sara johns’, 1, ‘F’, 1440);
 Inserting a record that has some null attributes requires identifying the fields that
actually get data
 When you insert a record and you have values for all attributes, there is no need to
specify the attributes names.
– Insert into employee
Values (‘Suzy Alan’, 10, ‘F’, 1200);
 Inserting from another table
– INSERT INTO emp_senior
select * from employee where age > 60;
The main condition in this case, that both tables has the same attributes
and ordered in the same order
Delete
Delete certain rows (depending on a
condition)
– Delete from employee where age<30;

Delete all rows


– Delete from employee;
DELETE (cont.)

• Truncate Syntax:
Truncate Table TableName, eg. : Truncate table employee
– Quicker way for deleting all the rows from a table
– It releases the space used by the table
Drop Syntax:
Drop TABLE TableName, eg. : Drop table employee
– Remove the table completely from the database
Relational Database Schema--Figure 5.5
• Examples:
U4A: DELETE FROM EMPLOYEE
WHERE LNAME='Brown’

U4B: DELETE FROM EMPLOYEE


WHERE SSN='123456789’

U4C: DELETE FROM EMPLOYEE


WHERE DNO IN
(SELECT DNUMBER
FROM DEPARTMENT
WHERE DNAME='Research')

U4D: DELETE FROM EMPLOYEE


UPDATE
• Used to modify attribute values of one or more
selected tuples
• A WHERE-clause selects the tuples to be modified
• An additional SET-clause specifies the attributes to
be modified and their new values
• Each command modifies tuples in the same relation
• Referential integrity should be enforced
UPDATE (cont.)
• Example: Change the location (plocation) and controlling
department number(dnum) of project number 10 to
'Bellaire' and 5, respectively.

U5: UPDATE PROJECT


SET PLOCATION = 'Bellaire', DNUM = 5
WHERE PNUMBER=10
UPDATE (cont.)
• Example: Give all employees in the 'Research' department a 10% raise
in salary.

U6: UPDATE EMPLOYEE


SET SALARY = SALARY *1.1
WHERE DNO IN (SELECT DNUMBER
FROM DEPARTMENT
WHERE DNAME='Research')

• In this request, the modified SALARY value depends on the original SALARY
value in each tuple
• The reference to the SALARY attribute on the right of = refers to the old SALARY
value before modification
• The reference to the SALARY attribute on the left of = refers to the new SALARY
value after modification
AGGREGATE FUNCTIONS
• Include COUNT, SUM, MAX, MIN, and AVG
• Query 15: Find the maximum salary, the minimum salary, and
the average salary among all employees.

Q15: SELECT MAX(SALARY),


MIN(SALARY), AVG(SALARY)
FROM EMPLOYEE

– Some SQL implementations may not allow more than one


function in the SELECT-clause

Slide 8-82
AGGREGATE FUNCTIONS (cont.)
• Query 16: Find the maximum salary, the minimum salary,
and the average salary among employees who work for
the 'Research' department.

Q16: SELECT MAX(SALARY), MIN(SALARY),


AVG(SALARY)
FROM EMPLOYEE, DEPARTMENT
WHERE DNO=DNUMBER AND
DNAME='Research'

Slide 8-83
AGGREGATE FUNCTIONS (cont.)
• Queries 17 and 18: Retrieve the total number of employees
in the company (Q17), and the number of employees in the
'Research' department (Q18).

Q17: SELECT COUNT (*)


FROM EMPLOYEE

Q18: SELECT COUNT (*)


FROM EMPLOYEE,
DEPARTMENT
WHERE DNO=DNUMBER AND
DNAME='Research’

Slide 8-84
GROUPING
• In many cases, we want to apply the aggregate
functions to subgroups of tuples in a relation
• Each subgroup of tuples consists of the set of tuples
that have the same value for the grouping
attribute(s)
• The function is applied to each subgroup
independently
• SQL has a GROUP BY-clause for specifying the
grouping attributes, which must also appear in the
SELECT-clause

Slide 8-85
GROUPING (cont.)
• Query 20: For each department, retrieve the department number, the
number of employees in the department, and their average salary.

Q20: SELECT DNO, COUNT (*), AVG (SALARY)


FROM EMPLOYEE
GROUP BY DNO

– In Q20, the EMPLOYEE tuples are divided into groups--each group


having the same value for the grouping attribute DNO
– The COUNT and AVG functions are applied to each such group of
tuples separately
– The SELECT-clause includes only the grouping attribute and the
functions to be applied on each group of tuples
– A join condition can be used in conjunction with grouping

Slide 8-86
GROUPING (cont.)
• Query 21: For each project, retrieve the project number, project
name, and the number of employees who work on that project.

Q21: SELECT PNUMBER, PNAME, COUNT (*)


FROM PROJECT, WORKS_ON
WHERE PNUMBER=PNO
GROUP BY PNUMBER, PNAME

– In this case, the grouping and functions are applied after the joining of
the two relations

Slide 8-87
THE HAVING-CLAUSE
• Sometimes we want to retrieve the values of
these functions for only those groups that
satisfy certain conditions
• The HAVING-clause is used for specifying a
selection condition on groups (rather than on
individual tuples)

Slide 8-88
THE HAVING-CLAUSE (cont.)
• Query 22: For each project on which more than two
employees work , retrieve the project number, project
name, and the number of employees who work on
that project.

Q22: SELECT PNUMBER, PNAME, COUNT (*)

FROM PROJECT, WORKS_ON


WHERE PNUMBER=PNO
GROUP BY PNUMBER, PNAME
HAVING COUNT (*) > 2

Slide 8-89
Why Transactions?
• Database systems are normally being accessed
by many users or processes at the same time.
– Both queries and modifications.
• Unlike operating systems, which support
interaction of processes, a DMBS needs to
keep processes from troublesome
interactions.

90
Example: Bad Interaction
• You and your domestic partner each take $100
from different ATM’s at about the same time.
– The DBMS better make sure one account
deduction doesn’t get lost.
• Compare: An OS allows two people to edit a
document at the same time. If both write,
one’s changes get lost.

91
Transactions
• Transaction = process involving database
queries and/or modification.
• Normally with some strong properties
regarding concurrency.
• Formed in SQL from single statements or
explicit programmer control.

92
ACID Transactions
• ACID transactions are:
– Atomic : Whole transaction or none is done.
– Consistent : Database constraints preserved.
– Isolated : It appears to the user as if only one process
executes at a time.
– Durable : Effects of a process survive a crash.
• Optional: weaker forms of transactions are often
supported as well.

93
COMMIT

• The SQL statement COMMIT causes a


transaction to complete.
– It’s database modifications are now permanent
in the database.

94
ROLLBACK

• The SQL statement ROLLBACK also causes


the transaction to end, but by aborting.
– No effects on the database.
• Failures like division by 0 or a constraint
violation can also cause rollback, even if
the programmer does not request it.

95
Example: Interacting Processes
• Assume the usual
Sells(shop,toothpaste,price) relation, and
suppose that Joe’s shop sells only Colgate for
$2.50 and Pepsodent for $3.00.
• Sally is querying Sells for the highest and
lowest price Joe charges.
• Joe decides to stop selling Colgate and
Pepsodent, but to sell only DantKanti at
$3.50. 96
Sally’s Program
• Sally executes the following two SQL
statements called (min) and (max) to help us
remember what they do.
(max) SELECT MAX(price) FROM Sells
WHERE shop = ’Joe’’s shop’;
(min)SELECT MIN(price) FROM Sells
WHERE shop = ’Joe’’s shop’;

97
Joe’s Program

• At about the same time, Joe executes the


following steps: (del) and (ins).
(del) DELETE FROM Sells
WHERE shop = ’Joe’’s shop’;
(ins) INSERT INTO Sells
VALUES(’Joe’’s shop’, DantKanti’, 3.50);

98
Interleaving of Statements
• Although (max) must come before (min), and
(del) must come before (ins), there are no
other constraints on the order of these
statements, unless we group Sally’s and/or
Joe’s statements into transactions.

99
Example: Strange Interleaving
• Suppose the steps execute in the order
(max)(del)(ins)(min).
Joe’s Prices:
{2.50,3.00} {2.50,3.00} {3.50}
Statement:
(max) (del) (ins) (min)
Result:
3.00 3.50

• Sally sees MAX < MIN!

100
Fixing the Problem by Using
Transactions
• If we group Sally’s statements (max)(min) into
one transaction, then she cannot see this
inconsistency.
• She sees Joe’s prices at some fixed time.
– Either before or after he changes prices, or in the
middle, but the MAX and MIN are computed from
the same prices.

101
Another Problem: Rollback

• Suppose Joe executes (del)(ins), not as a


transaction, but after executing these
statements, thinks better of it and issues a
ROLLBACK statement.
• If Sally executes her statements after (ins)
but before the rollback, she sees a value,
3.50, that never existed in the database.

102
Solution
• If Joe executes (del)(ins) as a transaction, its
effect cannot be seen by others until the
transaction executes COMMIT.
– If the transaction executes ROLLBACK instead,
then its effects can never be seen.

103

You might also like