Professional Documents
Culture Documents
1.database System Concepts and Architecture
1.database System Concepts and Architecture
Data Models
A collection of concepts that can be used to describe the structure of a database (data types, relationships, and constraints) basic operations (retrieval and updates) specify the dynamic aspect or behavior of a database application( user-defined operations ) example: COMPUTE_GPA, which can be applied to a STUDENT object
record formats record orderings access path: make search more efficient
Schemas
Each object in the schema-such as STUDENT or COURSE-is a schema construct. Schema diagram represents only some aspects of a schema (name of record type, data element and some type of constraint)
Jan 29, 2002
The data in the database at a particular moment in time is called a database state or snapshot or current set of occurrences or instances in the database When we define a new database we have database state is empty state (schema specified only in DBMS) The initial state when the database is first populated Then At any point in time, the database has a current state schema evolution: when we need to change the schema
The aim is to separate the user application and physical DB schema can be defined into three levels:
The internal level has an internal schema describes the physical storage structure of the database. uses a physical data model
Data Independence
Is the capacity to change the schema at one level of a database system without having to change the schema at the next higher level. Logical data independence: capacity to change the conceptual schema without having to change external schemas or application programs. Physical data independence: capacity to change the internal schema without having to change the conceptual (or external) schemas
DBMS Languages
Data Definition Language DDL: Language to specify conceptual and internal schemas for the database and any mappings between the two. Storage definition language SDL: used when clear distinction between conceptual and internal schema. view definition language VDL: specify user views and their mappings to the conceptual schema. data manipulation language DML:retrieval, insertion, deletion, and modification of the data
DBMS Languages ..
SQL relational database language: represents a combination of DDL,VDL, and DML, as well as statements for constraint specification and schema evolution There are two main types of DMLs:
A high-level or nonprocedural DML : specify complex DB operations. Example SQL(set-at-a-time) A low-level or procedural DML: retrieve individual records or objects from DB and process each separately (recordat-a-time).
DBMS Interfaces
Forms-Based Interfaces
display a form for each user (insert, select) designed for nave users.
DBMS Interfaces
Interfaces for Parametric Users (eg tellers) Interfaces for the DBA
Accept requests in native language and attempt to understand them. Refers to words in the schema and (standard words) to interpret the request.
goal is to min the number of keystroks required. (use of function) keys creating accounts, system privileges, changing schema, etc.
Data model:
centralized, distributed DBMS (DDBMS) ,Homogeneous DDBMSs ,federated DBMS (develop software to access several autonomous preexisting databases stored under heterogeneous DBMSs. )
Jan 29, 2002
Example: on-line transaction processing (OLTP) systems, which must support a large number of concurrent transactions without imposing excessive delays. )
Jan 29, 2002
What is DBMS?
Need for information management A very large, integrated collection of data. Models real-world enterprise.
A Database Management System (DBMS) is a software package designed to store and manage databases.
Data independence and efficient access. Data integrity and security. Uniform data administration. Concurrent access, recovery from crashes. Replication control Reduced application development time.
at the low end: access to physical world at the high end: scientific applications Digital libraries, interactive video, Human Genome project, e-commerce, sensor networks ... need for DBMS/data services exploding
Data Models
A data model is a collection of concepts for describing data. A schema is a description of a particular collection of data, using the a given data model. The relational model of data is the most widely used model today.
Main concept: relation, basically a table with rows and columns. Every relation has a schema, which describes the columns, or fields.
Levels of Abstraction
View 1
View 2
View 3
Views describe how users see the data. Conceptual schema defines logical structure Physical schema describes the files and indexes used.
Conceptual schema:
Students(sid: string, name: string, login: string, age: integer, gpa:real) Courses(cid: string, cname:string, credits:integer) Enrolled(sid:string, cid:string, grade:string)
Relations stored as unordered files. Index on first column of Students. Course_info(cid:string, enrollment:integer)
Physical schema:
Data Independence
Applications insulated from how data is structured and stored. Logical data independence: Protection from changes in logical structure of data. Physical data independence: Protection from changes in physical structure of data.
Concurrency Control
Interleaving actions of different user programs can lead to inconsistency: e.g., check is cleared while account balance is being computed. DBMS ensures such problems dont arise: users can pretend they are using a single-user system.
Because disk accesses are frequent, and relatively slow, it is important to keep the CPU humming by working on several user programs concurrently.
Users can specify some simple integrity constraints on the data, and the DBMS will enforce these constraints. Beyond this, the DBMS does not really understand the semantics of the data. (e.g., it does not understand how the interest on a bank account is computed). Why not? Thus, ensuring that a transaction (run alone) preserves consistency is ultimately the users responsibility!
DBMS ensures that execution of {T1, ... , Tn} is equivalent to some serial execution T1 ... Tn.
Before reading/writing an object, a transaction requests a lock on the object, and waits till the DBMS gives it the lock. All locks are released at the end of the transaction. (Strict 2PL locking protocol.) Idea: If an action of Ti (say, writing X) affects Tj (which perhaps reads X), one of them, say Ti, will obtain the lock on X first and Tj is forced to wait until Ti completes; this effectively orders the transactions. What if Tj already has a lock on Y and Ti later requests a lock on Y? What is it called? What will happen?
Ensuring Atomicity
DBMS ensures atomicity (all-or-nothing property) even if system crashes in the middle of a Xact. Idea: Keep a log (history) of all actions carried out by the DBMS while executing a set of Xacts:
Before a change is made to the database, the corresponding log entry is forced to a safe location. (WAL protocol.) After a crash, the effects of partially executed transactions are undone using the log. (Thanks to WAL, if log entry wasnt saved before the crash, corresponding change was not applied to database!)
The Log
Log records chained together by Xact id, so its easy to undo a specific Xact (e.g., to resolve a deadlock). Log is often duplexed and archived on stable storage. All log related activities (and in fact, all CC related activities such as lock/unlock, dealing with deadlocks etc.) are handled transparently by the DBMS.
e.g. webmasters
Designs logical /physical schemas Handles security and authorization Data availability, crash recovery Database tuning as needs evolve
Structure of a DBMS
A typical DBMS has a Query Optimization layered architecture. and Execution The figure does not show Relational Operators the concurrency control Files and Access Methods and recovery components. Buffer Management This is one of several Disk Space Management possible architectures; each system has its own variations.
DB
Summary
DBMS used to maintain, query large datasets. Benefits include recovery from system crashes, concurrent access, quick application development, data integrity and security. Levels of abstraction give data independence. A DBMS typically has a layered architecture. DBAs hold responsible jobs and are well-paid! DBMS R&D is one of the broadest, mature areas in CS.
Data Models
A Database models some portion of the real world. Data Model is link between users view of the world and bits stored in computer. Many models have been proposed. We will concentrate on the Relational Model.
Student (sid: string, name: string, login: string, age: integer, gpa:real)
10101 11101
A database schema is a description of a particular collection of data, using a given data model. The relational model of data is the most widely used model today.
Main concept: relation, basically a table with rows and columns. Every relation has a schema, which describes the columns, or fields.
Levels of Abstraction
Users
Views describe how users see the data. Conceptual schema defines logical structure Physical schema describes the files and indexes used. (sometimes called the ANSI/SPARC model)
View 1 View 2 View 3
Conceptual Schema
Physical Schema
DB
A Simple Idea: Applications should be insulated from how data Logical data independence: is structured and Protection from changes in logical structure of data. stored.
View 1
View 2
View 3
Conceptual Schema
Physical Schema
DB
Most widely used model currently. DB2, MySQL, Oracle, PostgreSQL, SQLServer, Note: some Legacy systems use older models
e.g., IBMs IMS
Object-oriented concepts have recently merged in object-relational model Informix, IBM DB2, Oracle 8i Early work done in POSTGRES research project at Berkeley
Relational database: a set of relations. Relation: made up of 2 parts: Schema : specifies name of relation, plus name and type of each column. E.g. Students(sid: string, name: string, login: string, age: integer, gpa: real) Instance : a table, with rows and columns. #rows = cardinality #fields = degree / arity Can think of a relation as a set of rows or tuples. i.e., all rows are distinct
Conceptual schema:
Students(sid: string, name: string, string, age: integer, gpa:real) Courses(cid: string, cname:string, Enrolled(sid:string, cid:string,
View 1
View 2
View 3
DB
SQL (a.k.a. Sequel), Intergalactic Standard for Data Stands for Structured Query Language Two sub-languages: Data Definition Language (DDL) create, modify, delete relations specify constraints administer users, security, etc. Data Manipulation Language (DML) Specify queries to find tuples that satisfy criteria add, modify, remove tuples
SQL Overview
CREATE TABLE <name> ( <field> <domain>, ) INSERT INTO <name> (<field names>) VALUES (<field values>) DELETE FROM <name> WHERE <condition> UPDATE <name> SET <field name> = <value> WHERE <condition> SELECT <fields> FROM <name> WHERE <condition>
Creates the Students relation. Note: the type (domain) of each field is specified, and enforced by the DBMS whenever tuples are added or modified.
CREATE TABLE Students (sid CHAR(20), name CHAR(20), login CHAR(10), age INTEGER, gpa FLOAT)
Another example: the Enrolled table holds information about courses students take.
INSERT INTO Students (sid, name, login, age, gpa) VALUES (53688, Smith, smith@ee, 18, 3.2)
Can delete all tuples satisfying some condition (e.g., name = Smith):
Keys
Keys are a way to associate tuples in different relations Keys are one form of integrity constraint (IC)
Enrolled
sid 53666 53666 53650 53666 cid grade Carnatic101 C Reggae203 B Topology112 A History105 B
Students
sid 53666 53688 53650 name login Jones jones@cs Smith smith@eecs Smith smith@math age 18 18 19 gpa 3.4 3.2 3.8
FORIEGN Key
PRIMARY Key
Primary Keys
Possibly many candidate keys (specified using UNIQUE), one of which is chosen as the primary key.
Keys must be used carefully! For a given student and course, there is a single grade.
CREATE TABLE Enrolled CREATE TABLE Enrolled (sid CHAR(20) (sid CHAR(20) cid CHAR(20), cid CHAR(20), vs. grade CHAR(2), grade CHAR(2), PRIMARY KEY (sid), PRIMARY KEY (sid,cid)) UNIQUE (cid, grade)) Students can take only one course, and no two students in a course receive the same grade.
Foreign key : Set of fields in one relation that is used to `refer to a tuple in another relation. Must correspond to the primary key of the other relation. Like a `logical pointer. If all foreign key constraints are enforced, referential integrity is achieved (i.e., no dangling references.)
E.g. Only students listed in the Students relation should be allowed to enroll for courses.
CREATE TABLE Enrolled (sid CHAR(20),cid CHAR(20),grade CHAR(2 PRIMARY KEY (sid,cid), FOREIGN KEY (sid) REFERENCES Students
Enrolled
sid 53666 53666 53650 53666 cid grade Carnatic101 C Reggae203 B Topology112 A History105 B
sid 53666 53688 53650
Students
age 18 18 19
11111 English102 A
Consider Students and Enrolled; sid in Enrolled is a foreign key that references Students. What should be done if an Enrolled tuple with a nonexistent student id is inserted? (Reject it!) What should be done if a Students tuple is deleted?
Also delete all Enrolled tuples that refer to it? Disallow deletion of a Students tuple that is referred to? Set sid in Enrolled tuples that refer to it to a default sid? (In SQL, also: Set sid in Enrolled tuples that refer to it to a special value null, denoting `unknown or `inapplicable.)
ICs are based upon the semantics of the real-world that is being described in the database relations. We can check a database instance to see if an IC is violated, but we can NEVER infer that an IC is true by looking at an instance. An IC is a statement about all possible instances! From example, we know name is not a key, but the assertion that sid is a key is given to us. Key and foreign key ICs are the most common; more general ICs supported too.
The key: precise semantics for relational queries. Allows the optimizer to extensively re-order operations, and still ensure that the answer does not change.
To find all 18 year old students, we can SELECT * sid name login age FROM Students S write: 53666 Jones jones@cs 18
WHERE S.age=18
we get:
Semantics of a Query
Remember, this is conceptual. Actual evaluation will be much more efficient, but must produce the same answers.
System handles query plan generation & optimization; ensures correct execution.
Issues: view reconciliation, operator ordering, physical operator choice, memory management, access path (index) use,
Structure of a DBMS
A typical DBMS has a layered architecture. The figure does not show the concurrency control and recovery components. Each system has its own variations. The book shows a somewhat more detailed version. You will see the real deal in PostgreSQL. Its a pretty full-featured example Next class: we will start on this stack, bottom up.
DB
Integrity constraints can be specified by the DBA, based on application semantics. DBMS checks for violations.
Two important ICs: primary and foreign keys In addition, we always have domain constraints.
Storage
The are two general types of storage media that is used with computers. They are :
Primary Storage - This includes all storage media that can be operated on directly by the CPU (RAM , L1 and L2 Cache Memory) Secondary Storage - This includes Hard Drives, CDs and tape.
Chapter 5
69
The Memory Hierarchy is based upon speed of access. However, this speed comes with a price tag attached which varies inversely with the access time of memory. Like cars the faster the memory access is the more it costs.
Chapter 5
70
Chapter 5
71
Chapter 5
72
Chapter 5 73
Chapter 5
74
Cylinder - Tracks with the same diameter that are located on the disk surface of a disk pack.
Chapter 5
75
Chapter 5
76
Chapter 5
77
Chapter 5 78
Computing Times
Given :
Seek Time (s) = 10 msec Rotational speed = 3600 rpm Track size = 50 KB Block size (B) = 512 bytes Interblock Gap = 128 bytes
Chapter 5
79
Chapter 5
80
Chapter 5
81
RAID Levels
Level 0 - has no redundancy and the best write performance but its read performance is not as good as level 1. Level 1 - uses mirrored disks which provide redundancy and improved read performance. Level 2 - provides redundancy using Hamming Codes
Chapter 5
82
RAID Levels
Level 3 - uses a single parity disk. Level 4 and 5 - use block-level data striping with level 5 distributing the data across all the disks. Level 6 - uses the P + Q redundancy scheme making use of the Reed-Soloman codes to protect against the failure of 2 Disks.
Chapter 5
83
Records
Records is the term used to refer to a number of related values or items. Each value or item is stored in a field of a specific data type. Records may be of either fixed or variable lengths.
Chapter 5
84
There are several reasons a record with the same record type may be of variable length.
Variable length fields Repeating fields
Chapter 5
85
When the records in a file is stored on a disk they may be placed in blocks of a fixed size. This will rarely match the record size. So a decision must be made when the record size is smaller than the block size and the block size is not a multiple of the record size whether to store the record all in one block and have unused space or in two different blocks.
Chapter 5 86
File Operations
File may either be stored in contiguous blocks or by linking the blocks together. There are advantages and disadvantages to both methods. Operations on files can be group into two type of operations. Retrieval or update. Retrieval only involves a read while and update involves read, write and modification.
Chapter 5 87
File Structure
Heap (Pile) Files Hash (Direct) Files Ordered (Sorted) Files B - Trees
Chapter 5
88
Once the data has been brought into memory, it can be accessed by an instruction in .00000004 seconds by a machine running a 25MIPS. The disparity between time for memory access and disk access is enormous:we can perform 625,000 instructions in the time it takes to read /write one disk page. To put this in human terms if you were typing a letter for you boss and found a word you could not make out so you leave him a voice mail message. Since you were told to do nothing else but this you patiently wait for his reply doing Nothing! Unfortunately, he just went on vacation and does not get your message for 3 WEEKS. This is similar to the computer waiting .025 seconds to get the needed data into memory from a disk read.
Chapter 5 89