Download as pdf
Download as pdf
You are on page 1of 51
eS STORAGE ° ND FILE A STRUCTURE In preceding chapters, we have emphasized the higher-level models of a database. Atthe conceptual or logical level, we viewed the database, in the relational model, as a collection of tables. The logical model of the database is the correct level for database users to focus on. The goal of a database system is to simplify and facilitate access to data. Users of the system should not be burdened unnecessarily with the physical details of the implementation of the system. In this chapter, as well as in Chapters 11 and 12, we describe various methods for implementing the data models and languages presented in preceding chapters. We start with characteristics of the underlying storage media, such as disk and tape systems. We then define various data structures that will allow fast access to data. We consider several alternative structures, each best suited to a different kind of access to data. The final choice of data structure needs to be made on the basis of the expected use of the system and of the physical characteristics of the Specific machine, 10.1 Overview of Physical Storage Media Several 1 pes of data storage exist in most computer systems, The ae Classified by the speed with which data can be 7 ‘ala to buy the medium, and by the medium’s ; Storage media accessed, by the cost per unit reliability, Among the media @ scanned with OKEN Scanner typically available are ©" " stest and MOSt Costly f, cache is the fast tom ot : va ae managed by the OPcrating system Cache pout managing cache ston the database a Shaly neh ‘The storage medium used for data ne en ~ memory. The general-purpose Machine inl tom Although main ro ieevonsn, oman mantions % ly too small (or too EXPENSIVE) 10 stone a mnt re usually lost if g pot CMtite datap. Cacht ‘ ory is smal concerned about Main memory. erated on is main on main memory. data, it is general The contents of main memory a Wer failure hy bie crash occurs. aT Yep ¢ Flash memory. Also known as electrically erasable Programmable read, ory (EEPROM), flash memory differs from main-memony ; Sere poe failure. Reading data from flash memory takes fen tha tay nanoseconds (a nanosecond is 0.001 microsecond), which js rough an 1g as is reading data from main memory. However, writing data to fash 8 fay is more complicated—data can be written once, which takes about 4, microseconds, but cannot be over written directly. To overwrite mens has been written already, we have to erase an entire bank of Memory at ong: it is then ready to be written again. A drawback of flash memory Fae a support a only limited number of erase cycles, ranging from 10,000 to 1 i lion. Flash memory has found popularity as a replacement for magnetic disks for storing small volumes of data (5 to 10 megabytes) in low-cost computer systems, such as computer systems that are embedded in other devices, Magnetic-disk storage. The primary medium for the long-term on-line sir. age of data is the magnetic disk. Typically, the entire database is stored o1 magnetic disk. Data must be moved from disk to main memory to be accessed. After operations are performed, the data that have been modified must be writ ten to disk. Disk storage is referred to as direct-access storage, because itis Possible to read data on disk in any order (unlike sequential-access stort) Disk storage survives power failures and system crashes. Disk-storage devices themselves may sometimes fail and thus destroy data, but such failures ust occur much less frequently than do system crashes. Optical stora; disk read- Fae isks used j OM storage cannot From qubbled with data prerecorded, ang ean be loaded into o 0 (WORK) disc wee version of optical storage is the write-once, read WI '@ to be written once, but does not ed Tedium jg used for archival SOP gil BNetic~optical storage devices _ iting Of a Neoded data, and that allow overwril Most ty, . ’ ; es of optical drive and replaced by other tad allow the disks 10 06 SKS. Jukebox systems conta @ scanned with OKEN Scanner Es ection 10.1 overview of Physical Storage Media 295 numerous disks that can be loaded jnto one of the drives automatically (by a robot arm) on demand, « Tape storage. Tape storage jg used primarily for backup and archival data, Although magnetic tape is much cheaper than disks, access to data is much slower, because the tape must pe accessed sequentially from the beginning. For this reason, tape storage is referred to as sequential-access storage. Tapes have a high capacity (5-gigabyte tapes are commonly available), and can be re- moved from the tape drive, facilitating cheap archival storage. Tape jukeboxes are used to hold exceptionally large collections of data, such as remote-sensing data from satellites, which could include as much as 12 terabytes (10' bytes) jn the near future, ‘The various storage media can be organized in a hierarchy (Figure 10.1) according to their speed and their cost. The higher levels are expensive, but are fast. As we move down the hierarchy, the cost per bit decreases, whereas the gocess time increases. This tradeoff is reasonable; if a given storage system were both faster and less expensive than another—other properties being the same— then there would be no reason to use the slower, more expensive memory. In fact, many early storage devices, including paper tape and core memories, are relegated to museums now that magnetic tape and semiconductor memory have become faster and cheaper. Magnetic tapes themselves were used to store active data back when disks were expensive and had low storage capacity. Today, almost Figure 10.1 Storage-device hierarchy. @ scanned with OKEN Scanner "I cture 296 Storage and File sina Chapter rf d on disks, except in rare cases where they are gf ore _ + jukeboxes. On tape a peice inte media—for example, cache and main memo, — ferred as primary storage. The media in the ae level in the hierarchy example, magnetic disks—are referred to a ee storage, or on-line story The media in the lowest level in the hierarc! y— ‘Or example, magnetic tape ang optical-disk jukeboxes—are referred to as tertiary Storage, or off-line Storage, In addition to the speed and cost of the various storage systems, ther is also the issue of storage volatility. Volatile storage loses its contents when thy power to the device is removed. In the absence of expensive battery and generato, backup systems, data must be written to nonvolatile storage for safekeeping, In the hierarchy shown in Figure 10.1, the storage systems from main memory up are volatile, whereas the storage systems below main memory are nonvolatile. We shall return to this subject in Chapter 15. all active data are store 10.2 Magnetic Disks Magnetic disks provide the bulk of secondary storage for modern computer sy tems. The storage capacity of a single disk ranges from 10 megabytes 10 0 gigabytes. A typical large commercial database may require hundreds of disks. 10.2.1 Physical Characteristics of Disks Physically, disk, i poe S are relative] i i isk platter has a fit Pe. Its two Rane simple (Figure 10.2). Each disk p' ! : «al and infor es are covered with a magnetic material, and i" track ¢ Sector 5 @ scanned with OKEN Scanner Block He; a Records Size Location End of Free Space Figure 10.9 siotted-page structure, Tn Naton cee level of indirection allows records tobe moved to Pe the record. Pace inside a block, while supporting indirect pointers Databases often store data that can be mu instance, an image or ch larger than a disk block. For * ‘ € an audio recording may be multiple megabytes in siz: while a video object may be multiple gigabytes in size. Recall thet so. supports the types blob and clob, which store binary and character large objects. Most relational databases restrict the size of a record to be no larger than the size of a block, to simplify buffer management and free-space management. Large objects are often stored in a special file (or collection of files) instead of being stored with the other (short) attributes of records in which they occur. A (logical) pointer to the object is then stored in the record containing the large object. Large objects are often represented using B*-tree file organizaticns, which we study in Section 11.4.1. B*-tree file organizations permit us to read an entire object, or specified byte ranges in the object, as well as to insert and delete parts of the object. 10.6 Organization of Records in Files So far, we have studied how records are represented in a file structure. A relation is a set of records. Given a set of records, the next question is how to organize them ina file. Several of the possible ways of organizing records in files are: © Heap file organization. Any record can be placed anywhere in the file where there is space for the record. There is no ordering of records. Typically, there is a single file for each relation. Sequential file organization. Records are stored in sequential order, accord- ing to the value of a “search key” of each record. Section 10.6.1 describes this organization. * Hashing file organization. A hash function is computed on some attribute of each record. The result of the hash function specifies in which block of the @ scanned with OKEN Scanner 458 Chapter 10 Storage and File Structure ToI01 [Srinivasan | Comp. Sei. [65000 |, Ta |W Finance | 90000 | _+ 75151 |Mozart__| Music 40000 I 73292 |Einstein | Physics [95000 |< 32343 History| 60000 |S 33456_ [Gold Physics | 97000 | 45565 _| Katz Comp. Sci. | 75000 IS 58585 _|Califeri | History {762000 |" F< 76543 [Singh Finance | 80000_ [F< 76766 _|Crick Biology | 72000_| 83521_[Bronct_—_| Comp. Sei. | 92000 [i> 98345 _|Kim Bee Eng. [60000 [fe Figure 10.10 Sequential file for instructor records. file the record should be placed. Chapter 11 describes this organization iti closely related to the indexing structures described in that chapter. Generally, a separate file is used to store the records of each relation. However ina multitable clustering file organization, records of several different relations are stored in the same file; further, related records of the different relations are stored on the same block, so that one I/O operation fetches related records from all the relations. For example, records of the two relations can be considered tobe telated if they would match in a join of the two relations. Section 10.6.2 describes this organization. 10.6.1 Sequential File Organization A sequential file is designed for efficient processing of records in sorted ordet based on some search key. A search key is any attribute or set of attributes:it need not be the primary key, or even a superkey. To permit fast retrieval of reco" in search-key order, we chain together records by pointers. The pointer in & record points to the next record in search-key order. Furthermore, to minimize" number of block accesses in sequential file processing, we store records physic#? in search-key order, or as close to search-key order as possible. Figure 10.10 shows a sequential file of instructor records taken from ou versity example. In that example, the records ID as the search key, The sequential file organization allows records to be read in sorte itd that can be useful for display purposes, as well as for certain query-P’ algorithms that we shall study in Chapter 1 It is difficult, however, Pohye inserted and deleted, since i are stored in search-key ordeh sil fed ord = . y in to maintain physical sequential order aS recor itis costly to move many records as a result a @ scanned with OKEN Scanner 10.6 Organization of Records inFiles 459 10101 _| Srinivasan Comp. Sci, 65000 12121 [Wa Finance 90000 -| 4? 15151_| Mozart Music 40000 | 22222 | Einstein Physics 95000 L? 32343 | El Said History 60000 | 4 33456 | Gold Physics 87000 41S 45565 | Katz Comp. Sci, 75000 Le 58583 | Califiert 76543_| Singh 76766 | Crick History _|“e3900-] Finance 80000 Biology 72000 | V Ni 83821 | Brandt Comp. Sei, |-92009-}-—L7 98345 [Kim Elec. Eng. | 80000 ? "cals 32222 | Verdi Music 48000 Figure 10.11 Sequential file after an insertion. insertion or deletion. We can manage deletion by using pointer chains, as we saw Previously. For insertion, we apply the following rules: 1. Locate the record in the file that comes before the Tecord to be inserted in search-key order. . 2. If there is a free record (that is, space left after a deletion) within the same block as this record, insert the new record there. Otherwise, insert the new record in an overflow block. In either case, adjust the pointers so as to chain together the records in search-key order, i i i insertion of the record Figure 10.11 shows the file of Figure 10.10 after the insertion of the re (32222, Verdi, Music, 48000). The structure in Figure 10.11 ali 7 porstion of new records, but forces sequential file-processing sppligations to Process in an order that does not match the physical ores oe ee! *E opproach If relatively few records need to be store in ven etioch neath key order works well. Eventually, however, the correspon cer of time, in which case se- and physical order may be totally lost over a peti ‘At this point, the file should quential processing will become much less efficient. tentidlorder Such reorga- be reorganized so that itis once again physically in sequen ee is tow. nizations are costly, and must be done during eid depends on the frequency ith whi nizations are needed depenc™ On rarely occu, ofrecer dente cen which insertion of new y e & 5 sor' 3 itis possible always to keep the file in physically . d. Pointer field in Figure 10.10 is not neede @ scanned with OKEN Scanner CHAPTER 3 RELATIONAL MODEL The relational model has established itself as the primary data model for commer- cial data-processing applications. The first database systems were based on either the network model (see Appendix A) or the hierarchical model (see Appendix B). Those two older models are tied more closely to the underlying implementation of the database than is the relational model. A substantial theory exists for relational databases. This theory assists in the design of relational databases and in the efficient processing of user requests for information from the database. We shall examine this theory in Chapters 6 and 7. The relational model is now being used in numerous applications outside the domain of traditional data processing. We shall consider extensions to the relational model required to handle these newer applications in Chapter 9. 3.1 Structure of Relational Databases A relational database consists of a collection of tables, each of which is assigned @ unique name. Each table has a structure similar to that presented in Chapter 2, where we represented E-R databases by tables, A row in a table represents a rela- tionship among a set of values. Since a table is a collection of such relationships, there is a close correspondence between the concept of rable and the mathematical Concept of relation, from which the relational data model takes its name. In what follows, we introduce the concept of relation. In this chapter, we shall be using a number of different relations to illustrate the various concepts underlying the relational data model. These relations represent Part of a banking enterprise. They differ slightly from the tables that were used @ scanned with OKEN Scanner ao on, We shall discuss criteria ‘detail in Chapter 7. ify our p ral structures in grea ‘0 that we can sim} in Chapter 2, s imp! i teness of relation for the appropria 3.1.1 Basic Structure Consider the account table of Figu name, account-nuniber, and balance. model, we refer to these headers il e is a set Oo! cach attribute, there is 4S 2 ain is of aa aie For the attribute pranch-name, for example, the domain is the of that at : a 5 the set of all account 7 te this set, Dz denote Il branch names. Let D; deno! 2 denote th : ae and Ds the set of all balances. As we saw n eae ee f ater OF vp, v3), where vy is 1+ must consist of a 3-tuple (v1y Y2+ 03 1 nae ™ seein domain Dj). v2 is an account number (that is, v2 1S it Cre Dp), 5 atance tha is, v isin domain Ds). In general, aecount wit contain only a subset of the set of all possible rows. Therefore, account is a s o D, x D2 x D3 It has three column headers: branch. Following the terminology of the relational attributes (as we did for the model in f permitted values, called the domain re 3.1. In general, a table of n attributes must be a subset of D, x Dy Xv. X Dat X Dn Mathematicians define a relation to be a subset of a Cartesian product of a list of domains. This definition corresponds almost exactly with our definition of table. The only difference is that we have assigned names to attributes, whereas mathematicians rely on numeric “names,” using the integer 1 to denote the attribute whose domain appears first in the list of domains, 2 for the attribute whose domain appears second, and so on. Because tables are essentially relations, we shall use the mathematical terms relation and tuple in place of the terms table and row. varia Pe a ae ah there are seven tuples. Let the tuple pamela dace Bena ee es i relation. We use the notation ¢{branch- ranch-name attribute. Thus, t{branch-name] branch-name T account-number | balance Downtown - ow [as | erryridge Al Round Hill yee sy Brighton A-201 oy ae A-222 700 ighton i) {ere NOURI Tes ene A017) 750 Figure 3,1 The account relation, @ scanned with OKEN Scanner = “Downtown,” and f[balance] = 500, Alternatively, we may write ¢[1] to denote the value of tuple ¢ on the first attribute (branch-name), t{2] to denote account. number, and so on. Since a relation is a set of tuples, we use the mathematical notation of ¢ € 7 to denote that tuple r js in relation r. We shall require that, for all relations r, the domains of all attributes of r be atomic. A domain is atomic if elements of the domain are considered to be indivisible units. For example, the set of integers is an atomic domain, but the set of all sets of integers is a nonatomic domain. The distinction is that we do not normally consider integers to have subparts, but we consider sets of integers to have subparts—namely, the integers comprising the set. The important issue is not what the domain itself is, but rather how we use domain elements in our database. The domain of all integers would be nonatomic if we considered each integer to be an ordered list of digits. In all our examples, we shall assume atomic domains. In Chapter 9, we shall discuss the nested relational data model, which allows nonatomic domains. It is possible for several attributes to have the same domain. For example, suppose that we have a relation customer that has the three attributes customer- name, customer-street, and customer-city, and a relation employee that includes the attribute employee-name. It is possible that the attributes customer-name and employee-name will have the same domain: the set of all person names. The domains of balance and branch-name, on the other hand, certainly ought to be distinct. It is perhaps less clear whether customer-name and branch-name should have the same domain. At the physical level, both customer names and branch names are character strings. However, at the logical level, we may want customer- name and branch-name to have distinct domains. One domain value that is a member of any possible domain is the null value, which signifies that the value is unknown or does not exist. For example, suppose that we include the attribute telephone-number in the customer relation. It may be that a customer does not have a telephone number, or that the telephone number is unlisted. We would then have to resort to null values to signify that the value is unknown or does not exist. We shall see later that null values cause a number of difficulties when we access or update the database, and thus should be eliminated if at all possible. 3.1.2 Database Schema When we talk about a database, we must differentiate between the database Schema, or the logical design of the database, and a database instance, which is a snapshot of the data in the database at @ given instant in time. . The concept of a relation corresponds to the programming-language notion of a variable. The concept of a relation schema corresponds to the programming- language notion of type definition. . It is convenient to give a name to a relation schema, just as we give names to type definitions in programming languages. We adopt the convention of using lowercase names for relations, and names beginning with an uppercase letter for @ scanned with OKEN Scanner 66 Relationat NN" pranch-city | assets Downtown | Brooklyn | 9000000 Redwood Palo Alto | 2100000 Peryridge | Horseneck | 1700000 Horseneck | 400000 branch-name Mianus Round Hill | Horseneck | 8000000 Pownal Bennington | 300000 North Town | Rye 3700000 Brighton Brooklyn | 7100000 Figure 3.2. The branch relation. relation schemas. Following this notation, we use Account-schema to denote the relation schema for relation account. Thus, “Account-schema = (branch-name, account-number, balance) We denote the fact that account is a relation on Account-schema by account (Account-schema) In general, a relation schema comprises a list of attributes and their corre- sponding domains. We shall not be concerned about the precise definition of the domain of each attribute until we discuss the SQL language in Chapter 4. The concept of a relation instance corresponds to the programming language notion of a value of a variable. The value of a given variable may change with time; similarly the contents of a relation instance may change with time as the relation is updated. However, we often simply say “relation” when we actually mean “relation instance.” As an example of a relation instance, consider the branch relation of Fig- ure 3.2. The schema for that relation is Branch-schema = (branch-name, branch-city, assets) Note that the attribute branch-name appears in both Branch-schema and Account-schema, This duplication is not a coincidence. Rather, using common attributes in relation schemas is one way of relating tuples of distinct relations. For example, suppose we wish to find the information about all of the accounts maintained in branches located in Brooklyn. We look first at the branch relation to find the names of all the branches located in Brooklyn. Then, for each such branch, we would look in the account relation to find the information about the accounts maintained at that branch. Using the terminology of the E-R model, we Say that the attribute branch-name represents the same entity set in both relations. Let’s continue our bi g ple. We need a relati inl anking exampl d al for: 7 ion to describe infc mation about customers. The relation schema is Customer-scl = mer -schema = (customer-name, customer-street, customer-city) A samy ion cI iple relation customer (Customer-schema) is shown in Figure 3.3. @ scanned with OKEN Scanner Section 5.4 ofructure of Relational Databases 67 customer-name | customer-street | customer-city Jones Main Harrison Smith North Rye Hayes Main Harrison Curry North Rye Lindsay Park Pittsfield Turner Putnam Stamford Williams Nassau Princeton Adams Spring Pittsfield Johnson Alma Palo Alto Glenn Sand Hill Woodside Brooks Senator Brooklyn Green Walnut Stamford Figure 3.3 The customer relation. We also need a relation to describe the association between customers and accounts. The relation schema to describe this association is Depositor -schema = (customer-name, account-number) ‘A sample relation depositor (Depositor-schema) is shown in Figure 3.4. It would appear that, for our banking example, we could have just one relation schema, rather than several. That is, it may be easier for a user to think in terms of one relation schema, rather than in terms of several. Suppose that we used only one relation for our example, with schema (branch-name, branch-city, assets, customer-name, customer-street customer-city, account-number, balance) Observe that, if a customer has several accounts, we must list her address once for each account. That is, we must repeat certain information several times. This repe- tition is wasteful and is avoided by the use of several relations, as in our example. In addition, if a branch has no accounts (a newly created branch, say, that has No customers yet), we cannot construct a complete tuple on the preceding single customer-name | account-number Johnson A-101 Smith A-215 Hayes A-102 Turner A-305 Johnson. A-201 Jones A-217 Lindsay A-222 Figure 3.4. The depositor relation. @ scanned with OKEN Scanner 68 Relational Model branch-name Toan-number | amount ‘Downtown L-17 soo Redwood L-23 A Perryridge L-15 a Downtown L-14 1501 Mianus L-93 500 Round Hill L-11 900 Perryridge L-16 1300 Figure 3.5. The loan relation. ming customer and account are available yet. To represent incomplete tuples, we must use null values that signify that the value is unknown or does not exist. Thus, in our example, the values for customer- name, customer-street, and so on must be null. By using several relations, we can represent the branch Information for a bank with no customers without using null values. We simply use a tuple on Branch-schema to represent the information about the branch, and create tuples on the other schemas only when the appropriate information becomes available. In Chapter 7, we shall study criteria to help us decide when one set of relation schemas is more appropriate than another, in terms of information repetition and the existence of null values. For now, we shall assume that the relation schemas are given. We include two additional relations to describe data about loans maintained in the various branches in the bank: relation, because no data conce! Loan-schema = (branch-name, loan-number, amount) Borrower -schema = (customer-name, loan-number) The sample relations loan (Loan-schema) and borrower (Borrower-schema) are shown in Figures 3.5 and 3.6, respectively. The banking enterprise that we have described is deri i ing ¢ jerived from the E-R diagram shown in Figure 3.7. The relation schemas correspond to the set of tables customer-name | loan-number Jones L-17 Smith L-23 Hayes L-15 Jackson L-14 Curry L-93 Smith L-ll Williams L-17 Adams L-16 Fi gure 3.6 The borrower relation. @ scanned with OKEN Scanner NN aS account account-branch branch depositor loan-branch customer loan —— Covet > Figure 3.7 E-R diagram for the banking enterprise. that we might generate using the method outlined in Section 2.9. We assume that the primary key for the branch entity set is branch-name. The primary key for Customer-schema is customer-name. We are not using the social-security attribute, as we did in Chapter 2, because now we want to have smaller relation schemas in our running example of a bank database, We expect that, in a real- world database, the social-security attribute would serve as a primary key. The primary key for the account entity set and the Joan entity set are account-number and loan-number, respectively. Finally, we note that the customer relation may contain information about customers who have neither an account nor a loan at the bank. The banking enterprise described here will serve as our primary example in this chapter and in subsequent ones. On occasion, we shall need to introduce additional relation schemas to illustrate particular points. 3.1.3 Keys The notions of superkey, candidate key, and primary key, as discussed in Chap- ter 2, are also applicable to the relational model. For example, in Branched (branch-name} and {branch-name, branch-city) are both superkeys. {branch- name, branch-city} is not a candidate key, because (branch-name} is a subset of. {branch-name, branch-city} and {branch-name } itself is a superkey. However, @ scanned with OKEN Scanner and for our purpose also will serve as a prj. -h-name} is a candidate key, ; : {branch-name} is not a superkey, since two branches in the . The attribute branch-city a eae aa may have different names (and different asset figures). a tat R be a relation schema. If we say that a subset K of R isa superkey for R, we are restricting consideration to relations r(R) in ve no vo distinct tuples have the same values on all attributes in K. That is, if 1 and ft are in > and ty 4h, then n[K] # olK)- If a relational database schem: schema, it is possible to determine the primary keys of the entity or rel derived: a is based on tables derived from an E.R he primary key for a relation schema from lationship sets from which the schema js e Strong entity set. The primary key of the entity set becomes the primary key of the relation. ; e Weak entity set. The table, and thus the relation, corresponding to a weak entity set includes o The attributes of the weak entity set o The primary key of the strong entity set on which the weak entity set depends The primary key of the relation consists of the union of the primary key of the strong entity set and the discriminator of the weak entity set. « Relationship set. The union of the primary keys of the related entity sets becomes a superkey of the relation. If the relationship is many-to-many, this superkey is also the primary key. Section 2.4.2 describes how to determine the primary keys in other cases. Recall from Section 2.9.3 that no table is generated for relationship sets linking a weak entity set to the corresponding strong entity set. « Combined tables. Recall from Section 2.9.3 that a binary many-to-one re- lationship set from A to B can be represented by a table consisting of the attributes of A and attributes (if any exist) of the relationship set. The primary key of the “many” entity set becomes the primary key of the relation (that is, if the relationship set is many to one from A to B, the primary key of A is the primary key of the relation). For one-to-one relationship sets, the relation eTecntoced ke 7m for a many-to-one relationship set. However, either eae nee Sate oye a chosen as the primary key of the relation, © Multivalued attributes. Recall f i iui ister ata cocction 2.9.4 that multivalued at entity Set or relationship set of which Mis ea orth ee Key of the holding an individual value of Af. The pritawey Meu lus a column © . The primary key of the entity or rela- tionship set together wi f meine ecther with the attribute C becomes the primary key for the From the precedi its attributes the ae list, we see that a relation / ary key of another echo tion schema may include among, Schema. This key is is called a foreign @ scanned with OKEN Scanner key. For example, tie ghoul’ Dranch-name in Account-schem 7 7 ia is a foreign ki since branch-name is the primary key of Branch-schema. ign key, 3.1.4 Query Languages ‘A query language is a language in which a user requests inform database. These languages are typically of a level higher than that of a stan- dard programming language. Query languages can be categorized as being ei- ther procedural or nonprocedural. In a procedural language, the user instructs the system to perform a sequence of operations on the database to compute the desired result. In a nonprocedural language, the user describes the infor- mation desired without giving a specific procedure for obtaining that informa. tion. Most commercial relational-database systems offer a query language that includes elements of both the procedural and the nonprocedural approaches. We shall study several commercial languages in Chapters 4 and 5. In this chapter, we examine “pure” languages: The relational algebra is procedural, whereas the tuple relational calculus and the domain relational calculus are nonprocedural. These query languages are terse and formal, lacking the “syntactic sugar” of commercial languages, but they illustrate the fundamental techniques for extracting data from the database. Initially, we shall be concerned with only queries. A complete data-manipula- tion language includes not only a query language, but also a language for database modification. Such languages include commands to insert and delete tuples, as well as commands to modify parts of existing tuples. We shall examine database modification after we complete our discussion of “pure” query languages. ation from the 3.2. The Relational Algebra The relational algebra is a procedural query language. It consists of a set of oper- ations that take one or two relations as input and produce a new relation as their result. The fundamental operations in the relational algebra are select, project, union, set difference, Cartesian product, and rename. In addition to the funda- mental operations, there are several other operations—namely, set intersection, natural join, division, and assignment. These operations will be defined in terms of the fundamental operations. 3.2.1 Fundamental Operations The select, project, and rename operations are called unary operations, because they operate on one relation. The other three operations operate on pairs of re! tions and are, therefore, called binary operations. 3.2.1.1 The Select Operation 7 | i . The select operation selects tuples that satisfy a given predicate. We use the lowercase Greek letter sigma (a) to denote selection. The predicate appea @ scanned with OKEN Scanner branch-name | loan-number | amount Perryridge L-15 7300 Perryridge L-16 1300 Figure 3.8 Result of Opranch-name ="Perryridge” (loan). a subscript to o. The argument relation is given in parentheses following the g, Thus, to select those tuples of the Joan relation where the branch is “Perryridge * we write Obranch-name =“Perryridge” (loan) If the Joan relation is as shown in Figure 3.5, then the relation that results from the preceding query is as shown in Figure 3.8. ; We can find all tuples in which the amount lent is more than $1200 by writing amount >1200 (loan) In general, we allow comparisons using =, 4, <, <, >, > in the selection predicate. Furthermore, we can combine several predicates into a larger predicate using the connectives and (A) and or (V). Thus, to find those tuples pertaining to loans of more than $1200 made by the Perryridge branch, we write branch-name =“Perryridge” A amount >1200 (loan) The selection predicate may include comparisons between two attributes. To illustrate, consider the relation Joan-officer that consists of three attributes: customer-name, banker-name, and loan-number, which specifies that a particular banker is the loan officer for a loan that belong to some customer. To find all customers who have the same name as their loan officer, we can write Fcustomer-name = banker-name (loan- officer’) Since the special value nul! indicates “value unknown or non-existent,” any com- parisons involving a null value evaluate to false, 3.2.1.2 The Project Operation Suppose we want to list all loan numbers, not care about the branch name, The pr this relation. The project operation is relation, with certain attributes left of and the amount of the loans, but do roject operation allows us to produce 4 unary operation that returns its argument th ¢ ut. Since a relation is a set, any duplicate ae Be limited Projection is denoted by the Greek letter pi (ar). We list ributes that we wish to appear in the res sa i argument relation follows in he pene Sa. abserit to, x., The Parentheses. Thus, the query to list all loa bers and the amount of the loan can be paseo query tall loan num| Mloan-number, amount (loan) The relation that results from this query is shown in Figure 3.9, it @ scanned with OKEN Scanner oer | amount L-17 1000 L-23 2000 L-15 1500 L-14 1500 L-93 500 L-ll 900 L-16 1300 Figure 3.9 Loan number and the amount of the loan, 3.2.1.3 Composition of Relational Operations The fact that the result of a relational o Let us consider the more complicated q “Harrison.” We write: peration is itself a relation is important. ery “Find those customers who live in Tleustomer-name (Ocustomer-city = “Harrison” (Customer) Notice that, instead of giving the name of a relation as the argument of the projection operation, we give an expression that evaluates to a relation. In general, since the result of a relational-algebra operation is of the same type (relation) as its inputs, relational-algebra operations can be composed together into a relational-algebra expression. Composing relational-algebra operations into telational-algebra expressions is just like composing arithmetic operations (such as +, —, * and +) into arithmetic expressions. We study the formal definition of relational-algebra expressions in Section 3.2.2. 3.2.1.4 The Union Operation Consider a query to find the names of all bank customers who have either an. account or a loan or both. Note that the customer relation does not contain the information, since a customer does not need to have either an account or a loan at the bank. To answer this query, we need the information in the depositor relation (Figure 3.4) and in the borrower relation (Figure 3.6). We know how to find the names of all customers with a loan in the bank: Tleustomer-name (borrower) We also know how to find the names of all customers with an account 1n the bank: Teustomer-name (depositor) . i need To answer the query, we need the union of these two sets; aa a iid all customer names that appear in either ot both of the ae by 'U. So the these data by the binary operation union, denoted, as in set theory, @ scanned with OKEN Scanner Cha 74 Relational Model or Johnson Smith Hayes ‘Turner Jones Lindsay Jackson Curry Williams Adams Figure 3.10 Names of all customers who have either a loan or an account. expression needed is Teustomer-name (borrower) U Teustomer-name (depositor ) The result relation for this query appears in Figure 3.10. Notice that there are 10 tuples in the result, even though there are seven distinct borrowers and six depositors. This apparent discrepancy occurs because Smith, Jones, and Hayes are borrowers as well as depositors. Since relations are sets, duplicate values are eliminated. Observe that, in our example, we took the union of two sets, both of which consisted of customer-name values. In general, we must ensure that unions are taken between compatible relations. For example, it would not make sense to take the union of the loan relation and the borrower relation. The former is a relation of three attributes; the latter is a relation of two. Furthermore, consider a union of a set of customer names and a set of cities. Such a union would not make sense in most situations. Therefore, for a union operation r U s to be valid, we require that two conditions hold: 1, The relations r and s must be of the same arity. That is, they must have the same number of attributes. 2. The i it ‘1 A . t me Seat i the ith attribute of r and the ith attribute of » must be the Note that r and s can be, it i relational-algebra ae general, temporary relations that are the result of 3.2.15 The Set Difference Operation The set-difference : 7 operation, F A one relation but are not in ees —, allows us to find tuples that are in containing those tupl t. The ex] ; pression r — s results in a relati eslia nina nte ults in a relation a @ scanned with OKEN Scanner 6S ESEEINA Aigebra 75 customer-name Johnson Turner Lindsay Figure 3.11 Customers with an account but no loan, ‘We can find all customers of the bank who have an account but not a loan by writing Teustomer-name (depositor) ~ Teustomer-name (borrower) The result relation for this query appears in Figure 3.11. As was the case with the union operation, we must ensure that set differences are taken between compatible relations. Therefore, for a set difference operation r —s to be valid, we require that the relations r and s be of the same arity, and that the domains of the ith attribute of r and the ith attribute of 5 be the same. 3.2.1.6 The Cartesian-Product Operation The Cartesian-product operation, denoted by a cross (x), allows us to combine information from any two relations. We write the Cartesian product of relations ry andr asry Xx 7. Recall that a relation is defined to be a subset of a Cartesian product of a set of domains. From that definition, we should already have an intuition about the definition of the Cartesian-product operation. However, since the same attribute name may appear in both r, and rz, we need to devise a naming schema to distinguish between these attributes. We do so here by attaching to an attribute the name of the relation from which the attribute originally came. For example, the relation schema for r = borrower x loan is (borrower.customer-name, borrower.loan-number, loan.branch-name, loan.loan-number, loan.amount) With this schema, we can distinguish borrower.loan-number from loan.loan- number. For those attributes that appear in only one of the two schemas, we shall usually drop the relation-name prefix. This simplification does not lead to any ambiguity. We can then write the relation schema for r as (customer-name, borrower.loan-number, branch-name, loan.loan-number, amount) This above naming convention requires that the relations that are the arguments of the Cartesian-product operation have distinct names, This requirement causes Problems in some cases, such as if the Cartesian product of a relation with itself is desired. A similar problem arises if we use the result of a relational-algebra €xpression in a Cartesian product, since We will need a name for the relation so @ scanned with OKEN Scanner 76 Relational Model “Mapter 3 a's attributes, In Section 3.2.1.7, we see how tg lation that we can refer (0 ther rel Ir avoid these problems by using @ rename operation. / Now that we know the relation schema for r = borrower x loan, wha ea ave suspected, we construct a tuple of r oy nr? As you may hi r ible pair of tuples: ‘one from the borrower relation and one from the ‘nis a large relation, as you can see from Figure 3.12, where i es that comprise 7. included only a portion of the (uP: : we included only a powion Ti nles in borrower and rz tuples in loan. Then, f tuples—one tuple from each relation; Assume that we ae iT ar there are 1 #72 ways of choosing @ Pit . : i es in r. In particular, note that for some tuples ¢ in r, it there are m1 * 712 tup! my be that tfborrower.Joan-number] # t[loan.loan-number }. , In general, if we have relations r1(1) and r2(R2)+ then ry x 12 is a relation whose schema is the concatenation of Ri and 2. Relation R contains all tuples t for which there is a tuple 4 in ri, and ft in rz for which t(Ri] = [Ri] and 1{R2] = [Ra]. Suppose that we want to find.the names of all customers who have a loan eed the information in both the /oan relation and at the Perryridge branch. We n the borrower relation to do so. If we write tuples appeal of each possi Joan relation. Thus, Obranch-name ="Pernidge”(borrower x loan) then the result is the relation shown in Figure 3.13. We have a relation that pertains to only the Perryridge branch. However, the customer-name column may contain customers who do not have a loan at the Perryridge branch. (If you do not see why that is true, recall that the Cartesian product takes all possible pairings of one tuple from borrower with one tuple of loan.) Since the Cartesian-product operation associates every tuple of loan with every tuple of borrower, we know that, if a customer has a loan in the Perryridge branch, then there is some tuple in borrower x loan that contains his name, and borrower.loan-number = loan.loan-number. So, if we write borrower Joan-number = loan.loan-number Crranch-name ="Perrige"(borrower x. loan)) we get only those tuples of borrower i tomers that hav ver x hi fe oe u " Joan that pertain to customers that h Finally, since we want only customer-name, we do a projection: Tleustomer-) aaa (Chorrower Joan-number = oan.loan=number a name ="Perryridge" (borrower x loan))) 1 result of this expression i ssion is shown in Fi our query. ‘Pression is shown in Figure 3.14 and is the correct answer t0 3.2.1.7 The Rename Operation Unlike relations in 1 not have a name th; the databa: mi aoees ea ne results of relational-algebra expressions 40 refer to them. It is useful to be able to give @ scanned with OKEN Scanner section 3.2 The Relational Algebra 77 mer-name borrower. br mI = aa loansumber |bnan Came ae eee Jones L-I7 Downtown T-l7 To00 Jones L-17 Redwood L-23 2000 Jones LA7 Perryridge L-15 1500 Jones L-17 Downtown L-14 1500 Jones L-17 Mianus L-93 500 Jones Ll7 Round Hill Ll 900 Jones LA7 Perryridge L-16 1300 Smith L-23 Downtown L-17 1000 Smith L-23 Redwood L-23 2000 Smith L-23 Perryridge Ls 1500 Smith L-23 Downtown L-14 1500 Smith L-23 Mianus L-93 500 Smith L-23 Round Hill Ll 900 Smith L-23 Perryridge L-16 1300 Hayes LS Downtown L-7 1000 Hayes L-15 Redwood L-23 2000 Hayes L-15 Perryridge L-15 1500 Hayes L-15 Downtown Ld 1500 Hayes L-15 Mianus L-93 500 Hayes L-15 Round Hill L-ll 900 Hayes L-15 Perryridge L-16 1300 Williams L-17 Downtown L-7 1000 Williams L-17 Redwood L23 2000 Williams L-17 Perryridge L-15 1500 Williams L-7 Downtown L-14 1500 Williams L-I7 Mianus L-93 500 Williams L-17 | Round Hill Lil 900 Williams L-17 | Perryridge L6 1300 Adams L-16 | Downtown LAT 1000 Adams L-16 Redwood L-23 eo) Adams L-16 _| Perryridge LAs a Adams L-16 Downtown L-14 ead Adams L-16 Mianus — L-93 900 Adams L-16 Round Hill L-ll ian Adams L-16 | Perryridge LI6 Figure 3.12 Result of borrower x loan. @ scanned with OKEN Scanner tional Mode? 18 Rela i Toan-number | amount e i r. customer-nam F Perryridge ae I fu Tones penandee Jones Perryridge L-15 1500 Smith Perryridge L-16 1300 Smith Perryridge L-15 1500 Las Perryridge L-16 1300 Hayes Perryridge LS 1500 i Perryridge L-16 1300 aa Perryridge LS 1500 a Perryridge L-16 1300 Scat Perryridge L-5 1500 aa Perryridge L-16 1300 Williams Perryridge L-15 1500 Williams Perryridge L-16 1300 Adams Perryridge L-15 1500 Adams Perryridge L-16 1300 Figure 3.13 Result of oivanci-name ="Perrige” (borrower x loan). the lower-case Greek letter rho (p), them names; the rename: operator, denoted by , Igebra expression E, the expression lets us perform this task. Given a relational-al px (E) returns the result of expression E under the name x. Arelation r by itself is considered to be a (trivial) relational-algebra expres- sion. Thus, we can also apply the rename operation to a relation r to get the same relation under a new name. A second form of the rename operation is as follows. Assume that a relational- algebra expression E has arity n, Then, the expression Path Arya) (E) relay ie cae expression E under the name x, and with the attributes renamed 15 Ady 006) Ane ai ea illustrate the use of renaming a relation, we consider the query “Find the rgest account balance in the bank.” Our strategy is to compute first a temporary customer-name Hayes Adams n Figure 3.14 Result of ustomer- ame (Shorrower.loan-number = loan loan-nunber =— (bvanch-rame =Pexryidge” (borrower x. loan))). @ scanned with OKEN Scanner Jation consisting OF those balances that = difference between the relation Tata ‘yst computed, to obtain the result. To ¢ are not the largest, and then to take the mce (account) and the te ompute the tempora computing the Cartes compare the value of any ty wances appearing in one tup! devise a mechanism to distinguish between the two balance use the rename operation to rename one can reference the relation twice without The temporary rel: can now be written as , We need to attributes. We shall unt relation; thu: ambiguity, _. lation that consists of the balances that are not the largest Taccount balance (Gaccount.batance < d.batance (account x Pa (account))) This expression gives those balances in th balance appears somewhere in the accoun, contains all balances except the largest one, The query to find the largest account bal: follows: e account relation for which a larger t relation (renamed as d). The result This relation is shown in Figure 3.15. lance in the bank can be written as TIhatance (account) — Tlaccount.batance (Gaccount.balance < d.balance (account x Pa (account))) Figure 3.16 shows the result of this query. Let us present one more example to illustrate the rename operation. Con- sider the query “Find the names of all customers who live on the same street and in the same city as Smith.” We can obtain the street and city of Smith by writing Tl customer-street, customer-city (Scustomer-name = “Smith” (Customer)) However, to find other customers with this street and city, we must reference the customer relation a second time. In the following query, we use the rename operation on the preceding expression to give its result the name smith-addr, and to rename its attributes to street and city, instead of customer-street and balance 500 700 400 350 750 Figure 3.15 Result of subexpression ea accoum balance (Gaccounbalance < d.balance (ACCOUNT X Py N, @ scanned with OKEN Scanner halance L 900 account balance in the bank. Figure 3.16 Largest custome! THeustomer.customer-name (customer customer street = . customer X_ Psmith-addr street cit! ‘ (Tleustomer-street, customer-cil) Geustomer-name = “Smith” (CUstomer))))) ‘The result of this query, when we apply it to the customer relation of Figure 3,3, is shown in Figure 3.17. The rename operation is n minheae-sveet customer customer =city=smith-addr city ‘ot strictly required, since it is possible to use a positional notation for attributes. We can name attributes of a relation implicitly using a positional notation, where $1, $2, a refer to the first attribute, the second attribute, and so on, The po: onal notation also applies to results of relational- algebra operations. The following relational-algebra expression illustrates the use of positional notation with the unary operator 0: os=s3(R x R) If a binary operation needs to distinguish between its two operand relations, a similar positional notation can be used for relation names as well. For example, $R1 could refer to the first operand, and $R2 could refer to the second operand. However, the positional notation is inconvenient for humans, since the position of the attribute is a number, rather than an easy-to-remember attribute name. Hence, we do not use the positional notation in this textbook. 3.2.2. Formal Definition of the Relational Algebra The operations that we saw in Section 3.2.1 allow us to give a complete definition of an expression in the relational algebra. A basic expression in the relational algebra consists of either one of the following: ¢ A relation in the database « A constant relation A ‘general expression inthe relational algebra is constructed out of smaller subex Pressions. Let £y and £2 be relational-algebra expressions. Then, the following customer-name Smith Curry Figur ; © 3.17 Customers who live on the same street h and in the same city as Smith. d @ scanned with OKEN Scanner section 3-2 The Relational Algebra g1 are all etational-algebra expressions: of VU Ex ef ~ 0 Ey x Er # op(E1)s Where P is a predicate on attributes in Ey ¢ Tis(E1), where S is a list consisting of some of the attributes in Ey fe py (Ex), Where x is the new name for the result of E 32.3 Additional Operations The fundamental operations of the relational algebra are sufficient to express any relational-algebra query.! However, if we restrict ourselves to just the fondemental operations, certain common queries are lengthy to express. Therefore, we define additional operations that do not add any power to the algebra, but that simplify common queries. For each new operation, we give an equivalent expression using only the fundamental operations, 3.2.3.1 The Set-Intersection Operation The first additional-relational algebra operation that we shall define is set inter- section (N). Suppose that we wish to find all customers who have both a loan and an account. Using set intersection, we can write Tleustomer-name (borrower) 1 Teustomer-name (depositor) The result relation for this query appears in Figure 3.18. Note that we can rewrite any relational algebra expression using set intersec- tion by replacing the intersection operation with a pair of set-difference operations as follows: rNs=r-(r-s) Thus, set intersection is not a fundamental operation and does not add any power to the relational algebra. It is simply more convenient to write r M s than to write r-(@-s). customer-name Hayes Jones Smith Figure 3.18 Customers with both an account and a loan at the bank. ‘in Section 3.5, we introduce operations that extend the power of the relational algebra to handle ‘ull and aggregate values. @ scanned with OKEN Scanner pe certain queries that require a, Cartesian Produ, Cartesian product includes a selection operat ct, Consider the query “Find the names of al tho have a loan at and find the amount of the loan.” We fits customers who h oduct of the borrower and Joan relations. Then, we se} ae Siar to only the same loan-number, followed by the Projection Sel customer-name, loan-number, and amount: 3.2.3.2. The Natural-Jom “I i implify It is often desirable to simp! i hat involves @ Typically, a query t olves aC on the result of the Cartesian pm aa Tleustomer-name, toan.toan-number, amount (©norrower.toan-number = loan loan-number lows us to combine certain selections join is a bi ion that all tural join is a binary operation that s rain se ee 2 Canesian product into one operation. It is denoted by the ‘join” symbol bX, The natural-join operation forms a Cartesian product of its two arguments, performs a selection forcing equality on those attributes that appear in both relation schemas, and finally removes duplicate attributes. | —_ Although the definition of natural join is complicated, the operation is easy to apply. As an illustration, let us consider again the example “Find the names of all customers who have a loan at the bank, and find the amount of the loan.” This query can be expressed using the natural join as follows: (borrower x loan)) Tleustomer-name, loan-number, amount (borrower * loan) Since the schemas for borrower and loan (that is, Borrower-schema and Loan-schema) have the attribute loan-number in common, the natural-join oper- ation considers only pairs of tuples that have the same value on /oan-number. It combines each such pair of tuples into a single tuple on the union of the two schemas (that is, customer-name, branch-name, loan-number, amount). After pet- forming the projection, we obtain the relation shown in Figure 3.19. ., Consider two relation schemas R and $—which are, of course, lists of attribute names. If we consider the schemas to be sets, rather than lists, we can denote those attribute names that appear in both R and S by R 1 S, and denote those attribute names that appear in R, in S, or in both by R U S. Similarly, those customer-name | loan-number amount Jones L-I7 1000 Smith L-23 2000 Hayes L-15 1500 jackson L-14 1500 Curry L-93 500 Smith L-ll 900 Williams L-I7 1000 Adams L-16 1300 Figure 3.19, 19 Result of Tleustomer-name, foan-rumber, amoun (borrower ® loan). @ scanned with OKEN Scanner Brighton | Perryridge Figure 3.20 Result of “Harrison” (customer 4 account M4 depositor)), Tavanch-name (Feustomer =i attribute names that appear in R but not § are denoted by R ~S, whereas denotes those attribute names that appear in intersection, and difference operations here on relations. We are now ready for a formal definition of the natu relations 7(R) and s(S). The natural join of r and 5, relation on schema R U_S formally defined as follows: t S-R S but not in R. Note that the union, are on sets of attributes, rather than ral join. Consider two | denoted byr 4 sisa | rMs=TMrus Ga where RAS = {Ai, Ao, An}. Because the natural join is central to much of relational-database theory and practice, we give several examples of its use. HAL ArA2=5.A2 Av ArAq=5.Ayq 1% S) Find the names of all branches with customers who have an account in the bank and who live in Harrison. Tloranch-name (Ceustomer-city ="Harison” (Customer ™ account * depositor)) The result relation for this query appears in Figure 3.20. Notice that we wrote customer ™ account ® depositor without inserting parentheses to specify the order in which the natural-join operations on the three relations be executed. In the preceding case, there are two possibilities: © (customer ™ account) ™ depositor © customer \ (account ™ depositor) We did not specify which expression we intended, because the two are equiv- alent. That is, the natural join is associative. © Find all customers who have both a loan and an account at the bank. Tewomer-nane (borrower depositor) Note that in Section 3.2.3.1, we wrote an expression for this query using set intersection, We repeat this expression here. Teustoner-name (borrower) © Tcustomer-name (depositor) The result relation for this query was shown earlier in Figure ele cxample illustrates a general fact about the relational algebra: It is aa le ° Write several equivalent relational-algebra expressions that are quite differe Tom one another, @ scanned with OKEN Scanner branch-name Brighton Downtown 3.21 Result of Hiranci-name Ctronch-city= “Brooklyn” (branch)), Figure relations without any attributes in common; that is, Let r(R) and s(S) be he empty set.) Then, r Ms =r x 5. R 1 S=G. (GO denotes t ion it ion to the natural-join operation that joi ration is an extension u 1 ‘ The theta sine © selection and a Cartesian product into a single operation, Comi Erica r(R) and s($), and let @ be a predicate on attributes in the coma R US, The theta join operation r Mg s, is defined as follows: r Mg s = o9(r x s) 3.2.3.3 The Division Operation i +, is sui i include the phrase ivis ration, denoted by +, is suited to queries that incl as ai” Serpe that we wish to find all customers who have an account at all the branches located in Brooklyn, We can obtain all branches in Brooklyn by the expression 11 = Tloranch-name (Gbranch-city ="Brooklyn” (branch)) The result relation for this expression appears in Figure 3.21. 2 We can find all (customer-name, branch-name) pairs for which the customer has an account at a branch by writing 12 = Teustomer-name, branch-name (depositor ™ account) Figure 3.22 shows the result relation for this expression. , __ Now, we need to find customers who appear in r2 with every branch name in ry. The operation that provides exactly those customers is the divide operation. [customer-name | branch-name Johnson | Downtown Smith Mianus Hayes Perryridge Turner Round Hill Williams | Perryridge Lindsay | Redwood Johnson | Brighton Jones Brighton Figure 3.22 Result of q “tomer-name, branch-name (depositor 4 account). @ scanned with OKEN Scanner smulate the query by writing We fot Tleustomer-name, branch-name (depositor % account) + Mhranch-name (Otranch-city =“Brooktyn" (branch) ‘The result of this expression is a relation that has the schema i that contains the tuple (Johnson), _ Formally, let r(R) and S(S) be relations, and let S CR; that is artibute of schema S is also in schema R. The relation r = s is a relation on schema R — S—that is, on the schema containing all attributes of schema R that gre not in schema S. A tuple ¢ is in r + 5 if and only if both of two conditions hold: (customer-namey is every 1. isin Tr-s(") 2, For every tuple f; in s, there is a tuples, in r satisfying both of the following: a. t[S] = G5] b. t{R — S] t It may surprise you to discover that, given a division operation and the schemas of the relations, we can, in fact, define the division operation in terms of the fundamental operations. Let r(R) and s($) be given, with $ CR: oe = Mr-s %) — Mrs ( (Mr-s (7) x 5) — TMe-s,s(r)) To see that this expression is true, we observe that I,_s (r) gives us all tuples r that satisfy the first condition of the definition of division. The expression on the right side of the set difference operator, Mrs ( (Tr-s (7) x 8) — Mp-s,s(r)), serves to eliminate those tuples that fail to satisfy the second condition of the definition of division. Let us see how it does so. Consider Tp_s (r) x s. This relation is on schema R, and pairs every tuple in TTg_s (r) with every tuple in s. The expression Ip_s,s(r) merely reorders the attributes of r. Thus, (ps (r) x s) — Tr-s,s(r) gives us those pairs of tuples from Me-s (r) and s that do not appear in r. If a tuple fj is in Tr-s ( (Tp-s (7) x 8) — Tress), then there is some tuple , in s that does not combine with tuple ¢; to form a tuple a Thus, t; holds a value for attributes R — S that does not appear inr + s. is these values that we eliminate from Tr-s (r). 323.4 The Assignment Operation {is convenient at times to write a relational-algebra expression in parts using as- Signment to a temporary relation variable. The assignment operation denoted by {Works ina manner similar to assignment in a programming language. To illus "Tate this operation, we consider the definition of division given in Section 3.2.3.3. @ scanned with OKEN Scanner We could write r + 5 aS 0 1 < Mr-s tems ys empl x 8) ~ result = temp| — temp2 nt does not result in any relation being displayeq f the expression to the right of the <— is assigneg of the <. This relation variable may be used The evaluation of an assignme to the user. Rather, the result of to the relation variable on the left ii ressions. . a 7 me vit the aselgaerh operation, a query can be written as a sequential pro- gram consisting of a series ‘of assignments followed by an expression whose value is displayed as the result of the query. For relational-algebra queries, as- Signment must always be made to a temporary relation variable. Assignments to permanent relations constitute a database modification. We discuss this issue in Section 3.6. Note that the assignment operation does not provide any addi- tional power to the algebra. It is, however, a convenient way to express complex queries. 3.3. The Tuple Relational Calculus When we write a relational-algebra expression, we provide a sequence of pro- cedures that generates the answer to our query. The tuple relational calculus, by contrast, is a nonprocedural query language. It describes the desired information without giving a specific procedure for obtaining that information. A query in the tuple relational calculus is expressed as {| PO) that is, it is the set of all tuples ¢ such that predicate P is true for 1. Following our earlier notation, we use t[A] to denote the value of tuple ¢ on attribute A, and we use f © r to denote that tuple ¢ is in relation r. Before we give a formal definition of the tuple relational calculus, we re- tum to some of the queries for which we wrote relational-algebra expressions in Section 3.2. 3.3.1 Example Queries Say that we wa : Be erin want to find the branch-name, Joan-number, and amount for loans of {elt © loan vn #[amount] > 1200) Suppose that we of the loan relation wet the loan-number attribute, rather than all attributes © write an enor eae this query in the tuple relational calculus, we need those tuples ‘loon ra Telation on the schema (loan-number). We need imber) such that there is a tuple in /oan with the amount @ scanned with OKEN Scanner yy > 1200. To express this request, bute i ; : attr natical logic. The notation we need the construct “there exists” | from mathem at er (Q(t) means “there exists a tuple ¢ in relation r such that predicate Q(t) is true.” Using this notation, we can write the query “Find the loan number for each Joan of an amount greater than $1200” as (114s € loan (t{loan-number] = s{loan-number] A slamount] > 1200)} In English, we read the preceding expression as “the set of all tuples ¢ such that there exists a tuple s in relation loan for which the values of ¢ and s for the loan. number attribute are equal, and the value of s for the amount attribute is greater than $1200.” Tuple variable ¢ is defined on only the /oan-number attribute, since that is the only attribute for which a condition is specified for 1. Thus, the result is a relation on (loan-number). Consider the query “Find the names of all customers who have a loan from the Perryridge branch.” This query is slightly more complex than the previous queries, since it involves two relations: borrower and loan. As we shall see, however, all it requires is that we have two “there exists” clauses in our tuple. relational-calculus expression, connected by and (A). We write the query as fol- lows: {t | 3s © borrower (t{customer-name] = s{customer-name] A Au € loan (u{loan-number] = s{loan-number] A ulbranch-name] = “Perryridge”))} In English, this expression is “the set of all (customer-name) tuples for which the customer has a loan that is at the Perryridge branch.” Tuple variable u ensures that the customer is a borrower at the Perryridge branch. Tuple variable s is restricted to pertain to the same loan number as s. The result of this query is shown in Figure 3.23, To find all customers who have a loan, an account, or both at the bank, we Used the union operation in the relational algebra. In the tuple relational calculus, we shall need two “there exists” clauses, connected by or (V): {13s € borrower (t{customer-name] = s{customer-name]) Vu € depositor (t{customer-name] = u{customer-name])} customer-name Hayes Adams Figure 3.23 Names of all customers who have a loan at the Perryridge branch. @ scanned with OKEN Scanner the set of all customer-name tuples such that at least ne is ex] i ives US This expression gives ¥ of the following holds: i borrower telatio in some tuple of the 2s 9 -name appears i The customer-nam ver from the bank. borrower fi r-name appears in some tuple of the depositor relation as g The customer depositor of the bank. a count at the bank, that Custome; 1s both a loan and an acc: a nk . Soe ee inthe vczut, becatbe the mathematical defniting of a set doe not allow duplicate members. The result of this query was shown earlier in Fig. n “ ae now want only those customers that have both an account and a loan at the bank, all we need to do is to change the or (Vv) to and (A) in the Preceding expression. {t|4s € borrower (t{customer-name] = s{customer-name]) A Bu € depositor (t{customer-name] = ulcustomer-name))} The result of this query was shown in Figure 3.18. Now consider the query “Find all customers who have an account at the bank but do not have a loan from the bank.” The tuple-relational-calculus expression for this query is similar to the expressions that we have just seen, except for the use of the not (=) symbol: {t|4u € depositor (t{customer-name] u[customer -name}) A748 € borrower (t[customer-name] = s[customer-name])} tuple of the borrower relatioy ing al is aie appear ie eeceien 8 4 loan from the bank. The result of this formula se 7 cy shall Consider next uses implication, denoted by =>. The be tne” Noe we 2 ag 5 implies Q;” that is, “if P is true, then Q must implication rather than ore Fara rte tse of of a query in English ' Consider the jue gpertion: “Find al cuon, 2 teed in Section 3.2.3 to illustrate the division Tooklyn.” To write th; have at all by ' . “ iS query ji a ranches located in for all’ Construct, denoteq iy calculus, we introduce the Yee r (Oy 'O is true for all tuples ¢ j h relation 7, Means “ @ scanned with OKEN Scanner We write the expression for our query as follows: {Yu € branch (ulbranch-city| = “Brooklyn” = 3s € depositor (t{customer-name] = s[customer-name] AB w account (wlaccount-number] = staccount-number] A wlbranch-name] = ufbranch-name})))} In English, we interpret this expression as “the set of all customers, (that is, (customer-name) tuples 1) such that, for all tuples w in the branch relation, if the value of w on attribute branch-city is Brooklyn, then the customer has an account at the branch whose name appears in the branch-name attribute of u.” 33.2 Formal Definition We are now ready for a formal definition, A tuple-relational-calculus expression is of the form {| PO) where P is a formula, Several tuple variables may appear in a formula, A tuple variable is said to be a free variable unless it is quantified by a 3 or Y. Thus, in 1 € loan \ 4s € customer(t{branch-name] = s{branch-name}) 1 is a free variable. Tuple variable s is said to be a bound variable. A tuple-relational-calculus formula is built up out of atoms. An atom has one of the following forms: s €r, where s is a tuple variable and r is a relation (we do not allow use of the ¢ operator) ¢ s[x] © ufy], where s and w are tuple variables, x is an attribute on which 5 is defined, y is an attribute on which u is defined, and © is a comparison operator (<, <, =, #, >, >); we require that attributes x and y have domains whose members can be compared by © © s{x] © c, where s is a tuple variable, x is an attribute on which s is defined, is a comparison operator, and c is a constant in the domain of attribute x We build up formulae from atoms using the following rules: * An atom is a formula. © If P; is a formula, then so are ~P; and (P1). P * If Py and Py are formulae, then so are Pi V Po, Pr A Pa,and Pr => Pa © If Pi(s) is a formula containing a free tuple variable s, and r is a relation, then 3s er (Pls) and Vs € 7 (PCs) are also formulae. @ scanned with OKEN Scanner al ‘Apter 3 90 Relational Mode! Chapter ‘As we could for the relational algebra, we can write equivalent | relati Igebra, v - that are aot identical in appeé e tuple relational calculus, t Sions at a identi Pres alences include the following arance. It hree rules: is equi = (oP, v 7P2). 1. P; A Pp is equivalent (0 2, vr € r (Py(0) is equivalent to ~ at er (AP\(t)). ; Vv Pa 3. Pi => P) is equivalent to “Pi 3.3.3 Safety of Expressions ‘There is one final issue to be addressed. A tuple-relational-calculus expression may generate an infinite relation. Suppose that we wrote the expression {t I> (t € loan)} ‘There are infinitely many tuples that are not in /oan. Most of these tuples contain values that do not even appear in the database! Clearly, we do not wish to allow such expressions. To assist us in defining a restriction of the tuple relational calculus, we intro- duce the concept of the domain of a tuple relational formula, P. Intuitively, the do- main of P, denoted dom(P), is the set of all values referenced by P. These include values mentioned in P itself, as well as values that appear in a tuple of a relation mentioned in P. Thus, the domain of P is the set of all values that appear explicitly in P or that appear in one or more relations whose names appear in P. For exam- ple, dom(t € loan A t{amount] > 1200) is the set containing 1200 as well as the set of all values appearing in loan. Also, dom(— (t € loan)) is the set of all values appearing in Joan, since the relation loan is mentioned in the expression. We say that an expression {r | P(¢)) is safe if all values that appear in the result are values from dom(P). The expression {t |= (t € loan)} is not safe. Note that dom(— (t € Joan)) is the set of all values appearing in loan. However, it is possible to have a tuple f not in Joan that contains values that do not appear in Joan. The other examples of tuple-relational-calculus expressions that we have written in this section are safe. 3.3.4 Expressive Power of Languages ed a Ley restricted to safe expressions is equivalent in expres- fieetares eae algebra, Thus, for every relational-algebra expression, tuple-elationd earn ¢ pension in the tuple relational calculus, and for every Denon: encase Presson, there is an equivalent relational-algebra eX- Teerences 16 the proc gone is asseiton here; the bibliographic notes contain * ome parts of the proof are included in the exercises. 3.4 Th Lei, ¢ Domain Relational Calculus ‘This Second form - "fom uses domain a tclational calculus called: domain relational calcults: iain variables that 1 d bles that take on values from an attribute’s domain, @ scanned with OKEN Scanner & Section ~" oe «Sh than values for an entire tuple. The domain relational calcul rather related to the tuple relational calculus. us, however, is closely relat 34.1 Formal Definition ‘an expression in the domain relational calculus is of the form {<21y X2.-. n> | PO tata) where %1) X25-+++%n Fepresent domain variables. P represents a formula com: ed of atoms, as was the case in the tuple relational calculus. An atom in the remain relational calculus has one of the following forms: @ < 4X}, X2)-+++4n > € 1, where r is a relation on n attributes and x, are domain variables or domain constants. e x © y, where x and y are domain variables and © is a comparison operator (<, S = # >» =). We require that attributes x and y have domains that can be compared by ©. x @ c, where x is a domain variable, © is a comparison operator, and c is a constant in the domain of the attribute for which x is a domain variable. ie te We build up formulae from atoms using the following rules: Anatom is a formula. ¢ If P; isa formula, then so are —P; and (P)). © If Pi and P2 are formulae, then so are Py V P,, P; A Po,and Py => Pp. «If P\(x) is a formula in x, where x is a domain variable, then Ax (Pi(x)) and ¥ x (Pi(x)) are also formulae. As a notational shorthand, we write 3a,b,c (P(a,b,c)) for da (Ab (Ac (P(a,b,c)))) 3.4.2 Example Queries We now give domain-relational-calculus queries for the examples that we con- Sidered earlier, Note the similarity of these expressions and the corresponding ‘uple-relational-calculus expressions * Find the branch name, Joan number, and amount for loans of over $1200: { | € loan Aa > 1200} @ scanned with OKEN Scanner sm Ch 92 Rotational Model ‘apter 3 ater than $1200; ¢ Find all foan numbers for loans with an amount { [dba (» 1200)) Although the second query appeat imilar to the one that we pic for the tuple relational calculus, there is an important difference. In the tuple calculus, when we write 3s for some tuple variable s, we bind it immediately to a relation by writing e€ r. However, when we write 3 b in the domain calculus, b refers not to a tuple, but rather to a domain value. Thus, the domain of variable b is unconstrained until the subformula b,/,a > € loan constrains b to branch names that appear in the Joan relation. e Find the names of all customers who have a loan from the Perryridge branch and find the loan amount: { |3 1 ( € borrower : AAb(€ loan A b = “Perryridge”))} * Find the names of all customers who have a loan, an account, or both at the Perryridge branch: { |3l(€ borrower Adb,a(€ loan Ab V3a(e depositor Aab,n( € account A b = “Perryridge”))} “Perryridge”)) ¢ Find the names of all customers who have an account at all the branches located in Brooklyn: ( [Vx,yz(ex,y > € branch) \ y = “Brooklyn” > Bab ( © account \ € depositor)) In English, we interpret the Precedin, name) tuples ¢ such x,y,z, if the bran 8 expression as “the set of all (customer- h that, for all (branch-name, branc ity, assets) tuples, 3 There enion aa ly 8 Brooklyn, then the following is true”: 4 tuple in i rc ith a branch name y Pi the relation account with account number a and © There exists a tuple in the ‘ . . ncraberre ple in the relation depositor with customer ¢ and account 3.4.3 Safet Y of Expresciane @ scanned with OKEN Scanner CHAPTER 6 INTEGRITY CONSTRAINTS Integrity constraints provide a means of ensuring that changes made to the database by authorized users do not result in a loss of data consistency. Thus, integrity constraints guard against accidental damage to the database. We have already seen a form of integrity constraint for the E-R model in Chapter 2. These constraints were in the following forms: « Key declarations—the stipulation that certain attributes form a candidate key for a given entity set. The set of legal insertions and updates is constrained to those that do not create two entities with the same value on a candidate key. « Form of a relationship—many to many, one to many, one to one. A one-to- one or one-to-many relationship restricts the set of legal relationships among entities of a collection of entity sets. Jn general, an integrity constraint can be an arbitrary predicate pertaining to the database, However, arbitrary predicates may be costly to test. Thus, we usually limit ourselves to integrity constraints that can be tested with minimal overhead. 61 Domain Constraints We have seen that a domain of possible values must be associated with every attribute, In Chapter 4, we saw how such constraints are specified in the SQL DDL. Domain constraints are the most elementary form of integrity constraint. They are tested easily by the system whenever @ new data item is entered into the database, _, Its possible for several attributes to have the same domain, For example, the attributes customer-name and employee-name might have the same domain: the set 193 @ scanned with OKEN Scanner the domains of balance and branch-name Certainly ss clear whether customer-name and branch, ‘ain. At the implementation level, both customer name should have a Cae ane Praeaates we would normally nt names and branch sand all ‘customers Who have the same name as a branch” to be consider the query Th us, if we view the database at the conceptual, rather than the meaningful query. mer-name and branch-name should have distinct domains, physical, level, custo “Fecussion, we can see that a proper definition of domain Erom|the Pie ea inserted in the database, but also test values ii ints not only allows us to t sai e test cies to ensure that the comparisons made make sense, Per anciple behind attribute domains is similar to that behind typing of ‘i imming langua; I. es i Ee a ore Oh pe low the compiler to check the pee in greal ale anene languages inhibit “clever hacks” that are often require a ys errs ming. Since database systems are designed to support users who are not computer experts, the benefits of strong typing often outweigh the disadvantages. Never. theless, many existing systems allow only a small number of types of domains, Newer systems, particularly object-oriented database systems, offer a rich set of domain types that can be extended easily. Object-oriented databases are discussed in Chapters 8 and 9. oo The check clause in SQL-92 permits domains to be restricted in powerful ways that most programming language type systems do not permit. Specifically, the check clause permits the schema designer to specify a predicate that must be satisfied by any value assigned to a variable whose type is the domain. For in- stance, a check clause can ensure that an hourly wage domain allows only values greater than a specified value (such as the minimum wage), as illustrated here: names. However, all person \ ot aoe distinct. It is perhaps le ought to be d create domain hourly-wage numeric(5,2) constraint wage-value-test check(value >= 4.00) The domain hourly-wage is declared to be a decimal number with a total of five digits, two of which are placed after the decimal point, and the domain has a constraint that ensures that the hourly wage is greater than 4.00. The clause con- straint wage-value-test is optional, and is used to give the name wage-value-test to the constraint. The name is used to indicate which constraint an update violated. ‘The check clause can also be used to restrict a domain to not contain any null values, as illustrated here: create domain account-number char(10) constraint account-number-null-test check(value not null ) As another example, the domain can be restri of values by using the in clause: ‘cted to contain only a specified set create domain account-type char(10) constraint account-type-test check(value in (“Checking”. ino” © Scanned with OKEN Scanner

You might also like