De Unit 1-Database Concepts

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 54

DATA ENVIRONMENT

Prof. Ravi Prakash


8979048096
ravi.prakash@ddn.upes.ac.in
COURSE OBJECTIVES

1. To help the students for understanding the


importance of data in Data Environment.
2. To enable students for describing and analysis
the data through different techniques under
different conditions
3. To provide the students to analysis specific
characteristics of Data management
4. To enable students to synthesis related
information and evaluated option for most logical
and optimal solution.
DATA ENVIRONMENT
COURSE OUTLINE

UNIT CONTENTS
1 Understanding Data
2 Understanding Data Collection
3 Understanding Data Storage & Management
4 Understanding Data Visualization
SYLLABUS
I. Understanding Data
 Data, information and knowledge
 Types of data
 Introduction to database management systems
 Data modeling using ER Diagrams
 Using relational DBMS
 SQL: how to create a database, load data, insert/delete, and ask  queries
II. Understanding DataCollection                                                      
 Basics of Data Collection
 Data Measurement & Scaling techniques
 Data collection methods for Primary and Secondary data
 Issues with data collection methods
III. Understanding Data Storage and Management                           
 Data storage techniques
 Data management techniques
IV.  Understanding Data Visualization                              
 Data Visualization need and concept
 Data Visualization techniques
MODULE I –

UNDERSTANDING DATA
OBJECTIVES OF THIS SESSION
• Definition and Concept of DBMS
• Drawbacks of file processing systems
• Database environment Components
• Database Users
• Advantages of DBMS
• When Not to Use a DBMS
• Evolution of Database Systems
Information is the backbone of any organization. In a world that
focuses on achievement and advantage, information is the critical
factor that enables managers and organizations to gain a
competitive edge. It is the most critical resource of an organization.
Information is nothing but refined data.

According to Burch and Grudnitski, “Information is data that have


been put into a meaningful and useful context and communicated
to recipient who uses it to make decisions”.

Information consists of data, images, text, documents, audio and


video, but always organized in a meaningful context.
Data are processed to produce information for the decision-maker
in business organizations.
A database consists of 4 elements as shown below.
Data are binary computer representation of stored.

DATA ITEMS

RELATIONSHIPS
DATABASE

CONSTRAINTS

SCHEMA
Fig : Components of a Database

In a table or relation, the cardinality of a relation is defined as the number of


Tuples (rows or records) in that relation.
Degree of a relation is defined as the number of attributes (columns) in that
relation.

Slide 1- 8
 RDBMS stands for Relational Database Management Systems..
 All modern database management systems like SQL, MS SQL Server,
IBM DB2, ORACLE, My-SQL and Microsoft Access are based on RDBMS.
 It is called Relational Data Base Management System (RDBMS) because
it is based on relational model introduced by E.F. Codd.
 Data is represented in terms of tuples (rows) in RDBMS.
 Relational database is most commonly used database. It contains
number of tables and each table has its own primary key.
 Due to a collection of organized set of tables, data can be accessed
easily in RDBMS
 The RDBMS database uses tables to store data. A table is a collection of
related data entries and contains rows and columns to store data.
 A table is the simplest example of data storage in RDBMS.

Slide 1- 9
Student Table

 Field is a smaller entity of the table which contains specific information about every record
in the table. field in the student table consists of id, name, age, course.
 A row of a table is also called record. It contains the specific information of each individual
entry in the table. It is a horizontal entity in the table. Student contains 5 records.
 The NULL value of the table specifies that the field has been left blank during record
creation. It is totally different from the value filled with zero or a field that contains space.

Slide 1- 10
ER Model - Basic Concepts

• ER Diagram stands for Entity Relationship Diagram, also known as ERD is


a diagram that displays the relationship of entity sets stored in a
database. The ER model defines the conceptual view of a database. It
works around real-world entities and the associations among them.
• ER diagrams help to explain the logical structure of databases.
• ER diagrams are created based on three basic concepts: entities,
attributes and relationships.
• An entity can be a real-world object, either animate or inanimate, that can
be easily identifiable. For example, in a school database, students, teachers,
classes, and courses offered can be considered as entities. All these entities
have some attributes or properties that give them their identity.
• Entities are represented by means of their properties, called attributes. All
attributes have values. For example, a student entity may have name, class,
and age as attributes.
• The association among entities is called a relationship. For example, an
employee works_at a department, a student enrolls in a course. Here,
Works_at and Enrolls are called relationships.
Slide 1- 11
Slide 1- 12
Slide 1- 13
Definitions
• Data: Meaningful facts, text, graphics, images, sound, video
segments.
• Database: An organized collection of logically related data.
• Information: Data processed to be useful in decision making.
• Metadata: Data that describes data.
• DBMS : A database management system (DBMS) is a collection of
programs that enables users to create and maintain a database.
• The DBMS is hence a general-purpose software system that
facilitates the processes of defining, constructing, manipulating,
and sharing databases among various users and applications.
DBMS vs. File System

Slide 1- 15
Disadvantages of file processing systems

Still widely used today (e.g. for backup) but have the following problems:
• Program-Data Dependence– file descriptions are stored within each
application that accesses file, so change to file structure requires changes
to all file descriptions in all programs.
• Data Redundancy (Duplication of data) – wasteful, inconsistent, loss of
metadata integrity (same data has different names in different files, or
same name may be used for different data in different files).
• Limited Data Sharing – users have little opportunity to share data outside
their own applications.
• Lengthy Development Times – little opportunity to re-use previous
development efforts.
• Excessive Program Maintenance – factors above combine to create heavy
maintenance load
A simplified database system environment

Fig 1.1 : The


Database environment
Actors on the scene ( Database Users)
1. Database administrators(DBAs): Responsible for managing the database system,
authorizing access, coordinating & monitoring uses, acquiring resources
2. Database designers: Responsible for designing the database, identifying the data
to be stored, choosing the structures to represent and store this data.
3. End Users: The persons that use the database for querying, updating, generating
reports, etc.
Casual end users: Occasional users.(middle- or high-level managers)
Parametric or naïve users: make up a sizable portion of database end users. Their main job
function revolves around constantly querying and updating the database, using standard
types of queries and updates-called canned transactions-that have been carefully
programmed and tested. Example – bank tellers or reservation clerk.
Sophisticated end users: Use full DBMS capabilities for implementing complex applications
Stand-alone users - maintain personal databases by using ready-made program packages
that provide easy-to-use menu-based or graphics-based interfaces. An example is the user of
a tax package that stores a variety of personal financial data for tax purposes.
4. System Analysts/Application programmers: Design and implement recorded
transactions for parametric users.
Advantages of Database Management
• Reducing Data Redundancy
• The file based data management systems contained multiple files that were stored in
many different locations in a system or even across multiple systems. Because of this,
there were sometimes multiple copies of the same file(duplicate data) which lead to data
redundancy. This is prevented in a database as there is a single database and any change
in it is reflected immediately. Because of this, there is no chance of encountering
duplicate data.
• Sharing of Data
• In a database, the users of the database can share the data among themselves.
• Data Integrity
• Data integrity means that the data is accurate and consistent in the database. In a multi-
user environment, it is necessary to ensure that the data is correct and consistent. 
• Data Security
• Data Security is vital concept in a database. Only authorized users are allowed to access
the database and their identity is authenticated using a username and password.
Unauthorized users should not be allowed to access the database under any
circumstances.
Providing Persistent Storage
Databases can be used to provide persistent storage for program objects and data structure so
as to support advanced technologies like object-oriented databases.
Backup and Recovery
Database Management System automatically takes care of backup and recovery. The users
don't need to backup data periodically because this is taken care of by the DBMS.  
Data Consistency
All data appears consistently across the database and the data is same for all the users viewing
the database. Moreover, any changes made to the database are immediately reflected to all the
users and there is no data inconsistency.
Providing Multiple User Interfaces
DBMS should provide a variety of user interfaces. These include query languages,
programming language interfaces for application programmers.
Representing Complex Relationships among Data
A database has the capability to represent a variety of complex relationships among the data,
to define new relationships as they arise.

Slide 1- 20
When not to use a DBMS
Main costs of using a DBMS:
- High initial investment in hardware, software, training
and possible need for additional hardware.
- Overhead for providing generality, security, recovery, integrity,
and
concurrency control.
- Generality that a DBMS provides for defining and processing data.

When a DBMS may be unnecessary:


- If the database and applications are simple, well defined, and not
expected to change.
- If there are stringent real-time requirements that may not be met
because of DBMS overhead.
- If access to data by multiple users is not required.
Evolution of database systems

• 1960’s – file processing systems: punch cards, paper tape, magnetic tape –
sequential access and batch processing
• 1970s - Hierarchical and Network (legacy, some still used today) – difficulties
= hard to access data (navigational record-at-a-time procedures), limited data
independence, no widely accepted theoretical model (unlike relational)
• 1980s - Relational – E.F. Codd and others developed this theoretically well-
founded model – all data represented in the form of tables – Oracle, DB2,
Ingres
• 1990s - Object-oriented, but some organisations have to handle large
amounts of both structured and unstructured data, so Object-relational
databases developed.
• 2000 and beyond – multi –tier, client-server, distributed environments, web-
based, content-addressable storage, data mining
Example of a simple database
(UNIVERSITY)
Example of a Database
(with a Conceptual Data Model)
• Some mini-world relationships:

• SECTIONs are of specific COURSEs


• STUDENTs take SECTIONs
• COURSEs have prerequisite COURSEs
• INSTRUCTORs teach SECTIONs
• COURSEs are offered by DEPARTMENTs
• STUDENTs major in DEPARTMENTs

Note: The above could be expressed in the


ENTITY-RELATIONSHIP data model.
Role of the Database Administrator(DBA)

1. DBA administers the three levels of the database and, in consultation


with the overall user community, sets up the definition of the global
view or conceptual level of the database.
2. Mappings between the internal and the conceptual levels, as well as
between the conceptual and external levels, are also defined by the
DBA.
3. DBA ensures that appropriate measures are in place to maintain the
integrity of the database and that the database is not accessible to
unauthorized users.
4. DBA is responsible for granting permission to the users of the database
and stores the profile of each user in the database.
5. DBA is responsible for defining procedures to recover the database
from failures
Additional Implications of Using the Database
Approach

 Potential for enforcing standards: this is very crucial for the success of
database applications in large organizations Standards refer to data item
names, display formats, screens, report structures, meta-data
(description of data) etc.
 Reduced application development time: incremental time to add each
new application is reduced.
 Flexibility to change data structures: database structure may evolve as
new requirements are defined.
 Availability of current information: Extremely important for on-line
transaction systems such as airline, hotel, car reservations.
 Economies of scale: by consolidating data and applications across
departments wasteful overlap of resources and personnel can be
avoided.
Extending Database Capabilities

•New functionality is being added to DBMSs in the


following areas:
• Scientific Applications
• Image Storage and Management
• Audio and Video data management
• Data Warehousing technologies
• Data Mining technologies
• Distributed database systems
• Time Series and Historical Data Management

The above gives rise to new research and development in incorporating


new data types, complex data structures, new operations and storage and
indexing schemes in database systems.
Data Model

• A DBMS provides users with a conceptual representation of data


that does not include many of the details of how the data is
stored or how the operations are implemented.
• Informally, a data model is a type of data abstraction that is used
to provide this conceptual representation. The data model uses
logical concepts, such as objects, their properties, and their
interrelationships, that may be easier for most users to
understand than computer storage concepts.
• Hence, the data model hides storage and implementation details
that are not of interest to most database users.
Data Models

• Data Model: A set of concepts to describe the structure of a database, and certain constraints
that the database should obey.
• Data Model Structure and Constraints:
• Constructs are used to define the database structure
• Constructs typically include elements (and their data types) as well as groups of elements
(e.g. entity, record, table), and relationships among such groups
• Constraints specify some restrictions on valid data; these constraints must be enforced at
all times
• Data Model Operations: Operations for specifying database retrievals and updates by
referring to the concepts of the data model. Operations on the data model may include basic
operations and user-defined operations.
• By structure of a database, we mean the data types, relationships, and constraints that should
hold for the data.
Categories of data models

1. Conceptual (high-level, semantic) data models: Provide concepts that are close to the
way many users perceive data. (Also called entity-based or object-based data models.)
2. Physical (low-level, internal) data models: Provide concepts that describe details of how
data is stored in the computer.
3. Implementation (representational) data models: Provide concepts that fall between the
above two, balancing user views with some computer storage details.
• Conceptual data models use concepts such as entities, attributes, and relationships.
• An entity represents a real-world object or concept, such as an employee or a project,
that is described in the database.
• An attribute represents some property of interest that further describes an entity, such
as the employee's name or salary.
• A relationship among two or more entities represents an association among two or
more entities, for example, a works-on relationship between an employee and a project.
History of Data Models

• Relational Data Model: proposed in 1970 by E.F. Codd (IBM), first


commercial system in 1981-82. Now in several commercial products
(DB2, ORACLE, SQL Server, SYBASE, INFORMIX).

Network Data Model: the first one to be implemented by Honeywell
in 1964-65 (IDS System). Adopted heavily due to the support by
CODASYL (CODASYL - DBTG report of 1971). Later implemented in a
large variety of systems - IDMS (Cullinet - now CA), DMS 1100 (Unisys),
IMAGE (H.P.), VAX -DBMS (Digital Equipment Corp.)
• Hierarchical Data Model: implemented in a joint effort by IBM and
North American Rockwell around 1965. Resulted in the IMS family of
systems. The most popular model. Other system based on this model:
System 2k (SAS inc.)
History of Data Models

• Object-oriented Data Model(s):


• several models have been proposed for implementing in a database
system. One set comprises models of persistent O-O Programming
Languages such as C++ (e.g., in OBJECTSTORE or VERSANT), and
Smalltalk (e.g., in GEMSTONE). Additionally, systems like O2, ORION
(at MCC - then ITASCA), IRIS (at H.P.- used in Open OODB).
• Object-Relational Models: Most Recent Trend. Started with Informix
Universal Server. Exemplified in the latest versions of Oracle-10i, DB2,
and SQL Server etc. systems.
Network Model

 The first network DBMS was implemented by Honeywell in


1964-65 (IDS System).
 Adopted heavily due to the support by CODASYL (Conference
on Data Systems Languages).
 Later implemented in a large variety of systems - IDMS
(Cullinet), DMS 1100 (Unisys), IMAGE (Hewlett-Packard), VAX -
DBMS (Digital Equipment Corp).
Example of Network Model
Schema
Network Model - Pros & Cons

ADVANTAGES:
• Network Model is able to model complex relationships and represents
semantics of add/delete on the relationships.
• Can handle most situations for modeling using record types and relationship
types.
• Language is navigational; uses constructs like FIND, FIND member, FIND
owner, FIND NEXT within set, GET etc.
• Programmers can do optimal navigation through the database.

DISADVANTAGES:
• Navigational and procedural nature of processing
• Database contains a complex array of pointers that thread through a set of
records.
• Little scope for automated "query optimization”
Hierarchical Model

 Initially implemented in a joint effort by IBM and North


American Rockwell around 1965. Resulted in the IMS family
of systems.
 IBM’s IMS product had (and still has) a very large customer
base worldwide
 Hierarchical model was formalized based on the IMS system
 Other systems based on this model: System 2k (SAS inc.)
Hierarchical Model – Pros & Cons

• ADVANTAGES:
• Hierarchical Model is simple to construct and operate on
• Corresponds to a number of natural hierarchically organized domains - e.g.,
assemblies in manufacturing, personnel organization in companies
• Language is simple :
• uses constructs like GET, GET UNIQUE, GET NEXT, GET NEXT WITHIN PARENT etc.

• DISADVANTAGES:
• Navigational and procedural nature of processing
• Database is visualized as a linear arrangement of records
• Little scope for "query optimization"
Relational Data Model

 Proposed in 1970 by E.F. Codd (IBM), first commercial system


in 1981-82.
 Now in several commercial products (e.g. DB2, ORACLE, MS
SQL Server, SYBASE, INFORMIX, MS Access).
 Several free open source implementations, e.g. MySQL,
PostgreSQL
 Currently most dominant for developing database
applications.
 SQL relational standards: SQL-89 (SQL1), SQL-92 (SQL2), SQL-
99, SQL3, etc.
Object-Oriented Data Model

 Several models have been proposed for implementing in a


database system.
 One set comprises models of persistent O-O Programming
Languages such as C++ (e.g., in OBJECTSTORE), and Smalltalk
(e.g., in GEMSTONE).
 Object Database Standard: ODMG-93, ODMG-version 2.0,
ODMG-version 3.0
Object-Relational Data Model

 Most recent trend. Started with Informix Universal Server.


 Relational systems incorporate concepts from object
databases leading to object-relational.
 Exemplified in the latest versions of Oracle-10i, DB2, and SQL
Server and other DBMSs.
 Standards included in SQL-99 and expected to be enhanced
in future SQL standards
Database schemas versus Database
instances

• In any data model, it is important to distinguish between the description


of the database and the database itself.
• Database Schema: The description of a database. Includes descriptions of
the database structure and the constraints that should hold on the
database. This is is specified during database design and is not expected
to change frequently.
• Schema Diagram: A diagrammatic display of (some aspects of) a
database schema.
• Schema Construct: A component of the schema or an object within the
schema, e.g., STUDENT, COURSE.
• Database Instance: The actual data stored in a database at a particular
moment in time. Also called database state (or occurrence/snapshot).
Database Schema Vs. Database State
 Database State:
• Refers to the content of a database at a moment in time.
• The actual data stored in a database at a particular moment in time. This includes
the collection of all the data in the database.
• Also called database instance (or occurrence or snapshot).
• The term instance is also applied to individual database components, e.g. record instance,
table instance, entity instance
 Initial Database State: Refers to the database when it is initially loaded into the
system.
 Valid State: A state that satisfies the structure and constraints of the database.
 Distinction
• The database schema changes very infrequently.
• The database state changes every time the database is updated.
• Schema is also called intension, whereas state is called extension
Database schema - Example

Fig : Schema diagram sample


Example of a Database State

Fig : A database state


Three-Schema Architecture

 Proposed to support DBMS characteristics of:


• Program-data independence.
• Support of multiple views of the data.
 Not explicitly used in commercial DBMS products, but has been useful
in explaining database system organization
 Defines DBMS schemas at three levels:
• Internal schema at the internal level to describe physical storage structures and
access paths. Typically uses a physical data model.
• Conceptual schema at the conceptual level to describe the structure and
constraints for the whole database for a community of users. Uses a conceptual or
an implementation data model.
• External schemas at the external level to describe the various user views. Usually
uses the same data model as the conceptual level.
Three-Schema Architecture

 Mappings among schema levels are needed to


transform requests and data.
 Programs refer to an external schema, and are mapped by
the DBMS to the internal schema for execution.
 Data extracted from the internal DBMS level is reformatted
to match the user’s external view (e.g. formatting the
results of an SQL query for display in a web page)
Data Independence
 Data independence can be explained using the three-schema architecture.
 Data independence refers characteristic of being able to modify the schema at
one level of the database system without altering the schema at the next
higher level.
 There are two types of data independence:

 Logical Data Independence


 Logical data independence refers characteristic of being able to change
the conceptual schema without having to change the external schema.
 It is used to separate the external level from the conceptual view.
 If we do any changes in the conceptual view of the data, then the user
view of the data would not be affected.
 Logical data independence occurs at the user interface level.

 Physical Data Independence


 Physical data independence can be defined as the capacity to change the
internal schema without having to change the conceptual schema.
 If we do any changes in the storage size of the database system server,
then the Conceptual structure of the database will not be affected.
 It is used to separate conceptual levels from the internal levels.
 Physical data independence occurs at the logical interface level.
Slide 1- 50
Database Languages
• A DBMS has appropriate languages and interfaces to express database queries and updates.
Database languages can be used to read, store and update the data in the database
1.Data Definition Language (DDL):
 DDL is used to define database structure or pattern. It is used to create
schema, tables, indexes, constraints, etc. in the database.
 Using the DDL statements, you can create the skeleton of the database.
 Data definition language is used to store the information of metadata like the
number of tables and schemas, their names, indexes, columns in each table,
constraints, etc.
2.Data Manipulation Language (DML):
 It is used for accessing and manipulating data in a database. It handles user
requests.
 Select: It is used to retrieve data from a database.
 Insert: It is used to insert data into a table.
 Update: It is used to update existing data within a table.
 Delete: It is used to delete all records from a table.
Database Languages
3. Data Control Language (DCL):
 It is used to retrieve the stored or saved data.
 Grant: It is used to give user access privileges to a database.
 Revoke: It is used to take back permissions from the user.
4. Transaction Control Language (TCL):
 TCL is used to run the changes made by the DML statement. TCL can be grouped
into a logical transaction.
 Commit: It is used to save the transaction on the database.
 Rollback: It is used to restore the database to original since the last Commit.
Classification of DBMSs

 Based on the data model used


 Traditional: Relational, Network, Hierarchical.
 Emerging: Object-oriented, Object-relational.
 Other classifications
 Single-user (typically used with personal computers)
vs. multi-user (most DBMSs).
 Centralized (uses a single computer with one database)
vs. distributed (uses multiple computers, multiple databases)
Cost considerations for DBMSs

 Cost Range: from free open-source systems to


configurations costing millions of dollars
 Examples of free relational DBMSs: MySQL, PostgreSQL,
others
 Commercial DBMS offer additional specialized modules,
e.g. time-series module, spatial data module, document
module, XML module
 Different licensing options: site license, maximum
number of concurrent users (seat license), single user,
etc.

You might also like