Download as pdf or txt
Download as pdf or txt
You are on page 1of 81

118.

721 Analysis and Interpretation of Animal Health Data Module 2 - Design and use of a database
Patrick Biggs1 & Tim-Hinnerk Heuer2 Disease Research Centre, IVABS, Massey University 2Landcare Research, Palmerston North
1Infectious

Introductions:
3 day course covering the design & use of a database
With focus on PostgreSQL

Course taught by Patrick Biggs and Tim-Hennerk Heuer, with techical support from Simon Verschaffelt

Day 1
Introduction
Why not use Excel? History of databases PostgreSQL as a database Terminology

Database design ERD (entity relationship diagrams)

Day 1 (contd)
Discussion of software environment that we will use (THH) Data normalisation Working with test data (THH)
The path from data in Excel to PostgreSQL Reading data into the database

Day 2
Data types
Null values

Data summaries
Basic queries

Data validation
Missing values, spelling errors

Day 3
More queries Data export Data backup talking to other software
R, Perl, PHP Extensions e.g. postGIS

Discussion and recap

Day 1

Why not use Excel?


New versions can accommodate truly vast data spaces
Up to 220 rows (1,048,576) and 214 (16,384) columns in Excel 2008 234 cells (1.7179 x 1010) Thats a lot of data!
If your computer can handle it!

http://jasonlbaptiste.com/startups/microsoftexcel-is-the-worlds-most-used-database/

MS Access the Microsoft database


http://office.microsoft.com/en-us/accesshelp/using-access-or-excel-to-manage-yourdata-HA010210195.aspx
Each has its place, as do other data types
Discuss this in a while

History of PostgreSQL (1)


http://www.postgresql.org/ Free and open source software Huge community of users and developers Following key points directly from http://www.postgresql.org/about/history/
PostgreSQL, originally called Postgres, was created at UCB by a computer science professor named Michael Stonebraker Stonebraker started Postgres as a follow-up project to its predecessor, Ingres Postgres thus plays off of its predecessor (as in "after Ingres) Ingres (1977 to 1985) according to classic RDBMS theory. Postgres (1986-1994) was a project meant to break new ground in database concepts such as exploration of "object relational" technologies.

History of PostgreSQL (2)


1995 - two Ph.D. students from Stonebraker's lab, Andrew Yu and Jolly Chen, replaced Postgres' POSTQUEL query language with an extended subset of SQL. They renamed the system to Postgres95. 1996 - Postgres95 departed from academia and started a new life in the open source world when a group of dedicated developers outside of Berkeley saw the promise of the system, and devoted themselves to its continued development. Contributing enormous amounts of time, skill, labor, and technical expertise, this global development group radically transformed Postgres. Over the next eight years, they brought consistency and uniformity to the code base and renamed it to PostgreSQL.

History of PostgreSQL (3)


PostgreSQL began at version 6.0 (Jan 1997)
Giving credit to its many years of prior development. With the help of hundreds of developers from around the world, the system was changed and improved in almost every area. Over the next four years (versions 6.0 - 7.0), major improvements were made across the system

Version 7.0 released in Jan 2000


Major changes introduced

Version 8.0 released in Jan 2005


Debut into enterprise market More advanced features, such as Java procedures and a native Windows port

Version 9.0 released in Sep 2010


We are using version 9.1 on this course

Key terms and concepts


Database
A repository for data (stored information) and metadata (information about the structure)

Relational database
Special type of database using tables

Relationships
Tables linked together by relationships Few types exists

Normalization
Process of simplifying the data structure Whole science, with many forms (see later)

Database model
Plan or concept for data storage using an entity relationship diagram (ERD)

Brief history of databases (1)


Over 60 years old File systems (pre-1950s)
Flat files are files where data simply dumped there with no structure Flat files are stored within an OS Searching for data has to be programmed company

department

Hierarchical database models (1950s)


Inverted tree-like structure Tables have parent-child relationship employee Each child table has single parent table Each parent table can have multiple child tables Child tables dependent on parent tables One-to-many relationships

manager

project

task

Brief history of databases (2)


Network database models (1960s)
Refinement of hierarchical model Child tables can have more than one parent Allows for many-to-many relationships
E.g. many-to-many between employees & tasks
This is bad database practice

Employee can be assigned many tasks Task can be assigned to many employees Requirement for new table (assignment) to resolve this dilemma by creating unique definition for employee and task combinations

task
assignment

project
employee

manager department employee type

company

Brief history of databases (3)


Relational database model (1970s)
Refinement of hierarchical model Tables can be accessed directly without accessing parent tables Tables can be linked regardless of hierarchical position
company
One-to-many relationship

department

many-to-many relationship resolved

employee

employee type

project

task

assignment

Brief history of databases (4)


Object database models (1980s)
3D structure when any DB item can be retrieved rapidly Performs poorly when retrieving more than 1 item RDB much better for this Does get around RDB need for many-to-many relationship tables However generally more complex relying on object-based programming Object structure fundamentally different to relational structure

Whilst it may have its problems & limitations, RDB is the best general purpose database for most situations
17

Terminology wrt PostgreSQL


Key terms to understand:
Schema Database Command Query Table Column Row Composite type Domain View Client/server
Client Server

Postmaster Transaction Commit Rollback Index Tablespace Result set

Terminology - some degree of detail


Schema
Named collection of tables In PostgreSQL can also be views, indices, sequences, data types, operators and functional Aka a catalogue

Database (db)
Named collection of schemas When client application connects to server, db name is specified Client cannot interact with > 1 db / connection, but can open any number of connections to access multiple db

Terminology - some degree of detail


Command
String that gets sent to the server to make it do something Aka statement

Query
Command type that retrieves data from the server

Table
Collection of rows Usually named, but can exist temporarily All rows have the same shape
Every row in table contains same columns

Terminology - some degree of detail


Column
Smallest storage unit inside a db 1 piece of information about an object Has name and data type Columns grouped into rows, and rows into tables

Row
Collection of column values Every row has same shape
Same set of columns E.g. if you run a car dealership, and you have a vehicles table, each row is an individual car

Terminology - some degree of detail


Composite type
From PostgreSQL v 8.0+, data types can be made that are composed of multiple values

Domain
Named specialisation of another datatype Can be included in >1 table

View
Alternate way to present a table A virtual table Not storing data just creating way of looking at existing data

Terminology - some degree of detail


Client/server
Kind of architecture using 2 programs; client and server Can be on same host or not, as long as are connected by a network Server
Offers a service - to store, retrieve and change data Has no UI, cant talk to it directly, need to use a client

Client
Connects to a server via a postmaster Asks server to perform work

Terminology - some degree of detail


Postmaster
Listens for connections from client and creates new server process

Transaction
Collection of db operations treated as a unit all or nothing PostgreSQL ensures all complete, or none do Usually starts with a BEGIN command, and ends with COMMIT or ROLLBACK

Commit
Successful end of a transaction Changes can be made permanent

Terminology - some degree of detail


Rollback
Unsuccessful end of transaction Deletes any changes since start of transaction

Index
Data structure to help reduce time to find data Absolutely vital to performance optimisation
Especially with large and/or complex db

Tablespace
Alternative storage location for tables and indices

Result set
All data you get back from db after issuing a query Maybe be empty

Structural Terminology - RDBMS


Relational DataBase Management System DB: database
Data organised into tables Table organised into rows and columns Each row in a table is a record Records contain several pieces of information; each column in a table in one of those pieces

MS: management system


Software so data can be accessed so that data can be inserted, retrieved, modified and deleted

R: relational
Data stored in separate tables can be linked by looking for common elements Allows data to be pulled quickly and answers made from data stored in more than one table

Database design
Going to take an overview approach Aim is to get you being able to design, build and then use a database in your own work Guiding principles and procedures the key here
If its not 100.0% maximised for performance, or fully denormalised thats not the end of the world This is not Amazon.com, for example where queries taking a few milliseconds longer mounts up on a daily basis

Database design - goals


Create an effective database
What should it do? What makes it useful? What problems can it solve?

Want to get data out quickly, reliably & consistently Want to be free of incorrect and/or contradictory data

Software design small diversion


Coding is fun, design isnt!!! Lets get straight to coding (i.e. setting up the db)..
NO. Lets think about what we are going to do Remember GIGO (garbage in, garbage out)
Is your data a true representation of what you want to accomplish? Will a bad model do all you want it to do, as it will result in a bad db

Information containers
Database
A repository for data (stored information) and metadata (information about the structure) Can create, read, update and delete data in some manner

Broad definition, not just computers here:


a filing cabinet a notebook your brain

None of these things are perfect


Have strengths and weaknesses as databases

Desirable features of our db


Lots of them: CRUD Retrieval Consistency Validity Easy Error Correction Speed Atomic Transactions ACID Persistence and Backups Low Cost and Extensibility Ease of Use Portability Security Sharing Ability to Perform Complex Calculations

Features (1)
CRUD
The 4 things any db should provide Create, Read, Update and Delete If it has CRUD, it is a db
E.g. plane black box flight recorder cannot (rightfully so) modify the data, so its not a db

Retrieval (R in CRUD)
Aka read you should be able to find
all data in the db all data in specific ways

Data structure allows quick retrieval

Features (2)
Consistency (part of R in CRUD)
Performing the same search now and in the future on exactly the same dataset gives the same result Db design has a role here
Where is your data stored? Is it stored in >1 place in the db?

Validity
Data is validated against other information in the db
E.g. if you have a date field it should not be able to accept the sky is blue Old is not a valid age Lots is not a valid quantity

Features (3)
Easy Error Correction
Ability to update data throughout system
E.g. a building supply company db has had entries for duck tape listed, when should be duct tape

Need good design for this to work well

Speed
Good design plays role in db efficiency

Features (4)
Atomic Transactions
???? A possibly complex series of actions that is considered as a single operation by those not involved directly in performing the transaction Transactions cannot half happen Important for the R and U parts of CRUD

Features (5)
ACID
4 features that an effective transaction system should provide: Atomicity
Already mentioned this

Consistency
Db is in a consistent state before and after transaction happened

Isolation
Transaction isolates details from everyone except person making it

Durability
Once complete, it does not disappear, and is final

Features (6)
Persistence and Backups
Data must be persistent, i.e. not disappear without warning Easy to back up a computer db, harder for a physical one e.g. a notebook Most db systems allow easy backup procedures
You need to remember to do them though!

Low Cost and Extensibility


Easy to obtain and install, inexpensive (i.e. free) Extensible aka scalable
Can your db tolerate the change from a thousand to a million rows, or even more?

Features (7)
Ease of use
How good is your UI?
We will use one later on for exactly this purpose

Portability
If the db is accessible on the net, in principle you can see it from anywhere

Security
See above, if you can see your db from anywhere, so can others maybe, with worse intentions for your data than you Within the db, can apply security conditions on tables etc.

Features (8)
Sharing
Can many people use the db simultaneously without there being problems?
Transactional control important here

Good db design ensures data compartmentalized properly for sharing data too

Ability to Perform Complex Calculations


Says it all really Want to be able to do these very quickly at times Element of good design here as well

Types of db (1)
Focus on relational db here, but are other types:
Flat files
INI files Windows System Registry

Relational db Spreadsheets Hierarchical databases XML Network Object Object-Relational Exotic

Types of db (2)
Again, strengths and weaknesses to each type Mention a few here, not all of them
Flat files
Have their use as a place for data storage Not sophisticated though, e.g. will not handle hierarchical data, or allow you to do sophisticated searches

Spreadsheets
Already mentioned to some degree Cant do complex queries or check data integrity without writing extra code etc

Types of db (3)
Hierarchical
Data in a tree like structure e.g. files and folder on a computer hard drive Lots of data takes this form:
E.g. family trees Parts lists in complicated objects such as a computer

XML
Language for storing, transferring and retrieving hierarchical data only, nothing else happens with it

3 points of view on relational db


Db-theoretical
Mathematical and rigorous but can be intimidating

build the db and get it done


Less precise, but more intuitive Use familiar terms e.g. row and table Most common practice now

Files and structure


More concerned with the files and how it works on the computer than anything else Bit old school these days

Db fundamentals
Tables, rows and columns Relations, attributes and tuples Keys Indexes/indices Constraints

Tables, rows and columns


Informally, the db is a collection of tables Set of values allowed for a column is called the domain
E.g. telephone numbers, shoe sizes, car colours Similar to data type (the kind of data a column can hold) but not quite

Relations, attributes and tuples


Formal term for a table is a relation
All values in a row in a table are related as they apply to one thing

Formal name for a column is attribute Formal names for a row is tuple
Sounds like scruple A two attribute relation holds pairs, a three attribute relation holds triples A six attribute relation holds 6-tuples

Keys (1)
Loosely a key is combination of 1 or more columns that can be used to find rows in a table Again multiple types exist
Key
Set of 1 or more columns that have certain properties

Compound key
Key that includes 1 or more columns

Superkey
Set of 1 or more columns for which no 2 rows can have the exact same value As they define fields that must be unique within a table, sometimes called unique keys

Keys (2)
Candidate key
Minimal superkey Remove any columns from the superkey, it is no longer a superkey

Unique key
A superkey used to uniquely identify rows in table Similar to candidate key, except in use
Candidate key could be used to identify data, whereas unique key is used to constrain data Implementation issue, not a theoretical one like a candidate key

Keys (3)
Primary key
A superkey that is used to identify rows in a table A table has only 1 primary key Again more for implementation than theory Very good way to find data quickly

Foreign key
Used as a constraint rather than to find records in a table (later)

Indexes/indices
Db structure that makes it easier to find records based on values in 1 or more fields Index != key Do not use indices gratuitously
They require db overhead
This can be significant on big tables (i.e. million plus rows)

Only use them on fields that you are most likely to search on e.g. not house number

Constraints (1)
A constraint places restrictions on data in a table Formally not part of the db
Practically they are as play role in data management

Again multiple types


Basic constraints
E.g. can make a field required, so that all rows have to hold a value
Compare to NULL (later)

Prevent a field from holding a value that does not match its data type
E.g. in an integer field ten is not allowed E.g. 2 in a column allowing a 10 character string, deoxyribose is not allowed

Constraints (2)
Check constraints
Boolean expression to see if certain data allowed
If true it is allowed

Field-level check constraint validates a single column


E.g. to be loaded the constraint Salary > 0

Table-level check constraint validates more than a single column


E.g. to be loaded the constraint Salary > 0 and Commission > 0 must both be met

Constraints (3)
Primary key constraints
By definition, no 2 records can have identical values for the field that defines tables primary key

Unique constraints
Values in 1 or more fields be unique

Foreign key (FK) constraints


Different refers to a key in another table Db uses it to locate records in another table Defines a reference from 1 table to another, this is also called a referential integrity constraint
Data values must match in each table

FKs define associations between tables called relationships or links

Db operations (1)
8 operations originally defined for relational db at their inception:
Selection
Select some or all rows in a table

Projection
Drops columns from a table

Union
Combines tables and removed duplicate columns

Intersection
Finds the records that are same in 2 tables

Db operations (2)
Difference
Records present in 1 table not in the other

Cartesian Product
A new table created by combining every record in the first table with every record in the second E.g. table one has 1, 2, 3 and table two has A, B, C, the product is 1A, 1B, 1C, 2A, 2B, 2C, 3A, 3B, 3C

Join
A subset of Cartesian product where some condition is met

Divide
The opposite of the Cartesian product Uses one table to partition records in another

ERDs Entity relationship diagrams


Diagram that allows relationships between objects to be visualised Important part of database design Start off as concepts, then can get into actual tables, constraints, keys, indices etc. Software exists to visualise tables and relationships within a db
Some software can reverse engineer ERDs to create the db schema

ERDs
ERDs often use symbols to represent three different types of information
Boxes are commonly used to represent entities Diamonds are normally used to represent relationships Ovals are used to represent attributes

ERDs
Entities, attributes and identifiers Entity represent instance of something you want to track (rectangle)
Can by physical or abstract

Similar entities are grouped into entity sets Attributes describe the object they represent (ellipses) Identifiers (aka key or primary key) are underlined This can get limited on space if every attribute is listed for an entity
firstName employeeI D lastName

employee

ERDs
Diamonds show relationships name (usually a verb or something similarly descriptive) e.g.
Person Drives Forklift

No notion of direction here, even though it exists i.e. forklift does not drive person Can help by adding arrows e.g.
Customer Places Order Ships Shipper

Cardinality in ERDs
Can take this further and add cardinality to the images An attributes cardinality tells how many values of that attribute an object might have Expressed in variety of notations, but uses 3 symbols:
Ring means 0 Line means 1 Crows foot means many
SwordSwallower
Swallow s

Sword

Hence here, one and only one sword swallower can swallow between 0 and many swords

Inheritance in ERDs
Hierarchy is present Similar idea in notation, but now add triangle that represents IsA 1 point points towards the parent class Work your way up the tree
The commander is space trained and is an astronaut
Astronaut

Is A Space Trained Payload Specialist

Is A Commander Pilot Mission Specialist

Converting ERDs to actual db tables


Example of an ordering system in a business ERD
customers customerID name
1 11

contacts name phone email customerID


orders orderID date customerID addresses addressesType Street city state zip customerID
1

items sku description quantity unitPrice orderID

Converting ERDs to actual db tables


Example of an ordering system in a business Db tables
customers customerID number required name string required
1 11

name phone email customerID

contacts string string string number

required required required


1

orderID date customerID

orders number date number

required required required

addresses AddressesType string Street string city string state string zip string customerID number

items sku string description string quantity number unitPrice currency orderID number

required required required required required

required required required required required required

Data normalisation (1)


A way to make a database flexible and robust Protects the db against certain error types
Lots of duplicated data, wastes space and makes updating data a problem Incorrectly associate 2 unrelated pieces of data so not possible to delete one with deleting the other Requires piece of non-existent data to represent piece of data that does exist Limit the number of values that can be entered

Normalisation a rearrangement of data within the db to put it in a standard normal form that prevents these kinds of anomalies

Data normalisation (2)


There are 7 forms:
First Normal Form (1NF) Second Normal Form (2NF) Third Normal Form (3NF) Boyce-Codd Normal Form (BCNF) Fourth Normal Form (4NF) Fifth Normal Form (5NF) Domain/Key Normal Form (DKNF)
weakest

strongest

Only consider up to 3NF here, as most people do not go beyond this (best bang for your buck)

First Normal Form (1NF)


You are a db! Mostly enforced by most db products Official rules for 1NF are:
1. 2. 3. 4. 5. 6. Each column must have a unique name The order of rows and column do not matter Each column must have a single data type No two rows can have identical values Each column must contain a single value Columns cannot contain repeating groups

First Normal Form (1NF)


Rules 1 and 2 come for free with most db Rule 3 means that 2 types of data cannot be stored in the same column e.g. date and number Rule 4 obvious, so how can you tell them apart?
Add a primary key

Rule 5 most likely to be violated, but it is important


Create a new table, and link data back to main table with primary key

Rule 6 you cannot have multiple columns containing indistinguishable values


Create a new table, and link data back to main table with primary key

First Normal Form (1NF)


Example of airline flights
Rows ordered so that frequent flyers are at top of table

Following is not normalised


city DEN SAN SAN DEN SAN city PHX LAX LAX PHX LAX connections 1 JFK, SEA, TPA JFK, SEA, TPA 1 JFK, SEA, TPA

Normalise this table to 1NF

First Normal Form (1NF)


Make sure columns have unique names
startCity DEN SAN SAN DEN SAN destinationCity PHX LAX LAX PHX LAX connections 1 JFK, SEA, TPA JFK, SEA, TPA 1 JFK, SEA, TPA

Make sure order of rows do not matter, if do, add a column to imply the order
startCity destinationCity connections priority DEN PHX 1 1 SAN LAX JFK, SEA, TPA 1 SAN LAX JFK, SEA, TPA 1 DEN PHX 1 2 SAN LAX JFK, SEA, TPA 2

First Normal Form (1NF)


Make sure columns hold a single data type. Here list the connecting cities, not numbers
startCity destinationCity connections priority DEN PHX LON 1 SAN LAX JFK, SEA, TPA 1 SAN LAX JFK, SEA, TPA 1 DEN PHX LON 2 SAN LAX JFK, SEA, TPA 2

Make sure no two rows contain identical values, if so add a field to differentiate them e.g. Customer ID to tell them apart
startCity destinationCity connections priority customerID DEN PHX LON 1 4637 SAN LAX JFK, SEA, TPA 1 12878 SAN LAX JFK, SEA, TPA 1 2871 DEN PHX LON 2 28718 SAN LAX JFK, SEA, TPA 2 9287

First Normal Form (1NF)


This is OK now, but may want to add a date column to future-proof the table even more
startCity destinationCity connections priority customerID date DEN PHX LON 1 4637 21/05/10 SAN LAX JFK, SEA, TPA 1 12878 7/09/09 SAN LAX JFK, SEA, TPA 1 2871 7/09/09 DEN PHX LON 2 28718 21/05/10 SAN LAX JFK, SEA, TPA 2 9287 7/09/09

customerID and date are primary key

Make sure each column contains a single value


If has multiple values, split them into a new table a new connections table Add a new column that corresponds to the primary key in original table

First Normal Form (1NF)


Result with two tables: Relational model:
Data:

First Normal Form (1NF)


Make sure columns dont contain repeating groups
Whilst start and destination city are similar they are distinguishable To some degree it depends on what is being done with the data, and what is being used to search on

Is this new structure 1NF?


Probably not strictly, as primary key currently is customerID/date in Trips table, and customerID/date/connectionNumber in Connections table If traveler flies more than once per day, primary key is violated, so need to add TripNumber to both tables

Second Normal Form (2NF)


A table is 2NF if:
1. It is 1NF 2. All of the non-key fields depend on the key fields

For example:
time 1:30 1:30 2:00 2:15 2:30 3:30 3:30 3:45 wrestler Annette Cart Ben Jones Sydney Dart Ben Jones Annette Cart Sydney Dart Mike Acosta Annette Cart class pro pro amateur pro pro amateur amateur pro rank 3 2 1 2 3 1 6 3

Technically wrestler violates 1NF Time/wrestler combination is primary key

Second Normal Form (2NF)


Table is vulnerable to
update anomalies due to repeated data deletion anomalies as lose all other bits of information if want to delete a row Insertion anomalies if addition of data violates primary key

Solution here is to pull columns that do not depend on entire primary key out to a new table

Second Normal Form (2NF)


Relational model

Data

Holding data about wrestlers and matches in same table a problem If a table represents a single unified concept such as wrestler or match, the table will be in 2NF

Third Normal Form (3NF)


A table is 3NF if:
1. It is in 2NF 2. It contains no transitive dependencies

A transitive dependency is when one non-key value depends on another non-key fields value e.g. using Person as the primary key in a book club table

Third Normal Form (3NF)


Table is 1NF (single field is primary key), so it is also 2NF Subject to modification anomalies Some fields here are related to others
Author, Pages and Year related to Title Title depends on Person, and all other fields depend on Title

Main clue of a transitive dependency is lots of duplicative values in the table

Third Normal Form (3NF)


Sort this out by finding problem fields and putting them into a new table
Relational data Data

Beyond 3NF
Pretty easy to get to 3NF as prevents most common data anomalies
Stores separate data separately Removes redundant data

Plan to get data in your db into 3NF as a starting point Beyond 3NF data should be robust to less common anomalies Practically though, data can be so split up it is hard to maintain, implement and use

References
PostgreSQL
By Korry Douglas & Susan Douglas 2nd edition Developers Library, Sams Publishing, Indianapolis, IN, USA

Beginning database design solutions


By Rod Stephens www.wrox.com Wiley Publishing Inc., Indianapolis, IN, USA

You might also like