Professional Documents
Culture Documents
118.721 Analysis and Interpretation of Animal Health Data
118.721 Analysis and Interpretation of Animal Health Data
721 Analysis and Interpretation of Animal Health Data Module 2 - Design and use of a database
Patrick Biggs1 & Tim-Hinnerk Heuer2 Disease Research Centre, IVABS, Massey University 2Landcare Research, Palmerston North
1Infectious
Introductions:
3 day course covering the design & use of a database
With focus on PostgreSQL
Course taught by Patrick Biggs and Tim-Hennerk Heuer, with techical support from Simon Verschaffelt
Day 1
Introduction
Why not use Excel? History of databases PostgreSQL as a database Terminology
Day 1 (contd)
Discussion of software environment that we will use (THH) Data normalisation Working with test data (THH)
The path from data in Excel to PostgreSQL Reading data into the database
Day 2
Data types
Null values
Data summaries
Basic queries
Data validation
Missing values, spelling errors
Day 3
More queries Data export Data backup talking to other software
R, Perl, PHP Extensions e.g. postGIS
Day 1
http://jasonlbaptiste.com/startups/microsoftexcel-is-the-worlds-most-used-database/
Relational database
Special type of database using tables
Relationships
Tables linked together by relationships Few types exists
Normalization
Process of simplifying the data structure Whole science, with many forms (see later)
Database model
Plan or concept for data storage using an entity relationship diagram (ERD)
department
manager
project
task
Employee can be assigned many tasks Task can be assigned to many employees Requirement for new table (assignment) to resolve this dilemma by creating unique definition for employee and task combinations
task
assignment
project
employee
company
department
employee
employee type
project
task
assignment
Whilst it may have its problems & limitations, RDB is the best general purpose database for most situations
17
Database (db)
Named collection of schemas When client application connects to server, db name is specified Client cannot interact with > 1 db / connection, but can open any number of connections to access multiple db
Query
Command type that retrieves data from the server
Table
Collection of rows Usually named, but can exist temporarily All rows have the same shape
Every row in table contains same columns
Row
Collection of column values Every row has same shape
Same set of columns E.g. if you run a car dealership, and you have a vehicles table, each row is an individual car
Domain
Named specialisation of another datatype Can be included in >1 table
View
Alternate way to present a table A virtual table Not storing data just creating way of looking at existing data
Client
Connects to a server via a postmaster Asks server to perform work
Transaction
Collection of db operations treated as a unit all or nothing PostgreSQL ensures all complete, or none do Usually starts with a BEGIN command, and ends with COMMIT or ROLLBACK
Commit
Successful end of a transaction Changes can be made permanent
Index
Data structure to help reduce time to find data Absolutely vital to performance optimisation
Especially with large and/or complex db
Tablespace
Alternative storage location for tables and indices
Result set
All data you get back from db after issuing a query Maybe be empty
R: relational
Data stored in separate tables can be linked by looking for common elements Allows data to be pulled quickly and answers made from data stored in more than one table
Database design
Going to take an overview approach Aim is to get you being able to design, build and then use a database in your own work Guiding principles and procedures the key here
If its not 100.0% maximised for performance, or fully denormalised thats not the end of the world This is not Amazon.com, for example where queries taking a few milliseconds longer mounts up on a daily basis
Want to get data out quickly, reliably & consistently Want to be free of incorrect and/or contradictory data
Information containers
Database
A repository for data (stored information) and metadata (information about the structure) Can create, read, update and delete data in some manner
Features (1)
CRUD
The 4 things any db should provide Create, Read, Update and Delete If it has CRUD, it is a db
E.g. plane black box flight recorder cannot (rightfully so) modify the data, so its not a db
Retrieval (R in CRUD)
Aka read you should be able to find
all data in the db all data in specific ways
Features (2)
Consistency (part of R in CRUD)
Performing the same search now and in the future on exactly the same dataset gives the same result Db design has a role here
Where is your data stored? Is it stored in >1 place in the db?
Validity
Data is validated against other information in the db
E.g. if you have a date field it should not be able to accept the sky is blue Old is not a valid age Lots is not a valid quantity
Features (3)
Easy Error Correction
Ability to update data throughout system
E.g. a building supply company db has had entries for duck tape listed, when should be duct tape
Speed
Good design plays role in db efficiency
Features (4)
Atomic Transactions
???? A possibly complex series of actions that is considered as a single operation by those not involved directly in performing the transaction Transactions cannot half happen Important for the R and U parts of CRUD
Features (5)
ACID
4 features that an effective transaction system should provide: Atomicity
Already mentioned this
Consistency
Db is in a consistent state before and after transaction happened
Isolation
Transaction isolates details from everyone except person making it
Durability
Once complete, it does not disappear, and is final
Features (6)
Persistence and Backups
Data must be persistent, i.e. not disappear without warning Easy to back up a computer db, harder for a physical one e.g. a notebook Most db systems allow easy backup procedures
You need to remember to do them though!
Features (7)
Ease of use
How good is your UI?
We will use one later on for exactly this purpose
Portability
If the db is accessible on the net, in principle you can see it from anywhere
Security
See above, if you can see your db from anywhere, so can others maybe, with worse intentions for your data than you Within the db, can apply security conditions on tables etc.
Features (8)
Sharing
Can many people use the db simultaneously without there being problems?
Transactional control important here
Good db design ensures data compartmentalized properly for sharing data too
Types of db (1)
Focus on relational db here, but are other types:
Flat files
INI files Windows System Registry
Types of db (2)
Again, strengths and weaknesses to each type Mention a few here, not all of them
Flat files
Have their use as a place for data storage Not sophisticated though, e.g. will not handle hierarchical data, or allow you to do sophisticated searches
Spreadsheets
Already mentioned to some degree Cant do complex queries or check data integrity without writing extra code etc
Types of db (3)
Hierarchical
Data in a tree like structure e.g. files and folder on a computer hard drive Lots of data takes this form:
E.g. family trees Parts lists in complicated objects such as a computer
XML
Language for storing, transferring and retrieving hierarchical data only, nothing else happens with it
Db fundamentals
Tables, rows and columns Relations, attributes and tuples Keys Indexes/indices Constraints
Formal name for a column is attribute Formal names for a row is tuple
Sounds like scruple A two attribute relation holds pairs, a three attribute relation holds triples A six attribute relation holds 6-tuples
Keys (1)
Loosely a key is combination of 1 or more columns that can be used to find rows in a table Again multiple types exist
Key
Set of 1 or more columns that have certain properties
Compound key
Key that includes 1 or more columns
Superkey
Set of 1 or more columns for which no 2 rows can have the exact same value As they define fields that must be unique within a table, sometimes called unique keys
Keys (2)
Candidate key
Minimal superkey Remove any columns from the superkey, it is no longer a superkey
Unique key
A superkey used to uniquely identify rows in table Similar to candidate key, except in use
Candidate key could be used to identify data, whereas unique key is used to constrain data Implementation issue, not a theoretical one like a candidate key
Keys (3)
Primary key
A superkey that is used to identify rows in a table A table has only 1 primary key Again more for implementation than theory Very good way to find data quickly
Foreign key
Used as a constraint rather than to find records in a table (later)
Indexes/indices
Db structure that makes it easier to find records based on values in 1 or more fields Index != key Do not use indices gratuitously
They require db overhead
This can be significant on big tables (i.e. million plus rows)
Only use them on fields that you are most likely to search on e.g. not house number
Constraints (1)
A constraint places restrictions on data in a table Formally not part of the db
Practically they are as play role in data management
Prevent a field from holding a value that does not match its data type
E.g. in an integer field ten is not allowed E.g. 2 in a column allowing a 10 character string, deoxyribose is not allowed
Constraints (2)
Check constraints
Boolean expression to see if certain data allowed
If true it is allowed
Constraints (3)
Primary key constraints
By definition, no 2 records can have identical values for the field that defines tables primary key
Unique constraints
Values in 1 or more fields be unique
Db operations (1)
8 operations originally defined for relational db at their inception:
Selection
Select some or all rows in a table
Projection
Drops columns from a table
Union
Combines tables and removed duplicate columns
Intersection
Finds the records that are same in 2 tables
Db operations (2)
Difference
Records present in 1 table not in the other
Cartesian Product
A new table created by combining every record in the first table with every record in the second E.g. table one has 1, 2, 3 and table two has A, B, C, the product is 1A, 1B, 1C, 2A, 2B, 2C, 3A, 3B, 3C
Join
A subset of Cartesian product where some condition is met
Divide
The opposite of the Cartesian product Uses one table to partition records in another
ERDs
ERDs often use symbols to represent three different types of information
Boxes are commonly used to represent entities Diamonds are normally used to represent relationships Ovals are used to represent attributes
ERDs
Entities, attributes and identifiers Entity represent instance of something you want to track (rectangle)
Can by physical or abstract
Similar entities are grouped into entity sets Attributes describe the object they represent (ellipses) Identifiers (aka key or primary key) are underlined This can get limited on space if every attribute is listed for an entity
firstName employeeI D lastName
employee
ERDs
Diamonds show relationships name (usually a verb or something similarly descriptive) e.g.
Person Drives Forklift
No notion of direction here, even though it exists i.e. forklift does not drive person Can help by adding arrows e.g.
Customer Places Order Ships Shipper
Cardinality in ERDs
Can take this further and add cardinality to the images An attributes cardinality tells how many values of that attribute an object might have Expressed in variety of notations, but uses 3 symbols:
Ring means 0 Line means 1 Crows foot means many
SwordSwallower
Swallow s
Sword
Hence here, one and only one sword swallower can swallow between 0 and many swords
Inheritance in ERDs
Hierarchy is present Similar idea in notation, but now add triangle that represents IsA 1 point points towards the parent class Work your way up the tree
The commander is space trained and is an astronaut
Astronaut
addresses AddressesType string Street string city string state string zip string customerID number
items sku string description string quantity number unitPrice currency orderID number
Normalisation a rearrangement of data within the db to put it in a standard normal form that prevents these kinds of anomalies
strongest
Only consider up to 3NF here, as most people do not go beyond this (best bang for your buck)
Make sure order of rows do not matter, if do, add a column to imply the order
startCity destinationCity connections priority DEN PHX 1 1 SAN LAX JFK, SEA, TPA 1 SAN LAX JFK, SEA, TPA 1 DEN PHX 1 2 SAN LAX JFK, SEA, TPA 2
Make sure no two rows contain identical values, if so add a field to differentiate them e.g. Customer ID to tell them apart
startCity destinationCity connections priority customerID DEN PHX LON 1 4637 SAN LAX JFK, SEA, TPA 1 12878 SAN LAX JFK, SEA, TPA 1 2871 DEN PHX LON 2 28718 SAN LAX JFK, SEA, TPA 2 9287
For example:
time 1:30 1:30 2:00 2:15 2:30 3:30 3:30 3:45 wrestler Annette Cart Ben Jones Sydney Dart Ben Jones Annette Cart Sydney Dart Mike Acosta Annette Cart class pro pro amateur pro pro amateur amateur pro rank 3 2 1 2 3 1 6 3
Solution here is to pull columns that do not depend on entire primary key out to a new table
Data
Holding data about wrestlers and matches in same table a problem If a table represents a single unified concept such as wrestler or match, the table will be in 2NF
A transitive dependency is when one non-key value depends on another non-key fields value e.g. using Person as the primary key in a book club table
Beyond 3NF
Pretty easy to get to 3NF as prevents most common data anomalies
Stores separate data separately Removes redundant data
Plan to get data in your db into 3NF as a starting point Beyond 3NF data should be robust to less common anomalies Practically though, data can be so split up it is hard to maintain, implement and use
References
PostgreSQL
By Korry Douglas & Susan Douglas 2nd edition Developers Library, Sams Publishing, Indianapolis, IN, USA