Homework 1 B

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Homework 1b

Overview
In this assignment we will be expanding our PostgreSQL countries database to capture more
information, and learn to use XQuery to extract information from web pages to populate the
database.
Due
The project is due 12PM midnight on Monday March 10, with an ER diagram due in
discussion section on Monday March 3rd.
Database Design
Please visit https://www.cia.gov/library/publications/the-world-factbook, and familiarize
yourself with the information provided about the various countries (hint: look at text, low
bandwidth version). Select a portion of the information available and create an entityrelationship diagram (on paper) representing this portion of the available information. Your
diagram should include
1. Five (or more) simple attributes not covered in the first part of the homework example.
2. Three other Entity Sets, including International Organizations.
3. Relationships as needed, including Borders.
Be sure to show key constraints, weak entities, participation constraints, etc. Label each
relationship and comment on your choices around participation constraints and key
constraints for each relationship in your diagram.
Building the Database
Add attributes to your existing country table and add new tables to capture the information in
your ER model.
Extracting Information
The web page sources for the text, low-bandwidth version are available in directory
/home/ff/cs186/factbook/xml. Each file is named AA.xml, where AA is a two-character
abbreviation of the country name. An example XQuery that uses catalog files to iterate over
multiple documents and extract background information from these files is available in
factback.xq. Note that two catalogs are given, one called cat.xml refers to all the countries,
while catsmall.xml covers only 5 countries and is useful for development and testing
In addition, a single XML file mondial.xml is available, along with a DTD for this file. It is
also fine to use XQuery to extract data from this file to populate your database.
The linked
image
cannot be
displayed.
The file may

In order to run XQuery, create a file ending in .xq and then use the saxon XQuery processor
, available in the /home/ff/cs186/Saxon subdirectory. (The CLASSPATH for this should be
set on login, but if necessary add saxon9.jar in the Saxon directory to your CLASSPATH.)
The linked
image
cannot be
displayed.
The file may

The goal is to write one XQuery for each table in your database, to produce a new XML file
consisting of a set of <record> elements, as shown in this figure.
The linked
image
cannot be
displayed.
The file may

A program is provided that will turn an XML file that looks like the above into SQL INSERT
statements:
The linked
image
cannot be
displayed.
The file may

Challenges
Additional string processing may be required to clean up the data for insert into the table,
including deciding on standard format of key fields, etc. You are free to attack this problem,
commonly referred to as the Extract, Transform, Load problem, any way you want, for
example running perl or python scripts on either the <record> file above or the file of insert
statements. Alternatively, it is fine to insert data into staging tables for processing before
the actual production tables are populated, making use of the many string-manipulation
functions available in PostgreSQL (http://www.postgresql.org/docs/8.1/interactive/sqlcommands.html).
Note that in XQuery the function tokenize($b/text(),sep) is available that takes as a first
parameter a string and as the second parameter a separator, and returns a list of strings from
the first parameter broken up by the second. For example, sep might be , and the string
$b/text() is apple, orange, lemon, then the result will be a list of three words, one for each
fruit, and a for $w in tokenize() can be used to iterate over them. You may want to look at
w3c.org or w3schools.com for information on additional XPath Functions that can help you
with this.
Deliverables
1. ER diagram (hand drawn) with discussion attached.
2. Turn in the following files by putting them in a directory hw1b and submit
3.
4. % cd hw1b
5. % submit hw1b

Required Files for submission (the required files are createdb.sql, xq.tar, scripts.tar,
dump.tar, hw1b.txt):
1. Create Table statements in a file createdb.sql. You may want to use pg_dump -schema
to dump this information.

2. XQuery scripts for each table foo in a file foo.xq. The XQuery should work when run
in the directory above the xml directory. Put them all together in a single xq.tar file.
(only submit the tar file)
3. Any scripts used to process the data, in a single scripts.tar tar file. (create an empty
scripts.tar even if you don.t have scripts).
4. The results of 'select * from foo' for each table foo in your database in a foo.out file
(there should be more than one). Put them altogether in a single dump.tar tar file.
(only submit the tar file)
5. Write a short description file hw1b.txt. You should write down description about the
purpose each script file if there is any (e.g. give an short example what a script file
can parse). You should also mention whether you extracted files from country files or
mondial.xml. You are welcome to document the effort you put to make the
assignment better (e.g. special text processing cases you did, interesting cases you
faced). Don't write too long. It should just be short description.
Note
1. As mentioned in the above description, it's ok to use mondial.xml only.
2. Extra credits will be given to those who put extra effort in text processing.
3. If you don.t do extra text processing, it's ok to put some field as type String (for
example, in the area attribute, you can leave it as '336,237 km' instead of '336237'),
but again, we.ll give extra credits to extra work.

You might also like