Professional Documents
Culture Documents
Oasis Project Report: Arizona State University
Oasis Project Report: Arizona State University
December 8, 2014
Contents
1 Introduction
4
4
5
7
4 Data Creation
11
4.1 PHP Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Python Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.3 Morckaroo Scripts . . . . . . . . . . . . . . . . . . . . . . . . . 15
5 ETL Process
16
6 Remarks
20
List of Figures
1
2
3
4
5
6
7
8
9
10
11
12
ER Diagram . . . . . . . . . . . . .
Test 1 . . . . . . . . . . . . . . . .
Test 2 . . . . . . . . . . . . . . . .
Test 3 . . . . . . . . . . . . . . . .
Test 4 . . . . . . . . . . . . . . . .
Policies Sales Star Schema . . . . .
Monthly Sales Periodic Snapshot
Insurance Business Matrix . . . . .
Mockaroo homepage . . . . . . . .
PHP script of ETL Part 1 . . . . .
PHP script of ETL Part 2 . . . . .
Monthly Sales Fact Data . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
7
7
8
8
9
10
10
15
16
17
19
Listings
1
2
3
4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
12
14
18
Introduction
The project is based on chapter 16 of the book The Data Warehouse Toolkit:
The Complete Guide to Dimensional Modeling and the author discusses various types of insurance like homeowner and private property but for this
project we focused on automobile insurance.
The fist step was to create the business rules to back the development
of the ER diagram of the transactional database which is the primal source
of data of the data warehouse. Next were created star schemas for two
different purposes and subsequently were generated some fake data via script
for simulate the ETL process after the implementation of both star schemas.
Finally was generated a report as a example of the information that can be
extracted of the data modeling developed in this project.
According Kimball and Ross (2013) the insurance companies have some common demands like to measure performance over time by coverage, covered
item, policyholder, and sales distribution channel characteristics. Although
some enterprises are engaged in many other external processes, such as the
investment of premium payments or compensation of contract agents. Despite all this requirements, we concentrated on the core of the business that
is related to policies and payments.
2.1
Business Rules
To support the development of the transactional database, the following fifteen business rules were created:
1. A customer can have zero or more policies
2. A vehicle must be owned by one and only one customer
3. Every vehicle must have one and only one model
4. A policy must have a customer, a agent and a vehicle
5. A customer must have a SSN, address and name
2.2
Entity-Relationship Diagram
The main entities are Vehicle, Customer, Agent and Policy. A bridge entity
called policy coverage was used because one policy can have more than one
coverage, and one type of coverage may be in one or more policies. Based
on this fifteen business rules the following Entity-Relationship diagram was
generated:
Figure 1: ER Diagram
2.3
Tests
Figure 2: Test 1
Figure 3: Test 2
Figure 4: Test 3
Figure 5: Test 4
The main purpose of our Data Warehouse is to support the analysis of the
amount of policies sold, by whom have sold and where its sold. The grain is
on row per agent per policy and based on that the following star schema was
created:
10
Data Creation
For the data creation were used Python, PHP and SQL scripts wherein the
last one was created with the support of Mockaroo website.
4.1
PHP Scripts
PHP scripts were used to generate the population of Policy Coverage table.
The following script generates 1200 rows of fake data.
Listing 1: PHP script to populate Policy Coverage table
1
2
3
<? php
$numberPolicies = 1200;
$numberCoverages = 4;
4
5
6
8
9
10
11
12
13
14
15
16
17
18
19
20
21
?>
11
4.2
Python Scripts
Python scripts were used to populate the Vehicle and Policy tables.
The first one chose a random date (month, day and year), vehicle and
agent. This script create 500 rows of fake data.
Listing 2: Python script to populate Policy table
1
import random
2
3
4
5
6
# Date source
year = [2010 ,2011 ,2012]
month = list ( range (1 ,13) )
day = list ( range (1 ,29) )
7
8
9
10
11
12
13
# Policy_id
id =0
# List of all vehicles
vehicle = list ( range (1 ,1001) )
# List of all agent
agent = list ( range (1 ,11) )
14
15
16
17
18
19
20
21
22
23
24
25
26
# DateInicitial
rand_date_ini = " " + str ( rand_year ) + " -" + str (
rand_month ) + " -" + str ( rand_day ) + " "
27
28
29
30
31
12
32
33
34
35
36
37
38
39
40
41
42
43
44
f . close ()
13
import random
2
3
4
# Plate source
char = [ " a " ," b " ," c " ," d " ," e " ," f " ," g " ," h " ," i " ," j " ," k " ," l " ,
" m " ," n " ," o " ," p " ," q " ," r " ," s " ," t " ," u " ," v " ," w " ," x " ," y " ,"
z"]
num = [ " 0 " ," 1 " ," 2 " ," 3 " ," 4 " ," 5 " ," 6 " ," 7 " ," 8 " ," 9 " ]
6
7
id = 0
8
9
10
11
12
# list of customer
cust = list ( range (1 ,101) )
# list of models
model = list ( range (1 ,275) )
13
14
15
# car s years
year = [ " 2000 " ," 2001 " ," 2002 " ," 2003 " ," 2004 " ," 2005 " ," 2006 "
," 2007 " ," 2008 " ," 2009 " ," 2010 " ," 2011 " ," 2012 " ," 2013 " ,"
2014 " ]
16
17
18
19
20
21
22
23
24
25
26
f . close ()
4.3
Morckaroo Scripts
The Mockaroo website was used to generate mostly of the data due its simplicity and effectiveness. This is a simple website with a really simple interface for generate scripts in many different types of languages and was
decided to generate all scripts in SQL language to sustain the consistency of
the project.
15
ETL Process
16
17
SQL scripting was used to create the ETL process for the periodic snapshot.
Listing 4: SQL script for th ETL process of periodic snapshot
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
18
After the ETL process finished, the following table was generated as a
example of report that it can be extracted of the data modeling developed
in this project.
19
Remarks
In sum, a fully functional data warehouse was developed following the business rules that the team created according the business process discussed
in chapter sixteen of Kimball and Ross(2013)[2]. The first step was to define the business rules in order to design a transactional database which is
the backbone of the data warehouse. After that, the transactional database
was implemented and tested to ensure its conform for the next phase which
was the creation of the star schema of the main business process. Also, a
star schema of a periodic snapshot was design to support the analysis of the
monthly sales. After the implementation of both star schemas, scripts were
created to populate the database with fake data for the purpose of testing
the scripts created for ETL process. The ETL process was developed via
Laravel and SQL script and the last step was to generate a example report
that data modeling designed by the team is capable of.
Some remarkable experiences through the project were:
The necessity to change the initial ER model during the process to fit
all requirements
ETL process is the most complex step
There are many different ways to generate fake data
References
[1] The PostgreSQL Global Development Group. Postgresql documentation,
2014.
[2] Ralph Kimball and Margy Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. John Wiley & Sons, Inc., 3rd edition, 2013.
[3] Taylor Otwell. Laravel 4.2 documentation, 2014.
20