Williams Christian Csc457 Portfolio

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

CSC 457

Project Portfolio

Christian Williams
4 May 2020
Page 1 of 19

Information Requirements
During the information requirements stage of designing a data warehouse, the purpose of the
data warehouse is formalized by determining the requirements of the system with respect to business
requirements and data requirements.

While determining the business requirements of the data warehouse, it is important to


formalize the requirements of the data warehouse into terms that are meaningful to the organization
that the system is being designed for. The data warehouse should be able to generate meaningful
reports that will assist the end-user in decision making. By researching the organization, the exact
domain of the data warehouse can be better understood and determined through the discovery of
specific operational activities and processes vital to the organization. Interviewing the client and
personnel from the organization can assist in determining the exact requirements of the system. The
business requirements should be formalized into a list of expected reports that the data warehouse will
be able to create using the data generated by the organization [1].

After determining the business requirements of the data warehouse, it is important to


determine the data that will be required to generate the specified reports. Data requirements should be
based upon the data generated by the organization and the data available in the source systems [1].

The purpose of the data warehouse that my team designed was to assist the Mobile Police
Department in allocating their resources. The first task was determining the resources that would need
to be allocated by the MPD. The team decided that the MPD would be making decisions about how to
better allocate individual police units. Then, the team had to determine the potential users of the data
warehouse and decide upon the type of reports that could be generated to assist those users in the
allocation of resources. Finally, the team had to decide upon the data that would be needed to generate
those reports and where the required data would be generated based upon day-to-day activities of the
MPD.

Potential Users
• Police Officials
• City Planners/Zones

Information Requirements
• Analyze the average response time for incidents in a precinct.
• Analyze the number of incidents with respect to a precinct and crime category.
• Analyze the number of violent crimes committed with respect to precinct and year.
• Analyze the number of crimes committed by year.
• Analyze the number of incidents with respect to the time.
• Analyze the number of incidents with respect to the type of crime.

References
https://www.slideshare.net/datamgmt/gathering-business-requirements
Page 2 of 19

Logical Design
After the information requirements of the data warehouse have been formalized, the logical
design process can begin. The logical design process of a data warehouse consists of four phases:
selecting a “business” process, declaring the granularity, identifying the measures, and identifying the
dimensions.

Logical design of the data warehouse begins with selecting a business process. A business
process is major operational activity or event supported by a source system that generates the data
represented in the fact table or a set of related fact tables [2].

After the business process has been determined, the granularity of the fact table should be
determined. The granularity of the fact table is representative of the level of detail stored by a row in
the fact table for the selected business process. The granularity of the fact table should be designed to
meet the expectations of the business requirements and should be designed at the lowest level of detail
or most atomic level supported by the data collected by the source systems. However, there are
industries where the most atomic level of granularity would result in an overwhelmingly large volume of
data and a coarser granularity would be more reasonable to manage [2].

After the granularity of the fact table has been determined, the dimensions with respect to the
granularity of the fact table should be identified. Dimensions provide context for the measures. They
define what the value of a measure represents. The attributes of the dimension will be used to filter and
constrain queries [2].

After the dimensions have been identified, the facts or measures from the business process
should be identified. Measures are usually numeric quantities generated in a source system by a
business process, such as an amount or a quantity. Measures should be true to the granularity of the
fact table [2].

There are three different types of measures: additive, semi-additive, and non-additive
measures. The value of an additive measure will make sense if it is added along any or all the
dimensions. Additive measures usually represent the measure of an activity: such as, quantity sold.
However, the value of a semi-additive measure will only make sense if it is added along specific
dimensions. Semi-additive measures usually represent the measure of an intensity, such as an account
balance. Non-additive measures can not be added along any of the dimensions.

When data is loaded from a source system, it is important to assign unique records a surrogate
key to buffer the data warehouse from changes in the operational keys from the source system or from
multiple source systems using the same operational key to identify multiple unique records. The
surrogate key is a unique identifier assigned by the data warehouse to identify unique records.

After the dimensions and measures have been identified, the next step in the logical design of a
data warehouse is to create a logical star schema. When creating a logical star schema, the attributes of
each dimension should be created, and the hierarchy of the attributes established. The attributes of
each dimension will enable the user to constrain and filter queries while the hierarchy will be used to
drill-down or roll-up. The relationship between the dimensions and the fact table should be represented
by a foreign key in the fact table.
Page 3 of 19

Granularity
• The police report of an incident.

Measures
• Number of EMT
• Response Time
• Time on Scene

Dimensions
• Date: Date_ID, Time of Day, Day, Month, Year
• Precinct: Precinct_ID, Station Number, Population, City, Zip
• Crime: Crime_ID, Crime Name, Crime Type, Crime Category, Violent Crime Status
• Responder: Responder_ID, Name, Age, Badge Number, Precinct Number, Gender

Logical Star Schema

Date Crime
Date ID Crime ID
Time of Day
Name
Day
Type
Month Category
Year Violent
Incident
Date ID
Precinct ID
Crime ID
Responder ID
Response Time
Time on Scene
Number of EMT
Responder
Precinct Responder ID
Precinct ID Name
Station Number Age
Population Badge Number
City Precinct Number
Zip Gender
Page 4 of 19

Database Design (SQL)


create table Date (
date_id int NOT NULL IDENTITY(1, 1) PRIMARY KEY,
date_op_key nvarchar(32) NOT NULL,
time nvarchar(10) NOT NULL,
day int NOT NULL,
month nvarchar(4) NOT NULL,
year int NOT NULL
);
create table Responder (
responder_id int NOT NULL IDENTITY(1, 1) PRIMARY KEY,
responder_op_key nvarchar (32) NOT NULL,
fname nvarchar(30) NOT NULL,
lname nvarchar(30) NOT NULL,
age int NOT NULL,
badge_num nvarchar(15) NOT NULL,
precinct_num int not null,
gender nvarchar(6) not null
);
create table Precinct (
precinct_id int NOT NULL IDENTITY(1, 1) PRIMARY KEY,
precinct_op_key nvarchar(32) NOT NULL,
station_num int NOT NULL,
population int NOT NULL,
city nvarchar(30) NOT NULL,
zip int NOT NULL
);
create table Crime (
crime_id int NOT NULL IDENTITY(1, 1) PRIMARY KEY,
crime_op_key nvarchar(32) NOT NULL,
crime_name nvarchar(100) NOT NULL,
crime_type nvarchar(30) NOT NULL,
crime_category nvarchar(30) NOT NULL,
violent nvarchar(10) NOT NULL
);
create table Incident_Fact (
date_id int NOT NULL,
precinct_id int NOT NULL,
responder_id int NOT NULL,
crime_id int NOT NULL,
response_time int NOT NULL,
time_on_scene int NOT NULL,
num_emt int NOT NULL
);
alter table Incident_Fact add constraint fk_date_id foreign key (date_id)
references Date(date_id);

alter table Incident_Fact add constraint fk_precinct_id foreign key (precinct_id)


references Precinct(precinct_id);

alter table Incident_Fact add constraint fk_crime_id foreign key (crime_id)


references Crime(crime_id);

alter table Incident_Fact add constraint fk_responder_id foreign key (responder_id)


references Responder(responder_id);
Page 5 of 19

Extraction Transformation and Load (ETL)


The ETL process can be challenging for several reasons. When creating a data warehouse, there
can be large volumes of data from several source systems that have to be managed. This data could
have duplicates, mismatching ids, missing fields, or noise. Due to the large volume of data, it is not
possible to check if the transformations done on each row are correct so the algorithms being
performed must be general enough to work yet specific enough to ensure a high degree of accuracy.

Since the data being extracted is typically from multiple sources or in a format different than the
format desired for storage in the data warehouse, data cleaning and data transformation must be
performed before the data can be loaded into the data warehouse. Data cleaning is a process that is
meant to improve the overall quality and correctness of the data. It can involve decoding fields to
meaningful values, converting the values within the fields to a uniform format, ensuring that identifiers
are unique, validating the information in the fields, and removing inconsistencies [3]. Typical
transformations that must be performed on the data during data transformation include data cleaning,
format revisions, decoding fields, splitting fields, merging records, removing duplicate records, etc.

Example Transformations
• Decoding of fields: when data in a record is being stored as a coded value, such as a 0 or 1, it
should be replaced with a meaningful value before the record is loaded into the data
warehouse. An example of decoding of fields is when a field, such as current address, could have
a 0 to represent ‘no’ and a 1 to represent ‘yes’. The 0 or 1 would be replaced with the
corresponding meaningful value before being loaded into the data warehouse.
• Merging of information: when there are records from two different sources that have a
relationship, information from those records can be combined to form a single record before it
is loaded into the data warehouse. An example of merging of information is when all possible
products are stored in one source as name, category, and product_id and products that the
store carry are stored in a different source as product_id and department, records from the two
sources can be merged together to form a new record that lists all products that the store
currently carries by product_id, name, category, and department.
• De-duplication: de-duplication is detecting records that refer to the same entity and storing only
one of them. An example of de-duplication is when a source or multiple sources are storing a
record defining the same product. The duplicate records need to be identified by a key, name,
or a combination of fields using some sort of matching algorithm (e.g. fuzzy matching, distance
matching, classification algorithms, etc.) and then the duplicate record needs to be removed.
• Splitting of single fields: when a data source is storing data as a single field that has multiple
components but a dimension table requires the individual components of that field, the field
must be split apart into its components before it is stored in the data warehouse. An example of
splitting of single fields is a source storing an address as one long string but the dimension table
requires that the street, city, state, and zip code be stored as individual fields, so the address
must be split apart into those components before it is stored in the data warehouse.
Page 6 of 19

• Missing information: when a source contains null strings or is missing information from a record,
a value for the missing field must be generated from the information present using an algorithm
or a dummy variable such as ‘unknown’ must be used to represent the missing information. An
example of missing information is when a source storing demographic information does not
have a phone number for a person. A variable representing that the phone number is ‘unknown’
must be used for the record or the phone number corresponding to the person must be
obtained from a different source if there are multiple sources.

After the data has been cleansed and undergone transformations specific to the data sets, it can be
loaded into the data warehouse. The dimension tables should be populated before the fact table/s to
ensure that there are no foreign key constraint errors. When populating the dimension and fact tables,
it is important to replace operational keys with surrogate keys unique to the data warehouse. If the
operational keys are guaranteed to be unique, a simple function such as the identity function in
Microsoft SQL Server can be used. However, if the operational keys are not unique across the data
sources then a map can be used to map operational keys to surrogate keys.

Crime Dimension

Using the tMap component in Talend, the data from two comma delimited text files were joined by a unique identifier to add the
‘violent crime status’ to each crime record, and then a surrogate key was generated for each record using the Numeric.sequence
function. Next, tExtractDelimitedFields split the original ‘crime type’ field into two fields ‘crime type’ and ‘crime category’. After
‘crime type’ was split into two separate fields, tReplace decoded the ‘violent crime status’ field by replacing the value ‘0’ with
‘nonviolent’ , replacing the value ‘1’ with ‘violent, and adding unknown to any records with a missing value for ‘violent crime
status’. After the transformations on the Crime dimension were completed, the records were loaded into the corresponding
database table to be stored in the data warehouse.
Page 7 of 19

Responder Dimension

Using a tMap component in Talend, several columns of the Responder records were filtered out of the data by mapping the
fields to be kept to the corresponding fields in the schema retrieved from the Responder table in the data warehouse. In addition
to filtering the columns in the Responder dataset, the tMap generated a surrogate key for each record using the
Numeric.sequence function. After all the transformations on the Responder dimension were completed, the records were loaded
into the corresponding database table to be stored in the data warehouse.

Precinct Dimension

Using the tExtractDelimitedFields component in Talend, the ‘address’ field in the Precinct records was split into multiple fields.
The records were then piped into the tMap component where some fields of the address were filtered out of the data by
mapping the fields to be kept to the corresponding fields in the schema retrieved from the Precinct table in the data warehouse.
A surrogate key was generated by the tMap for each record using the Numeric.sequence function. After all the transformations
were completed on the Precinct dimension, the records were loaded into the corresponding database table to be stored in the
data warehouse.
Page 8 of 19

Date Dimension

Using a tExtractDelimitedFields component in Talend, the information from datetime strings stored in a spreadsheet were split
into several corresponding fields such as month, day, and year. Then, a tMap mapped those fields to the corresponding fields in
the schema retrieved from the Date table in the data warehouse and then generated a surrogate key for each record using the
Numeric.sequence function. After the transformations on the date dimension were completed, the records were loaded into the
corresponding database table to be stored in the data warehouse.

Incident (Fact Table)

Using the tDBRow component in Talend, each operational key in the fact table was used to generate a query to the
corresponding table in the data warehouse to return the surrogate key associated with that operational key. The
tParseRecordSet used the surrogate key returned from the query to replace the operational key with the value of the surrogate
key. After each operational key had been replaced with the corresponding surrogate key, the tMap component was used to map
the columns of the dataset to the schema retrieved from the Incident table in the data warehouse ,and then the records were
loaded into the Incident table located in the data warehouse.
Page 9 of 19

Physical Design
When dealing with large volumes of data, query performance can be accelerated by creating
aggregate fact tables or data cubes since the aggregation tables and cuboids act as an index into a
dimension attribute. In addition to acting as an index into a dimension attribute, the number of records
that need to be parsed to fulfill a query from an application or user can potentially be reduced because
some of the records might be filtered out by a cuboid. When creating aggregations and data cubes, the
size of potential aggregations needs to be balanced by the expected frequency of its use [2].

Aggregate fact tables are usually handled by the DBMS or an OLAP engine. By using an OLAP
engine or a DBMS’s built in aggregation navigation to handle the aggregations, the system is not
restricted to a single BI front-end application [2]. A properly designed set of aggregates should behave
as database indexes and should not be directly accessible to the user or business intelligence (BI)
application [4].

Data cubes contain a subset of records from the data warehouse that have been aggregated or
filtered based upon criteria common to frequent queries and are meant to be directly accessed by the
user. ROLAP engines create data cubes from aggregate tables dynamically for BI applications whereas
MOLAP engines prefabricate data cubes and select the correct data cube based upon the needs of the BI
application.

MPD Incident Report Cuboids


Incidents [Date, Crime, Precinct, Responder]

Cuboid 1: {year, category, station number, responder id}

Cuboid 1 would answer queries by crime category and precinct with the ability to drill down into crime
class and violent status.

Cuboid 2: {year, crime name, station number, responder id} where year = 2019

Cuboid 2 would answer all queries related to the last year of crime data by precinct.

Cuboid 3: {year, category, city, responder id}

Cuboid 3 would answer all queries related to crime category by city.


Page 10 of 19

OLAP and Intelligence Applications


After the data has been loaded into the data warehouse and segmented into various
aggregations to generate multidimensional views of the data, the data needs to be parsed, analyzed,
and mined to understand patterns in the data and to generate reports that assist an organization in
decision making. Business intelligence applications interface with the OLAP engine or the DBMS to
analyze the data and generate reports or machine learning models [2]. There are several major subsets
of business intelligence applications including: direct access query tools, data mining, standard report
generation, analytic applications, dashboards, and operational BI applications.

Direct access query tools allow the user to formulate ad hoc queries to the dimensional model
through a metadata layer to generate a report or dataset. These tools should provide the user with the
ability to formulate queries, analyze and present data, user experience, and add technical features.
Because of the difficulty of translating business questions into the syntax of a query language and the
requirement of understanding the details of the data, direct access query tools are mostly used by
expert power users who are experienced in business and technology [2].

Standard report generation tools are typically a parameter driven report with a predefined,
preformatted output that provide users with a core set of information about specific areas of the
business. They are generally easy-to-use and geared to non-technical users. They provide a reference
point for metrics and can be used to verify the metrics of other types of business intelligence
applications [2].

Analytic applications are managed sets of reports targeted at a specific business process that
encapsulate domain specific expertise to analyze and interpret that business process to solve a specific
problem. They often include complex, code-based algorithms or data mining models to help identify
underlying issues or opportunities. Some analytic applications allow the user to make changes to
operational systems based upon insights gained from the application [2].

Dashboard applications are consolidated information applications that analyze data from
multiple business processes by providing status reporting and alerts across multiple data sources at a
high level with the ability to drill-down to more detailed data. They are generally geared toward a
specific type of user within an organization and provide that user the ability to spot trends in the data
then drill-down to investigate the cause of the trend.

Operational intelligence applications leverage the historical context of multiple business


processes by utilizing the data warehouse to assist users in operational decision making, often using
data mining models to help identify patterns and opportunities [2].

Data mining applications assist in identifying patterns and relationships that exist in the data to
find opportunity or predict behavior. They are often embedded in other types of business intelligence
applications to provide the applications with additional functionality [2].
Page 11 of 19

Visual Analytic Reports (Tableau)

Average Response Time and Average Time on Scene by Year

This report shows the average response time and average time on the scene of MPD by year with the functionality to roll-up into
month and day.
Page 12 of 19

Number of Crimes by Category per Precinct by Year

This report shows the number of crimes committed in each precinct by year sorted by misdemeanor and felony with the ability to
drill-down into the class of the offense (Class A, Class B, Class C) and into the violent status (violent, nonviolent). In addition to
being able to drill down into the details of the crimes committed, this report has the functionality to roll-up into month and day.

Number of EMT per Precinct by Year

This report shows the number of EMT each precinct used by year and has the functionality to roll-up into month and year.
Page 13 of 19

Average Response Time by Precinct

This report shows the average response time for each precinct. This report includes the functionality to roll-up into month and
day.

Crimes Committed in 2019

This report shows the crimes committed in 2019 by name with the functionality to drill-down into crime name and roll up to
month and day.
Page 14 of 19

Reports Exported from Tableau

Average Response Time by Precinct

Average Response Time and Average Time on Scene by Year


Page 15 of 19

Number of Crimes by Category per Precinct by Year


Page 16 of 19

Violent Crime per Precinct by Year


Page 17 of 19

Crimes Committed in 2019


Page 18 of 19

Number of EMT per Precinct by Year


Page 19 of 19

References
[1] D. Walker, “Gathering Business Requirements for Data Warehouses,” LinkedIn SlideShare, 29-
Jul-2012. [Online]. Available: https://www.slideshare.net/datamgmt/gathering-business-
requirements. [Accessed: 30-Apr-2020].
[2] R. Kimball, The data warehouse lifecycle toolkit, 2nd ed. Indianapolis: Wiley, 2008.

[3] P. Pathak, “ETL - Understanding It and Effectively Using It,” Medium, 07-Jan-2019. [Online].
Available: https://medium.com/hashmapinc/etl-understanding-it-and-effectively-using-it-
f827a5b3e54d. [Accessed: 01-May-2020].

[4] “Aggregate Fact Tables or Cubes,” Kimball Group. [Online]. Available:


https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-
techniques/dimensional-modeling-techniques/aggregate-fact-table-cube/. [Accessed: 01-May-
2020].

You might also like