Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Our chosen dataset, "Electric Vehicle Population," sourced from the U.S.

Government's open
data website, focuses on registered Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric
Vehicles (PHEVs) through the Washington State Department of Licensing (DOL). With over
153,000 tuples, the dataset is publicly accessible, making it an ideal resource for our project.

We selected this dataset due to its relevance to the critical issue of global warming. As climate
change poses a significant environmental concern, the transition from gasoline vehicles to
electric vehicles is seen as a pivotal step in mitigating carbon emissions. By working with a
dataset that captures information on electric vehicle registrations, we aim to contribute to the
broader understanding of this eco-friendly shift.

The dataset encompasses key attributes, including:

 VIN (Vehicle Identification Number): A unique identifier for each vehicle.


 Model Year: The manufacturing period of a specific vehicle model.
 Make: The name of the vehicle manufacturer.
 Model: A distinct design or version within a manufacturer's lineup.
 EV Type: Indicates if the vehicle is a Battery Electric Vehicle (BEV) or a Plug-in Hybrid
Electric Vehicle (PHEV).
 CAFV (Clean Alternative Fuel Vehicle) Eligibility: Designation for vehicles meeting clean
fuel standards.
 Range: The electric range of the vehicle.
 Base MSRP (Manufacturer's Suggested Retail Price): The base price of the vehicle.
 DOL_ID: A unique ID for each registered vehicle.
 Location, State, County, City: Information on where the vehicle is registered.
 Postal Code: The postal code of the registration location.
 Electric Utility: Specifies the electric utility used in the location.

It's important to note that while exploring the dataset, our group encountered a column
labeled "Census." Despite our collective efforts to comprehend its purpose and significance, we
found it challenging to extract meaningful insights or relevance from this column.
Consequently, after careful consideration, we made the decision to discard the "Census"
column from our analysis, as it did not contribute substantively to the objectives of our project.
ER diagram:

DATA Model Discussion


Based on Normalization from 0NF,2NF, 3NF and BCNF we ended up breaking the data set into 8
tables

List of Tables:
1. Vehicle (VIN, Model_year, Make, Model)
a. The primary key “VIN” can solely identify the Model Year, Make and model of a vehicle
so we created this table.

2. EVtype(EV_ID, EVtype):
a. The primary key “EV_ID” can solely identify Evtype
b. The decision to create a separate table for electric vehicle types, rather than
directly incorporating this information into the main dataset, is strategic. It offers
flexibility for potential changes in the future. For instance, if the nomenclature
for Battery Electric Vehicles is modified, such as changing from "BEV" to "FBEV"
(fully battery Electric Vehicle), the adjustment can be implemented more
efficiently. This approach minimizes the need to modify each individual record in
the primary dataset, streamlining database management and updates.
3. CAFV( CleanAF_ID , cleanAF)
a. The primary key “CleanAF_ID” can solely identify CleanAF
b. The decision to create a separate table is the same reason as creating the EVtype
Table
4. VehicleDetail(VIN, Make, Model, Model_year, Erange, price, EV_ID, cleanAF_ID)
a. Here in this table VIN, Model Year, Make and Model are our primary keys and
combinedly only they can uniquely identify the Electric Range, CFAV and Base MSRP
of a vehicle, eligibility for clean alternate fuel and Electric vehicle type.
b. Attributes
i. Erange (Electric Range): Represents the distance a vehicle can travel on
electric power alone.
ii. Price: The base Manufacturer's Suggested Retail Price (MSRP) of the
vehicle.
iii. EV_ID: A foreign key linking to the "EVtype" table, indicating the type of
electric vehicle (BEV or PHEV).
iv. cleanAF_ID: A foreign key linking to the Clean Alternative Fuel Vehicle
(CAFV)eligibility, providing information about the vehicle's environmental
classification.
5. Location (DOL_ID, Longitude, Latitude, State, County, City, Postal code, UtilID)
a. The "Location" table is a key component in our database schema, providing information
about the geographical placement and utility details for each registered electric vehicle.
Here are the main aspects of this table
b. DOL_ID: The use of "DOL_ID" as the primary key ensures that each location record is
uniquely identified within the "Location" table.
c. Atributes:
i. Longitude and Latitude: Geographical coordinates specifying the location of the
registered vehicle.
ii. State, County, City, and Postal Code: Information about the location where the
vehicle is registered.
iii. UtilID: Utility ID, which may serve as a foreign key linking to information about
the electric utility used in that location
6. District (DOL_ID, Longitude, Latitude, LD)
a. Contains geographical details (Longitude, Latitude) and LD (District information)
for each location. This table structure allows for efficient organization and
retrieval of data related to districts, utilizing the DOL_ID as a primary key for data
integrity.
7. Utility( UtilityID, UtilName)
a. This table structure is designed to efficiently manage and retrieve data
related to electric utilities, utilizing UtilityID as the primary key for data
integrity.
8. VehicleLocation( VIN , DOL_ID) is a junction table that connects the vehicle to the
location it was registered

Cardinality and Participation:


 The relationship between "Vehicle" and "Location" is established as many-to-one relationship,
indicating that multiple vehicles can be registered to a each specific “Location” and the Vehicle
entity have total participation with “Location” entity as all vehicles need to register to a location.
 The relationship between "Location" and “Utility” is established as one-to-many relationship,
indicating that multiple utility provider can serve to a specific location. “Utility” entity has total
participation with “Location” entity which tells us that all utility providers must serve to a
location to be in the dataset
 We established one-to-one relationship between the “Location” and "District" entity with partial
participation for both entities. This relationship illustrates that one Location contains one district.
 There is one-to-one relation between “VehicleDetail” and “CleanAFV” entity. There is a total
participation between the two entities , which means that all vehicles falls in. either of the three
kinds .
 There is one-to-one relation between “VehicleDetail” and “EVtype” entity. There is a total
participation between the two entities , which means that all vehicles falls in either of the two
types of vehicle whether it is a Plug-in Hybrid Electric Vehicle (PHEV) or Battery Electric Vehicle
(BEV)

Challenges and Changes from Part 1 to Part 2 :

Changes:
 We renamed our “EVData” table from part 1 to “Vehicle Location” table in part 2 and got rid of
all the attributes other than VIN and DOL_ID as of all the other attributes were found out to be
unelated for the table .
 We have also removed the “AltFuel” table in part 2 because CFAV could not be identified from
Make , Model and Model Year.
 We incorporated CFAV into our “Vehicle Detail” table because the primary keys in “Vehicle
Detail” table can find out the CFAV eligibility of a vehicle.
 We added two more tables in part 2, the “EvType” and “CAFV” table, we added these two table
so that when we change a instance on our dataset we don’t have to through the entire “Vehicle”
table to enforce the change, instead we can change it in the desired table and all other instances
on all the related table will be changed
 We got rid of the census attribute because it Is not related to any other attribute and it does not
explain what it means and how it was collected.
 Our data model did not clearly fit into the relational database and we had to make a quite bit of
adjustments ,written in the beginning of this paragraph. There were multiple VIN number was
registered to multiple location which In reality Is not possible since a vin number Is unique,
therefore we got rid of the faulty and duplicated tuples and we had to restrict the number of
tuples in part 2 in comparison to part 1.
 We ended deleting rows that left use with aprox 9654 rows comparared to the original
153000+ rows

Challenges:
 Working with the original data, which represented locations as POINT (-120.524012
46.5973939), posed a challenge in extracting the longitude and latitude values. Since no one in
our group was good in Excel, we sought the assistance of ChatGPT. Despite encountering some
difficulties, ChatGPT provided a helpful method that required a few adjustments. Through these
changes, I successfully split the data into longitude and latitude components.
 We also faced challenges on how Utility was related to a location and whether multiple utilities
serving a particular location . Looking and researching into the datasets deeply helped us solve
this issue and we were able to come up with a connection. There were so many faulty tuples in
our dataset which was causing issue building our relational model, we identified the faulty data
points, got rid of them and modified our dataset.

Our only regret lies in the decision to separate the "Vehicle" and "VehicleDetail" tables. Initially, we
opted for this division under the assumption that every car with the same make, model, and model year
should uniformly provide vehicle specifications such as price and range. However, upon completing the
entire database, we discovered instances where not all vehicles with the same make, model, and model
year had registered prices or ranges. In hindsight, if given the opportunity to redesign the database, we
would prefer to consolidate "Vehicle" and "Vehicle Detail" into a single table to save space and make It
more efficient.

We could have designed the data model differently, as mentioned in the part where we expressed regret.
This change might have made the data structure more efficient and simpler to construct. It could also have
saved time during the creation and insertion of data into the database by reducing the number of tables.
However, despite these considerations, we are content with the current model as it fulfills the objective of
retrieving all the necessary information.

Interesting Queries:
 Which city In the Washington state has the most electric vehicle
o This query returns the name of the city that has the most electric vehicle.
o This was as Interesting query because our group was trying to figure out which city Is
helping the most in terms of reducing carbon emissions.
 List the city that has the most electric vehicles
o This query was similar to the previous query, but what Interesting about this was It
allows the user to enter the number cities to view and It displays the list ranking from
most to least.

The dataset that we used does require a relational database. Using Microsoft Access database system to
model this data would have been much easier and convenient. AS its graphical user interface allows users
to create and manage databases without extensive programming knowledge. As it requires a lot less
programming knowledge, creating the interesting queries using Microsoft Access would have been much
simpler and the user-friendliness and simplicity of the Microsoft Access database system would certainly
allow us to write more different and interesting queries with ease.

Given our group's limited background in databases, the queries we formulated were considered good,
although there may be some unnoticed flaws. This database could serve as an effective learning tool in
COMP3380, showcasing common mistakes to avoid when writing SQL queries. One specific challenge
future student might explore involves determining which utility company has the most consumers.

Our dataset posed a challenge in this regard as the utility names were formatted inconsistently. For
instance, some names appeared alone, like "PUGET SOUND ENERGY INC," while others were
combined, such as "PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA)” and “CITY OF
SEATTLE - (WA)|CITY OF TACOMA - (WA).” The difficulty arose when trying to count instances of a
specific utility like "PUGET SOUND ENERGY INC" while accounting for variations in combination
formats. In Java, we could use a split function, but achieving this in SQL felt challenging.

Future students may find this a compelling challenge, encouraging them to devise their own method for
accurately counting utility company consumers despite variations in the dataset's formatting.

You might also like