Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Compiling and Pre-Processing Geospatial Data Sets

Automated using Python

Benjamin Felton, EIT


MS Graduate Student
University of Virginia
Charlottesville, VA
brf2er@virginia.edu

Abstract—Data management is an important aspect of any project. identification techniques like ground surveying, which can be
This process can include the maintenance and formatting of raw costly and time consuming. This has caused a push in research
data. This study focuses on developing a Python scripted tool to towards using remote sensing techniques for wetlands
automate the acquisition and pre-processing steps of a project. delineation.
The particular project of interest involves the delineation of
wetlands using the following ancillary data: SSURGO, NWI, and
NHD. This tool utilizes the appropriate libraries to download the Remote sensing leverages technological hardware,
zipped data from federal FTP sites, extract the data into an particularly aerial sensors, to observe and record data about
appropriate folder structure, process this data by projecting and Earth’s surface, oceans, and atmosphere by means of
clipping it to a desired datum and spatial scope, and create a electromagnetic radiation. From remote sensing we are able to
compiled geodatabase of these geospatial data sets for utilization obtain useful data sets for wetlands delineation, like Digital
in ArcGIS. Elevation Models (DEM) and Light Detection and Ranging
(LiDAR), Multispectral Satellite Imagery, Radar, and
Keywords—data management; SSURGO; NWI; NHD; FTP; Normalized Difference Vegetation Index (NDVI). Other
ArcGIS
ancillary data sets are also required for the wetlands delineation
I. INTRODUCTION that are not necessarily obtained by means of remote sensing.
Some of these data sets include the National Wetlands
Civil Engineering encompasses a wide range of fields;
Inventory (NWI), Soil Survey Geographic (SSURGO) soils
spanning from roadway design in transportation to storm water
data, and National Hydrography Dataset (NHD).
control in hydrology. Irrespective of the particular field within
civil engineering, there will most likely be concerns about the
Of this data, there are a variety of formats utilized to
environmental impact within the design and planning phase of
represent this geospatial information, ranging from polylines
a particular project. There are multiple considerations to ensure
and polygons to rasters. Not only are there different formats,
that the ecology is maintained and projects are as non-invasive
but also different resolutions and they are packaged in different
as possible. A particular consideration in the interest of this
ways. Because each of these agencies use different schema to
paper is the preservation of wetlands.
organize and distribute this data, it would be extremely
beneficial to develop a script that will take all of this data and
Wetlands are an important natural feature inherently capable
preprocess it so that all data is formatted in an easily readable
of many beneficial hydrological and environmental processes.
and consistent way to facilitate quicker and easier wetland
Some of these benefits include storm water runoff control,
delineations.
effluent and sediment control, and providing habitats for
wildlife and plants. Historically, wetlands were thought to be II. OBJECTIVE
bastions for disease and were destroyed or repurposed for
The main objectives of this study are to develop a tool that
agricultural reasons. Because of this practice, approximately
will compile a wide range of data sets, preprocess data sets into
half of America’s original wetlands no longer exist. The need
a useable and consistent format, and organize these downloaded
for wetlands preservation quickly became apparent and a
and processed data sets in a systematic and readable hierarchy.
number of agencies showed interest in this matter.
This tool's primary purpose is to serve as a starting point for an
expanded project, which is for the development of an ArcGIS
In order to adequately prevent the destruction of wetlands,
tool to delineate wetlands. In the interest of time, the following
we need to be able to accurately and easily identify their
three data sets have been selected for inclusion in this study:
locations. The National Agriculture Imagery Program (NAIP)
NWI, SSURGO, and NHD.
maintains a current data base of wetlands called the National
Wetlands Inventory (NWI). However, this inventory has
proven to be insufficient at times, requiring further
As these data sets have a number of different sources, the tool particular county within a state which can be downloaded from
will need to have the ability to utilize a source’s established web the following site:
application programming interfaces (API) or if none exist, have https://www.census.gov/geo/reference/codes/cou.html
the ability to almost completely automate the data acquisition
C. Data Acquisition
procedure to require as little user input as possible.
Preprocessing these data sets will require the appropriate To accomplish the data acquisition portion of this project,
geospatial data processes to ensure that all datasets have the first step was to identify any web API’s used by the agencies
consistent attributes. Finally, the organization will play an in charge of the desired data sets. These groups include the
important role in separating the user from the tool's processes, United States Geologic Survey, United States Fish and Wildlife
while making it easier for the user to give the tool the Service, and the Natural Resource Conservation Service. Each
appropriate required data in order to function. of these entities maintained File Transfer Protocol (FTP) sites
at the following locations for the indicated data sets:
III. METHODS
NHD:
A. Language
ftp://nhdftp.usgs.gov/DataSets/Staged/States/FileGDB/
Selecting the appropriate programming language was an SSURGO:
important step for this study. Considering that the intended use http://websoilsurvey.sc.egov.usda.gov/App/WebSoilSurvey.as
of this study’s tool is to be used for another project to develop px
a tool for ArcGIS, it seemed pertinent to utilize the same NWI:
programming language required by the system. Since Python http://www.fws.gov/wetlands/data/State-Downloads.html
is the intended programming language for the development of
the wetlands delineation tool, this study’s tool also used Python. Using these designated sites, the developed Python script
calls these data sets using the urllib2 library. By using this
The tool is segmented into two separate scripts, the main library, the script establishes a link to the particular site
script that queries the user for the appropriate preliminary location, at which point, the script opens the indicated file for
information and a separate module that contains all of the reading. The script will then read the file in defined blocks and
procedures of the script. This segmentation between querying write these blocks one after another to the same directory as the
and computation is to reduce overwhelming the user with script using built in libraries with the .read() and .write()
portions of code that should remain unchanged. functions. The script also contains a file transfer progress
B. Preliminary Requirements update that will print the status to the Integrated Development
Environment (IDE) console using built in libraries with the
In order for the script to run without error, some preliminary
.info() and .getheaders() functions. The block size was defined
information and a file is required from the user. The user first
as 250000 bytes, which was tested empirically to find an
must be able to provide the state and county of interest. Both
appropriate block size, as a block too small would cause a
of these inputs must be in the appropriate format. To indicate
remote termination of the connection with the host server and a
the desired state, it must be represented by using the state’s
block too large would not provide the user with frequent enough
abbreviation, i.e. NC, NY, etc. To indicate the desired county,
updates on the file transfer progress.
it must be represented by typing out the full name including the
term county while maintaining capitalization, i.e Hyde County,
The NHD and NWI sites are capable of extracting the
Lexington County, etc. The user must also be able to provide a
required data for a particular site using only the user defined
Comma Separated Values (.csv) file containing the Federal
state, which also means that these data sets span the entirety of
Information Processing Standard (FIPS) codes for each
the state. However, the SSURGO data also requires the county
of interest in order to determine the county’s FIPS code in order
to call the appropriate url. In order to accomplish this, a python This model developed in Modelbuilder was exported to a
definition is used to unpack and generate a python dictionary Python script, which was altered and pasted into the main script.
containing the county names as the keys and the FIPS codes This portion of the script utilizes the ArcGIS Python library,
associate with those counties using the csv library. The called arcpy, to access and run the Project, Clip, Create Personal
dictionary is then used to grab the FIPS code for the user Geodatabase, Feature to Geodatabase, and Table to
defined county. Geodatabase geo-processing tools. This will output a
geodatabase file (.gdb) containing the NWI, NHD, and
Each of these data sets are downloaded as a zip file. In order SSURGO data in the scope of the user defined county.
to utilize these data sets for the pre-processing step, they must
be extracted. To accomplish this, the zipfile library is used to IV. CONCLUSION
extract the data while the os library is used to specify the path The development of a tool to acquire and pre-process data
location for the extracted data sets. sets utilized for water delineation was successful. However,
due to time limitations, only certain data sets were able to be
D. Pre-Processing
included in the tool. This tool shows that by using Python to
The pre-processing portion of the script was initially automate the tedious process of data acquisition and pre-
developed using ArcGIS’s Modelbuilder tool. Modelbuilder processing, time and effort can be saved by limiting redundant
was used to outline the operations used to process the data. The tasks using minimal user input.
particular tools of interest are Project, Clip, Create Personal
Geodatabase, Feature to Geodatabase, and Table to With any automated tasks, there are a number of assumptions
Geodatabase. The Projection tool was used to convert all data that are made that could prove to be problematic in the future.
sets to the North American Datum 1983 (NAD83) datum. This For example, the website urls used in the script to download the
projection was selected since the intended use for this study’s data are hard coded into the definitions. If one of the sources
tool was for the download and pre-processing of any county decides to change its web location, the script will break and not
within any state in North America. The Clip tool was used to function properly. Similarly, if a source were to change the
reduce the NHD and NWI data sets to the particular county of naming convention of their zip files, the script would also break.
interest. Create Personal Geodatabase initializes the The task of including an intelligent algorithm that would search
geodatabase where Feature to Geodatabase and Table to the potential files for the most likely desired set of data is not
Geodatabase append the data sets to the newly created
geodatabase.
impossible, but it is outside the scope of this study and can be a sites followed by projecting and clipping the data sets so that
rigorous procedure to make this script more robust. they have a consistent datum and spatial span. The benefit of
this tool is to automate the data acquisition and pre-processing
Even with the successful development of this script, there is steps by allowing the user to define the area of interest and
room for improvement that will need to be implemented before allow the script to run over a period of time, freeing the user
the full potential of this script is reached. First, other data sets, from navigation through multiple sites demanding much more
like DEM, Multispectral satellite imagery, radar, and NDVI user input and time.
will need to be incorporated. Second, as the spatial area of
interest for the use of this tool does not expand beyond the REFERENCES
United States, creating a comprehensive list containing all FIPS [1] Soil Survey Staff, Natural Resources Conservation Service, United States
would be beneficial as the user would not have to provide a .csv Department of Agriculture. Web Soil Survey. Available online at
http://websoilsurvey.nrcs.usda.gov/. Accessed [12/9/2014].
file. Third, implementing any catches for error codes would be
[2] National Wetlands Inventory Staff, U.S. Fish and Wildlife Services, U.S.
more beneficial to users without programing experience. These Department of Agriculture. National Wetlands Inventory. Available
tasks would improve the functionality and ease of use of the online at http://www.fws.gov/wetlands/data/Data-Download.html.
script in its current condition. Finally, enveloping this code in Accessed [12/9/2014]
a loop will allow the user to define multiple states and counties [3] Python Software Foundation. Python Language Reference, version 2.7.
Available online at http://www.python.org.
of interest.
[4] ESRI 2011. ArcGIS Desktop: Release 10. Redlands, CA: Environmental
Systems Research Institute.
This tool provides a useful functionality for compiling [5] U.S. Census Bureau. (2014). 2010 FIPS Codes for Counties and County
multiple datasets from federal FTP sites and remote sensing Equivalent Entities. Available online at
technologies that facilitate quicker and easier acquisition and http://www.census.gov/hhes/www/income/statemedfaminc.html.
pre-processing of data for use within a wetlands delineation
model. This tool utilizes the appropriate libraries in order to
access, download, and extract zip files from the appropriate

You might also like