Professional Documents
Culture Documents
OsokeyServerlessComputingSeismicWhitepaperAWS 2019
OsokeyServerlessComputingSeismicWhitepaperAWS 2019
Seismic Data Management and beyond
on Amazon Web Services (AWS)
Osokey
Published: 15th September 2019
Updated: 30th October 2019
Contact: James Selvage (james@osokey.com)
Visit: https://osokey.com
1 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.
Contents
Abstract 3
1. Introduction 4
a. What is seismic data? 5
3. Implementation 15
4. Performance 16
6. Conclusion 22
2 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.
Abstract
SEG-Y and SEG-D are oil & gas industry standard file formats for seismic data. This
whitepaper describes a serverless solution for cloud-based management of seismic
data that enables a lift and shift of the SEG-Y or SEG-D format data into Amazon
Simple Storage Service (S3). The event-driven architecture can ingest seismic data
at any scale and automatically generates a file inventory that can be searched using
Amazon Athena. Each seismic data file progresses through custom code running on
AWS Lambda that automatically captures metadata and stores it in Amazon
DynamoDB. For each SEG-Y file, a trace level index is created to enable the
architecture to be extended beyond data management, e.g. viewing seismic sections
and gathers or transforming seismic data into a streaming format for on-premise
geoscience applications. Raw read performance from multiple AWS Lambda
functions reading the same 1TB SEG-Y file achieved an aggregate read performance
of 42 GB/s. The architecture enables on-demand compression of seismic data using
parallel AWS Lambda functions to perform read, compress and write operations,
achieving a rate of 2.8 GB/s. It is shown that Amazon S3 Batch Operations provides
a cost effective way to bulk process files and it is used to perform duplicate
detection of 592,921 SEG-D files, with a total AWS cost of less than $15 USD.
3 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.
1. Introduction
SEG-Y and SEG-D are oil & gas industry standard file formats. A major oil & gas
company is likely to have seismic data spanning decades and consuming Petabytes
(PBs) of storage. Furthermore, this seismic data will be stored across a variety of
different storage mediums depending on operational requirements. Table 1 shows
the typical storage mediums that are used:
Storage medium Operational requirement
4 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.
5 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.
Figure 1 - The architecture described in this whitepaper connects different AWS
Services to create a serverless seismic data management solution.
6 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.
a. Amazon S3
Amazon Simple Storage Service (Amazon S3) is an object storage service that offers
industry-leading scalability, data availability, security, and performance. In this
section we will see that S3 is more than just storage and provides many benefits for
seismic data management.
Upload
Each SEG-Y or SEG-D file is stored as an object on S3 within an S3 Bucket. Seismic
data can be added to the S3 bucket via the internet or by using an AWS Snowball.
Standard company folder structures can be mirrored to help organise files within the
S3 Bucket. For example, by storing SEG-Y data with the following prefixes:
<country>/<survey name>/<attribute>/filename.segy
It is possible to drill-down to files by prefix directly from the AWS Management
Console for S3. Figure 2 shows an example of public domain seismic data from the
Equinor Volve Dataset stored in an S3 Bucket, 9f7f65067d31-oso-segy, under the Key
prefix:
nor/equinor/ST0202vsST10010_4D/Stacks/<filename>.segy
The combination of bucket and key define a unique URL for a given object, e.g.
https://9f7f65067d31-oso-segy.s3-eu-west-1.amazonaws.com/nor/equinor/S
T0202vsST10010_4D/Stacks/ST0202ZDC12-PZ-PSDM-KIRCH-FULL-D.MIG_FI
N.POST_STACK.3D.JS-017534.segy
Figure 2 - Example from AWS Management Console of S3 showing SEG-Y files. These
files are stored in the S3 Glacier Storage Class.
7 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.
The S3 Bucket, 9f7f65067d31-oso-segy, is not publicly accessible so following the
above link will result in an Access Denied Error. In this architecture seismic data
within this bucket is kept secure by limiting access with AWS Identity and Access
Management Permissions, which will be described in more detail in the Security,
Permissions & Activity Logging and Implementation sections. The objects are also
encrypted at rest and in transit.
Multipart objects
Amazon S3 supports multipart uploads of SEG-Y files, which means that large
objects are stored in many smaller parts. The number of parts can be determined
from the object’s ETag:
For multipart uploads the ETag is the MD5 hexdigest of each part’s MD5 digest
concatenated together, followed by the number of parts separated by a dash.
So the number of parts can be determined from the ETag. For example, a 119 MB
SEG-Y in our S3 Bucket has the ETag, a8abbeb338a3e0f689186ef78f95e904-8. The -8
indicates that this object is made up of 8 parts. A 1TB SEG-Y file in our S3 Bucket is
made up of 8097 parts (ETag, 29014b75882070511aa863b2f90b2e37-8097). This is
transparently handled by Amazon S3, the files appear as single objects, and it means
that the SEG-Y is effectively “bricked” automatically. Therefore, Amazon S3 natively
supports multiple parallel reads of SEG-Y format objects. This will be shown in more
detail in the Extending the Architecture - Transforming SEG-Y data and Performance
sections.
Search
To enhance the search capabilities of the S3 Bucket the architecture utilises daily S3
Inventory reports that can be queried using AWS Athena. This means that individual
SEG-Y files can be found based on keywords in the object’s key, size and storage
class using SQL Queries. The search results can be downloaded as a CSV file.
Archiving
The storage class of an individual object can be changed depending on usage
requirements. In this architecture a mixture of S3 Standard, S3 Glacier and S3 Glacier
Deep Archive are utilised to accommodate the different operational requirements of
seismic data (Table 2). Using Amazon S3 in this way eliminates the need for
magnetic tape storage.
8 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.
Figure 3 - Key-value tags can be added to objects. The osoArchive tag is used by a
Lifecycle Rule to transition the object to the Glacier Storage Class.
9 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.
and the objects in it. In this architecture, IAM roles with policies that enable access
to the bucket are used to enable the Osokey AWS Account to read seismic data in a
customer’s AWS Account. This approach is described in the Implementation section.
Durability
Amazon S3 helps to ensure data durability by synchronously storing your data
across multiple facilities. In Figure 2, the bucket is located in the AWS Region,
eu-west-1. This region has three isolated availability zones and the seismic data in
the S3 Standard or S3 Glacier Storage Classes is redundantly stored within each
zone. This benefit is included in the cost of Amazon S3 per GB pricing.
S3 Batch Operations
Amazon S3 Batch Operations provides a way to process objects in bulk that are
stored on S3. For example, you can copy each object to another bucket, set tags on
each object, restore each object from Glacier or invoke an AWS Lambda function on
each object. The latter operation can be used to perform a consistent data operation
on a seismic file. Osokey recently performed SEG-D duplicate detection of 592,921
SEG-D files using Batch Operations. This is described in the Extending the
Architecture - Duplicate Detection section.
Events
The Amazon S3 notification feature enables you to take actions whenever specific
events happen on your buckets. In this architecture the events in Figure 4 are used to
trigger a Lambda function whenever objects with extensions .segy, .SEGY, .sgy or
.SGY are created in the bucket. The Lambda functions are in the Osokey AWS
Account and run custom Python code to automatically ingest the new SEG-Y files.
This includes the automatic extraction of pertinent metadata and creating a trace
level index.
Figure 4 - The S3 notification feature is used to trigger a Lambda function whenever a
new SEG-Y file is added to the S3 bucket.
b. Amazon Athena
Amazon Athena is a serverless, interactive query service that can analyse data in
10 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.
Amazon S3 using standard SQL. In this architecture AWS Athena is used to enable
the searching of the daily S3 Inventory reports. For example, the results in Figure 5
are formatted for display from an AWS Athena query where the object key contains
“P000” and was performed using an SQL query of:
> SELECT * FROM <Athena Table> WHERE key LIKE '%P000%'
Figure 5 - The results from an Athena query based on a keyword search of seismic
filenames. The .CSV file has been formatted for display in a web browser.
Figure 6 shows a CSV file downloaded from the result of the Athena query:
> SELECT storage_class, count(*) FROM <Athena Table> GROUP BY
storage_class
This provides a way to Audit how many objects are in a given storage class.
Figure 6 - A CSV downloaded from an AWS Athena query that shows data by Amazon
S3 Storage Class.
c. AWS Lambda
AWS Lambda lets you run code without provisioning or managing servers. In this
architecture AWS Lambda functions run custom Python code and are triggered by
events, e.g. when a new seismic file is uploaded to the AWS S3 Buckets (Figure 7).
Multiple AWS Lambda functions are chained together in the ingestion pipeline shown
below. Multiple SEG-Y files are processed in parallel.
This provides a highly scalable and automated metadata extraction approach. It
creates the required metadata to start utilising the ingested seismic data. For
example, Osokey recently ingested over 500,000 SEG-D files uploaded using an AWS
11 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.
Snowball with this architecture. These SEG-D files can be searched using AWS
Athena and transitioned to the Amazon S3 Glacier or Deep Glacier Storage Classes.
These high levels of automation enable Osokey to offer this seismic ingestion
service on a pay-as-you-go basis, starting at a cost of 0.24 USD per GB.
Figure 7 - Scalable SEG-Y ingestion service using AWS Lambda to custom code.
Results from the Lambda functions are stored in the customer’s AWS Account.
d. Amazon DynamoDB
Amazon DynamoDB is a key-value and document database that delivers single-digit
millisecond performance at any scale. In this architecture Amazon DynamoDB
provides a flexible metadata store for ingesting seismic data. It is used for both
transient metadata created during ingestion and for the permanent storage of
metadata that enhances search capabilities and enables trace level indexing of a
SEG-Y or a SEG-D file. Like Amazon S3, all data that is stored in DynamoDB is
encrypted at-rest.
Figure 8 - Early metadata added to the seismic ingestion DynamoDB table. This
metadata is updated as the seismic data progresses through the ingestion pipeline.
12 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.
Figure 9 - As SEG-Y or SEG-D file progresses through the ingestion processes
additional metadata is captured in DynamoDB. Metadata captured from the Binary
Header is shown above.
To support different types of queries across metadata in the DynamoDB table Global
Secondary Indexes (GSIs) are used. A GSI can contain a selection of attributes from
the main table, but organised by a different primary key. U
p
to 20 global secondary
indexes (default limit) can be created per table.
In Figure 10 a GSI is created based on the Bucket and Key attributes and projects a
subset of attributes from the table. This enables a query for SEG-Y files that begin
with “aus/Gippsland/”.
13 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.
Figure 10 - Global Secondary Indexes (GSIs) are used to enable different types of
queries.
In Figure 11 a seismic data management table is formatted based on another GSI
query for SEG-Y data ingested between 15th December 2018 to 22nd December
2018. This GSI has a primary key and sort key based on the bucket and timestamp of
when a seismic data file was added.
Figure 11 - Global Secondary Index (GSI) query results can be formatted to produce
useful tables for seismic data management.
14 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.
3. Implementation
The implementation of the architecture separates Osokey's code from our
customer's seismic data. This enables Osokey to update our code for all customers
and enables each customer to retain control of their data in their own AWS Account
(Figure 12).
Figure 12 - The architecture separates Osokey’s code from customers data by
connecting separate AWS Accounts using IAM permissions.
This permissions model is enabled through the use of AWS Identity and Access
Management (IAM). In each customer cloud account IAM roles are used to grant
cross-account access to Osokey. These roles have policies that limit access, by
Osokey, to the minimum permissions needed for Osokey to provide the seismic
ingestion service, i.e. the AWS Services and data that Osokey code can access.
Whenever Osokey’s code is invoked by a customer’s cloud account the appropriate
role is adopted to service the request and return metadata to the customer’s
Amazon S3 buckets and Amazon DynamoDB tables.
Figure 13 shows a summary from the IAM console of the Admin role that can be
adopted by Osokey to perform operations on the customer’s AWS Account. The Last
Accessed column is shown and the IAM policies that are granting access to a given
AWS Service, e.g. Amazon DynamoDB.
15 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.
Figure 13 - Customer’s can use the IAM Access Advisor for visibility on when the
Osokey AWS Account is accessing AWS Services in their AWS Account.
4. Performance
The architecture enables SEG-Y files to be read in parallel by multiple AWS Lambda
functions. For example, the 1TB SEG-Y file made up of 8097 parts was read with 740
Lambda functions. At peak, 500 Lambda functions were concurrently reading from
the SEG-Y object and achieved a peak aggregate read performance of 42 GB/s
(gigabytes per second).
If the submission of the Lambda functions is included then the total system time to
have the entire SEG-Y file available in memory to perform operations on was 52
seconds, this equates to a system read performance of 19.7 GB/s.
16 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.
Figure 14 - CloudWatch metrics can be used to produce custom dashboard
components.
17 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.
Figure 15 - The seismic metadata and trace index capabilities are used to construct a
web-based data management portal on-the-fly.
Figure 16 - Map portal showing seismic trace outlines and other spatial information.
Spatial information is associated with each seismic data file during ingestion.
18 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.
Figure 17 - Seismic section and metadata opened from the map to review. This uses
the trace level indexing to produce the familiar inline, crossline, random line, gather or
timeslice displays.
19 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.
Figure 19 - Invocations of the AWS Lambda function against time for the duplicate
detection using S3 Batch Operations.
S3 Batch Operations was used to apply this Lambda function to 592,921 SEG-D files
(7,264 GB) to detect duplicates. Figure 19 shows a graph of the Lambda function
being invoked against time. It took less than 25 minutes for the Batch Operations job
to complete and a peak of 916 Lambda functions were run in parallel. Each hash was
stored in a DynamoDB table and a query across this table discovered that there were
56,154 duplicate SEG-D files. The combined AWS costs of Amazon S3 Batch
Operations, Amazon DynamoDB and AWS Lambda were less than $15 USD.
20 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.
In Figure 20, an inline from a 1TB 3D seismic dataset has been streamed into a
Jupyter Notebook. An ephemeral streaming format was generated from the SEG-Y
data using AWS Lambda functions running in parallel. The custom code compresses
the seismic data and stores it as 293,662 separate parts consuming between 74GB
and 166GB depending on the chosen compression quality. It took approximately 360
seconds to read, compress and write, which corresponds to a rate of ~2.8 GB/s. This
means that these files can be cost effectively recreated and removed based on
usage patterns.
Figure 20 - SEG-Y data can be transformed on-the-fly into ephemeral formats. An inline
from a 1TB 3D Seismic dataset has been streamed into a Jupyter Notebook and
converted to a NumPy array.
21 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.
6. Conclusion
Osokey’s experience has been that a layered approach to development is an
effective approach to starting with cloud. AWS provides a foundation with a scalable,
reliable and global cloud infrastructure (Figure 21). In this whitepaper Amazon S3,
Amazon DynamoDB and Amazon Athena have been connected using AWS Lambda
into a serverless seismic data management solution, which Osokey call the data
layer.
Figure 21 - Osokey have found it effective to take a layered approach to development,
starting with the AWS foundation of a scalable, reliable and global cloud infrastructure.
AWS CloudFormation is used to simplify the deployment of AWS Resources in any
AWS Region. AWS Identity and Access Management (IAM) is used to separate
Osokey’s custom AWS Lambda functions from customers seismic data. A customer
can utilise the monitoring capabilities of AWS to audit Osokey’s access.
Amazon S3 Lifecycle Management Rules and object tags are used to transition
seismic data from S3 Standard Storage Class to S3 Glacier Storage Class. This
simplifies archiving because the location of the seismic data does not change. By
using Amazon Athena with Amazon S3 Inventory Reports, a data manager can
quickly establish which seismic data is archived. When operational requirements
change seismic data can be restored to the S3 Standard Storage within hours.
The AWS Lambda ingestion pipeline is triggered automatically by new SEG-Y or
SEG-D files. The custom code extracts and identifies pertinent metadata and stores
this in Amazon DynamoDB in the customer’s AWS Account. The massive parallel
read performance of Amazon S3 makes it viable to re-extract additional metadata
on-demand and, with the flexibility of DynamoDB, permanently store this metadata to
enhance search capabilities. Amazon DynamoDB Global Secondary Indexes (GSIs)
can be created to query this metadata.
22 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.
The ingestion pipeline also creates a trace level index for each file, with this you can
read in parallel from many parts of the file without contention or slow down. By
utilising the stored metadata and trace index capabilities, access and analytics
layers can be added to the solution. Osokey utilise the outputs from this architecture
to provide a Software as a Service (SaaS) solution that delivers cloud-based data
management, collaboration & data analysis for seismic data stored on AWS (Figure
22). Seismic data can be quickly located, viewed with a few clicks and streamed
globally.
The performance of Amazon S3 enables on-demand transformation of SEG-Y data
and a read, compress and write rate of 2.8 GB/s was achieved from a 1TB seismic
file. The ETag of this file (29014b75882070511aa863b2f90b2e37-8097) shows that
it is stored as 8097 parts on S3. This performance also enables services like S3
Batch Operations to be used to process files in bulk, and an example of finding
duplicates amongst 592,921 SEG-D files was completed within 25 minutes, with total
AWS costs of less than $15 USD.
This architecture and approach simplifies the adoption of cloud for seismic data
because there is no need to transcribe your data before ingestion. Moving to cloud
can be a lift and shift of the SEG-Y or SEG-D format data rather than a read, identify
and convert before uploading. Once the data is in the cloud other AWS Services can
be integrated to deliver continuous innovation.
23 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.
Figure 22 - The outputs from the seismic ingestion pipeline can be integrated into a
web-based seismic data management, viewing and collaboration solution.
24 / 24
Copyright 2019 Osokey Ltd. All Rights Reserved.