Professional Documents
Culture Documents
Paper Volve Data Exploration
Paper Volve Data Exploration
Paper Volve Data Exploration
net/publication/347714801
CITATIONS READS
3 42
3 authors, including:
All content following this page was uploaded by Andrzej Tunkiel on 24 May 2021.
OMAE2020-18151
ABSTRACT within the oil and gas. The field is located in the North Sea and
was in operation from 2008 to 2016. One of the biggest benefits
In 2018 Equinor made an unprecedented step for an energy of using an on open dataset is that it allows for experiment re-
company and made a multi-terabyte dataset from Volve field production and benchmarking. It is impossible to do so if results
open. However, there is a long way from downloading data to are published while withholding the raw data. Over 52 percent
executing meaningful analysis. With no way of quickly evaluat- of scientists surveyed by Nature in 2016 claim there is currently
ing the data due to its size and unfamiliar file formats the use of a reproducibility crisis in science [3], with unavailability of raw
Volve data was so far limited. data among the top contributing factors.
This paper presents our exploratory work related to the real- Additionally, data related to drilling operation are of rela-
time drilling part of the dataset. We provide description of com- tively unusual type - it is a data-series with up to hundreds of
mon obstacles and approaches for overcoming them. We also correlated attributes. It is hard to find a dataset where data struc-
describe specific contents of the dataset for others to gauge the ture and data problems are similar.
potential for case studies. We hope that this will lower the bar Curated datasets exist in other fields. Especially in Machine
for Volve field data accessibility, promote research, and become Learning there are a number of datasets that allow researchers
a catalyst for other data science projects. to benchmark their methods, with MNIST database1 containing
labelled images of handwritten numbers being the most notable
INTRODUCTION one. There are also other, more specialized open datasets, such
as Johns Hopkins Turbulence Database2 , which enables research
through providing data that would be otherwise prohibitively ex-
Work done in the energy sector, up until recently, was fully
pensive to acquire.
dependant on laboratory scale work [1] and commercial partners
In this study, we want to facilitate a step change in re-
tied to specific research projects. Prerequisite legal overhead,
search quality within the energy sector by making Volve real-
confidentiality and sensitivity of the real well data means they
time drilling data easily accessible. Today, the data in question
are often prohibitively difficult to obtain.
Volve field dataset [2], made public by Equinor, has a po-
tential to become the go-to dataset for data scientists working 1 http://yann.lecun.com/exdb/mnist/
2 http://turbulence.pha.jhu.edu/
1 Copyright c 2020 by ASME
are available as WITSML (Wellsite Information Transfer Stan- nificantly more information in the time-based data, since it is
dard Markup Language) files [4], format common in the industry, recorded continuously throughout operations. The depth-based
but not compatible with most common data science tools. are only a subset recorded when actual drilling is performed,
In this paper, we present what data are available and meth- when the rock is being physically cut.
ods for dealing with common problems. As an extension to this In Table 3, appendix, we summarized the available data in
paper we also provide an online platform for data exploration terms of wellbores and number of samples available for both
and download. This will lift the real-time drilling part of Volve depth and time data.3
from being accessible only by experts, to a resource that is easily
accessible for a wide range of scientists and engineers. Parsing of WITSML data
Presented work contributes to the field of data preparation, WITSML is an industry standard for transferring real-time
one of six major phases in Cross-industry standard process for data between cooperating oilfield companies. Despite wide
data mining (CRISP-DM), the most widely-used analytics model adoption we failed to identify a Python library allowing for direct
in data mining, by characterizing the typical problems found in import of such data.
real time drilling data. Discussed issues and presented solu- Decision was made to parse the files using regular expres-
tions are meant to bridge the gap between researchers and pub- sions. Other methods, such as libraries dedicated to reading
lic raw dataset within drilling, as well as to be a reference for XML files, are equally suitable.
wider data science community that is not necessarily familiar WITSML format provides a number of useful values that
with petroleum industry. we decided to retain. There is information attached to each at-
This paper starts with an introduction to the Volve dataset tribute, such as full name, mnemonic, unit, data type, minimum,
and its contents, with a focus on WITSML real-time drilling data. and maximum date index. Our conversion effort aimed at provid-
As a second step typical data related issues and potential pitfalls ing data in CSV format to achieve full compatibility with most
are highlighted together with recommended solutions. To better common Python data analysis library - Pandas, and other data an-
visualize some of the discussed topics we present several exam- alytics tools such as R and Excel. It meant that only a simple title
ples, coupled with tables giving a bird’s-eye view of the dataset. is possible for each attribute. Format of full name concatenated
with unit was used.
VOLVE DATASET Additional attribute was created informing user which sec-
tion of the well the log is from. This information was obtained
from metafileinfo.txt files residing within the folder structure. No
Volve field is located in the North Sea, mid way between Sta-
other modification to the data was performed. All but one well
vanger and Aberdeen. It was discovered in 1993 and production
were parsed successfully, with the exception stemming from a
started in February 2008 and it lasted eight years. The reservoir
seemingly different data structure.
is in the sandstone of Middle Jurassic age in the Hugin Formation
at the depth between 2700m and 3100m with seabed at depth of
Available data
80m. Peak production was 56 000 barrels per day with a total
As indicated earlier, available data exist as either time-based
of 63 million barrels of oil produced. In June 2018 Equinor de-
data or depth-based data. In general terms, time-based data are
cided to disclose all subsurface and operating data for this field
expected to contain values as-recorded with mostly fixed time-
totalling approximately 40 000 files of various kind. The data
steps. One can observe the movement of the drawworks, down-
is published on a very permissive license - Creative Commons
time between drilled sections, pulling out of hole etc. Depth-
BY-NC-SA 4.0 - which, in short, means that any derivative work
based data on the other hand will be processed to contain seem-
has to attribute the original license holder (BY, by attribution),
ingly continuous drilling operation. Depth-based data do not
cannot be commercial (NC - non-commercial) and be shared on
contain time information.
an identical license (SA - share-alike). There are total 14 com-
In some, but not all wells, time-based data contain not only
pressed archives available for download, see Table 2 for size ref-
drilling operations, but also events such as casing or completion
erence.
running.
With just one exception, if depth-based data are available,
Real-time dataset time-based equivalent also exists. There are typically above 100
Our work was focused on the part of the dataset named Volve attributes available in the depth-based logs and above 200 in
WITSML Realtime drilling data, the 5GB archive seen in Table time-based ones. Measurement units in Volve dataset are exclu-
3, in the appendix. sively metric.
Within multiple, nested folders, it contains drilling logs as
both time and depth-based data. The main difference between
these is the indexing attribute. Additionally, there is usually sig- 3 We did not investigate why seemingly the same wells are logged in different
folders.
2 Copyright c 2020 by ASME
Raw amount of data TABLE 1. Depth range
Different logging frequencies mean that a new entry is gen-
erated even though not all attributes have a new value available. Minimum4 Maximum
Folder Well
This leads to a relatively high amount of missing values (not a depth (m) depth (m)
number, NaN) in the dataset. For example, in well F9 A, depth- Norway-Statoil-NO F-1 C - 0 1 257.0 3632.1
based, over 80.8 percent of individual data points are empty. 74.2 Norway-Statoil-NO F-1 C - A 2 564.0 3 682.8
percent cells in the time data for the same well also have no value. Norway-Statoil-NO F-1 C - B 2 591.9 3 465.0
This is common throughout all the wells. Norway-Statoil-NO F-1 C - C 2 528.6 4 008.4
Norway-StatoilHydro F-4 191.6 2 992.4
Depth Norway-StatoilHydro F-5 2 911.7 3 793.0
One of the key attributes of well data is depth, or more pre- Norway-Statoil F-7 131.8 914.9
cisely, multiple variants of depth. From a catalogue of defini- Norway-StatoilHydro F-9 162.2 633.5
tions, one can typically find at least one of the common ones that Norway-NA F-9 A 400.1 1 206.0
is available for the majority of the dataset. Norway-StatoilHydro F-10 448.5 5 311.1
Depth-based logs will usually have attribute called Mea- Norway-Statoil-NO F-11 196.0 347.6
sured Depth m, which is typically complete. In the case of well
Norway-Statoil-NO F-11 T2 363.8 2 574.0
F9 A a Bit Depth m value is also available, however it covers
Norway-Statoil-NO F-11 A 2 522.7 3 762.0
only a small section of the well and in practice is identical to the
Measured Depth m. Norway-Statoil-NO F-11 B 2 655.9 4 770.6
Time-based F9 A dataset has a number of depth-related at- Norway-Statoil F-12 279.3 3 464.5
tributes. One has to investigate three available Bit Depths, as well Norway-StatoilHydro F-14 316.0 3 466.1
as Continuous Survey Depth m, especially when analysing direc- Norway-StatoilHydro F-15 1 392.6 4 065.3
tional drilling aspect of the well, as it is corrected for the sensor Norway-StatoilHydro F-15 A 2 517.5 3 233.0
position in the bottom hole assembly. Norway-StatoilHydro F-15 B 2 968.8 3 035.5
Depth range of the depth-based datasets is reported in Table Norway-StatoilHydro F-15 S 1 503.6 4 090.0
1. The raw minimum depth is not reported, as some datasets con-
tained clearly incorrect values. We decided to use 5th percentile
of the depth series, which corrects this issue while still providing TYPICAL PROBLEMS AND PROPOSED SOLUTIONS
a good indication of wells’ usable data range.
Uneven data frequency
Attribute availability Not all data are received simultaneously. Mud pulse teleme-
try will provide data in a continuous slow stream, in sequences
All the logs differ in terms of attribute availability due to dif-
repeating every couple of minutes. In the logged data, every time
ferences in equipment and practices utilized during operations. A
a new value is available it is written down in a new line into the
listing of selected attributes’ keywords is provided in appendix,
log. There is a minimal waiting time to collect other attributes,
in Table 4.
hence very often rows are sparsely populated and dominated by
One can roughly identify what kind of analysis is possible
missing values.
based on presented table. Wells F-1, F-11, and F-15D have sig-
This problem is visualized in Figure 1. Sampling of all three
nificant amount of attributes related to gamma and neutron based
attributes is different. Attribute A is logged at half of the fre-
measurements. Those wells were drilled in 2013 as opposed to
quency of attribute B. Attribute C is logged at the same frequency
other wells drilled in years 2007 - 2009. We did not investigate
as B, but at different times. In the presented example, the first
the reason behind the use of different type of equipment.
sample will only have a value for attribute C. Second sample
Nearly all logs contain basic drilling attributes, such as rate
will contain A and B, third one again only C, fourth one only B
of penetration, surface torque, weight on bit, etc. When explor-
and so on. This practical approach to logging retains maximum
ing the dataset it is worth using a script [5] that would automate
amount of information while at the same time it creates a number
searching for attributes through all the files, as well as plotting
of problems for data analysts.
charts to gauge the usability of the data for a given research prob-
In a case of a correlation analysis between (often logged)
lem.
attributes B and C it is possible that they never co-exist in the
same row. If both are downhole attributes, uploaded through mud
pulse telemetry, they will necessarily be offset by a fixed value.
4 5th percentile Additionally, series-type analysis, be it depth or time, re-
Big gaps
There are limited possibilities when it comes to gaps that ex- Sort by Time
ceed several meters in datasets. They may appear due to change
in equipment, sensor failures, logging failures, data corruption,
and similar. Volve data contains a number of such longer gaps. Forward Fill
We discovered that some gaps may exist in the depth based
dataset while not being present in the time based one, and vice Select time range
versa. One example is illustrated in Figure 3 and discussed in
detail in further sections. There are no simple universal methods Depth based data
Sort by Depth with a gap
of merging date and time datasets. Below we provide a generic
algorithm that can be used to address this issue.
Select given Ensure no
Depth range corrupted data
Patching. Data from a time-based log can be used to fill exists on the gap
in, or patch, relatively large gaps in a depth-based log. A process Adjust attribute edges
that allows patching of a depth-based dataset using data from a names
time-based dataset is shown in Figure 2.
Time-based data have to be sorted by time and attributes for-
ward filled. This is done to ensure that there is no data loss in the Merge
consequent depth-based sorting. As mentioned in the previous
sections, the dataset is in big part empty, as values are recorded
at different intervals. If data is sorted by depth without forward-
filling first, all the samples that are in between logged depth val- Gap-filled depth
ues would be lost. If multiple operations, including tripping in based data
and out, are logged, it is beneficial to isolate a rough time-range FIGURE 2. PATCHING FROM TIME PROCESS
of the log that corresponds exclusively to drilling. With drilling-
only part of the dataset, one can simply sort by depth to convert
the log into a depth-based log. Note that there may be differ-
ent depths in the time-based log and it may not be immediately
clear which one is the most appropriate nor which one is the most Outliers, artifacts, and sentinel values
complete. Not all values in a log can be considered valid. There are a
Next step is to identify exact depth range that is missing or number of reasons for erroneous values with their own respective
is otherwise corrupted in the depth-based log. Note, when a gap methods for removing them. Traditional outliers, in the context
in data exists, attributes near the gap may be incorrect. It must of Volve WITSML data, can often be relatively easily removed
be ensured that there are no corrupted data on the edges of the with a median filter. There are publications [10–12] that deal
gap. Data removal is often necessary so that the edge values are with other, more complex methods.
without errors. With the depth range identified, the range of the Data artifacts are incorrect values due to flaws in measure-
patch dataset must be adjusted. For technical reasons it is likely ment or recording techniques. These can often be seen as im-
necessary to adjust the names of the attributes so they can merge possible straight lines in plots, such as the one on the bottom in
correctly. Take note that seemingly the same attributes may in Figure 3. This unfortunately often requires manual intervention,
fact differ in terms of filtering, noise levels, data artifacts or depth such as removing all given values of an attribute from samples
shift. Figure 4 shows values for four inclination attributes for the between two depths.
same well, without any two lines overlapping completely. One Sentinel values are employed to show a lack of value. They
must evaluate if given gap filling is acceptable on a case by case are often selected to be physically or mathematically impossible,
DISCUSSION, RECOMMENDATION
Folder Well Data Neutron Gamma Inclination Azimuth Continuous MWD Caliper
Norway-NA F-1 time 0 2 1 1 3 13 1
Norway-Statoil-NO F-1 C 0 depth 45 46 13 32 0 1 22
Norway-Statoil-NO F-1 C 0 time 0 13 8 7 0 1 3
Norway-Statoil-NO F-1 C A depth 37 17 7 27 0 1 18
Norway-Statoil-NO F-1 C A time 28 13 6 7 0 2 15
Norway-Statoil-NO F-1 C B depth 45 25 5 36 0 1 22
Norway-Statoil-NO F-1 C B time 28 24 6 19 0 2 15
Norway-Statoil-NO F-1 C C depth 39 68 9 75 0 1 20
Norway-Statoil-NO F-1 C C time 28 47 8 33 0 2 15
Norway-Statoil-NO F-4 time 0 0 0 0 0 1 1
Norway-StatoilHydro F-4 depth 4 6 3 3 4 22 1
Norway-StatoilHydro F-4 time 3 7 5 3 8 29 2
NA-NA F-5 time 0 0 0 0 0 1 1
Norway-Statoil-NO F-5 time 0 0 0 0 0 1 1
Norway-StatoilHydro F-5 depth 2 2 2 1 3 13 2
Norway-StatoilHydro F-5 time 0 6 3 1 4 23 3
Norway-Statoil F-7 depth 0 2 2 1 3 13 0
Norway-Statoil F-7 time 0 2 2 1 4 14 1
Norway-Statoil-NO F-7 time 0 0 0 0 0 1 1
Norway-NA F-9 A depth 0 3 1 1 2 15 0
Norway-NA F-9 A time 0 3 1 1 3 26 1
Norway-Statoil-NO F-9 time 0 0 0 0 0 1 1
Norway-StatoilHydro F-9 depth 0 2 1 1 2 17 0
Norway-StatoilHydro F-9 time 0 4 1 1 4 20 1
Norway-StatoilHydro F-10 depth 5 7 3 2 4 15 1
Norway-StatoilHydro F-10 time 3 6 3 2 5 16 2
Norway-Statoil-NO F-11 A depth 39 17 7 27 0 0 20
Norway-Statoil-NO F-11 A time 28 12 6 7 0 1 15
Norway-Statoil-NO F-11 B depth 39 49 9 64 0 1 20
Norway-Statoil-NO F-11 B time 28 25 6 21 0 2 15
Norway-Statoil-NO F-11 depth 0 17 6 6 0 0 0
Norway-Statoil-NO F-11 T2 depth 31 37 10 10 0 0 16
Norway-Statoil-NO F-11 T2 time 28 23 9 10 0 1 15
Norway-Statoil-NO F-11 time 0 13 6 6 0 1 1
Norway-Statoil F-12 depth 3 5 2 2 2 14 1
Norway-Statoil F-12 time 3 10 4 3 6 32 3
Norway-Statoil-NO F-12 time 0 0 0 0 0 1 1
Norway-Statoil-NO F-14 time 0 0 0 0 0 1 1
Norway-StatoilHydro F-14 depth 3 8 3 2 4 21 1
Norway-StatoilHydro F-14 time 3 6 3 2 5 22 2
Norway-StatoilHydro F-15 depth 3 5 3 3 4 22 2
Norway-StatoilHydro F-15 time 1 3 3 3 5 23 3
Norway-StatoilHydro F-15 A depth 5 4 2 2 4 23 1
Norway-StatoilHydro F-15 B depth 9 14 5 3 4 38 6
Norway-StatoilHydro F-15 B time 1 1 2 2 5 22 3
Norway-Statoil-NO F-15 C depth 0 0 0 0 0 0 0
Norway-Statoil-NO F-15 C time 0 0 0 0 0 1 1
Norway-Statoil-NO F-15 D time 28 47 7 33 0 2 15
Norway-StatoilHydro F-15 S depth 5 6 3 2 3 14 1
Norway-StatoilHydro F-15 S time 3 5 3 2 4 15 2