EGU2020 21549 Presentation

Optimizing NetCDF usage
Valentín KIVACHUK BURDÁ
05/05/2020 1
INTRODUCTION
- Original data sources comes in different formats (CSV,

GRIB, numpy arrays, …)
- Must convert to NetCDF format.
- Efficient most of the times. But NOT always.
05/05/2020 2
CONTEXT
3 Vars (Floats) with: Algorithm example

- 1461 time slots
img_len = 64
- 384 points of latitude
list_rand = 6000 random values of time, lat and lon
- 288 points of longitude
for time, lat, lon in list_rand:
A-% | Nebulosity (0 – 100) for var in [ ‘A’, ‘B’, ‘C’ ]:
B - ºC | Temperature (-10 – 40) data = NetCDF[var][time, lat + img_len, lon + img_len ]

acumulate_mean(var, data)
C - m/s | Wind Speed (0.xx – 400.xx)
We collect 6,000 samples for each variable,

randomly in all dimensions, and compute the
mean thereof.
05/05/2020 3
WHAT IS NetCDF?
NetCDF is a set of software and data formats that manage

scientific data.
- Self-describing format
- Multidimensional variables
- Native compression (zlib)
05/05/2020 4
EXTERNAL COMPRESSION
Apply NetCDF compression (zlib) with maximum level (9).

Size MB
2000 1853
1500
1064
1000
500
0
Original Zlib lvl 9
Original Zlib lvl 9
05/05/2020 5
The performance is REALLY BAD.
1030s (17min 10s)
05/05/2020 6
What can I do?
05/05/2020 7
Higer compression level needs more CPU time.

Size MB
2000 1853
1800
1600
1400 1216
1200 1064
1000
800
600
400
200
0
Original Zlib lvl 9 Zlib lvl 2
Original Zlib lvl 9 Zlib lvl 2
05/05/2020 8
BUT, bigger size -> more data to read -> slower

Performance (seconds)
1600
1339,9
1400
1200 1030,2
1000
800
600
400
200
0
Zlib lvl 9 Zlib lvl 2 img_len = 64
Zlib lvl 9 Zlib lvl 2 for time, lat, lon in list_rand:
for var in [ ‘A’, ‘B’, ‘C’ ]:
data = NetCDF[var][time, lat + img_len, lon + img_len ]
05/05/2020 9
TYPE OF DATA
- A value can be stored in different formats (int, float, ..)

- All values within a variable have the same format
- A format have a size (per value) and can represent a
delimited range of values
- Choosing a format with smallest size that can represent
our range of values
05/05/2020 10
TYPE OF DATA
Reduce size per value
Name Size (bytes) Range
BYTE 1 -127 ... 128
UNSIGNED BYTE 1 0 … 255
SHORT 2 -32,768 … 32,767 A - % | Nebulosity (0 – 100)
B - ºC | Temperature (-10 – 40)
UNSIGNED SHORT 2 0 … 65,535
C - m/s | Wind Speed (0.xx – 400.xx)
INT 4 -2,147,483,648 …
2,147,483,647
UNSIGNED INT 4 0 … 4,294,967,295
INT64 8 −𝟐𝟔𝟑 … 𝟐𝟔𝟑 − 𝟏
UNSIGNED INT64 8 0 … 𝟐𝟔𝟒 − 𝟏
FLOAT 4 𝟑. 𝟒 ± 𝑬𝟑𝟖 * * Can represent
DOUBLE 8 𝟏. 𝟕 ± 𝑬𝟑𝟎𝟖 * decimals
05/05/2020 11
TYPE OF DATA
Size MB Performance (seconds)
1400 1216 1600
1400 1339,9
1200 1064
1000 1200 1030,2
800 700 1000
783
800
600
600
400
400
200
200
0 0
Zlib lvl 9 Zlib lvl 2 Data type Zlib lvl 9 Zlib lvl 2 Data type
Zlib lvl 9 Zlib lvl 2 Data type Zlib lvl 9 Zlib lvl 2 Data type
img_len = 64
for var in [ ‘A’, ‘B’, ‘C’ ]:
05/05/2020 12
TRANSFORMATION
Linear Packing PPC

- Pack floats, doubles (4, - Set 0s some mantissa
8 Bytes) inside smaller positions (IEEE-754)
formats (4, 2, 1 Bytes) - Loss precission (can be
- Loss precission controlled)
(depends of data range - Impact on external
and output type) compressors (zlib)
05/05/2020 13
TRANSFORMATION
• C range (floats) can be packed inside shorts
A-% | Nebulosity (0 – 100)

B - ºC | Temperature (-10 – 40)
C - m2/s | Wind Speed (0.xx – 400.xx)
05/05/2020 14
TRANSFORMATION
Size MB Performance (seconds)
1400 1216 1600
1400 1339,9
1200 1064
1000 1200 1030,2
800 700 1000
783
553 800
600
600
400
400 252,6
200
200
0 0
Zlib lvl 9 Zlib lvl 2 Data type Packing Zlib lvl 9 Zlib lvl 2 Data type Packing
Zlib lvl 9 Zlib lvl 2 Data type Packing Zlib lvl 9 Zlib lvl 2 Data type Packing
img_len = 64
for var in [ ‘A’, ‘B’, ‘C’ ]:
05/05/2020 15
ACCESS PATTERN
05/05/2020 16
ACCESS PATTERN
- Multidimensional data can be stored in chunks.

- Each chunk is processed internally as atomic set of data.
- The shape of the chunk is closely related with the
performance of final application.
05/05/2020 17
ACCESS PATTERN
- How we access the data?

- Each time: 1 time, 64 lat and 64 lon
- Different aproach for Read, Write or Both
- A REALLY GOOD chunking for one access pattern can
be REALLY BAD for other. img_len = 64
for var in [ ‘A’, ‘B’, ‘C’ ]:
05/05/2020 18
CHUNKING
Each chunk with 1 time, 64 lat and 64 lon.

1600
1400 1339,9
1200 1030,2
1000
783
800
600
400 252,6
200 10
0
Zlib lvl 9 Zlib lvl 2 Data type Packing Chunk 1x64x64
Zlib lvl 9 Zlib lvl 2 Data type Packing Chunk 1x64x64
05/05/2020 19
CHUNKING
Find the best chunksize: A complex problem (1x64x72)

1600
1400 1339,9
1200 1030,2
1000
783
800
600
400 252,6
200 10 9,9
0
Zlib lvl 9 Zlib lvl 2 Data type Packing Chunk 1x64x64 Chunk 1x64x72
Zlib lvl 9 Zlib lvl 2 Data type Packing Chunk 1x64x64 Chunk 1x64x72
05/05/2020 20
ITERATE BY VARIABLE
- NetCDF read data from disk to RAM. A expensive

operation.
- Stores data in cache (RAM) to quickly retrieve previously
read data.
- Cache is per variable (not reused between variables)
05/05/2020 21
ITERATE BY VARIABLE
img_len = 64
Performance (seconds) list_rand = 6000 random values of time, lat and lon
1600 for time, lat, lon in list_rand:
1400 1339,9 for var in [ ‘A’, ‘B’, ‘C’ ]:
1200
Not effective for this code data = NetCDF[var][time, lat + img_len, lon + img_len ]
1030,2 acumulate_mean(var, data)
1000
783
800
600
400 252,6
img_len = 64
200
10 9,9 10,2 list_rand = 6000 random values of time, lat and lon
0
Zlib lvl 9 Zlib lvl 2 Data type Packing Chunk Chunk Iter by var for var in [ ‘A’, ‘B’, ‘C’ ]:
1x64x64 1x64x72 for time, lat, lon in list_rand:
Zlib lvl 9 Zlib lvl 2 Data type Packing data = NetCDF[var][time, lat + img_len, lon + img_len ]
Chunk 1x64x64 Chunk 1x64x72 Iter by var
05/05/2020 22
RANDOMNESS
- Arbitrary access to data is very slow compared to

sequential.
- Some applications requires random order of values.
- Order of data access is independent from application
data consumption.
- Better usage of cache mechanisms (spatial locality)
05/05/2020 23
RANDOMNESS
img_len = 64
Performance (seconds) list_rand = 6000 random values of time, lat and lon
1600 for var in [ ‘A’, ‘B’, ‘C’ ]:
1339,9 for time, lat, lon in list_rand:
1400
1200 1030,2 data = NetCDF[var][time, lat + img_len, lon + img_len ]
1000 acumulate_mean(var, data)
783
800
600
400 252,6
200 10 9,9 10,2 7,4
0 img_len = 64
list_seq, permutation = order_list(list_rand)
for var in [ ‘A’, ‘B’, ‘C’ ]:
Zlib lvl 9 Zlib lvl 2 Data type Packing for time, lat, lon in list_seq :
Chunk 1x64x64 Chunk 1x64x72 Iter by var Randomness data = NetCDF[var][time, lat + img_len, lon + img_len ]
05/05/2020 24
FINAL RESULTS
- Storage
- Original 1853 MB (100,00 %)
- Optimized 553 MB ( 29,84 %)
- Performance
- Original 1030,2s
- Optimized 7,4s (~x139 faster)
05/05/2020 25
CONCLUSIONS
• Find the best type of data for the values you will store.
• A bad chunking have a big impact in the performance.
• Choosing the best chunking is a complex problem.
• Always try to access the data as much sequential as
possible.
05/05/2020 26
• The « NetCDF: Performance and Storage Optimization
of Meteorological Data » contain more details and
reproducible example.
05/05/2020 27
Merci de votre attention
© IRT AESE ”Saint Exupéry” - All rights reserved Confidential and proprietary document. This document and all information contained
herein is the sole property of IRT AESE “Saint Exupéry”. No intellectual property rights are granted by the delivery of this document
or the disclosure of its content. This document shall not be reproduced or disclosed to a third party without the express written
consent of IRT AESE “Saint Exupéry” . This document and its content shall not be used for any purpose other than that for which it is
supplied. IRT AESE ”Saint Exupéry” and its logo are registered trademarks.
05/05/2020 28

EGU2020 21549 Presentation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EGU2020 21549 Presentation

Uploaded by

Copyright:

Available Formats

Optimizing NetCDF usage

Valentín KIVACHUK BURDÁ

- Original data sources comes in different formats (CSV,

3 Vars (Floats) with: Algorithm example

A-% | Nebulosity (0 – 100) for var in [ ‘A’, ‘B’, ‘C’ ]:

B - ºC | Temperature (-10 – 40) data = NetCDF[var][time, lat + img_len, lon + img_len ]

We collect 6,000 samples for each variable,

NetCDF is a set of software and data formats that manage

Apply NetCDF compression (zlib) with maximum level (9).

The performance is REALLY BAD.

1030s (17min 10s)

Higer compression level needs more CPU time.

BUT, bigger size -> more data to read -> slower

- A value can be stored in different formats (int, float, ..)

Linear Packing PPC

• C range (floats) can be packed inside shorts

A-% | Nebulosity (0 – 100)

- Multidimensional data can be stored in chunks.

- How we access the data?

Each chunk with 1 time, 64 lat and 64 lon.

Find the best chunksize: A complex problem (1x64x72)

- NetCDF read data from disk to RAM. A expensive

- Arbitrary access to data is very slow compared to

You might also like