Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 31

Introduction to Gluster

a.a 2020-2021 prof. Guido Russo 1





Design of AOPP Cluster


Data management




a.a 2020-2021 prof. Guido Russo 2

Introduction to gluster archictecture

All nodes are the same, no meta-data servers etc. In this respect like Isilon

Each node contributes storage, bandwidth and processing to the cluster, as you add
more nodes performance increases linearly in a scale-out fashion.

Different data layouts are possible but it is possible to have the entire cluster as one
namespace or to divide it into volumes.

All network connections are used, enabling bandwidth to be scaled.

Adding and removing nodes can be done online

Nodes contribute “bricks” to the cluster which are basically disk partitions. There is no
need for these to be identical in hardware but it is better if they are the same size. They
can also have any file system on them (default xfs). These are set up on RHEL (or clone)
servers with gluster software installed.

Sold by Red Hat as “Red Hat Storage” - very expensive and same software more or less
but has support. Red Hat supports XFS over LVM volumes. Alternatively you can set up
gluster for free (no support).

Client access via NFS, glusterfs native client or object storage

a.a 2020-2021 prof. Guido Russo 3

Introduction to gluster architecture

a.a 2020-2021 prof. Guido Russo 4

Introduction to gluster architecture
A gluster volume is a collection of servers belonging
to a Trusted Storage Pool. A management daemon
(glusterd) runs on each server and manages a brick
process (glusterfsd) which in turn exports the
underlying on disk storage (XFS filesystem). The
client process mounts the volume and exposes the
storage from all the bricks as a single unified
storage namespace to the applications accessing it.
The client and brick processes' stacks have various
translators loaded in them. I/O from the application
is routed to different bricks via these translators.

a.a 2020-2021 prof. Guido Russo 5


Basic building blocks of gluster cluster consisting of several
bricks. One cluster can have several volumes. Several types
for example:

Distributed. Files are distributed over several bricks as
whole files. One copy, rely on underlying RAID for

Replicated. Files are distributed over several bricks as
whole files and in multiple copies. Number of bricks needs to
be multiple of replicate number.Gluster itself provides

Disperse (new feature). Files are dispersed with parity (like
RAID 5). However they are no longer present as whole files.

There are other types (striped, distributed-replicated etc).
a.a 2020-2021 prof. Guido Russo 6
Types of Volumes
Gluster file system supports different types of volumes
based on the requirements. Some volumes are good for
scaling storage size, some for improving performance and
some for both.
1. Distributed Glusterfs Volume - This is the type of
volume which is created by default if no volume type is
specified. Here, files are distributed across various bricks in
the volume. So file1 may be stored only in brick1 or brick2
but not on both. Hence there is no data redundancy. The
purpose for such a storage volume is to easily & cheaply
scale the volume size. However this also means that a brick
failure will lead to complete loss of data and one must rely
on the underlying hardware for data loss protection.

a.a 2020-2021 prof. Guido Russo 7

Distributed Volumes

a.a 2020-2021 prof. Guido Russo 8

Distributed Volumes

a.a 2020-2021 prof. Guido Russo 9

2. Replicated Glusterfs Volume

In this volume we overcome the risk of data loss which is

present in the distributed volume. Here exact copies of the
data are maintained on all bricks. The number of replicas
in the volume can be decided by client while creating the
volume. So we need to have at least two bricks to create a
volume with 2 replicas or a minimum of three bricks to
create a volume of 3 replicas. One major advantage of
such a volume is that even if one brick fails the data can
still be accessed from its replicated bricks. Such a volume
is used for better reliability and data redundancy.

a.a 2020-2021 prof. Guido Russo 10

Replicated Glusterfs Volume

a.a 2020-2021 prof. Guido Russo 11

Replicated Glusterfs Volume

a.a 2020-2021 prof. Guido Russo 12

Distributed Replicated Glusterfs
- In this volume files are distributed across replicated
sets of bricks. The number of bricks must be a multiple
of the replica count. Also the order in which we specify
the bricks is important since adjacent bricks become
replicas of each other. This type of volume is used when
high availability of data due to redundancy and scaling
storage is required. So if there were eight bricks and
replica count 2 then the first two bricks become replicas
of each other then the next two and so on. This volume
is denoted as 4x2. Similarly if there were eight bricks
and replica count 4 then four bricks become replica of
each other and we denote this volume as 2x4 volume.
a.a 2020-2021 prof. Guido Russo 13
Distributed Replicated Glusterfs

a.a 2020-2021 prof. Guido Russo 14

Distributed Replicated Glusterfs

a.a 2020-2021 prof. Guido Russo 15

Dispersed Glusterfs Volum
Dispersed volumes are based on erasure codes. It
stripes the encoded data of files, with some
redundancy added, across multiple bricks in the
volume. You can use dispersed volumes to have a
configurable level of reliability with minimum space
waste. The number of redundant bricks in the volume
can be decided by clients while creating the volume.
Redundant bricks determines how many bricks can
be lost without interrupting the operation of the

a.a 2020-2021 prof. Guido Russo 16

Dispersed Glusterfs Volum

a.a 2020-2021 prof. Guido Russo 17

Dispersed Glusterfs Volumes

a.a 2020-2021 prof. Guido Russo 18

Distributed Dispersed Glusterfs

Distributed dispersed volumes are the equivalent

to distributed replicated volumes, but using
dispersed subvolumes instead of replicated
ones. The number of bricks must be a multiple of
the 1st subvol. The purpose for such a volume is
to easily scale the volume size and distribute the
load across various bricks.

a.a 2020-2021 prof. Guido Russo 19

Distributed Dispersed Glusterfs

a.a 2020-2021 prof. Guido Russo 20

Distributed Dispersed Glusterfs

a.a 2020-2021 prof. Guido Russo 21

Gluster has many advanced
features including:

Support for infiniband



Geo-replication: You can have another cluster in a
different location and replicate to it
(asynchronous) – in theory when I tested it before
it didn't work very well.

I think some additional features are only available
on Red Hat Storage.

a.a 2020-2021 prof. Guido Russo 22

Design of AOPP cluster

Nodes are 12x6TB Dell PE730XD servers with separate (additional) disks for the OS. 5
nodes in RAID6 giving 300TB usable.

Installed with SL7 and free version of gluster. (Ubuntu would also be possible but is likely
to be less well supported) Ubuntu on the client side is fine.

Distributed volumes (should be robust to risk of data loss although somewhat less to
availability). Whole file distribution means the risk of catastrophic data loss due to
(gluster) file system corruption is low. Also most performance slowdowns are due to
replication. Will set up two volumes to allow some separation between different groups.
One brick per node.

Quotas at group and project level

Complex setup because of need to accommodate legacy systems so multi-homed. Main
storage and cluster communication on infiniband network but some client communication
over ethernet.

a.a 2020-2021 prof. Guido Russo 23

Performance (1)
Fio test Sequential read of 10GB file with 32k

block size.

system 
bandwidth 
iops 
time 

Simple 1gb 
114417KB/s 
3575 
91645msec 
2234.84 usec

Simple ib 
470192KB/s 
14693 
22301msec 
542.14 usec

Isilon 1gb 
113864KB/s 
3558 
92090msec 
2246.15 usec

Gluster 1 gb 113931KB/s 3560 92036msec 2244 usec

Gluster 10 gb 363912 KB/s 11372 28814msec 701 usec

Gluster ib 621415KB/s 19419 16874msec 411 usec

a.a 2020-2021 prof. Guido Russo 24

Performance (2)

Fio test. Writing 1000 small files of 100 KB each

system 
bandwidth 
iops 
time 

Simple 1 gb 1730.6KB/s 432 57786msec 18449.49 usec

Simple ib 65317KB/s 16329 1531msec 464.13 usec

Isilon 1 gb 17425KB/s 4356 5739msec 1811.24 usec

Gluster 1 gb 40833KB/s 10208 2449msec 751 usec

Gluster 10 gb 45851KB/s 11462 2181msec 672usec

Gluster ib 50125KB/s 12531 1995msec 612usec

a.a 2020-2021 prof. Guido Russo 25

Data management
Projects described using Dublin Core XML data format and controlled by
<?xml version="1.0" encoding="UTF-8"?>
<dc:title>Modelling Jupiter's atmospheric spin-up using the MITgcm</dc:title>
<dc:creator>Roland Young</dc:creator>
<dc:subject>Moist convection</dc:subject>
<dc:description>Simulations using the MITgcm studying jet formation in Jupiter's atmosphere under passive and active cloud
conditions (moist convection). Also includes test runs of the Jupiter MITgcm, and analysis of these simulations.</dc:description>
<dc:publisher>Data not yet published</dc:publisher>
<dc:contributor>Roland Young</dc:contributor>
<dc:contributor>Peter Read</dc:contributor>
<dc:date>11 May 2016</dc:date>
<dc:type>GCM output</dc:type>
<dc:format>MITgcm custom binaries</dc:format>
<dc:format>IDL .sav</dc:format>
<dc:rights>For internal AOPP use only.</dc:rights>

a.a 2020-2021 prof. Guido Russo 26

[root@cplxconfig2 manifests]# cat glusternode.pp
class profile::glusternode(
$hosts = {},
$bricks = {},
$volumes = {},
$properties = {},
# class using gluster::server
include gluster::params

class { 'gluster::server':
create_resources(gluster::host, $hosts)
create_resources(gluster::brick, $bricks)
create_resources(gluster::volume, $volumes)
create_resources(gluster::volume::property, $properties)

a.a 2020-2021 prof. Guido Russo 27

Puppet-gluster (2)
gluster::server::shorewall: false profile::glusternode::volumes:
gluster::server::infiniband: true volume01:
gluster::server::vip: '' bricks:
gluster::server::vrrp: true - ''
- ''
profile::glusternode::hosts: - '' transport: 'tcp,rdma'
ip: '' again: false
uuid: 'c6bcc598-53ab-41dd-ad9a-532e2215df5e' start: true

profile::glusternode::bricks: profile::glusternode::properties: volume01#auth.allow:
dev: '/dev/sdb' value:
fsuuid: '02f2e01a-5728-4263-be30- - '163.1.242.*'
0b4850fc5cdc' - '192.168.0.*'
lvm: false volume01#nfs.rpc-auth-allow:
xfs_inode64: true value:
areyousure: false - '163.1.242.*'
again: false - '192.168.0.*'
… 28

All client access is currently via nfs. Glusterfs
client seems to have slower performance.
“Infiniband” access is currently IP over Infiniband
and rdma does not seem to work very well. On the
other hand the current setup performs well.

You can't seem to add mount options as then you
get sec=null so you need to mount with no
options. Likewise although you can set up root
squash then you have no access at all as root.

Sometimes get issues of different nodes with
different view or puppet runs don't complete but
actually the whole system seems quite stable.
a.a 2020-2021 prof. Guido Russo 29

 Gluster is relatively easy to set up, stable and works well.

 Performance is good probably at least as good as Isilon at a fraction of the price.
 There are features of advanced file systems so it is realistic as a production facility
although you will mainly need to use the command line to configure them.
 It is easy to expand in future and the hardware will not need to be exactly the same.
 Puppet has already been set up. NB a separate role needs to be set up for each
cluster but the manifest for each node in the cluster contains the information on the
entire cluster.

a.a 2020-2021 prof. Guido Russo 30


 Gluster does not seem to be widely used for HPC. I have carefully evaluated this with
respect to the areas I have built systems for (bioinformatics, AOPP) and am satisfied
it is suitable but it may be less suitable for systems with rapid turnovers of millions of
files (ie fast scratch spaces). In fact Isilon is also recommended for mixed work loads
rather than fast scratch spaces. So this will need to be carefully evaluated when
moving into new areas.
 It may be marginally more expensive than systems that comprise metadata servers
and data nodes as with a pure scale-out system each node needs to have
reasonable performance and typically you do not use additional attached storage

a.a 2020-2021 prof. Guido Russo 31

You might also like