Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

A triad-based architecture

for a multipurpose Lustre


filesystem at /rdlab
May 2021
Gabriel Verdejo Alvarez

https://rdlab.cs.upc.edu
rdlab@cs.upc.edu
/rdlab
• The research and development Lab (context)

- Founded in 2010 at the Computer Science department

- IT support for research groups only

- National and European Projects (FP7, H2020…)

- Technology transfer

• The research and development Lab (Infrastructure)

- 160 researchers, 18 research groups

- HPC and Cloud services for research projects

- 400TBytes Lustre (2.12.5 + ZFS) storage

LUG2021 -2- https://rdlab.cs.upc.edu/


SQUARING THE CIRCLE
• Why not Lustre?

- Well-known project

- Using Lustre since 2010 (HPC service)

- Most of our data was already in Lustre

- Lustre provides a flexible architecture to play

• OK, but…

- Misconceptions (expensive, difficult to understand…)

- Compatibility issues (vendors and technologies)

- Who is using Lustre as a general purpose filesystem? (Early adopter panic)

- Undocumented experiences and good practices


LUG2021 -3- https://rdlab.cs.upc.edu/
STARTING POINT
• Classical Lustre setups

- Type A: Several n-disk volumes OST governed by a single dedicated OSS

- Type B (HA): A multiple-disk OST pair attached to a couple of OSS

Lustre

Router

LUG2021 -4- https://rdlab.cs.upc.edu/


THE SCIENTIFIC METHOD
• Cooking the idea

- Identify the main ingredients, goal(s) and constraints

- Set metrics and baselines

- Play: Combine, test and “taste”

Trust me, I am an engineer

ldiskfs

iSER
RDMA

MDADM LVM RAID

LUG2021 -5- https://rdlab.cs.upc.edu/


THE SCIENTIFIC METHOD II
• Milestones

2015 2018 2019 2021

Just an idea Lustre ZFS Deployment LUG2021

LUG2021 -6- https://rdlab.cs.upc.edu/


OFF THE BEATEN TRACK
• Ingredients for a triad based recipe

- 3 physical dedicated disk servers (different model/vendors?)


- Same disk technology layout
- Dedicated high-speed low-latency network (IB + iSER)

High speed Local Area Network

Point to Point dedicated low latency Point to Point dedicated low latency
network (IB) for iSER device export network (IB) for iSER device export

1 ZFS mirror 1’ ZFS mirror 1’’

2 ZFS mirror 2’ ZFS mirror 2’’

3 ZFS mirror 3’ ZFS mirror 3’’


LUG2021 -7- https://rdlab.cs.upc.edu/
OFF THE BEATEN TRACK II
• Spicing the triad

- Group alike ZFS mirrors into ZFS Stripes


- Group ZFS stripes into a 3 OST setup
- “Serve” every OST with HA and a Zpool cache disk
High speed Local Area Network

OSS 1 OSS 2 OSS 3


Zpool cache
(active OST only)
Corosync HA Corosync HA

Point to Point dedicated low latency Point to Point dedicated low latency
network (IB) for iSER device export network (IB) for iSER device export

1 ZFS mirror (1+1’+1’’) A 1’ 1’’

ZFS mirror (2+2’+2’’) B


OST 0
2 2’ 2’’

3 ZFS mirror (3+3’+3’’) C 3’ 3’’


ZFS Stripe (A,B,C,D)
4 ZFS mirror (4+4’+4’’) D 4’ 4’’

5 ZFS mirror (5+5’+5’’) E 5’ 5’’

ZFS mirror (6+6’+6’’) F


OST 1
6 6’ 6’’

7 ZFS mirror (7+7’+7’’) G 7 7’'


ZFS Stripe (E,F,G,H)
8 ZFS mirror (8+8’+8’’) H 8’ 8’’

9 ZFS mirror (9+9’+9’’) I 9’ 9’’

10 ZFS mirror (10+10’+10’’) J 10’ 10’’


OST 2 11 ZFS mirror (11+11’+11’’) K 11’ 11’’ ZFS Stripe (I,J,K,L)
12 ZFS mirror (12+12’+12’’) L 12’ 12’’

LUG2021 -8- https://rdlab.cs.upc.edu/


THE FINAL FORM
OST0, OST1, OST2

SATA3 SATA3 SATA3

HPC services
server1 server2 server3

OST3, OST4, OST5

High speed Local Area Network


SATA3 SATA3 SATA3

SSD SSD SSD

server1 server2 server3

Database services
OST6, OST7, OST8

SSD SSD SSD

server1 server2 server3

Cloud services

MDS/MDT

server1 server2 server3

LUG2021 -9- https://rdlab.cs.upc.edu/


WHY A TRIAD-BASED ARCHITECTURE?
• Features and flavors

- Customization

- Performance (dedicated network + I/O split)

- Data cost vs redundancy

- Reliability (data CRC + Quorum)

- Isolation (maintenance, disaster)

- Rebuild impact

- Big File support (in Lustre, size matters)

- ZFS benefits (compression, deduplication, cache…)

LUG2021 - 10 - https://rdlab.cs.upc.edu/
LUG2021 - 11 - https://rdlab.cs.upc.edu/
REFERENCES
[1] https://www.jiscmail.ac.uk/cgi-bin/webadmin?A3=ind1012&N=DELL04.Handover_training.pdf

[2] https://wiki.lustre.org/Lustre_Object_Storage_Service_(OSS)

[3] https://en.wikipedia.org/wiki/ISCSI_Extensions_for_RDMA

[4] https://wiki.lustre.org/ZFS_OSD

[5] https://openzfs.org/wiki/

[6] https://www.digistor.com.au/the-latest/Whether-RAID-5-is-still-safe-in-2019/

[7] https://blog.westerndigital.com/hyperscale-why-raid-systems-are-dangerous/

[8] https://www.laverdad.es/gastronomia/clave-innovacion-estar-20201107011018-ntvo.html

[9] https://www.lavanguardia.com/comer/al-dia/20201005/33652/mejor-croissant-espana-encontraras-pasteleria-barcelona-brunells.html

[10] https://www.upc.edu

[11] https://www.cs.upc.edu

[12] https://rdlab.cs.upc.edu

LUG2021 - 12 - https://rdlab.cs.upc.edu/

You might also like