Sto1479bu Formatted Final 1507840549321001iws1 PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 83

STO1479BU

vSAN Beyond the Basics


t i o n
i s tr ibu
or d
t ion
bli c a
r p u
o t fo
nt: N
o n te
17 C
2 0
w o rld
V M
Sumit Lahiri – Product Line Manager
Eric Knauft – Staff Engineer

#VMworld #STO1479BU
Disclaimer
• This presentation may contain product features that are currently under development.
• This overview of new technology represents no commitment from VMware to deliver these
features in any generally available product. t i o n
tr ibu
r dis purchase orders, or
• Features are subject to change, and must not be included in contracts,
o
sales agreements of any kind.
a t i on
c u b li
• Technical feasibility and market demand will affect o p
r final delivery.
o t f
• n
Pricing and packaging for any new technologiest : N or features discussed or presented have not
ont e
been determined.
17 C
2 0
o r ld
VMw

#STO1479BU CONFIDENTIAL 2
Agenda

t i o n
1 The world of Objects
i s tr ibu
or d
t ion
2 Life of vSAN Component bli c a
r p u
o t fo
nt: N
3 The 4 Rs of vSAN
o n te
17 C
2 0
4 rld
wo Fault Domains
Multi-Level
VM
5 All Flash I/O Flow

#STO1479BU CONFIDENTIAL 3
t i o n
i s tr ibu
or d
t ion
bli c a
r p u
The World of N o Objects
t fo
n t :
o n te
1 7 C
d 2 0
w o rl
VM

#STO1479BU CONFIDENTIAL
Disk layout in host

vSAN Datastore
n
tr utio 64 nodes
▪ ibMax
d i s
Disk groups contribute to single vSAN datastore in vSphere cluster
on or
a t i ▪ Min 2 nodes (ROBO)
disk group disk group disk group
p
disk group
u blicdisk group
fo r
N o t ▪ Max 5 Disk Groups per
nt:
Cache
o n te host
1 7 C
2 0
rld ▪ 2 – Tiers per Disk
Mwo
Capacity
V Group

#STO1479BU CONFIDENTIAL
Creating vm, creates several objects in the background

t i o n
Virtual Disk
i s tr ibu
(VMDK) or d
t ion
bli c a
r p u
o t fo
nt: N
o n te VM home namespace: VMX, log files
17 C
2 0
w orld
V M

Virtual memory swap objects

#STO1479BU CONFIDENTIAL 6
From VM to components

t i o n
(Object) (components)
istr ibu (blocks)
o r d
a t ion
ubli c
o r p Component
N otf Component
ent : Component
ont Component
17 C
d 2 0 (in low MBs)
w orl
V M
(Max Size: 255 GB)

#STO1479BU CONFIDENTIAL 7
Fault Domains

t i o n
i s tr ibu
or d
t ion
bli c a
r p u
o t fo
nt: N
o n te
vSphere vSAN
17 C
2 0
w o rld
V M
Host Racks Sites

#STO1479BU CONFIDENTIAL 8
Failures to Tolerate (FTT)

Always in context to fault domains


t i o n
i s tr ibu
or d
t ion
bli c a
r p u
o t fo
nt: N
o n te
17 C
2 0
rld
vSphere vSAN

Mw o
V
Host Racks Sites
Failures to Tolerate Failures to Tolerate Failures to Tolerate

#STO1479BU CONFIDENTIAL 9
Failures to Tolerate (FTT)

FTT implies host failures to tolerate if fault domain is not mentioned

t i o n
i s tr ibu
or d
t ion
bli c a
r p u
o t fo
nt: N
o n te
17 C
2 0
w rld
o
vSphere
V MvSAN vSphere vSAN vSphere vSAN

FTT=1 FTT=2 FTT=3

#STO1479BU CONFIDENTIAL 10
Failures to Tolerate (FTT) can be Nested

Survive one site failure and one host failure on the other site
t i o n
i s tr ibu
or d
t ion
bli c a
r p u
o t fo
nt: N
o n te
17 C
2 0
vSphere vSAN
w o rld
V M
Host Racks Sites

#STO1479BU CONFIDENTIAL 11
t i o n
i s tr ibu
or d
t ion
bli c a
Fault Tolerance
t fo
r pMethods
u
: N o
ont ent
17 C
d 2 0
orl
VMw

#STO1479BU CONFIDENTIAL
Failures Tolerate Method (FTM)

FTT=1 FTT=2 FTT=3

t i o n
i s tr ibu
or d
a t i on
blic
vSphere vSAN vSphere vSAN vSphere vSAN

r p u
o t fo
nt: N
RAID-1 ✓. 2bytes/byte
C o n te ✓. 3bytes/byte 4bytes/byte ✓.
0 1 7
rld 2
1.3 bytes/byte
Mw o
V
✓. X X
RAID-5
1.5 bytes/byte
X ✓. X
RAID-6
#STO1479BU CONFIDENTIAL 13
t i o n
i s tr ibu
or d
FTT = Failures to Toleratea t i on
blic p u
fo r
N o t
n nt:
teTolerance Method
FTM = Fault
17 C
o
d 2 0
orl
VMw

#STO1479BU CONFIDENTIAL
t i o n
i s tr ibu
or d
t ion
bli c a
Notation
t fo
r p u
: N o
ont ent
17 C
d 2 0
orl
VMw

#STO1479BU CONFIDENTIAL
Object is associated with underlying policy

t i o n
i s tr ibu
or d
t ion
bli c a
(VMDK)
r p u
o t fo
nt: N
Policy: o n te
17 C
2 0
1. Failures to Tolerate
w rld
o
V
2. Fault Tolerance MMethod

#STO1479BU CONFIDENTIAL 16
Policy dictates how objects are managed
FTT =1, FTM = RAID-1, Stripe Width >2

i b tion
(VMDK)
u
t r
or dis
a t i on
u bli c
(VMDK) or p
o t f
:N Replica Replica
ontent
Policy:
17 C
d 2 0 (stripes) (stripes)
1. w o
Failures to Tolerate (FTT)rl
VM
2. Fault Tolerance Method
(FTM) C1 C2 …. C1 C2 ….
(components) (components)

#STO1479BU CONFIDENTIAL 17
RAID Abstraction Model
FTT =1, FTM = RAID-1 , Stripe Width >2
No witness

(VMDK)
(VMDK)
t i o n
i s tr ibu
or d (RAID-1)

t ion
bli c a R1
r p u
o t fo (RAID-0) (RAID-0)
Replica Replica
n t : N
on t e R0 R0
2 0 17 C
(stripes)
orl d (stripes)
VMw
C1 C2 …. C1 C2 …. C1 C2 …. C1 C2 ….
(components) (components) (components)
(components)

#STO1479BU CONFIDENTIAL 18
FTT=1,FTM=RAID-1, comparison with stripe and without stripes

No witness No witness (VMDK)


t i o n
250GB (VMDK)
i s tribu
or d (RAID-1)

t ion
(RAID-1)
bli c a R1
f or pu
R1
o t
:N
(RAID-0) (RAID-0)
(no striping)
t e n t
on
(no striping)

1 7 C 1TB R0 R0 1 TB
d 20
C 250GB
or l C 250GB

(component) VMw (component)


C1 C2 …. C1 C2 ….
(components) (components)
250 GB
#STO1479BU CONFIDENTIAL 19
vSAN managed as bunch of components

t i o n
i s tr ibu
or d
vSAN Datastore t ion
bli c a
r p u
o t fo
nt: N
o n te
17 C
2 0
w orld
C V M C
C C C C
components

#STO1479BU CONFIDENTIAL
Each replica on different Fault Domain (e.g. host)

(VMDK)
FTT =2, FTM = RAID-1 , Stripen Width = 2
u t i o
b
istri
(RAID-1)

o r d
R1
a t ion
ubli c
o r p
tf
(RAID-0) (RAID-0) (RAID-0)

: N o
R0 t ent R0 R0
1 7 Con
2 0
w o rld
V M
C1 C2 C1 C2 C1 C2
(components) (components) (components)

#STO1479BU CONFIDENTIAL 21
Each component is commonly placed on a different host

(VMDK)
FTT =2, FTM = RAID-1 , Stripen Width = 2
u t i o
b
istri
(RAID-1)

o r d
R1
a t ion
ubli c
o r p
tf
(RAID-0) (RAID-0) (RAID-0)

: N o
R0 t ent R0 R0
1 7 Con
2 0
w o rld
V M
C1 C2 C1 C2 C1 C2
(components) (components) (components)

#STO1479BU CONFIDENTIAL 22
Can we survive 2 host failures with 3 hosts?

(VMDK)
FTT =2, FTM = RAID-1 , Stripen Width = 2
u t i o
b
istri
(RAID-1)

o r d
R1
a t ion
ubli c
o r p
tf
(RAID-0) (RAID-0) (RAID-0)

: N o
R0 t ent R0 R0
1 7 Con
2 0
w o rld
V M
C1 C2 C1 C2 C1 C2
(components) (components) (components)

#STO1479BU CONFIDENTIAL 23
t i o n
i s tr ibu
or d
t ion
bli c a
Liveness = Availability
t fo
r p u && Quorum
: N o
ont ent
17 C
d 2 0
orl
VMw

#STO1479BU CONFIDENTIAL
Quorum: In the event of cluster partition, which partition shall
proceed?

t i o n
i s tr ibu
or d
t ion
bli c a
N hosts r p u M hosts
o t fo
nt: N
…........ o n te …........
C
d 2017
or l
VM
partition-01
w partition-02

#STO1479BU CONFIDENTIAL 25
Quorum: The partition with the higher Votes proceed

N votes M votes
t i o n
i s tr ibu
N hosts o r
Mnhostsd
l i c atio
pu b
…........ for …........
N o t
e n t :
C ont
1 7
partition-01
r ld 20 partition-02
Mw o
V

Cluster members participate in voting

#STO1479BU CONFIDENTIAL 26
If M > N, Partition-2 proceeds

partition-02 proceeds
t i o n
N votes M votesdistr ibu
n o r
o
N hosts u bli catiM hosts
or p
o t f
e n t:N
…........ on t …........
2 017 C
orl d
VMw
partition-01 partition-02

Cluster members participate in voting

#STO1479BU CONFIDENTIAL 27
t i o n
i s tr ibu
or d
Voting t ion
bli c a
r p u
fo
o t
FTT=1 and n FTM
t : N = RAID-1
ont e
17 C
2 0
o r ld
V Mw

#STO1479BU CONFIDENTIAL
Quorum is calculated on a per object basis

t i o n
No witness
i s tr ibu
(VMDK)
or d
t ion
bli c a
(RAID-1)
r p u
o t fo• Each component participates in voting
R1
nt: N
o n te • With two components, this sums to even
C
1
d 2 017 1 number of votes
C or l C
V M w
(component) (component)

#STO1479BU CONFIDENTIAL 29
Add witness for Tier breaker vote

t i o n
i s tr ibu
(VMDK)
or d
t ion
bli c a
(RAID-1) (votes)
rup
1 f•o Witness is added as Tier breaker vote
R1 N
Wt:ot
n t e n • Acts as an observer which component has latest
C o
(witness)
(votes)
0 1 7 data
1 (votes) 1
r 2
ld C
o
VMw (component)
C
(component)

#STO1479BU CONFIDENTIAL 30
For VMDK-A , partition-2 has higher votes

(VMDK-A)
(votes)
(RAID-1) 1
t i o n
R1 W
i s tr ibu
(witness)
or d
(votes) 1 1 t io n
C C
bli c a
r p u
fo
(component) (component)

N o t
n te nt:
C o
N hosts
0 1 7 M hosts
rld 2
M w o
V
…........ …........
C C W
(votes) 1 1 (votes)
(votes) 1
partition-01 partition-02 proceeds

#STO1479BU CONFIDENTIAL 31
General Case: Different objects proceed on different partition

(VMDK-B) (VMDK-A)
(votes) (votes)
(RAID-1) 1 (RAID-1) 1
t i o nW
R1 W R1
i s t r ibu
(witness)
or d (witness)
(votes) 1 (votes) 1 n
aCtio
1 1
C C C
lic
ub(component)
(component) (component)
o r p (component)

o t f
nt: N
o n te
N hosts 1 7 C M hosts
2 0
w o rld
V M
…........
C C
…........
W C W C
(votes) 1 1 1 1 (votes) 1
(votes) 1

partition-01 proceeds for VMDK-B partition-02 proceeds for VMDK-A

#STO1479BU CONFIDENTIAL 32
Components can be classified as data component and witness
component

t i o n
(VMDK) i s tr ibu
or d
t ion
(1 vote)
bli c a
(RAID-1)
r p u 1
o t fo W (witness component)
R1
n t : N
on t e
(no striping)
1 7 C (no striping)
20
o1 rld
(1 vote) (1 vote) 1
VMw
D D
(data component) (data component)

#STO1479BU CONFIDENTIAL
t i o n
i s t ribu
or d
a t i on
Min count of hosts required
r p u b lic for survive
o
N host : ot f
Nfailures?
te n t
C on
2 017
orl d
VMw

#STO1479BU CONFIDENTIAL
Minimum 2N+1 hosts required to survive N host failures

N hosts = N shares of votes (N +1) hosts = (N+1) shares of vote

t i o n
1
…........ 1 1
…........distr ibu 1 1

n o r
b atio
lic
partition-01
rp u partition-02 is winning partition
t f o
:N o
t e n t
C on
7
r 201share of vote
• If each host represents same
ld
Mw o
• Wining partition V
would require a minimum of N+1 hosts

• Minimum size of cluster = 2N+1 hosts to survive N host failures

#STO1479BU CONFIDENTIAL 35
Min cluster size is determined by meeting Liveness requirement

t i o n
i s tr ibu
• Liveness = (Quorum) && (Availability) or d
t ion
bli c a
r p u
• Min of hosts in cluster = Max (Min o t fo hosts for Quorum,
n t : N
Min hosts for Availability) n te
C o
0 1 7
rl d 2
w o
VM

#STO1479BU CONFIDENTIAL 36
Examples

• FTT =1 , FTM = RAID-1 t i o n


• Min host for availability = 2 i s tr ibu
or d
• Min host of Quorum = 2N+1 = 3 a t ion
bli c
• Min cluster size =3 r p u
o t fo
nt: N
• FTT=2, FTM = RAID-1 o n te
1 7 C
• 2
Min host for availability
d 0 =3
o r l
• Min host for w
VMQuorum = 2N+1 =5
• Min cluster size =5

#STO1479BU CONFIDENTIAL 37
t i o n
i s tr ibu
or d
t ion
bli c a
r p u
Examples of Liveness
Not
(Quorum
fo + Availability)
e n t :
C ont
1 7
r ld 20
Mw o
V

#STO1479BU CONFIDENTIAL
Quorum (FTT:2, FTM: RAID-1 ) = 5 Hosts, no stripe

FTT =2, FTM = RAID-1 , Stripe Width = 1


t i o n
(witness component)

i s tr ibu
or d W
t ion 1
bli c a
(VMDK)
r p u
(RAID-1)
o t fo W
n t: N 1
R1
o n t e
1 7 C
d 20 2 witness components = 2 votes
r l 1
Mwo
1 1
D V D D
(data component) (data component) (data component)

3 data components = 3 votes

#STO1479BU CONFIDENTIAL 39
Votes Re-assigned / Re-balanced as stripe width is changed

FTT =2, FTM = RAID-1 , Stripe Width = 2

2 3
(VMDK) t i on
ib u
W W
istr
(RAID-1)

o r d
R1
a t i on
ubli c
o r p (RAID-0)
tf
(RAID-0) (RAID-0)
Assign higher votes
R0 2 2 R0 nt : No 2 R0 to break tie
on te
17 C
d 2 0
orl
VMw
1 C1 C2 1 1 C1 1 C2 1 C1 1 C2
(components) (components) (components)

#STO1479BU CONFIDENTIAL 40
Quorum with stripe width =2

Partition – 2 proceeds
Partition - 1
t i o n
i s tr ibu (VMDK)
or d
t ion
bli c a
r p u
C2 C1 W
o
C2 t fo C1 W C2 C1
n t: N
on t e
2 2 17 C 2 3 2
o rl d 20
(2 votes) VMw (2 votes) (2 votes) (1 vote)
(1 vote)

Availability but no Quorum (Availability) && (Quorum)

#STO1479BU CONFIDENTIAL 41
t i o n
i s tr ibu
or d
Quorum = True
lic a t i on
p u b
t f or
:N o
t e n t
Availability
17 C on = False
2 0
w orld
V M

#STO1479BU CONFIDENTIAL
It is possible to have Quorum but no Availability

(VMDK)
Partition - 1

✓ Quorum t i o n
R1
i s tr ibu
or d
t ion
bli c a
R0 R0 R0 r p u
o t fo
nt: N
onte
(votes)
C1 1 C1 1
1 7 C 3 W
20
C1 1

o r ld
V Mw
Partition - 2 C2 1 C2 1 C2 1
2 W
Quorum

#STO1479BU CONFIDENTIAL 43
t i o n
i s tr ibu
or d
t ion
bli c a
r p u
RAID-5 N o t fo
n t :
C onte
0 1 7
l d 2
or
VMw

#STO1479BU CONFIDENTIAL
RAID – 5 protection against 1 host failure

(VMDK)
n
Assigned higher vote to break tie
t i o
R5 b u
r distri
on o
c a t i
1 C0 1 C1 r pub2li 1
t f o C2 C3
: N o
t e n t
Each component C on on a separate host
0 1 7 …...... …......
r l d 2 …......
Mw o
V

esxi-01 esxi-02 esxi-03 esxi-04

#STO1479BU CONFIDENTIAL 45
RAID – 5 protection against 1 host failure

(VMDK)

t i o n
R5
i s tr ibu
or d
t ion
bli c a
1 C0 1 C1 r p u 2 1
o t fo C2 C3
n t : N
on t e
217 C
Each component
0…......
is divided into data and parity blocks
…...... …......
D1 D2 P1 D3
orl d
VMw

esxi-01 esxi-02 esxi-03 esxi-04

#STO1479BU CONFIDENTIAL 46
t i o n
i s tr ibu
or d
t ion
bli c a
The Life of vSAN r p u
Component
o
ot f : N
n t
C onte
0 1 7
l d 2
or
VMw

#STO1479BU CONFIDENTIAL
Object States: can be “not compliant” but accessible

• Compliance status: Are all replicas good?


(VMDK)
• Operational status: Is Accessible? 3 (votes)
W n
• Accessible implies Liveness t i o
R1 i s tr ibu
or d esxi-03
t ion
bli c a
(votes) 2 2 pu
(votes)
r
R0 o t fo
R0
n t : N
ont e
1 7 C
r ld 20
C1 o
Mw C2 C1
V C2

esxi-01 esxi-02

#STO1479BU CONFIDENTIAL 48
Object States: can be “not compliant” but accessible

• Compliance status: Are all replicas good?


(VMDK)
• Operational status: Is Accessible? 3 (votes)
W n
• Accessible implies Liveness t i o
R1 i s tr ibu
or d esxi-03
t ion
bli c a
(votes) 2 2 pu
(votes)
r
R0 o t fo
R0
n t : N
ont e • Active = known good
1 7 C
20 • Degraded = known bad, rebuild now
orld
C1
VMw C2 C1 C2 • Absent = known bad, cause not known,
repair after 60 mins

esxi-01 esxi-02 • Stale = Active however needs update

#STO1479BU CONFIDENTIAL 49
4 Rs – Resync , Rebuild, Repair and Reconfiguration

(VMDK) (VMDK) (VMDK)


t i o n
i s tr ibu
R1 R1
or d R1
(Host-1) (Host-4)
a t i on
(state: active-stale) (state: degraded)
C1 ….. C4 (components) ….. u b lic C1 ….. C4
C1 C4
or p
o t f
e n t:N
(blocks)
on t (resync blocks) (build out the component)
1 7 C
r ld 20
Mw o
• V
VMDK is divided into components Partial Resync Repair / Reconfigure
• Components comprise of data blocks • Copy data to stale components • Build fresh component
• Each component on different host • When a component comes • Full Resync
• Each data block of fixed size back from being absent

#STO1479BU CONFIDENTIAL 50
t i o n
i s tr ibu
or d
t ion
bli c a
r p u
Rebuild Example : N o t fo
te n t
C o n
0 1 7
d 2
orl
VMw

#STO1479BU CONFIDENTIAL
Begin: All components / elements are in active state

t i o n
(Active) (Active) (Active)
t r i
(Active) b u
(Active) r d is
An o
A A A
a i o
t C1
A

C1 C2 W
ub l i c C2
r p
o t fo
nt: N
2
o n te 3 2
1 7 C
2 0
(2ldvotes) (3 votes) (2 votes)
w o r
V M
Tolerate 1 host failure with RAID-1

#STO1479BU CONFIDENTIAL 53
Cluster partitions with unknown cause, components go ”Absent”

Partition - 1 Partition – 2 n
t i o
i s t r ibu
(Active) or d
t io n (Active)
bli c a
u
A
(Absent) A A
r p A A
B B
W ot fo
C1 C2
n t : N C1 C2

ont e
17 C
rl2d 20 3 2
Absent: Known bad,
Mw o
but cause not known V (2 votes) (3 votes) (2 votes)

Cluster partition, cause unknown, do not repair immediately


Object is not compliant but accessible

#STO1479BU CONFIDENTIAL 54
Partition with both Availability and Quorum proceeds

Partition - 1 Partition – 2 - proceeds


t i o n
i sibu
tr
o d
r and availability
vm HA to partition -2 , partition-2 has both quorum
t io n
bli c a
p u
t f or
: N oA
(Absent) A A
t e n t A A
B B
C2 7 C o n W
C1
2 0 1 C1 C2

o r l d
V Mw
2 3 2

(2 votes) (3 votes) (2 votes)

Availability no Quorum Quorum && Availability


#STO1479BU CONFIDENTIAL 55
Partition is resolved, component is Resynced

Resync
t i o n
i s t ribu
AS AS A
d
or A
oC1n
A
(Active-Stale)
a t i
C1 C2 W
blic C2
f or pu
o t
e n t:N
2
on t 3 2
1 7 C
2 0
(2ldvotes) (3 votes) (2 votes)
or
VMw
Active-Stale Component is Resynced
Component marked as Active Stale, Object is not compliant

#STO1479BU CONFIDENTIAL 56
All components / elements are in active state

t i o n
(Active) (Active) (Active)
t r i
(Active) b u
(Active) r d is
An o
A A
A
a i o
t C1
A

C1 C2 W
ub l i c C2
r p
o t fo
nt: N
2
o n te 3 2
1 7 C
2 0
(2ldvotes) (3 votes) (2 votes)
w o r
V M
All components are Active

Object is compliant and accessible

#STO1479BU CONFIDENTIAL 57
t i o n
i s tr ibu
or d
t ion
bli c a
r p u
fot
Repair
ent:
N oScenarios
ont
17 C
d 2 0
orl
VMw

#STO1479BU CONFIDENTIAL
Absent Components Repair After 60 Min

Partition - 1 Partition – 2 : most recent data


on
b u ti
istr i
o r d
a t ion
(Absent)
A
ubli c
or p
A A A A

C1 C2 W
o t f C1 C2
Resync after 60 min
e n t:N
on t
1 7 C
220 3 2
orld
VMw (2 votes) (3 votes) (2 votes)

#STO1479BU CONFIDENTIAL 59
Degraded Components Repair Immediately

Hardware Failure Causes Degraded


t i o n
i s tr ibu
or d
a t i on
blic A
(Degraded) A
D D
r p u A

C1 C2 W
o t fo C1 C2
Known bad,
e n t:N
Resync Now
on t
1 7 C
220 3 2
orld
VMw (2 votes) (3 votes) (2 votes)

#STO1479BU CONFIDENTIAL 60
Fresh components Resynced From Existing Components

Resync
t i o n
i s tr ibu
A
orA d R R
D D A
t ion
(Degraded)
W c
bliC1a (Reconfiguring
C1 C2
r p u C2 C1 C2

t f o
:N o
t e n t
2 on 3 2 2
20 17 C
orld
(2 votes)
VMw
Find another host to resync, Resync begins (Another Host)
Object state is not-compliant but accessible

#STO1479BU CONFIDENTIAL 61
Object is Compliant Again

t i o n
(remove)
i s t ibu
r
(Active) d
or(Active) (Active) (Active)
(Active) t ion
A A
bli c a A A
u
D D A
(Degraded)
r p W
C1 C2 C1
o t fo C2 C1 C2

nt: N
o n te
7 C 2 3 2
2
r ld 201
Mw o
V

Degraded component is marked for deletion

#STO1479BU CONFIDENTIAL 62
Rebuild RAID schematics – Resync begins

(VMDK)

t i o n
r ibu
R1 o r dist
W
o n
u bli cati
or p
o t f
e n
R0:N
t R0
R0 on t
1 7 C
r ld 20
Mw o C2
V
C1
C1 C2
C2
C1
Resync begins
(Degraded)

#STO1479BU CONFIDENTIAL 63
Rebuild RAID schematics – Resync ends

(VMDK)

t i o n
r ibu
R1 o r dist
W
o n
u bli cati
or p
o t f
e n
R0:N
t R0
R0 on t
1 7 C
r ld 20
Mw o C2
V
C1
C1 C2
C2
C1
Resync Ends
(mark for removal)

#STO1479BU CONFIDENTIAL 64
t i o n
i s tr ibu
or d
Reconfiguration
c a t i on
u bli
or p
o t f
Changing e t:N
Storage
n Policies
o n t
C 17
2 0
w orld
V M

#STO1479BU CONFIDENTIAL
Reconfiguration – Increase FTT =2 to FTT =3

t i o n
i s tr ibu
R1 or d
t ion R1
bli c a
r p u
R0 R0 R0
o t fo R0 R0 R0
nt: N R0
o n te
17 C
2 0
w orld
V M

#STO1479BU CONFIDENTIAL
Reconfiguration – Increase Sripe Width
R1

R0 R0 R0
t i o n
i s tr ibu
or d
io n
R1licat
p u b
f or
o t
e n t:N
on t
1 7 C
r ld
R0 20 R0 R0
o
VMw R0 R0 R0

#STO1479BU CONFIDENTIAL
t i o n
i s tr ibu
or d
t ion
bli c a
Multi-Level Fault r u
Domains
p
o
ot f : N
n t
C onte
0 1 7
l d 2
or
VMw

#STO1479BU CONFIDENTIAL
Failures to Tolerate (FTT) can be Nested

Survive one site failure and one host failure on the other site
t i o n
i s tr ibu
or d
t ion
bli c a
r p u
o t fo
nt: N
o n te
17 C
2 0
vSphere vSAN
w o rld
V M
Host Racks Sites

#STO1479BU CONFIDENTIAL 69
Stretched Cluster deployment with local fault protection

3rd site for


t i o n
witness
i ibu
s tr
o
• Prior examples, r d host is the fault domain
t io n
c a
r pu•bl2i Levels of fault domain
RAID-1
tf o – Site and host
e n t : No
RAID-5
on t
RAID-5
• Failures to tolerate at each level
2 0 17 C
orl d
VMw
Cluster Cluster

vSphere vSAN

5ms RTT, 10GbE

#STO1479BU CONFIDENTIAL 70
RAID tree for stretched cluster with local fault protection

R1
t i o n
i s tr ibu
or d
t ion
R5
bli c a R5
r p u
t f o
:N o
t e n t
D1 on D1 P1
17 C
P1

d 2 0
D2 orl D3 D2 D3
VMw

(Site -1) (Site -2)

#STO1479BU CONFIDENTIAL 71
Survive 1 site failure

R1
t i o n
i s tr ibu
or d
t ion
R5
bli c a R5
r p u
t f o
:N o
t e n t
D1 on D1 P1
17 C
P1

d 2 0
D2 orl D3 D2 D3
VMw

(Site -1) (Site -2)

#STO1479BU CONFIDENTIAL 72
Survive 1 site failure and 1 host failure

R1
t i o n
i s tr ibu
or d
t ion
R5
bli c a R5
r p u
t f o
:N o
t e n t
D1 on D1 P1
17 C
P1

d 2 0
D2 orl D3 D2 D3
VMw

(Site -1) (Site -2)

#STO1479BU CONFIDENTIAL 73
Anatomy of write: from site - 1 to site - 2
1 Issue write

R1
t i o n
2b Send only data across sites
i s tr ibu
d
or Remote Helper Raid Tree
t ion
Dn
bli c a R5
(proxy owner)
2a R5 R5
r p u
fo 3 Remote side calculates
Not
Update Local Data
and Parity
e n t : parity.

C ont
D1 P1 1 7 D1 P1

r ld20
Mw o D2 D3
D2 V D3

(Site -1) (Site - 2)

#STO1479BU CONFIDENTIAL 74
t i o n
i s tr ibu
or d
t ion
bli c a
Votes in Stretched r p u Cluster
o
ot f : N
n t
C onte
0 1 7
l d 2
or
VMw

#STO1479BU CONFIDENTIAL
5 Votes per site

Witness has equal share of votes as


the other 2 entities (e.g. sites)
3 voting entities for first level
Wion
Site-1, Site-2 and the witness
i b u t
istr
R1
o r d
a t ion
ubli c
o r p
R5 N otf
ent : R5
on t
17 C
d 2 0
orl D3 2
VMw
D1 D1 D3
1
D2 P1 5 5 D2 P1
1 1

4 components for second level


(Site -1) (Site -2)
Total of 5 votes (odd number of votes)
#STO1479BU CONFIDENTIAL 76
Witness is assigned same voting rights as the sites

Witness has equal share of votes as


the other 2 entities (e.g. sites)
3 voting entities for first level
5
Wion
Site-1, Site-2 and the witness
i b u t
istr
R1
o r d
a t ion
ubli c
5
o r p 5
R5 N otf
ent : R5
ont
17 C
d 2 0
orl D3
VMw
D1 D1 D3

D2 P1 5 5 D2 P1

4 components for second level


(Site -1) (Site -2)
Total of 5 votes (odd number of votes)
#STO1479BU CONFIDENTIAL 77
t i o n
i s tr ibu
or d
t ion
bli c a
I/O Flows r p u
o
ot f: N
n t
C onte
0 1 7
l d 2
or
VMw

#STO1479BU CONFIDENTIAL
Anatomy of a All Flash Write
Pretty much same as hybrid:
virtual disk ▪ VM running on host H1
▪ H1 is owner of virtual disk object Number
1
6 Of Failures To Tolerate = 1
t i o n
▪ Object has 2 replicas buH1 and H2
trion
vSphere
r d is
Virtual SAN t
1. GuestaOS no
io issues write op to virtual disk
i
u c
bl
p
o2.r Owner clones write op
H1 H2 H3
o t f
2
e n t:N 3. In parallel: sends “prepare” op to H1 (locally)
on t
1 7 C and H2
3
r ld 20
Mw o 4. H1, H2 persist op to Flash (log)
5 4 V 4 5
5. H1, H2 ACK prepare op to owner
7
7 6. Owner waits for ACK from both ‘prepares’ and
completes I/O
7. Later, owner commits batch of writes
#STO1479BU CONFIDENTIAL
All-flash: Destaging Cache to Capacity
▪ Data from committed writes
virtual disk accumulate on Flash Cache (Write
Buffer)
• From different VMs / virtual disks
t i o n
▪ In all-flash, blocks that t r i
areb u written most
vSphere
often (hot) stayoin r is cache.
dwrite
Virtual SAN
a t i on
u b lic blocks that are infrequently
▪ In all-flash,
o r p
H1 H2 H3
o t f accessed (cold) are destaged to flash
nt: N capacity layer.
o n te
1 7 C
2 0
w o rld
V M
hot

cold

#STO1479BU CONFIDENTIAL
Nerd Out With These Key vSAN Activities at VMworld

t i o n
i s tr ibu
or d
Practice with Visit SDDC
a t i on Become a
Hands-on-Labs ublic
Assessment Lounge
r p vSAN Specialist
t o
f assess if your IT
Learn from self-paced and expert Discover how
: o to
Nfit for HCI
Earn VMware digital badges to
led hands on labs
n t
is a good showcase your skills
• vSAN Getting Started Workshop o n•te • New 2017 vSAN Specialist
7 C Four Seasons Willow Room/2nd

01
(Expert led) floor Badge
VxRail Getting Started (Self ld 2

w o r • Open from 11am – 5pm Sun, • Education & Certification Lounge:
VM online
paced) Mon, and Tue VM Village
• Self-Paced lab available • Learn more at Assessing &
24x7 Sizing in STO1500BU • Certification Exam Center:
Jasmine EFG, Level 3

#HitRefresh on your current data center and discover the possibilities!


3 Easy Ways to Learn More about vSAN

Storage Hub Technical Library New vSAN Tools Hands-On Lab

t i o n
i s tr ibu
or d
t ion
bli c a Test drive vSAN

r p u for free today!

o t fo
nt: N
o n te vSAN Sizer
• StorageHub.vmware.com2017
C • Live at VMworld
o r ld
M w
• Reference architectures, • Practical learning of
off-line demosVand more vSAN, VxRail and more
• Easy search function • 24x7 availability online
• And More! – for free!
vSAN Assessment

82
#STO1479BU CONFIDENTIAL
t i o n
i s tr ibu
or d
t ion
bli c a
r p u
o t fo
nt: N
o n te
17 C
2 0
w orld
V M
t i o n
i s tr ibu
or d
t ion
bli c a
r p u
o t fo
nt: N
o n te
17 C
2 0
w orld
V M

You might also like