SDC NVMe

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Ethernet Storage Fabrics

Using RDMA with Fast NVMe-oF Storage to Reduce


Latency and Improve Efficiency
Kevin Deierling & Idan Burstein
Mellanox Technologies
1
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Storage Media Technology
Storage Media Access Time

10,000X
Improvement

2
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Ethernet Storage Fabric

Ethernet
Storage Fabric

Everything a Traditional SAN Offers but …


Faster, Smarter, & Less Expensive

PERFORMANCE INTELLIGENCE EFFICIENCY


• Highest Bandwidth • Integrated & Automated Provisioning • Just Works Out of the Box
• Lowest latency • Hardware-enforced Security & Isolation • Flexibility: Block, File, Object, HCI
• RDMA and storage offloads • Monitoring, Management, & Visualization • Converged: Storage, VM, Containers
• Native NVMe-oF Acceleration • Storage-aware QoS • Affordable: SAN without the $$
3
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Networked Storage Growth is Ethernet

 Ethernet growing very rapidly driven by:


 Cloud & HyperConverged Infrastructure
 Server SAN is the New Storage Network  NVMe Over Fabrics
Because there is No Fibre Channel in the Cloud  Software Defined Storage

4
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Extending NVMe Over Fabrics (NVMe-oF)
 NVMe SSDs shared by multiple servers
 Better utilization, capacity, rack space,
power
 Scalability, management, fault isolation
 NVMe-oF industry standard
 Version 1.0 completed in June 2016
 RDMA protocol is part of the standard
 NVMe-oF version 1.0 includes a Transport
binding specification for RDMA
 Ethernet(RoCE) and InfiniBand

5
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
How Does NVMe-oF Maintain Performance?

 Extends NVMe efficiency over a fabric


 NVMe commands and data
structures transferred end to end
 RDMA is key to performance
 Reduces latency
 Increased throughput
 Eliminates TCP/IP overhead

6
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
RDMA: More Efficient Networking
RDMA Performs
Four Critical
Functions in Hardware

1. Reliable Data Transport


2. App-level user space I/O
- AKA: Kernel Bypass
3. Address Translation
4. Memory Protection
adapter based
transport

 CPU not consumed moving data - Free to run apps!

7
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
RDMA is Natural Extension for NVMe
SW-HW communication
Target through work & completion
Software
Software queues in shared memory

Memory Submission Completion Submission Completion

RoCE/IB NIC NVMe Flash


Hardware

8
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Memory Queue Based Data Transfer Flow

RDMA Adapter

9
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
RDMA

 Transport built on simple primitives deployed for 15


years in the industry
 Queue Pair (QP) – RDMA communication end
point: send, receive and completion queues
 Connect for establishing connection mutually
 RDMA Registration of memory region (REG_MR)
for enabling virtual network access to memory
 SEND and RCV for reliable two-sided messaging
 RDMA READ and RDMA WRITE for reliable one-
sided memory to memory transmission

10
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
NVMe and NVMeoF Fit Together Well

Network
11
11
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
NVMe-OF IO WRITE
 Host NVMe RNIC RNIC
NVMe
Initiator Target
 Post SEND carrying Command
Capsule (CC)
Post Send (CC)
 Subsystem Send – Command Capsule

 Upon RCV Completion Ack


Completion
Completion Allocate memory for data
 Allocate Memory for Data Free send buffer
Post Send
Register to the RNIC

 Post RDMA READ to fetch data RDMA Read


(Read data)

 Upon READ Completion Read response first

 Post command to backing store Read response last

 Upon SSD completion Completion


Post NVMe command
 Send NVMe-OF RC Wait for completion
Free allocated memory
 Free memory Post Send (RC)
Free Receive buffer

Send – Response Capsule


 Upon SEND Completion Free data buffer
Completion Ack
 Free CC and completion Completion
Free send buffer
resources

12
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
NVMe-OF IO READ

 Host NVMe
NVMe RNIC RNIC
 Post SEND carrying Command Initiator Target
Capsule (CC)
 Subsystem Post Send (CC)

Send – Command Capsule


 Upon RCV Completion Ack
Completion
 Allocate Memory for Data Free send buffer
Completion

 Post command to backing Post NVMe command


Wait for completion
store Post Send
(Write data)
Free receive buffer

 Upon SSD completion Write first Post Send (RC)

Write last
 Post RDMA Write to write Ack
data back to host Send – Response Capsule
Completion
 Send NVMe RC Completion Ack Free allocated buffer

Completion
 Upon SEND Completion Free send buffer

 Free memory
 Free CC and completion
resources

13
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
NVMe-OF IO WRITE IN-Capsule
NVMe NVMe
RNIC RNIC
 Host Initiator Target

 Post SEND carrying Command


Post Send (CC)
Capsule (CC) Send – Command Capsule

 Subsystem Completion
Ack
Completion
Free send buffer

 Upon RCV Completion Post NVMe command


Wait for completion
Free receive buffer
 Allocate Memory for Data
Post Send (RC)

 Upon SSD completion Completion


Send – Response Capsule
Ack
 Send NVMe-OF RC Completion
Free send buffer
 Free memory
 Upon SEND Completion
 Free CC and completion
resources

14
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
NVMf is Great!

15
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Shared Receive Queue Work Queue

Memory
buffer

 Problem: memory foot print of a connection


Memory
and the buffer associated with them buffer

 Locality, scalability, etc. WQE


Memory
buffer

 Solution: Share receive buffering resources Memory

between QPs WQE


buffer

 According to the parallelism required by Memory


buffer

the application Memory


buffer

 E2E credits is being managed by RNR


NACK
 TO associated with the application
latency
QP
RQ
QP
RQ
QP
RQ
QP
RQ

 We have submitted patches to fix


performance in Linux – please try!

16
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
NVMe-OF System Example
Host Memory

NVMe SQ NVMe CQ RDMA SQ RDMA RQ

Data

PCIe

NVMe RNIC

17
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Target Data Path (NVMe WRITE)
3. Submit NVMe
Command
7. Send Fabrics
Response Host Memory

NVMe SQ NVMe CQ RDMA SQ RDMA RQ

Data

2. Data
PCIe
6. NVMe Fetch
Completion
1. Fabrics
4. Doorbell Command
5. Command +
Data Fetch

NVMe RNIC

8. Fabrics
Response

18
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Controller Memory Buffer
 Internal memory of the NVMe devices exposed over the PCIe
 Few MB are enough to buffer the PCIe bandwidth for the latency of
the NVMe device
 Latency ~ 100-200usec, Bandwidth ~ 25-50 GbE  Capacity ~
2.5MB
 Enabler for peer to peer communication of data and commands
between RDMA capable NIC and NVMe SSD
 Optional from NVMe 1.2

19
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Target Data Path with CMB (NVMe WRITE)
3. Submit NVMe
Command
7. Send Fabrics
Response Host Memory

NVMe SQ NVMe CQ RDMA SQ RDMA RQ

PCIe
6. NVMe
Completion
1. Fabrics
4. Doorbell Command
5. Command
Data
Fetch
NVMe RNIC

8. Fabrics
Response
2. Data
Fetch

20
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
SW Arch for P2P NVMe and RDMA

21
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
NVMe Over Fabrics Target Offload
Host Root Complex and
Memory Subsystem
 NVMe Over Fabrics is built on top of RDMA
 Transport communication in hardware
Admin
 NVMe over Fabrics target offload enable the NVMe IO
hosts to access the remote NVMe devices w/o any
NVMe over Fabrics Target
CPU processing Offload
NVMe
 By offloading the control part of the NVMf data RDMA Transport
RNIC
path
 Encapsulation/Decapsulation NVMf <-> NVMe is
done by the adapter with 0% CPU
Network
 Resiliency – i.e. NVMe is exposed through
Kernel Panic
 OS Noise
 Admin operations are maintained in software
22
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Target Data Path with NVMf Target Offload
(NVMe WRITE)
Host Memory

NVMe SQ NVMe CQ

Data

2. Data
PCIe
7. NVMe Fetch
3. Submit NVMe
Completion
Command

5. Doorbell RDMA SQ RDMA RQ

6. Command + 4. Poll CQ
Data Fetch

NVMe
7. Fabrics RNIC 1. Fabrics
Response Command

23
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Software API for NVMf Target Offload

submit_bio()
(I/O)

NVMf Target
Block
Driver

I/
Get IO QUEUES O
Register NVMf Get Properties
NS
Offload Parameters Get Data Buffer* NVMe-PCI
Bind queue pair to
NVMf offloads
NVMe
RDMA ConnectX®-5 I/O device

24
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Namespace / Controller Virtualization
 Controller virtualization
 Expose a single backend
controller as multiple NVMf
controllers
 Multiplex the commands in the
transport
NVMe-OF Offload FrontEnd Subsystem
 To enable scalability of the
amount of supported initiators namespace 0 namespace 1 namespace 2 namespace 3 namespace 4

 Subsystem virtualization and nvme_backend_controller nvme_backend_controller nvme_backend_controller


A B C
namespace provisioning
 Expose a single NVMf front-end nvme0 nvme1 nvme2
subsystem for multiple NVMe
ns0 ns1 ns0 ns0 ns1
subsystems
 Provision namespaces to NVMf
frontend controllers

25
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Performance

26
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Target Data Path with NVMf Target Offload
and CMB (NVMe WRITE)
Host Memory

3. Submit NVMe
PCIe
Command

4. Poll CQ
5. Doorbell RDMA SQ RDMA RQ

NVMe SQ NVMe CQ 6. NVMe


Completion
Data

7. Fabrics RNIC
NVMe Response

27
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.

You might also like