Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

An SCI-based Software

VIA System for PC Clustering

Jeonghee Shin
University of Southern California
AGENDA
Introduction

SCI-based Software VIA(SS-VIA)


 Overview of SS-VIA
 Implementation of SS-VIA

Experimental Result
 Latency and Bandwidth
 NAS Parallel Benchmark

Conclusion
INTRODUCTION
INTRODUCTION
Introduction
TCP/IP TCP/IP User-level message-passing interface

 Unnecessary user applications user applications


data copy user mode

 Context switch kernel mode

 Interrupt kernel kernel

NIC NIC
NIC

User-level message-passing interface


 without intervention of the kernel
 U-Net, SHRIMP, Active Message, Fast Message, VIA
Virtual Interface Architecture
Compaq, Intel, and Microsoft
Eliminating the system call overhead of traditional
models

Hardware implementation
 Emulex/Giganet’s cLan
 Finisar’s Fibre Channel VI Host Bus Adapter
 Technical University of Chemnitz

Software implementation
 Intel (Fast Ethernet, Myrinet)
 Lawrence Berkeley Laboratory’s M-VIA
(Fast Ethernet, Gigabit Ethernet)
 UC Berkeley (Myrinet)
SS-VIA
SS-VIA SYSTEM
SYSTEM
SS-VIA
SCI (Scalable Coherent Interface)
 ANSI/IEEE standard (1592-1992)
 Low latency & high bandwidth
 Remote memory access mechanism
 Enable a node to transfer data in remote nodes’ memory
without system call
 Excellent environment to build a VIA system

Support both
 MPI-based message passing programming
 SCI-based shared memory programming
Structure of SS-VIA
Application

MPI(Message Passing Interface)


VI
Consumer
VIPL(Virtual Interface Provider Library)

User VI VI CQ

Kernel s r s r c
e e e e o
n c n c m
d v d v p
Kernel Agent VI Manager

SISCI(Software Infrastructure for SCI)


VI
Provider VI
Device Driver Network Adapter

PCI-SCI Adapter Card

Virtual Interface (VI)


Completion Queue (CQ)
VI Provider
VI Consumer
Structure of SS-VIA
Application

MPI(Message Passing Interface)


VI
Consumer
VIPL(Virtual Interface Provider Library)

User VI VI CQ

Kernel s r s r c
e e e e o
n c n c m
d v d v p
Kernel Agent VI Manager

SISCI(Software Infrastructure for SCI)


VI
Provider VI
Device Driver Network Adapter

PCI-SCI Adapter Card

VI
 Handled by the VI Manager
 Consists of a pair of Work Queues
 Send Queue & Receive Queue
Structure of SS-VIA
Application

MPI(Message Passing Interface)


VI
Consumer
VIPL(Virtual Interface Provider Library)

User VI VI CQ

Kernel s r s r c
e e e e o
n c n c m
d v d v p
Kernel Agent VI Manager

SISCI(Software Infrastructure for SCI)


VI
Provider VI
Device Driver Network Adapter

PCI-SCI Adapter Card

VI Provider
 Kernel Agent & VI Network Adapter
Kernel Agent
 Setup & resource management functions
Structure of SS-VIA
Application

MPI(Message Passing Interface)


VI
Consumer
VIPL(Virtual Interface Provider Library)

User VI VI CQ

Kernel s r s r c
e e e e o
n c n c m
d v d v p
Kernel Agent VI Manager

SISCI(Software Infrastructure for SCI)


VI
Provider VI
Device Driver Network Adapter

PCI-SCI Adapter Card

VI Network Adapter
 Perform data transfer functions
 Dolphin’s PCI-SCI Adapter Card
Structure of SS-VIA
Application

MPI(Message Passing Interface)


VI
Consumer
VIPL(Virtual Interface Provider Library)

User VI VI CQ

Kernel s r s r c
e e e e o
n c n c m
d v d v p
Kernel Agent VI Manager

SISCI(Software Infrastructure for SCI)


VI
Provider VI
Device Driver Network Adapter

PCI-SCI Adapter Card

VI Consumer
 VIPL, MPI, and user applications
 Make system calls to the Kernel Agent to create a VI on a
local system & connect it to a VI on a remote system
 Application’s requests are directly posted to the local VI after
a connection is established
Structure of SS-VIA
Application

MPI(Message Passing Interface)


VI
Consumer
VIPL(Virtual Interface Provider Library)

User VI VI CQ

Kernel s r s r c
e e e e o
n c n c m
Kernel Agent VI Manager d v d v p

SISCI(Software Infrastructure for SCI)


VI
Provider VI
Device Driver Network Adapter

PCI-SCI Adapter Card

CQ
 The completion of the request is notified to the CQ
 Associated with a Work Queue
PCI-SCI Mapping
Implementing functions for SS-VIA using SISCI API
Mapping between a local memory and a remote
memory
 The local node can access the remote node’s memory
directly
 Communication : Remote Read/Write
Node A SCI space Node B

 Allocate the target


memory in the user
address space of A

   The connection
between A and B
Local Remote
Segment  Segment

 Directly transfer data


from B to A
 Without extra copies
user address space SCI global shared user address space
address space
Initialization Mechanism
By the Kernel Agent
 Request the VI manager to create a local VI
 Connect it with another VI
 Only use remote write transaction for data transfer
 Remote Write < Remote Read

Node A Node B

Create
Local Segment

timeout
Waiting Create and Map
connect_request Remote Segment
connect_reject
connect_reject Examine connect_reject or timeout
Connection Waiting
Attribute connect_accept

Create
Create and Map Local Segment
Remote Segment

VI Connected
VI Connected
Data Transfer Mechanism
Transfer data without intervention of
the Kernel Agent once the VI connection is
completed
Data Transfer Mechanism
Sender Receiver

Application Application

MPI(Message Passing Interface) MPI(Message Passing Interface)

VIPL(Virtual Interface Provider Library) VIPL(Virtual Interface Provider Library)

CQ VI VI User User VI VI CQ

c s r s r
Kernel Kernel s r s r c
o e e e e e e e e o
m n c n c n c n c m
p d v/ d v d v d v p
VI Manager Kernel Agent Kernel Agent VI Manager

SISCI(Software Infrastructure for SCI) SISCI(Software Infrastructure for SCI)

Device Driver Device Driver

PCI-SCI Adapter Card PCI-SCI Adapter Card

The sender and the receiver put their descriptor in


the Work Queue through the VI manager
Data Transfer Mechanism
Sender Receiver

Application Application

MPI(Message Passing Interface) MPI(Message Passing Interface)

VIPL(Virtual Interface Provider Library) VIPL(Virtual Interface Provider Library)

CQ VI VI User User VI VI CQ

c s r s r
Kernel Kernel s r s r c
o e e e e e e e e o
m n c n c n c n c m
p d v/ d v d v d v p
VI Manager Kernel Agent Kernel Agent VI Manager

SISCI(Software Infrastructure for SCI) SISCI(Software Infrastructure for SCI)

Device Driver Device Driver

PCI-SCI Adapter Card PCI-SCI Adapter Card

On the sender side


 The VI manager picks up the descriptor in the Send
Queue
 Send the data pointed by the descriptor to the VI
Network Adapter
Data Transfer Mechanism
Sender Receiver

Application Application

MPI(Message Passing Interface) MPI(Message Passing Interface)

VIPL(Virtual Interface Provider Library) VIPL(Virtual Interface Provider Library)

CQ VI VI User User VI VI CQ

c s r s r
Kernel Kernel s r s r c
o e e e e e e e e o
m n c n c n c n c m
p d v/ d v d v d v p
VI Manager Kernel Agent Kernel Agent VI Manager

SISCI(Software Infrastructure for SCI) SISCI(Software Infrastructure for SCI)

Device Driver Device Driver

PCI-SCI Adapter Card PCI-SCI Adapter Card

The VI Network Adapter transfers the data to the


network
 remote write
Data Transfer Mechanism
Sender Receiver

Application Application

MPI(Message Passing Interface) MPI(Message Passing Interface)

VIPL(Virtual Interface Provider Library) VIPL(Virtual Interface Provider Library)

CQ VI VI User User VI VI CQ

c s r s r
Kernel Kernel s r s r c
o e e e e e e e e o
m n c n c n c n c m
p d v/ d v Kernel Agent VI Manager d v d v p
VI Manager Kernel Agent

SISCI(Software Infrastructure for SCI) SISCI(Software Infrastructure for SCI)

Device Driver Device Driver

PCI-SCI Adapter Card PCI-SCI Adapter Card

On the receiver side


 the data received through the network is stored in the
user buffer which is pointed to by the descriptor in the
Receive Queue
Data Transfer Mechanism
Eliminating the time-consuming system call
 Using the remote memory access mechanism
 Main feature of NUMA

Removing the unnecessary data copy in the


intermediate buffers
 Directly move the data between the user buffer and
the network interface

Using the polling mechanism


 Avoid the context switch overhead
MPI on SS-VIA
MPI-2 Standard
Provide an easy way of programming

User
UserApplication
Application
MPI
MPIAPI
API
MPI
MPIRuntime
RuntimeLibrary
Library
SS-VIA
SS-VIA
EXPERIMENTS
EXPERIMENTS
Experimental Environment
8-node PC cluster system
 350 MHz Pentium II
 128MB Memory
 4.3GB SCSI Hard Disk Node Node

 Linux
Node 3COM Node
Super-
CPU
StackII L1
Node Node L2
Switch

PCI Main
Node Node bridge Memory
Node Node
PCI bus
Node Node
PCI Fast
4×4 Ethernet
Adapter
SCI
Switch CPU
L1
Node Node L2

Node Node PCI Main


Ring bridge Memory

PCI bus

PCI-SCI
Bridge Card
Experiments
Compared systems

 M-VIA on Fast Ethernet


 A software-emulated VIA developed in Lawrence
Berkeley Laboratory
 modular VIA on Linux

 SCI-MPI
 Developed in Aachen university
 Provide MPI on the SCI-based shared memory system

 TCP/IP on Fast Ethernet


Latency & Bandwidth
Latency Bandwidth
(usec)
(MB/s)
10000 90

SS-VIA
80
SCI-MPI
1000 70 M-VIA
60 TCP/IP

50
100
40

SS-VIA 30
10 SCI-MPI
20
M-VIA
TCP/IP 10

1 0
4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k
(byte)
(byte)

Other VIA systems


 M-VIA on Gigabit Ethernet : 19, 60MB/s
 UC Berkeley’s VIA on Myrinet : 30, 68MB/s
 Intel’s VIA on Myrinet : 53, 87MB/s
 Emulex/Giganet’s cLan : 7, 110MB/s
Execution Time of NPB
NAS(Numerical Aerodynamics Simulation)
Parallel Benchmark (NPB) version 2.3
 Developed by NASA

Problem Size # of
Benchmark Program
(byte) iterations
LU solver (LU) 33x33x33 300

Multigrid (MG) 64x64x64 40

Conjugate Gradient (CG) 14000 15

Embarrassingly Parallel (EP) 33554432 0


Execution Time of NPB
NPB-LU NPB-LU
15
SS-VIA SS-VIA
300 SCI-MPI

communication time (sec)


SCI-MPI
M-VIA
M-VIA
TCP/IP
execution time (sec

TCP/IP 10
200

5
100

0 0
1 2 4 8 2 4 8
# of nodes # of nodes

NPB-MG NPB-MG

SS-VIA 2.0 SS-VIA


15 SCI-MPI
SCI-MPI

communication time (sec)


M-VIA M-VIA
TCP/IP
execution time (sec)

TCP/IP 1.5
10

1.0

5
0.5

0 0.0
1 2 4 8 2 4 8
# of nodes # of nodes
Execution Time of NPB
NPB-CG NPB-CG
60 8
SS-VIA SS-VIA
SCI-MPI
50 SCI-MPI

communication time (sec)


6 M-VIA
M-VIA
execution time (sec)

TCP/IP TCP/IP
40

30 4

20
2
10

0 0
1 2 4 8 2 4 8
# of nodes # of nodes

NPB-EP NPB-EP
2.0
SS-VIA
SS-VIA SCI-MPI
40

communication time (sec)


SCI-MPI M-VIA
1.5
M-VIA
execution time (sec)

TCP/IP
30 TCP/IP

1.0
20

0.5
10

0 0.0
1 2 4 8 2 4 8
# of nodes # of nodes
CONCLUSION
CONCLUSION
Conclusion
SS-VIA
 VIA has been developed on top of the SCI network
based PC cluster system
 Eliminating intervention of the kernel
 Bandwidth : 84MB/s Latency : 8
 NAS Parallel Benchmark : average speed-up of 6.1
 Support both
 Message-passing programming
 SCI-based shared-memory programming
 Guarantee stability

Future Work
 Optimization

You might also like