Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

Oracle Real Application Clusters (RAC)

RAC Internals, Cache Fusion and Performance Tuning

A BrainSurface Presentation

This views/content in this document are those of the author and do not necessarily reflect that of Oracle Corporation and/or its affiliates/subsidiaries. The material in this document is for informational purposes only and is published with no guarantee or warranty, express or implied.

Oracle RAC Internals

Node & Clusterware stack startup sequence Heartbeat mechanism Voting disk functionality Split-brain resolution Node reboot causes

Oracle RAC Internals: Node Startup Sequence

Clusterware startup order discussed in the coming slides

Figure/Diagram from Oracle Documentation

Oracle RAC Internals: Clusterware Stack Startup Sequence: Pre-11gR2

Entries in the /etc/inittab
h1:35:respawn:/etc/init.d/init.evmd run >/dev/null 2>&1 </dev/null h2:35:respawn:/etc/init.d/init.cssd fatal >/dev/null 2>&1 </dev/null h3:35:respawn:/etc/init.d/init.crsd run >/dev/null 2>&1 </dev/null

Added during the execution

Clusterware stack 2 1

OS startup

Node boots up

Publish the events upon detecting Responsible to execute callouts
Voting disk



oclsmon.bin oprocd.bin

Provides cluster group membership Monitor nodes in the cluster via heartbeat mechanism

Manage and monitor CRS resources Updates OCR when srvctl is used

Oracle RAC Internals: Clusterware Stack Startup Sequence: 11gR2

Entries in the /etc/inittab
h1:3:respawn:/sbin/init.d/init.ohasd run >/dev/null 2>&1 </dev/null

Added during the execution

inittab init.ohasd

OS startup

Node boots up

Oracle High Availability Services Daemon


CSSD Monitor CRSD CTSSD Diskmon ACFS Drivers


ONS ASM Instance DB Instance Listener SCAN Listener


Oracle RAC Internals: Clusterware Stack Startup Order: 11gR2

Oracle High Availability Services Daemon

Figure/Diagram from Oracle Documentation

Oracle RAC Internals: Clusterware Stack Startup Order: 11gR2

Figure/Diagram from Oracle Documentation

Oracle RAC Internals: Clusterware and Heartbeat Mechanism

Clusterware and heartbeat mechanism
Two (02) types of heartbeats:
1.Network heartbeat Performed once per second. Node will evict from cluster when failed to send a network heartbeat within <MissCount maximum time in seconds> time frame. clssnmPollingThread (ocssd.log) CSSD]2009-01-27 11:15:37.409 [18] >TRACE: clssnmPollingThread: Eviction started for node usogp06 (6), flags 0x0001, state 3,wt4c 0 2.Disk (Voting Disk) heartbeat Each node of a cluster writes a disk heartbeat to voting disk every second Reads kill block every second to commit suicide, if required. Node evicts from cluster if no heartbeat is updated within I/O (MissCount/Disktimeout) timeout. clssnmDiskPMT (ocssd.log)
CSSD]2009-10-11 15:56:23.668 [93645744] >WARNING: clssnmDiskPMT: long disk latency >(45940 ms) to voting disk (0//dev/raw/raw1)

Oracle RAC Internals: Clusterware and Heartbeat Mechanism

CSS parameters and their default values in 11gR2: crsctl get css prarameter crsctl set css parameter value clusterguid disktimeout (200 (seconds)) misscount (30 (seconds)) more misscount time when vendor cluster is configured reboottime (3 (seconds)) priority (4 (UNIX), 3 (Windows)) logfilesize (50 (MB))

Oracle RAC Internals: Voting Disk Functionality

Network heartbeat (every second)
Used by the Cluster synchronization Service (CSS). It records and manages the node membership information. At any time, each node of a cluster must be able to access more than half of the voting disks.


cs s
All 3 nodes can see each other ALL IS WELL!


cs s

cs s

Recommended to have 2n+1 (odd number) voting disk files.

Voting Voting Voting Disk Disk Disk Disk heartbeat

(once per second)
Figure/Diagram from Oracle Documentation

Oracle RAC Internals: Split-Brain Syndrome


cs s

cs s

cs s

Node 1 & 2 can see each other but both cant see 3 ? lets evict Node3

Voting Disk

cant see 1&2 Kill yourself (Node3)

Figure/Diagram from Oracle Documentation

Oracle RAC Internals: Split-Brain Resolution What is Split-Brain?

The term "Split-Brain" is often used to describe the scenario when two or more co-operating processes in a distributed system, typically a high availability cluster, lose connectivity with one another but then continue to operate independently of each other, including acquiring logical or physical resources, under the incorrect assumption that the other process(es) are no longer operational or using the said resources.
Quote/Abstract from MOS document

Oracle RAC Internals: Node Reboot Causes

When does a node reboots?
Network failure interconnect Slow interconnect (latency) must fail 30 consecutive times! | check private interconnect configuration Voting disk IO cannot read or write | refer ocssd.log CPU-bound CPU is too busy to maintain heartbeat | configure oswatcher to verify resource consumption Files moved, delected, changed or some other human error Configuration error wrong network for private interconnect ocssd process died Some Oracle Clusterware bug

Oracle RAC Internals: Grid Infrastructure: Log Files Hierarchy

Figure/Diagram from Oracle Documentation

What is Cache Fusion? Synopsis & Overview

Cache Fusion is the driving technology behind Oracle RAC that enable Applications to scale out on multiple servers/instances. Cache Fusion/Synchronization enables concurrent/simultaneous transactionprocessing between all Instances using the Private Cluster Interconnect. DB Blocks are synchronized, NOT mirrored = Faster performance.

What is Cache Fusion? Synopsis & Overview

With the advent of Oracle RAC 9i in 2001, Cache Fusion provides the following great features: More nodes can be added/removed in HOT MODE=ZERO DOWNTIME with zero database downtime to provide elasticity and scalability. Database Files residing on Shared Disk Cluster File System provide a uniform, fast and readconsistent image to the end-user. Applications typically scale out-of-the-box with zero/minimal tuning.

Cache Fusion Synopsis & Overview

Cache Fusion is very fast due to the fact that, disk writes are eliminated when other instances request blocks for updates. Cache Fusion is a mechanism within Oracle RAC employs Shared Cache Architecture that fuses the in-memory data buffer cache across all nodes into a single logical read-consistent buffer cache available to all instances. DB Blocks are transferred in-memory from instanceto-instance cache over the Cluster InterConnect when requested after proper locking procedures are implemented.

Cache Fusion Synopsis & Overview

Global Cache Service (GCS) is used for FAST instance-toinstance block buffer transfer and establishes/implements Cache Coherency = Never more than 3 hops. Global Enqueue Service (GES), previously known as Dynamic Lock Manager (DLM) is used for block buffer locking. Global Resource Directory (GRD) is used for keeping track of Block Buffer Location/Mode/Role information. The Private Cluster InterConnect is used for block-transfers amongst instances to enable Cache Fusion.

Cache Fusion Architecture Overview

Figure/Diagram from Oracle Documentation

Cache Fusion Architecture Global Resource Directory (GRD)

GCS & GES maintain the Global Resource Directory (GRD). Internal Repository stored by all instances of the RAC Cluster. Global Resource Directory (GRD) is used for keeping track of Data Structures, Block Buffer Location, Mode, Role, Inventory etc.

Cache Fusion Architecture Global Cache Service (GCS)

The backbone of Cache Fusion: Responsible for Cache Coherence. Responsible for maintaining different block modes and transfer of data buffers amongst the instances. Implemented by the Global Cache Service Processes (LMSn). Lock Manager Server (LMS): Processes that are responsible for remote messaging. LMSn: n = 0 9: Upto 10 LMS processes: Can be set with the Init parameter GCS_SERVER_PROCESSES

Cache Fusion Architecture Global Enqueue Service (GES)

Global Enqueue Service (GES), previously known as Dynamic Lock Manager (DLM) is responsible for locking mechanisms used in Cache Fusion. LMON process responsible for cluster monitoring & management of global resources: Also know as Cluster Group Services. LMD0 processes responsible for: Management of resource requests from RAC instances. Distributed Deadlock Detections. Processing of Enqueued Requests. Access Control to Global Enqueues.

Cache Fusion Measuring Efficiency

Global Cache Services (GCS) Waits = Cross-Instance Block transfer Waits = Measure of Data Block Transfer Efficiency.

Cache Fusion Dynamic Performance Views

Some useful Dynamic Performance Views for monitoring Cache Fusion:
v$gc_element v$cache v$instance_cache_transfer v$cr_block_server v$cache_transfer v$ges_blocking_enqueue gv$file_cache_transfer gv$temp_cache_transfer gv$cache_transfer gv$class_cache_transfer

RAC Performance Tuning: Starting Out

Nemiec (2004 9i RAC) App Tuning Database Tuning OS Tuning

Nanda (2009)

CPU and I/O (not Interconnect) are necessary for RAC Performance

THEN... RAC Tuning

Lawson (2010)

The Essence Of Performance Tuning Is The Same

These quotes are from

presentations in the RAC SIG library.

RAC Performance Tuning: Approaches

Top-Down Application Responsiveness Grid Control Performance Tab Statspack/AWR Reports

Goal: Minimize Response Time or Throughput

Bottom-Up Storage Spindles, Controllers, Paths OS I/O times, queues Network latency Memory CPU (each core) Goal: Balance & Maximize Utilization

RAC Performance Tuning: Application & Schema Design

Look Out For:

Indexes Sequences Hot rows or small tables MSSM gc Wait Events High Interconnect Utilization

RAC Performance Tuning: Application & Schema Design

Main Principle: parallelize (avoid serialization on any data) If it doesn't scale on SMP then it won't scale on RAC

Decrease rows/block Reverse Key or Hash Indexes

No Range Scans

Same principles of good app design for non-RAC!!

Seq NoOrder+Cache ASSM (or FreeL Gr) Data & Index Partitioning App Partitioning

RAC Performance Tuning: Tune the Entire System as a Whole

Figure/Diagram from Bert Scalzo

RAC Performance Tuning: Tune the Entire System as a Whole

Figure/Diagram from Bert Scalzo

RAC Performance Tuning: Real Life Case Study

Figure/Diagram from Bert Scalzo

RAC Performance Tuning: Configuration Checklist

Hardware All nodes have similar performance characteristics Interconnect (The RAC Achilles heal) Network segment truly private Bond NICs to improve throughput All nodes set NICs to Jumbo Frames Switches / VLANs set to Jumbo Frames Consider 10Gbit Ethernet for Interconnect Storage Multipath Verify settings for read & write caching match application nature If using iSCSI, treat as similar to interconnect network (see above) Software All nodes have the exact same OS patches All nodes have the exact same Oracle patches Oracle both recommends and pushes for using ASM on RAC Do NOT rely on non-RAC enabled scripts or tools for handling RAC

RAC Performance Tuning: Block Size is Important

DBCAs default block size is 8K Many DBAs experience is that bigger block size is better So most databases these days often have block sizes >= 8K But bigger is not always better Block size and number of nodes should be considered (next 2 slides) No matter how fast or good cache fusion is dont stress it if unnecessary Example: OLTP application using 8K block size and having 8 nodes Larger block size = more rows per block More rows per block = more likelihood of block contention More nodes (>=4) = more likelihood of block contention More block contention means more cache fusion work Remember, interconnect is most often RACs Achilles heal

RAC Performance Tuning: Block Contention

Figure/Diagram from Bert Scalzo

RAC Performance Tuning: Block Contention

Figure/Diagram from Bert Scalzo

To summarize, Oracle RAC is proven, robust and stable and is used by corporations, organizations & governments across the globe to achieve High Availability, Elasticity & Scalability by providing a lower-cost and higher ROI alternative to Mainframe-like SMP (Symmetric Multi-Processing) models of computing. Learn more about Oracle RAC at Oracle's RAC homepage.

You might also like