Oracle Real Application Clusters (RAC) : RAC Internals, Cache Fusion and Performance Tuning

Oracle Real Application Clusters (RAC)
RAC Internals, Cache Fusion and Performance Tuning
A BrainSurface Presentation www.brainsurface.com
Disclaimer
This views/content in this document are those of the author and do not necessarily reflect that of Oracle Corporation and/or its affiliates/subsidiaries. The material in this document is for informational purposes only and is published with no guarantee or warranty, express or implied.
Oracle RAC Internals
Agenda
Node & Clusterware stack startup sequence Heartbeat mechanism Voting disk functionality Split-brain resolution Node reboot causes
Oracle RAC Internals: Node Startup Sequence
Clusterware startup order discussed in the coming slides
Figure/Diagram from Oracle Documentation
Oracle RAC Internals: Clusterware Stack Startup Sequence: Pre-11gR2

Entries in the /etc/inittab
h1:35:respawn:/etc/init.d/init.evmd run >/dev/null 2>&1 </dev/null h2:35:respawn:/etc/init.d/init.cssd fatal >/dev/null 2>&1 </dev/null h3:35:respawn:/etc/init.d/init.crsd run >/dev/null 2>&1 </dev/null
Added during the root.sh execution
inittab
Clusterware stack 2 1
OS startup
Node boots up
init.evmd
evmd.bin
Publish the events upon detecting Responsible to execute callouts
Voting disk
init.cssd
ocssd.bin
init.crsd
crsd.bin
OCR
oclsmon.bin oprocd.bin
Provides cluster group membership Monitor nodes in the cluster via heartbeat mechanism
Manage and monitor CRS resources Updates OCR when srvctl is used
Oracle RAC Internals: Clusterware Stack Startup Sequence: 11gR2

Entries in the /etc/inittab
h1:3:respawn:/sbin/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
Added during the root.sh execution
inittab init.ohasd
OS startup
Node boots up
Oracle High Availability Services Daemon
oraagent.bin
MDNSD GIPCD GPNPD EVMD ASM
orarootagent.bin
CSSD Monitor CRSD CTSSD Diskmon ACFS Drivers
cssdagent
OCSSD
oraagent
ONS ASM Instance DB Instance Listener SCAN Listener
orarootagent
GSD VIP SCAN VIP
Oracle RAC Internals: Clusterware Stack Startup Order: 11gR2
Oracle High Availability Services Daemon
Oracle RAC Internals: Clusterware Stack Startup Order: 11gR2
Oracle RAC Internals: Clusterware and Heartbeat Mechanism

Clusterware and heartbeat mechanism
Two (02) types of heartbeats:
1.Network heartbeat Performed once per second. Node will evict from cluster when failed to send a network heartbeat within <MissCount maximum time in seconds> time frame. clssnmPollingThread (ocssd.log) CSSD]2009-01-27 11:15:37.409 [18] >TRACE: clssnmPollingThread: Eviction started for node usogp06 (6), flags 0x0001, state 3,wt4c 0 2.Disk (Voting Disk) heartbeat Each node of a cluster writes a disk heartbeat to voting disk every second Reads kill block every second to commit suicide, if required. Node evicts from cluster if no heartbeat is updated within I/O (MissCount/Disktimeout) timeout. clssnmDiskPMT (ocssd.log)
CSSD]2009-10-11 15:56:23.668 [93645744] >WARNING: clssnmDiskPMT: long disk latency >(45940 ms) to voting disk (0//dev/raw/raw1)
Oracle RAC Internals: Clusterware and Heartbeat Mechanism

CSS parameters and their default values in 11gR2: crsctl get css prarameter crsctl set css parameter value clusterguid disktimeout (200 (seconds)) misscount (30 (seconds)) more misscount time when vendor cluster is configured reboottime (3 (seconds)) priority (4 (UNIX), 3 (Windows)) logfilesize (50 (MB))
Oracle RAC Internals: Voting Disk Functionality

Network heartbeat (every second)
Used by the Cluster synchronization Service (CSS). It records and manages the node membership information. At any time, each node of a cluster must be able to access more than half of the voting disks.
Node1
Node2
cs s
All 3 nodes can see each other ALL IS WELL!
Node3
cs s
cs s
Recommended to have 2n+1 (odd number) voting disk files.
Voting Voting Voting Disk Disk Disk Disk heartbeat

(once per second)
Oracle RAC Internals: Split-Brain Syndrome

Split-brain
Node1
cs s
Node2
cs s
Node3
cs s
Node 1 & 2 can see each other but both cant see 3 ? lets evict Node3
Voting Disk
cant see 1&2 Kill yourself (Node3)
Oracle RAC Internals: Split-Brain Resolution What is Split-Brain?

The term "Split-Brain" is often used to describe the scenario when two or more co-operating processes in a distributed system, typically a high availability cluster, lose connectivity with one another but then continue to operate independently of each other, including acquiring logical or physical resources, under the incorrect assumption that the other process(es) are no longer operational or using the said resources.
Quote/Abstract from MOS document
Oracle RAC Internals: Node Reboot Causes

When does a node reboots?
Network failure interconnect Slow interconnect (latency) must fail 30 consecutive times! | check private interconnect configuration Voting disk IO cannot read or write | refer ocssd.log CPU-bound CPU is too busy to maintain heartbeat | configure oswatcher to verify resource consumption Files moved, delected, changed or some other human error Configuration error wrong network for private interconnect ocssd process died Some Oracle Clusterware bug
Oracle RAC Internals: Grid Infrastructure: Log Files Hierarchy
What is Cache Fusion? Synopsis & Overview

Cache Fusion is the driving technology behind Oracle RAC that enable Applications to scale out on multiple servers/instances. Cache Fusion/Synchronization enables concurrent/simultaneous transactionprocessing between all Instances using the Private Cluster Interconnect. DB Blocks are synchronized, NOT mirrored = Faster performance.
What is Cache Fusion? Synopsis & Overview

With the advent of Oracle RAC 9i in 2001, Cache Fusion provides the following great features: More nodes can be added/removed in HOT MODE=ZERO DOWNTIME with zero database downtime to provide elasticity and scalability. Database Files residing on Shared Disk Cluster File System provide a uniform, fast and readconsistent image to the end-user. Applications typically scale out-of-the-box with zero/minimal tuning.
Cache Fusion Synopsis & Overview

Cache Fusion is very fast due to the fact that, disk writes are eliminated when other instances request blocks for updates. Cache Fusion is a mechanism within Oracle RAC employs Shared Cache Architecture that fuses the in-memory data buffer cache across all nodes into a single logical read-consistent buffer cache available to all instances. DB Blocks are transferred in-memory from instanceto-instance cache over the Cluster InterConnect when requested after proper locking procedures are implemented.
Cache Fusion Synopsis & Overview

Global Cache Service (GCS) is used for FAST instance-toinstance block buffer transfer and establishes/implements Cache Coherency = Never more than 3 hops. Global Enqueue Service (GES), previously known as Dynamic Lock Manager (DLM) is used for block buffer locking. Global Resource Directory (GRD) is used for keeping track of Block Buffer Location/Mode/Role information. The Private Cluster InterConnect is used for block-transfers amongst instances to enable Cache Fusion.
Cache Fusion Architecture Overview
Cache Fusion Architecture Global Resource Directory (GRD)

GCS & GES maintain the Global Resource Directory (GRD). Internal Repository stored by all instances of the RAC Cluster. Global Resource Directory (GRD) is used for keeping track of Data Structures, Block Buffer Location, Mode, Role, Inventory etc.
Cache Fusion Architecture Global Cache Service (GCS)

The backbone of Cache Fusion: Responsible for Cache Coherence. Responsible for maintaining different block modes and transfer of data buffers amongst the instances. Implemented by the Global Cache Service Processes (LMSn). Lock Manager Server (LMS): Processes that are responsible for remote messaging. LMSn: n = 0 9: Upto 10 LMS processes: Can be set with the Init parameter GCS_SERVER_PROCESSES
Cache Fusion Architecture Global Enqueue Service (GES)

Global Enqueue Service (GES), previously known as Dynamic Lock Manager (DLM) is responsible for locking mechanisms used in Cache Fusion. LMON process responsible for cluster monitoring & management of global resources: Also know as Cluster Group Services. LMD0 processes responsible for: Management of resource requests from RAC instances. Distributed Deadlock Detections. Processing of Enqueued Requests. Access Control to Global Enqueues.
Cache Fusion Measuring Efficiency

Global Cache Services (GCS) Waits = Cross-Instance Block transfer Waits = Measure of Data Block Transfer Efficiency.
Cache Fusion Dynamic Performance Views

Some useful Dynamic Performance Views for monitoring Cache Fusion:
v$gc_element v$cache v$instance_cache_transfer v$cr_block_server v$cache_transfer v$ges_blocking_enqueue gv$file_cache_transfer gv$temp_cache_transfer gv$cache_transfer gv$class_cache_transfer
RAC Performance Tuning: Starting Out
Nemiec (2004 9i RAC) App Tuning Database Tuning OS Tuning
Nanda (2009)
CPU and I/O (not Interconnect) are necessary for RAC Performance
THEN... RAC Tuning
Lawson (2010)
The Essence Of Performance Tuning Is The Same
These quotes are from
presentations in the RAC SIG library.
RAC Performance Tuning: Approaches
Top-Down Application Responsiveness Grid Control Performance Tab Statspack/AWR Reports
Goal: Minimize Response Time or Throughput
Bottom-Up Storage Spindles, Controllers, Paths OS I/O times, queues Network latency Memory CPU (each core) Goal: Balance & Maximize Utilization
RAC Performance Tuning: Application & Schema Design
Look Out For:

Indexes Sequences Hot rows or small tables MSSM gc Wait Events High Interconnect Utilization
RAC Performance Tuning: Application & Schema Design
Main Principle: parallelize (avoid serialization on any data) If it doesn't scale on SMP then it won't scale on RAC
Decrease rows/block Reverse Key or Hash Indexes
No Range Scans
Same principles of good app design for non-RAC!!
Seq NoOrder+Cache ASSM (or FreeL Gr) Data & Index Partitioning App Partitioning
RAC Performance Tuning: Tune the Entire System as a Whole
Figure/Diagram from Bert Scalzo
RAC Performance Tuning: Tune the Entire System as a Whole
RAC Performance Tuning: Real Life Case Study
RAC Performance Tuning: Configuration Checklist

Hardware All nodes have similar performance characteristics Interconnect (The RAC Achilles heal) Network segment truly private Bond NICs to improve throughput All nodes set NICs to Jumbo Frames Switches / VLANs set to Jumbo Frames Consider 10Gbit Ethernet for Interconnect Storage Multipath Verify settings for read & write caching match application nature If using iSCSI, treat as similar to interconnect network (see above) Software All nodes have the exact same OS patches All nodes have the exact same Oracle patches Oracle both recommends and pushes for using ASM on RAC Do NOT rely on non-RAC enabled scripts or tools for handling RAC
RAC Performance Tuning: Block Size is Important

DBCAs default block size is 8K Many DBAs experience is that bigger block size is better So most databases these days often have block sizes >= 8K But bigger is not always better Block size and number of nodes should be considered (next 2 slides) No matter how fast or good cache fusion is dont stress it if unnecessary Example: OLTP application using 8K block size and having 8 nodes Larger block size = more rows per block More rows per block = more likelihood of block contention More nodes (>=4) = more likelihood of block contention More block contention means more cache fusion work Remember, interconnect is most often RACs Achilles heal
RAC Performance Tuning: Block Contention
RAC Performance Tuning: Block Contention
Summary
To summarize, Oracle RAC is proven, robust and stable and is used by corporations, organizations & governments across the globe to achieve High Availability, Elasticity & Scalability by providing a lower-cost and higher ROI alternative to Mainframe-like SMP (Symmetric Multi-Processing) models of computing. Learn more about Oracle RAC at Oracle's RAC homepage.
http://www.oracle.com/technology/products/database/clustering/index.html

Oracle Real Application Clusters (RAC) : RAC Internals, Cache Fusion and Performance Tuning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Oracle Real Application Clusters (RAC) : RAC Internals, Cache Fusion and Performance Tuning

Uploaded by

Copyright:

Available Formats

Oracle Real Application Clusters (RAC)

RAC Internals, Cache Fusion and Performance Tuning

A BrainSurface Presentation www.brainsurface.com

Oracle RAC Internals

Oracle RAC Internals: Node Startup Sequence

Clusterware startup order discussed in the coming slides

Figure/Diagram from Oracle Documentation

Oracle RAC Internals: Clusterware Stack Startup Sequence: Pre-11gR2

Added during the root.sh execution

Oracle RAC Internals: Clusterware Stack Startup Sequence: 11gR2

Added during the root.sh execution

Oracle High Availability Services Daemon

Oracle RAC Internals: Clusterware Stack Startup Order: 11gR2

Oracle High Availability Services Daemon

Figure/Diagram from Oracle Documentation

Oracle RAC Internals: Clusterware Stack Startup Order: 11gR2

Figure/Diagram from Oracle Documentation

Oracle RAC Internals: Clusterware and Heartbeat Mechanism

Oracle RAC Internals: Clusterware and Heartbeat Mechanism

Oracle RAC Internals: Voting Disk Functionality

Recommended to have 2n+1 (odd number) voting disk files.

Voting Voting Voting Disk Disk Disk Disk heartbeat

Oracle RAC Internals: Split-Brain Syndrome

cant see 1&2 Kill yourself (Node3)

Figure/Diagram from Oracle Documentation

Oracle RAC Internals: Split-Brain Resolution What is Split-Brain?

Oracle RAC Internals: Node Reboot Causes

Oracle RAC Internals: Grid Infrastructure: Log Files Hierarchy

Figure/Diagram from Oracle Documentation

What is Cache Fusion? Synopsis & Overview

What is Cache Fusion? Synopsis & Overview

Cache Fusion Synopsis & Overview

Cache Fusion Synopsis & Overview

Cache Fusion Architecture Overview

Figure/Diagram from Oracle Documentation

Cache Fusion Architecture Global Resource Directory (GRD)

Cache Fusion Architecture Global Cache Service (GCS)

Cache Fusion Architecture Global Enqueue Service (GES)

Cache Fusion Measuring Efficiency

Cache Fusion Dynamic Performance Views

RAC Performance Tuning: Starting Out

Nemiec (2004 9i RAC) App Tuning Database Tuning OS Tuning

THEN... RAC Tuning

The Essence Of Performance Tuning Is The Same

These quotes are from

presentations in the RAC SIG library.

RAC Performance Tuning: Approaches

Top-Down Application Responsiveness Grid Control Performance Tab Statspack/AWR Reports

Goal: Minimize Response Time or Throughput

RAC Performance Tuning: Application & Schema Design

Look Out For:

RAC Performance Tuning: Application & Schema Design

Decrease rows/block Reverse Key or Hash Indexes

Same principles of good app design for non-RAC!!

RAC Performance Tuning: Tune the Entire System as a Whole

Figure/Diagram from Bert Scalzo

RAC Performance Tuning: Tune the Entire System as a Whole

Figure/Diagram from Bert Scalzo

RAC Performance Tuning: Real Life Case Study

Figure/Diagram from Bert Scalzo

RAC Performance Tuning: Configuration Checklist

RAC Performance Tuning: Block Size is Important