Lundhild-Understanding RAC Internals

<Insert Picture Here>
Understanding RAC Internals

Barb Lundhild Oracle Corporation RAC Product Management
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracles products remains at the sole discretion of Oracle.
Agenda
1. What are the major components of Oracle Clusterware and how do they interact? 2. Why does Oracle reboot nodes? 3. How does Oracle handle private interconnect failure and scalability? 4. When my public network fails, why does ASM and the db instance get shut down? 5. What exactly is the VIP, its purpose, and how does it work? 6. What is the purpose of ONS is it required for anything other than FAN? 7. How does Oracle do load balancing across RAC instances?
What are the major components of Oracle Clusterware and how do they interact?
RAC 10 Architecture
public network
Node1
VIP1 Service Listener instance 1 ASM
VIPn Service Listener instance n ASM
Node n
cluster Oracle Clusterware interconnect Oracle Clusterware
Operating System
Operating System
shared storage
Managed by ASM RAW Devices
Redo / Archive logs all instances Database / Control files OCR and Voting Disks
What does Clusterware provide?
VIP
Event Management High Availability Framework Process Monitor Group Membership

Operating System
Clusterware
Oracle Clusterware 10 Architecture
VIP
EVM RACG Oracle Clusterware CRS OPROC CSS

Operating System
Why does Oracle Clusterware reboot nodes?
Oracle Clusterware
Group Membership and Heartbeats

Cluster needs to know who is a member at all times Oracle Clusterware has 2 heartbeats:
Network heartbeat and Disk heartbeat
If a node does not send a network heartbeat for <MissCount> (time in seconds), then node is evicted from cluster If disk heartbeat (voting disk) is not updated in <I/O timeout>, then node is evicted from cluster
Heartbeat Failures
Network Heartbeat
node(4) missed(59) checkin(s) >2005-06-18 08:14:37.858 [3002575792] >WARNING: clssnmPollingThread: Eviction started for node 4,flags 0x000d, >state 3, wt4c 0 >2005-06-18 08:14:41.985 [3047074736] >TRACE: clssnmHandleSync: CSSD]2005-10-11 15:56:23.668 [93645744] >WARNING: clssnmDiskPMT: long disk latency >(45940 ms) to voting disk (0//dev/raw/raw1)
Disk Heartbeat
Oracle Clusterware
Split Brain Resolution
Split Brain Resolution:
Determine surviving subcluster Sub-cluster with largest number of Nodes Sub-cluster with lowest node number IO Fencing via Stonith algorithm (remote power reset)
Voting disk is used to detect and resolve network problems that could lead to a split-brain
Final arbiter of the status of configured nodes, either up or down, and delivers eviction notices Recommended to have at least 3 voting disks Multiple voting disks supported in RAC 10g Release 2 Dynamic addition of voting disk RAC 11g
Oracle Clusterware Disk Heartbeat

Disktimeout: maximum time (s) for voting file I/O to complete.
10g Release 1 and 10.2.0.1 I/O timeout was directly related to MissCount. I.E. MissCount governed sensitivity of both heartbeats 10.2.0.2 more granular sensitivity via separation of network and disk heartbeats Disktimeout parameter set for CSS, default = 200s Tune disktimeout for the Voting Disk storage solution be careful - some multipathing solutions require high disktimeout values
Changing MissCount
IT IS NOT SUPPORTED TO REDUCE MISSCOUNT BELOW THE DEFAULT
Default varies somewhat by platform (30s or 60s) Default = 600s if vendor clusterware is installed
It should not be necessary to tune Disktimeout
How does Oracle handle private interconnect failure and scalability?
Private Interconnect
public network
//
Node 2
VIPn Service Listener instance n ASM
Oracle Clusterware
Node1

Oracle Clusterware

Oracle Clusterware
Node n
Operating System
Operating System
Operating System
Switch 1
cluster interconnect
Switch 2
Private Interconnect
Network between the nodes of a RAC cluster MUST be private Supported links: GbE, IB ( IPoIB: 10.2 ) Supported transport protocols:
Oracle Clusterware uses TCP RAC: UDP, RDS (10.2.0.3)
Use multiple or dual-ported NICs for redundancy and increase bandwidth with NIC bonding Large ( Jumbo ) Frames for GbE recommended
Interconnect Bandwidth
Bandwidth requirements depend on
CPU power per cluster node Application-driven data access frequency Number of nodes and size of the working set Data distribution between PQ slaves 10000-12000 8K blocks per sec to saturate 1 x Gb Ethernet ( 75-80% of theoretical bandwidth )
Typical utilization approx. 10-30% in OLTP Multiple NICs generally not required for performance and scalability
IPC configuration
Settings:
Socket receive buffers ( 256 KB 1MB ) Negotiated top bit rate and full duplex mode NIC ring buffers Ethernet flow control settings CPU(s) receiving network interrupts
Verify your setup:

CVU does checking Load testing eliminates potential for problems
Interconnect Bonding
Terminology: NIC Bonding, link aggregation, port trunking, NIC teaming, Multiple physical links combined into a single logical link
Provides redundancy and/or scalability
Logical link is provided to Oracle Clusterware and RAC Most operate at OSI Layer 2 Different implementations on different platforms
Read the fine print Generally recommend failover only (active/passive) configuration
Interconnect Bonding
Some cluster managers provide support for multiple interconnects
Not required with Oracle Clusterware
OS-Specific bonding
Solaris: IPMP, Sun Trunking AIX: etherchannel HP-UX: APA Linux: NIC Bonding Windows: NIC Teaming IB drivers inherently support failover and load balancing.
10
Interconnect Configuration
OCR
[SYSTEM.css.interfaces.global.bond0.192|d168|d12|d0.1] ORATEXT : cluster_interconnect SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_ALL_ACCESS, OTHER_PERMISSION : PROCR_READ, USER_NAME : oracle, GROUP_NAME : odba}
RDBMS
SQL> select * from x$ksxpia; ADDR INDX INST_ID P PICK NAME_KSXPIA IP_KSXPIA -------- ---------- ---------- - ---- --------------- ------------58EC8340 0 1 Y OCR bond0 192.168.12.1
cluster_interconnects (init.ora for RAC)

Overrides clusterware setting Supports load balancing, not failover
Operating System Dependency

Block access latencies increase when CPU(s) busy and run queues are long Immediate LMS scheduling is critical for predictable block access latencies when CPU > 80% busy Fewer and busier LMS processes may be more efficient. i.e. monitor their CPU utilizaiion Real Time or fixed priority for LMS is supported
Implemented by default with 10.2 Do not put more instances than CPUs on a server
11
Misconfigured or Faulty Interconnect Can Cause:

Dropped packets/fragments Buffer overflows Packet reassembly failures or timeouts Ethernet Flow control kicks in TX/RX errors
lost blocks at the RDBMS level, responsible for 64% of escalations
Lost Blocks: NIC Receive Errors
Db_block_size = 8K
ifconfig a:
eth0 Link encap:Ethernet HWaddr 00:0B:DB:4B:A2:04 Bcast:130.35.27.255 MTU:1500 Mask:255.255.252.0 inet addr:130.35.25.110
UP BROADCAST RUNNING MULTICAST
Metric:1
RX packets:21721236 errors:135 dropped:0 overruns:0 frame:95 TX packets:273120 errors:0 dropped:0 overruns:0 carrier:0
12
Lost Blocks: IP Packet Reassembly Failures

netstat s Ip: 84884742 total packets received 1201 fragments dropped after timeout 3384 packet reassembles failed
Finding a Problem with the Interconnect or IPC

Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s)(ms) Time Wait Class ---------------------------------------------------------------------------------------------------log file sync gc buffer busy
286,038 177,315
49,872 29,021 5,703
174 164 52
41.7 24.3 4.8 4.1 3.9
Commit Cluster Cluster Cluster Other
gc cr block busy 110,348 gc cr block lost cr request retry
4,272 6,316
4,953 1159 4,668 739
Should never be here
13
What are the startup/shutdown sequence and dependencies?
Node Startup Sequence

3 7 6 5 4 2 1
VIP1 Service Listener Instance 1 ASM
Oracle Clusterware
Operating System
14
Oracle Dependencies
Prior to 10.2.0.3
public network
Node1
Node2
Operating System
Operating System
shared storage
Oracle Dependencies
Prior to 10.2.0.3
public network
Node1
VIP1 VIP2 Service Listener instance 2 ASM
Node2
Operating System
Operating System
shared storage
15
Oracle Dependencies
public network
Node1
Node 2
Operating System
Operating System
shared storage
Oracle Dependencies
public network
Node1
VIP1 VIP2 Service Listener instance 2 ASM
Node 2
Operating System
Operating System
shared storage
16
What exactly is the VIP, its purpose, and how does it work?
Why Oracle RAC 10g has a VIP?

Protects database clients from long TCP/IP timeouts (can be >10 minutes) During normal operation, works the same as hostname During failure, it removes network timeout from connection request time, client fails immediately to next address in the list
sales.us.acme.com =(DESCRIPTION=(ADDRESS_LIST= (LOAD_BALANCE=on)(FAILOVER=ON) (ADDRESS=(PROTOCOL=tcp)(HOST=sales1-vip)(PORT=1521)) (ADDRESS=(PROTOCOL=tcp)(HOST=sales2-vip)(PORT=1521))) (CONNECT_DATA= (SERVICE_NAME=sales.us.acme.com)))
17
Oracle RAC 10g VIP

The Details! One for each node in cluster Required for Oracle Clusterware installation IP and network name should not currently be in use Should be registered in DNS and be on the same subnet as public IP address Can use OS bonding to provide failover and load balancing on network interfaces on the node Configuration managed by VIPCA Note that netmask defaults to 255.255.255.0, rather than defaulting to netmask of underlying physical interface.
Oracle RAC VIP is DIFFERENT

Only accepts connections when on its home node Failure on home node: relocates to another node in the cluster only to send a error back to client (it will not be in the listener so it cannot accept connections!) You will only have one active RAC VIP per node (there may be others who have relocated due to failure!)
Independent of number of databases running in cluster
18
Oracle RAC 10g VIP

[root@pmrac1 root]# ifconfig eth0 Link encap:Ethernet HWaddr 00:12:79:D8:90:93 inet addr:144.25.214.45 Bcast:144.25.215.255 Mask:255.255.252.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:5070815 errors:0 dropped:0 overruns:0 frame:0 TX packets:3064435 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:509963813 (486.3 Mb) TX bytes:3621223517 (3453.4 Mb) Interrupt:25 eth0:1 Link encap:Ethernet HWaddr 00:12:79:D8:90:93 inet addr:144.25.214.47 Bcast:144.25.215.255 Mask:255.255.252.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:5762695 errors:0 dropped:0 overruns:0 frame:0 TX packets:5679252 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:3400642002 (3243.1 Mb) TX bytes:3166774792 (3020.0 Mb) Interrupt:25
VIP
Listener.ora
SID_LIST_LISTENER_PMRAC1 = (SID_LIST = (SID_DESC = (SID_NAME = PLSExtProc) (ORACLE_HOME = /u01/oracle/product/10gR2/asm) (PROGRAM = extproc) ) ) LISTENER_PMRAC1 = (DESCRIPTION_LIST = (DESCRIPTION = (ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC1))
VIP
) )
(ADDRESS = (PROTOCOL = TCP)(HOST = pmrac1-vip)(PORT = 1521)(IP = FIRST))

(ADDRESS = (PROTOCOL = TCP)(HOST = 144.25.214.45)(PORT = 1521)(IP = FIRST))
19
Application VIPs
New resource in Oracle RAC 10g Release 2 Created as functional VIPs which can be used to connect to an application regardless of the node it is running on VIP is a dependent resource of the user registered application There can be many VIPs, one per User Application
Creating an Application VIP

The usrvip script must run as root
The default permissions need to be changed As root crs_setperm ApplicationVIP1 o root Allow oracle user to execute this script As root crs_setperm ApplicationVIP1 u user:oracle:r-x Start the VIP As oracle crs_start ApplicationVIP1
20
What is the purpose of ONS is it required for anything other than FAN?
Oracle Notification Service (ONS)

Publish/Subscribe Messaging System Allows both local and remote consumption Used by Fast Application Notification (FAN) to publish HA Events and Load Balancing Events Used by FAN clients to subscribe to events Automatically installed and configured by the installation of Oracle Clusterware DO NOT TURN OFF Required by Oracle Clusterware and RAC
21
What is FAN?
Fast Application Notification (FAN) is a RAC notification mechanism FAN HA Events: Notification of Up/Down for service, instance & node Load Balancing Advisory Events: Advise clients of current load for service and where to send connection requests Enable it, and Forget it.
Fan Clients
HA Events: JDBC Implicit Connection Cache, OCI, ODP.NET Connection Pools, Listener, Server Side Callouts, CMAN Load Balancing Advisory Events: JDBC Implicit Connection Cache, ODP.NET Connection Pools, Listener, CMAN
New in RAC 11g OCI Session Pools subscribe to Load Balancing Advisory Events to provide Runtime Connection Load Balancing
22
How does Oracle do load balancing across RAC instances?
Connection Load Balancing
LISTENER
Service OLTP? OLTP1 on N1
Application Server Network
OLTP2 on N2 OLTP3 on N3
Network
RAC Database
23
Connection Load Balancing
LISTENER
Connection made to OLTP1
Listeners RAC Database
tw Ne ork
Clients
Connection Pools
How do you Load Balance?
c c c c c c cc c c c c
Application Connection Pool Real Application Clusters
24
Load Balancing Advisory

Load Balancing Advisory is an advisory for balancing work across RAC instances. Load Balances at the transaction level (not connections!) Directs work to where services are executing well and resources are available. Adjusts distribution for different power nodes, different priority and shape workloads, changing demand. Stops sending work to slow, hung, failed nodes early.
Load Balancing Advisory

Automatic Workload Repository Calculates goodness locally, forwards to master mmon Master mmon builds advisory for distribution of work Records advice to SYS$SERVICE_METRICS Posts FAN event to AQ, PMON, ONS
25
View LBA FAN Event
Runtime Connection Load Balancing

When application does getConnection, the connection given is the one that will provide the best service. Supported by Oracle JDBC and ODP.NET connection Pools (OCI Session Pools in RAC 11g!) Policy defined by setting GOAL on Service Need to have Connection Load Balancing
26
Load Balancing Advisory Enabled through Service Goal

THROUGHPUT Work requests are directed based on throughput .
used when the work in a service completes at homogenous rates. An example is a trading system where work requests are similar lengths.
SERVICE_TIME Work requests are directed based on response time.

used when the work in a service completes at various rates. An example is as internet shopping system where work requests are various lengths
None Default setting, turn off advisory
Fast Connection Failover

Fast and reliable high availability for connections in an Oracle Real Application Clusters 10g environment Enable it and forget it Application can make it transparent to user by trapping SQL Exception and retrying Supported by Oracle JDBC, OCI, and ODP.NET
27
FAN/FCF Client Integration

JDBC When DOWN signal received from RAC 10g
First pass: Connections are marked as down Second pass: Aborts and removes connections that are marked as down Routes new requests to surviving instances Throws exception if application was in midst of transaction
When UP signal received from RAC 10g

Creates new connections to new instances Distributes new work requests evenly to all available instances
Q & A
QUESTIONS ANSWERS
28
Appendix
For More Information
http://search.oracle.com
REAL APPLICATION CLUSTERS
or otn.oracle.com/rac
29
Useful Metalink Notes

Note 342082.1 How to Change Subnet Masks for VIPs Note 294430.1 CSS Timeout Computation in RAC 10g Note 284752.1 10g RAC: Steps To Increase CSS Misscount, Reboottime and Disktimeout Note 291962.1 Setting Up Bonding in SLES 9 Note 291958.1 Setting Up Bonding in Suse SLES8 Note 298891.1 Configuring Linux for the Oracle 10g VIP using bonding Note 283107.1 Configuring Solaris IP Multipathing (IPMP) for the Oracle 10g VIP
OTN.ORACLE.COM/RAC
Workload Management with Oracle Real Application Clusters (FAN, FCF, Load Balancing) Using standard NFS to support a third voting disk on a stretch cluster configuration on Linux Using Oracle Clusterware to Protect 3rd Party Applications RAC Sample Code Page
http://www.oracle.com/technology/sample_code/products/rac/index.html
30
31

Lundhild-Understanding RAC Internals

Uploaded by

Copyright:

Available Formats

You might also like

Lundhild-Understanding RAC Internals

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lundhild-Understanding RAC Internals

Uploaded by

Copyright:

Available Formats

<Insert Picture Here>

Understanding RAC Internals

<Insert Picture Here>

VIP1 Service Listener instance 1 ASM

VIPn Service Listener instance n ASM

cluster Oracle Clusterware interconnect Oracle Clusterware

Managed by ASM RAW Devices

What does Clusterware provide?

Event Management High Availability Framework Process Monitor Group Membership

Oracle Clusterware 10 Architecture

EVM RACG Oracle Clusterware CRS OPROC CSS

<Insert Picture Here>

Why does Oracle Clusterware reboot nodes?

Group Membership and Heartbeats

Oracle Clusterware Disk Heartbeat

It should not be necessary to tune Disktimeout

<Insert Picture Here>

How does Oracle handle private interconnect failure and scalability?

VIP1 Service Listener instance 1 ASM

VIP2 Service Listener instance 2 ASM

Verify your setup:

cluster_interconnects (init.ora for RAC)

Operating System Dependency

Misconfigured or Faulty Interconnect Can Cause:

Lost Blocks: NIC Receive Errors

UP BROADCAST RUNNING MULTICAST

Lost Blocks: IP Packet Reassembly Failures

Finding a Problem with the Interconnect or IPC

49,872 29,021 5,703

41.7 24.3 4.8 4.1 3.9

Commit Cluster Cluster Cluster Other

gc cr block busy 110,348 gc cr block lost cr request retry

4,953 1159 4,668 739

Should never be here

<Insert Picture Here>

What are the startup/shutdown sequence and dependencies?

Node Startup Sequence

VIP1 Service Listener instance 1 ASM

VIP2 Service Listener instance 2 ASM

cluster Oracle Clusterware interconnect Oracle Clusterware

Managed by ASM RAW Devices

VIP1 Service Listener instance 1 ASM

VIP1 VIP2 Service Listener instance 2 ASM

cluster Oracle Clusterware interconnect Oracle Clusterware

Managed by ASM RAW Devices

VIP1 Service Listener instance 1 ASM

VIP2 Service Listener instance 2 ASM

cluster Oracle Clusterware interconnect Oracle Clusterware

Managed by ASM RAW Devices

VIP1 Service Listener instance 1 ASM

VIP1 VIP2 Service Listener instance 2 ASM

cluster Oracle Clusterware interconnect Oracle Clusterware

Managed by ASM RAW Devices

<Insert Picture Here>

Why Oracle RAC 10g has a VIP?

Oracle RAC 10g VIP

Oracle RAC VIP is DIFFERENT

Oracle RAC 10g VIP