Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 13

PURPOSE

Network information in Oracle Clusterware MUST be consistent with the setting on OS level. This
note explains the basics of IPv4 subnet and Oracle Clusterware (CRS or Grid Infrastructure)

DETAILS
A. What's IPv4 Subnet
The IP subnet is a logical separation of IP addresses bounded by the subnet mask. The first address is
used as the Subnet Identification (subnet ID or subnet number or route definition) while the last is the
broadcast address.

For example, subnet ID 10.1.0.0 and netmask 255.255.255.128 produces an IP range of 10.1.0.0 -
10.1.0.127. In this range, 10.1.0.0 is the subnet ID and 10.1.0.127 is the broadcast address. Changing
the netmask to 255.255.254.0, the IP range will be 10.1.0.0 - 10.1.1.255 where the broadcast address
is now 10.1.1.255.

B. Calculate Subnet ID online with ifconfig output


Some websites can be used to calculate subnet ID if IP address and netmask is known, search for
"subnet calculator" to find them.
Assuming the following as ifconfig output of a network adapter:
eth7 Link encap:Ethernet HWaddr 00:16:3E:11:22:88
..
inet 10.xxx.xx.108 netmask 0xfffff800 broadcast 10.220.15.255
For this subnet, netmask is 255.255.248.0 (converted from hex 0xfffff800), broadcast is 10.220.15.255
and subnet ID is 10.220.8.0

C. Find out Subnet ID from oifcfg


oifcfg from Oracle Clusterware (CRS or Grid Infrastructure) can also be used to find out subnet ID
and netmask:
$CLUSTERWARE_HOME/bin/oifcfg iflist -p -n
eth0 10.208.4.0 PRIVATE 255.255.255.0
eth1 10.1.0.128 PRIVATE 255.255.255.128
eth1 169.254.0.0 UNKNOWN 255.255.0.0
Note:
The first column is the network adapter name
The second column is the subnet ID
The third column indicates whether it's private, public or unknown according to RFC standard, it has
NOTHING to do whether it's used as private or public network in Oracle Clusterware
The last column is the netmask

D. Subnet Info in Oracle Clusterware - OCR


Both public and private network information are stored in OCR.
For pre-11.2, the private network information in OCR is used by ASM and database instances; for
11.2+, the information is used by Oracle Clusterware(GI), ASM and databases instances.
To find out what's in OCR:
$CLUSTERWARE_HOME/bin/oifcfg getif
eth1 10.1.0.0 global cluster_interconnect
eth3 120.0.0.0 global public
Note:
The first column is the network adapter name
The second column is the subnet ID
The third column is always "global" and should not be changed
The last column indicates whether it's public or cluster_interconnect(private) in Oracle Clusterware
To modify:
Network info in OCR can not be modified. To change, as Oracle Clusterware user, remove the
incorrect network info and add it back.

For example, if "eth3" has wrong subnet number in OCR, to fix it, as Oracle Clusterware user:
$CLUSTERWARE_HOME/bin/oifcfg delif -global eth3
$CLUSTERWARE_HOME/bin/oifcfg setif -global eth3/120.0.0.0:public
Note: the same network adapter is not allowed to be in OCR for more than once, otherwise the
command will fails with error: PRIF-50: duplicate interface is given in the input
Refer to Document 283684.1 for details to change both public and private network in OCR.

E. Subnet Info in Oracle Clusterware - Nodeapps(network resource)


To find out nodeapps information
For pre-11.2 CRS:
$CLUSTERWARE_HOME/bin/srvctl config nodeapps -n <nodename> -a
VIP exists.: /<racnode1v>/10.1.0.44/255.255.255.128/eth3
Note:
The first column is the node VIP name - in this example "<racnode1v>"
The second column is the node VIP IP address
The third column is the public network netmask for this VIP
The forth column is the network adapter name

For 11.2 GI
$CLUSTERWARE_HOME/bin/srvctl config network
Network exists: 1/10.10.0.0/255.255.255.128/eth3, type static
Network exists: 2/10.20.1.0/255.255.255.128/eth4, type static

$CLUSTERWARE_HOME/bin/srvctl config nodeapps


Network exists: 1/10.10.0.0/255.255.255.128/eth3, type static
VIP exists: /<wirac1fv>/10.10.0.52/10.10.0.0/255.255.255.128/eth3, hosting node <hostname>
GSD exists
ONS exists: Local port 6100, remote port 6200, EM port 2016
Note:
The VIP part is the same as in 10.2
For Network part:
The first column is the network number, default network will have number 1
The second column is the subnet ID, in this example 10.10.10.0 for the first network
The third column is the netmask, in this example both network have 255.255.255.128
The forth column is the network adapter name
"type static" means this is static network instead of a dynamic network (DHCP etc)

Error number/messages associated with mismatch of public network info


PRCR-1013 : Failed to start resource ora.net1.network
PRCR-1064 : Failed to start resource ora.net1.network on node <hostname>
CRS-2674: Start of 'ora.net1.network' on '<hostname>' failed
PRCR-1079 : Failed to start resource ora.<hostname>.vip
CRS-2632: There are no more servers to try to place resource 'ora.<hostname>.vip' on that would
satisfy its placement policy

To modify nodeapps (network resource)


As root execute "srvctl modify network -h" for 11.2.0.2 and above or "srvctl modify nodeapps -h" for
pre-11.2.0.2. Refer to note 276434.1 for details.
PURPOSE
The purpose of this note is to make DBAs and System Administrators familiar with the concept of the
cluster or private interconnect and its usage in Oracle Clusterware and Oracle RACs

SCOPE
The main audience is DBAs and System Administrators. This note applies to Oracle Clusterware and
Oracle RACs releases 10.1, 10.2 and 11.1.

DETAILS
The term 'Private Interconnect' or 'Cluster Interconnect' is the private communication link between cl
uster nodes.
For Oracle Real Application Clusters and Oracle Clusterware we can differentiate between
1. Physical interface (including NIC, cables and switches)
2. Private Interconnect used by Oracle Clusterware
3. Private Interconnect used by Oracle Real Application Clusters (RAC)

Physical Layout of the Private Interconnect


The basic requirements are described in the Installation Guide for each platform. Additional informati
on about certification can be found on Metalink Certify.
The interconnect as identified by both subnet number and interface name must be configured on all cl
ustered nodes.
A switch between the clustered nodes is an absolute requirement.

Why Do We Need a Private Interconnect ?


Clusterware uses the interconnect for cluster synchronization (network heartbeat) and daemon comm
unication between the the clustered nodes. This communication is based on the TCP protocol.
RAC uses the interconnect for cache fusion (UDP) and inter-process communication (TCP). Cache
Fusion is the remote memory mapping of Oracle buffers, shared between the caches of participating
nodes in the cluster. The volume and traffic patterns of this type of data, shared between nodes can
vary greatly depending on the applications.
There are some vendor specific protocol exceptions in Oracle 10g.

Interconnect Failure
The private interconnect is the critical communication link between nodes and instances. Network er
rors will negatively impact Oracle Clusterware communication as well as RAC communication and
performance.
Private interconnect failures are recognized by Oracle Clusterware and result in what is known as a 's
plit-brain' or subdivided cluster.
A subdivided cluster can result in data corruption, consequently immediate action is taken to resolve t
his condition. Interconnect failures, therefore, result in a node or subset of nodes in the cluster shuttin
g down. In the case of two equally sized sub clusters, it is basically random which sub cluster will sur
vive, and a customer's architecture and design should be based on this. At this time the Oracle Cluste
rware happens to use the node numbers to resolve this, but this could change in the future.

RAC instances will wait for the end of cluster reconfiguration to start their own reconfiguration.

Interconnect High Availability


It is an Oracle Best Practice to make the private interconnect highly available. Depending on the oper
ating system and vendor you can use OS network drivers such as bonding, teaming, IPMP, Etherchan
nel, APA, MultiPrivNIC. For that purpose the interconnect should be configured across two NICs as
well as two switches for complete redundancy so that the cluster can survive a single point of failure.
The setup of the highly available interconnect is transparent to Oracle Clusterware and RAC. This me
ans that an underlying failure is handled by the OS or networking drivers that manage the interfaces.
Oracle software does not recognize the change underneath, because the failover is handled by the
operating system transparently to Oracle.

Private Interconnect for Oracle Clusterware


The private node name determines the interface being used for Oracle Clusterware and is defined duri
ng installation of the Clusterware.
With 3rd party Vendor Clusterware in place Oracle Clusterware should be configured to use the same
interconnect (often referred to as the heartbeat) as the underlying vendor cluster software.

There exist three ways to identify the private node name after installation:
olsnodes -n -p can be used to identify the private node name.
[oracle@racnode1 ~]$ olsnodes -n -p
racnode1 1 racnode1-priv
racnode2 2 racnode2-priv
[oracle@racnode1 ~]$
You may also check the private node name in the ocrdump output.
[SYSTEM.css.node_numbers.node1.privatename]
ORATEXT : racnode1-priv
ocssd.log has a line with clssnmClusterListener
[ CSSD]2009-02-23 03:09:06.945 [3086] >TRACE: clssnmClusterListener: Listening on
(ADDRESS=(PROTOCOL=tcp)(HOST=racnode1-priv)(PORT=49895))
Although the Oracle Universal Installer allows IP addresses when prompted for the private node name
you should always use the host names defined in the hosts file or DNS. This enables you to change the
IP addresses when you need to move the server to a different IP range. You would have to reinstall Or
acle Clusterware when you used IP-addresses instead of host names and you are changing the server's
IP adresses.

Private Interconnect for RAC


The database uses the private interconnect for communication and cache fusion. The private interconn
ect is also referred to as "The Cluster Interconnect". Oracle highly recommends that the database and
Oracle Clusterware share the same interconnect.
The instances get the private interconnect definition from

spfile or init.ora when the parameter CLUSTER_INTERCONNECTS was set


Oracle Cluster Regsitry (OCR) defined during the installation (default)
automatically from platform defaults if not defined in the spfile/init.ora and OCR.
A value in the spfile or init.ora overrides a definition in the OCR.

Identification of the Private Interconnect for RAC


The value of the private interconnect for an instance can be identified using
1). The views V$CLUSTER_INTERCONNECTS and V$CONFIGURED_INTERCONNECTS
V$CLUSTER_INTERCONNECTS displays one or more interconnects that are being used for cluster
communication.
V$CONFIGURED_INTERCONNECTS displays all the interconnects that Oracle is aware of. This vi
ew aims to answer the question on where Oracle found the information about a specific interconnect.
Notes: these views are not available in Oracle 10g 10.1.

2). The alert.log


a) CLUSTER_INTERCONNECTS parameter is set
b) Value read from Cluster Registry
Interface type 1 Database Team 10.x.x.0 configured from OCR for use as a cluster interconnect
c) Interconnect finally used
Cluster communication is configured to use the following interface(s) for this instance
10.x.x.1
Note that there might be a difference to the interconnect defined in the OCR because the value was ov
erridden by setting the CLUSTER_INTERCONNECTS parameter or when Oracle selected another int
erconnect because the one in the OCR was not available.
d) Protocol
Unix and Linux: cluster interconnect IPC version:Oracle UDP/IP (generic)
Windows: cluster interconnect IPC version:Oracle 9i Winsock2 TCP/IP IPC

Private Interconnect for ASM


Oracle ASM must use the same interconnect as Oracle Clusterware.

Different Interconnects for Clusterware and RAC


Oracle Clusterware does not monitor interfaces other than the one specified via the private host name
during installation. If RAC instances run with an interconnect that is different from the Oracle Cluster
ware interconnect any failure will remain undetected by Oracle Clusterware. The instances will detect
this failure themselves and clear the situation by a mechanism called Instance Membership Recovery
(IMR). Instance evictions due to IMR happen after 10 minutes.
This configuration should be avoided.

CLUSTER_INTERCONNECTS Parameter
This parameter overrides the value configured in the OCR. You may specify multiple IP addresses in
CLUSTER_INTERCONNECTS which will cause all defined interfaces to be used. Keep in mind that
a failure of one of those interfaces will cause the instance to fail. The failed instance can only be
restarted when all the interfaces are fixed or you start all the instances without the faulty interface.

Changing the Private Interconnect


1). Clusterware
A. Change of the IP Address: change the IP address in the hosts file and/or DNS and make sure that A
SM and the database also use the same interconnect
B. Change of the private node name used by Oracle Clusterware: requires a reinstall of Oracle Cluster
ware
2). Real Application Clusters
A. OCR: Use Note 283684.1 How to Change Interconnect/Public Interface IP or Subnet in Oracle Clu
sterware
B. CLUSTER_INTERCONNECTS: shutdown all instances, change the IP address and restart the inst
ances

SYMPTOMS
After installing 10.1.0.2.0 Real Application on Linux, the private network is not used for RAC traffic.
In the alert log you may see a warning like the following:
WARNING: unknown interface type -1073839360 returned from OCR ifquery
interface name ^B interface identifier
this interface will not be used for cluster communication
Interface type 1 eth0 <138.X.XXX.0> configured from OCR for use as a public interface

You may also see:


Cluster communication is configured to use the following interface(s) for this instance
*public IP address*
CAUSE: This is a known issue

SOLUTION
Find the private network interface on each node with ifconfig -a:
[root@opcbrh1 root]# ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:90:27:BC:D9:8C
inet addr:*Public IP Address* Bcast:<138.X.XXX.255> Mask:<XXX.XXX.252.0>
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:2522101 errors:0 dropped:0 overruns:0 frame:0
TX packets:2146596 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:1403142233 (1338.1 Mb) TX bytes:2127338734 (2028.7 Mb)
Interrupt:11 Base address:0x9000

eth1 Link encap:Ethernet HWaddr 00:30:BD:05:D5:63


inet addr:<192.XXX.X.10> Bcast:<192.XXX.X.255> Mask:<XXX.XXX.255.0>
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:742712 errors:0 dropped:0 overruns:0 frame:0
TX packets:466855 errors:0 dropped:0 overruns:1 carrier:0
collisions:0 txqueuelen:100
RX bytes:356479236 (339.9 Mb) TX bytes:193062887 (184.1 Mb)
Interrupt:11

In this case eth1 is my private network so <192.XXX.X.10> is my private IP.


PFILE:
If you are using init.ora files, set cluster_interconnects to the private IP on each node and restart the
instances. Node 1 example:
cluster_interconnects = '<192.XXX.X.10>'

SPFILE:
If you are using an spfile, use an alter system command for each instance. Example:
alter system set cluster_interconnects = '<192.XXX.X.10>' scope=spfile sid='<SID1>';
alter system set cluster_interconnects = '<192.XXX.X.20>' scope=spfile sid='<SID2>';

After restarting the instances, you should see the correct IP address in the alert log. Example:
Cluster communication is configured to use the following interface(s) for this instance
192.XXX.X.10

GOAL
Frequently, in the case of node reboots, the log of the CSS daemon processes (ocssd.log) indicates that
the network heartbeat from one or more remote nodes was not received (for example, the message
"CRS-1610:Network communication with node xxxxxx (3) missing for 90% of timeout interval.
Removal of this node from cluster in 2.656 seconds" appears in the ocssd.log), and that the node
subsequently was rebooted (to avoid a split brain or because it was evicted by another node).

The script in here performs the network connectivity check using ssh. This check complements ping
or traceroute since ssh uses TCP protocol while ping uses ICMP and traceroute in Linux/Unix uses U
DP (traceroute on Windows use ICMP).

The network communication involves both the actual physical connection and the OS layer such as IP,
UDP, and TCP.
CRS (10g and 11.1) uses TCP to communicate, so using ssh to test the connection as well as TCP and
IP layer is a better test than ping or traceroute.
Because CRS on 11.2 uses UDP to communicate, using ssh to test TCP is not the optimal test, but this
test will complement the traceroute test.

The script tests the private interconnect once every 5 seconds, so this script will put an insignificant lo
ad on the server.

SOLUTION
1) Create a file in a location of your choice and copy and paste the lines in the following note box:
#!/bin/ksh
export TODAY=`date "+%Y%m%d"`
while [ $TODAY -lt <the time you want the script to stop running> ] # format needs to be YYMMDD
do
export TODAY=`date "+%Y%m%d"`
export LOGFILE=<log file directory>/interconnect_test_${TODAY}.log
ssh <private Ip address for node 1> "hostname; date" >> $LOGFILE 2>&1
ssh <private Ip address for node 2> "hostname; date" >> $LOGFILE 2>&1

echo "" >> $LOGFILE


echo "" >> $LOGFILE

sleep 5
done

2) Replace <private Ip address for node 2> with real private interconnect IP address or private interco
nnect host name. The script will execute the cmds, "hostname" and "date", and output to a log file.

3) If there are more than two nodes in the cluster, add more lines to issue
sh <private Ip address for node 1> "hostname; date" >> $LOGFILE 2>&1
Make sure that this script issues ssh to every node including the local node.

4) Replace <log file directory> with a real directory name where the output of this script will go.
The script will likely grow less than one MB every day, so you do not need large amount of space.
You can also regularly delete old log files.

5) Replace <the time you want the script to stop running> with the date and year that you want the
script to stop running. The format has to be YearMonthDate like 20121231 for December 31, 2012.

6) Save the file and issue "chmod +x <the script file name>" to make the script executable.

7) Make sure that the ssh works without asking for any password over the private interconnect.
It is best to first test the ssh connection over the private interconnect from all nodes to every other
node including itself (local node).

8) Issue "nohup <the script file name> &" to run the script in background.
Run this script from every node in the cluster.

How to interpret the output in the log file:


When there is a problem with the private interconnect or when the node is down, the date shown in
log file will not be once every 5 seconds but longer.
If the difference is more than 10 seconds between succeeding dates when the script was running, then
the network/server is having serious delay in transmitting network heartbeats. If the difference is great
er than 30 seconds, the node will reboot, so you will likely not see the difference that is greater than 3
0 seconds.

Find out the approximate time that the node is rebooted and check when the script show last output be
fore the node is rebooted. If the time difference is more than 15 seconds, then the network problem is
the cause of the missing network heartbeats. Investigate the reason that ssh (a regular OS cmd) hang.

The following script is an example from the three node cluster:


#!/bin/ksh
export TODAY=`date "+%Y%m%d"`
while [ $TODAY -lt 20121231 ] # format needs to be YearMonthDate
do
export TODAY=`date "+%Y%m%d"`
export LOGFILE=/tmp/interconnect_test_${TODAY}.log
ssh drrac1-priv "hostname; date" >> $LOGFILE 2>&1
ssh drrac2-priv "hostname; date" >> $LOGFILE 2>&1
ssh drrac3-priv "hostname; date" >> $LOGFILE 2>&1

echo "" >> $LOGFILE


echo "" >> $LOGFILE

sleep 5
done

PURPOSE
This note is a troubleshooting guide for the following situation: Oracle Clusterware cannot be started
on all nodes at once. For example, in a 2-node cluster, the Oracle Clusterware on the 2nd node won't
start, or, attempting to start clusterware on the second node causes the first node's clusterware to shutd
own.

In the clusterware alert log ($GRID_HOME/log/<hostname>/alert<hostname>.log) of one or more


nodes where Oracle Clusterware is started, the following messages are seen:

2012-07-14 19:24:18.420
[cssd(6192)]CRS-1612:Network communication with node racnode02 (2) missing for 50% of timeout
interval. Removal of this node from cluster in 14.500 seconds
2012-07-14 19:24:25.422
[cssd(6192)]CRS-1611:Network communication with node racnode02 (2) missing for 75% of timeout
interval. Removal of this node from cluster in 7.500 seconds
2012-07-14 19:24:30.424
[cssd(6192)]CRS-1610:Network communication with node racnode02 (2) missing for 90% of timeout
interval. Removal of this node from cluster in 2.500 seconds
2012-07-14 19:24:32.925
[cssd(6192)]CRS-1607:Node racnode02 is being evicted in cluster incarnation 179915229; details at
(:CSSNM00007:) in /u01/app/gridhome/log/racnode01/cssd/ocssd.log.

In the clusterware alert log ($GRID_HOME/log/<hostname>/alert<hostname>.log) of the evicted


node(s), the following messages are seen:
2012-07-14 19:24:29.282
[cssd(8625)]CRS-1608:This node was evicted by node 1, racnode01; details at (:CSSNM00005:) in
/u01/app/gridhome/log/racnode02/cssd/ocssd.log.
2012-07-14 19:24:29.282
[cssd(8625)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at
(:CSSSC00012:) in /u01/app/gridhome/log/racnode02/cssd/ocssd.log

TROUBLESHOOTING STEPS
The Oracle clusterware cannot be up on two (or more) nodes if those nodes cannot communicate with
each other over the interconnect.

The CRS-1612, CRS-1611, CRS-1610 messages "Network communication with node NAME(n) miss
ing for PCT% of timeout interval" are warning that ocssd on that node cannot communicate with ocss
d on the other node(s) over the interconnect. If this persists for the full timeout interval (usually thirty
seconds - reference: DocID 294430.1) then Oracle Clusteware is designed to evict one of the nodes.

Therefore, the issue that requires troubleshooting in such as case is: why the nodes cannot communica
te over the interconnect

Step 1. Basic connectivity:


Follow the steps in Note 1054902.1 to validate the network connectivity:
Note 1054902.1 - How to Validate Network and Name Resolution Setup for the Clusterware and RAC

Note: If the problem is intermittent, also conduct the test from the following My Oracle Support
document:

To check TCP/IP communication:


Note 1445075.1 - Node reboot or eviction: How to check if your private interconnect CRS can
transmit network heartbeats

Step 2. After basic connectivity is confirmed, check advanced connectivity checks:


1. Firewall
Firewall needs to be turned off on the private network.
If unsure whether there is any firewall between the nodes, use a tool like ipmon or wireshark.
Linux: Turn off iptables completely on all nodes and test:
service iptables stop
If clusterware on all nodes can come up when iptables is turned off completely, but cannot come up
when iptables is running, then the IP packet filter rules need adjusting to allow ALL traffic between
the private interconnects of all the nodes.

2. Multicast
In 11.2.0.2 (only), multicast must be configured on either 230.0.1.0 or 224.0.0.251 for Clusterware
startup. Follow the steps in Document 1212703.1 to check multicast communication.

Reference: GI 11.2.0.2 Install/Upgrade may fail due to Multicasting Requirement (Doc ID


1212703.1)

3. Jumbo Frames Configuration


If Jumbo Frames is configured, check to make sure its is configured properly
a) Check the MTU on the private interconnect interface(s) of each node:
/bin/netstat -in
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0 1500 0 203273 0 0 0 2727 0 0 0 BMRU
Note: In the above example MTU is set to 1500 for eth0

b) If MTU > 1500 on any interface, follow the steps in Note 1085638.1 to check if Jumbo Frames are
properly configured.

4. Third-party mDNS daemons running


HAIP uses mDNS. If there are any 3rd-party mDNS daemons running, such as avahi or bonjour, they
can actually remove the HAIP addresses and prevent cluster communication. Make sure that there are
no 3rd party mDNS daemons running on the server.
Note 1501093.1 - CSSD Fails to Join the Cluster After Private Network Recovered if avahi Daemon is
up and Running

5. Advanced UDP checks


Please refer to the steps in the following document to check UDP communication over the
interconnect:
Note 563566.1 - Troubleshooting gc block lost and Poor Network Performance in a RAC
Environment

6. OS resource starvation
If the system is too busy, i.e. 100% busy CPU etc, the issue may happen, review OSW and CHM data
to confirm

Step 3. Known bugs


After reviewing all of the above, if no problems were found, check the following known issues:
Document 1488378.1 - List of gipc defects that prevent GI from starting/joining after private network
is restored or node rebooted.

PURPOSE
This note covers the current recommendation for the Real Application Cluster Interconnect and Jumbo
Frames

SCOPE
This article points out the issues surrounding Ethernet Jumbo Frame usage for the Oracle Real Applica
tion Cluster (RAC) Interconnect. In Oracle Real Application Clusters, the Cluster Interconnect is desig
ned to run on a dedicated, or stand-alone network. The Interconnect is designed to carry the communic
ation between the nodes in the Cluster needed to check for the Clusters condition and to synchronize t
he various memory caches used by the database.

Ethernet is a widely used networking technology for Cluster Interconnects. Its variable frame size of
46-1500 bytes is the transfer unit between the all Ethernet participants, such as the hosts and switches.
The upper bound, in this case 1500, is called MTU (Maximum Transmission Unit). When an applicat
ion sends a message greater than 1500 bytes (MTU), it is fragmented into 1500 byte, or smaller, fram
es from one end-point to another. In Oracle RAC, the setting of DB_BLOCK_SIZE multiplied by the
MULTI_BLOCK_READ_COUNT determines the maximum size of a message for the Global Cache
and the PARALLEL_EXECUTION_MESSAGE_SIZE determines the maximum size of a mes sage
used in Parallel Query. These message sizes can range from 2K to 64K or more, and hence will get
fragmented more so with a lower/default MTU.

Jumbo Frames introduces the ability for an Ethernet frame to exceed its IEEE 802 specified Maximum
Transfer Unit of 1500 bytes up to a maximum of 9000 bytes. Even though Jumbo Frames is widely av
ailable in most NICs and data-center class managed switches it is not an IEEE approved standard.
While the benefits are clear, Jumbo Frames interoperability is not guaranteed with some existing netw
orking devices. Though Jumbo Frames can be implemented for private Cluster Interconnects, it requir
es very careful configuration and testing to realize its benefits. In many cases, failures or inconsistenci
es can occur due to incorrect setup, bugs in the driver or switch software, which can result in sub-opti
mal performance and network errors.

DETAILS
Configuration
In order to make Jumbo Frames work properly for a Cluster Interconnect network, careful configuratio
n in the host, its Network Interface Card and switch level is required:
The host's network adapter must be configured with a persistent MTU size of 9000(which will survive
reboots).
For example, ifconfig -mtu 9000 followed by ifconfig -a to show the setting completed.
Certain NIC's require additional hardware configuration.
For example, some Intel NIC's require special descriptors and buffers to be configured for Jumbo Fra
mes to work properly.
The LAN switches must also be properly configured to increase the MTU for Jumbo Frame support.
Ensure the changes made are permanent (survives a power cycle) and that both "Jumbo" refer to same
size, recommended 9000 (some switches do not support this size).

Because of the lack of standards with Jumbo Frames the interoperability between switches can be prob
lematic and requires advanced networking skills to troubleshoot.
Remember that the smallest MTU used by any device in a given network path determines the maximu
m MTU (the MTU ceiling) for all traffic travelling along that path.
Failing to properly set these parameters in all nodes of the Cluster and Switches can result in unpredict
able errors as well as a degradation in performance.

Testing
Request your network system administrator along with vendors to fully test the configuration using st
andard tools such as SPRAY/NETCAT and show that there is an improvement not degradation when
using Jumbo Frames. Other basic ways to check it's configured correctly on Linux/Unix are using:

Traceroute: Notice the 9000 packet goes through with no error, while the 9001 fails, this is a correct
configuration that supports a message of up to 9000 bytes with no fragmentation:
[node01] $ traceroute -F node02-priv 9000
traceroute to node02-priv (10.x.x.2), 30 hops max, 9000 byte packets
1 node02-priv (10.x.x.2) 0.232 ms 0.176 ms 0.160 ms

[node01] $ traceroute -F node02-priv 9001


traceroute to node02-priv (10.x.x.2), 30 hops max, 9001 byte packets
traceroute: sendto: Message too long
1 traceroute: wrote node02-priv 9001 chars, ret=-1
* Note: Due to Oracle Bugzilla 7182 (must have logon privileges) -- also known as RedHat Bugzilla
464044 -- older than EL4.7 traceroute may not work correctly for this purpose.
* Note: Some versions of tracroute, e.g. traceroute 2.0.1 shipped with EL5, add the header size on top
of what is specified when using the -F flag (same as ping behavior below). Newer versions of tracerou
te, like 2.0.14 (shipped with OL6) have the old behavior of traceroute version 1 (size of packet is exact
ly as what is specified with the -F flag).

Ping: With ping we have to take into account an overhead of about 28 bytes per packet, so 8972 bytes
go through with no errors, while 8973 fail, this is a correct configuration that supports a message of up
to 9000 bytes with no fragmentation:
[node01]$ ping -c 2 -M do -s 8972 node02-priv
PING node02-priv (10.x.x.2) 1472(1500) bytes of data.
1480 bytes from node02-priv (10.x.x.2): icmp_seq=0 ttl=64 time=0.220 ms
1480 bytes from node02-priv (10.x.x.2): icmp_seq=1 ttl=64 time=0.197 ms

[node01]$ ping -c 2 -M do -s 8973 node02-priv


From node02-priv (10.x.x.1) icmp_seq=0 Frag needed and DF set (mtu = 9000)
From node02-priv (10.x.x.1) icmp_seq=0 Frag needed and DF set (mtu = 9000)
--- node02-priv ping statistics ---
0 packets transmitted, 0 received, +2 errors

For Solaris platform, the similar ping command is:


$ ping -c 2 -s node02-priv 8972
* Note: Ping reports fragmentation errors, due to exceeding the MTU size.

Performance
For RAC Interconnect traffic, devices correctly configured for Jumbo Frame improves performance b
y reducing the TCP, UDP, and Ethernet overhead that occurs when large messages have to be broken
up into the smaller frames of standard Ethernet. Because one larger packet can be sent, inter-packet lat
ency between various smaller packets is eliminated. The increase in performance is most noticeable in
scenarios requiring high throughput and bandwidth and when systems are CPU bound.

When using Jumbo Frames, fewer buffer transfers are required which is part of the reduction for
fragmentation and reassembly in the IP stack, and thus has an impact in reducing the latency of a an
Oracle block transfer.

As illustrated in the configuration section, any incorrect setup may prevent instances from starting up
or can have a very negative effect on the performance.

Known Bugs
In some versions of Linux there are specific bugs in Intel's Ethernet drivers and the UDP code path in
conjunction with Jumbo Frames that could affect the performance. Check for and use the latest versi
on of these drivers to be sure you are not running into these older bugs.
The following bugzilla bugs 162197, 125122 are limited to RHEL3.

Recommendation
There is some complexity involved in configuring Jumbo Frames, which is highly hardware and OS
specific. The lack of a specific standard may present OS and hardware bugs. Even with these consid
erations, Oracle recommends using Jumbo Frames for private Cluster Interconnects.

Since there is no official standard for Jumbo Frames, this configuration should be properly load tested
by Customers. Any indication of packet loss, socket buffer or DMA overflows, TX and RX error in ad
apters should be noted and checked with the hardware and operating system vendors.

The recommendation in this Note is strictly for Oracle private interconnect only, it does not apply to
other NAS or iSCSI vendor tested and validated Jumbo Frames configured networks.

Oracle VM didn't support Jumbo in all ovm2 versions and all ovm3.0, however starting with ovm3.1.1
it's supported, refer to:
http://www.oracle.com/us/technologies/virtualization/ovm-3-1-whats-new-1634275.pdf

To procedure to change MTU, refer to note 283684.1 - "Changing private network MTU only"

You might also like