Professional Documents
Culture Documents
Ora Net 0a
Ora Net 0a
Network information in Oracle Clusterware MUST be consistent with the setting on OS level. This
note explains the basics of IPv4 subnet and Oracle Clusterware (CRS or Grid Infrastructure)
DETAILS
A. What's IPv4 Subnet
The IP subnet is a logical separation of IP addresses bounded by the subnet mask. The first address is
used as the Subnet Identification (subnet ID or subnet number or route definition) while the last is the
broadcast address.
For example, subnet ID 10.1.0.0 and netmask 255.255.255.128 produces an IP range of 10.1.0.0 -
10.1.0.127. In this range, 10.1.0.0 is the subnet ID and 10.1.0.127 is the broadcast address. Changing
the netmask to 255.255.254.0, the IP range will be 10.1.0.0 - 10.1.1.255 where the broadcast address
is now 10.1.1.255.
For example, if "eth3" has wrong subnet number in OCR, to fix it, as Oracle Clusterware user:
$CLUSTERWARE_HOME/bin/oifcfg delif -global eth3
$CLUSTERWARE_HOME/bin/oifcfg setif -global eth3/120.0.0.0:public
Note: the same network adapter is not allowed to be in OCR for more than once, otherwise the
command will fails with error: PRIF-50: duplicate interface is given in the input
Refer to Document 283684.1 for details to change both public and private network in OCR.
For 11.2 GI
$CLUSTERWARE_HOME/bin/srvctl config network
Network exists: 1/10.10.0.0/255.255.255.128/eth3, type static
Network exists: 2/10.20.1.0/255.255.255.128/eth4, type static
SCOPE
The main audience is DBAs and System Administrators. This note applies to Oracle Clusterware and
Oracle RACs releases 10.1, 10.2 and 11.1.
DETAILS
The term 'Private Interconnect' or 'Cluster Interconnect' is the private communication link between cl
uster nodes.
For Oracle Real Application Clusters and Oracle Clusterware we can differentiate between
1. Physical interface (including NIC, cables and switches)
2. Private Interconnect used by Oracle Clusterware
3. Private Interconnect used by Oracle Real Application Clusters (RAC)
Interconnect Failure
The private interconnect is the critical communication link between nodes and instances. Network er
rors will negatively impact Oracle Clusterware communication as well as RAC communication and
performance.
Private interconnect failures are recognized by Oracle Clusterware and result in what is known as a 's
plit-brain' or subdivided cluster.
A subdivided cluster can result in data corruption, consequently immediate action is taken to resolve t
his condition. Interconnect failures, therefore, result in a node or subset of nodes in the cluster shuttin
g down. In the case of two equally sized sub clusters, it is basically random which sub cluster will sur
vive, and a customer's architecture and design should be based on this. At this time the Oracle Cluste
rware happens to use the node numbers to resolve this, but this could change in the future.
RAC instances will wait for the end of cluster reconfiguration to start their own reconfiguration.
There exist three ways to identify the private node name after installation:
olsnodes -n -p can be used to identify the private node name.
[oracle@racnode1 ~]$ olsnodes -n -p
racnode1 1 racnode1-priv
racnode2 2 racnode2-priv
[oracle@racnode1 ~]$
You may also check the private node name in the ocrdump output.
[SYSTEM.css.node_numbers.node1.privatename]
ORATEXT : racnode1-priv
ocssd.log has a line with clssnmClusterListener
[ CSSD]2009-02-23 03:09:06.945 [3086] >TRACE: clssnmClusterListener: Listening on
(ADDRESS=(PROTOCOL=tcp)(HOST=racnode1-priv)(PORT=49895))
Although the Oracle Universal Installer allows IP addresses when prompted for the private node name
you should always use the host names defined in the hosts file or DNS. This enables you to change the
IP addresses when you need to move the server to a different IP range. You would have to reinstall Or
acle Clusterware when you used IP-addresses instead of host names and you are changing the server's
IP adresses.
CLUSTER_INTERCONNECTS Parameter
This parameter overrides the value configured in the OCR. You may specify multiple IP addresses in
CLUSTER_INTERCONNECTS which will cause all defined interfaces to be used. Keep in mind that
a failure of one of those interfaces will cause the instance to fail. The failed instance can only be
restarted when all the interfaces are fixed or you start all the instances without the faulty interface.
SYMPTOMS
After installing 10.1.0.2.0 Real Application on Linux, the private network is not used for RAC traffic.
In the alert log you may see a warning like the following:
WARNING: unknown interface type -1073839360 returned from OCR ifquery
interface name ^B interface identifier
this interface will not be used for cluster communication
Interface type 1 eth0 <138.X.XXX.0> configured from OCR for use as a public interface
SOLUTION
Find the private network interface on each node with ifconfig -a:
[root@opcbrh1 root]# ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:90:27:BC:D9:8C
inet addr:*Public IP Address* Bcast:<138.X.XXX.255> Mask:<XXX.XXX.252.0>
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:2522101 errors:0 dropped:0 overruns:0 frame:0
TX packets:2146596 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:1403142233 (1338.1 Mb) TX bytes:2127338734 (2028.7 Mb)
Interrupt:11 Base address:0x9000
SPFILE:
If you are using an spfile, use an alter system command for each instance. Example:
alter system set cluster_interconnects = '<192.XXX.X.10>' scope=spfile sid='<SID1>';
alter system set cluster_interconnects = '<192.XXX.X.20>' scope=spfile sid='<SID2>';
After restarting the instances, you should see the correct IP address in the alert log. Example:
Cluster communication is configured to use the following interface(s) for this instance
192.XXX.X.10
GOAL
Frequently, in the case of node reboots, the log of the CSS daemon processes (ocssd.log) indicates that
the network heartbeat from one or more remote nodes was not received (for example, the message
"CRS-1610:Network communication with node xxxxxx (3) missing for 90% of timeout interval.
Removal of this node from cluster in 2.656 seconds" appears in the ocssd.log), and that the node
subsequently was rebooted (to avoid a split brain or because it was evicted by another node).
The script in here performs the network connectivity check using ssh. This check complements ping
or traceroute since ssh uses TCP protocol while ping uses ICMP and traceroute in Linux/Unix uses U
DP (traceroute on Windows use ICMP).
The network communication involves both the actual physical connection and the OS layer such as IP,
UDP, and TCP.
CRS (10g and 11.1) uses TCP to communicate, so using ssh to test the connection as well as TCP and
IP layer is a better test than ping or traceroute.
Because CRS on 11.2 uses UDP to communicate, using ssh to test TCP is not the optimal test, but this
test will complement the traceroute test.
The script tests the private interconnect once every 5 seconds, so this script will put an insignificant lo
ad on the server.
SOLUTION
1) Create a file in a location of your choice and copy and paste the lines in the following note box:
#!/bin/ksh
export TODAY=`date "+%Y%m%d"`
while [ $TODAY -lt <the time you want the script to stop running> ] # format needs to be YYMMDD
do
export TODAY=`date "+%Y%m%d"`
export LOGFILE=<log file directory>/interconnect_test_${TODAY}.log
ssh <private Ip address for node 1> "hostname; date" >> $LOGFILE 2>&1
ssh <private Ip address for node 2> "hostname; date" >> $LOGFILE 2>&1
sleep 5
done
2) Replace <private Ip address for node 2> with real private interconnect IP address or private interco
nnect host name. The script will execute the cmds, "hostname" and "date", and output to a log file.
3) If there are more than two nodes in the cluster, add more lines to issue
sh <private Ip address for node 1> "hostname; date" >> $LOGFILE 2>&1
Make sure that this script issues ssh to every node including the local node.
4) Replace <log file directory> with a real directory name where the output of this script will go.
The script will likely grow less than one MB every day, so you do not need large amount of space.
You can also regularly delete old log files.
5) Replace <the time you want the script to stop running> with the date and year that you want the
script to stop running. The format has to be YearMonthDate like 20121231 for December 31, 2012.
6) Save the file and issue "chmod +x <the script file name>" to make the script executable.
7) Make sure that the ssh works without asking for any password over the private interconnect.
It is best to first test the ssh connection over the private interconnect from all nodes to every other
node including itself (local node).
8) Issue "nohup <the script file name> &" to run the script in background.
Run this script from every node in the cluster.
Find out the approximate time that the node is rebooted and check when the script show last output be
fore the node is rebooted. If the time difference is more than 15 seconds, then the network problem is
the cause of the missing network heartbeats. Investigate the reason that ssh (a regular OS cmd) hang.
sleep 5
done
PURPOSE
This note is a troubleshooting guide for the following situation: Oracle Clusterware cannot be started
on all nodes at once. For example, in a 2-node cluster, the Oracle Clusterware on the 2nd node won't
start, or, attempting to start clusterware on the second node causes the first node's clusterware to shutd
own.
2012-07-14 19:24:18.420
[cssd(6192)]CRS-1612:Network communication with node racnode02 (2) missing for 50% of timeout
interval. Removal of this node from cluster in 14.500 seconds
2012-07-14 19:24:25.422
[cssd(6192)]CRS-1611:Network communication with node racnode02 (2) missing for 75% of timeout
interval. Removal of this node from cluster in 7.500 seconds
2012-07-14 19:24:30.424
[cssd(6192)]CRS-1610:Network communication with node racnode02 (2) missing for 90% of timeout
interval. Removal of this node from cluster in 2.500 seconds
2012-07-14 19:24:32.925
[cssd(6192)]CRS-1607:Node racnode02 is being evicted in cluster incarnation 179915229; details at
(:CSSNM00007:) in /u01/app/gridhome/log/racnode01/cssd/ocssd.log.
TROUBLESHOOTING STEPS
The Oracle clusterware cannot be up on two (or more) nodes if those nodes cannot communicate with
each other over the interconnect.
The CRS-1612, CRS-1611, CRS-1610 messages "Network communication with node NAME(n) miss
ing for PCT% of timeout interval" are warning that ocssd on that node cannot communicate with ocss
d on the other node(s) over the interconnect. If this persists for the full timeout interval (usually thirty
seconds - reference: DocID 294430.1) then Oracle Clusteware is designed to evict one of the nodes.
Therefore, the issue that requires troubleshooting in such as case is: why the nodes cannot communica
te over the interconnect
Note: If the problem is intermittent, also conduct the test from the following My Oracle Support
document:
2. Multicast
In 11.2.0.2 (only), multicast must be configured on either 230.0.1.0 or 224.0.0.251 for Clusterware
startup. Follow the steps in Document 1212703.1 to check multicast communication.
b) If MTU > 1500 on any interface, follow the steps in Note 1085638.1 to check if Jumbo Frames are
properly configured.
6. OS resource starvation
If the system is too busy, i.e. 100% busy CPU etc, the issue may happen, review OSW and CHM data
to confirm
PURPOSE
This note covers the current recommendation for the Real Application Cluster Interconnect and Jumbo
Frames
SCOPE
This article points out the issues surrounding Ethernet Jumbo Frame usage for the Oracle Real Applica
tion Cluster (RAC) Interconnect. In Oracle Real Application Clusters, the Cluster Interconnect is desig
ned to run on a dedicated, or stand-alone network. The Interconnect is designed to carry the communic
ation between the nodes in the Cluster needed to check for the Clusters condition and to synchronize t
he various memory caches used by the database.
Ethernet is a widely used networking technology for Cluster Interconnects. Its variable frame size of
46-1500 bytes is the transfer unit between the all Ethernet participants, such as the hosts and switches.
The upper bound, in this case 1500, is called MTU (Maximum Transmission Unit). When an applicat
ion sends a message greater than 1500 bytes (MTU), it is fragmented into 1500 byte, or smaller, fram
es from one end-point to another. In Oracle RAC, the setting of DB_BLOCK_SIZE multiplied by the
MULTI_BLOCK_READ_COUNT determines the maximum size of a message for the Global Cache
and the PARALLEL_EXECUTION_MESSAGE_SIZE determines the maximum size of a mes sage
used in Parallel Query. These message sizes can range from 2K to 64K or more, and hence will get
fragmented more so with a lower/default MTU.
Jumbo Frames introduces the ability for an Ethernet frame to exceed its IEEE 802 specified Maximum
Transfer Unit of 1500 bytes up to a maximum of 9000 bytes. Even though Jumbo Frames is widely av
ailable in most NICs and data-center class managed switches it is not an IEEE approved standard.
While the benefits are clear, Jumbo Frames interoperability is not guaranteed with some existing netw
orking devices. Though Jumbo Frames can be implemented for private Cluster Interconnects, it requir
es very careful configuration and testing to realize its benefits. In many cases, failures or inconsistenci
es can occur due to incorrect setup, bugs in the driver or switch software, which can result in sub-opti
mal performance and network errors.
DETAILS
Configuration
In order to make Jumbo Frames work properly for a Cluster Interconnect network, careful configuratio
n in the host, its Network Interface Card and switch level is required:
The host's network adapter must be configured with a persistent MTU size of 9000(which will survive
reboots).
For example, ifconfig -mtu 9000 followed by ifconfig -a to show the setting completed.
Certain NIC's require additional hardware configuration.
For example, some Intel NIC's require special descriptors and buffers to be configured for Jumbo Fra
mes to work properly.
The LAN switches must also be properly configured to increase the MTU for Jumbo Frame support.
Ensure the changes made are permanent (survives a power cycle) and that both "Jumbo" refer to same
size, recommended 9000 (some switches do not support this size).
Because of the lack of standards with Jumbo Frames the interoperability between switches can be prob
lematic and requires advanced networking skills to troubleshoot.
Remember that the smallest MTU used by any device in a given network path determines the maximu
m MTU (the MTU ceiling) for all traffic travelling along that path.
Failing to properly set these parameters in all nodes of the Cluster and Switches can result in unpredict
able errors as well as a degradation in performance.
Testing
Request your network system administrator along with vendors to fully test the configuration using st
andard tools such as SPRAY/NETCAT and show that there is an improvement not degradation when
using Jumbo Frames. Other basic ways to check it's configured correctly on Linux/Unix are using:
Traceroute: Notice the 9000 packet goes through with no error, while the 9001 fails, this is a correct
configuration that supports a message of up to 9000 bytes with no fragmentation:
[node01] $ traceroute -F node02-priv 9000
traceroute to node02-priv (10.x.x.2), 30 hops max, 9000 byte packets
1 node02-priv (10.x.x.2) 0.232 ms 0.176 ms 0.160 ms
Ping: With ping we have to take into account an overhead of about 28 bytes per packet, so 8972 bytes
go through with no errors, while 8973 fail, this is a correct configuration that supports a message of up
to 9000 bytes with no fragmentation:
[node01]$ ping -c 2 -M do -s 8972 node02-priv
PING node02-priv (10.x.x.2) 1472(1500) bytes of data.
1480 bytes from node02-priv (10.x.x.2): icmp_seq=0 ttl=64 time=0.220 ms
1480 bytes from node02-priv (10.x.x.2): icmp_seq=1 ttl=64 time=0.197 ms
Performance
For RAC Interconnect traffic, devices correctly configured for Jumbo Frame improves performance b
y reducing the TCP, UDP, and Ethernet overhead that occurs when large messages have to be broken
up into the smaller frames of standard Ethernet. Because one larger packet can be sent, inter-packet lat
ency between various smaller packets is eliminated. The increase in performance is most noticeable in
scenarios requiring high throughput and bandwidth and when systems are CPU bound.
When using Jumbo Frames, fewer buffer transfers are required which is part of the reduction for
fragmentation and reassembly in the IP stack, and thus has an impact in reducing the latency of a an
Oracle block transfer.
As illustrated in the configuration section, any incorrect setup may prevent instances from starting up
or can have a very negative effect on the performance.
Known Bugs
In some versions of Linux there are specific bugs in Intel's Ethernet drivers and the UDP code path in
conjunction with Jumbo Frames that could affect the performance. Check for and use the latest versi
on of these drivers to be sure you are not running into these older bugs.
The following bugzilla bugs 162197, 125122 are limited to RHEL3.
Recommendation
There is some complexity involved in configuring Jumbo Frames, which is highly hardware and OS
specific. The lack of a specific standard may present OS and hardware bugs. Even with these consid
erations, Oracle recommends using Jumbo Frames for private Cluster Interconnects.
Since there is no official standard for Jumbo Frames, this configuration should be properly load tested
by Customers. Any indication of packet loss, socket buffer or DMA overflows, TX and RX error in ad
apters should be noted and checked with the hardware and operating system vendors.
The recommendation in this Note is strictly for Oracle private interconnect only, it does not apply to
other NAS or iSCSI vendor tested and validated Jumbo Frames configured networks.
Oracle VM didn't support Jumbo in all ovm2 versions and all ovm3.0, however starting with ovm3.1.1
it's supported, refer to:
http://www.oracle.com/us/technologies/virtualization/ovm-3-1-whats-new-1634275.pdf
To procedure to change MTU, refer to note 283684.1 - "Changing private network MTU only"