Tuesday, 11 August 2015

RAC on Windows: Oracle Clusterware Node Evictions a.k.a. Why do we get a Blue Screen (BSOD) Caused By Orafencedrv.sys? [ID 337784.1]

Applies to: Oracle Server - Enterprise Edition - Version 10.2.0.1 to 11.2.0.3 [Release 10.2 to 11.2]
Microsoft Windows Itanium (64-bit)
Microsoft Windows x64 (64-bit)
Microsoft Windows 2000Microsoft Windows XPMicrosoft Windows Server 2003 (64-bit Itanium)Microsoft Windows Server 2003 (64-bit AMD64 and Intel EM64T)Microsoft Windows Server 2003 R2 (64-bit AMD64 and Intel EM64T)Microsoft Windows Server 2003 R2 (32-bit)
Oracle Server Enterprise Edition - Version: 10.2.0.1 to 11.1.0.7

Symptoms

While running RAC a blue screen is shown and a reboot takes place. Windows creates a coredump that shows that orafencedrv.sys is involved.

Changes

The following STOP code can be observed in the Blue screen:
STOP: 0x0000FFFF (0x00000000000000000000, 0x00000000000000000000, 0x00000000000000000000, 0x00000000000000000000)

Cause

When running  Oracle RAC/Clusterware on Windows, the OracleCSService is SUPPOSED to reboot the OS if it detects a problem in the clusterware. The result of a CSS daemon rebooting the node will be that a blue screen will occur.

The failure is as per design. Anytime that the OracleCSService process fails, it is designed to cause the machine to reboot - it does this by means of an IOCTL to the IOFENCE driver. This is a kernel driver which gets a fault. And for windows this is an unhandled exception that will cause the blue screen.

Therefore, blue screens that implicate orafencedrv.sys occur by design in an Oracle RAC on Windows environment. 

Note that our Clusterware / Grid Infrastructure  software is designed to fence and reboot a node in either of the following two ways / cases:

1. Oracle Cluster nodes are designed to 'checkin' every second with each other in two ways:
a. They check to make sure they can each ping the other(s) on their private interconnect IP addresses. If a node does not respond to network pings on the private interconnect within (which defaults to 30 seconds), then the Oracle Cluster Synchronization Services Daemon (OCSSD) instructs our Oracle Fence Driver (orafencedrv.sys) to evict the unresponsive node from the cluster.
b. The nodes also checkin with each other every second with regard to whether or not each node can read and write to the voting disk. The for this event are different depending on the activity of the clusterware at that moment. What we deem to be a 'short disk timeout' is 60 seconds (this happens when the cluster is under normal operation) while what we deem to be a 'long disk timeout' is 200 seconds (this happens when the cluster is undergoing reconfiguration at the time when the node fails to meet its 'voting disk' checkin). In any case, the action we take is the same: the Oracle Cluster Synchronization Services Daemon (OCSSD) instructs our Oracle Fence Driver (orafencedrv.sys) to evict the unresponsive node from the cluster.

2. OraFence has a built-in mechanism to check it was scheduled in time. If it is not scheduled within 5 seconds it will also reboot the note. In this way, OraFence is designed to fence and reboot a node if it perceives that a given node is 'hung' once its own timeout has been reached. Note that the default timeout for the OraFence driver is a (very low) 0x05 (5 seconds). What this means is that if the OraFence driver detects what it perceives to be a hang for example at the operating system level and that hang persists beyond 5 seconds, it's possible that the OraFence driver - of its own accord - will fence and evict the node.

Solution

So the question is not why does the blue screen occur, but why is the  OracleCSService process failing (node eviction) or why OraFenceService was not scheduled.
The first next step in diagnosing any case of BSOD which is implicating the OraFence driver is to determine which type (1 or 2) of eviction / reboot your cluster has experienced.

To that end, please start with the Windows System Event Viewer log (if you intend to upload this log to support, please save it off as a .TXT file before doing so), and check for the 'bugcheck' or 'stop code' that was reported when the node was brought down. If that bugcheck code shows:

1. 0x0000FFFF then that means that you have experienced the first type of eviction explained here above: and you should look to your Oracle Clusterware alert log file ($GRID_HOME\log\) and your ocssd.log files for answers. More than likely you'll see that there was a series of missed checkins that led up to the eviction occurrence and this almost always indicates an underlying network issue (for example: faulty cables, cards, drivers, or ports on the network switch) or other OS resource issue / contention whereby the network is not able to respond to the checkin as it is 'busy' with other resource intensive operations and/or cannot get CPU to respond to the network checkin.  In both cases, the root cause really needs to be sought at the OS / System Administration level. 
10.1: %CRS_HOME%\css\log\ocssd.log
10.2: %CRS_HOME%\log\alert.log AND %CRS_HOME%\log\\cssd\ocssd.log
11.1: %CRS_HOME%\log\alert.log AND %CRS_HOME%\log\\cssd\ocssd.log
11.2: %CRS_HOME%\log\alert.log AND %GI_HOME%\log\\cssd\ocssd.log


2. 0x0000FFFE then that means that you have experienced the second type of eviction explained here above: and you should look at whether or not your node was under heavy load / is truly hanging from time to time for any reason - and/or - look at increasing the default orafencetimeout value - again - we have found that 5 seconds is a very aggressive timeout value and can safely be adjusted upwards. This is controlled with the following registry key:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\OraFenceService\


Please be sure to configure your server(s) to automatically reboot on a Bug Check/System Failure event, otherwise, you will see a blue screen without any further activity (the node will not actually reboot).  To that end, please check this setting by going to
Control Panel -> System -> Advanced system Settings -> 'Advanced tab'
'Startup and Recovery' Settings -> "System Failure" Select "Automatically
Restart"


 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As a normal function of our Oracle Clusterware / Grid Infrastructure, OraFenceService is designed to fence (I/O) and reboot a node if it perceives that
node is 'hung' once its configured timeout has been reached. The default timeout for the OraFence driver is a (very low) 5 seconds.
What this means is that if the OraFence driver detects what it perceives to be a hang at the operating system level and that hang persists beyond 5 seconds,
it's possible that the OraFence driver - of its own accord - will fence and evict the node.
It is advisable in some cases to increase the OraFence timeout value as high as 10 seconds in some cases.
The OraFence timeout is controlled by the following Windows registry key: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\OraFenceService\Timeout.
Note that modification of the OraFenceService timeout value requires a node reboot.
Please increase OraFence timeout from the default of 5 seconds to an value of 10 seconds and let us know whether you are encountering the issue again.
~~~~~~~~~~~~~~~~~~~~~~

 

No comments:

Post a Comment