Oracle Database Notes: August 2015

Saturday 29 August 2015

Add Node on Existing Cluster_Steps

1> Compare the mentioned file in the new and old environment: "/etc/sysctl.conf" file.
2> Compare the mentioned file in the new and old environment: "/etc/security/limits.conf" file.
3> Compare the mentioned file in the new and old environment "/etc/pam.d/login" file.
4> Create the necessary oracle groups and users.
5> chkconfig ntpd status
6> Create the .profile for setting the environment for grid and oracle user (Compare the mentioned file in the new and old environment)
"/home/oracle/.bash_profile" file and "/home/grid/.bash_profile"
"/home/oracle/grid_env" and "/home/grid/grid_env"
"/home/oracle/db_env" and "/home/grid/db_env"
7> uname -rm
8> All necessary ASM libraries and packages
9> Ping all the IPs of the New_Node
ping -c 3 112-rac1
ping -c 3 112-rac1-priv
ping -c 3 112-rac1-vip

10> ASM must be configured and "#oracleasm listdisks" command on the New_Node should show all the disks.
Few command reference : oracleasm listdisks, oracleasm init, oracleasm scandisks

11> Configure secure shell for oracle user on all nodes

From oracle_home/oui/bin on existing Node -
./runSSHSetup.sh -user oracle -hosts "Existing_Node New_Node" -advanced -exverify

12> Verify New_Node (HWOS)

From grid_home on existing Node
$GRID_HOME/bin/cluvfy stage -post hwos -n New_Node > /u02/hwos.log

13> Verify Peer (REFNODE)

From grid_home on existing Node
$GRID_HOME/bin/cluvfy comp peer -refnode existing Node -n New_Node -orainv oinstall -osdba dba -verbose > /u02/comppeer.log

14> Verify New_Node (New_Node PRE)

From grid_home on existing Node
$GRID_HOME/bin/cluvfy stage -pre nodeadd -n New_Node -fixup -verbose > /u02/fixup.log

14> Extend Clusterware

Run “addNode.sh”

a) [oracle@existing Node bin]$ ./addNode.sh -silent "CLUSTER_NEW_NODES={New_Node}" "CLUSTER_NEW_VIRTUAL_HOSTNAMES={New_Node-vip}"
From root user on New_Node :::

b) [root@New_Node oraInventory]# ./orainstRoot.sh
From root user on New_Node :::

c) [root@New_Node oraInventory]# cd /u01/app/11.2.0/grid/
[root@New_Node grid]# ./root.sh

If successful, clusterware daemons, the listener, the ASM instance, etc. should be started

d) [oracle@New_Node ~]$ crsctl check crs
e) [oracle@New_Node ~]$ crs_stat -t -v
f) [oracle@New_Node ~]$ olsnodes -n
Existing Node 1
Existing_Node 2
New_Node 3

g)[oracle@New_Node ~]$ srvctl status asm -a
ASM is running on Existing_Node,New_Node,existing Node
ASM is enabled.

h) [oracle@New_Node ~]$ ocrcheck
i) [oracle@New_Node ~]$ crsctl query css votedisk

15) Verify New_Node (New_Node POST)

[oracle@existing Node u02]$ $GRID_HOME/bin/cluvfy stage -post nodeadd -n New_Node -verbose > /u02/clusterpost.log

16) CREATE ORACLE INSTANCE

#Run DBCA to add instance from RAC1
cd $ORACLE_HOME/bin
./dbca

“Oracle Real Application Clusters Database” and then “select Instance Management” and then select “add instance”

Tuesday 11 August 2015

Resetting Listener Log without stopping Database

Often it is required to reset the listener log when it grows too much like more than 2G, as this log is being used by the instance all the time you can not delete. By doing the following you can achieve your purpose to reset the listener log.

C:\Documents and Settings\inam>lsnrctl

LSNRCTL for 32-bit Windows: Version 10.2.0.1.0 - Production on 01-AUG-2010 10:33:31

Copyright (c) 1991, 2005, Oracle. All rights reserved.

Welcome to LSNRCTL, type "help" for information.

LSNRCTL> SET CURRENT_LISTENER LISTENER
Current Listener is LISTENER

LSNRCTL> SET LOG_FILE TEMP_LISTENER.LOG
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=srv2.domain)(PORT=1521)))
LISTENER parameter "log_file" set to temp_listener.log
The command completed successfully

LSNRCTL> SET LOG_FILE LISTENER.LOG
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=srv2.domain)(PORT=1521)))
LISTENER parameter "log_file" set to listener.log
The command completed successfully
LSNRCTL>

Create database (10gR2) manually on Windows based on ASM instance

Step 1: set ORACLE_SID on dos prompt
C:\Documents and Settings\inam>set ORACLE_SID=ASMDB

Step 2: Create parameter for the database instance to be created
D:\ASMTEST\ASMDB\pfile\initASMDB.ora
control_files = +DB_DATA
undo_management = AUTO
db_name = ASMDB
db_block_size = 8192
sga_max_size = 1073741824
sga_target = 1073741824
db_create_file_dest = +DB_DATA
db_create_online_log_dest_1 = +DB_DATA

Step 3: Create a password file
C:\Documents and Settings\inam>orapwd file=D:\ASMTEST\ASMDB\pwdASMDB.ora password=oracle entries=5

Step 4: create/Start the instance
C:\Documents and Settings\inam>oradim -new -sid ASMDB -syspwd asmdb123 -pfile D:\ASMTEST\ASMDB\pfile\initASMDB.ora -startmode a
Instance created.

C:\Documents and Settings\inam>sqlplus / as sysdba

SQL*Plus: Release 10.2.0.1.0 - Production on Tue Oct 26 12:37:05 2010

Copyright (c) 1982, 2005, Oracle. All rights reserved.

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning, OLAP and Data Mining options

SQL> shutdown immediate
SQL> startup nomount pfile=D:\ASMTEST\ASMDB\pfile\initASMDB.ora
ORACLE instance started.

Total System Global Area 1073741824 bytes
Fixed Size                  1253124 bytes
Variable Size             264241404 bytes
Database Buffers          801112064 bytes
Redo Buffers                7135232 bytes
SQL>

Step 5: Create the database

SQL> create database ASMDB
2 character set WE8ISO8859P1
3 national character set utf8
4 undo tablespace undotbs1
5 default temporary tablespace temp;

Database created.

Step 6: Run following scripts
SQL> @D:\oracle\product\10.2.0\db_1\RDBMS\ADMIN\catalog.sql
SQL> @D:\oracle\product\10.2.0\db_1\RDBMS\ADMIN\catproc.sql
SQL> @D:\oracle\product\10.2.0\db_1\sqlplus\admin\pupbld.sql

Step 7: Test DB ( some admin task)
Create the service for your newly created db in tnsnames.ora and access via toad or sqlplus for your testing.
SQL> create tablespace myts_on_ASM datafile '+DB_DATA' size 200M

Note: if you are on linux skip step 4 as there is no oradim for linux , other steps are same.

Create ASM Instance(Manually) on Windows

Automatic Storage Management (ASM) is an integrated file system and volume manager expressly built for Oracle database files. ASM provides the performance of raw I/O with the easy management of a file system. It simplifies database administration by eliminating the need for you to directly manage potentially thousands of Oracle database files. It does this by enabling you to divide all available storage into disk groups. You manage a small set of disk groups and ASM automates the placement of the database files within those disk groups.

With the following steps you can create the ASM instance on windows. After this instance can be used while creating the Oracle database.

Step 1: Create New Partition for Device Files
Create partitions for for device files like E:\, F:\
Step 2: Create CSS service if it is not there; Cluster Synchronization Services (CSS) is required to enable synchronization between an Automatic Storage Management (ASM) instance and the database instances.
Create this service by running the following batch file
D:\oracle\product\10.2.0\db_1\BIN>localconfig add   -- it will give the following output
            Step 1: creating new OCR repository
            Successfully accumulated necessary OCR keys.
            Creating OCR keys for user 'domainName\inam', privgrp ''..
            Operation successful.
            Step 2: creating new CSS service
            successfully created local CSS service
            successfully added CSS to home
Step 3: Building the ASM Candidate "disks" using asmtool.
asmtool helps to stamp the new disks on windows for using as ASM Disks. You can use asmtoolg (GUI version) also. Execute Following Command on Command Prompt.
D:\oracle\product\10.2.0\db_1\BIN>asmtool -create d:\ASMTEST\DISK1 1024
D:\oracle\product\10.2.0\db_1\BIN>asmtool -create d:\ASMTEST\DISK2 1024
D:\oracle\product\10.2.0\db_1\BIN>asmtool -create d:\ASMTEST\DISK3 1024

Note: You could use the DISKPART utility also to create the virtual disks
DISKPART> create vdisk file="c:\temp\DISK1.vhd" maximum 500
100 percent completed
DiskPart successfully created the virtual disk file.

DISKPART> select vdisk file="c:\temp\vdisk01.vhd"

DiskPart successfully selected the virtual disk file.

DISKPART> attach vdisk

100 percent completed

DiskPart successfully attached the virtual disk file.

DISKPART> list disk

Disk ### Status Size Free Dyn Gpt
-------- ------------- ------- ------- --- ---
Disk 0 Online 465 GB 0 B
* Disk 1 Online 500 MB 500 MB

Initialize and format disk as RAW
Right click Computer > Manage > Storage > Disk Management
Don't assign drive letter

After this you could use the asmtool to stamp this new disk.

Step 4: Create Admin Directories for your new asm instance, i've created on the following location
D:\ASMTEST\DATABASE\admin\+ASM\bdump
D:\ASMTEST\DATABASE\admin\+ASM\cdump
D:\ASMTEST\DATABASE\admin\+ASM\hdump
D:\ASMTEST\DATABASE\admin\+ASM\pfile
D:\ASMTEST\DATABASE\admin\+ASM\udump

Step 5: Create ASM Instance Parameter File
filenam>> D:\ASMTEST\DATABASE\admin\+ASM\pfile\init.ora

INSTANCE_TYPE=ASM
_ASM_ALLOW_ONLY_RAW_DISKS = FALSE
DB_UNIQUE_NAME = +ASM
ASM_DISKSTRING ='D:\ASMTEST\DISK*'
LARGE_POOL_SIZE = 16M
BACKGROUND_DUMP_DEST = 'D:\ASMTEST\DATABASE\admin\+ASM\bdump'
USER_DUMP_DEST = 'D:\ASMTEST\DATABASE\admin\+ASM\udump'
CORE_DUMP_DEST = 'D:\ASMTEST\DATABASE\admin\+ASM\cdump'
ASM_DISKGROUPS='DB_DATA' ,'DB_ARCHIVELOG'

Step 6: Creating ASM Instance
D:\oracle\product\10.2.0\db_1\BIN>oradim -new -asmsid +ASM -syspwd asm123 -pfile d:\asmtest\database\admin\+ASM\pfile\init.ora -startmode a
Instance created.

Step 7: Starting the ASM Instance
D:\oracle\product\10.2.0\db_1\BIN>set ORACLE_SID=+ASM
D:\oracle\product\10.2.0\db_1\BIN>sqlplus / as sysdba
SQL*Plus: Release 10.2.0.1.0 - Production on Tue Oct 26 10:42:13 2010
Copyright (c) 1982, 2005, Oracle. All rights reserved.
Connected to an idle instance.
SQL> startup nomount pfile='D:\ASMTEST\DATABASE\admin\+ASM\pfile\init.ora'
ASM instance started

Total System Global Area   88080384 bytes
Fixed Size                  1247444 bytes
Variable Size              61667116 bytes
ASM Cache                  25165824 bytes
SQL>

Step 8: Create ASM Disk Groups
Check the asm disk status
SQL> SELECT group_number, disk_number, mount_status, header_status, state, path FROM v$asm_disk
2 /

GROUP_NUMBER DISK_NUMBER MOUNT_S HEADER_STATU STATE
------------ ----------- ------- ------------ --------
PATH
--------------------------------------------------------------------------------
           0           0 CLOSED CANDIDATE    NORMAL
D:\ASMTEST\DISK1
           0           2 CLOSED CANDIDATE    NORMAL
D:\ASMTEST\DISK3
           0           1 CLOSED CANDIDATE    NORMAL
D:\ASMTEST\DISK2

The value of zero in the GROUP_NUMBER column for all four disks. This indicates that a disk is available but hasn't yet been assigned to a disk group.
SQL> CREATE DISKGROUP DB_DATA NORMAL REDUNDANCY FAILGROUP controller1 DISK 'D:\ASMTEST\DISK1', 'D:\ASMTEST\DISK2'
2    FAILGROUP controller2 DISK 'D:\ASMTEST\DISK3', 'D:\ASMTEST\DISK4';

Diskgroup created.

Step 9: Mount diskgroup
SQL> shutdown immediate;
ASM diskgroups dismounted
ASM instance shutdown
SQL> startup nomount pfile='D:\ASMTEST\DATABASE\admin\+ASM\pfile\init.ora'
ASM instance started

Total System Global Area   88080384 bytes
Fixed Size                  1247444 bytes
Variable Size              61667116 bytes
ASM Cache                  25165824 bytes
SQL> SELECT group_number, disk_number, mount_status, header_status, state, path FROM v$asm_disk;
GROUP_NUMBER DISK_NUMBER MOUNT_S HEADER_STATU STATE
------------ ----------- ------- ------------ --------
PATH
--------------------------------------------------------------------------------
           0           0 CLOSED MEMBER       NORMAL
D:\ASMTEST\DISK1
           0           3 CLOSED MEMBER       NORMAL
D:\ASMTEST\DISK4
           0           2 CLOSED MEMBER       NORMAL
D:\ASMTEST\DISK3

GROUP_NUMBER DISK_NUMBER MOUNT_S HEADER_STATU STATE
------------ ----------- ------- ------------ --------
PATH
--------------------------------------------------------------------------------
           0           1 CLOSED MEMBER       NORMAL
D:\ASMTEST\DISK2

SQL> alter diskgroup DB_DATA mount;

Diskgroup altered.

SQL>
SQL> SELECT group_number, disk_number, mount_status, header_status, state, path FROM v$asm_disk;

GROUP_NUMBER DISK_NUMBER MOUNT_S HEADER_STATU STATE
------------ ----------- ------- ------------ --------
PATH
--------------------------------------------------------------------------------
           1           0 CACHED MEMBER       NORMAL
D:\ASMTEST\DISK1
           1           1 CACHED MEMBER       NORMAL
D:\ASMTEST\DISK2
          1           2 CACHED MEMBER       NORMAL
D:\ASMTEST\DISK3

GROUP_NUMBER DISK_NUMBER MOUNT_S HEADER_STATU STATE
------------ ----------- ------- ------------ --------
PATH
--------------------------------------------------------------------------------
           1           3 CACHED MEMBER       NORMAL
D:\ASMTEST\DISK4

Step 10: Test ASM Instance (some admin tasks)
C:\Documents and Settings\inam> sqlplus / as sysdba
SQL> ALTER DISKGROUP DB_DATA ADD DIRECTORY '+DB_DATA/my_dir';

Diskgroup altered.
ALTER DISKGROUP DB_DATA RENAME DIRECTORY '+DB_DATA/my_dir' TO '+DB_DATA/my_dir_2';

How to Delete a directory and all its contents:
ALTER DISKGROUP DB_DATA DROP DIRECTORY '+DB_DATA/my_dir_2' FORCE;

Aliases
Aliases allow you to reference ASM files using user-friendly names, rather than the fully qualified ASM filenames.

How to Create an alias using the fully qualified filename:
ALTER DISKGROUP DB_DATA ADD ALIAS '+DB_DATA/my_dir/my_file.dbf'
FOR '+DB_DATA/mydb/datafile/my_ts.342.3';

How to Create an alias using the numeric form filename:
ALTER DISKGROUP Db_DATA ADD ALIAS '+DB_DATA/my_dir/my_file.dbf'
FOR '+DB_DATA.342.3';

How to Rename an alias:
ALTER DISKGROUP DB_DATA RENAME ALIAS '+DB_DATA/my_dir/my_file.dbf'
TO '+DB_DATA/my_dir/my_file2.dbf';

How to Delete an alias:
ALTER DISKGROUP DB_DATA DELETE ALIAS '+DB_DATA/my_dir/my_file.dbf';

Files
Files are not deleted automatically if they are created using aliases, as they are not Oracle Managed Files (OMF), or if a recovery is done to a point-in-time before the file was created. For these circumstances it is necessary to manually delete the files, as shown below.

How to Drop file using an alias:
ALTER DISKGROUP DB_DATA DROP FILE '+DB_DATA/my_dir/my_file.dbf';

How to Drop file using a numeric form filename:
ALTER DISKGROUP Db_DATA DROP FILE '+DB_DATA.342.3';

How to Drop file using a fully qualified filename:
ALTER DISKGROUP DB_DATA DROP FILE '+DB_DATA/mydb/datafile/my_ts.342.3';

Metadata
The internal consistency of disk group metadata can be checked in a number of ways using the CHECK clause of the ALTER DISKGROUP statement.

How to Check metadata for a specific file:
ALTER DISKGROUP DB_DATA CHECK FILE '+DB_DATA/my_dir/my_file.dbf'

How to Check metadata for a specific failure group in the disk group:
ALTER DISKGROUP DB_DATA CHECK FAILGROUP failure_group_1;

How to Check metadata for a specific disk in the disk group:
ALTER DISKGROUP DB_DATA CHECK DISK diska1;

How to Check metadata for all disks in the disk group:
ALTER DISKGROUP DB_DATA CHECK ALL;

Templates
Templates are named groups of attributes that can be applied to the files within a disk group. The following example show how templates can be created, altered and dropped.

How to Create a new template:
ALTER DISKGROUP DB_DATA ADD TEMPLATE my_template ATTRIBUTES (MIRROR FINE);

How to Modify template:
ALTER DISKGROUP DB_DATA ALTER TEMPLATE my_template ATTRIBUTES (COARSE);

How to Drop template.
ALTER DISKGROUP DB_DATA DROP TEMPLATE my_template;

SQLNet Logging/Tracing

Net Logging and Trace
Oracle Net logging and trace are configured in the sqlnet.ora typically found at ORACLE_HOME\network\admin

Logging

The following parameters can be set to configure Oracle Net logging in sqlnet.ora:

Parameter	Description
LOG_DIRECTORY_CLIENT	Specifies the directory for the client log file
LOG_FILE_CLIENT	Specifies the name of the client log file
LOG_DIRECTORY_SERVER	Specifies the directory for the server log file
LOG_FILE_SERVER	Specifies the name of the server log file

By default both the client and server log file names default to sqlnet.log

Trace

The following parameters can be set to configure Oracle Net logging in sqlnet.ora:

Parameter	Description
TRACE_DIRECTORY_CLIENT	Specifies the directory for the client trace file
TRACE_FILE_CLIENT	Specifies the name of the client trace file
TRACE_DIRECTORY_SERVER	Specifies the directory for the server trace file
TRACE_FILE_SERVER	Specifies the name of the server trace file
TRACE_FILELEN_CLIENT	Specifies the size of each client trace file in kilobytes
TRACE_FILENO_CLIENT	Specifies the number of client trace files
TRACE_FILELEN_SERVER	Specifies the size of each server trace file in kilobytes
TRACE_FILENO_SERVER	Specifies the number of server trace files
TRACE_LEVEL_CLIENT	Specifies the level of detail for client trace
TRACE_LEVEL_SERVER	Specifies the level of detail for server trace
TRACE_TIMESTAMP_CLIENT	Includes a timestamp (to microseconds) for each event in the client trace
TRACE_TIMESTAMP_SERVER	Includes a timestamp (to microseconds) for each event in the client trace
TRACE_UNIQUE_CLIENT	Creates an individual client trace file for each process

For both the client and server trace files, the default directory is $ORACLE_HOME/network/trace.

For the client, the default trace file name is sqlnet.trc; for the server the default trace file name is svr_pid.trc

When both TRACE_FILELEN_CLIENT and TRACE_FILENO_CLIENT are set to non-zero values, the trace files are used cyclically. When one file is full, output continues in the next file; when all files are full output continues in the first file. A sequence number is included in the file name. For example if TRACE_FILE_CLIENT is client and TRACE_FILENO_CLIENT is 5 then the files will be:

client1_pid.trc

client2_pid.trc

client3_pid.trc

client4_pid.trc

client5_pid.trc

TRACE_FILELEN_SERVER and TRACE_FILENO_SERVER work in a similar way to TRACE_FILELEN_CLIENT and TRACE_FILENO_CLIENT.

For both TRACE_LEVEL_CLIENT and TRACE_LEVEL_SERVER, the parameter can take a numeric value between 0 and 16 where 0 is disabled and 16 is the most detailed. Alternatively these parameters can also take a scalar value was follows:

OFF	0	No tracing
USER	4	Include user errors
ADMIN	6	Include administrative errors
SUPPORT	16	Include packet contents

Level 16 (SUPPORT) is the most detailed trace level. Take care when enabling this level of detail as it will consume disk space very rapidly. Consider using the TRACE_FILELEN_SERVER and TRACE_FILENO_SERVER parameters to reduce the impact on the server If TRACE_UNIQUE_CLIENT is set to ON then a separate trace file will be created for each client. The pid is appended to the file name e.g. client_123.trc. Note that this appears to be the default behaviour in recent versions

Example: on one of our server following for client side tracing

###################TRACING/LOGGING#########################
LOG_DIRECTORY_CLIENT=D:\Oracle\product\10.1.0\Client_1\networks\Log
LOG_FILE_CLIENT=sqlnet_log
TRACE_LEVEL_CLIENT = SUPPORT
TRACE_UNIQUE_CLIENT = on
TRACE_DIRECTORY_CLIENT = D:\Oracle\product\10.1.0\Client_1\network\trace
TRACE_FILE_CLIENT = CLIENT_4_24
TRACE_TIMESTAMP_ CLIENT = ON
###################################################
after the above mentioned change start sqlplus and observe that there are .trc files in the specified trace folder for analysis and investigation.

RAC on Windows: Oracle Clusterware Node Evictions a.k.a. Why do we get a Blue Screen (BSOD) Caused By Orafencedrv.sys? [ID 337784.1]

Applies to: Oracle Server - Enterprise Edition - Version 10.2.0.1 to 11.2.0.3 [Release 10.2 to 11.2]
Microsoft Windows Itanium (64-bit)
Microsoft Windows x64 (64-bit)
Microsoft Windows 2000Microsoft Windows XPMicrosoft Windows Server 2003 (64-bit Itanium)Microsoft Windows Server 2003 (64-bit AMD64 and Intel EM64T)Microsoft Windows Server 2003 R2 (64-bit AMD64 and Intel EM64T)Microsoft Windows Server 2003 R2 (32-bit)
Oracle Server Enterprise Edition - Version: 10.2.0.1 to 11.1.0.7

Symptoms

While running RAC a blue screen is shown and a reboot takes place. Windows creates a coredump that shows that orafencedrv.sys is involved.

Changes

The following STOP code can be observed in the Blue screen:

STOP: 0x0000FFFF (0x00000000000000000000, 0x00000000000000000000, 0x00000000000000000000, 0x00000000000000000000)

Cause

When running Oracle RAC/Clusterware on Windows, the OracleCSService is SUPPOSED to reboot the OS if it detects a problem in the clusterware. The result of a CSS daemon rebooting the node will be that a blue screen will occur.

The failure is as per design. Anytime that the OracleCSService process fails, it is designed to cause the machine to reboot - it does this by means of an IOCTL to the IOFENCE driver. This is a kernel driver which gets a fault. And for windows this is an unhandled exception that will cause the blue screen.

Therefore, blue screens that implicate orafencedrv.sys occur by design in an Oracle RAC on Windows environment.

Note that our Clusterware / Grid Infrastructure software is designed to fence and reboot a node in either of the following two ways / cases:

1. Oracle Cluster nodes are designed to 'checkin' every second with each other in two ways:

a. They check to make sure they can each ping the other(s) on their private interconnect IP addresses. If a node does not respond to network pings on the private interconnect within (which defaults to 30 seconds), then the Oracle Cluster Synchronization Services Daemon (OCSSD) instructs our Oracle Fence Driver (orafencedrv.sys) to evict the unresponsive node from the cluster.

b. The nodes also checkin with each other every second with regard to whether or not each node can read and write to the voting disk. The for this event are different depending on the activity of the clusterware at that moment. What we deem to be a 'short disk timeout' is 60 seconds (this happens when the cluster is under normal operation) while what we deem to be a 'long disk timeout' is 200 seconds (this happens when the cluster is undergoing reconfiguration at the time when the node fails to meet its 'voting disk' checkin). In any case, the action we take is the same: the Oracle Cluster Synchronization Services Daemon (OCSSD) instructs our Oracle Fence Driver (orafencedrv.sys) to evict the unresponsive node from the cluster.

2. OraFence has a built-in mechanism to check it was scheduled in time. If it is not scheduled within 5 seconds it will also reboot the note. In this way, OraFence is designed to fence and reboot a node if it perceives that a given node is 'hung' once its own timeout has been reached. Note that the default timeout for the OraFence driver is a (very low) 0x05 (5 seconds). What this means is that if the OraFence driver detects what it perceives to be a hang for example at the operating system level and that hang persists beyond 5 seconds, it's possible that the OraFence driver - of its own accord - will fence and evict the node.

Solution

So the question is not why does the blue screen occur, but why is the OracleCSService process failing (node eviction) or why OraFenceService was not scheduled.

The first next step in diagnosing any case of BSOD which is implicating the OraFence driver is to determine which type (1 or 2) of eviction / reboot your cluster has experienced.

To that end, please start with the Windows System Event Viewer log (if you intend to upload this log to support, please save it off as a .TXT file before doing so), and check for the 'bugcheck' or 'stop code' that was reported when the node was brought down. If that bugcheck code shows:

1. 0x0000FFFF then that means that you have experienced the first type of eviction explained here above: and you should look to your Oracle Clusterware alert log file ($GRID_HOME\log\) and your ocssd.log files for answers. More than likely you'll see that there was a series of missed checkins that led up to the eviction occurrence and this almost always indicates an underlying network issue (for example: faulty cables, cards, drivers, or ports on the network switch) or other OS resource issue / contention whereby the network is not able to respond to the checkin as it is 'busy' with other resource intensive operations and/or cannot get CPU to respond to the network checkin. In both cases, the root cause really needs to be sought at the OS / System Administration level.

10.1: %CRS_HOME%\css\log\ocssd.log
 10.2: %CRS_HOME%\log\alert.log AND %CRS_HOME%\log\\cssd\ocssd.log
 11.1: %CRS_HOME%\log\alert.log AND %CRS_HOME%\log\\cssd\ocssd.log
 11.2: %CRS_HOME%\log\alert.log AND %GI_HOME%\log\\cssd\ocssd.log

2. 0x0000FFFE then that means that you have experienced the second type of eviction explained here above: and you should look at whether or not your node was under heavy load / is truly hanging from time to time for any reason - and/or - look at increasing the default orafencetimeout value - again - we have found that 5 seconds is a very aggressive timeout value and can safely be adjusted upwards. This is controlled with the following registry key:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\OraFenceService\

Please be sure to configure your server(s) to automatically reboot on a Bug Check/System Failure event, otherwise, you will see a blue screen without any further activity (the node will not actually reboot). To that end, please check this setting by going to

Control Panel -> System -> Advanced system Settings -> 'Advanced tab'
 'Startup and Recovery' Settings -> "System Failure" Select "Automatically
 Restart"

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As a normal function of our Oracle Clusterware / Grid Infrastructure, OraFenceService is designed to fence (I/O) and reboot a node if it perceives that
node is 'hung' once its configured timeout has been reached. The default timeout for the OraFence driver is a (very low) 5 seconds.
What this means is that if the OraFence driver detects what it perceives to be a hang at the operating system level and that hang persists beyond 5 seconds,
it's possible that the OraFence driver - of its own accord - will fence and evict the node.
It is advisable in some cases to increase the OraFence timeout value as high as 10 seconds in some cases.
The OraFence timeout is controlled by the following Windows registry key: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\OraFenceService\Timeout.
Note that modification of the OraFenceService timeout value requires a node reboot.

Please increase OraFence timeout from the default of 5 seconds to an value of 10 seconds and let us know whether you are encountering the issue again.

~~~~~~~~~~~~~~~~~~~~~~