Wednesday, 2 September 2015

Important Logs for investigating Cluster Logs

Information from ALL nodes in order to investigate any Cluster HAIP Issue :

0. ping all the scan IPs, tnsping the listener and check if the listener is being resolved by scan name.
1. upload endpoints_listener.ora, listener.ora, tnsnames.ora and sqlnet.ora You are using
2. provide output of (from all nodes):
$GRID_HOME/bin/crsctl stat res -t
$GRID_HOME/bin/crsctl stat res -t -init
srvctl config
$GRID_HOME/bin/srvctl config scan
$GRID_HOME/bin/srvctl config scan_listener
$GRID_HOME/bin/srvctl config listener
$GRID_HOME/bin/lsnrctl status LISTENER_SCAN1
$GRID_HOME/bin/lsnrctl status LISTENER_SCAN2
$GRID_HOME/bin/lsnrctl status LISTENER_SCAN3
$GRID_HOME/bin/lsnrctl service LISTENER_SCAN1
$GRID_HOME/bin/lsnrctl service LISTENER_SCAN2
$GRID_HOME/bin/lsnrctl service LISTENER_SCAN3
$GRID_HOME/bin/lsnrctl status LISTENER
$GRID_HOME/bin/lsnrctl service LISTENER
from sql connection to each instance:
show parameter remote_listener
show parameter local_listener

3. Upload local and scan_listener logs from all nodes.
a) listener.log

b) Please find the location for SCAN listeners and upload the files

=>  $GRID_HOME/log/diag/tnslsnr//listener_scan1/trace/listener_scan1.log

    $GRID_HOME/log/diag/tnslsnr//listener_scan2/trace/listener_scan2.log

    $GRID_HOME/log/diag/tnslsnr//listener_scan3/trace/listener_scan3.log

4. cat /etc/hosts
5. CRS Logs:
    1. Clusterware alert log in /log/
    2. The cssdagent log(s) in /log//agent/ohasd/oracssdagent_root
    3. The cssdmonitor log(s) in /log//agent/ohasd/oracssdmonitor_root
    4. The ocssd log(s) in /log//cssd
    5. The lastgasp log(s) in /etc/oracle/lastgasp or /var/opt/oracle/lastgasp
    6. IPD/OS or OS Watcher data
    7. 'opatch lsinventory -detail' output for the GRID home
    8. *Messages files:

* Messages file locations:

    Linux: /var/log/messages
    Sun: /var/adm/messages
    HP-UX: /var/adm/syslog/syslog.log
    IBM: /bin/errpt -a > messages.out

Saturday, 29 August 2015

Add Node on Existing Cluster_Steps

1> Compare the mentioned file in the new and old environment: "/etc/sysctl.conf" file.
2> Compare the mentioned file in the new and old environment: "/etc/security/limits.conf" file.
3> Compare the mentioned file in the new and old environment "/etc/pam.d/login" file.
4> Create the necessary oracle groups and users.
5> chkconfig ntpd status
6> Create the .profile for setting the environment for grid and oracle user (Compare the mentioned file in the new and old environment)
"/home/oracle/.bash_profile" file and "/home/grid/.bash_profile"
"/home/oracle/grid_env" and "/home/grid/grid_env"
"/home/oracle/db_env" and "/home/grid/db_env"
7> uname -rm
8> All necessary ASM libraries and packages
9> Ping all the IPs of the New_Node
ping -c 3 112-rac1
ping -c 3 112-rac1-priv
ping -c 3 112-rac1-vip

10> ASM must be configured and "#oracleasm listdisks" command on the New_Node should show all the disks.
Few command reference : oracleasm listdisks, oracleasm init, oracleasm scandisks

11> Configure secure shell for oracle user on all nodes

From oracle_home/oui/bin on existing Node -
./runSSHSetup.sh -user oracle -hosts "Existing_Node New_Node" -advanced -exverify

12> Verify New_Node (HWOS)

From grid_home on existing Node
$GRID_HOME/bin/cluvfy stage -post hwos -n New_Node > /u02/hwos.log

13> Verify Peer (REFNODE)

From grid_home on existing Node
$GRID_HOME/bin/cluvfy comp peer -refnode existing Node -n New_Node -orainv oinstall -osdba dba -verbose > /u02/comppeer.log

14> Verify New_Node (New_Node PRE)

From grid_home on existing Node
$GRID_HOME/bin/cluvfy stage -pre nodeadd -n New_Node -fixup -verbose > /u02/fixup.log

14> Extend Clusterware

Run “addNode.sh”

a) [oracle@existing Node bin]$ ./addNode.sh -silent "CLUSTER_NEW_NODES={New_Node}" "CLUSTER_NEW_VIRTUAL_HOSTNAMES={New_Node-vip}"
From root user on New_Node :::

b) [root@New_Node oraInventory]# ./orainstRoot.sh
From root user on New_Node :::

c) [root@New_Node oraInventory]# cd /u01/app/11.2.0/grid/
[root@New_Node grid]# ./root.sh

If successful, clusterware daemons, the listener, the ASM instance, etc. should be started

d) [oracle@New_Node ~]$ crsctl check crs
e) [oracle@New_Node ~]$ crs_stat -t -v
f) [oracle@New_Node ~]$ olsnodes -n
Existing Node  1
Existing_Node 2
New_Node  3

g)[oracle@New_Node ~]$ srvctl status asm -a
ASM is running on Existing_Node,New_Node,existing Node
ASM is enabled.

h) [oracle@New_Node ~]$ ocrcheck
i) [oracle@New_Node ~]$ crsctl query css votedisk

15) Verify New_Node (New_Node POST)

[oracle@existing Node u02]$ $GRID_HOME/bin/cluvfy stage -post nodeadd -n New_Node -verbose > /u02/clusterpost.log




16) CREATE ORACLE INSTANCE


#Run DBCA to add instance from RAC1
cd $ORACLE_HOME/bin
./dbca


“Oracle Real Application Clusters Database” and then “select Instance Management” and then select “add instance”

Tuesday, 11 August 2015

Resetting Listener Log without stopping Database

Often it is required to reset the listener log when it grows too much like more than 2G, as this log is being used by the instance all the time you can not delete. By doing the following you can achieve your purpose to reset the listener log.

C:\Documents and Settings\inam>lsnrctl

LSNRCTL for 32-bit Windows: Version 10.2.0.1.0 - Production on 01-AUG-2010 10:33:31

Copyright (c) 1991, 2005, Oracle.  All rights reserved.

Welcome to LSNRCTL, type "help" for information.

LSNRCTL> SET CURRENT_LISTENER LISTENER
Current Listener is LISTENER

LSNRCTL> SET LOG_FILE TEMP_LISTENER.LOG
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=srv2.domain)(PORT=1521)))
LISTENER parameter "log_file" set to temp_listener.log
The command completed successfully

LSNRCTL> SET LOG_FILE LISTENER.LOG
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=srv2.domain)(PORT=1521)))
LISTENER parameter "log_file" set to listener.log
The command completed successfully
LSNRCTL>

Create database (10gR2) manually on Windows based on ASM instance

Create database (10gR2) manually on Windows based on ASM instance
Step 1: set ORACLE_SID on dos prompt
C:\Documents and Settings\inam>set ORACLE_SID=ASMDB

Step 2: Create parameter for the database instance to be created
D:\ASMTEST\ASMDB\pfile\initASMDB.ora
control_files = +DB_DATA
undo_management = AUTO
db_name = ASMDB
db_block_size = 8192
sga_max_size = 1073741824
sga_target = 1073741824
db_create_file_dest = +DB_DATA
db_create_online_log_dest_1 = +DB_DATA

Step 3: Create a password file
C:\Documents and Settings\inam>orapwd file=D:\ASMTEST\ASMDB\pwdASMDB.ora password=oracle entries=5

Step 4: create/Start the instance
C:\Documents and Settings\inam>oradim -new -sid ASMDB -syspwd asmdb123 -pfile D:\ASMTEST\ASMDB\pfile\initASMDB.ora -startmode a
Instance created.


C:\Documents and Settings\inam>sqlplus / as sysdba

SQL*Plus: Release 10.2.0.1.0 - Production on Tue Oct 26 12:37:05 2010

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning, OLAP and Data Mining options

 SQL> shutdown immediate
 SQL> startup nomount pfile=D:\ASMTEST\ASMDB\pfile\initASMDB.ora
ORACLE instance started.

Total System Global Area 1073741824 bytes
Fixed Size                  1253124 bytes
Variable Size             264241404 bytes
Database Buffers          801112064 bytes
Redo Buffers                7135232 bytes
SQL>


Step 5: Create the database

SQL> create database ASMDB
  2  character set WE8ISO8859P1
  3  national character set utf8
  4  undo tablespace undotbs1
  5  default temporary tablespace temp;

Database created.

Step 6: Run following scripts
SQL> @D:\oracle\product\10.2.0\db_1\RDBMS\ADMIN\catalog.sql
SQL> @D:\oracle\product\10.2.0\db_1\RDBMS\ADMIN\catproc.sql
SQL> @D:\oracle\product\10.2.0\db_1\sqlplus\admin\pupbld.sql


Step 7: Test DB ( some admin task)
Create the service for your newly created db in tnsnames.ora and access via toad or sqlplus for your testing.
SQL> create tablespace myts_on_ASM datafile '+DB_DATA' size 200M

Note: if you are on linux skip step 4 as there is no oradim for linux , other steps are same.

Create ASM Instance(Manually) on Windows

Create ASM Instance(Manually) on Windows
Automatic Storage Management (ASM) is an integrated file system and volume manager expressly built for Oracle database files. ASM provides the performance of raw I/O with the easy management of a file system. It simplifies database administration by eliminating the need for you to directly manage potentially thousands of Oracle database files. It does this by enabling you to divide all available storage into disk groups. You manage a small set of disk groups and ASM automates the placement of the database files within those disk groups.


With the following steps you can create the ASM instance on windows. After this instance can be used while creating the Oracle database.


Step 1:  Create New Partition for Device Files
Create partitions for for device files like E:\, F:\
Step 2: Create CSS service if it is not there; Cluster Synchronization Services (CSS) is required to enable synchronization between an Automatic Storage Management (ASM) instance and the database instances.
Create this service by running the following batch file
D:\oracle\product\10.2.0\db_1\BIN>localconfig add   -- it will give the following output
            Step 1:  creating new OCR repository
            Successfully accumulated necessary OCR keys.
            Creating OCR keys for user 'domainName\inam', privgrp ''..
            Operation successful.
            Step 2:  creating new CSS service
            successfully created local CSS service
            successfully added CSS to home
Step 3: Building the ASM Candidate "disks" using asmtool.
asmtool  helps to stamp the new disks on windows for using as ASM Disks. You can use asmtoolg (GUI version) also. Execute Following Command on Command Prompt.
D:\oracle\product\10.2.0\db_1\BIN>asmtool -create d:\ASMTEST\DISK1 1024
D:\oracle\product\10.2.0\db_1\BIN>asmtool -create d:\ASMTEST\DISK2 1024
D:\oracle\product\10.2.0\db_1\BIN>asmtool -create d:\ASMTEST\DISK3 1024

Note: You could use the DISKPART utility also to create the virtual disks
DISKPART> create vdisk file="c:\temp\DISK1.vhd" maximum 500
  100 percent completed
DiskPart successfully created the virtual disk file.

DISKPART> select vdisk file="c:\temp\vdisk01.vhd"

DiskPart successfully selected the virtual disk file.

DISKPART> attach vdisk

  100 percent completed

DiskPart successfully attached the virtual disk file.

DISKPART> list disk

  Disk ###  Status         Size     Free     Dyn  Gpt
  --------  -------------  -------  -------  ---  ---
  Disk 0    Online          465 GB      0 B
* Disk 1    Online          500 MB   500 MB

Initialize and format disk as RAW
Right click Computer > Manage > Storage > Disk Management
Don't assign drive letter

After this you could use the asmtool to stamp this new disk.

Step 4: Create Admin Directories for your new asm instance, i've created on the following location
D:\ASMTEST\DATABASE\admin\+ASM\bdump
D:\ASMTEST\DATABASE\admin\+ASM\cdump
D:\ASMTEST\DATABASE\admin\+ASM\hdump
D:\ASMTEST\DATABASE\admin\+ASM\pfile
D:\ASMTEST\DATABASE\admin\+ASM\udump

Step 5: Create ASM Instance Parameter File
filenam>>   D:\ASMTEST\DATABASE\admin\+ASM\pfile\init.ora

INSTANCE_TYPE=ASM
_ASM_ALLOW_ONLY_RAW_DISKS = FALSE
DB_UNIQUE_NAME = +ASM
ASM_DISKSTRING ='D:\ASMTEST\DISK*'
LARGE_POOL_SIZE = 16M
BACKGROUND_DUMP_DEST = 'D:\ASMTEST\DATABASE\admin\+ASM\bdump'
USER_DUMP_DEST = 'D:\ASMTEST\DATABASE\admin\+ASM\udump'
CORE_DUMP_DEST = 'D:\ASMTEST\DATABASE\admin\+ASM\cdump'
ASM_DISKGROUPS='DB_DATA' ,'DB_ARCHIVELOG'

Step 6: Creating ASM Instance
D:\oracle\product\10.2.0\db_1\BIN>oradim -new -asmsid +ASM -syspwd asm123 -pfile d:\asmtest\database\admin\+ASM\pfile\init.ora -startmode a
Instance created.

Step 7: Starting the ASM Instance
D:\oracle\product\10.2.0\db_1\BIN>set ORACLE_SID=+ASM
D:\oracle\product\10.2.0\db_1\BIN>sqlplus / as sysdba
SQL*Plus: Release 10.2.0.1.0 - Production on Tue Oct 26 10:42:13 2010
Copyright (c) 1982, 2005, Oracle.  All rights reserved.
Connected to an idle instance.
SQL> startup nomount pfile='D:\ASMTEST\DATABASE\admin\+ASM\pfile\init.ora'
ASM instance started

Total System Global Area   88080384 bytes
Fixed Size                  1247444 bytes
Variable Size              61667116 bytes
ASM Cache                  25165824 bytes
SQL>

Step 8: Create ASM Disk Groups
Check the asm disk status
SQL> SELECT group_number, disk_number, mount_status, header_status, state, path FROM v$asm_disk
  2  /

GROUP_NUMBER DISK_NUMBER MOUNT_S HEADER_STATU STATE
------------ ----------- ------- ------------ --------
PATH
--------------------------------------------------------------------------------
           0           0 CLOSED  CANDIDATE    NORMAL
D:\ASMTEST\DISK1
           0           2 CLOSED  CANDIDATE    NORMAL
D:\ASMTEST\DISK3
           0           1 CLOSED  CANDIDATE    NORMAL
D:\ASMTEST\DISK2

The value of zero in the GROUP_NUMBER column for all four disks. This indicates that a disk is available but hasn't yet been assigned to a disk group.
 SQL> CREATE DISKGROUP DB_DATA NORMAL REDUNDANCY FAILGROUP controller1 DISK 'D:\ASMTEST\DISK1', 'D:\ASMTEST\DISK2'
  2    FAILGROUP controller2 DISK 'D:\ASMTEST\DISK3', 'D:\ASMTEST\DISK4';

Diskgroup created.

Step 9: Mount diskgroup
SQL> shutdown immediate;
ASM diskgroups dismounted
ASM instance shutdown
SQL> startup nomount pfile='D:\ASMTEST\DATABASE\admin\+ASM\pfile\init.ora'
ASM instance started

Total System Global Area   88080384 bytes
Fixed Size                  1247444 bytes
Variable Size              61667116 bytes
ASM Cache                  25165824 bytes
SQL> SELECT group_number, disk_number, mount_status, header_status, state, path FROM v$asm_disk;
GROUP_NUMBER DISK_NUMBER MOUNT_S HEADER_STATU STATE
------------ ----------- ------- ------------ --------
PATH
--------------------------------------------------------------------------------
           0           0 CLOSED  MEMBER       NORMAL
D:\ASMTEST\DISK1
           0           3 CLOSED  MEMBER       NORMAL
D:\ASMTEST\DISK4
           0           2 CLOSED  MEMBER       NORMAL
D:\ASMTEST\DISK3

GROUP_NUMBER DISK_NUMBER MOUNT_S HEADER_STATU STATE
------------ ----------- ------- ------------ --------
PATH
--------------------------------------------------------------------------------
           0           1 CLOSED  MEMBER       NORMAL
D:\ASMTEST\DISK2

SQL> alter diskgroup DB_DATA mount;

Diskgroup altered.

SQL>
SQL> SELECT group_number, disk_number, mount_status, header_status, state, path FROM v$asm_disk;

GROUP_NUMBER DISK_NUMBER MOUNT_S HEADER_STATU STATE
------------ ----------- ------- ------------ --------
PATH
--------------------------------------------------------------------------------
           1           0 CACHED  MEMBER       NORMAL
D:\ASMTEST\DISK1
           1           1 CACHED  MEMBER       NORMAL
D:\ASMTEST\DISK2
          1           2 CACHED  MEMBER       NORMAL
D:\ASMTEST\DISK3

GROUP_NUMBER DISK_NUMBER MOUNT_S HEADER_STATU STATE
------------ ----------- ------- ------------ --------
PATH
--------------------------------------------------------------------------------
           1           3 CACHED  MEMBER       NORMAL
D:\ASMTEST\DISK4

Step 10: Test ASM Instance (some admin tasks)
C:\Documents and Settings\inam> sqlplus / as sysdba
SQL> ALTER DISKGROUP DB_DATA ADD DIRECTORY '+DB_DATA/my_dir';

Diskgroup altered.
ALTER DISKGROUP DB_DATA RENAME DIRECTORY '+DB_DATA/my_dir' TO '+DB_DATA/my_dir_2';

How to Delete a directory and all its contents:
ALTER DISKGROUP DB_DATA DROP DIRECTORY '+DB_DATA/my_dir_2' FORCE;

Aliases
Aliases allow you to reference ASM files using user-friendly names, rather than the fully qualified ASM filenames.

How to Create an alias using the fully qualified filename:
ALTER DISKGROUP DB_DATA ADD ALIAS '+DB_DATA/my_dir/my_file.dbf'
FOR '+DB_DATA/mydb/datafile/my_ts.342.3';

How to Create an alias using the numeric form filename:
ALTER DISKGROUP Db_DATA ADD ALIAS '+DB_DATA/my_dir/my_file.dbf'
FOR '+DB_DATA.342.3';

How to Rename an alias:
ALTER DISKGROUP DB_DATA RENAME ALIAS '+DB_DATA/my_dir/my_file.dbf'
TO '+DB_DATA/my_dir/my_file2.dbf';

How to Delete an alias:
ALTER DISKGROUP DB_DATA DELETE ALIAS '+DB_DATA/my_dir/my_file.dbf';

Files
Files are not deleted automatically if they are created using aliases, as they are not Oracle Managed Files (OMF), or if a recovery is done to a point-in-time before the file was created. For these circumstances it is necessary to manually delete the files, as shown below.

How to Drop file using an alias:
ALTER DISKGROUP DB_DATA DROP FILE '+DB_DATA/my_dir/my_file.dbf';

How to Drop file using a numeric form filename:
ALTER DISKGROUP Db_DATA DROP FILE '+DB_DATA.342.3';

How to Drop file using a fully qualified filename:
ALTER DISKGROUP DB_DATA DROP FILE '+DB_DATA/mydb/datafile/my_ts.342.3';

Metadata
The internal consistency of disk group metadata can be checked in a number of ways using the CHECK clause of the ALTER DISKGROUP statement.

How to Check metadata for a specific file:
ALTER DISKGROUP DB_DATA CHECK FILE '+DB_DATA/my_dir/my_file.dbf'

How to Check metadata for a specific failure group in the disk group:
ALTER DISKGROUP DB_DATA CHECK FAILGROUP failure_group_1;

How to Check metadata for a specific disk in the disk group:
ALTER DISKGROUP DB_DATA CHECK DISK diska1;

How to Check metadata for all disks in the disk group:
ALTER DISKGROUP DB_DATA CHECK ALL;

Templates
Templates are named groups of attributes that can be applied to the files within a disk group. The following example show how templates can be created, altered and dropped.

How to Create a new template:
ALTER DISKGROUP DB_DATA ADD TEMPLATE my_template ATTRIBUTES (MIRROR FINE);

How to Modify template:
ALTER DISKGROUP DB_DATA ALTER TEMPLATE my_template ATTRIBUTES (COARSE);

How to Drop template.
ALTER DISKGROUP DB_DATA DROP TEMPLATE my_template;

SQLNet Logging/Tracing

Net Logging and Trace
Oracle Net logging and trace are configured in the sqlnet.ora typically found at ORACLE_HOME\network\admin

Logging

The following parameters can be set to configure Oracle Net logging in sqlnet.ora:

ParameterDescription
LOG_DIRECTORY_CLIENTSpecifies the directory for the client log file
LOG_FILE_CLIENTSpecifies the name of the client log file
LOG_DIRECTORY_SERVERSpecifies the directory for the server log file
LOG_FILE_SERVERSpecifies the name of the server log file
By default both the client and server log file names default to sqlnet.log

Trace

The following parameters can be set to configure Oracle Net logging in sqlnet.ora:
ParameterDescription
TRACE_DIRECTORY_CLIENTSpecifies the directory for the client trace file
TRACE_FILE_CLIENTSpecifies the name of the client trace file
TRACE_DIRECTORY_SERVERSpecifies the directory for the server trace file
TRACE_FILE_SERVERSpecifies the name of the server trace file
TRACE_FILELEN_CLIENTSpecifies the size of each client trace file in kilobytes
TRACE_FILENO_CLIENTSpecifies the number of client trace files
TRACE_FILELEN_SERVERSpecifies the size of each server trace file in kilobytes
TRACE_FILENO_SERVERSpecifies the number of server trace files
TRACE_LEVEL_CLIENTSpecifies the level of detail for client trace
TRACE_LEVEL_SERVERSpecifies the level of detail for server trace
TRACE_TIMESTAMP_CLIENTIncludes a timestamp (to microseconds) for each event in the client trace
TRACE_TIMESTAMP_SERVERIncludes a timestamp (to microseconds) for each event in the client trace
TRACE_UNIQUE_CLIENTCreates an individual client trace file for each process
For both the client and server trace files, the default directory is $ORACLE_HOME/network/trace.
For the client, the default trace file name is sqlnet.trc; for the server the default trace file name is svr_pid.trc


When both TRACE_FILELEN_CLIENT and TRACE_FILENO_CLIENT are set to non-zero values, the trace files are used cyclically. When one file is full, output continues in the next file; when all files are full output continues in the first file. A sequence number is included in the file name. For example if TRACE_FILE_CLIENT is client and TRACE_FILENO_CLIENT is 5 then the files will be:
client1_pid.trc
client2_pid.trc
client3_pid.trc
client4_pid.trc
client5_pid.trc
TRACE_FILELEN_SERVER and TRACE_FILENO_SERVER work in a similar way to TRACE_FILELEN_CLIENT and TRACE_FILENO_CLIENT.
For both TRACE_LEVEL_CLIENT and TRACE_LEVEL_SERVER, the parameter can take a numeric value between 0 and 16 where 0 is disabled and 16 is the most detailed. Alternatively these parameters can also take a scalar value was follows:
OFF0No tracing
USER4Include user errors
ADMIN6Include administrative errors
SUPPORT16Include packet contents
Level 16 (SUPPORT) is the most detailed trace level. Take care when enabling this level of detail as it will consume disk space very rapidly. Consider using the TRACE_FILELEN_SERVER and TRACE_FILENO_SERVER parameters to reduce the impact on the server If TRACE_UNIQUE_CLIENT is set to ON then a separate trace file will be created for each client. The pid is appended to the file name e.g. client_123.trc. Note that this appears to be the default behaviour in recent versions

Example:  on one of our server following for client side tracing

###################TRACING/LOGGING#########################
LOG_DIRECTORY_CLIENT=D:\Oracle\product\10.1.0\Client_1\networks\Log
LOG_FILE_CLIENT=sqlnet_log
TRACE_LEVEL_CLIENT = SUPPORT
TRACE_UNIQUE_CLIENT = on
TRACE_DIRECTORY_CLIENT = D:\Oracle\product\10.1.0\Client_1\network\trace
TRACE_FILE_CLIENT = CLIENT_4_24
TRACE_TIMESTAMP_ CLIENT = ON
###################################################
after the above mentioned change start sqlplus and observe that there are .trc files in the specified trace folder for analysis and investigation.

RAC on Windows: Oracle Clusterware Node Evictions a.k.a. Why do we get a Blue Screen (BSOD) Caused By Orafencedrv.sys? [ID 337784.1]

Applies to: Oracle Server - Enterprise Edition - Version 10.2.0.1 to 11.2.0.3 [Release 10.2 to 11.2]
Microsoft Windows Itanium (64-bit)
Microsoft Windows x64 (64-bit)
Microsoft Windows 2000Microsoft Windows XPMicrosoft Windows Server 2003 (64-bit Itanium)Microsoft Windows Server 2003 (64-bit AMD64 and Intel EM64T)Microsoft Windows Server 2003 R2 (64-bit AMD64 and Intel EM64T)Microsoft Windows Server 2003 R2 (32-bit)
Oracle Server Enterprise Edition - Version: 10.2.0.1 to 11.1.0.7

Symptoms

While running RAC a blue screen is shown and a reboot takes place. Windows creates a coredump that shows that orafencedrv.sys is involved.

Changes

The following STOP code can be observed in the Blue screen:
STOP: 0x0000FFFF (0x00000000000000000000, 0x00000000000000000000, 0x00000000000000000000, 0x00000000000000000000)

Cause

When running  Oracle RAC/Clusterware on Windows, the OracleCSService is SUPPOSED to reboot the OS if it detects a problem in the clusterware. The result of a CSS daemon rebooting the node will be that a blue screen will occur.

The failure is as per design. Anytime that the OracleCSService process fails, it is designed to cause the machine to reboot - it does this by means of an IOCTL to the IOFENCE driver. This is a kernel driver which gets a fault. And for windows this is an unhandled exception that will cause the blue screen.

Therefore, blue screens that implicate orafencedrv.sys occur by design in an Oracle RAC on Windows environment. 

Note that our Clusterware / Grid Infrastructure  software is designed to fence and reboot a node in either of the following two ways / cases:

1. Oracle Cluster nodes are designed to 'checkin' every second with each other in two ways:
a. They check to make sure they can each ping the other(s) on their private interconnect IP addresses. If a node does not respond to network pings on the private interconnect within (which defaults to 30 seconds), then the Oracle Cluster Synchronization Services Daemon (OCSSD) instructs our Oracle Fence Driver (orafencedrv.sys) to evict the unresponsive node from the cluster.
b. The nodes also checkin with each other every second with regard to whether or not each node can read and write to the voting disk. The for this event are different depending on the activity of the clusterware at that moment. What we deem to be a 'short disk timeout' is 60 seconds (this happens when the cluster is under normal operation) while what we deem to be a 'long disk timeout' is 200 seconds (this happens when the cluster is undergoing reconfiguration at the time when the node fails to meet its 'voting disk' checkin). In any case, the action we take is the same: the Oracle Cluster Synchronization Services Daemon (OCSSD) instructs our Oracle Fence Driver (orafencedrv.sys) to evict the unresponsive node from the cluster.

2. OraFence has a built-in mechanism to check it was scheduled in time. If it is not scheduled within 5 seconds it will also reboot the note. In this way, OraFence is designed to fence and reboot a node if it perceives that a given node is 'hung' once its own timeout has been reached. Note that the default timeout for the OraFence driver is a (very low) 0x05 (5 seconds). What this means is that if the OraFence driver detects what it perceives to be a hang for example at the operating system level and that hang persists beyond 5 seconds, it's possible that the OraFence driver - of its own accord - will fence and evict the node.

Solution

So the question is not why does the blue screen occur, but why is the  OracleCSService process failing (node eviction) or why OraFenceService was not scheduled.
The first next step in diagnosing any case of BSOD which is implicating the OraFence driver is to determine which type (1 or 2) of eviction / reboot your cluster has experienced.

To that end, please start with the Windows System Event Viewer log (if you intend to upload this log to support, please save it off as a .TXT file before doing so), and check for the 'bugcheck' or 'stop code' that was reported when the node was brought down. If that bugcheck code shows:

1. 0x0000FFFF then that means that you have experienced the first type of eviction explained here above: and you should look to your Oracle Clusterware alert log file ($GRID_HOME\log\) and your ocssd.log files for answers. More than likely you'll see that there was a series of missed checkins that led up to the eviction occurrence and this almost always indicates an underlying network issue (for example: faulty cables, cards, drivers, or ports on the network switch) or other OS resource issue / contention whereby the network is not able to respond to the checkin as it is 'busy' with other resource intensive operations and/or cannot get CPU to respond to the network checkin.  In both cases, the root cause really needs to be sought at the OS / System Administration level. 
10.1: %CRS_HOME%\css\log\ocssd.log
10.2: %CRS_HOME%\log\alert.log AND %CRS_HOME%\log\\cssd\ocssd.log
11.1: %CRS_HOME%\log\alert.log AND %CRS_HOME%\log\\cssd\ocssd.log
11.2: %CRS_HOME%\log\alert.log AND %GI_HOME%\log\\cssd\ocssd.log


2. 0x0000FFFE then that means that you have experienced the second type of eviction explained here above: and you should look at whether or not your node was under heavy load / is truly hanging from time to time for any reason - and/or - look at increasing the default orafencetimeout value - again - we have found that 5 seconds is a very aggressive timeout value and can safely be adjusted upwards. This is controlled with the following registry key:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\OraFenceService\


Please be sure to configure your server(s) to automatically reboot on a Bug Check/System Failure event, otherwise, you will see a blue screen without any further activity (the node will not actually reboot).  To that end, please check this setting by going to
Control Panel -> System -> Advanced system Settings -> 'Advanced tab'
'Startup and Recovery' Settings -> "System Failure" Select "Automatically
Restart"


 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As a normal function of our Oracle Clusterware / Grid Infrastructure, OraFenceService is designed to fence (I/O) and reboot a node if it perceives that
node is 'hung' once its configured timeout has been reached. The default timeout for the OraFence driver is a (very low) 5 seconds.
What this means is that if the OraFence driver detects what it perceives to be a hang at the operating system level and that hang persists beyond 5 seconds,
it's possible that the OraFence driver - of its own accord - will fence and evict the node.
It is advisable in some cases to increase the OraFence timeout value as high as 10 seconds in some cases.
The OraFence timeout is controlled by the following Windows registry key: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\OraFenceService\Timeout.
Note that modification of the OraFenceService timeout value requires a node reboot.
Please increase OraFence timeout from the default of 5 seconds to an value of 10 seconds and let us know whether you are encountering the issue again.
~~~~~~~~~~~~~~~~~~~~~~

 

Tuesday, 28 July 2015

dNFS and ASM/Filesystem

An Oracle ASM disk group can be created from NFS files, including Oracle Direct NFS (dNFS), as well as whole disks, partitions, and LUNs. The NFS files that are provisioned to a disk group may be from multiple NFS servers to provide better load balancing and flexible capacity planning.

Direct NFS can be used to store data files, but is not supported for Oracle Clusterware files. To install Oracle Real Application Clusters (Oracle RAC) on Windows using Direct NFS, you must also have access to a shared storage method other than NFS for Oracle Clusterware file


Advantages of 'HAVING' ASM with dNFS --

1. If we remove ASM then DBA has no or very less control over storage. He can't even see how much is the free space left as physical level.
2. Stripping option is there to gain performance benefits
3. Multiplexing has benefits over mirroring when it comes to recovery.

(e.g, suppose a database is created with only 1 controlfile as external mirroring is in place at storage level , and another database is created with 2 copies (multiplexed within Oracle level), and an rm command was issued to remove that file then definitely there will be a time difference between restoring the file back.)


"Please also note that ASM is not required for using Direct NFS and NAS. ASM can be used if customers feel that ASM functionality is a value-add in their environment. " ------How to configure ASM on top of dNFS disks in 11gR2 (Doc ID 1570073.1


Oracle Direct NFS (dNFS) is an optimized NFS (Network File System) client that provides faster and more scalable access to NFS storage located on NAS storage devices (accessible over TCP/IP). Direct NFS is built directly into the database kernel.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The advantages of ASM are well known --

1. Easier administration for DBA, as using this 'layer', we know the storage very well.
2. automatic re-balancing and dynamic reconfiguration.
3. Stripping and mirroring (though we are not using this option in our env, external redundancy is provided at storage level).
4. Less (or no) dependency on storage admin for DB file related tasks.
5. Oracle also recommends to use ASM rather than file system storage.

 Advantages of DNFS(Direct Network File System) ---

1. Oracle bypasses the OS layer, directly connects to storage.
2. Better performance as user's data need not to be loaded in OS's kernel.
3. It load balances across multiple network interfaces in a similar fashion to how ASM operates in SAN environments.

Now if we combine these 2 options , how will be that configuration in terms of administration/manageability/performance/downtime in future in case of migration.

I have collected some points.

In favor of 'NOT' HAVING ASM--

1. ASM is an extra layer on top of storage so if using dNFS ,this layer should be removed as there are no performance benefits.
2. store the data in file system rather than ASM.
3. Stripping will be provided  at storage level (not very much sure about this).
4. External redundancy is being used at storage level so its better to remove ASM. 



How To Setup DNFS (Direct NFS) On Oracle Release 11.2 (Doc ID 1452614.1) 

Troubleshooting tfa issues

Check the tfa process on both the nodes
ps -ef | grep tfa

Verify TFA runtime status
# /u01/app/grid/tfa/bin/tfactl print status

Check for errors:
# /u01/app/grid/tfa/bin/tfactl print errors

Check for database startups:
# /u01/app/grid/tfa/bin/tfactl print startups

Restart TFA
# /etc/init.d/init.tfa  restart

Stop TFAMain process and removes related inittab entries
# /etc/init.d/init.tfa shutdown
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

To list hosts
tfactl print hosts

To add the host successfully, try the following steps:
1. Get the cookie in node1 using:
./tfa_home/bin/tfactl print cookie
2. Set the cookie from Step 1 in node2 using:
  ./tfa_home/bin/tfactl set cookie=
3. After Step 2, add host again:
 ./tfa_home/bin/tfactl host add node2

Verify TFA runtime status
tfactl print status

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By default TFA will try to use ports 5000 to 5005 for secure root communications however if that port range is not
available on your system this can be changed.

The file TFA_HOME/internal/usuableports.txt when installed looks like this.

#cat /opt/oracle/tfa/tfa_home/internal/usableports.txt
5000
5001
5002
5003
5004
5005

To change the ports TFA will try to use you need to:-
1) Stop TFA on all nodes
2) Edit usableports.txt to replace the ports you wish tfa to use.
3) Remove the file tfa_home/internal/port.txt on all nodes.
4) Start TFA on all nodes.

Note: The usableports.txt file must be identical on all nodes.

TFA will also use one of ports 5006 to 5011 on the loopback interface for non root communications however
if that port range is not available on your system then this can also be changed.

The file TFA_HOME/internal/NonRootusuableports.txt when installed looks like this.

#cat /opt/oracle/tfa/tfa_home/internal/usableNonRootports.txt
5006
5007
5008
5009
5010
5011

To change the ports TFA will try to use you need to:-
1) Stop TFA on all nodes
2) Edit usableports.txt to replace the ports you wish tfa to use.
3) Remove the file tfa_home/internal/NonRootports.txt on all nodes.
4) Start TFA on all nodes.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
re-installation with -local will resolve the adding the hosts.

Small Notes on Clusterware resource ora.cvu

ora.cvu is a new resource introduced with Grid Infrastructure 11.2.0.2. The purpose of this resource is to invoke clusterware health checks at regular intervals. It is a singleton resource with cardinality of 1 and invokes the cluster verification utility. It executes the following command in the background.
cluvfy comp health -_format
Attribute '_format' sets the format which is redirected to the log file of the ora.cvu
Logs:
The logs related to ora.cvu are located under /log/cvu/.
There are two directories underneath: one for log file s(cvulog) and one for trace files (cvutrc).

Monitoring/Managing cvu:
-- status of the resource ora.cvu
crsctl stat res ora.cvu -t

-- Stopping cvu
srvctl stop cvu
srvctl status cvu

--Starting cvu
srvctl start cvu
srvctl status cvu

-- enable/disable ora.cvu
srvctl enable cvu

-- check the status
crsctl stat res ora.cvu -p | grep ENABLED

-- disable ora.cvu
srvctl disable cvu

-- change execution frequency of ora.cvu
The frequency of ora.cvu execution can be changed by changing the value of the attribute "CHECK_INTERVAL"

crsctl modify resource ora.cvu -attr CHECK_INTERVAL=21600
crsctl stat res ora.cvu -p

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cvu res dir will have .xml and .txt files (it will create up to 5 .txt and/or .xml files in that directory), generated at every 6hrs normally by ora.CVU resource. Due to code bug, these files are not getting purged; you might be hitting the bug 19703199  with a workaround to delete the files (or apply the patch)

Thursday, 23 July 2015

To purge WRH$_SQL_PLAN table (old awr reports)

From the diag0 trace logs, WRH$_SQL_PLAN table may cause the database hangs. Please purge the table after taking a backup of any good sql plans (if any).

1> Please keep a backup of good sql plans so that you can restore it after purging WRH$_SQL_PLAN table

2> Please use one of the option to Purge the WRH$_SQL_PLAN table (Doc ID 1478615.1).

OPTION 1
========

Try executing the dbms_workload_repository.purge_sql_details() as per below:

First count the number in WRH$_SQL_PLAN table.

SQL> select count(*) from sys.wrh$_sql_plan;

SQL>SELECT dbid FROM v$database;

SQL> exec dbms_workload_repository.purge_sql_details(1000, &dbid);               --------------------------- Enter the dbid from above query.

SQL> select count(*) from sys.wrh$_sql_plan;


See if this deletes any rows from the WRH$_SQL_PLAN table. It is needed that you shrink the table and check the table size in the end to reclaim the space.

OPTION 2
========


Another option is to remove the old wrh$_sql_plan data by running simple "delete from wrh$_sql_plan where snap_id between x and y" manually (delete little by little). After deletion of old records from wrh$_sql_plan table, the table needs to be shrinked.  Then check the table size in the end.

Below SQLs will help in identifying the range of snapshots that can deleted from WRH$_SQL_PLAN.

SQL> select snap_id ,timestamp from  wrh$_sql_plan where snap_id=(select min(snap_id) from  wrh$_sql_plan);

The above query gives the oldest available data in table WRH$_SQL_PLAN.

This data could be older than the last available snapshot in the AWR repository.

SQL> select dbid,SNAP_ID,BEGIN_INTERVAL_TIME,END_INTERVAL_TIME from dba_hist_snapshot where SNAP_ID=&snap_id;The above query gives the oldest available AWR data from dba_hist_snapshot view.

Example for the delete process:

SQL> select min(TIMESTAMP) from wrh$_sql_plan;
MIN(TIMESTAMP)
------------------
06-JUN-08

1 row selected.

SQL> select min(BEGIN_INTERVAL_TIME) from dba_hist_snapshot;
MIN(BEGIN_INTERVAL_TIME)
---------------------------------------------------------------------------
31-JUL-12 12.00.24.903 AM

SQL> select count(*) from sys.wrh$_sql_plan;
COUNT(*)
----------
27147

SQL> delete from wrh$_sql_plan where trunc(TIMESTAMP) < (select min(BEGIN_INTERVAL_TIME) from dba_hist_snapshot);
7449 rows deleted.

SQL> commit;
Commit complete.

SQL> select min(TIMESTAMP) from wrh$_sql_plan;
MIN(TIMESTAMP)
------------------
01-AUG-12

1 row selected.

SQL> select min(BEGIN_INTERVAL_TIME) from dba_hist_snapshot;
MIN(BEGIN_INTERVAL_TIME)
---------------------------------------------------------------------------
31-JUL-12 12.00.24.903 AM

1 row selected.

Delete the older snap_ids from the table to reclaim space in the table. Then shrink the table after delete to reclaim the space.

Also, set the purging policy as per the Database usage.

Wednesday, 22 July 2015

UNDO TUNING : ORA-01555

ORA-01555 caused by SQL statement below (SQL ID: fdxcyoin67ty8t, Query Duration=380128 sec, SCN: 0x0229.ff00afd0):
 

Following are the existing settings

Code:

SQL> show parameter undo

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
undo_management                      string      AUTO
undo_retention                       integer     96000
undo_tablespace                      string          undo


SQL> select RETENTION from dba_tablespaces where TABLESPACE_NAME='UNDO';

RETENTION
-----------
NOGUARANTEE

SQL> select distinct AUTOEXTENSIBLE from dba_data_files where TABLESPACE_NAME like 'UNDO%';

AUT
---
NO

SQL> select sum(bytes)/1024/1024  "size_in_mb" from dba_data_files where tablespace_name='UNDO';

SIZE_IN_MB
--------------------
  216511


Following are the details from v$undostat

Code:

select begin_time, end_time, undotsn, undoblks, maxquerylen, maxqueryid, activeblks, unexpiredblks, expiredblks, tuned_undoretention from v$undostat
where trunc(begin_time)=trunc(sysdate)-1 order by begin_time;

BEGIN_TIME     END_TIME          UNDOTSN   UNDOBLKS MAXQUERYLEN MAXQUERYID    ACTIVEBLKS UNEXPIREDBLKS EXPIREDBLKS TUNED_UNDORETENTION
-------------- -------------- ---------- ---------- ----------- ------------- ---------- ------------- ----------- -------------------
21-04-13 00:08 21-04-13 00:18          1      12733      378446 duqnawh32hp4u      91152       7068448      225440              345600
21-04-13 00:18 21-04-13 00:28          1       8951      379047 duqnawh32hp4u      99344       7072800      225440              345600
21-04-13 00:28 21-04-13 00:38          1      14073      379650 duqnawh32hp4u      90128       7075872      234656              345600
21-04-13 00:38 21-04-13 00:48          1       6109        2651 2hgnkv6xxncmq      90128       6919328      399520              345600
21-04-13 00:48 21-04-13 00:58          1      10088        3253 2hgnkv6xxncmq      89104       6768928      525472              345600
21-04-13 00:58 21-04-13 01:08          1      69593        3854 2hgnkv6xxncmq      90128       6664736      564384              345600
21-04-13 01:08 21-04-13 01:18          1     163304        4457 2hgnkv6xxncmq     122896       6753824      541856              345600
21-04-13 01:18 21-04-13 01:28          1     172593        5058 2hgnkv6xxncmq     121872       6909472      482464              345600
21-04-13 01:28 21-04-13 01:38          1     171075        5660 2hgnkv6xxncmq     119824       7080480      480416              345600
21-04-13 01:38 21-04-13 01:48          1      84882        6262 2hgnkv6xxncmq      98320       7270048      478368              345600
21-04-13 01:48 21-04-13 01:58          1       1622        6863 2hgnkv6xxncmq      72720       7330464      476320              345600
21-04-13 01:58 21-04-13 02:08          1       2075        6335 2hgnkv6xxncmq      64528       7337632      477344              345600
21-04-13 02:08 21-04-13 02:18          1       1491         236 a4mhnydhzsv0y      64528       7336608      478368              345600
21-04-13 02:18 21-04-13 02:28          1        884         795 62yh0pbb4p9f2      63504       7325472      490656              345600


Following are the details in AWR report (00:00 til 01:00 of 21-Apr-2013) .... not thet the error was produced at 00:42

Code:

Undo Segment Summary                DB/Inst: DBCPY/dbcpy01  Snaps: 18853-18854
- Min/Max TR (mins) - Min and Max Tuned Retention (minutes)
- STO - Snapshot Too Old count,  OOS - Out of Space count
- Undo segment block stats:
- uS - unexpired Stolen,   your - unexpired Released,   uU - unexpired reUsed
- eS - expired   Stolen,   eR - expired   Released,   eU - expired   reUsed

Undo   Num Undo       Number of  Max Qry   Max Tx Min/Max   STO/     uS/uR/uU/
 TS# Blocks (K)    Transactions  Len (s) Concurcy TR (mins) OOS      eS/eR/eU
---- ---------- --------------- -------- -------- --------- ----- --------------
   1      156.1          41,962  379,650       38 5760/5760 1/0   0/0/0/0/0/0
          -------------------------------------------------------------

    Undo Segment Stats                   DB/Inst: DBCPY/dbcpy01  Snaps: 18853-18854
- Most recent 35 Undostat rows, ordered by Time desc

                Num Undo    Number of Max Qry  Max Tx Tun Ret STO/    uS/uR/uU/
End Time          Blocks Transactions Len (s)   Concy  (mins) OOS     eS/eR/eU
------------ ----------- ------------ ------- ------- ------- ----- ------------
21-Apr 01:08      69,593        4,871   3,854      37   5,760 0/0   0/0/0/0/0/0
21-Apr 00:58      10,088        7,261   3,253      37   5,760 0/0   0/0/0/0/0/0
21-Apr 00:48       6,109        4,110   2,651      35   5,760 1/0   0/0/0/0/0/0
21-Apr 00:38      14,073        4,168 #######      36   5,760 0/0   0/0/0/0/0/0
21-Apr 00:28       8,951        4,031 #######      36   5,760 0/0   0/0/0/0/0/0
21-Apr 00:18      12,733        5,979 #######      36   5,760 0/0   0/0/0/0/0/0
21-Apr 00:08      34,602       11,542 #######      38   5,760 0/0   0/0/0/0/0/0
          -------------------------------------------------------------
   


Undo Advisor information taken 'now' is as following

Code: [Select all] [Show/ hide]

SQL> select dbms_undo_adv.longest_query(sysdate-2,sysdate) from dual;

DBMS_UNDO_ADV.LONGEST_QUERY(SYSDATE-2,SYSDATE)
----------------------------------------------
                                        379650

SQL> select dbms_undo_adv.required_retention from dual;

REQUIRED_RETENTION
------------------
            379650

SQL> select dbms_undo_adv.best_possible_retention from dual;

BEST_POSSIBLE_RETENTION
-----------------------
                1169169

SQL> select dbms_undo_adv.required_undo_size(379650) from dual;

DBMS_UNDO_ADV.REQUIRED_UNDO_SIZE(379650)
----------------------------------------
                                   80655

SQL> select dbms_undo_adv.required_undo_size(379650,sysdate-3,sysdate) from dual;

DBMS_UNDO_ADV.REQUIRED_UNDO_SIZE(379650,SYSDATE-3,SYSDATE)
----------------------------------------------------------
                                                     56028

In above situation what should be my first choice (assuming increasing space is not an issue) - increase undo tablespace or increase undo retention?

If latter is the choice then what should be the value?
Because as I understand present 96000 value is taken as lower limit and because of auto tuning the actual value (TUNED_UNDORETENTION) being used was 345600
In that case shall I set it to something > max(maxquerylen) i.e 379,650 + X?
Or I shall increase the undo tablespace size?

From Undo Advisor output it looks to me that even if I increase the undo retention to 379650 current undo size will be able to support it (may be at the expense of DMLs)
Is that right?


~~~~~

General calculation :-

TUNED_UNDORETENTION = MAXQUERYLEN + 300 Sec.

If UNDO_RETENTION is set > 0, the value specified acts as a minimum setting.

If UNDO_MANAGEMENT=AUTO automatic undo tuning will be used.
MMON will calculate the MAXQUERYLEN every 30 sec interval. Based upon the MAXQUERYLEN, MMON decides the
TUNED_UNDORETENTION. That means the new UNDO RETENTION will be set to TUNED_UNDORETENTION.

For optimal undo management, rather than
tuning based on 100% of the tablespace size, the database tunes the undo retention period based
on 85% of the tablespace size.

Automatic tuning will help to avoid ORA-01555, but if your UNDO tablespace has autoextend off, then you might get into a situation
where active DML needs more space--not reusing expired UNDO segments. The database will be under space pressure and Oracle will give
higher priority to finishing the DML, and not queries.

So Better set Auto

~~~~~~~~~~
_____________________________________________________________________________________________

Data Concurrency and Read Consistency

ROLLBACK or UNDO is the backbone of the READ CONSISTENCY mechanism provided by Oracle. Multi-User Data Concurrency and Read Consistency mechanism make Oracle stand tall in Relational Database Management Systems (RDBMS) world.

Starting in Oracle9i, rollback segments are re-named undo logs. Traditionally transaction undo information was stored in Rollback Segments until a commit or rollback statement was issued, at which point it was made available for overlaying.

Best of all, automatic undo management allows the DBA to specify how long undo information should be retained after commit, preventing “snapshot too old” errors on long running queries.

This is done by setting the UNDO_RETENTION parameter.  The default is 900 seconds (5 minutes), and you can set this parameter to guarantee that Oracle keeps undo logs for extended periods of time.

Rather than having to define and manage rollback segments, you can simply define an Undo tablespace and let Oracle take care of the rest. Turning on automatic undo management is easy.  All you need to do is create an undo tablespace and set UNDO_MANAGEMENT = AUTO.

UNDO_RETENTION is a parameter in the init.ora initialization parameters file that specifies the time period in seconds for which a system retains undo data for committed transactions. The flashback query can go upto the point of time specified as a value in the UNDO_RETENTION parameter.

The properties of the UNDO_RETENTION parameter are mentioned below:

    Parameter type – Integer
    Default value – 900
    Range of values – 0 to 232 – 1
    Syntax – ALTER SYSTEM SET UNDO_RETENTION = < seconds >;

However it is worth to tune the following important parameters

1. The size of the UNDO tablespace
2. The UNDO_RETENTION parameter

Calculate UNDO_RETENTION for given UNDO Tabespace

You can choose to allocate a specific size for the UNDO tablespace and then set theUNDO_RETENTION parameter to an optimal value according to the UNDO size and the database activity. If your disk space is limited and you do not want to allocate more space than necessary to the UNDO tablespace, this is the way to proceed. The following query will help you to optimize the UNDO_RETENTION parameter:

Optimal Undo Retention =
           Actual Undo Size / (DB_BLOCK_SIZE × UNDO_BLOCK_REP_ESC)

Because these following queries use the V$UNDOSTAT statistics, run the queries only after the database has been running with UNDO for a significant and representative time.
Actual Undo Size

SELECT SUM(a.bytes) "UNDO_SIZE"
FROM v$datafile a,
       v$tablespace b,
       dba_tablespaces c
WHERE c.contents = 'UNDO'
   AND c.status = 'ONLINE'
   AND b.name = c.tablespace_name
   AND a.ts# = b.ts#;

Undo Blocks per Second

SELECT MAX(undoblks/((end_time-begin_time)*3600*24)) "UNDO_BLOCK_PER_SEC"
FROM v$undostat;

DB Block Size

SELECT TO_NUMBER(value) "DB_BLOCK_SIZE [KByte]"
FROM v$parameter
WHERE name = 'db_block_size';

Optimal Undo Retention Calculation

Formula:
Optimal Undo Retention =
           Actual Undo Size / (DB_BLOCK_SIZE × UNDO_BLOCK_REP_ESC)

Using Inline Views, you can do all calculation in one query

SQL Code:
SELECT d.undo_size/(1024*1024) "ACTUAL UNDO SIZE [MByte]",
    SUBSTR(e.value,1,25)    "UNDO RETENTION [Sec]",
    ROUND((d.undo_size / (to_number(f.value) *
    g.undo_block_per_sec)))"OPTIMAL UNDO RETENTION [Sec]"
  FROM (
       SELECT SUM(a.bytes) undo_size
          FROM v$datafile a,
               v$tablespace b,
               dba_tablespaces c
         WHERE c.contents = 'UNDO'
           AND c.status = 'ONLINE'
           AND b.name = c.tablespace_name
           AND a.ts# = b.ts#
       ) d,
       v$parameter e,
       v$parameter f,
       (
       SELECT MAX(undoblks/((end_time-begin_time)*3600*24))undo_block_per_sec
       FROM v$undostat
       ) g
WHERE e.name = 'undo_retention'
  AND f.name = 'db_block_size'

Calculate Needed UNDO Size for given Database Activity

If you are not limited by disk space, then it would be better to choose theUNDO_RETENTION time that is best for you (for FLASHBACK, etc.). Allocate the appropriate size to the UNDO tablespace according to the database activity:

Formula:
Undo Size = Optimal Undo Retention × DB_BLOCK_SIZE × UNDO_BLOCK_REP_ESC

Using Inline Views, you can do all calculation in one query

SQL Code:
SELECT d.undo_size/(1024*1024) "ACTUAL UNDO SIZE [MByte]",
       SUBSTR(e.value,1,25) "UNDO RETENTION [Sec]",
       (TO_NUMBER(e.value) * TO_NUMBER(f.value) *
       g.undo_block_per_sec) / (1024*1024)
      "NEEDED UNDO SIZE [MByte]"
  FROM (
       SELECT SUM(a.bytes) undo_size
         FROM v$datafile a,
              v$tablespace b,
              dba_tablespaces c
        WHERE c.contents = 'UNDO'
          AND c.status = 'ONLINE'
          AND b.name = c.tablespace_name
          AND a.ts# = b.ts#
       ) d,
      v$parameter e,
      v$parameter f,
       (
       SELECT MAX(undoblks/((end_time-begin_time)*3600*24))
         undo_block_per_sec
         FROM v$undostat
       ) g
 WHERE e.name = 'undo_retention'
  AND f.name = 'db_block_size'

Automatic Undo Retention Tuning

Oracle 10g automatically tunes undo retention to reduce the chances of “snapshot too old”errors during long-running queries. The UNDO_RETENTION parameter is used to set a low retention time threshold which the system will attempt to achieve. In the event of any undo space constraints the system will prioritize DML operations over undo retention meaning the low threshold may not be achieved. If the undo retention threshold must be guaranteed, even at the expense of DML operations, the RETENTION GUARANTEE clause can be set against the undo tablespace during or after creation:
Reset the undo low threshold.

ALTER SYSTEM SET UNDO_RETENTION = 2400;

Guarantee the minimum threshold is maintained.

ALTER TABLESPACE undotbs1 RETENTION GUARANTEE;

SELECT tablespace_name, retention FROM dba_tablespaces;

TABLESPACE_NAME                RETENTION
------------------------------ -----------
SYSTEM                         NOT APPLY
UNDOTBS1                       GUARANTEE
SYSAUX                         NOT APPLY
TEMP                           NOT APPLY
USERS                          NOT APPLY

5 rows selected.

Switch back to the default mode.

ALTER TABLESPACE undotbs1 RETENTION NOGUARANTEE;

TABLESPACE_NAME                RETENTION
------------------------------ -----------
SYSTEM                         NOT APPLY
UNDOTBS1                       NOGUARANTEE
SYSAUX                         NOT APPLY
TEMP                           NOT APPLY
USERS                          NOT APPLY

5 rows selected.

The Undo Advisor PL/SQL Interface

Oracle Database provides an Undo Advisor that provides advice on and helps automate the establishment of your undo environment. You activate the Undo Advisor by creating an undo advisor task through the advisor framework. The following example creates an undo advisor task to evaluate the undo tablespace. The name of the advisor is ‘Undo Advisor‘. The analysis is based on Automatic Workload Repository snapshots, which you must specify by setting parameters START_SNAPSHOT and END_SNAPSHOT. In the following example, the START_SNAPSHOT is “1” and END_SNAPSHOT is “2”.

DECLARE
   tid    NUMBER;
   tname  VARCHAR2(30);
   oid    NUMBER;
   BEGIN
   DBMS_ADVISOR.CREATE_TASK('Undo Advisor', tid, tname, 'Undo Advisor Task');
   DBMS_ADVISOR.CREATE_OBJECT(tname, 'UNDO_TBS', null, null, null, 'null', oid);
   DBMS_ADVISOR.SET_TASK_PARAMETER(tname, 'TARGET_OBJECTS', oid);
   DBMS_ADVISOR.SET_TASK_PARAMETER(tname, 'START_SNAPSHOT', 1);
   DBMS_ADVISOR.SET_TASK_PARAMETER(tname, 'END_SNAPSHOT', 2);
   DBMS_ADVISOR.execute_task(tname);
end;

_____________________________________________________________________________________________

To check the optimum size, please excute this query:

SELECT d.undo_size/(1024*1024) "ACTUAL UNDO SIZE [MByte]",
SUBSTR(e.value,1,25) "UNDO RETENTION [Sec]",
ROUND((d.undo_size / (to_number(f.value) *
g.undo_block_per_sec))) "OPTIMAL UNDO RETENTION [Sec]"
FROM (
SELECT SUM(a.bytes) undo_size
FROM v$datafile a,
v$tablespace b,
dba_tablespaces c
WHERE c.contents = 'UNDO'
AND c.STATUS = 'ONLINE'
AND b.name = c.tablespace_name
AND a.ts# = b.ts#
) d,
v$parameter e,
v$parameter f,
(
SELECT MAX(undoblks/((end_time-begin_time)*3600*24))
undo_block_per_sec
FROM v$undostat
) g
WHERE e.name = 'undo_retention'
AND f.name = 'db_block_size';

Cleanup orphaned datapump jobs from DBA_DATAPUMP_JOBS:

Step 01: Check the orphaned datapump jobs.

sqlplus / as sysdba
SET lines 140
COL owner_name FORMAT a10;
COL job_name FORMAT a20
COL state FORMAT a12
COL operation LIKE owner_name
COL job_mode LIKE owner_name
SELECT owner_name, job_name, operation, job_mode,
state, attached_sessions
FROM dba_datapump_jobs;

OWNER_NAME JOB_NAME OPERATION JOB_MODE STATE ATTACHED_SESSIONS
---------- -------------------- ---------- ---------- ------------ -----------------
RANA SYS_EXPORT_SCHEMA_01 EXPORT SCHEMA NOT RUNNING 0
RANA SYS_EXPORT_SCHEMA_02 EXPORT SCHEMA NOT RUNNING 0

Step 02: Check the state field. For orphaned jobs the state will be NOT RUNNING. So from the output we can say both are orphaned jobs.

Step 03: Drop the master table.

DROP TABLE RANA.SYS_EXPORT_SCHEMA_01;
DROP TABLE RANA.SYS_EXPORT_SCHEMA_02;

Step 04: Check for existing data pump jobs by query issued in step 01. If objects are in recyclebin bin then purge the objects from the recyclebin.

SQL> SELECT owner_name, job_name, operation, job_mode,
 state, attached_sessions
 FROM dba_datapump_jobs;
OWNER_NAME JOB_NAME OPERATION JOB_MODE STATE ATTACHED_SESSIONS
---------- -------------------- ---------- ---------- ------------ -----------------
RANA BIN$xMNQdACzQ6yl22kj EXPORT SCHEMA NOT RUNNING 0
 9U0B8A==$0
RANA BIN$BmUy4r5MSX+ojxFk EXPORT SCHEMA NOT RUNNING 0
 sw8ocg==$0

SQL> PURGE TABLE RANA.SYS_EXPORT_SCHEMA_01;
Table purged.

SQL> PURGE TABLE RANA.SYS_EXPORT_SCHEMA_02;
Table purged.

Check if there is any orphaned jobs again.

SQL> SELECT owner_name, job_name, operation, job_mode,
state, attached_sessions
FROM dba_datapump_jobs;
no rows selected

Step 05: In this stage you did not get any orphaned jobs if the jobs have a master table. If there are still jobs listed in dba_datapump_jobs do cleanup process like below.

SET serveroutput on
SET lines 100
DECLARE
 job1 NUMBER;
BEGIN
 job1 := DBMS_DATAPUMP.ATTACH('SYS_EXPORT_SCHEMA_01','RANA');
 DBMS_DATAPUMP.STOP_JOB (job1);
END;
/
DECLARE
 job2 NUMBER;
BEGIN
 job2 := DBMS_DATAPUMP.ATTACH('SYS_EXPORT_SCHEMA_02','RANA');
 DBMS_DATAPUMP.STOP_JOB (job2);
END;
/

Rhel6 Vs oel6

The differences between Oracle Linux and Red Hat Enterprise Linux are mostly differences in licensing and support.

Both companies use a different mechanism to count the number of installed instances.

Secondly Oracle Linux offers the choice of several types of kernel. you can have the same kernel Red Hat ships or you can have one of two types of kernels modified by Oracle.

Oracle's Unbreakable kernel includes some additional drivers (e.g. for SSD disks) that are not available in Red Hat's kernel. Because of these additions Oracle claims significant performance gains.

Off course you need to consider how much value you attach to such marketing claims and whether or not the situation applies to your environment. I.e. do you use SSD disks and infiband messaging?

On the other hand Oracle does IMHO not provide a real alternative for Red Hat Network Satellite server. That was a big influence in our decission to run our new Oracle Databases on standard RHEL servers. But again how important this is depends on your environment. Oracle Enterprise Manager has been extended with some management features.

With Oracle Linux, there are some bonus features that differentiate it from RHEL:

1.  Unbreakable Kernel (Optimized OLTP, InfiniBand, and SSD disk access,  NUMA-optimizations, Reliable Datagram Sockets (RDS), async I/O, OCFS2,  and networking)
2. OCFS2 (Clustered file system)
3. Ksplice (Hot Kernel Patching without rebooting)
4. Dtrace (Ported from Solaris)
6. Cheaper support costs than RHEL

Files for dns and dhcp


# Public
192.9.201.183    rac1.shilpa.com        rac1
192.9.201.187    rac2.shilpa.com        rac2

#Private
192.9.201.173    rac1-priv.shilpa.com        rac1-priv
192.9.201.187    rac2-priv.shilpa.com        rac2-priv

#Virtual
192.9.201.173    rac1-vip.shilpa.com        rac1-vip
192.9.201.177    rac2-vip.shilpa.com        rac2-vip

#Scan
192.9.201.163    rac-scan.shilpa.com        rac-scan
192.9.201.167    rac-scan.shilpa.com        rac-scan
192.9.201.169    rac-scan.shilpa.com        rac-scan

Netmask:255.255.255.0
Default gateway IP : 192.9.201.2
Primary DNS Server : 192.9.201.59
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[root@server1 Desktop]# cat /var/named/forward.zone
$TTL 1D
@    IN SOA    server1.shilpa.com. root.server1.shilpa.com. (
                    0    ; serial
                    1D    ; refresh
                    1H    ; retry
                    1W    ; expire
                    3H )    ; minimum
        NS    server1.shilpa.com.
server1        A    192.9.201.59
rac1        A    192.9.201.183
rac2        A    192.9.201.187
rac3        A    192.9.201.189
host01        A    192.9.201.23
host02        A    192.9.201.27
rac1-priv    A    10.0.0.1
rac2-priv    A    10.0.0.2
rac3-priv    A    10.0.0.3
cluster01-gns    A    192.9.201.180
    AAAA    ::1
$ORIGIN        cluster01.shilpa.com.
@    IN      NS    cluster01-gns.shilpa.com.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@server1 Desktop]# cat /var/named/reverse.zone
$TTL 1D
@    IN SOA    server1.shilpa.com. root.server1.shilpa.com. (
                    0    ; serial
                    1D    ; refresh
                    1H    ; retry
                    1W    ; expire
                    3H )    ; minimum
    NS    server1.shilpa.com.
    A    192.9.201.59
    AAAA    ::1
59    PTR    server1.shilpa.com.
183    PTR    rac1.shilpa.com.
187    PTR    rac2.shilpa.com.
189    PTR    rac3.shilpa.com.
23    PTR    host01.shilpa.com.
27    PTR    host02.shilpa.com.
180    PTR    cluster01-gns.shilpa.com.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@server1 Desktop]# cat /var/named/reverse1.zone
$TTL 1D
@    IN SOA    server1.shilpa.com. root.server1.shilpa.com. (
                    0    ; serial
                    1D    ; refresh
                    1H    ; retry
                    1W    ; expire
                    3H )    ; minimum
    NS    server1.shilpa.com.
    A    192.9.201.59
    AAAA    ::1
1    PTR    rac1-priv.shilpa.com.
2    PTR    rac2-priv.shilpa.com.
3    PTR    rac3-priv.shilpa.com.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@server1 Desktop]# cat /var/named/chroot/etc/named.conf
//
// named.conf
//
// Provided by Red Hat bind package to configure the ISC BIND named(8) DNS
// server as a caching only nameserver (as a localhost DNS resolver only).
//
// See /usr/share/doc/bind*/sample/ for example named configuration files.
//

options {
    listen-on port 53 { 192.9.201.59; };
#    listen-on-v6 port 53 { ::1; };
    directory     "/var/named";
    dump-file     "/var/named/data/cache_dump.db";
        statistics-file "/var/named/data/named_stats.txt";
        memstatistics-file "/var/named/data/named_mem_stats.txt";
    allow-query     { any; };
    recursion yes;

    dnssec-enable yes;
    dnssec-validation no;
    dnssec-lookaside auto;

    /* Path to ISC DLV key */
    bindkeys-file "/etc/named.iscdlv.key";

    managed-keys-directory "/var/named/dynamic";
};

logging {
        channel default_debug {
                file "data/named.run";
                severity dynamic;
        };
};

zone "." IN {
    type hint;
    file "named.ca";
};

include "/etc/named.rfc1912.zones";
include "/etc/named.root.key";
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@server1 Desktop]# cat /etc/named.rfc1912.zones
// named.rfc1912.zones:
//
// Provided by Red Hat caching-nameserver package
//
// ISC BIND named zone configuration for zones recommended by
// RFC 1912 section 4.1 : localhost TLDs and address zones
// and http://www.ietf.org/internet-drafts/draft-ietf-dnsop-default-local-zones-02.txt
// (c)2007 R W Franks
//
// See /usr/share/doc/bind*/sample/ for example named configuration files.
//

zone "shilpa.com" IN {
    type master;
    file "forward.zone";
    allow-transfer { 192.9.201.59; };
};

zone "localhost" IN {
    type master;
    file "named.localhost";
    allow-update { none; };
};

zone "1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa" IN {
    type master;
    file "named.loopback";
    allow-update { none; };
};

zone "201.9.192.in-addr.arpa" IN {
    type master;
    file "reverse.zone";
};

zone "0.0.10.in-addr.arpa" IN {
    type master;
    file "reverse1.zone";
};
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@server1 Desktop]# cat /etc/dhcp/dhcpd.conf
# dhcpd.conf
#
# Subnet Declaration

subnet 192.9.201.0 netmask 255.255.255.0 {
        option routers                  192.9.201.1;
        option subnet-mask              255.255.255.0;
        option domain-search              "shilpa.com";
        option domain-name-servers       192.9.201.59;
        option time-offset              -18000;     # Eastern Standard Time
#    range 192.9.201.190 192.9.201.254;
}

# Use this to enble / disable dynamic dns updates globally.
ddns-update-style interim;

# If this DHCP server is the official DHCP server for the local
# network, the authoritative directive should be uncommented.
authoritative;

# Use this to send dhcp log messages to a different log file (you also
# have to hack syslog.conf to complete the redirection).
log-facility local6;

# Range Parameter

default-lease-time 1600;
max-lease-time 7200;
option subnet-mask 255.255.255.0;
option routers 192.9.201.1;
option domain-name-servers 192.9.201.59;
option domain-search "shilpa.com";
subnet 192.9.201.0 netmask 255.255.255.0 {
   range 192.9.201.190 192.9.201.254;
}

# Static IP Address Using DHCP

host ns {
   next-server marvin.redhat.com;
   hardware ethernet 12:34:56:78:AB:CD;
   fixed-address 207.175.42.254;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~