I have been recently deploying some new ODA X8-2-HA. After reimaging and creating the appliance, I had to patch the ODA. The day after, checking status, I could realize that my node 0 was not working as expected. I did some interesting troubleshooting that I wanted to share hoping it might be helpful for someone.
Why patching an ODA after reimaging it?
It might be curious to speak about patching an ODA after fresh installation (reimaging). The problem is that reimaging with last available version will only install last operating system and grid version. ILOM and BIOS will not be updated. The need to patch can easily been seen when you are checking the installed components :
[root@ODA-node0 ~]# odacli describe-component System Version --------------- 19.8.0.0.0 System node Name --------------- ODA-node0 Local System Version --------------- 19.8.0.0.0 Component Installed Version Available Version ---------------------------------------- -------------------- -------------------- OAK 19.8.0.0.0 up-to-date GI 19.8.0.0.200714 up-to-date DB 12.1.0.2.200714 up-to-date DCSAGENT 19.8.0.0.0 up-to-date ILOM 4.0.4.38.r130206 4.0.4.51.r134837 BIOS 52010400 52021300 OS 7.8 up-to-date FIRMWARECONTROLLER 16.00.01.00 16.00.08.00 FIRMWAREEXPANDER 0309 0310 FIRMWAREDISK { [ c2d0,c2d1 ] 1120 1102 [ c0d0,c0d1,c0d2,c0d3,c0d4,c0d5,c0d6, A959 up-to-date c0d7,c0d8,c0d9,c0d10,c0d11,c0d12,c0d13, c0d14,c0d15,c0d16,c0d17,c0d18,c0d19, c0d20,c0d21,c0d22,c0d23,c1d0,c1d1,c1d2, c1d3,c1d4,c1d5,c1d6,c1d7,c1d8,c1d9, c1d10,c1d11,c1d12,c1d13,c1d14,c1d15, c1d16,c1d17,c1d18,c1d19,c1d20,c1d21, c1d22,c1d23 ] } HMP 2.4.5.0.1 up-to-date System node Name --------------- ODA-node1 Local System Version --------------- 19.8.0.0.0 Component Installed Version Available Version ---------------------------------------- -------------------- -------------------- OAK 19.8.0.0.0 up-to-date GI 19.8.0.0.200714 up-to-date DB 12.1.0.2.200714 up-to-date DCSAGENT 19.8.0.0.0 up-to-date ILOM 4.0.4.38.r130206 4.0.4.51.r134837 BIOS 52010400 52021300 OS 7.8 up-to-date FIRMWARECONTROLLER 16.00.01.00 16.00.08.00 FIRMWAREEXPANDER 0309 0310 FIRMWAREDISK { [ c2d0,c2d1 ] 1120 1102 [ c0d0,c0d1,c0d2,c0d3,c0d4,c0d5,c0d6, A959 up-to-date c0d7,c0d8,c0d9,c0d10,c0d11,c0d12,c0d13, c0d14,c0d15,c0d16,c0d17,c0d18,c0d19, c0d20,c0d21,c0d22,c0d23,c1d0,c1d1,c1d2, c1d3,c1d4,c1d5,c1d6,c1d7,c1d8,c1d9, c1d10,c1d11,c1d12,c1d13,c1d14,c1d15, c1d16,c1d17,c1d18,c1d19,c1d20,c1d21, c1d22,c1d23 ] } HMP 2.4.5.0.1 up-to-date
In my case here, I have been redeploying version ODA 19.8 on the ODA and I’m having a gap in version for ILOM, BIOS and storage. There are some very good blogs for patching an ODA to version 19.X :
Patching ODA from 18.8 to 19.6
Reimaging ODA in version 19.8 and patching
In some cases the ODA patching is failing to update the ILOM and the BIOS and this needs to be done manually. Such an operation is also described in details in same last blog :
ILOM and BIOS manual updates
Cluster offset issue
Day after, before moving forward I checked the ODA status and could see some problem with my node 0. The ASM instance was not started, as well as the ASM listener, see below.
oracle@ODA-node0:/home/oracle/ [rdbms12102_1] ser 2020-09-16_10-23-05::dmk-run.pl::CheckListenerFile ::ERROR ==> Couldn't open the listener.ora : /u01/app/oracle/product/12.1.0.2/dbhome_1/network/admin Dummy: ------ rdbms12102_1 : DUMMY (12.1.0.2/dbhome_1) Database(s): ------------ +ASM1 : STOPPED (grid) oracle@ODA-node0:/home/oracle/ [rdbms12102_1]
Please note that our customer environments are running dbi DMK management kit, that’s why some of those displays might not be usual.
Easy, even surprised, I decided to try and to start ASM listener :
grid@ODA-node0:/home/grid/ [+ASM1] srvctl status listener -listener ASMNET1LSNR_ASM PRCR-1070 : Failed to check if resource ora.ASMNET1LSNR_ASM.lsnr is registered CRS-0184 : Cannot communicate with the CRS daemon.
Which was definitively unsuccessfull.
I then checked the Oracle Restart status :
grid@psrpri230:/home/grid/ [rdbms12102_1] crsctl check crs CRS-4638: Oracle High Availability Services is online CRS-4535: Cannot communicate with Cluster Ready Services CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online
The cluster was having problems. I tried to stop and to start it.
[root@ODA-node0 bin]# ./crsctl start cluster CRS-2672: Attempting to start 'ora.ctssd' on 'ODA-node0' The clock on host ODA-node0 differs from mean cluster time by 21656584768 microseconds. The Cluster Time Synchronization Service will not perform time synchronization because the time difference is beyond the permissible offset of 600 seconds. Details in /u01/app/grid/diag/crs/ODA-node0/crs/trace/octssd.trc. CRS-2674: Start of 'ora.ctssd' on 'ODA-node0' failed CRS-2672: Attempting to start 'ora.ctssd' on 'ODA-node0' The clock on host ODA-node0 differs from mean cluster time by 21656573817 microseconds. The Cluster Time Synchronization Service will not perform time synchronization because the time difference is beyond the permissible offset of 600 seconds. Details in /u01/app/grid/diag/crs/ODA-node0/crs/trace/octssd.trc. CRS-2674: Start of 'ora.ctssd' on 'ODA-node0' failed CRS-4000: Command Start failed, or completed with errors.
There was definitively a time synchronization issue between my 2 nodes node0 and node1. This could be seen in the log file as well :
[root@ODA-node0 bin]# tail -20 /u01/app/grid/diag/crs/ODA-node0/crs/trace/octssd.trc 2020-09-16 10:51:56.830 : GIPCTLS:2817517312: gipcmodTlsSetAuthFlags: nzcred auth (global 0) flags, feature: 32, optional:0 2020-09-16 10:51:56.830 : GIPCTLS:2817517312: gipcmodTlsAuthInit: tls context initialized successfully 2020-09-16 10:51:56.839 : GIPCTLS:2817517312: gipcmodTlsAuthStart: TLS HANDSHAKE - SUCCESSFUL 2020-09-16 10:51:56.839 : GIPCTLS:2817517312: gipcmodTlsAuthStart: peerUser: NULL 2020-09-16 10:51:56.839 : GIPCTLS:2817517312: gipcmodTlsAuthStart: name_CN=ea78a0117f76ff9cff4967dc0d8bf3cf_4294692300,O=Oracle Clusterware, 2020-09-16 10:51:56.839 : GIPCTLS:2817517312: gipcmodTlsAuthStart: name_CN=ea78a0117f76ff9cff4967dc0d8bf3cf_1599664758,O=Oracle_Clusterware, 2020-09-16 10:51:56.839 : GIPCTLS:2817517312: gipcmodTlsAuthStart: endpoint 0x7efe9003ac00 [0000000000000a28] { gipcEndpoint : localAddr 'gipcha://ODA-node0:3202-6a4f-ef55-4696', remoteAddr 'gipcha://ODA-node1:CTSSGROUP_2/a7fa-52e9-ad59-eca4', numPend 2, numReady 0, numDone 1, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj 0x7efe9003d410, sendp (nil) status 13flags 0x200b8602, flags-2 0x10, usrFlags 0x20020 }, auth state: gipcmodTlsAuthStateReady (3) 2020-09-16 10:51:56.839 : GIPCTLS:2817517312: gipcmodTlsAuthReady: TLS Auth completed Successfully 2020-09-16 10:51:56.977 : CRSCCL:2817517312: ConnAccepted from Peer_msgTag= 0xcccccccc version= 0 msgType= 4 msgId= 0 msglen = 0 clschdr.size_clscmsgh= 88 src= (2, 4294731470) dest= (1, 1099434) 2020-09-16 10:51:56.977 : CTSS:2811213568: ctssslave_swm2_1: Waiting for time sync message from master. sync_state[2]. 2020-09-16 10:51:56.977 : CTSS:2815416064: ctsscomm_recv_cb4_2: Receive active version change msg. Old active version [318767104] New active version [318767104]. 2020-09-16 10:51:56.977 : CTSS:2815416064: ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3]. 2020-09-16 10:51:56.977 : CTSS:2811213568: ctssslave_swm2_3: Received time sync message from master. 2020-09-16 10:51:56.977 : CTSS:2811213568: ctssslave_swm: The magnitude [21656573817] of the offset [21656573817 usec] is larger than [600000000 usec] sec which is the CTSS limit. Hence CTSS is exiting. 2020-09-16 10:51:56.977 : CTSS:2811213568: (:ctsselect_msm3:): Failed in clsctssslave_sync_with_master [12]: Time offset is too much to be corrected, exiting. 2020-09-16 10:51:56.977 : CTSS:2815416064: ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler 2020-09-16 10:51:57.310 : CRSCCL:3293234944: clsCclGetPriMemberData: Detected pridata change for node[1]. Retrieving it to the cache. 2020-09-16 10:51:57.532 : CTSS:3315066624: ctss_checkcb: clsdm requested check alive. checkcb_data{mode[0xf6], offset[21656573 ms]}, length=[8]. 2020-09-16 10:51:57.532 : CTSS:2811213568: ctsselect_msm: CTSS daemon exiting [12] as offset is to large. 2020-09-16 10:51:57.532 : CTSS:2811213568: CTSS daemon aborting
I then decided to check server time displayed on the OS :
grid@ODA-node0:/home/grid/ [rdbms12102_1] date Wed Sep 16 11:10:17 CEST 2020 oracle@ODA-node1:/home/oracle/ [rdbms12102_1] date Wed Sep 16 05:09:29 CEST 2020
Node 0 was on time but node 1 definitively not.
But why? NTP has been configured on the appliance :
[root@ODA-node1 ~]# odacli describe-system Appliance Information ---------------------------------------------------------------- ID: 74352d02-f2f6-4547-a5cf-71b40f7b7879 Platform: X8-2-HA Data Disk Count: 24 CPU Core Count: 32 Created: September 9, 2020 2:04:08 AM CEST System Information ---------------------------------------------------------------- Name: Domain Name: domain_name Time Zone: Europe/Zurich DB Edition: EE DNS Servers: 10.X.X.X 10.X.X.X NTP Servers: 10.10.40.2 10.7.40.23 10.7.40.25 Disk Group Information ---------------------------------------------------------------- DG Name Redundancy Percentage ------------------------- ------------------------- ------------ Data Normal 50 Reco Normal 50
Please note that the IP addresses have been changed not to display customer’s one.
NTP configuration on the ODA
NTP configuration on the ODA is a common linux NTP configuration.
I have checked the NTP configuration and all was looking good and ok. In the mean time I could realize that the NTP pool from Redhat are by default configured and active. Main customers won’t accept such external connection and are having their own internal NTP server. This was the case by this customer, so I have commented the lines in the NTP configuration file.
[root@ODA-node1 bin]# cp -p /etc/ntp.conf /etc/ntp.conf.20200916 [root@ODA-node1 bin]# vi /etc/ntp.conf [root@ODA-node1 bin]# diff /etc/ntp.conf /etc/ntp.conf.20200916 21,24c21,24 < #server 0.rhel.pool.ntp.org iburst < #server 1.rhel.pool.ntp.org iburst < #server 2.rhel.pool.ntp.org iburst server 0.rhel.pool.ntp.org iburst > server 1.rhel.pool.ntp.org iburst > server 2.rhel.pool.ntp.org iburst > server 3.rhel.pool.ntp.org iburst
The NTP configuration file looks then like this :
[root@ODA-node1 bin]# cat /etc/ntp.conf # For more information about this file, see the man pages # ntp.conf(5), ntp_acc(5), ntp_auth(5), ntp_clock(5), ntp_misc(5), ntp_mon(5). driftfile /var/lib/ntp/drift # Permit time synchronization with our time source, but do not # permit the source to query or modify the service on this system. restrict default nomodify notrap nopeer noquery # Permit all access over the loopback interface. This could # be tightened as well, but to do so would effect some of # the administrative functions. restrict 127.0.0.1 restrict ::1 # Hosts on local network are less restricted. #restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap # Use public servers from the pool.ntp.org project. # Please consider joining the pool (http://www.pool.ntp.org/join.html). #server 0.rhel.pool.ntp.org iburst #server 1.rhel.pool.ntp.org iburst #server 2.rhel.pool.ntp.org iburst #server 3.rhel.pool.ntp.org iburst #broadcast 192.168.1.255 autokey # broadcast server #broadcastclient # broadcast client #broadcast 224.0.1.1 autokey # multicast server #multicastclient 224.0.1.1 # multicast client #manycastserver 239.255.254.254 # manycast server #manycastclient 239.255.254.254 autokey # manycast client # Enable public key cryptography. #crypto includefile /etc/ntp/crypto/pw # Key file containing the keys and key identifiers used when operating # with symmetric key cryptography. keys /etc/ntp/keys # Specify the key identifiers which are trusted. #trustedkey 4 8 42 # Specify the key identifier to use with the ntpdc utility. #requestkey 8 # Specify the key identifier to use with the ntpq utility. #controlkey 8 # Enable writing of statistics records. #statistics clockstats cryptostats loopstats peerstats # Disable the monitoring facility to prevent amplification attacks using ntpdc # monlist command when default restrict does not include the noquery flag. See # CVE-2013-5211 for more details. # Note: Monitoring will not be disabled with the limited restriction flag. disable monitor server 10.10.40.2 prefer server 10.7.40.23 server 10.7.40.25
I wanted to check how the server was synchronized using ntpq tool.
[root@ODA-node1 bin]# ntpq -p ntpq: read: Connection refused
But I had to restart the service first. Reason is explained in my blog NTP is not working for ODA new deployment (reimage) in version :
[root@ODA-node1 bin]# service ntpd restart Redirecting to /bin/systemctl restart ntpd.service
I could then check the NTP configuration.
[root@ODA-node1 bin]# ntpq -p remote refid st t when poll reach delay offset jitter ============================================================================== lantime.domain_name .INIT. 16 u - 512 0 0.000 0.000 0.000 +ntp1.domain_name 131.188.3.223 2 u 32 64 377 0.364 0.635 0.247 *ntp2.domain_name 131.188.3.223 2 u 26 64 377 0.385 0.851 0.173
remote
The remote column will display the NTP server names that should be known by the DNS server. In my case those are the NTP servers I configured when creating the appliance, see previous odacli describe-system command and nslookup below command.
[root@psrpri231 ~]# nslookup ntp2.domain_name Server: 10.7.9.9 Address: 10.7.9.9#53 Name: ntp2.domain_name Address: 10.7.40.25 [root@psrpri231 ~]# nslookup ntp1.domain_name Server: 10.7.9.9 Address: 10.7.9.9#53 Name: ntp1.domain_name Address: 10.7.40.23
At that time there was a network routing issue to the preferred NTP. This is why only the 2 other NTP servers are displayed in the ntpq command. This was certainly the root cause of the synchronization problem between both nodes.
refid
The refid entry will show the current source of synchronization for each peer. In my case the source was 131.188.3.223 which is a NTP server of the Friedrich-Alexander university in Erlangen (Germany). This can be confirmed by doing a simple nslookup from my laptop.
C:Usersmaw>nslookup 131.188.3.223 Server: UnKnown Address: 192.168.1.254 Name: ntp3.rrze.uni-erlangen.de Address: 131.188.3.223
The character at the left margin will display the synchronization status for the peer.
The character * will highlight which peer is currently selected. It is named the system peer.
The character + will highlight the peers which are designated as been acceptable for synchronization but not currently selected. They are considered as good reference time sources and are survivors of the selection process. They can be potential system peers and are then called candidates.
The character – will highlight a peer that is discarded from the selection. Those other time sources display a time that differs from the survivors’ time. They are called falsetickers.
st
The st entry will display the peer stratum. The stratum will measure the distance between the peer to the NTP source. That would show how many hops/servers there are until the hardware refclock. In my case, the customer NTP servers are stratum 2 servers. This indicates that they are synchronized directly with the NTP server of the Friedrich-Alexander university which is itself a stratum 1 which has a hardware refclock (stratum 0). The stratum can be seen as a level in the timing hierarchy.
t
The t enty will display the type of the peer.
u=unicast
m=multicast
l=local
-=don’t know
when
The when entry will display the time in seconds since the peer was last heard.
poll
The poll entry will display the polling interval in seconds. This is the time in seconds the NTP daemon took to synchronize with the peer. That’s to say the delay in seconds between 2 requests.
reach
The reach entry will display in octal format the status of the server availability. After 8 successful synchronization attempts the value will be 377.
delay
The delay entry will display the latency the request will take from client to server and back again. It describe the round-trip delay of the NTP request. This value is important for the NTP to take the network latency in account and adjust accordingly the timing.
offset
The offset entry will give the time difference between the ODA itself and the peer (the synchronization server).
jitter
The jitter entry will give the dispersion, the magnitude of variance. Each peer has a different amount of jitter. The lower the jitter value is, more accurate will be the peer.
Resolving and checking the cluster offset issue
Now that the NTP issue is resolved, I could restart Oracle cluster. I could also check the nodes was up and running again.
I first could see that the ODA date has been automatically adjusted by the NTP :
[root@ODA-node1 ~]# date Wed Sep 16 11:29:31 CEST 2020
I could restart the cluster :
[root@ODA-node0 bin]# cd /u01/app/19.0.0.0/grid/bin/ [root@ODA-node0 bin]# ./crsctl start cluster CRS-2672: Attempting to start 'ora.ctssd' on 'ODA-node0' CRS-2676: Start of 'ora.ctssd' on 'ODA-node0' succeeded CRS-2672: Attempting to start 'ora.asm' on 'ODA-node0' CRS-2672: Attempting to start 'ora.crsd' on 'ODA-node0' CRS-2676: Start of 'ora.crsd' on 'ODA-node0' succeeded CRS-2676: Start of 'ora.asm' on 'ODA-node0' succeeded
And I could check the cluster is up and running :
[root@ODA-node0 bin]# ./crsctl check crs CRS-4638: Oracle High Availability Services is online CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online [root@ODA-node0 bin]# ./crsctl status resource -t -------------------------------------------------------------------------------- Name Target State Server State details -------------------------------------------------------------------------------- Local Resources -------------------------------------------------------------------------------- ora.DATA.COMMONSTORE.advm ONLINE ONLINE ODA-node0 STABLE ONLINE ONLINE ODA-node1 STABLE ora.LISTENER.lsnr ONLINE ONLINE ODA-node0 STABLE ONLINE ONLINE ODA-node1 STABLE ora.chad ONLINE ONLINE ODA-node0 STABLE ONLINE ONLINE ODA-node1 STABLE ora.data.commonstore.acfs ONLINE ONLINE ODA-node0 mounted on /opt/orac le/dcs/commonstore,S TABLE ONLINE ONLINE ODA-node1 mounted on /opt/orac le/dcs/commonstore,S TABLE ora.net1.network ONLINE ONLINE ODA-node0 STABLE ONLINE ONLINE ODA-node1 STABLE ora.ons ONLINE ONLINE ODA-node0 STABLE ONLINE ONLINE ODA-node1 STABLE ora.proxy_advm ONLINE ONLINE ODA-node0 STABLE ONLINE ONLINE ODA-node1 STABLE -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.ASMNET1LSNR_ASM.lsnr(ora.asmgroup) 1 ONLINE ONLINE ODA-node0 STABLE 2 ONLINE ONLINE ODA-node1 STABLE ora.DATA.dg(ora.asmgroup) 1 ONLINE ONLINE ODA-node0 STABLE 2 ONLINE ONLINE ODA-node1 STABLE ora.LISTENER_SCAN1.lsnr 1 ONLINE ONLINE ODA-node0 STABLE ora.LISTENER_SCAN2.lsnr 1 ONLINE ONLINE ODA-node1 STABLE ora.RECO.dg(ora.asmgroup) 1 ONLINE ONLINE ODA-node0 STABLE 2 OFFLINE OFFLINE STABLE ora.asm(ora.asmgroup) 1 ONLINE ONLINE ODA-node0 Started,STABLE 2 ONLINE ONLINE ODA-node1 STABLE ora.asmnet1.asmnetwork(ora.asmgroup) 1 ONLINE ONLINE ODA-node0 STABLE 2 ONLINE ONLINE ODA-node1 STABLE ora.cvu 1 ONLINE ONLINE ODA-node1 STABLE ora.ODA-node0.vip 1 ONLINE ONLINE ODA-node0 STABLE ora.ODA-node1.vip 1 ONLINE ONLINE ODA-node1 STABLE ora.qosmserver 1 ONLINE ONLINE ODA-node1 STABLE ora.scan1.vip 1 ONLINE ONLINE ODA-node0 STABLE ora.scan2.vip 1 ONLINE ONLINE ODA-node1 STABLE --------------------------------------------------------------------------------
And finally, the ASM instance and the ASM listener were up and running again :
oracle@ODA-node0:/home/oracle/ [rdbms12102_1] ser 2020-09-16_11-34-31::dmk-run.pl::CheckListenerFile ::ERROR ==> Couldn't open the listener.ora : /u01/app/oracle/product/12.1.0.2/dbhome_1/network/admin Dummy: ------ rdbms12102_1 : DUMMY (12.1.0.2/dbhome_1) Database(s): ------------ +ASM1 : STARTED (grid) Listener(s): ------------ ASMNET1LSNR_ASM : Up (grid) LISTENER : Up (grid) LISTENER_SCAN1 : Up (grid)