Introduction
Apart from the number of servers (1 vs 2), the main difference between Oracle Database Appliance lite (S/L) and High-Availability (HA) is the data disks location. They are inside the server on lite ODAs, and in a dedicated disk enclosure on HA ODAs. Obviously, this is because when 2 nodes want to use the same disks, these disks have to be shared. And this is why HA needs a SAS disk enclosure.
Disk technology on ODA
On lite ODAs, disks are SSDs inside the server and connected to the PCI Express bus without any interface, it’s the NVMe technology. This is very fast. There are faster technologies, like NVRam, but price/performance ratio made NVMe technology a game changer.
HA ODAs are not that fast regarding disk bandwidth. This is because NVMe only works for disks locally connected to the server’s motherboard. Both HA ODA nodes come with SAS controllers, these being connected to a SAS disk enclosure with SAS SSDs in it. As this enclosure is quite big (same height as the 2 nodes together), disk capacity is much higher than lite ODAs. A fully loaded X9-2HA ODA with SSDs has 184TB, it’s more than twice the 81TB capacity of a fully loaded ODA X9-2L. Furthermore, you can add another storage enclosure to X9-2HA to double the disk capacity to 369TB. And if you need even more capacity, there is an high capacity version of this enclosure with a mix of SSDs and HDDs for a maximum RAW capacity of 740TB. This is huge!
Hardware monitoring on ODA
Monitoring the ODA hardware is done from ILOM, the management console. ILOM can send SNMP traps and raise an alert if something is wrong. For an HA ODA, you have 2 ILOMs to monitor, as the 2 nodes are separate hardware. There’s a catch when it comes to monitoring the storage enclosure. This enclosure is not active, meaning that it doesn’t have any intelligence, and therefore cannot raise any alert. And ILOM from the nodes is not aware of hardware outside the nodes. You may think that it’s not really a problem because data disks are monitored by ASM. But this enclosure also has SAS interfaces to get connected with the nodes. And if one of these interfaces is down, you may not detect the problem.
The use case
My customer has multiple HA ODAs, and I was doing a sanity checks of these ODAs. Everything was fine until I did an orachk on an X6-2HA:
odaadmcli orachk
INFO: 2022-11-16 16:41:11: Running orachk under /usr/bin/orachk Searching for running databases . . . . .
........
List of running databases registered in OCR
1. XXX
3. YYY
4. ZZZ
5. All of above
6. None of above
Select databases from list for checking best practices. For multiple databases, select 5 for All or comma separated number like 1,2 etc [1-6][5]. 6
RDBMS binaries found at /u01/app/oracle/product/19.0.0.0/dbhome_1 and ORACLE_HOME not set. Do you want to set ORACLE_HOME to "/u01/app/oracle/product/19.0.0.0/dbhome_1"?[y/n][y] y
...
FAIL => Several enclosure components controllers might be down
...
This is not something nice to see. My storage enclosure has a problem.
I will do another check with odaadmcli:
odaadmcli show enclosure
NAME SUBSYSTEM STATUS METRIC
E0_FAN0 Cooling OK 4910 rpm
E0_FAN1 Cooling OK 4530 rpm
E0_FAN2 Cooling OK 4920 rpm
E0_FAN3 Cooling OK 4570 rpm
E0_IOM0 Encl_Electronics OK -
E0_IOM1 Encl_Electronics Not availab -
E0_PSU0 Power_Supply OK -
E0_PSU1 Power_Supply OK -
E0_TEMP0 Amb_Temp OK 23 C
E0_TEMP1 Midplane_Temp OK 23 C
E0_TEMP2 PCM0_Inlet_Temp OK 29 C
E0_TEMP3 PCM0_Hotspot_Temp OK 26 C
E0_TEMP4 PCM1_Inlet_Temp OK 44 C
E0_TEMP5 PCM1_Hotspot_Temp OK 28 C
E0_TEMP6 IOM0_Temp OK 22 C
E0_TEMP7 IOM1_Temp OK 28 C
Enclosure is not visible through one of the SAS controller. Maybe there is a failure, but the node is not able to say that there is a failure. It may be related to an unplugged SAS cable, as I found on MOS.
Let’s do a validate storage topology:
odacli validate-storagetopology
INFO : ODA Topology Verification
INFO : Running on Node0
INFO : Check hardware type
SUCCESS : Type of hardware found : X6-2
INFO : Check for Environment(Bare Metal or Virtual Machine)
SUCCESS : Type of environment found : Bare Metal
INFO : Check number of Controllers
SUCCESS : Number of Internal RAID bus controllers found : 1
SUCCESS : Number of External SCSI controllers found : 2
INFO : Check for Controllers correct PCIe slot address
SUCCESS : Internal RAID controller : 23:00.0
SUCCESS : External LSI SAS controller 0 : 03:00.0
SUCCESS : External LSI SAS controller 1 : 13:00.0
INFO : Check if JBOD powered on
SUCCESS : 0JBOD : Powered-on
INFO : Check for correct number of EBODS(2 or 4)
FAILURE : Check for correct number of EBODS(2 or 4) : 1
ERROR : 1 EBOD found on the system, which is less than 2 EBODS with 1 JBOD
INFO : Above details can also be found in the log file=/opt/oracle/oak/log/srvxxx/storagetopology/StorageTopology-2022-11-16-17:21:43_34790_17083.log
EBOD stands for Expanded Bunch Of Disks, which is not very clear. But as disks are OK, this is probably related to cabling or controller in the enclosure.
Solution
My customer went to the datacenter and first checked the cabling, but it was fine. Opening an SR on My Oracle Support quickly solved the problem. A new controller was sent, it was swapped in the enclosure with the defect one without any downtime, and everything is fine then.
Conclusion
There is absolutely no problem with the HA storage enclosure not being smart. You don’t need a smart storage for this kind of server, as ODA is a “Simple. Reliable. Affordable” solution.
In this particular case, it’s hard to detect that the failure is a real one. But my customer was using a RAC setup with a failure in one of the redundant components, maybe since months. It’s definitely not satisfying. From time to time, manual and human checks are still needed!