Introduction

Apart from the number of servers (1 vs 2), the main difference between Oracle Database Appliance lite (S/L) and High-Availability (HA) is the data disks location. They are inside the server on lite ODAs, and in a dedicated disk enclosure on HA ODAs. Obviously, this is because when 2 nodes want to use the same disks, these disks have to be shared. And this is why HA needs a SAS disk enclosure.

Disk technology on ODA

On lite ODAs, disks are SSDs inside the server and connected to the PCI Express bus without any interface, it’s the NVMe technology. This is very fast. There are faster technologies, like NVRam, but price/performance ratio made NVMe technology a game changer.

HA ODAs are not that fast regarding disk bandwidth. This is because NVMe only works for disks locally connected to the server’s motherboard. Both HA ODA nodes come with SAS controllers, these being connected to a SAS disk enclosure with SAS SSDs in it. As this enclosure is quite big (same height as the 2 nodes together), disk capacity is much higher than lite ODAs. A fully loaded X9-2HA ODA with SSDs has 184TB, it’s more than twice the 81TB capacity of a fully loaded ODA X9-2L. Furthermore, you can add another storage enclosure to X9-2HA to double the disk capacity to 369TB. And if you need even more capacity, there is an high capacity version of this enclosure with a mix of SSDs and HDDs for a maximum RAW capacity of 740TB. This is huge!

Hardware monitoring on ODA

Monitoring the ODA hardware is done from ILOM, the management console. ILOM can send SNMP traps and raise an alert if something is wrong. For an HA ODA, you have 2 ILOMs to monitor, as the 2 nodes are separate hardware. There’s a catch when it comes to monitoring the storage enclosure. This enclosure is not active, meaning that it doesn’t have any intelligence, and therefore cannot raise any alert. And ILOM from the nodes is not aware of hardware outside the nodes. You may think that it’s not really a problem because data disks are monitored by ASM. But this enclosure also has SAS interfaces to get connected with the nodes. And if one of these interfaces is down, you may not detect the problem.

The use case

My customer has multiple HA ODAs, and I was doing a sanity checks of these ODAs. Everything was fine until I did an orachk on an X6-2HA:

odaadmcli orachk
INFO: 2022-11-16 16:41:11: Running orachk under /usr/bin/orachk Searching for running databases . . . . .
........
List of running databases registered in OCR
1. XXX
3. YYY
4. ZZZ 
5. All of above
6. None of above
Select databases from list for checking best practices. For multiple databases, select 5 for All or comma separated number like 1,2 etc [1-6][5]. 6
RDBMS binaries found at /u01/app/oracle/product/19.0.0.0/dbhome_1 and ORACLE_HOME not set. Do you want to set ORACLE_HOME to "/u01/app/oracle/product/19.0.0.0/dbhome_1"?[y/n][y] y
...
FAIL => Several enclosure components controllers might be down
...

This is not something nice to see. My storage enclosure has a problem.

I will do another check with odaadmcli:

odaadmcli show enclosure

        NAME        SUBSYSTEM         STATUS      METRIC

        E0_FAN0     Cooling           OK          4910 rpm
        E0_FAN1     Cooling           OK          4530 rpm
        E0_FAN2     Cooling           OK          4920 rpm
        E0_FAN3     Cooling           OK          4570 rpm
        E0_IOM0     Encl_Electronics  OK          -
        E0_IOM1     Encl_Electronics  Not availab -
        E0_PSU0     Power_Supply      OK          -
        E0_PSU1     Power_Supply      OK          -
        E0_TEMP0    Amb_Temp          OK          23 C
        E0_TEMP1    Midplane_Temp     OK          23 C
        E0_TEMP2    PCM0_Inlet_Temp   OK          29 C
        E0_TEMP3    PCM0_Hotspot_Temp OK          26 C
        E0_TEMP4    PCM1_Inlet_Temp   OK          44 C
        E0_TEMP5    PCM1_Hotspot_Temp OK          28 C
        E0_TEMP6    IOM0_Temp         OK          22 C
        E0_TEMP7    IOM1_Temp         OK          28 C

Enclosure is not visible through one of the SAS controller. Maybe there is a failure, but the node is not able to say that there is a failure. It may be related to an unplugged SAS cable, as I found on MOS.

Let’s do a validate storage topology:

odacli validate-storagetopology
INFO    : ODA Topology Verification
INFO    : Running on Node0
INFO    : Check hardware type
SUCCESS : Type of hardware found : X6-2
INFO    : Check for Environment(Bare Metal or Virtual Machine)
SUCCESS : Type of environment found : Bare Metal
INFO    : Check number of Controllers
SUCCESS : Number of Internal RAID bus controllers found : 1
SUCCESS : Number of External SCSI controllers found : 2
INFO    : Check for Controllers correct PCIe slot address
SUCCESS : Internal RAID controller   : 23:00.0
SUCCESS : External LSI SAS controller 0 : 03:00.0
SUCCESS : External LSI SAS controller 1 : 13:00.0
INFO    : Check if JBOD powered on
SUCCESS : 0JBOD : Powered-on
INFO    : Check for correct number of EBODS(2 or 4)
FAILURE : Check for correct number of EBODS(2 or 4) : 1
ERROR   : 1 EBOD found on the system, which is less than 2 EBODS with 1 JBOD
INFO    : Above details can also be found in the log file=/opt/oracle/oak/log/srvxxx/storagetopology/StorageTopology-2022-11-16-17:21:43_34790_17083.log

EBOD stands for Expanded Bunch Of Disks, which is not very clear. But as disks are OK, this is probably related to cabling or controller in the enclosure.

Solution

My customer went to the datacenter and first checked the cabling, but it was fine. Opening an SR on My Oracle Support quickly solved the problem. A new controller was sent, it was swapped in the enclosure with the defect one without any downtime, and everything is fine then.

Conclusion

There is absolutely no problem with the HA storage enclosure not being smart. You don’t need a smart storage for this kind of server, as ODA is a “Simple. Reliable. Affordable” solution.

In this particular case, it’s hard to detect that the failure is a real one. But my customer was using a RAC setup with a failure in one of the redundant components, maybe since months. It’s definitely not satisfying. From time to time, manual and human checks are still needed!