While supporting since several years ODAs of different generations and versions, we faced time to time some hardware alerts sent back by the ILOM. However all of them are not related to real hardware issues and are false positive. To get rid of them the solution is to reset them manually.
When an hardware error occurs the first reaction is to open a Service Request and to provide an ILOM snapshot to the support. This can easily be done using the Maintenance menu in the ILOM web interface.
Based on support feedback, they may confirm that this alert is simply a false positive. Another solution if support answer is too slow is simply to give a try 😀
However this will need a server reboot to ensure the alert really disappeared.
Here an example of a fault alarm about CPU we faced:
Date/Time Subsystems Component ------------------------ ------------------ ------------ Tue Feb 13 14:00:26 2018 Power PS1 (Power Supply 1) A loss of AC input power to a power supply has been detected. (Probability:100, UUID:84846f3c-036d-6941-eaca-de18c4c236bd, Resource:/SYS/PS1, Part Number:7333459, Serial Number:465824T+1734D30847, Reference Document:http://support.oracle.com/msg/SPX86A-8003-EL) Thu Feb 15 14:27:04 2018 System DBP (Disk Backplane) ILOM has detected that a PCIE link layer is inactive. (Probability:25, UUID:49015767-38b2-6372-9526-c2d2c3885a72, Resource:/SYS/DBP, Part Number:7341145, Serial Number:465136N+1739P2009T, Reference Document:http://support.oracle.com/msg/SPX86A-8009-3J) Thu Feb 15 14:27:04 2018 System MB (Motherboard) ILOM has detected that a PCIE link layer is inactive. (Probability:25, UUID:49015767-38b2-6372-9526-c2d2c3885a72, Resource:/SYS/MB, Part Number:7317636, Serial Number:465136N+1742P500BX, Reference Document:http://support.oracle.com/msg/SPX86A-8009-3J) Thu Feb 15 14:27:04 2018 Processors P1 (CPU 1) ILOM has detected that a PCIE link layer is inactive. (Probability:25, UUID:49015767-38b2-6372-9526-c2d2c3885a72, Resource:/SYS/MB/P1, Part Number:SR3AX, Serial Number:54-85FED07F672D3DD3, Reference Document:http://support.oracle.com/msg/SPX86A-8009-3J)
We can see that there are indeed 3 alerts for this issue.
In order to reset such an alert, you need first to log in on the server as root and access the IPMI tool
[root@oda-dbi01 ~]# ipmitool -I open sunoem cli Connected. Use ^D to exit. Oracle(R) Integrated Lights Out Manager Version 4.0.0.28 r121827 Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved. Warning: password is set to factory default. Warning: HTTPS certificate is set to factory default. Hostname: oda-dbi01-ilom
Once in IPMI, you can list the Open Problems to get the same output than above using the following command:
-> ls /System/Open_Problems
In the list of the Open Problems we can find the UUID of the concerned component (see 3rd line)
Thu Feb 15 14:27:04 2018 Processors P1 (CPU 1) ILOM has detected that a PCIE link layer is inactive. (Probability:25, UUID:49015767-38b2-6372-9526-c2d2c3885a72, Resource:/SYS/MB/P1, Part Number:SR3AX, Serial Number:54-85FED07F672D3DD3, Reference Document:http://support.oracle.com/msg/SPX86A-8009-3J)
Now it is time to access the fault manager to reset all alerts related to this UUID
-> cd SP/faultmgmt/shell/ /SP/faultmgmt/shell -> start Are you sure you want to start /SP/faultmgmt/shell (y/n)? y
The reset of the alert is done with the fmadm command
faultmgmtsp> fmadm acquit 49015767-38b2-6372-9526-c2d2c3885a72
At this point the alerts are already removed from the Open problems. However to make sure the issue is really gone, we need to reboot the ODA and check the Open Problems afterwards.
Note that I presented here the way to check Open Problems using the IPMI command line, but the same output is also available in the ILOM web page.
Hope it helps!