Advisory: HP Proliant BL465c G7 – Storage Fault with P410 removed

I posted an issue I found with the HP BL465c G7 blades last year where we were getting storage errors in the iLO 3. These blades were ordered without the P410 controller as they were booting ESXi from SD card and using SAN attached storage for the virtual machines. The iLO would report the storage status strangely (see my earlier post:ย as it would flip-flop between OK and failed.

After several months of logging a job with HP and being told that a fix would come in iLO 3 v1.6 I eventually tested the new firmware and to my disappointment the issue persisted. I didn’t want my hosts continuously alerting in vCenter with storage failures so I’ve kept my blades at an older version that doesn’t pass this failure through to the CIM.

Anyway, just this morning I received an email from HP providing an advisory link to the problem saying that there is still no firmware fix and as a workaround the SAS/SATA cable should be removed from the backplane.

It is certainly not an ideal workaround but would allow me to upgrade the blades to the latest versions of iLO firmware. I’ll look at testing one and post an update on the issue ๐Ÿ™‚

Storage alert bug running vSphere on HP BL465c G7 blades

I have recently been configuring some BL465c G7 blades at work running vSphere 5.0 Update 1 installed on an internal SD card and using FC storage for the datastores. I ran into a strange issue where from within the vSphere client my blade hosts would show as having alerts after running HP’s October firmware bundle containing the BIOS version A19 15/8/12.

Hosts showing alert status

Hosts showing alert status

After some investigation I found that they had storage faults, specifically the following alarm:

Alert Detail

Alert Detail

Then when drilling into the Hardware Status tab (with the HP offline bundle installed) it shows that the storage controller has drive faults or missing disks as seen below.

Storage status showing failed drives

Storage status showing failed drives

This was really weird as these particular blades were ordered without the integrated P410 controller as we weren’t planning on using local disk. Weird…

Anyway, I spent a while trying different things to clear the alerts without any luck. Diving into the iLO from the System Information summary, the storage tab shows the same error:

iLO storage summary

iLO storage summary – BAD

But after a few minutes and a page refresh, the status clears and looks as expected:

iLO storage summary - OK

iLO storage summary – OK

This continues to flip-flop between good and bad every minute or so. WHAT???

I decided to roll back the BIOS and iLO firmware (as well as the iLO which made no difference) and what was interesting is that when going back to BIOS version A19 19/3/12, the iLO status still shows the same problem as above, but the CIM provider information within vSphere no longer shows the storage controller…because it DOESN’T HAVE ONE! :-). See the screenshot below:

Older BIOS doesn't present storage controller CIM information

Older BIOS doesn’t present storage controller CIM information

I logged a job with HP and after working through several diagnostics with them they came to the conclusion that this was definitely a bug and would be addressed in a future BIOS update.

So for those of you out there that have been pulling your hair out like I have, there is a bug and there is no immediate fix other than rolling back to BIOS rev. A19 19/3/12 or earlier. Either that or you have to live with your hosts continually showing alerts ๐Ÿ™‚ NOTE: Rolling back the BIOS only masks the alert from coming through into the vSphere client and still shows up in the iLO status summary. However, the alert doesn’t appear to generate any IML event logs…it also does not show up in HP SIM either.

This bug only appears to affect blades that don’t have the optional P410 controller installed and I only have BL465c G7 blades to test this on. It may also affect BL465/460c G7 and earlier where the controller is optional.

UPDATE: I have been told by HP that the bug is caused by the disk drive backplane being active even when the controller isn’t present and they also suggest that it can be observed with any BIOS/iLO combination. I have also found that some blades seem fine with just the BIOS rollback while some still bring the storage controller status back into vSphere. For these odd few a rollback to iLO2 1.28 seems to fix the problem, hence I am making this my baseline for now.

UPDATE – October 2 2013

After stumbling across a few updates to the iLO 3 firmware I noticed that v1.57 specifically mentions the following fix:

  • Incorrect Drive status shown in the iLO 3 GUI when the HP Smart Array P410 controller is removed from the ProLiant c-Class Blade BL465c G7.

However, after testing this new firmware the same problem exists and is also present on the latest v1.61 firmware. What is interesting to see is that the error is slightly different in that while the drives flip-flop between “not installed” and “Fault”, the number of drive bays does not change now. Now the number of drives is always correctly shown as two…I guess progress is progress right??? ๐Ÿ˜›

I’ll open a new case with HP and hopefully find a fix for this hugely annoying bug!!!