Advisory: HP Proliant BL465c G7 – Storage Fault with P410 removed

I posted an issue I found with the HP BL465c G7 blades last year where we were getting storage errors in the iLO 3. These blades were ordered without the P410 controller as they were booting ESXi from SD card and using SAN attached storage for the virtual machines. The iLO would report the storage status strangely (see my earlier post: as it would flip-flop between OK and failed.

After several months of logging a job with HP and being told that a fix would come in iLO 3 v1.6 I eventually tested the new firmware and to my disappointment the issue persisted. I didn’t want my hosts continuously alerting in vCenter with storage failures so I’ve kept my blades at an older version that doesn’t pass this failure through to the CIM.

Anyway, just this morning I received an email from HP providing an advisory link to the problem saying that there is still no firmware fix and as a workaround the SAS/SATA cable should be removed from the backplane.

It is certainly not an ideal workaround but would allow me to upgrade the blades to the latest versions of iLO firmware. I’ll look at testing one and post an update on the issue 🙂

vCenter Converter Standalone – Slow conversion rates and SSL

I thought I’d quickly write about this as it was something I was not previously aware of, mostly due to the fact that I have not performed many P2V migrations using the Converter tool and when I have it’s mostly been with the old offline converter.

Anyway, I was doing some bench-marking of the conversion process on a Windows 2003 R2 server and I couldn’t understand why I seemed to be hitting a network throughput ceiling of around 10MB/s. At first I though it must have been a routing issue as in this particular environment I was using had multiple VLANs and the source machine was in a different VLAN to the ESXi host. The router in this case only had a routing throughput of 100Mb/s so the 10MB/s made me think that this was the case.

However, when I moved the host into the same VLAN as the source machine I got the same speeds…now I was really confused. Everything else in between seemed fine and I could not work out what was making the conversion so slow.

So I jumped on Twitter to see if anyone else had come across this before (admittedly Google probably would have told me :-P) and a couple of smart guys i know @vStorage and @dmanconi suggested I turn off SSL.

I immediately went back to Converter and was clicking around like crazy and thought “where the hell do I turn that off!!??!!??”. Thankfully Google stepped in here and lead me to this VMware KB article:

Aha! So as of vCenter Converter 5.0 it enables SSL by default…I’m not sure why to be honest, in my opinion security of the traffic during a P2V would be the last thing on my mind, but that’s just me 🙂

Anyway, I followed the instructions and set the <useSsl> parameter to False, restarted the Worker service and kicked off my conversion again.

WOW, now I was getting around 50MB/s throughput on my conversion, around five times faster than before! This now meant that my upcoming P2V jobs were going to complete in way less time than I first though.

So a word of advice if you are thinking about doing some P2V’s or are not blown away by how slow they are running, apply this tweak and you will be away laughing.

Oh, while we are on the topic, for those of you using HP Proliant servers, Guillermo Musumeci has written a handy tool for automating the removal of the HP Proliant Support Pack drivers and software after you’ve done your P2V: …unfortunately at this moment the website won’t let me register, but I’m sure it will be working again soon. The tool has been around for some time but is one of those really hand tools to have and saves manually removing the HP drivers and services.

VMware VCAP5-DCA Experience

This morning I received a much anticipated email from VMware Certification. It had been almost three weeks since re-sitting my DCA exam and I was extremely nervous about the result.

My first attempt at the exam was terrible and it really threw me. At the time the lab environment was very slow and I panicked, wasting valuable time just thinking about the time! After that I was totally put off the experience and put my study on hold. In fairness I had just had another baby so I cut myself a little slack 🙂

Anyway, three months went by and I geared up once more to attack the beast. I sat the exam at the same testing centre in sunny Tauranga, New Zealand (seems to be the ONLY place you can actually book the exam in NZ anyways). This time the lab environment seemed much faster at responding and I found myself getting through questions far quicker. To be fair I had also put more study into the areas that I had clearly lacked from the first attempt and this made a lot of difference.

The exam was still extremely tough and it was clearly a very different set of questions and lab setup from the first time. I managed to run out of time with about five questions left completely unanswered. I still felt very nervous about how I did but hoped that I had done enough to pass, in this case more than 300/500. My first attempt came in at about 250/500 but I had only answered about 60% of the questions, so I was hopeful that this time was much better.

So, this morning I was up at about 6am with my two boys, half asleep sitting on the couch. I’m on my phone using the web browser to view webmail (don’t ask!) and the attached pdf result wouldn’t download!!!! Then when I finally managed to get it to download it wouldn’t open the pdf, argh!!!!!! I finally got the thing open and scrolled down to the score…389/500, woohoo!

I was so stoked, I really didn’t want to have to sit it a third time! Now that I have both DCD and DCA I can now apply for VCDX. This will be my next major goal and realistically will take me some time. Between a busy job and family life there ain’t a lot of time left for preparing a design, but lets hope that through work I can prepare one suitable.

Finally, to give my two cents worth of advice to others out there considering taking the exam;

1. Work through the blueprint end to end, and I really do mean end to end. The exam covers stuff from all over the blueprint (go figure :-P)

2. Check out Jason Langer and Josh Coen’s study guides here: and here:, both of these guys fricken rule! Many thanks to them for their massive efforts in creating such a great resource.

3. Use Autolab for creating your home lab environment, it will save you a heap of time: A BIG thanks to Alastair Cooke for his work and others that have helped him. It is a fantastic tool for deploying vSphere at home.

4. Make sure that you don’t skip over areas that you think you already know. I did this both times and realised afterwards that I didn’t really know as much as I thought I did!

5. During the exam, manage your time very carefully. Don’t stall on any one particular question too long and if you get stuck, move on and come back later.

6. Oh, and lastly, if you do fail, don’t beat yourself up. It’s a real tough exam with a lot of content to work through in a short space of time. I was way too hard on myself the first time and this put me off getting back in the drivers seat for a long time. Sometimes it is good to fail and gives us perspective.

We’ll that’s about it from me, I’m over the moon about passing and I can sit back for a little while now…just not too long eh! VCDX…

Study hard and good luck!

vSphere Update Manager – Remediation failures due to removable media

This is something that has caught me out a few times before so I thought I’d quickly post about it.

Have you have ever tried to remediate a cluster using VUM and encounter a message something like this?:

vum - error

The message itself doesn’t tell a lot, leaving you to dive into the error logs within vCenter. This can be quite tedious particularly if you don’t know what to look out for. The best place to start is searching for entries containing “Error”…surprise surprise 😛

In this particular scenario I have several VM’s with removable media attached to them, bear in mind your particular scenario could be different and may have nothing to do with removable media!

Digging through the VUM logs (normally found under C:\Users\All Users\VMware\VMware Update Manager\Logs or C:\Documents and Settings\All Users\Application Data\VMware\Update Manager) we come across a file named vmware-vum-server-log4cpp.log.

Browsing through the file searching for any errors we find the following entry:

vum - log error

Aha! We now have an explanation to why the remediation failed! You might think the easiest way to fix this is to check each of the VM’s in the cluster and remove any removable media devices…

While it is probably a good idea to remove these generally speaking there may be valid reasons for having them attached. Hence why VUM has a great feature found under “Maintenance Mode Options”!


vum - remediate disable media


By ticking the box at the bottom the remediation task will automatically disable any attached removable media devices allowing the task to complete successfully.

Anyway, I hope this little trick helps others out there who are wondering why their cluster remediations aren’t completing!

CPU Ready and vCPU over-subscription

I’ve been doing a bit of performance tuning on some of the clusters I look after at work and started looking deeper into CPU Ready times. This particular metric has always been something I’m aware of and it’s impact on performance but I had never gone looking for issues relating to it. Mostly because I’d never had a host or cluster that was that over-subscribed!

Anyway, I thought I’d do a quick post on what CPU Ready times mean, how you can measure them and how you can help reduce them…here goes.

What is CPU Ready???

The term CPU Ready is a bit confusing at first as one would assume that it refers to how much CPU is ready to be used, but this is not the case. The lower the CPU Ready time the better!

CPU Ready is the percent of time a process is ready to run but is waiting for the CPU scheduler to allow that process to run on a physical processor. I.e. “I’m ready to go but I can’t do anything yet!”.

So now that we have a better understanding of what CPU Ready is, lets look at what can cause this value to increase and hurt your VM’s performance.

What causes CPU Ready times to increase???

1. Over-commitment/Over-subscription of physical CPU

This would be the most common cause and can happen when you have committed too many vCPU’s in relation to the number of physical CPU cores in your host.

From what I have read it seems that for best performance you should keep your pCPU:vCPU ratio equal to or less than 1:3. So in other words, if your host has a total of four CPU cores, you should not allocate more than a total of 12 vCPU to the VM’s on that host. This isn’t to say you can’t have more but you may run into performance problems doing so.

2. Using CPU affinity rules

Using CPU affinity rules across multiple VM’s can cause high CPU Ready times as this can restrict how the CPU scheduler balances load. Unless specifically required I would not recommend using CPU affinity rules.

3. Using CPU limits on virtual machines

Another potential cause of CPU Ready is using CPU limits on virtual machines. Again, from what I have read I would suggest that you do not use CPU limits unless absolutely necessary. CPU limits can prevent the scheduler from allocating CPU time to a VM if it were to violate the limit set, hence causing ready times to increase.

4. When Fault Tolerance is configured on a VM

The last scenario could be where you have deployed a VM using FT and the primary and secondary VM can’t keep up with the synchronisation of changes. When this happens the CPU can be throttled causing higher ready times.

Now that we’ve covered what can cause CPU Ready times to increase, lets look at how to measure them and reduce them. For this example I’ve used the most common cause, over-provisioning.

How do I look for CPU Ready issues???

Take the example below; it is a VM that has been configured with four vCPU. Looking at the last days CPU usage you can see this particular VM is doing almost nothing (it is a test VM).


When I then look at the CPU Ready times for the same period I see that the summation value is around 9200ms. Remember that both of these charts are the last day roll up.Image

Now you are probably thinking, what the hell does that mean? Well, we can convert this summation into a percentage to make things a little easier to quantify.

The formula is simply this:

CPU Ready % = (Summation value / (chart update interval in seconds x 1000)) x 100

Each of the available update intervals are listed below (refer to KB article 2002181:

  • Realtime: 20 seconds
  • Past Day: 5 minutes (300 seconds)
  • Past Week: 30 minutes (1800 seconds)
  • Past Month: 2 hours (7200 seconds)
  • Past Year: 1 day (86400 seconds)

So returning to our previous chart of the last day we get:

(9200 / (300 x 1000)) x 100 = 3.06%

Now this isn’t a bad CPU Ready time percentage but it will do for the purposes of this example. VMware recommends that for best performance CPU Ready % should be less than 5%.

Based on the fact that my virtual machine is by no means busy and it has been given 4 vCPU I will now drop this back to 2. Yes I could drop it back even further to 1 but for the purposes of this example I’ll bring it back to 2 🙂

After powering off the VM, changing the vCPU and powering back on I get a significant drop in CPU Ready time as seen below in a Real-time chart.


Running a new calculation on this value of around 54ms (a very rough guesstimate average :-P) we get this:

(54 / (20 x 1000)) x 100 = 0.27%

As you can see the average CPU Ready time has decreased quite significantly by simply lowering committed resource to the VM. Obviously this would only be practical on VM’s that are not vCPU constrained.

In my experience most people (myself included) over allocate vCPU, particularly when translating vendor hardware requirements into virtual machine requirements! Some of the worst I’ve seen are when sizing some of the Microsoft System Center products. The sizing guides often suggest dual quad-core physical servers, but this does not mean you should give your VM eight vCPU.

I think the best approach is to size lower and adjust accordingly if you are hitting a CPU resource limit. Spend some time looking over your environment and see where you might be able to tune your performance, you might be surprised at how much you can improve it!

Removing Datastores/LUNs within vSphere 5

It’s been a while since my last post (too long!) and I thought I’d talk about something that has recently come back to bite me in the rear end.

I’m sitting at my desk at home doing some work, in particular documenting a procedure for doing some volume migration that needs to happen on several clusters. I’ve been stepping through my documentation to test it and I’ve hit a strange issue which appears to be linked to problems occurring when removing LUNs from a host incorrectly.

I had unmounted several datastores that were no longer required, while still maintaining the recommended two for HA datastore heart-beating. I made sure that I disabled SIOC on the datastores and when un-mounting them I was shown lots of green ticks telling me I’d completed all of the prerequisites.

However, I proceeded to un-present the LUNs from the hosts without first detaching the LUN from the Storage->Devices view:

Unmounting Device

Unmounting Device

Bear in mind that the above screenshot shows attempting to unmount a test volume I have, hence the red tick!

What can happen if you don’t perform this step is an APD or All Paths Down state for the host. Cormac Hogan has a great article here about the APD condition:

Unfortunately for me in this particular case is that I un-presented the LUNs without properly doing an unmount. When I tried a rescan of the iSCSI software HBA the operation eventually timed out and my host disconnected. I now have a host running production VM’s that cannot be managed via the vSphere client or via SSH/Powershell AND the DCUI has frozen! Yay.

So in summary, if you want to ensure you don’t cause yourself any unwanted pain when removing datastores/LUNs from a host or cluster, make sure you follow the KB article PROPERLY!

Oh, and to add further pain to this problem it seems to have the same symptom that I’ve seen before when applying host profiles and the vmotion vmkernel port ip address gets applied to the management vmkernel port…WEIRD!

Anyway, I’d better get back to fixing what I’ve broken 😛

Storage alert bug running vSphere on HP BL465c G7 blades

I have recently been configuring some BL465c G7 blades at work running vSphere 5.0 Update 1 installed on an internal SD card and using FC storage for the datastores. I ran into a strange issue where from within the vSphere client my blade hosts would show as having alerts after running HP’s October firmware bundle containing the BIOS version A19 15/8/12.

Hosts showing alert status

Hosts showing alert status

After some investigation I found that they had storage faults, specifically the following alarm:

Alert Detail

Alert Detail

Then when drilling into the Hardware Status tab (with the HP offline bundle installed) it shows that the storage controller has drive faults or missing disks as seen below.

Storage status showing failed drives

Storage status showing failed drives

This was really weird as these particular blades were ordered without the integrated P410 controller as we weren’t planning on using local disk. Weird…

Anyway, I spent a while trying different things to clear the alerts without any luck. Diving into the iLO from the System Information summary, the storage tab shows the same error:

iLO storage summary

iLO storage summary – BAD

But after a few minutes and a page refresh, the status clears and looks as expected:

iLO storage summary - OK

iLO storage summary – OK

This continues to flip-flop between good and bad every minute or so. WHAT???

I decided to roll back the BIOS and iLO firmware (as well as the iLO which made no difference) and what was interesting is that when going back to BIOS version A19 19/3/12, the iLO status still shows the same problem as above, but the CIM provider information within vSphere no longer shows the storage controller…because it DOESN’T HAVE ONE! :-). See the screenshot below:

Older BIOS doesn't present storage controller CIM information

Older BIOS doesn’t present storage controller CIM information

I logged a job with HP and after working through several diagnostics with them they came to the conclusion that this was definitely a bug and would be addressed in a future BIOS update.

So for those of you out there that have been pulling your hair out like I have, there is a bug and there is no immediate fix other than rolling back to BIOS rev. A19 19/3/12 or earlier. Either that or you have to live with your hosts continually showing alerts 🙂 NOTE: Rolling back the BIOS only masks the alert from coming through into the vSphere client and still shows up in the iLO status summary. However, the alert doesn’t appear to generate any IML event logs…it also does not show up in HP SIM either.

This bug only appears to affect blades that don’t have the optional P410 controller installed and I only have BL465c G7 blades to test this on. It may also affect BL465/460c G7 and earlier where the controller is optional.

UPDATE: I have been told by HP that the bug is caused by the disk drive backplane being active even when the controller isn’t present and they also suggest that it can be observed with any BIOS/iLO combination. I have also found that some blades seem fine with just the BIOS rollback while some still bring the storage controller status back into vSphere. For these odd few a rollback to iLO2 1.28 seems to fix the problem, hence I am making this my baseline for now.

UPDATE – October 2 2013

After stumbling across a few updates to the iLO 3 firmware I noticed that v1.57 specifically mentions the following fix:

  • Incorrect Drive status shown in the iLO 3 GUI when the HP Smart Array P410 controller is removed from the ProLiant c-Class Blade BL465c G7.

However, after testing this new firmware the same problem exists and is also present on the latest v1.61 firmware. What is interesting to see is that the error is slightly different in that while the drives flip-flop between “not installed” and “Fault”, the number of drive bays does not change now. Now the number of drives is always correctly shown as two…I guess progress is progress right??? 😛

I’ll open a new case with HP and hopefully find a fix for this hugely annoying bug!!!