vCenter 5.1 SSO default domain clashes if domain is added to username for CommVault Auto-Discovery

TheSaffaGeek

Today on a customer site after having done an upgrade of their vCenter to vCenter 5.1  the customer started experiencing problems with their CommVault Simpana 9 backup proxies talking to the vCenter server to see the hosts and datastores contained in the environment and when they ran an auto-discovery on the environment it came back blank. I searched the VMware forums, the VMware documentation and the internet and the customer searched maintenance advantage but we couldn’t find a solution to the problem.

 

We tested logging into the vCenter server as the service account and noticed a very strange error occur when we ticked the use windows session credentials box. The error we received was incorrect username or password yet the service account had the correct permissions. We realised by unticking the use windows session credentials box and typing in the username and password for the service account and thereby…

View original post 126 more words

Removing Datastores/LUNs within vSphere 5

It’s been a while since my last post (too long!) and I thought I’d talk about something that has recently come back to bite me in the rear end.

I’m sitting at my desk at home doing some work, in particular documenting a procedure for doing some volume migration that needs to happen on several clusters. I’ve been stepping through my documentation to test it and I’ve hit a strange issue which appears to be linked to problems occurring when removing LUNs from a host incorrectly.

I had unmounted several datastores that were no longer required, while still maintaining the recommended two for HA datastore heart-beating. I made sure that I disabled SIOC on the datastores and when un-mounting them I was shown lots of green ticks telling me I’d completed all of the prerequisites.

However, I proceeded to un-present the LUNs from the hosts without first detaching the LUN from the Storage->Devices view:

Unmounting Device

Unmounting Device

Bear in mind that the above screenshot shows attempting to unmount a test volume I have, hence the red tick!

What can happen if you don’t perform this step is an APD or All Paths Down state for the host. Cormac Hogan has a great article here about the APD condition: http://blogs.vmware.com/vsphere/2011/08/all-path-down-apd-handling-in-50.html

Unfortunately for me in this particular case is that I un-presented the LUNs without properly doing an unmount. When I tried a rescan of the iSCSI software HBA the operation eventually timed out and my host disconnected. I now have a host running production VM’s that cannot be managed via the vSphere client or via SSH/Powershell AND the DCUI has frozen! Yay.

So in summary, if you want to ensure you don’t cause yourself any unwanted pain when removing datastores/LUNs from a host or cluster, make sure you follow the KB article PROPERLY! http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2004605

Oh, and to add further pain to this problem it seems to have the same symptom that I’ve seen before when applying host profiles and the vmotion vmkernel port ip address gets applied to the management vmkernel port…WEIRD!

Anyway, I’d better get back to fixing what I’ve broken 😛

Storage alert bug running vSphere on HP BL465c G7 blades

I have recently been configuring some BL465c G7 blades at work running vSphere 5.0 Update 1 installed on an internal SD card and using FC storage for the datastores. I ran into a strange issue where from within the vSphere client my blade hosts would show as having alerts after running HP’s October firmware bundle containing the BIOS version A19 15/8/12.

Hosts showing alert status

Hosts showing alert status

After some investigation I found that they had storage faults, specifically the following alarm:

Alert Detail

Alert Detail

Then when drilling into the Hardware Status tab (with the HP offline bundle installed) it shows that the storage controller has drive faults or missing disks as seen below.

Storage status showing failed drives

Storage status showing failed drives

This was really weird as these particular blades were ordered without the integrated P410 controller as we weren’t planning on using local disk. Weird…

Anyway, I spent a while trying different things to clear the alerts without any luck. Diving into the iLO from the System Information summary, the storage tab shows the same error:

iLO storage summary

iLO storage summary – BAD

But after a few minutes and a page refresh, the status clears and looks as expected:

iLO storage summary - OK

iLO storage summary – OK

This continues to flip-flop between good and bad every minute or so. WHAT???

I decided to roll back the BIOS and iLO firmware (as well as the iLO which made no difference) and what was interesting is that when going back to BIOS version A19 19/3/12, the iLO status still shows the same problem as above, but the CIM provider information within vSphere no longer shows the storage controller…because it DOESN’T HAVE ONE! :-). See the screenshot below:

Older BIOS doesn't present storage controller CIM information

Older BIOS doesn’t present storage controller CIM information

I logged a job with HP and after working through several diagnostics with them they came to the conclusion that this was definitely a bug and would be addressed in a future BIOS update.

So for those of you out there that have been pulling your hair out like I have, there is a bug and there is no immediate fix other than rolling back to BIOS rev. A19 19/3/12 or earlier. Either that or you have to live with your hosts continually showing alerts 🙂 NOTE: Rolling back the BIOS only masks the alert from coming through into the vSphere client and still shows up in the iLO status summary. However, the alert doesn’t appear to generate any IML event logs…it also does not show up in HP SIM either.

This bug only appears to affect blades that don’t have the optional P410 controller installed and I only have BL465c G7 blades to test this on. It may also affect BL465/460c G7 and earlier where the controller is optional.

UPDATE: I have been told by HP that the bug is caused by the disk drive backplane being active even when the controller isn’t present and they also suggest that it can be observed with any BIOS/iLO combination. I have also found that some blades seem fine with just the BIOS rollback while some still bring the storage controller status back into vSphere. For these odd few a rollback to iLO2 1.28 seems to fix the problem, hence I am making this my baseline for now.

UPDATE – October 2 2013

After stumbling across a few updates to the iLO 3 firmware I noticed that v1.57 specifically mentions the following fix:

  • Incorrect Drive status shown in the iLO 3 GUI when the HP Smart Array P410 controller is removed from the ProLiant c-Class Blade BL465c G7.

However, after testing this new firmware the same problem exists and is also present on the latest v1.61 firmware. What is interesting to see is that the error is slightly different in that while the drives flip-flop between “not installed” and “Fault”, the number of drive bays does not change now. Now the number of drives is always correctly shown as two…I guess progress is progress right??? 😛

I’ll open a new case with HP and hopefully find a fix for this hugely annoying bug!!!

HP Lefthand OS 10.0 upgrade issue with certain management group password strings

A few weeks ago HP released the new Lefthand OS 10.0 for their StoreVirtual products. I’ve since upgraded my VSA clusters successfully,  however, when attempting this on a test P4300 cluster I ran into some weird issues.

During the upgrade (which now has a nice progress window) the first node in the cluster would upgrade OK but upon rebooting would not reconnect to the CMC. When trying to log into the direct console on the upgraded node it would not accept my Management Group (MG) password. I had no other option but to forcibly cancel the upgrade and I rebuilt the failed node.

After attempting a subsequent upgrade, the same problem occurred. This time I noticed in the status bar in the CMC that it was failing to log into the newly upgraded node and said the username and password was incorrect! As a result I logged a job with HP.

The response I received was rather interesting…as it turns out they have had several incidences similar and are related to MG passwords containing the characters ‘~‘ or ‘$‘.

So, this got me thinking…if the password is corrupted by the upgrade, what happens if I try logging in with a partial password? I tried several combinations on my test cluster until VIOLA, I could log in with the first part of the password leading up until the special character!

To clarify this, lets say our MG password was abc123$def, after the upgrade the CMC will fail to reconnect. You should be able to log into the node using abc123, i.e. we’ve dropped the $ symbol and anything after that. I’m not sure what happens if your password begins with those characters…could be interesting as the CMC does not allow a blank password!

I would imagine HP will release a patch for this in the coming weeks, in the meantime you could change your MG passwords prior to the upgrade 🙂
Read more of this post

HP LeftHand CMC 10.0 Changes

HP’s Lefthand / P4000/ StoreVirtual product has had a major version upgrade with it’s announcement of Lefthand OS 10.0. This release will be the first to drop the SAN/iQ moniker in favor of the company name that created the product before HP’s aquisition a few years ago.

The release of this software upgrade was slated for the 4th December if I’m not mistaken but interestingly their FTP site now has the updated patches/upgrades as of the 26th of November.

I had the chance to download the new files (with some difficulty, I get the feeling their FTP site is taking a hammering at the moment!) and have since installed the new version of their Centralised Management Console or CMC.

Going into this upgrade I had high hopes for its new support for an internet proxy for the downloading of patches, something that has really let the product down previously in my opinion. In any case, the new version now allows you to specify a SOCKS proxy…yay!

Now, the bad news…

It does not allow you to specify any authentication for the proxy…argh!!!! In our environment this is a real pain from a security perspective and as such is not going to help. For now it will be back to downloading the media from an alternative location and copying it to the CMC. This in itself can prove to be tedious, particularly when the CMC decides that the downloaded media is corrupt and needs to re-download it! Oh well…baby steps eh 😛

CMC 10.0 Proxy Setting

On a more positive note, the new version now supports ActiveDirectory integrated authentication. So far I can’t see where this is configured but I’m guessing you’ll need to have your nodes upgraded to version 10 first…i’ll post an update on this shortly.

Further to this there is now an additional SAN status page panel showing all active tasks which should prove to be extremely useful, something that was lacking previously, especially when managing multiple clusters from a single CMC by more than one administrator. Again I’ll post more on this when I see it in action. In the meantime here’s a shot of the Active Tasks pane, not very exciting but gives you an idea.

CMC 10.0 Active Tasks

So that seems to be about it for now, I’d be keen to hear from any others that have found more new features that I’ve missed. Once I’ve fully downloaded all of the new patches I’ll upgrade one of my test VSA clusters and post about that, hopefully I’ll then be able to integrate the cluster security into AD 🙂

Thanks for reading!

vSphere 5.1 – VMware Tools filling event logs on a Terminal Server

I just stumbled across this bug while working on an ESXi 5.1 server running a Windows Server 2008 R2 Terminal Server VM.

The VM had been crashing randomly and throughout the Application event logs were the following messages:

Image

Reading the event details showed the following:

Image

After a bit of digging around I found this KB article: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2036350

As it turns out this is a known issue and currently a workaround exists to help alleviate the symptoms.

First disable VMware Tools Application logging by modifying the tools.conf file, normally found under C:\ProgramData\VMware\VMware Tools on Server 2008/R2/2012. If this file does not exist you can create it manually via notepad or similar. Add the following line to the file:

[logging]
vmusr.level = error

Save the file and restart the VMTools service from the Services snap-in tool.

Secondly, disable the virtual machine general logging via the vSphere client. Edit the virtual machine settings, click on the Options tab, select Advanced->General and un-tick the “Enable Logging” tick box.

Image

Save the configuration and then restart the VM. If you are unable to restart the VM you can also vMotion the VM to another host to make this setting take affect.

Anyway, hope this helps 🙂

vSphere Home Lab: Part 3 – Procurves and static routing

So I’ve just spent the last three hours trying to work out why my Procurve switch wasn’t routing my various VLAN’s I have configured for my home lab.

I had to move my three hosts and switch into the garage because the heat in the office was becoming unbearable! Unfortunately because of this I broke my connection to my iSCSI VLAN I had configured for my labs ip storage. Because I’m running my SAN management software on my main pc I had a second nic directly plugged into that VLAN, nice and simple right?

However, when I moved the gear I no longer had two cables running to my main pc, I now only had one. I though to myself, “surely I can set up some static routing!?!?”.

Anyway, as it turns out my little Thomson ADSL router supports static routing, cool! I configured this like so:

:rtadd 192.168.3.0/24 192.168.2.50 (where 192.168.3.0/24 is my iSCSI subnet and 192.168.2.50 is the management ip of the Procurve). Step one done!

Next I jumped onto my Procurve 2910al and enabled ip routing, giving me this config:

hostname “SWITCH1”
module 1 type j9147a
no stack
ip default-gateway 192.168.2.1
ip route 0.0.0.0 0.0.0.0 192.168.2.1
ip routing
snmp-server community “public” unrestricted
spanning-tree legacy-path-cost
spanning-tree force-version stp-compatible
vlan 1
name “DEFAULT_VLAN”
no untagged 25-36
untagged 1-24,37-48
ip address 192.168.2.50 255.255.255.0
exit
vlan 10
name “vMotion”
tagged 13-18
no ip address
exit
vlan 20
name “FT”
tagged 13-18
no ip address
exit
vlan 30
name “iSCSI”
untagged 25-36
ip address 192.168.3.1 255.255.255.0
exit
management-vlan 1

Now, doing a tracert from my main pc on VLAN1 it would get as far as the Procurve, but the switch would respond with destination net unreachable.

I continued to try different commands and read several blog posts on configuring static routes and everything I had done looked fine!

I finally came across a comment someone had posted on a forum suggesting that when you specify a management VLAN on the switch it breaks routing! ARGHHHHHHH!

So, I ran “no management-vlan 1” and saved the config. Now the switch is properly routing all VLANs, yay!!!!!

Now I can fire up my HP P4000 CMC and connect to my VSA’s from my main pc on VLAN1, woohoo.