ESXi disks must be "considered local" for scratch to be created

A new KB was released yesterday (http://kb.vmware.com/kb/1033696), in which I noticed something interesting.

ESXi Installable creates a 4 GB Fat16 partition on the target device during installation if there is sufficient space, and if the device is considered Local.

This made me prick up my ears, as only a couple of weeks ago I was having problems using a kickstart script to deploy ESXi to some HP DL580 G7 servers.  This issue arose because the ESXi installer considered the local disk controller as non-local.

<aside>
To get around this kickstart issue, I had to add “remote” to the firstdisk option on the autopart line, so it ended-up looking like this:
autopart --firstdisk=local,remote --overwritevmfs
Basically, this tells the installer to tries the first local disk and if it can’t find one then in goes for the first remote disk.  Clearly this increases the chances of accidentally wiping a SAN LUN, but as the site had migrated to NFS only, I wasn’t too concerned.
</aside>

So I had a quick check of a few ESXi hosts that I had rolled out recently, and sure enough no scratch partition had been created.  This was unexpected behaviour as the hosts had indeed local spinning disks and had enough space (4GB free) for the scratch partition to be created during the install.  This means there will be no persistent scratch area – so the scratch will instead be created on a volitile ramdisk, which eats a bit of your host’s memory, and means the scratch contents don’t survive a reboot.  After further investigation I found this was also true on some DL380 G6 servers, but not on some DL380 G5 servers.  It seems this is something you want to go and check yourself on a case-by-case (RAID controller-by-RAID controller) basis.

To check if a host has a scratch partition, login via the TSM and run:
cat /etc/vmware/locker.config EDIT – see here for an update
If the file is blank, then no scratch is configured.

Here it is without a scratch partition:

And here it is with a scratch partition created by the installer:

To create a scratch partition for these servers on their local “non-local” disks then follow the steps in the KB.  You can do this after deployment via the vSphere client, vCLI, PowerCLI or the TSM.

Here is an outline of doing it at the TSM:
1.  Create a directory on the local VMFS volume
2.  Run vim-cmd hostsvc/advopt/update ScratchConfig.ConfiguredScratchLocation string /vmfs/volumes/DatastoreName/DirectoryName
3.  Reboot the host

The KB also details how to add this configuration to your kickstart files for future deployments or rebuilds.

Great new MS clustering KB

(Update: BTW, I’m not being sarcastic.  I really do like this KB very much, I think it’s an excellent resource.)

One of my more popular posts that I wrote a couple of years ago was about configuring Microsoft Clustering Services (MSCS) on VMs.  Getting MSCS configured correctly in VMs has always been tricky.  Even now there is still some mystery surrounding VMware’s support of MSCS (Windows Failover Clustering as it is now known) and Microsoft’s other clustering technologies.  Often these grey areas are as much a result of Microsoft’s own careful wording of the physical hardware it considers supported, and then how that translates into the virtual hardware world that VMware presents us.

So I was delighted this weekend to see a new Knowledge Based article published by VMware, which deciphers the support requirements for each of Microsoft’s clustering techniques:

http://kb.vmware.com/kb/1037959

The KB even includes this rather natty ready-reckoner:

Go and look, read and digest the entire pithy article for infinite wisdom: http://kb.vmware.com/kb/1037959

I want a template replicator vApp

I want a template replicator vApp…

Here’s the problem:

Companies that have more than one office split by a WAN link, have problems keeping their templates in sync.  There are two common approaches to this:
1) When updating a template, touch each site and update the same template in each location.
2) Update a master template and copy it out to each site.
Neither of these solutions scale very well, when you have multiple templates and lots of remote sites.

I dream of a vApp which you can deploy each site, and that is aware of the other instances at each site.  The vApp’s sole purpose is to watch for changes in a local template store on the designated “master” appliance, and replicate out those changes to all connected instances (at each sites).  It would be nice if those changes were just block-level changes.  The templates can sit in the vApp themselves on an NFS export, to be mounted by local ESXi hosts, so the templates can be deployed and updated straight from the vApp.

What do you think?  Obviously these tasks can be offloaded to storage array replication, DataDomain type devices, etc.  But I’d like a native tool that could be used anywhere, and wouldn’t rely on specific equipment.  Let me know in the comments below if you already have any groovy tools or rsync scripts that you use to do this automagically.

 

Hopefully, the VMware labs can create me a new fling for Christmas 🙂

Manually rolling back a failed ESX 3.5 to 4.x upgrade

I thought I’d write a quick message about this, as much for my own reference.  Maybe someone else out will find it useful one day. Its purpose is to document the changes made by the esx 3.5 to 4.x rollback script.

Normally when you do 3.5 > 4.x upgrade, if the upgrade fails then it automatically rolls the changes back for you.  Assuming that the install completes successfully, you can do one of two things afterwards.  You can manually remove the old 3.5 cruft with the cleanup-esx3 shell script.  Or if you decide there is something not quite right with the new 4.x install you can manually roll the server back to 3.5 by running the rollback-to-esx3.  All pretty straightforward stuff really.

However, I was just in a situation where I had to boot back into a 3.5 install (until you run the cleanup script, the 3.5 boot options remains in the grub boot menu), and I wanted to run the rollback script.  Normally you run the rollback script when you have booted up into the 4.x install.  But in this case I couldn’t boot into the 4.x image (following a failed patching session) so couldn’t get to the rollback script which lives in /usr/sbin/ of the esxconsole.vmdk file, which 3.5 doesn’t mount.

The reason was the server had recently been upgraded to 4.1 fine, but when subsequently applying the latest 4.1 U1 patches the upgrade went belly-up. This particular server was built by a consultant with only a 100MB boot partition.  The ESX 4 kernel images are around 25MB and aren’t removed automatically, so the /boot partition can fill up after a couple of patching sessions.

The type of nonsensical VUM error messages I was getting were:
HostPatchESXUpdateFailure
HostUpgradeIncompatible
RemediateFailure
HostUpgradePrecheckTestFailBootStorage
<whinge> this is the one that made me check the filesystem usage, but you’d think they could just say explicitly why and save us all a lot of head-scratching</whinge>

Most of the vCenter error messages are much improved these days.  However VUM messages are still dreadful.

Fortunately I had access to another recently upgraded host, which I hadn’t patched with 4.1U1, so I could get to the rollback script.  I just made the same changes manually.  (Actually I didn’t remove the files, but moved to the /tmp directory)

Once I’d done that, I was able to rescan the host in VUM again and run the upgrade from 3.5 to 4.1 again.  Then I could run the cleanup script, and then upgrade it to 4.1U1.

So here are the contents of the rollback-to-esx3 script:

rm -rf /boot/config-2.6.*
rm -rf /boot/initrd-2.6.*
rm -rf /boot/initrd.img
rm -rf /boot/System.map-2.6.*
rm -rf /boot/vmlinuz-2.6.*
rm -rf /boot/vmlinuz
rm -rf /boot/trouble
cp /boot/grub/grub.conf.esx3 /boot/grub/grub.conf        ## make a copy first

And for completeness, here is the cleanup-esx3 script:

rm /usr/sbin/rollback-to-esx3
sed -i -e '/^# BEGIN migrated entries/,/^# END migrated entries/d' /etc/fstab
# Remove old ESX v3 titles in grub.conf
sed -i -e '/^# BEGIN ESX v3 title/,/^# END ESX v3 title/d' /boot/grub/grub.conf
# Remove old ESX v3 boot files:
rm -f /boot/initrd-2.4.21-58.ELvmnix.img-dbg
rm -f /boot/initrd-2.4.21-58.ELvmnix.img
rm -f /boot/System.map
rm -f /boot/System.map-2.4.21-58.ELvmnix
rm -f /boot/config-2.4.21-58.ELvmnix
rm -f /boot/kernel.h
rm -f /boot/vmlinuz-2.4.21-58.ELvmnix
rm -f /boot/initrd-2.4.21-58.ELvmnix.img-sc
rm -f /boot/vmlinux-2.4.21-58.ELvmnix