Understanding ESXi – stateless, diskless, feckless

There are a couple of terms that are used whilst talking about ESXi design, which always creates confusion.  As many users, now late into the vSphere 4 cycle, consider their migration to ESXi, I want to take a moment to try to clear up these misunderstandings.  Hopefully this might also explain what ESXi is capable of a little more deeply, and make you think about its deployment options.

ESXi can be installed as diskless or diskful [sic].  ESXi can be configured as stateless or stateful.  However I think those monikers are somewhat misleading, and open to interpretation.  Here’s a bit of background first.

There are 3 interesting directories which ESXi needs to decide where to store: /bootbank, /locker and /scratch.

bootbank – this is the boot image, along with the vendor drivers and CIM providers.

locker – this is where it keeps the vSphere client and VMware tools and other non-essential stuff. (UPDATE: Carter Shanklin let me know that since 4.1, the locker directory doesn’t keep a copy of the VIC anymore)

scratch – for the state archive (configuration settings), logs, core dumps, diagnostic bundles.

Whether ESXi is stateless/stateful and diskless/diskful basically determines where this stuff is stored.  When ESXi boots up, it loads the running image entirely into RAM.  It uses a combination of tardisks which are fixed in size and comprise static files, and ramdisks that can grow and shrink as required.  In addition to RAM it can optionally make use the disk that it booted from, and a 4GB VFAT scratch partition that can be alongside the boot disk or stored remotely.

A stateless server doesn’t mean the server has no state (configuration), and a diskless server doesn’t mean that has no storage. So what do they mean?

Diskless – ESXi has a read-only bootdisk after it has booted up.

Diskful – ESXi has a writeable bootdisk after it has booted up.

Stateless – ESXi doesn’t actively persist its state across reboots.

Stateful – ESXi does persist its state across reboots.

So just to clarify, a diskless server does not necessarily mean a server with no spinning disks.  ESXi will consider boot-from-SAN, boot-from-USB key/SD card as a diskful.  And tangentially a server choked full of the finest RAIDed disks, can equally be configured as diskless if you really want.  A stateless ESXi server does not mean that it doesn’t have any configuration settings, just that the server stores this in a volatile ramdisk and therefore isn’t usually responsible for being authoritative for the configuration as it boots up.

I’ve highlighted the diskless and stateless definition above mainly because those are the “new” deployment options.  When you’ve installed classic ESX before, it would have installed to writable disks and constantly had its configuration saved on those disks.  So what do the two new options give you in particular?

Diskless – I believe the primary purpose to have a diskless ESXi setup is to allow the image to be loaded from a PXE boot. I don’t mean the same as a PXE boot install that you would boot from and install an OS.  A “PXE boot” boot is different in that it actually boots the OS from the network, not booting just the installer. This means that images can be stored centrally and replaced easily. Remember, ESXi loads the running boot image into RAM as it boots up, so it’s feasible to replace it and have all servers “upgraded” after their next reboot en masse.

If a diskless server is also configured as stateless, it will use up to 4GB of RAM for a ramdisk-based scratch partition.  This eats into your RAM, and can be avoided with a stateful scratch partition created on a remote VMFS or NFS volume.

One advantage to having a diskful server, is that it stores two copies of the bootbank directory, but only mounts one of them.  The offline copy can be used to boot from if a problem occurs with the live mounted version.

Stateless – At first looks, the thought of a server which can lose its configuration after a reboot seems silly.  However, the reason is that it allows the use of a centralized configuration authority.  One tool can push out configuration settings to all the hosts.  This is where Host Profiles potentially steps in to allow policy based configuration, and you can avoid having to touch every server in a cluster when a new configuration change is required.  To me it’s all about extracting the configuration from the deployment process.

The important impact of running ESXi as stateless is one of persistence.  Not only can the configuration need to be reloaded from somewhere remote after a reboot, but the logs and the core dumps (think vmkcore contents) are lost if the server shuts down.  So an important consideration for deploying stateless ESXi, is to configure a remote syslog server to capture those logs.

A diskless server is stateless by default. But, my understanding is that, if you configure a diskful server as stateless (it can write to the boot device, but it uses a volatile scratch partition) then it will save the state archive to the boot disk every 10 minutes.  This was done to prevent wear on small USB based storage which was the most likely candidate for this type of setup.  The impact here is that reboots would lose logs and core dumps (as for all stateless setups), and the startup config could be different to the running config.  Potentially any configuration changes made to the host in the last 10 minutes could be lost.  This is where the “actively” comes from in my stateless definition.  A stateless server can persist state across reboots if it’s diskful, but because it only copies it to disk periodically, I wouldn’t consider that active or necessarily authoritative.

So why do I care?

Now usually, unless you plan to run ESXi as diskless, stateless or both, you are most likely to install it and not think about it, assuming it to be both stateful and diskful.  Most of us expect the servers to actively maintain their state across reboots, to save logs and dumps locally, and load the bootable image from disk (or maybe a USB key/SD card).

However all this information becomes more interesting to you when you realize that how much writeable storage is available when you first build your ESXi server, will by default dictate whether the state is persistent.  If 4GB is free locally (after the base install) then a 4GB scratch partition is created.  That doesn’t sound much of a requirement these days, but as an example consider an HP BL490c server (this is blade server with no local disk bays).  You could be thinking this item would work wonders in your chassis as an ESXi server.  The G7 model has an onboard SD card slot. Perfect.  However as one of my colleagues pointed out the other day, the largest HP certified SD card current is 4GB (as far as I know there is no HCL for SD cards, so your device should be certified by the server vendor for a true supported configuration).  What a sweet ESXi hardware choice. But without realizing it, by default during the ESXi install, it sets up the server as stateless and creates the scratch partition on a ramdisk.  And you probably never even realised the impact of this.  Now that you do realize, then this can be easily fixed by re-directing the scratch partition to an external VMFS or NFS volume.

A combination of diskless and stateless makes for a very agile platform.  Imagine re-provisioning ESXi hosts with simple reboots.  They could pick up new configurations and boot images centrally, with them re-roled for different purposes, different networking configurations, different storage LUNs or even different storage devices.  Your VDI cluster needs more horsepower at 9am, but then after the initial rush you need more grunt for your Tier 1 VMs.  ESXi servers automatically re-provisioned through a DRS-like automation tool.  Okay, I’m totally dreaming here (honestly, this is pure conjecture), but it’s not too difficult to imagine very scalable “cloud” (quotation marks and italics for extra special emphasis here :)) setups, gobbling this stuff for breakfast as soon as it is a workable easily implementable deployment solution.

Now I don’t know about you, but my head spins whenever I try to think these options through.  I find myself re-reading it and re-correcting myself all the time. Hopefully what I’ve described above is somewhat clear.  If you disagree with any of the definitions or the impacts then let me know in the comments and I’ll consider update the post.  I want it to be as definitive and accurate as possible, and I’m sure there are plenty of impacts I’ve not considered here.

In the new vSphere Design book, I mention elements of this subject and I also include a table which helps to understand where the bootbank, locker and scratch directories are physically stored depending on their state and disk combination.

##################

The VMware vSphere Design book is now available for pre-order on Amazon and will be in the stores around the middle of March 2011.  Pre-order your copy today:

Book follow-up

Following our announcement last week about the new VMware vSphere Design book, I wanted to follow-up on some of the feedback we’ve received.

Firstly, and I know I speak for Scott and Maish here, we are extremely grateful for all the kind words and thoroughly encouraged by those of you who let us know via email, blog comments and twitter that you had already ordered your copy.  All three of us were very enthused about writing this book, because we felt it would fill a gap that exists at the moment.  I sincerely hope that everyone in the VMware community who grabs themselves a copy finds it useful.

There are a few recurring questions that we’ve been asked, so I’ll try to answer them here (let me know in the comments if there are others, and I’ll update the post):

Will there be an electronic/kindle/nook/etc version available?

Yes.  The publishers will release an electronic version at the same time that the paper copy leaves the warehouse.  It is up to each retailer to convert it into their chosen electronic format.  We expect the electronic versions to be available from your favourite retailers around the same time as the printed version.

Does the book focus on vSphere 4.0 or 4.1? Does it cover feature XXX in 4.1?

The book was conceived before the release of version 4.1, and in fact the writing had begun by the time 4.1 was released.  Personally I had completed one chapter and was partway through a second when 4.1 hit.  However, I went back and made sure I “retrofitted” those to include anything new from 4.1.  So all the chapters I wrote certainly cover 4.1, and I also tried to make note of where there was a difference between the two.

The book certainly stands for both versions and I fully expect the majority of it to be relevant for vSphere 4+1, as the book tries to focus on conceptual ideas first.

Will the book be available in the UK/Europe/Netherlands/Timbuktu?

This book is distributed by a regular book publisher (Sybex, part of the Wiley group), not a self-publishing group.  As such it should be available for order via any reputable book retailer.  Like any technical book, you are more likely to find it on their online store first.  If you don’t find it on the shelf of your local bricks ‘n mortar bookshop, they should be able to order it for you.

Does it prepare me for my VCDX defence/application or VCAP-DCD exam?

This book is not specifically written for either the VCDX process or the VCAP-DCD exam.  It does discuss vSphere design in-depth, and therefore should be valuable while considering the design process.  Primarily this book is aimed as a practical guide to understanding vSphere design, not a study guide.  If it does help you, then that’s great.

——-

I’m immensely proud to have it finished, and to have been able to work in such great company as Scott, Maish and Jason Boche (who was our technical editor).  Until it’s actually out there, here is a peek at the chapters:

  1. An introduction to designing vSphere environments
  2. ESX versus ESXi
  3. Designing the management layer
  4. Server hardware
  5. Networking
  6. Storage
  7. Virtual machine design
  8. Datacenter designs
  9. Designing with security in mind
  10. Monitoring and capacity planning
  11. Bringing it all together (design case-study)

——-

Now that I’ve actually finished working on it, it affords me some more time to get back to my (ir)regular blogging.  One thing I noted whilst writing the book was there was occasionally not enough pages (and unfortunately never enough time) to always dive as deeply into some areas that I’d have liked.  There were a number of areas I realized that weren’t covered in much depth elsewhere and I’d love to explore them more.  So, as time permits, I’m going to concentrate on some of those and dig into things a bit.  I’ve already got a couple of posts brewing; I’m just working out my plan for them now.  All published on my blog here.  Now, I just need to find the hours to write them 🙂

##################

The VMware vSphere Design book is now available for pre-order on Amazon and will be in the stores around the middle of March 2011.  Pre-order your copy today:

New vSphere Design Book

This is a joint post by three prominent writers in the virtualization community: Maish Saidel-Keesing, Scott Lowe and Forbes Guthrie.

For the past 6 months we have worked together on a new book.  This has been kept pretty quiet, but it’s now time to make it public.

Previous VMware vSphere books have focused on how to master the technology, deep-diving into certain elements and giving tips & tricks that help you manage your virtual infrastructure.  But we felt there was something missing in all these books – how to design an infrastructure.  For example:

– What kind of servers should I use?
– Which storage protocol: NFS, iSCSI, FC?
– How to scale a vCenter server appropriately?

The three of us collaborated on the book, not only to explain how to configure each element of your infrastructure, but to make you think about all the options available, and how each choice can impact the overall design. It should help you find the right solution for your environment – because no “one size fits all”.

It is the only book focused on designing VMware vSphere implementations. It’s written for engineers and architects who plan, install, maintain and optimize vSphere solutions.

The book details the overall design process, server hardware selection, network layout, security considerations, storage infrastructure, virtual machine design, and more. We debate the merits of scaling up servers versus scaling out, ESX versus ESXi hypervisors, vSwitches versus dvSwitches, and FC, FCoE, iSCSI or NFS storage. We show you which tools can be used to monitor, to plan, to manage, to deploy and to secure your vSphere landscape. We run through the design decisions that a typical company might face, and question the choices you come to. The book is packed with real-world proven strategies. VMware vSphere Design examines how the virtualization architecture for your company should ideally look – be it a newly deployed environment or optimizing the existing infrastructure.

We would like to thank Jason Boche for acting as the technical editor for the book.

We hope you enjoy reading this book, as much as we enjoyed writing it.
Maish, Scott and Forbes.

The VMware vSphere Design book is now available for pre-order on Amazon and will be in the stores around the middle of March 2011.  Pre-order your copy today:

Large Pages – a problem of perception and measurement

This post is in response to Gabe’s recent post:

Large Pages, Transparent Page Sharing and how they influence the consolidation ratio

and then Frank Denneman’s reply here: Re: Impact of Large Pages on Consolidation ratios

Finished reading? Ok, let’s continue…

Let me try to summarize very briefly what happens.  When Large Pages are not considered, Transparent Page Sharing (TPS) runs across a host’s VMs periodically (every 60 minutes by default) and reclaims identical 4KB pages – think of it like de-duping your memory.  When using Large Pages (2MB), TPS does not try to reclaim those identical pages because of the “cost” of comparing these much larger pages.  That is until there is memory contention, at which point the Large Pages are broken down into 4KB blocks and identical pages are shared.

Large Pages have been proven to offer performance improvements.  The strategy that VMware are following is technically advantageous – why incur the expense of figuring out which Large Pages are identical if the host doesn’t need to reclaim memory.  If it can back all VM memory requests with physical memory, then it doesn’t need to worry about reclamation yet.

However, the problem I see is that of perception.  If you run lots of VMs with very similar memory usage on a host, you would expect to see the advantage of TPS kicking in and suitable memory savings.  However if the Large Pages don’t get shared until the host thinks it is under pressure, you won’t see those savings until there appears to be a problem.  The host will wait until it hits 94% memory usage (6% free), before it deems itself under memory contention and starts to break those Large Pages into smaller 4KB ones.  So in a VM environment with very similar VMs, you are consistently going to run your host at around 94% memory used.  This isn’t a technical issue.  All those identical memory pages can still be reclaimed, just as before, and you are gaining a performance gain of Large Pages.  This is a perception issue.

Most vSphere administrators probably don’t realize this, their manager almost certainly don’t realize this.  All they see is their hosts running out of memory – they have less than 10% memory free! Time for some more hosts.  And even for those that understand that when they get to this level there is the potential that there is still more potential memory available, it’s not that apparent how much saving they could expect and how far to push it.  Do we hide all memory usage in the vSphere client until it hits only 4% or 2% free?  Obviously not. I think this is a measurement issue.