VMware High Availability Constructs
Date: Aug 25, 2010
When configuring HA two major decisions will need to be made.
- Isolation Response
- Admission Control
Both are important to how HA behaves. Both will also have an impact on availability. It is really important to understand these concepts. Both concepts have specific caveats. Without a good understanding of these it is very easy to increase downtime instead of decreasing downtime.
Isolation Response
One of the first decisions that will need to be made when HA is configured is the “isolation response”. The isolation response refers to the action that HA takes for its VMs when the host has lost its connection with the network. This does not necessarily means that the whole network is down; it could just be this hosts network ports or just the ports that are used by HA for the heartbeat. Even if your virtual machine has a network connection and only your “heartbeat network” is isolated the isolation response is triggered.
Today there are three isolation responses, “Power off”, “Leave powered on” and “Shut down”. This answers the question what a host should do when it has detected it is isolated from the network.
In any of the three chosen options, the remaining non isolated, hosts will always try to restart the virtual machines no matter which of the following three options is chosen as the isolation response:
- Power off When network isolation occurs all virtual machines are powered off. It is a hard stop, or to put it bluntly, the power cable of the VMs will be pulled out!
- Shut down When network isolation occurs all virtual machines running on the host will be shut down using VMware Tools. If this is not successful within 5 minutes, a “power off” will be executed. This time out value can be adjusted by setting the advanced option das.isolationShutdownTimeout. If VMware Tools is not installed, a “power off” will be initiated immediately.
- Leave powered on When network isolation occurs on the host, the state of the virtual machines remains unchanged.
This setting can be changed on the cluster settings under virtual machine options.
Figure 1: Cluster default settings
The default setting for the isolation response has changed multiple times over the last couple of years. Up to ESX 3.5 U2 / vCenter 2.5U2 the default isolation response when creating a new cluster was “Power off”. This changed to “Leave powered on” as of ESX 3.5 U3 / vCenter 2.5 U3. However with vSphere 4.0 this has been changed again. The default setting for newly created clusters is “Shut down”. When installing a new environment; you might want to change the default setting based on your customer’s requirements or constraints.
The question remains, which setting should you use? The obvious answer applies here; it depends. We prefer “Shut down” because we do not want to use a degraded host to run our virtual machines on and it will shut down your virtual machines in clean manner. Many people however prefer to use “Leave powered on” because it eliminates the chances of having a false positive and the associated down time with a false positive. A false positive in this case is an isolated heartbeat network but a non-isolated virtual machine network and a non-isolated iSCSI / NFS network.
That leaves the question how the other HA nodes know if the host is isolated or failed.
HA actually does not know the difference. The other HA nodes will try to restart the affected virtual machines in either case. When the host has failed, a restart attempt will take place no matter which isolation response has been selected. If a host is merely isolated, the non-isolated hosts will not be able to restart the affected virtual machines. This is caused by the lock on the VMDK and swap files. None of the hosts will be able to boot a virtual machine when the files are locked. For those who don’t know, ESX locks files to prevent the possibility of multiple ESX hosts starting the same virtual machine. However, when a host fails, this lock expires and a restart can occur.
To reiterate, the remaining nodes will always try to restart the “failed” virtual machines. The possible lock on the VMDK files belonging to these virtual machines, in the case of an isolation event, prevents them from being started. This assumes that the isolated host can still reach the files, which might not be true if the files are accessed through the network on iSCSI, NFS, or FCoE. HA however will repeatedly try starting the “failed” virtual machines when a restart is unsuccessful.
The amount of retries is configurable as of vCenter 2.5 U4 with the advanced option “das.maxvmrestartcount”. The default value is 5. Pre-vCenter 2.5 U4 HA would keep retrying forever which could lead to serious problems as described in KB article 1009625 where multiple virtual machines would be registered on multiple hosts simultaneously leading to a confusing and inconsistent state. (http://kb.vmware.com/kb/1009625)
HA will try to start the virtual machine one of your hosts in the affected cluster; if this is unsuccessful on that host, the restart count will be increased by 1. The next restart attempt will than occur after two minutes. If that one fails, the next will occur after 4 minutes, and if that one fails the following will occur after 8 minutes until the “das.maxvmrestartcount” has been reached.
To make it more clear look at the following list:
- T+0 Restart
- T+2 Restart retry 1
- T+4 Restart retry 2
- T+8 Restart retry 3
- T+8 Restart retry 4
- T+8 Restart retry 5
As shown above in the bullet list and clearly depicted in the diagram below; a successful power-on attempt could take up to 30 minutes in the case multiple power-on attempts are unsuccessful. However HA does not give a guarantee and a successful power-on attempt might not ever take place.
Figure 2: High Availability restart timeline
Split-Brain
When creating your design, make sure you understand the isolation response setting. For instance when using an iSCSI array or NFS based storage choosing “Leave powered on” as your default isolation response might lead to a split-brain situation.
A split-brain situation can occur when the VMDK file lock times out. This could happen when the iSCSI, FCoE or NFS network is also unavailable. In this case the virtual machine is being restarted on a different host while it is not being powered off on the original host because the selected isolation response is “Leave powered on”. Which could potentially leave vCenter in an inconsistent state as two VMs with a similar UUID would be reported as running on both hosts. This would cause a “ping-pong” effect where the VM would appear to live on ESX host 1 at one moment and on ESX host 2 soon after.
VMware’s engineers have recognized this as a potential risk and will come with a solution for this unwanted situation as explained by one of the engineers on the VMTN Community forums. (http://communities.vmware.com/message/1488426#1488426)
In short; as of version 4.0 Update 2 ESX detects that the lock on the VMDK has been lost and issues a question if the virtual machine should be powered off and auto answers the question with yes. However, you will only see this question if you directly connect to the ESX host. HA will generate an event for this auto-answer though, which is viewable within vCenter. Below you can find a screenshot of this question.
Figure 3: Virtual machine message
As stated above, as of ESX 4 update 2 the question will be auto-answered and the virtual machine will be powered off to recover from the split brain scenario.
The question still remains: with iSCSI or NFS, should you power off virtual machines or leave them powered on?
As described above in earlier versions, "Leave powered on" could lead to a split brain scenario. You would end up seeing multiple virtual machines ping-ponging between hosts as vCenter would not know where it resided as it was active in memory on two hosts. As of ESX 4.0 Update 2, this is however not the case anymore and it should be safe to use “Leave powered on”.
We recommend avoiding the chances of a split brain scenario. Configure a secondary Service Console on the same vSwitch and network as the iSCSI or NFS VMkernel portgroup and pre-vSphere 4.0 Update 2 to select either “Power off” or “Shut down” as the isolation response . By doing this you will be able to detect if there’s an outage on the storage network. We will discuss the options you have for Service Console / Management Network redundancy more extensively later on in this book for more detailed information.
Isolation Detection
We have explained what the options are to respond to an isolation event. However we have not extensively discussed how isolation is detected. This is one of the key mechanisms of HA. Isolate detection is a mechanism which takes place on the host that is isolated. The remaining, non-isolated, hosts don’t know if that host has failed completely or if it is isolated from the network, they only know it is unavailable.
The mechanism is fairly straightforward though and works as earlier explained with heartbeats. When a node receives no heartbeats from any of the other nodes for 13 seconds (default setting) HA will ping the “isolation address”. Remember primary nodes send heartbeats to primaries and secondaries, secondary nodes send heartbeats only to primaries.
The isolation address is the gateway specified for the Service Console network (or management network on ESXi), but there is a possibility to specify one or multiple additional isolation addresses with an advanced setting. This advanced setting is called “das.isolationaddress” and could be used to reduce the chances of having a false positive. We recommend to set at least one additional isolation address.
Figure 4: das.isolationaddress
When isolation has been confirmed, meaning no heartbeats have been received and HA was unable to ping any of the isolation addresses, HA will execute the isolation response. This could be any of the above described options, power down, shut down or leave powered on.
If only one heartbeat is received or just a single isolation address can be pinged the isolation response will not be triggered, which is exactly what you want.
Selecting an Additional Isolation Address
A question asked by many people is which address should be specified for this additional isolation verification. We generally recommend an isolation address closest to the hosts to avoid too many network hops. In many cases the most logical choice is the physical switch to which the host is directly connected, another usual suspect would be a router or any other reliable and pingable device.
Failure Detection Time
Failure Detection Time seems to be a concept which is often misunderstood but is critical when designing a virtual infrastructure. Failure Detection Time is basically the time it takes before the “isolation response” is triggered. There are two primary concepts when we are talking about failure detection time:
- The time the host will detect it is isolated
- The time the non-isolated hosts will mark the host as isolated and initiate the failover
The following diagram depicts the timeline for both concepts:
Figure 5: High Availability failure detection time
The default value for isolation failure detection is 15 seconds. (das.failuredetectiontime) In other words the failed or isolated host will be declared dead by the other hosts in the HA cluster on the fifteenth second and a restart will be initiated by the failover coordinator.
For now let’s assume the isolation response is “Power off”. The isolation response “Power off” will be triggered by the isolated host 1 second before the das.failuredetectiontime elapses. In other words a “Power off” will be initiated on the thirteenth second. A restart will be initiated on the fifteenth second by the failover coordinator.
Does this mean that you can end up with your virtual machines being down and HA not restarting them?
Yes, when the heartbeat returns between the 14th and 15th second the “Power off” might have already been initiated. The restart however will not be initiated because the received heartbeat indicates that the host is not isolated anymore.
How can you avoid this?
Selecting “Leave VM powered on” as an isolation response is one option. Increasing the das.failuredetectiontime will also decrease the chances of running into issues like these, and with ESX 3.5 it was a standard best practice to increase the failure detection time to 30 seconds.
At the time of writing (vSphere) this is not a best practice anymore as with any value the “1-second” gap exists and the likelihood of running into this issue is small. We recommend keeping das.failuredetectiontime as low as possible.