vSphere 4.0 bug on HA with dvSwitches
We are using Distributed Virtual Switches to provide a centralized network configuration to the ESXi hosts and naturally we are using HA to provide availability to the infrastructure.
The issue:
In case of and HA event some of the virtual machines restarted by HA are not connected to the network.
In the network connection properties of the VM we can see that
The checkbox “connect at startup” is correctly maintained
The “connected” checkbox is not!!!

In this situation trying to connect the VM to the network (select the checkbox) results in an error message.
‘invalid config for device 0′
Is obvious that this results in an environment where even if all the VMs are running the VMs restarted by HA are not available on the network!
The RCA:
In normal operations in presence of a dvSwitch the hostd process in ESX is charge of creating the dvPort in the dvSwitch to let the VM to use the network (Connect the VM to the port number XX)
In case of an HA event the hosd daemon is not doing that because the HA is using a kind of “shortcut” to start the VM. This results in, the dvSwitch even if present, configured and working in the ESX hosts where the VM are restarted is not aware of the existence of the dvPort where the VM is connected.
Environment:
vSphere ESXi, 4.0.0 build 193498
vCenter version 4.0.0 build 162856
Workaround:
The most effective workaround so far is to restart, all the VMs affected, from vCenter (please note from vCenter Not from the guest OS itself) so the hostd daemon will properly start and connect the machine
-Or-
Restart the hostd in all the ESX hosts (that in case of ESXi is not very handy to do!) and reconnect all the machines affected to the dvSwitch.
The problem is now escalated to the developers in VMWARE and since they were able to reproduce the problem in house we expect a solution in a reasonable time. A bunch of scripts anyway can haelp us to workaround the problem at the moment.
I have same problem but VMware are unable to find any reference internally to this. Do you have the SR number for this problem so I can give to VMware? Do you have any further info on wheterh this is fixed in a future update?
Thanks
SteveH
The problem has been addressed (and according to VMWARE fixed) in the Update 1 for VCenter.
From the release notes
Virtual machine with vDS loses network connectivity on failover
After the primary datastore of a virtual machine is relocated, the virtual machine might lose network connectivity upon failover if the source virtual machine is configured with vDS. This issue might also occur on a virtual machine that is deployed from a template. An error message similar to the following might be displayed in the vCenter Client if you try to restore the network connectivity on the virtual machine:Invalid configuration for device ’0′
This issue is resolved in this release.
Wonderful, Thanks for the info.
I experienced the same problem but all my components are version 4 update 1 so looks like the problem is not entirely fixed.
Restarting/shutdown of the VM from within vCenter didn’t fix it and I didn’t try restarting the management agents on all the ESX hosts either.
I found a much simply solution. I changed the portgroup membership to something else, then changed it back to what it should be, then checked the Connected box.
Hi KMF
yes you are right the problem is not fixed with Update 1. I forgot to update the post.
It will be fixed with Update 2. We are running Update 2 Early Access at the moment and the problem is fixed.
It turned out that it is a race condition between ESX/HA and vCenter.
To make it simple: HA restarts a machine, after an HA event, but vCenter is not aware yet of the HA event so when the ESX host is going back to vCenter reserving the port in the dvSwitch, vCenter disconnects the VM because the port is in use (by the same VM in another host according to its version of the dvSwitch).
Your workaround is fine if you want to script it.
If you are working from vSphere client though you can have the same effect simply reselecting the same port group (scroll down the dvProtGroup and reselect).
Unfortunately any VMs that were created prior to the update being applied will need to have their network cards removed and recreated in order for the fix to apply.
See http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1017861
Did you have the same problem or did you find some other way of getting it to work?
Hi Jason
things are a bit confusing here because we are in front of two different problems that show up with the same behaviour (Invalid configuration for device 0)
Update 1 only addresses one of them and the official workaround is the one you mentioned.
According to my experience is sufficient to move the vnic to another dvPortGroup and than move it back or re-select the same port group (scroll down the dvProtGroup and reselect). Both these activities will trigger hostd updating the configuration in the vCenter database (VPX_DVPORT table) for the port resyncing it with the one in the dvSwitch in the ESX (you can see it with net-dvs from ESX).
Once this is addressed you can still have the “race condition problem” I mentioned (and we experienced) were vCenter is not coping with HA. Update 2, should be released in the next few days if not today, will fix also this second problem.
Going back to your question :
My answer is: yes. We scripted a change of the dvPortGroup configuration to “ephemeral – no binding” from port binding “static binding”.
In ephemeral binding the dvPortGroup behaviour will resemble the vSwitch behavior while assigning ports. ESX will assign ports to the VM.
In this way you will cut away the vCenter dependency to configure/reconfigure VMs connectivity but you will still benefit of the centralized administration offered by the dvSwitches.
Reason leading me to do advice this is not only the bug mentioned in this article but also one or two incident scenarios were with a dvPortGroup with static binding you will not be able to easily recover.
Scenario one:
Your vCenter is a VM.
You will need vCenter to reconfigure your VM connectivity but vCenter itself is disconnected Catch-22!
Scenario two:
Your vCenter is installed in a Microsoft cluster (hence running with domain credentials) and your DCs are VMs.
In case of an incident affecting your vCenter and your DCs if you have a dvPortGroup with port binding and you have to reconfigure your DC connectivity to let it work you won’t have any freedom to manage your VMs connectivity (… double failure OK but still !)
More in general:
In case of a major disaster were manual actions are required (HA not failing over VMs or whatever …) you will have a dependency in your vCenter availability but I bet you will be more interested to restore VMs availability than vCenter availability. At the end of the day your company’s business is running in the VM and not in vCenter
The migration is pretty much painless (once scripted) can be done online without any downtime
)
- clone your dvPortGroup dvPG-A into dvPG-A-clone
- migrate all nic connected to dvPG-A to the clone (dvPG-A-clone)
- Change dvPG-A to “ephemeral –no binding”
- Migrate back your nic
- Remove dvPG-A-clone
(I think this entire answer would deserve another article in our blog
Hope this helps
Wow Enrico, thank you so much. This information is very useful.
Would you be willing to share the script that you used to make the change so we don’t have to start from scratch with our script? Did you use powershell, perl, or something else?
No problem. The script is in perl and it’s in your mailbox.
Cheers,
Enrico
@Enrico Romani
I am also interested into the script.
If you are willing to share it, why not publish it
Thank you,
Calin
HI,
I have a cluster of 3 ESXi 4.1.0,260247 and have the VC as a vm
we have virtual distributed swith configured.
one of the hosts was restarted and now seems to have lost connection to the vds.
the networking section is blank
this host had the VC. I removed from inventory and added to other host via vsphere client but the nics are not connected. when I try to connect the nic, I get invalid configuration for device ’0′
any advise. this is causing a lot of stress
Hi,
I just had this problem with ESX 4.1.0 260247 & vCenter 259021 on a Windows Server 2008 R2. My vCenter is a vm running on this cluster out of two hosts. Both ESX had lost the connection to the NFS datastore and the vms were able to boot but no network connection was there. I fixed this with a local switch on one of the hosts and when the vcenter became connection to both hosts, those were reconfigured and over vcenter I was able to connect the vms. So this problem isn’t fixed until now.
Does this effect v1000 VDS aswell
@Simon Wilmann
Hi Simon,
unfortunately I cannot help you very much here. I’m not using v1000 a.t.m. so I did not experienced the issue.
Enrico
A year later…..I have vSphere 5.0 since three weeks and experiencing exactly the same problem with the device ’0′ issue. C’mon VMware this needs to be fixed else it is no longer adviced to run vCenter in a VM when using dvSwitches. highly inconvenient.
@Peter
Peter I’m having the same issue on vSphere 5. This is used with View and is causing issues when View is provisioning new desktops. After it clones the NIC is not connected, therefore my desktops don’t properly provision. I can fix the issue also by manually changing the port group then changing it back and connecting.
@Brian
Hi, I can confirm that vSphere 5 has this issue too. I hit this bug today while I was stressing our cluster by repeated vMotions of all VMs between randomly chosen hosts. I’m not convinced it is related to HA as in vSphere 4.0 because according to the logs there was no HA restart of the affected VM. I’ll try to reproduce it.