Beware the 802.3ad Beast!
I readily admit I’m not a network person. We have engineers for that where I work, so I don’t have to be. But every once in a while, something crops up that makes even our network engineers scratch their heads, shrug, and say something to the effect of: “from what I can tell, it should work.”
We use VMware ESX server as our virtualization hypervisor and we manage them with vSphere. I love my VMware boxes. You’d have to pry them from my cold, dead, fingers if you want to get them away from me. But last week made get very close to actually cursing. The problem seemed like it was something very basic: virtual machines on one ESX host couldn’t always talk to virtual machines on another ESX host. Identically configured hosts, no less! If I ssh’d into one box and issued a “ping” back to box 1, then traffic started flowing. It was simply forgetting how to get to that other VM after some number of minutes. We tried every setting in the Cisco switch. We even called VMware support and they looked at everything and couldn’t find any glaring errors.
The next day, on a whim, I had our network engineer enable EtherChannel (Cisco’s name for 802.3ad) on the ports my ESX hosts were plugged into and I switched the ESX server’s NIC load balancing to “IP hash” instead of originating port ID and voila! It magically started working.
I was so frustrated the previous day that I was a little annoyed, truth be told, that the fix was something so simple. So if your VMs have trouble talking to other VMs reliably and you think your switches are losing the ARP entries for those VMs, enable EtherChannel for the NICs you’re using on that ESX host and change your load balancing to “IP hash” on the vSwitch properties in your vSphere client. It’ll save you a lot of headache.