VMware Cloud Community
wildcattdw
Contributor
Contributor

ESX 3.5 - guest CPU after Vmotion

Hey all. After upgrading several of my ESX hosts (my development cluster and my production VDI cluster) I have been running into a pretty serious issue. I have not seen anyone else ask/complain about this, but from my testing, it seems to me somebody should be seeing something similar.

After upgrading, when a machine is Vmotioned between two hosts, and the Vmotion hits 90%, the guest CPU spikes to 100% and just cooks. I have to reboot the guest to clear the issue. My VDI enviroment is all Unisys ES7000/one hosts, my dev enviroment is a mixture of HP and IBM hosts, it happens in both environments. All hosts involved were upgraded from 3.0.2, most via esxupdate.

Anyone else experiencing anything similar? Ideas?

Tim

0 Kudos
25 Replies
DanRDALE
Contributor
Contributor

I had a similar issue. After upgrading from 3.0.2 to 3.5, I went from getting about 5-10 cpu warning emails to a few hundred a day. We only have 3 hosts and 52 virtual machines. We had HA on and DRS Fully Automated and Agressive with no problems prior to the upgrade. After the upgrade, the VM's were reporting CPU spikes like no tomorrow, with or without VMtools being upgraded and the server being rebooted.

To test the issue, I turned off DRS and HA and let it sit for a while. Everything went back to normal, but I wanted HA and DRS running again, so I just turned on HA and set DRS to Partially Automated for now. That's where I am currently sitting and have had no issues.

0 Kudos
fitzsimmonsr
Contributor
Contributor

I am having a similar problem, but it is limited to only one guest out of the 45 that I have. Upgraded from 3.0.2 to 3.5 and VC 2.02 to 2.5. And since then anytime VMotion occurs for this one host the CPU gets pegged at 100% and the VM has to be shutdown, so it will come up stable. The VM is a W2000 server running IIS. It had been converted from a physical server using Converter well before the update. It ran fine before the update. I do have other W2000 servers (none running IIS) that are not having this problem. The problem happened before I updated the tools to 3.5 and after I updated the tools to 3.5. I have currently disabled DRS Automation for that box, and it has been stable since then. I figured it was just a problem with that box, but maybe not....

Thanks

Bobby

0 Kudos
wildcattdw
Contributor
Contributor

Yep, first thing I did was disable DRS, which sorta stinks. I want to try doing a scratch install of a couple of the machines; but I can evacuate the hardware because if I do, 50+ machines per box will go nuts. I spent a couple hours yesterday working w/ the folks at Vmware to try to troubleshoot the issue. I built from scratch an XP guest, and when I used it to test (with nothing installed on it, just bare XP) it jumps up to 50 to 60% and sits there. Same issue, just not as extreme. I'll keep you all posted.

I feel better that it's not just me.

T

0 Kudos
fitzsimmonsr
Contributor
Contributor

Any luck with Vmware tech support? I am beginning to see this effect at least one more server, maybe two. I think I will have to open a ticket as well.

Thanks

0 Kudos
wildcattdw
Contributor
Contributor

Yes, they have been exceptionally helpful. Early last week they sent me a message that the issue has been identified as a "known issue." They asked me to remove two hosts from my DRS cluster, and try, and it looks like when the hosts are not in a cluster, the issue doesn't arise, however, I have not been doing a lot of vMotions either. I did get a request to do some more testing, however, I am not sure how much risk I can take right now; everything is working happily and the 'worst offenders' are part of a group that hosts VDIs. I don't want to put any end users through any more pain... 😃

0 Kudos
fitzsimmonsr
Contributor
Contributor

I opened up a ticket with VMware today. We tested out a few scenerios. No luck yet. If we get it resolved, I'll post it here.

Thanks

0 Kudos
dirckvdb
Contributor
Contributor

We are having the same problem on a few of our virtual servers. Sometimes the systems are having a bad responds without us seeing high CPU and memory usage. Disabling DRS solves our problem. I was wondering if anyone has some news on this isue.

thanks

0 Kudos
depping
Leadership
Leadership

0 Kudos
rserao
Contributor
Contributor

According to VMware they are working on a patch, but it has not been released yet. They did ask me to test a workaround on Friday. I have not had a chance to test it out yet. Will try to get to it today. The workaround they asked me to test is:

"Edit the following file:

C:\Documents and Settings\All Users\Application Data\VMware\VMware VirtualCenter\vpxd.cfg

Add the following line: after the <vpxd> line

<cluster>

<VMOverheadGrowthLimit>5</VMOverheadGrowthLimit>

</cluster>

After editing the vpxd.cfg, you'll need to restart the Virtual Center Service. Disable and enable drs, then test a vmotion to see if that helps you.

Looking forward to hearing from you."

I'll post my results after I test.

Bobby

0 Kudos
wildcattdw
Contributor
Contributor

I implemented this work-around this morning, and so far my test results have been very good. I had a couple of VMs in this environment (7 hosts, 80 guests) that were very prone to the CPU running away, and they look good now. I enabled automatic DRS so I could try using Update Manager, and it's all going very well.

T

0 Kudos
tanino
Contributor
Contributor

I tried this in one site, with one virtual center server and till now, I haven't had the issue anymore.

But in another site, with another virtual center server i couldn't manage the parameter, since every time I set it, it comes back to 0 or -1.

I tried in vpxd.cfg attached (like the other vc), in the advanced parameters of the host, i tried to move the 3.5 host to another cluster.

here the vmkernel entries as output of several attempts

Apr 17 16:47:24 goavm04 vmkernel: 0:00:00:03.596 cpu5:1040)Config: 414: "VMOverheadGrowthLimit" = 0, Old Value: -1, (Status: 0x0)

Apr 17 16:47:34 goavm04 vmkernel: 0:00:01:20.239 cpu2:1039)Config: 414: "VMOverheadGrowthLimit" = -1, Old Value: 0, (Status: 0x0)

Apr 17 16:48:01 goavm04 vmkernel: 0:00:02:02.857 cpu2:1041)Config: 414: "VMOverheadGrowthLimit" = 0, Old Value: -1, (Status: 0x0)

Apr 17 17:02:28 goavm04 vmkernel: 0:00:16:30.069 cpu2:1041)Config: 414: "VMOverheadGrowthLimit" = 5, Old Value: 0, (Status: 0x0)

Apr 17 17:02:32 goavm04 vmkernel: 0:00:16:33.627 cpu3:1041)Config: 414: "VMOverheadGrowthLimit" = 0, Old Value: 5, (Status: 0x0)

Apr 17 17:03:25 goavm04 vmkernel: 0:00:17:27.076 cpu2:1039)Config: 414: "VMOverheadGrowthLimit" = 5, Old Value: 0, (Status: 0x0)

Apr 17 17:03:27 goavm04 vmkernel: 0:00:17:28.610 cpu3:1041)Config: 414: "VMOverheadGrowthLimit" = 0, Old Value: 5, (Status: 0x0)

Apr 17 17:11:42 goavm04 vmkernel: 0:00:25:44.075 cpu5:1041)Config: 414: "VMOverheadGrowthLimit" = -1, Old Value: 0, (Status: 0x0)

Apr 17 17:12:02 goavm04 vmkernel: 0:00:26:03.977 cpu5:1041)Config: 414: "VMOverheadGrowthLimit" = 0, Old Value: -1, (Status: 0x0)

Apr 17 17:12:32 goavm04 vmkernel: 0:00:26:33.993 cpu5:1041)Config: 414: "VMOverheadGrowthLimit" = -1, Old Value: 0, (Status: 0x0)

Apr 17 17:13:01 goavm04 vmkernel: 0:00:27:02.976 cpu3:1039)Config: 414: "VMOverheadGrowthLimit" = 5, Old Value: -1, (Status: 0x0)

Apr 17 17:13:12 goavm04 vmkernel: 0:00:27:14.230 cpu1:1039)Config: 414: "VMOverheadGrowthLimit" = -1, Old Value: 5, (Status: 0x0)

Apr 17 17:13:42 goavm04 vmkernel: 0:00:27:44.072 cpu1:1040)Config: 414: "VMOverheadGrowthLimit" = 0, Old Value: -1, (Status: 0x0)

Apr 17 17:14:31 goavm04 vmkernel: 0:00:28:32.745 cpu1:1040)Config: 414: "VMOverheadGrowthLimit" = -1, Old Value: 0, (Status: 0x0)

Apr 17 17:14:40 goavm04 vmkernel: 0:00:28:41.716 cpu1:1040)Config: 414: "VMOverheadGrowthLimit" = 5, Old Value: -1, (Status: 0x0)

Apr 17 17:14:43 goavm04 vmkernel: 0:00:28:44.285 cpu3:1040)Config: 414: "VMOverheadGrowthLimit" = 0, Old Value: 5, (Status: 0x0)

Apr 17 17:18:20 goavm04 vmkernel: 0:00:32:21.976 cpu4:1039)Config: 414: "VMOverheadGrowthLimit" = 5, Old Value: 0, (Status: 0x0)

Apr 17 17:18:30 goavm04 vmkernel: 0:00:32:32.210 cpu3:1041)Config: 414: "VMOverheadGrowthLimit" = 0, Old Value: 5, (Status: 0x0)

Apr 17 17:18:58 goavm04 vmkernel: 0:00:32:59.632 cpu4:1039)Config: 414: "VMOverheadGrowthLimit" = -1, Old Value: 0, (Status: 0x0)

Apr 17 17:19:15 goavm04 vmkernel: 0:00:33:17.135 cpu2:1041)Config: 414: "VMOverheadGrowthLimit" = 0, Old Value: -1, (Status: 0x0)

Apr 17 17:19:34 goavm04 vmkernel: 0:00:33:35.390 cpu5:1040)Config: 414: "VMOverheadGrowthLimit" = -1, Old Value: 0, (Status: 0x0)

Apr 17 17:20:07 goavm04 vmkernel: 0:00:34:09.202 cpu2:1041)Config: 414: "VMOverheadGrowthLimit" = -1, Old Value: -1, (Status: 0x0)

Apr 17 17:20:25 goavm04 vmkernel: 0:00:34:27.264 cpu6:1039)Config: 414: "VMOverheadGrowthLimit" = 0, Old Value: -1, (Status: 0x0)

Apr 17 17:23:21 goavm04 vmkernel: 0:00:37:22.424 cpu5:1040)Config: 414: "VMOverheadGrowthLimit" = 5, Old Value: 0, (Status: 0x0)

Apr 17 17:23:24 goavm04 vmkernel: 0:00:37:26.133 cpu3:1041)Config: 414: "VMOverheadGrowthLimit" = 0, Old Value: 5, (Status: 0x0)

Apr 17 17:26:03 goavm04 vmkernel: 0:00:40:04.884 cpu3:1041)Config: 414: "VMOverheadGrowthLimit" = 5, Old Value: 0, (Status: 0x0)

Apr 17 17:26:05 goavm04 vmkernel: 0:00:40:06.354 cpu5:1040)Config: 414: "VMOverheadGrowthLimit" = 0, Old Value: 5, (Status: 0x0)

Apr 17 17:43:15 goavm04 vmkernel: 0:00:57:17.058 cpu6:1039)Config: 414: "VMOverheadGrowthLimit" = -1, Old Value: 0, (Status: 0x0)

Apr 17 19:00:58 goavm04 vmkernel: 0:02:14:59.423 cpu6:1041)Config: 414: "VMOverheadGrowthLimit" = -1, Old Value: -1, (Status: 0x0)

Apr 17 19:58:19 goavm04 vmkernel: 0:03:12:20.447 cpu2:1039)Config: 414: "VMOverheadGrowthLimit" = 0, Old Value: -1, (Status: 0x0)

Now I disabled the DRS completely, and I hope not to get the issue again.

What else can I try to set the value to 5 as suggested? Should I upgrade as soon as possible to VC 2.5 Update 1? Is this the SOLUTION?

Thank you in advance.

0 Kudos
fitzsimmonsr
Contributor
Contributor

After you updated the vpxd.cfg file did you restart the Virtual Center Service? Then disable and enable DRS?

You can also change this setting on each host by going to the Configuration Tab, Choose Advance Settings, and then MEM. The VMOverheadGrowthLimit should be towards the end.

If you are setting it on the Virtual Center server and a host is not retaining the setting, it sounds like a problem with the hosts registration with the Virtual Center Server. Are the 2 clusters you are referring to managed by the same VC server? Try and remove the host from Virtual Center, then reboot the host and try and add it back into the cluster. The new registration may fix the problem.

Bobby

0 Kudos
tanino
Contributor
Contributor

Thank you for the quick reply.

After you updated the vpxd.cfg file did you restart the Virtual Center Service? Then disable and enable DRS?

Yes I tried several times

You can also change this setting on each host by going to the Configuration Tab, Choose Advance Settings, and then MEM. The VMOverheadGrowthLimit should be towards the end.

Yes, I also tried this

Are the 2 clusters you are referring to managed by the same VC server?

No, different ones

Try and remove the host from Virtual Center, then reboot the host and try and add it back into the cluster. The new registration may fix the problem.

I will try. But the parameter was already there before registering back the host on the VC server (after its re-installation in 3.5)...

I will let you know if the disconnect, remove, reboot, add host will work...

Thank you.

0 Kudos
tanino
Contributor
Contributor

We've just tried what you suggested (unregister, reboot and register the host back in the virtual center server), but still the parameter change back to the old value in few seconds.

Apr 18 11:03:09 goavm04 vmkernel: 0:00:02:00.255 cpu6:1041)Config: 414: "VMOverheadGrowthLimit" = -1, Old Value: -1, (Status: 0x0)

Apr 18 11:07:43 goavm04 vmkernel: 0:00:06:34.200 cpu1:1041)Config: 414: "VMOverheadGrowthLimit" = 0, Old Value: -1, (Status: 0x0)

Apr 18 11:12:42 goavm04 vmkernel: 0:00:11:32.577 cpu6:1039)Config: 414: "VMOverheadGrowthLimit" = 5, Old Value: 0, (Status: 0x0)

Apr 18 11:12:54 goavm04 vmkernel: 0:00:11:44.465 cpu2:1041)Config: 414: "VMOverheadGrowthLimit" = 0, Old Value: 5, (Status: 0x0)

Apr 18 11:13:34 goavm04 vmkernel: 0:00:12:24.580 cpu7:1039)Config: 414: "VMOverheadGrowthLimit" = -1, Old Value: 0, (Status: 0x0)

What should I do now? This is our critical production environment.

Notice that we aren't using DRS at the moment (it was enabled but manual). Is it safe to completely disable the DRS? Must we expect cpu spikes in this configuration, waiting for the solution?

Is the final solution to upgrade the virtual center server to VC 2.5 Update 1?

Thank you in advance.

Alessandro

0 Kudos
gdesmo
Enthusiast
Enthusiast

I have applied the fix to the vpxd.cfg file. It was not working when I did it on an individual host.

It changed the vaules on each host from 0 to 5.

Going forward. When I apply vc update1 do I need to revert them back to 0?

0 Kudos
planetman
Contributor
Contributor

""Going forward. When I apply vc update1 do I need to revert them back to 0?"

I would also be interested in the answer to this question. Indeed, is the bug actually fixed in Update1?

Many thanks

0 Kudos
marvinthebassma
Contributor
Contributor

Hi there,

same question here : I applied the fix to VC 2.5.0.

Now we have upgraded to VC 2.5.0 Update 1

Do I have to remove the parameter from the vpxfd.cfg file ?

Martin

0 Kudos
fitzsimmonsr
Contributor
Contributor

I applied the fix for VC2.5. Then I updated to Update 1 without changing the setting. Everything seems to be running OK. I am not sure if Update 1 (VC2.5 or ESX 3.5) included the change or not since I had applied it manually before the update. Guess we need someone who did not apply the fix, but ran the update to check and see if the setting was changed. Is anyone who applied the fix and then applied Update 1 having any problems?

0 Kudos
AdamSnow
Enthusiast
Enthusiast

I did not make the change manually, and after updating to VC 2.5 Update 1, the issue is still there. I ende dup making the change manually to fix the issue. the KB article says that Update 1 fixes it, but it does not:

0 Kudos