After recently upgrading vCenter and ESXi to 6.0U1a and installing all patches (now build numbers 3018524 and 3073146 respectively), we began experiencing random host disconnects from vCenter. The host itself and all guests on it are still alive when it's disconnected; I can SSH to the host and RDP/SSH to guests. If I literally do nothing, it eventually fixes itself and rejoins vCenter within 15-20 minutes. We did not have this issue prior to upgrading to 6.0U1a. This is not the "NETDEV WATCHDOG: vmnic4: transmit timed out" issue. In fact, the reason we upgraded to the latest build was to get the fix for that particular issue.
I've personally witnessed this happen now on three different hosts and never has it reoccurred on the same host twice that we have noticed. The vmkernel.log simply shows:
2015-11-18T20:56:42.662Z cpu12:173086)User: 3816: wantCoreDump:vpxa-worker signal:6 exitCode:0 coredump:enabled
2015-11-18T20:56:42.819Z cpu15:173086)UserDump: 1907: Dumping cartel 172357 (from world 173086) to file /var/core/vpxa-worker-zdump.000 ...
The vpxa.log doesn't show anything building up to the disconnection and leaves a large gap after the agent crashes, like so:
2015-11-18T20:56:42.638Z info vpxa[FFF2AB70] [Originator@6876 sub=vpxLro opID=QS-host-311567-2883ed8a-1e-SWI-42a5654a] [VpxLroList::ForgetTask] Unregistering vim.Task:sessio
2015-11-18T20:56:42.641Z verbose vpxa[FFF6CB70] [Originator@6876 sub=VpxaHalCnxHostagent opID=QS-host-311567-2883ed8a-1e] [VpxaHalCnxHostagent::DoCheckForUpdates] CheckForUp
2015-11-18T20:56:42.641Z verbose vpxa[FFF6CB70] [Originator@6876 sub=vpxaMoService opID=QS-host-311567-2883ed8a-1e] [VpxaMoService] GetChanges: 97820 -> 97820
2015-11-18T20:56:42.641Z verbose vpxa[FFF6CB70] [Originator@6876 sub=VpxProfiler opID=QS-host-311567-2883ed8a-1e] [2+] VpxaStatsMetadata::PrepareStatsChanges
2015-11-18T21:10:20.328Z Section for VMware ESX, pid=3326854, version=6.0.0, build=3073146, option=Release
2015-11-18T21:10:20.329Z verbose vpxa[FF8A6A60] [Originator@6876 sub=Default] Dumping early logs:
2015-11-18T21:10:20.329Z info vpxa[FF8A6A60] [Originator@6876 sub=Default] Logging uses fast path: false
vCenter logs simply show the host becoming unreachable so the problem is obviously host-side.
Anyone else seeing similar activity? This has all the feel of another "known issue" but I don't see any talk about it. I did open a case with VMware support and am awaiting contact now.
Hi,
I'm experiencing vpxa crashing due out of memory with 5.5 (esxi build 3116895, vcenter 3142196). Hosts goes offline in vcenter and after a while it reconnects. Probably from update to this version in autumn.
vpxa process:2015-11-20T08:00:31Z Unknown: out of memory [34516]..2015-11-20T08:00:31.935Z cpu0:54442)UserDump: 1820: Dumping cartel 34516 (from world 54442) to file /var/core/vpxa-worker-zdump.000
I had opened it twice with vmware support and they advised to increase ThreadStackSizeKb in vpxa.cfg. It was ok for few days (or just maybe restart of vpxa) (and period of crashed seems to be longer?) Anyway, I have opened it once more and answer from support was that it is due lot of snapshots - we had few vms with more than 32 (officialy supported number). We had deleted more than 32 snapshots and...still crashing
We are not with latest 5.5, but I'm unsure if upgrade will help. Due sslv3 patches I have to upgrade vcenter first, I can't patch hosts to try -- anyway vpxa is installed and upgraded from/with vcenter right?
Your "Unknown: out of memory" certainly sounds like the same issue I started seeing on 6.0U1a, just haven't seen it on any other version thus far. Checking your build number, you're on ESXi 5.5 Update 3a (Express Patch 😎 which was released the same day as 6.0U1a (10-6-15) . So it's possible a change was made in 6.0U1a to start causing this issue and was then rolled back to 5.5U3a also. I unfortunately don't have any clients on 5.5U3 or higher yet to confirm the issue is happening there also. But seems a little more than coincidental.
The vpxa agent gets installed/enabled when you join a host to vCenter. The vpxa agent on the hosts gets upgraded when you upgrade vCenter.
hi guys
This is a known issue ( vpxa crash , out of memroy error) from 5.5 U2 32485427 onwards and Vmware releasing a patch to fix it as a high priority. Only temp solutions is changing the value, but will lose the value after reboot.
1. Connect to each hosts mentioned above through SSH
2. Run the following command to change the default value of vpxa
a ) Run the following command to set the grpID of the vpxa process to a variable
grpID=$(vsish -e set /sched/groupPathNameToID host vim vmvisor vpxa | cut -d' ' -f 1)
b) Run the following to increase the max memory allocation of the vpxa process to 400MB (default is 304)
vsish -e set /sched/groups/$grpID/memAllocationInMB max=400 minLimit=unlimited
3. Confirm that the max memory allocation of the vpxa process has been changed
vsish -e get /sched/groups/$grpID/memAllocationInMB
The output should be similar to the following:
sched-allocation {
min:0
max:400
shares:0
minLimit:-1
units:units: 3 -> mb
}
Hope this helps.
Hi,
I had no confirmation from support that this is known issue in my two opened cases As reading this discussion I changed vpxa memory limit using vsphere client, I think it is same as your solution using vsish, right?
I had doubled original value (it seems to be computed from host memory size). There was one crash on one host (of 5) after a week. Going to try on another cluster...
Hi Ivanerben
Have you had any crash even after the change? if so, please let me know as I have the case still opened with them.
This is a known issue internally with in VMware support. Dont think they acknowledged to everyone.
Hi, yes we had one crash on host with modified memory settings:
2016-03-30T01:40:54Z | XXX Unknown: out of memory [15946563] |
2016-03-30T01:40:54.042Z XXX vmkernel: cpu34:13686138)User: 2888: wantCoreDump : vpxa-worker -enabled : 1
2016-03-30T01:50:46.047Z XXX Hostd: [49701B70 info 'Vimsvc.ha-eventmgr'] Event 502173 : /usr/lib/vmware/vpxa/bin/vpxa crashed (11 time(s) so far) and a core file might have been created at /var/core/vpxa-worker-zdump.000. This might have caused connections to the host to be dropped.
2016-03-30T01:50:46Z | XXX watchdog-vpxa: '/usr/lib/vmware/vpxa/bin/vpxa ++min=0,swapscope=system,group=host/vim/vmvisor/vpxa -D /etc/vmware/vpxa' exited after 972216 seconds 134 |
Alright, I just noticed that even after running the commands, the values doesnt change and you have to restart the vpxa service to take effect.
Had the value changed definitely on the crashed host?
Changing on couple of hosts today with build 3248547, will keep you posted.
If it doesnt fix the issue, the only hope is to await vmware to release patches. This is their 2nd top priority according to them
So...one crash is ok = restart
Hi, it seems that modifying settings using vsphere client is persistent and stays after host reboot.
But is it "fixed" with the increased settings or does it just take longer to crash now?
We have to wait longer for confirmation, but I have only one crash per esxi host since April 1st, which is promising.
Naah, just longer period with modified settings. Still crashing with 'Unknown: out of memory'
They finally made a public KB article for the problem, but still no fix:
For 5.5 there is update 3e, I'm reading release notes and trying to find if it is fixed..
It appears the article posted by Bleeder above, KB2144799, now indicates this was fixed in ESXi600-201608401-BG and ESXi550-201608401-BG released 8/4/16. I haven't installed any of the patches released from 8/4 yet so I can't confirm. While the issue did seem to be less frequent after dropping our stats collection levels as someone above indicated, it didn't completely fix the issue as I just witnessed this problem again the other day, but I also never increased the memory above the default as previous workarounds mentioned.
i upgraded my ESXi version to 6.0U2 with build 4192238 and my ESXi server disconnect from vCenter and for few Vms it gets disconnect from vcenter and had to migrate to other ESXi to resolve the issue, when i opened a case with VMware they said to upgrade the NIC firmware and drive to resolve this issue, even after the upgrade the issue still exists.
Even after trying below KB the issue exists.
If anyone got solution. I feel to stop upgrading the ESXi version to 6, in ESXi 5.5 U3 i did not had a single issues like these.
What NIC are you using and what driver/firmware? Run ethtool -i and you should see something like:
driver: i40e
version: 1.4.26
firmware-version: 5.02 0x8000222e 17.5.10
bus-info: 0000:01:00.0
Pls find the details
Driver Info:
Bus Info: 0000:0c:00:0
Driver: elxnet
Firmware Version: 4.2.433.604
Version: 10.2.445.0
Driver Info:
Bus Info: 0000:0c:00:0
Driver: elxnet
Firmware Version: 10.6.144.2702
Version: 10.6.144.2712