VMware Cloud Community
mikeskomal
Contributor
Contributor

VM loses network connectivity

I have several 3.0.1 servers. On occasion, VMs can no longer ping their default gateway. If I shut down the VM and start it up, problem is fixed. What gives?

0 Kudos
16 Replies
MR-T
Immortal
Immortal

Any particular guests which this seems to happen on?

There used to be issues with NT4 machines in 2.5.x but a patch fixed that.

mikeskomal
Contributor
Contributor

Windows Server 2003

0 Kudos
bigvee
Enthusiast
Enthusiast

vmtools installed and up-to-date?

MR-T
Immortal
Immortal

I'd 2nd this.

Having the latest vmtools loaded is very important.

0 Kudos
mikeskomal
Contributor
Contributor

Looks like that may be the problem. I thought those would upgrade automatically with host upgrades.?

0 Kudos
media_gen
Contributor
Contributor

I'd also recommend that you double-check that your NICs speed is setup the same on the switches and on your Host servers.

VMware Support recommended that we have our switches and NICs set too 1000mb/Full-Duplex.

We had similar problems a few months back and it ended up being a faulty switch, the drops were intermittent until we started isolating and running tests on each individual switch.

Good Luck!

0 Kudos
MR-T
Immortal
Immortal

No, you need to perform this yourself.

Did you originally create these machines on ESX 2.5 and then move them to ESX 3

0 Kudos
mikeskomal
Contributor
Contributor

I created templates with the original 3.0.1 build (32039). I've since installed a good number of patches. I'm now at Build 41412. Looks like I have some tool updating to do. Is there somewhere in VC that shows the Tools version of each VM?

0 Kudos
bigvee
Enthusiast
Enthusiast

Not really as far as I know... When you look at each VM it shows as out of date or installed.

I believe there is a script to do bulk updates, but they all require reboots after anyway.

0 Kudos
mikeskomal
Contributor
Contributor

Thanks for the responses

0 Kudos
jhanekom
Virtuoso
Virtuoso

Long shot: since you're saying "cannot ping default gateway", I'm assuming the VMs can still ping other hosts on the same network?

Are you using network bonds in your virtual switches? If so, have you changed any of the default load balancing or failure detection options? Are you using EtherChannel on the physical switches?

0 Kudos
plsntn_rules1
Contributor
Contributor

Exact same issues here.

But not all the VMs on the host lose conenctivity. Just a couple and a reboot of the VM will fix the issue.

Any insight anybody?

I iwll post details here of anythign anyoen wants to see.

thanks

0 Kudos
krc1517
Enthusiast
Enthusiast

I too have the same problem that isn't isolated to 1 OS or 1 Vi3 host or even the flavor of ESX

SLES 10SP1

Win2k3 Sp1 / SP2

ESX 2.5.4 Vi3.01 and 3.0.2

usually only happens to 1 VM at a time but has been happening more and more in the last few weeks.

Vmware tools are all over the board. Some are up to date, some are not.

Seems to happen more with VI3 than 2.5.4 although it does happen there too.

Most of the templates were made in 2.5.4. All SLES VMs are built from scratch.

Here's log capture at or near the time....09:56 the VM was restarted and this fixed the issue. usually I'd vmotion it but the other admin beat me to it.

Dec 17 09:07:44 vmhost46 vmkernel: 67:17:48:33.319 cpu4:1040)FS3: 4055: Reclaimed timed out heartbeat [HB state abcdef02 offset 3682816 gen 32 stamp 5852894630079 uuid 470d5d27-ec8

12b60-5f72-0017a44cb69b jrnl <FB 213015>]

Dec 17 09:08:07 vmhost46 vmkernel: 67:17:48:56.286 cpu1:1037)SCSI: vm 1037: 5510: Sync CR at 64

Dec 17 09:09:07 vmhost46 vmkernel: 67:17:49:56.458 cpu2:1032)SCSI: 3731: AsyncIO timeout (5000); aborting cmd w/ sn 3891670, handle 1145240/0x3d205388

Dec 17 09:09:07 vmhost46 vmkernel: 67:17:49:56.458 cpu2:1032)LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x3d205388, originSN 3891670 from vmhba1:2:6

Dec 17 09:09:07 vmhost46 vmkernel: 67:17:49:56.458 cpu2:1032)WARNING: LinSCSI: 3920: The driver failed to call scsi_done from it's abort handler and yet it returned SUCCESS

Dec 17 09:10:20 vmhost46 vmkernel: 67:17:51:09.240 cpu4:1037)SCSI: vm 1037: 5510: Sync CR at 64

Dec 17 09:11:44 vmhost46 vmkernel: 67:17:52:32.546 cpu1:1032)SCSI: 3731: AsyncIO timeout (5000); aborting cmd w/ sn 2641816, handle 106220/0x3d207678

Dec 17 09:11:44 vmhost46 vmkernel: 67:17:52:32.546 cpu1:1032)LinSCSI: 3616: Aborting cmds with world 1024, originHandle 0x3d207678, originSN 2641816 from vmhba1:2:5

Dec 17 09:11:44 vmhost46 vmkernel: 67:17:52:32.546 cpu1:1032)WARNING: LinSCSI: 3920: The driver failed to call scsi_done from it's abort handler and yet it returned SUCCESS

Dec 17 09:14:42 vmhost46 vmkernel: 67:17:55:31.404 cpu2:1038)SCSI: vm 1038: 5510: Sync CR at 64

Dec 17 09:14:47 vmhost46 vmkernel: 67:17:55:36.531 cpu0:1035)SCSI: vm 1035: 5510: Sync CR at 64

Dec 17 09:15:02 vmhost46 vmkernel: 67:17:55:50.526 cpu1:1034)FS3: 1717: Checking if lock holders are live for lock [type 10c00002 offset 9568256 v 151664, hb offset 3236864

Dec 17 09:15:02 vmhost46 vmkernel: gen 77717, mode 1, owner 470bcab7-b892b183-e6e5-0018fe75da11 mtime 5956622]

Dec 17 09:55:29 vmhost46 vmkernel: 67:18:36:17.573 cpu4:1529)World: vm 1529: 3864: Killing self with status=0x0:Success

Dec 17 09:55:29 vmhost46 vmkernel: 67:18:36:17.574 cpu4:1531)World: vm 1531: 3864: Killing self with status=0x0:Success

Dec 17 09:55:29 vmhost46 vmkernel: 67:18:36:17.607 cpu5:1530)World: vm 1530: 3864: Killing self with status=0x0:Success

Dec 17 09:55:29 vmhost46 vmkernel: 67:18:36:17.666 cpu5:1532)World: vm 1532: 3864: Killing self with status=0x0:Success

Dec 17 09:55:29 vmhost46 vmkernel: 67:18:36:17.667 cpu5:1533)World: vm 1533: 3864: Killing self with status=0x0:Success

Dec 17 09:55:30 vmhost46 vmkernel: 67:18:36:18.579 cpu1:1037)WARNING: Alloc: vm 1529: 1293: Deallocating pinned ppn 0xd33, throttle 0.

Dec 17 09:55:30 vmhost46 vmkernel: 67:18:36:19.119 cpu5:1506)World: vm 1506: 3864: Killing self with status=0x0:Success

Dec 17 09:55:30 vmhost46 vmkernel: 67:18:36:19.126 cpu4:1504)World: vm 1504: 3864: Killing self with status=0x0:Success

Dec 17 09:56:10 vmhost46 vmkernel: 67:18:36:59.091 cpu2:1038)World: vm 1764: 690: Starting world vmware-vmx with flags 4

Dec 17 09:56:10 vmhost46 vmkernel: 67:18:36:59.452 cpu1:1764)World: vm 1765: 690: Starting world vmm0:rescron with flags 8

Dec 17 09:56:10 vmhost46 vmkernel: 67:18:36:59.453 cpu1:1764)Sched: vm 1765: 4836: adding 'vmm0:rescron': group 'host/user': cpu: shares=762 min=0 max=80048

Dec 17 09:56:10 vmhost46 vmkernel: 67:18:36:59.453 cpu1:1764)Sched: vm 1765: 4849: renamed group 128 to vm.1764

Dec 17 09:56:10 vmhost46 vmkernel: 67:18:36:59.453 cpu1:1764)Sched: vm 1765: 4863: moved group 128 to be under group 4

Dec 17 09:56:10 vmhost46 vmkernel: 67:18:36:59.487 cpu1:1764)Swap: vm 1765: 1426: extending swap to 524288 KB

Dec 17 09:56:11 vmhost46 vmkernel: 67:18:36:59.724 cpu2:1764)VSCSI: 2604: Creating Virtual Device for world 1765 vscsi0:0

Dec 17 09:56:11 vmhost46 vmkernel: 67:18:36:59.724 cpu2:1764)SCSI: 1271: Set shares value for world 1765 to 0x3e8

Dec 17 09:56:11 vmhost46 vmkernel: 67:18:36:59.897 cpu3:1764)World: vm 1766: 690: Starting world vmware-vmx with flags 44

Dec 17 09:56:11 vmhost46 vmkernel: 67:18:36:59.899 cpu4:1766)World: vm 1767: 690: Starting world vmware-vmx with flags 44

Dec 17 09:56:11 vmhost46 vmkernel: 67:18:36:59.900 cpu4:1766)World: vm 1768: 690: Starting world vmware-vmx with flags 44

Dec 17 09:56:11 vmhost46 vmkernel: 67:18:36:59.900 cpu6:1765)Init: 740: Received INIT from world 1765

Dec 17 09:56:15 vmhost46 vmkernel: 67:18:37:03.832 cpu3:1766)World: vm 1769: 690: Starting world vmware-vmx with flags 44

Dec 17 09:56:15 vmhost46 vmkernel: 67:18:37:03.837 cpu3:1766)World: vm 1770: 690: Starting world vmware-vmx with flags 44

Dec 17 09:56:35 vmhost46 vmkernel: 67:18:37:24.484 cpu2:1765)VSCSI: 1897: Reset request on handle 8312 (0 outstanding commands)

Dec 17 09:56:35 vmhost46 vmkernel: 67:18:37:24.484 cpu7:1049)VSCSI: 2103: Resetting handle 8312

Dec 17 09:56:35 vmhost46 vmkernel: 67:18:37:24.484 cpu7:1049)SCSI: 3295: handle 1145240 / orig 0x3e174448

Dec 17 09:56:35 vmhost46 vmkernel: 67:18:37:24.484 cpu7:1049)VSCSI: 1946: Completing reset on handle 8312 (0 outstanding commands)

Dec 17 13:15:01 vmhost46 vmkernel: 67:21:55:50.495 cpu0:1037)FS2: 1371: Scheduling maintenance on 43ce9b56-7c74fa0e-72e3-00110a59cad7. Last opener 0.0.0.0

I don't recall if there's been any networking work lately to try and pin this on but i'll keep track from now on.

Oh yeah, DL585G1s with 2 nics bonded into a pair to provide fault tollerance.

Thanks,

0 Kudos
dougdavis22
Hot Shot
Hot Shot

You can see the status of VMware Tools on each VM either at datacenter, cluster or host level in VC. For example, select a cluster and then select the Virtual Machines tab on the right. Once the list has finished loading, right-click the headings at the top and add the 'Tools Status' column. This will then give you a value of either ToolsOK or ToolsOld.

Hope this helps,

Doug.

0 Kudos
krc1517
Enthusiast
Enthusiast

Not sure if this was in response to my post or the one earlier....I know what version of tools I'm running...Old and New. Smiley Happy

0 Kudos
xav_bx
Contributor
Contributor

Hi,

I'm experiencing something similar...

Viewed by guest OS the network is present, but no communication (no ping neither any service) with any other server. A simple reboot often resolves this issue.

The more strange is that guest OS configured in DHCP retrieve a good IP address from the DHCP server then loose their network. I've verified that VM tools are OK.

Our configuration:

  • ESX 3.0.2

  • Guest OS 2003 SP2 and XP SP2., for both the VMware tools are up to date (build 55869)

  • Connected on LAN via a VSwitch connected to 2 NICs (Auto-negotiate : 1000 Mb full duplex).

I've made a test on windows XP SP2 configured in DHCP (I've made the test several times for each),

  • TEST1:after a restart if I wait a couple of seconds before login (30 seconds), then launch a ping to another server : all works fine

  • TEST2: after a restart I log in and asap I launch a ping to another server, the request timed out, if I ask ipconfig I have retrieved a DHCP address.

It seems that if the network is sollicicated before all services (and perharps VM tools) are started there is a break on the network layer.

One of my colleague is testing with/without VMware tools on a Linux Guest (because we have also the symptom of no network on Linux).

0 Kudos