VMware Cloud Community
wrf3f34ff
Enthusiast
Enthusiast

ESXi dropping ARP packets?

We're widening our ESXi 3.5.1 test deployment to a second server, and we've encountered an unusual problem. ESXi doesn't appear to be receiving ARP packets. Periodically the switch just "loses" IPs from its NIC and we've confirmed that it's due to ARP caches expiring. Inside a VM, we can packet dump without ever seeing any ARP traffic. Sometimes doing a "Restart Networking" from the console helps for awhile, and as long as traffic continually passes to an IP, obviously its arp cache does not expire.

The first machine we loaded ESXi does not have this problem, running all the same tests, we get exactly the results one would expect.

We have already exhaustively tested the hardware and switch port. Both are working fine. Switch port monitoring shows that the ARP requests are outbound to the VMware box, but they never make it.

This is not related to the guest networking stack, this happens with the ESXi management NIC as well / even right after startup when no guests are running. ("Test Management Network" has to be done to render it accessible to the VI client.) So it's somewhere in the ESXi physical LAN driver or the vSwitch.

Both the working ESXi server and the new test server are Supermicro servers that are either on the HCL or from the same "series" as machines on the HCL (same electronics, different disk configuration or form factor). The machines are identical in terms of CPUs, RAM, disks, etc. The only material difference between the one that works and the one that doesn't is that the server having the problem is a bit older; the onboard dual-port NIC is an Intel (ESB2/Gilgal) 82563EB. However, the 82563EB is a component in many of the HCL systems, so I don't think it's a fundamental compatibility issues.

Does anyone have any idea what might be going on?

Thanks!

(I originally managed to post this to the wrong forum. I'm easily confused! Smiley Happy )

Tags (4)
0 Kudos
24 Replies
elazar
Enthusiast
Enthusiast

I did some digging as well(curious) and I found that some of the newer Intel chipsets handle ARP at the hardware level, additionally, they can/will filter ARP requests. This may be what you are seeing, and It can be turned off by the driver. Based on what I have read, this was fixed at some point in the mainline Linux kernel, but I have yet to find a bug request with the specific patch. Take a look at this post to the e1000-devel list: http://marc.info/?l=e1000-devel&m=118833355412469&w=2

elazar

0 Kudos
wrf3f34ff
Enthusiast
Enthusiast

I just wrote a long, polite explanation of why I am indeed frustrated, and the forum software logged me out when I went to post it.

With my sincere apologies if the tone is more brusque than the original, here is the short, less polite summary:

We were evaluating VI3. We finished almost the whole eval. It was a lot of work. We encountered this problem, which afflicts 4 out of our 5 VMware target servers. We posted here. Someone suggested we get a sales rep involved. We spent a week trying to get ahold of a VMware sales rep but they wouldn't return our calls. We concluded that VMware sells VI3 Foundation like Fusion or Workstation (buy it on the website if you want it, but don't bother us about it) and saves the sales force for the multi-million dollar license buys. Thus, we reluctantly gave up.

Not only am I the one who did all the work necessary to make our application work under VMware, but some of that included working with contacts at VMware so we could enhance one of the open source virtualized drivers to meet our needs, at our expense. The instant I raised this issue, the contact evaporated; that work now cannot be finished or contributed back and was wasted effort. So I was very emotionally invested in this project and personally feel pretty poorly treated by VMware as a result and am in a poor position to consider the matter objectively. Yes, very frustrating. Talk about snatching defeat from the jaws of victory!

There is no solution other than a VMware driver update that does not appear to be forthcoming. I came back to post that finding so that people searching could get closure, not to continue a futile discussion, and I apologize if I gave a different impression. If someone else wants to track down the specific patch, it may be related to hardware ARP filtering or to IPMI "LAN sharing" but it is definitely chipset-specific; not all chipsets serviced by the e1000 driver exhibit this problem. The argument in favor of IPMI involvement is partly logical (the scary "lan sharing" feature is at the scene of the crime, even though we weren't using it) and partly statistical (that only 2-3 people have encountered this suggests a combination of factors are involved, making it more rare than the 82563EB itself).

I wish you the best of luck both in finding the right patch and getting VMware to care. But since our eval is over we have repurposed the hardware and cannot test further or rebut additional claims that this problem might be something we previously established it isn't. Such claims only cloud the issue in my opinion, and that is frustrating too. Thus, we are ready to move on and would ask people who want to hypothesize about this problem to save their efforts for the next person who has it.

The problem is as follows:

Using VMware ESXi 3.5u2, if you have an Intel 82563EB / 80003ES2LAN dual LAN chipset, as found on certain Supermicro systems on the official HCL, you may experience dropped inbound ARP packets on the first physical interface, leading to eventual loss of connectivity to both vmknics and guest network adapters. This may be related to having an IPMI card installed, even if you are using the dedicated LAN port on the IPMI card. The workaround is to disable this chipset and use an add-in LAN card; both another party and we tested the "Intel Pro/1000 PT Dual-Port PCI Express Server adapter" successfully. Our research showed that each of Windows, Linux, and FreeBSD previously had this issue with this chipset, but that it was fixed by a driver update in each case. No such driver update is currently available for ESXi.

The end.

0 Kudos
tru
Contributor
Contributor

hello,

Thank YOU very much for this detailed explanation, I am hitting this very same issue on Supermicro X7DBR-3

04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01)

04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01)

04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01)

Subsystem: Super Micro Computer Inc Unknown device 0000

Flags: bus master, fast devsel, latency 0, IRQ 66

Memory at d8000000 (32-bit, non-prefetchable)

I/O ports at 2000

Capabilities: Power Management version 2

Capabilities: Message Signalled Interrupts: 64bit+ Queue=0/0 Enable+

Capabilities: Express Endpoint IRQ 0

Capabilities: Advanced Error Reporting

Capabilities: Device Serial Number 1c-c2-32-ff-ff-48-30-00

04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01)

Subsystem: Super Micro Computer Inc Unknown device 0000

Flags: bus master, fast devsel, latency 0, IRQ 58

Memory at d8020000 (32-bit, non-prefetchable)

I/O ports at 2020

Capabilities: Power Management version 2

Capabilities: Message Signalled Interrupts: 64bit+ Queue=0/0 Enable-

Capabilities: Express Endpoint IRQ 0

Capabilities: Advanced Error Reporting

Capabilities: Device Serial Number 1c-c2-32-ff-ff-48-30-00

seen from a CentOS-5 x86_64 OS: Linux xx.fr 2.6.18-92.1.13.el5 #1 SMP Wed Sep 24 19:32:05 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

no issue

but when running ESXi 3 (fresh clean install)

1) can't connect through Viclient ("A connection failure occured")

2) ESXI claims a dhcp lease but can not be pinged

3) once visible (console -> ping), the connection is lost after some time and one needs to restart the management interface at the console..

VMkernel sxxfr 3.5.0 #1 SMP Release build-110271 Aug 12 2008 19:36:55 i686 unknown

/var/log/messages:

Nov 5 23:47:00 vmkernel: 0:03:45:29.903 cpu4:1502)Config: 489: "HostIPAddr" = "0.0.0.0", Old value: "1xx188" (Status: 0x0)

Nov 5 23:47:00 vmkernel: 0:03:45:29.974 cpu4:1502)Uplink: 2491: Setting capabilities 0x300 for device vmnic0

Nov 5 23:47:00 vmkernel: 0:03:45:29.983 cpu4:1502)Uplink: 2491: Setting capabilities 0x2b for device vmnic0

Nov 5 23:47:00 vmkernel: 0:03:45:29.983 cpu4:1502)Net: 1846: Setting Tx-complete cb for port

Nov 5 23:47:00 vmkernel: 0:03:45:29.983 cpu4:1502)Net: 1883: Setting cb for port

Nov 5 23:47:00 vmkernel: 0:03:45:29.983 cpu4:1502)Tcpip_Support: 3294: NIC supports Tso

Nov 5 23:47:00 vmkernel: 0:03:45:29.983 cpu4:1502)Tcpip_Support: 3301: Stack supports TSO. MSS (minus TCP options) = 40960

Nov 5 23:47:00 vmkernel: 0:03:45:29.983 cpu4:1502)Tcpip_Support: 3309: NIC support TX checksum offloading

Nov 5 23:47:00 vmkernel: 0:03:45:29.983 cpu4:1502)Tcpip_Support: 3315: NIC supports Scatter-Gather transmits

Nov 5 23:47:00 vmkernel: 0:03:45:29.983 cpu4:1502)Tcpip_Support: 3367: ether attach complete

Nov 5 23:47:00 vmkernel: 0:03:45:29.983 cpu4:1502)Tcpip: 3389: Attempting to set a default gateway

Nov 5 23:47:00 vmkernel: 0:03:45:29.983 cpu4:1502)Tcpip: 1405: SetGateway (15a639d) failed with 0x33

Nov 5 23:47:00 vmkernel: 0:03:45:29.983 cpu4:1502)Tcpip: 1411: SetGateway 1xx1: network unreachable

Nov 5 23:47:00 vmkernel: 0:03:45:29.983 cpu4:1502)Tcpip: 1418: resetting to old gateway failed with 0x33

Nov 5 23:47:00 esxcfg-dhcp: deconfig vmk3

Nov 5 23:47:03 dhcp: bound vmk3 ip=1xx188 subnet=255.255.255.0 gateway=1xx1

Nov 5 23:47:03 vmkernel: 0:03:45:33.134 cpu0:46800)Tcpip_Support: 2723: index = 9086724, ip_addr = 0xbc5a639d, netmask = 0x0

Nov 5 23:47:03 vmkernel: 0:03:45:33.135 cpu0:46800)Tcpip_Support: 2723: index = 9086724, ip_addr = 0xbc5a639d, netmask = 0xffffff

Nov 5 23:47:03 Hostd: Refreshing the entire network configuration...

Nov 5 23:47:03 vmkernel: 0:03:45:33.137 cpu0:46800)Config: 489: "HostIPAddr" = "1xx188", Old value: "0.0.0.0" (Status: 0x0)

Nov 5 23:47:03 dhcp: bound vmk3 ip=1xx88 subnet=255.255.255.0 gateway=1xx1

Nov 5 23:47:04 dhcp: bound vmk3 ip=1xx88 subnet=255.255.255.0 gateway=1xx.1unable to set up domain and hostname: Invalid hostname, may not contain '.'

Nov 5 23:47:05 Hostd: Refreshed the entire network configuration.

Nov 5 23:47:05 Hostd: Refreshing the entire network configuration...

Nov 5 23:47:05 Hostd: Refreshed the entire network configuration.

Nov 5 23:47:17 dropbear[46932]: Child connection from 1xx140:53468

Nov 5 23:47:19 dropbear[46932]: password auth succeeded for 'root' from 1xx140:53468

I will try on another Supermicro (X7DWU) and report back asap

08:00.0 Ethernet controller: Intel Corporation 82575EB Gigabit Network Connection (rev 02)

08:00.1 Ethernet controller: Intel Corporation 82575EB Gigabit Network Connection (rev 02)

0 Kudos
tru
Contributor
Contributor

The Supermicro (X7DWU) with

08:00.0 Ethernet controller: Intel Corporation 82575EB Gigabit Network Connection (rev 02)

08:00.1 Ethernet controller: Intel Corporation 82575EB Gigabit Network Connection (rev 02)

works fine so far Smiley Happy

patch applied/Viclient could connect

VMkernel localhost.localdomain 3.5.0 #1 SMP Release build-120505 Sep 29 2008 23:27:40 i686 unknown

/var/log/messages:

Nov 5 23:26:58 vmkernel: 0:00:00:36.993 cpu2:1211)Loading module igb ...

Nov 5 23:26:58 vmkernel: 0:00:00:37.006 cpu2:1211)Mod: 936: Starting load for module: igb R/O length: 0x13000 R/W length: 0x8000 Md5sum: cfef9a0a631c96e5fc1

Nov 5 23:26:59 vmkernel: 0:00:00:37.116 cpu2:1211)Mod: 1373: Module igb: initFunc: 0x98e780 text: 0x986000 data: 0x293aac0 bss: 0x293ae40 (writeable align 3

Nov 5 23:26:59 vmkernel: 0:00:00:37.150 cpu2:1211)Mod: 1389: modLoaderHeap avail before: 7799000

Nov 5 23:26:59 vmkernel: 0:00:00:37.169 cpu2:1211)Initial heap size : 102400, max heap size: 4194304

Nov 5 23:26:59 vmkernel: 0:00:00:37.189 cpu2:1211)<6>Intel(R) Gigabit Ethernet Network Driver - version 1.0.0
Nov 5 23:26:59 vmkernel: 0:00:00:37.212 cpu2:1211)<6>Copyright (c) 2007 Intel Corporation.
Nov 5 23:26:59 vmkernel: 0:00:00:37.229 cpu2:1211)PCI: driver igb is looking for devices
...
Nov 5 23:26:59 vmkernel: 0:00:00:37.604 cpu2:1211)<7>PCI: Setting latency timer of device 08:00.0 to 64
Nov 5 23:26:59 vmkernel: 0:00:00:37.712 cpu2:1211)<6>igb: eth0: igb_probe: Intel(R) Gigabit Ethernet Network Connection
Nov 5 23:26:59 vmkernel: <6>igb: eth0: igb_probe: (PCIe:2.5Gb/s:Width x4) 00:30:48:65:3d:8a
Nov 5 23:26:59 vmkernel: 0:00:00:37.754 cpu2:1211)<6>igb: eth0: igb_probe: Using legacy interrupts. 1 rx queue(s), 1 tx queue(s)
Nov 5 23:26:59 vmkernel: 0:00:00:37.782 cpu2:1211)PCI: driver igb claimed device 08:00.0
Nov 5 23:26:59 vmkernel: 0:00:00:37.799 cpu2:1211)PCI: Registering network device 08:00.0
Nov 5 23:26:59 vmkernel: 0:00:00:37.874 cpu2:1211)LinPCI: 202: Device 8:0 claimed.
Nov 5 23:26:59 vmkernel: 0:00:00:37.889 cpu2:1211)Mod: 2535: called already for this device.
Nov 5 23:26:59 vmkernel: 0:00:00:37.907 cpu2:1211)PCI: Trying 08:00.1
Nov 5 23:26:59 vmkernel: 0:00:00:37.919 cpu2:1211)PCI: Announcing 08:00.1
Nov 5 23:26:59 vmkernel: 0:00:00:37.932 cpu2:1211)<7>PCI: Setting latency timer of device 08:00.1 to 64
Nov 5 23:26:59 vmkernel: 0:00:00:38.025 cpu2:1211)<6>igb: eth0: igb_probe: Intel(R) Gigabit Ethernet Network Connection
Nov 5 23:26:59 vmkernel: <6>igb: eth0: igb_probe: (PCIe:2.5Gb/s:Width x4) 00:30:48:65:3d:8b
Nov 5 23:26:59 vmkernel: 0:00:00:38.068 cpu2:1211)<6>igb: eth0: igb_probe: Using legacy interrupts. 1 rx queue(s), 1 tx queue(s)

Nov 5 23:26:59 vmkernel: 0:00:00:38.095 cpu2:1211)PCI: driver igb claimed device 08:00.1

Nov 5 23:26:59 vmkernel: 0:00:00:38.112 cpu2:1211)PCI: Registering network device 08:00.1

0 Kudos
d64
Contributor
Contributor

I seem to be having this problem as well. This is on a new Supermicro server - I had it as early as a year ago on other Supermicros, but these were CentOS machines that were fixed by updating CentOS from 5 to 5.1 or 5.2 after install.

ESX build is 123629.

0 Kudos