VMware Cloud Community
Glenn7
Contributor
Contributor

ESXi 5.0 & DELL R720 Network Connectivity Loss

Ok there have been a lot of threads about trying to get the Dell R720 working with the Broadcom 5720 daughter card.  You can inject the drivers post build or you use the DELL recovery cd http://ftp.dell.com/FOLDER00609866M/1/ to build the server.

The problem I am highlighting in this post occurs when the servers are commissioned.  We put 4 DELL R720 servers in production.  They are configured using 4 ports on the 5720 etherchannelled and using vDs. Within 2-3 weeks at different times all 4 hosts experienced a network failure and caused production outages.  All Vm's were unresponsive & offline and a reboot of each physical host resolved the issues temporarily but would reoccur at a later date.  Calls to both vmware and Dell were not very productive and it took 2 months to finally work out the problem, involving a lot of time & effort on my part.  Basically specific to the DELL R720 server the below criteria must be enforced to ensure the hosts do not experience random network loss.

The above was implemented on all 4 hosts they have been stable for the last 8 weeks.

0 Kudos
79 Replies
Glenn7
Contributor
Contributor

Hi

Prob best to create a new post and add this to it – so other people with similar issues see it. Our issue seems unrelated to the QLogic but see below

to determine how you can find the best supported vmware driver for any of your devices. SSH to host and run the vmkchdev command below. What you put in quotes is

case sensitive. In our case QLogic ports are vmhba2 and 3. The numbers in bold are VID:DID:SVID:SSID – go to vmware I/O HCL list (link below) and select the 4 x 4

digit codes from the relevant drop down lists. Then click on the link for your card and it will give you the latest driver for each esx version.

http://www.vmware.com/resources/compatibility/search.php?deviceCategory=io

vmkchdev -l | grep "vmhba"

000:000:31.2 8086:1d02 1028:048c vmkernel vmhba0

000:003:00.0 1000:005b 1028:1f34 vmkernel vmhba1

000:004:00.0 1077:2532 1077:015c vmkernel vmhba2

000:005:00.0 1077:2532 1077:015c vmkernel vmhba3

0 Kudos
Glenn7
Contributor
Contributor

All

I am using the DELL ISO 5.0 Update1 image on the DELL R720 and it did not fix the issue, as per Jason’s comments I am also waiting to see if the netqueue disable fix has resolved the problem

0 Kudos
MARKWILL
Contributor
Contributor

Question to ALL,

  Are your affected servers using the iDRAC "shared" on the Broadcom NIC(s)?  Or are you using the embedded DRAC nic for your server(s)...

Also are you using iDrac Express or Enterprise?

0 Kudos
Glenn7
Contributor
Contributor

IDrac Enterpise on the Idrac embedded NIC used in our environment

0 Kudos
MARKWILL
Contributor
Contributor

Had a similar problem on a R810 (not running ESXi) when the iDRAC was used (Shared) with the broadcom NIC,  network connectivity would periodically stop, with only a reboot to restore service. The system had iDRAC Express, so no dedicated nic was available.   If your system has a DEDICATED nic for iDRAC, use that instead, else try to disable iDRAC on the Broadcom NICs.  This had fixed our issue with broadcom NIC dropping or not routing traffic.

Furthermore, we upgraded our iDRAC to Enterprise /w daughter board, and utilized the onboard dedicated nic for iDRAC.... Thus, the  the network issue abated.

Just a concideration....

0 Kudos
vcocaud
Contributor
Contributor

New driver release :

Sans titre.jpg

Will try with this config :

5.0 Update 1 DELL ISO upgraded to 5.1 & last driver above.

0 Kudos
revox
Contributor
Contributor

We are experiencing the same issue with our Dell R720.

The onboard daughter card is a BCM57800 (2x10Gb 2x1Gb) with a BCM5720 as a PCIe DP NIC. I've install the 5.0U1 Dell image and updated the BIOS to 1.2.6. Is this issue limited to the BCM5720? If so it would be easy to remove.

I have not disabled NetQueue would that be a smart thing to do?

0 Kudos
Glenn7
Contributor
Contributor

Yes the netqueue disable is the proposed fix. So far no one has reported an issue after disabling it but it can take weeks. As far as everyone who has contacted me it does seem limited to the bcm5720.

0 Kudos
PJudgeAAM
Contributor
Contributor

I believe that we have just experienced this problem - embedded NICs are 5720s. Server BIOS 1.1.2

Does anyone know if 5719s are definitely NOT affected?

Each of our ESX hosts has an additional PCI card with quad 5719s.

Will implementing the fix just on the 5720s impact etherchannels composed of 2 x 5719s and 2 x 5720s?

Thanks for any update.

0 Kudos
JProos
Contributor
Contributor

My understanding of the proposed fix is that it changes the netqueue settings in the tg3 driver instead of making any changes to the nic itself, perse.  As such, it probably doesn't care about the nic model number is as long as it's using the tg3 driver.

That said, I don't think anyone has reported the issue with 5719's that I'm aware of.  I believe that all the reports have been with 5720's.

I suggest checking with vmware tech support directly regarding the appropriateness of the fix for the Broadcom 5719 nic.

Regarding targetting particular nics for the fix:  I don't know how esxi correlates which 0 goes with which nic.  In all cases that I understand at this point the host had nothing but 5720s installed (6 in my case) so the command used a simple sequence of 0,0... where the number of zero's matched the number of 5720s in the host.  In your case you might need a slightly different command if you want to target just the 5720s and leave the driver for the 5719s alone.

Jason

0 Kudos
JProos
Contributor
Contributor

I agree with the implication that Glenn is making here that it's going to take a while to be sure that the proposed fix actually worked.  In my case, I was experiencing as much as a few weeks between incidents.  In Glenn's case, I think he went as long as 2 months.

I'm quite interested in whether 5.0 U1 is still subject to the same problem or not. In the meantime, I'm sticking with 5.0 with the fix applied.

Jason

0 Kudos
Hairyman
Enthusiast
Enthusiast

Hey All,

Does anyone have the link to Dell's ftp server for ESXi5.1 like this link for 5.0 U1: http://ftp.dell.com/FOLDER00609866M/1/

I am trying to install ESXi on a brand new R720 and am also having the "no network card detected" issue during the install

0 Kudos
Glenn7
Contributor
Contributor

You may need to contact DELL as its not here – may not be available yet.

http://ftp.dell.com/Browse_For_Drivers/Servers,%20Storage%20&%20Networking/PowerEdge/PowerEdge%20R72...

0 Kudos
JurijC
Contributor
Contributor

Hairyman, I have made a custom ISO for our R720 which consist of a standard ISO plus the latest net-tg3 drivers published on 18.9.2012, if you need to install 5.1 urgently I can share it with you, but if I were you, I would wait for the Dell official ISO.

I had to make my own ISO because I tried to upgrade 5.0U1 to 5.1 using the vanilla ISO via Upgrade Manager, but got stuck with a half-installed 5.1 which did not recognize the network interface and instead of reinstalling 5.0U1 I opted for a custom install, since I didn't know when I would have a timeslot available for another host upgrade.

0 Kudos
JProos
Contributor
Contributor

Enterprise, here.

0 Kudos
revox
Contributor
Contributor

0 Kudos
PJudgeAAM
Contributor
Contributor

Thanks, Revox - it looks like the same fix is appropriate for both 5719 and 5720. That simplifies things if we need to apply the fix.

I'm going to wait and see for a while - if I don't see any more problems (and the logs don't have the relevant errors), I will try to wait until there's an official patch / driver update from either Dell or VMWare (or both).

0 Kudos
JProos
Contributor
Contributor

Everyone,

As of today, Dell tech support tells me the following regarding the status of their support engineer awareness of the issue:

"This has been noted and most of our team is informed of the situation and on how to resolve the matter if the issue has not replicated since the NetQ change."

That was from a Dell Pro-Support engineer.  I don't know if other support engineers were notified.

It sounds like Dell tech support are also not quite convinced that the tg3 driver netqueue workaround actually works and are relying on us to provide that confirmation.  I mentioned to them that it's still too early to say for certain, from my perspective.  All the confidence I have in the workaround at this point comes from the fact that I believe it comes from Broadcom via vmware and I would expect one or both of them to be speaking from some kind of position of knowledge.  I don't get the feeling that Dell has done any testing around this issue themselves.

I've asked Dell about the status of the Dell ESXi 5.1 ISO and this issue as well as whether that ISO contains the latest Broadcom driver.  If it doesn't then I asked if the latest  driver contains a fix.

Anyone starting to wonder what "vmware ready" means?  Or am I late to that party?

Jason

0 Kudos
JProos
Contributor
Contributor

For those who are planning to upgrade to a newer release of ESXi:

vmware tech support tells me that it should not be necessary to reissue the esxcfg-module -s force_netq=0,0,0,0 tg3 command after the upgrade.  It's supposed to persist across ESXi upgrades.  I suppose it's still worth checking that the values are what you want them to be after the upgrade.  There's another command in this discussion that can be used to report the current values.

Jason

0 Kudos
SCMHenry
Enthusiast
Enthusiast

Can you outline the procedure you used to build the custom ISO?

I can't seem to be able to find an ESXi 5.1 offline bundle depot package to do this using the powerCLI method, as suggested in KB2005205.

0 Kudos