Re: NFS storage network 'hangs'

mrnick1234567 · ‎12-29-2011

Hey,

I have a very frustrating problem with Linux VM's storage 'hanging'. We store our VM's on an EMC2 Isilon cluster accessed over NFS. The machines will frequently freeze for between 4 and 5 seconds. This affects all Linux VM's (CentOS5.5) on a particular ESXi host at the same time.

It seems to be related to the virtual disk controller, all VM's with the LSI controller exhibit the issue but a VM with the IDE one doesn't.

So far i've boiled it down to a very simple setup to recreate the issue:

1) Install ESXi 4.1U1 on a server or workstation connected to the network with a single 1gig link
2) Setup a VMkernel port for storage and management traffic
3) Setup a datastore on the isilon cluster mounted over NFS.
4) Create 2 CentOS 5.5 linux VM's with the LSI SCSI controller (they don't need network). Boot them into runlevel 1 (ie no network, minimum services).

5)
On one VM run ioping to measure latency to it's virtual disk. E.g.:
ioping -c 1000 /tmp

On the other VM, write some data to it's virtual own disk. E.g.:
dd if=/dev/zero of=/tmp/test bs=1024 count=40000

Most often when you run the dd command, a few seconds later both VM's will hang for 4-5 seconds. When it returns from hanging ioping always reports a ping time between 4000-5000ms. Both machines are frozen during this period, but the network is OK, I can still ping the ESXi host over the same link.

As I said using the IDE controllers seems to be a work around but it's not ideal. Interestingly I don't see the issue using the local disks as datastore, so it seems to be related to using NFS mounted datastores too.

I've tried updating with the latest patches using vCentre Update Manager.

Any ideas?

Nick

mbreitbach · ‎01-03-2012

I'd be very curious to find out what kind of resolution you get on this. I am seeing a very similar problem on all guest OS's in my environment. We are running a Nexenta storage platform, where it see's almost identical 4000-5000ms latency spikes (over 10gbit links, nonetheless). Those spikes go away when using block storage (iSCSI/FC) or when using IDE drivers. I have found other comments on the web (http://serverfault.com/questions/285214/troubleshooting-latency-spikes-on-esxi-nfs-datastores) that seem to indicate that it is resolved in ESXi 5.0, but that really doesn't help me today. I've got an open ticket on this issue, hopefully they will have a solution that is not "move to ESXi 5.0".

mrnick1234567 · ‎01-04-2012

That's interesting to know. I am also planning to open a case. A couple of things I've noticed the past few days:

1)

Using iometer on Win7 x64, the highest overall latency measured is nowhere near 4000-5000ms, and the OS seems responsive while the Linux VM's are hanging. This had me wondering if it's just an issue affecting Linux (CentOS5.5 in my case). Interesting that it affects all your VM's.

2)

On our production cluster of Dell R610 ESXi hosts, the virtual IDE controller also suffers the hangs, but it doesn't when setup on the test workstation mentioned in my first post. The LSI SCSI and LSI SAS controllers however hang on both test workstation and R610s. I've yet to clear down on of the R610s and rebuild it from scratch to see if that resolves the hanging with the IDE controller.

Please do let me know if VMware come back to you with any ideas.

Nick

BharatR · ‎01-04-2012

Hi,

Find the below article says about patches updates regarding the issue u were facing.

http://kb.vmware.com/kb/1014886

Best regards, BharatR--VCP4-Certification #: 79230, If you find this information useful, please award points for "correct" or "helpful".

mbreitbach · ‎01-04-2012

That KB article does not appear to address NFS latency at all. It does address some Dell Broadcom NIC issues, but does not appear to mention NFS latency issues at all. Also, it applies to ESXi 4.0, and I am currently on ESXi 4.1.

mbreitbach · ‎01-04-2012

As for the Windows 7 guests, we do not have any Windows 7 guests in our environment. We have a lot of Windows 2008R2, and a lot of Linux, but no desktop OS's.

mrnick1234567 · ‎01-05-2012

Hi BharatR

I don't see anything in that KB that refers to this issue. I am using VMware Update Manager and am and running the latest patches for ESXi4.1.0.

J1mbo · ‎01-05-2012

Do you see this problem with a single VM running one of the stat utilities referenced (ioping, fsunc-tester) against storage provided by an otherwise idle NFS store?

mrnick1234567 · ‎01-05-2012

Hi J1mbo,

Our isilons are in use in production, so unfortunately we've not had the chance to test against an idle isilon cluster. However running one VM does seem OK. It's as soon as you launch another VM on the same ESXi host and run dd for example the hangs begin.

I have also tested using a workstation as NFS storage (which is otherwise idle), and this has no problems with 1 or 2 VM's. ioping reports higher latencies when the other VM is writing, but there is no 4-5 second hang.

Interestingly I don't see this issue with iometer in Windows7 64. It seems responsive whilst the Linux hosts are hung.

Nick

PS - great blog, lots of your tips are on now linked to on our internal wiki page!

mbreitbach · ‎01-05-2012

I have run it against an idle NFS datastore, and with one VM, I do not see the pauses. As soon as I get more than one VM running on an ESXi host that's where the pauses start happening.

J1mbo · ‎01-05-2012

I've been trying to reproduce this but without success. My setup is thus:

ESXi 348481, single GbE pNIC
Debian 6 NFS Server, single SATA drive, P4/3GHz, 2GB RAM, JFS file system, 256K window sizes set
Datastore mounted via NFS on ESXi host
2x Ubuntu 11.04 VMs with VMDK's provisioned on NFS datastore, vmware tools installed via client option to do so
"ioping -i 0.3" running on one VM to 4K-aligned XFS partition
"dd if=/dev/zero" on the other with various block sizes, 64/128/512/1K/2K/4K/8K/32K/1M to XFS partition, tryed both 4K aligned and 63-sector offset

Without dd running the ioping is stable maybe 1ms or so. Running dd at all ramps up the response times as would be expected, but mostly consistent with NFS server load and the odd jump; highest recorded was 2,700ms or so with 1,700ms occuring more frequently. I tried 8 and 64 threads on the NFS server, with no particular difference with each (although I couldn't say how many threads were being used, as the "th" line in /proc/net/rpc/nfsd seems to be broken on my test box for some reason and always shows zeros).

Re ESXi 5, I've only looked at it very briefly but I did notice a number of new configurables in the advanced/NFS section including NFS.maxqueuedepth, although it defaults to a huge number anyway.

Any pointers on reproducing the issue would be greatfully received!

J1mbo · ‎01-05-2012

Actually to add, generating a workload on a 4K-aligned XFS partition doubled the write throughput (vs 63-sector) and yielded much more consistent ioping times (ioping -q -i 0 -w 60 -S 10G):

63-sector: 0.0/142.9/3010.0/410.1 ms
128-sector: 0.0/42.6/328.3/48.6 ms
(both min/avg/max/mdev)

I've seen this before, particularly with NFS shares delivered via XFS running on arrays, whereby a mixed read/write workload in unaligned guest partitions seemed to throttle the disk queue at the NFS server to 1 IO, obviously blocking disk performance to that of a single spindle in the process. I'm not convinced it's related to the problem here, but thought I'd mention it as the greatly extended response times were only present with the unaligned workload (I've run this three times, to be sure).

Here's the view of network utilisation between two runs:

mbreitbach · ‎01-05-2012

So I've been running some more tests, and have found an interesting correlation.

I've run IOPing on a LSI Logic SAS connected disk (/dev/sda) and it sees the latency.

I've also set up a NFS connection to the same pool of disks (different NFS share) and created a .img file. I formatted that .img file as ext2, mounted it via a loop device, and run the IOPing against that mountpoint. When I see the latency on the LSI connected disk, I do not see the same latency on the NFS mounted disk. IOping results below.

LSI connected disk

<snip>

4096 bytes from . (ext2 /dev/sda): request=146 time=0.4 ms

4096 bytes from . (ext2 /dev/sda): request=147 time=0.4 ms

4096 bytes from . (ext2 /dev/sda): request=148 time=0.4 ms

4096 bytes from . (ext2 /dev/sda): request=149 time=0.4 ms

4096 bytes from . (ext2 /dev/sda): request=150 time=0.4 ms

4096 bytes from . (ext2 /dev/sda): request=151 time=4757.8 ms

4096 bytes from . (ext2 /dev/sda): request=152 time=0.4 ms

4096 bytes from . (ext2 /dev/sda): request=153 time=0.4 ms

4096 bytes from . (ext2 /dev/sda): request=154 time=0.4 ms

4096 bytes from . (ext2 /dev/sda): request=155 time=0.4 ms

<snip>

NFS connected image file

<snip>

4096 bytes from . (ext2 /dev/loop1): request=147 time=0.1 ms

4096 bytes from . (ext2 /dev/loop1): request=148 time=0.1 ms

4096 bytes from . (ext2 /dev/loop1): request=149 time=0.1 ms

4096 bytes from . (ext2 /dev/loop1): request=150 time=0.1 ms

4096 bytes from . (ext2 /dev/loop1): request=151 time=0.1 ms

4096 bytes from . (ext2 /dev/loop1): request=152 time=0.1 ms

4096 bytes from . (ext2 /dev/loop1): request=153 time=0.1 ms

4096 bytes from . (ext2 /dev/loop1): request=154 time=0.1 ms

4096 bytes from . (ext2 /dev/loop1): request=155 time=0.1 ms

4096 bytes from . (ext2 /dev/loop1): request=156 time=0.1 ms

4096 bytes from . (ext2 /dev/loop1): request=157 time=0.1 ms

4096 bytes from . (ext2 /dev/loop1): request=158 time=0.1 ms

4096 bytes from . (ext2 /dev/loop1): request=159 time=0.4 ms

4096 bytes from . (ext2 /dev/loop1): request=160 time=0.1 ms

<snip>

Also of interest, an ext2 formatted image file mounted directly from an NFS share shows lower average latency than the VMDK based file going through the ESX stack.

Just to clarify the mounts up - here's what DF looks like :

Filesystem 1K-blocks Used Available Use% Mounted on

tmpfs 5120 0 5120 0% /lib/init/rw

tmpfs 207164 352 206812 1% /run

udev 1029676 0 1029676 0% /dev

rootfs 1035816 4824 1030992 1% /

/dev/sr0 703314 703314 0 100% /live/image

tmpfs 1035816 4824 1030992 1% /live/cow

tmpfs 1035816 0 1035816 0% /live

/dev/sda 8256952 151020 7686504 2% /mnt/sda

/dev/sdb 20642428 44992 19548860 1% /mnt/sdb

/dev/sdc 20642428 176196 19417656 1% /mnt/sdc

10.253.44.101:/volumes/SATAGroup1/latency_test

3740943360 3072 3740940288 1% /mnt/test

/mnt/test/40g.img 41284928 49116 39138660 1% /mnt/test2

J1mbo · ‎01-06-2012

I just can't reproduce this problem for some reason.

I tested the IDE controller in one VM and the LSI controller in the other both ways, i.e. generating load on the LSI and ping-test on the IDE, and vice-versa. The results were:

with load generated on the LSI controller in VM1, and ioping -i 0.3 running against an IDE controller drive in VM2, the ioping response times were consistently a bit slower with the IDE controller than with the LSI controller in VM2; and
with load generated on the IDE controller in VM2, and ioping -i 0.3 running against an LSI controller drive in VM1, the ioping reponse times were very much lower (i.e. better) than when the load was generated against an LSI controller drive in VM2. However this result is expected since the throughput was very much lower (half actually) due to the lack of queuing in the IDE controller.

I also tested movig the VMs to NOOP scheduler for these volumes, which seemed to slow things down slightly.

Not too sure what else to add here. I'd proceed by by simplifying the NFS test rig for example to a single interface, single pSwitch maybe?

mrnick1234567 · ‎01-06-2012

Thanks for taking the time to look at this Jim. I also tried it with a workstation acting as an NFS datastore (CentOS5.5, xfs formatted, single 1TB HDD, Quad Core 2.93, 12Gb RAM). Using that I saw no issues. Maybe that's whay you can't re-create it?

I'd be interested to see if mbreitbach tested on some NFS storage that wasn't the main storage he was having issues with. (He doesnt specify what the idle NFS datastore was on he tried)

One interesting thing I found yesterday was that using an Ubuntu VM (10.04.2 Lucid) running ioping 0-5, I don't see the hangs. This VM happily keeps writing to it's disk with genereally sub 10ms response times, whilst at the same time ioping running on my CentOS5.5 VM's hang. Again i'd be interested if mbreitbach sees similar.

I guess this pins it down to something in the vmware software stack which doesn't like certain combinations of OS's with certain storage vendors.

Vague i know!

dimugo · ‎01-09-2012

Hello ,

What's kind of equipement you have between your filer and your esx ?

We have experienced the same problem ( high latency) with our linux vms two weeks ago, and the solution was on the cisco switch. The two ethernet link was configured on etherchannel on the Netapp but not on the switch. So we recreate an etherchannel (port channel) with a trunk of two ports on the cisco and now it works fine. I think the Netapp was trying to load-balance connection but the swith refuse it so that cause retransmission of tcp packet.

It 's easy to figure out if the problem came from the link between filer and switch : shut down all the port except one and see what's happen.

We also have removed the flow control parameter ( set to none) because we suspect if to send pause frame.

Now biggest spikes are 50ms, that was 5000ms before we change the configuration.

Hope this helps.

mrnick1234567 · ‎01-11-2012

Thanks for the suggestion but the issue remains on my test box which doesn't use any trunking or LAG groups, so I think it's something else.

mrnick1234567 · ‎01-12-2012

Testing with a trial of ESXi5 and the problem is no longer there. Just need a fix for 4.1!

ehermouet44 · ‎02-13-2012

we have the same bug sometimes.

i detect that bug come when we want to create a big disk file (like 200gb) my nas serveur (hp nas x1400) overload hard drive, next my 2 esx cut connection with them, and if the cut is too long vm server will shutdown.

Varjen · ‎03-01-2012

Hi!

I'd hate to say that it's "nice" to see others with the same problem but at least i'm not alone!

I have very similar problems and i'm running ESX5 with the latest patch.

I'll try to describe my setup quickly. First off, i have two servers, one primary and one secondary. They currently reside in the same serverroom but will be located in separate firecells with separate powersupplies as soon as i am done with them. The main concept is that if one fails the other one takes over, i have tested this and it does work.

On both ESX hosts I have a virtual centos machine with DRBD and heartbeat that syncs discspace between the two servers via a 10GB fibre and serves as a NFS server to supply ESX with NFS datastores. This works brilliantly on one of the stores i setup for the virtual systemdiscs. The problem i have with latency is on a larger datastore i setup for datavolumes. As soon as i push the datastore the writelatency kicks up to 7000ms.
It's strange because when i subject the very same discspace with the same writeoperations i have a latency of 2ms.
My conclusion: The problem is NFS - ESX. Some kind of buffer or queue gets clogged/swamped and the whole thing goes belly up.

I have come to the end of my line now and im currently about to delve into the very scary realm of "Advanced settings" for the ESX. I sure hope i dont break my hosts now....

If i find anything i will write in this thread!

=T=

All

NFS storage network 'hangs'