Re: VM corruption with iSCSI and Software SANs - a...

tjk · ‎07-10-2007

Hello Folks, I'm a long time reader, first time poster!

I need help if anyone has any ideas, as I am out of ideas and different ways to test this, and I can sure use some guidance. To the point I'm getting discouraged, as I've spent 10+ days, 18 hours a day on this problem - and nothing I've done has fixed it or produced better results as of yet.

Problem:

Guest VM corruption in a lab setup. Corruption being ext3 journal errors, fsck'ing the filesystem that never fixes anything, to broke /proc and libraries etc. The guest VM's are running RHEL 4.4 and and RHEL 4.5 updated to latest kernel - even happens with RHEL 3.8 VM's.

The configuration:

Host machines: 4 x Dell 1950's with perc 5's for the OS and local storage, 16G ram, dual quad-core Intel 5300 series cpu's, 2xge connections on each. 1xge to the 'public' network, the 2nd ge connection is used for the iSCSI network that has a vmkernel/service console connection to a vSwitch on the 192.168.0.0 network. Switches are 3560G for front end net type traffic, and back-end dedicated to the iSCSI network for this lab setup is a Cisco 2960G with flow-control on and testing has been done with jumbo and non-jumbo packet sizes configured, made no difference in results.

For the host iSCSI setup, I've had the initiator talk to one target and see all luns, or talk to multiple targets with one LUN presented per target.

The software SAN setup is as follows:

Intel 5300 Quad core, 8G ram, Windows 2003 Enterprise, 16x160GB sata II NCQ Seagate HDD, 1x16 port 3ware 9650SE RAID Controller in various raid setups/groupings for testing (all hdd's in 1 raid10, 4xhdd in raid10, etc etc etc). I'm running SANMelody and Falcon's iSCSI Server trials to test their software solutions and see how far they can go in terms of IO, etc. Network connectivity off the SAN has been 1xGE for all LUNS, 1xGE for each LUN, 4xGE NIC Teamed for load balancing, NIC Teamed for redundancy. I've tested with just one target that all the hosts connect to and share the LUNS that way, to a dedicated target per host machine that shares the LUNS that way. Every setup I test is producing the same results.

The testing:

Create a VM, template it, deploy it one at a time, two at a time, 4 or 5 at a time. Some boot fine, some come up with bad filesystem errors a fsck will fix, sometimes it won't fix it. Sometimes I can boot up a VM, run bonnie++ in it for 1 or 2 hours looped, reboot it and it is corrupt. Same with IOZone.

I'm not getting any errors on the switch ports, no crc or dropped packets, etc. At peak I'm doing about 400-600Mbs off each host node to the SAN and when the SAN is NIC teamed, I can see 800-1.5Gbs thru-put via the monitoring I am doing with SNMP to the SAN server and the Cisco switch ports. I can even reproduce the issues when not pushng a lot of traffic and just clone a VM 3 to 5 times 2 or 3 are bad after reboots or after a couple reboots.

If anyone wants to touch the configs to poke around, it is an isolated lab I can grant access to.

I'm not seeing async errors in the /var/log/vmk* logs or on the SAN in general.

The end result is I am trying to see how viable an iSCSI back-end with software or virtual based SANs really are, and so far I am not very impressed.

I am running VI3 latest with all patches, including the ones from yesterday installed. The iSCSI initiator is the in ESX software based, no HBA.

Any ideas?

Best,

Tom

mcwill · ‎07-10-2007

I'm currently running a very similar setup;

2 esx hosts, sanmelody, 8 way 3ware 9650SE.

So far (5 months running) we have no corruption. However, all our heavy IO VMs are windows and I get the impression you are seeing errors in linux VMs. Have you tried running iometer in an XP or win2k3 VM for a couple of hours?

Sanmelody also has a trace log that is accessed from its mmc plugin, have you looked there for errors?

Finally, on the 3ware card, what version of firmware are you running, I seem to remember having to update ours as it was delivered with the initial f/w which was quite out of date. (Also have you tried disabling NCQ on the drives?)

tjk · ‎07-11-2007

Thanks for the reply, answers below.

2 esx hosts, sanmelody, 8 way 3ware 9650SE.

How are you presenting your luns to each host? Just one target with all the luns there? Or a target for each host on a different interface?

Are you measuring your iSCSI network performance from the SM server or from each host? If so what level of bits are you pushing?

So far (5 months running) we have no corruption.
However, all our heavy IO VMs are windows and I get
the impression you are seeing errors in linux VMs.
Have you tried running iometer in an XP or win2k3 VM
for a couple of hours?

I've tested mainly on all Linux guest VM's. I did load 2003 twice and each time after the build/reboot, it had errors booting, HAL issues or corruption.

Sanmelody also has a trace log that is accessed from
its mmc plugin, have you looked there for errors?

Yes, as well as one of the SE's for SANMelody, and so far nothing is standing out. No obviousy I/O or async errors, etc.

What version of SM sw are you running? What patch level?

Finally, on the 3ware card, what version of firmware
are you running, I seem to remember having to update
ours as it was delivered with the initial f/w which
was quite out of date. (Also have you tried disabling
NCQ on the drives?)

9.4.1.2 which is the latest I think. I have not disabled NCQ. Wouldn't this hurt performance? Why would I want to disable this?

Thanks!

Tom

mcwill · ‎07-11-2007

How are you presenting your luns to each host? Just
one target with all the luns there? Or a target for
each host on a different interface?

Currently we have 1 target with 1 LUN and all VMs on that LUN.

Are you measuring your iSCSI network performance from
the SM server or from each host? If so what level of
bits are you pushing?

From the SM server, we tend to peak at 600 Mbps

What version of SM sw are you running? What patch
level?

We run 2.0.1 Update 6 (from memory)

9.4.1.2 which is the latest I think. I have not
disabled NCQ. Wouldn't this hurt performance? Why
would I want to disable this?

We running firmware (FE9X 3.08.00.004) on the 9650.

I suggested turning off the command queuing as you appear to be loosing data under heavy load, if it were me I'd be looking to simplify the setup as much as possible.

Paul_Lalonde · ‎07-11-2007

Take a look at:

http://kb.vmware.com/kb/51306

And see if any of this matches your issues.

Also, are you using Dell brand memory, or 3rd party memory? I'd suggest running a VERY strict memory test ( look for memtest86+ )

Paul

ShadowTechnicia · ‎02-29-2008

mcwill, you indicated (in march 2006) that you had esx running on a 3ware 9650SE card - can you provide any more information PLEASE?

I'm trying to do the same now with ESX server 3.5 and I cannot get past the "no disk found to install" on error.

Thanks in Advance!!

mike_laspina · ‎02-29-2008

Hello,

You have definetly done a lot of ground work on this.

I would not rule out a hardware fault on the iSCSI server. It really does hint in that direction.

I would turn of flow-control, I have had one nightmare bug with it some time ago, you probably have tried that already.

Have you considered quickly throwing together a ZFS/iSCSI box just to rule out Hardware/Software on the current one.

I am running it in my home lab, it's very low I/O compared to what your lab is doing but its running two ESX 3.5 engines against a single 160GB LUN never corrupted anything after 2 months. Great performance for a junk box. ~200Mbit/s

Ben Rockwood has some blogs that would allow you to set it up in a few hours.

http://blog.laspina.ca/ vExpert 2009

mcwill · ‎03-02-2008

ShadowTechnician, you misunderstood I'm afraid.

We have a Win2k3 server running Sanmelody on a 3ware 9650SE card which is used to present an iscsi target to the ESX server. The ESX server boots from it's internal SAS drive.

jasonboche · ‎03-02-2008

If it helps any, I tried several times using Fedora Core 4, 5, and 6 as an ISCSI IET target and ran into corruption issues every time. I scrapped Fedore Core and went with rPath Linux Openfiler and haven't seen corruption in 2 years.

I wanted to mention that to shed more light on your ISCSI target as the source of your corruption issues.

Jas

[i]Jason Boche[/i]

[VMware Communities User Moderator|http://communities.vmware.com/docs/DOC-2444][/i]

VCDX3 #34, VCDX4, VCDX5, VCAP4-DCA #14, VCAP4-DCD #35, VCAP5-DCD, VCPx4, vEXPERTx4, MCSEx3, MCSAx2, MCP, CCAx2, A+

All

VM corruption with iSCSI and Software SANs - anyone else?