Hello all,
we are experiencing serious SCSI reservation issues in our ESX 3.0.1 / VC 2.0.1 environment.
This is our setup and the whole story:
Host hardware:
\- 2 IBM xSeries 445 (each with 8 SingleCore-CPUs and 32 GB RAM)
\- 3 HP ProLiant DL585 (each with 4 DualCore-CPUs and 32 GB RAM)
\- 2 HP ProLiant DL580 (each with 4 SingleCore-CPUs and 16 GB RAM)
We started with all servers running ESX 2.5.x attached to a EMC Symmetrix 8530. All servers used three 600 GB LUns on this box. All have two QLogic HBAs in them. No issues.
Then we started our migration to ESX3. At the same time we also needed to migrate to new SAN storage: six 400 GB LUNs on a HP XP12000. We used the brand new "VMotion with storage relocation"-feature to do both migrations. At the beginning this worked really fine.
So we re-installed all hosts one after the other with ESX3, attached the new storage LUNs to them (in addition to the old ones) and migrated the VMs from the not-yet-upgraded hosts to the already-upgraded hosts and the new storage.
We started with the three DL585 and were very pleased with the speed an the reliability of the process.
However, when we re-installed the first IBM-host the trouble began. All sorts of VM related procedures (e.g. storage relocation, hot and cold, powering on VMs, VMotion, create new VM) failed with all sorts of error messages in VirtualCenter. Looking at the vmkernel-logs of the hosts we discovered the reason for this: excessive SCSI reservation conflicts. The messages look like this e.g.:
Nov 14 13:29:43 frasvmhst06 vmkernel: 0:00:03:34.249 cpu4:1045)WARNING: SCSI: 5519: Failing I/O due to too many reservation conflicts
Nov 14 13:29:43 frasvmhst06 vmkernel: 0:00:03:34.249 cpu4:1045)WARNING: SCSI: 5615: status SCSI reservation conflict, rstatus 0xc0de01 for vmhba2:0:0. residual R 919, CR 0, ER 3
Nov 14 13:29:43 frasvmhst06 vmkernel: 0:00:03:39.086 cpu4:1045)FSS: 343: Failed with status 0xbad0022 for f530 28 2 453782fc 6b8bc9e9 1700770d 1d624ca 4 4 1 0 0 0 0 0
Things we have tried so far to make it better:
\- filed a SR with VMware. No helpful answers yet.
\- checked the firmware code of the XP12000. It is the latest: 50.07.64.
\- distributed SAN load on the two HBAs in each host (Three LUNs fixed on first path, the other three fixed on
the second). This helped a lot(!), but we still had frequent reservation conflicts.
\- updated all HBAs to the latest EMC-supported BIOS (version 1.47). Did not change anything.
\- doubled the HBA's queue depth to 64. Doesn't seem to help.
In the meantime we have updated all seven hosts and migrated all 124 VMs to the new storage. The old EMC-storage is still connected to all hosts but is unused. We even unloaded the VMFS2-driver like advised somewhere in the
SAN configuration guide. So, everything should be quiet now. However, we still see sporadic SCSI reservation conflicts, although there is no storage relocation or VMotion etc. in progress! Even if we just reboot a host it will generate these errors when initializing its SAN storage access.
What's wrong here? Are we already driving VMware to its limits by having 7 hosts accessing 6 LUNs concurrently?
Is it the IBM hardware? Is it ESX3 not properly releasing SCSI locks?
I'd love to read comments from people that have similar problems with maybe even similar hardware configurations or better: no issues with a similar hardware configuration (esp. the IBM hosts accessing a XP12000).
\- Andreas
Andreas,
I did raise another case with VMware and suggested they already have at least one case logged against this issue, waiting to hear....
Also getting our SAN guys to double check everything - waiting for them to get back to me too !
We have two QLA2342 in each of our boxes - as for the detail of the SAN I will have to get back to you on that one.
Gary
Hi all,
here is a short update on this issue.
We have reached a quite stable situation now by doing the following:
1) We configured the SAN paths on each hosts in such a way that each host
accesses a LUN via the same physical path. Our assumption was that ESX
has a problem handling SCSI reservations if different hosts use different
physical paths (ending up on different ports on the XP12000) to the same
LUN. This way we got rid of most SCSI reservation conflicts!
However, I'm not so sure that you really need the "one physical path". It
helps a lot if you just change the active path from one HBA to the other
(on each host, for each LUN). It looks like changing active paths forces
the hosts to drop (or just forget) SCSI reservations!?
2) After doing this we were able to use VMotion again on most of the VMs.
So we were able to free the hosts from VMs and reboot them one after
the other. We also took the chance to patch all hosts with the three official
patches that were released for ESX 3.0.1 in the meantime.
After having rebooted all seven hosts all SCSI reservation conflicts were
gone!
We also got interesting feedback from VMware support: They told us that
there is a bug in the QLogic HBA driver of ESX causing some sort of
misbehavior under certain circumstances. I'm missing a straight-forward
explanation here. It involves SAN switch ports and/or ESX hosts going
"bad" or "strange" at the wrong time.
At least they think that this might be the cause of our problems and they
suggested as a workaround to change the ql2xlogintimeout-parameter
of the QLogic HBAs to 5 seconds (default is 20) like this:
\# esxcfg-module -s ql2xlogintimeout=5 qla2300_707
\# esxcfg-boot -b
\# reboot
Just to be sure we implemented this, too (in step 2 above).
They also provided us with an instrumented build of the QLogic driver that
should fix this problem without needing any workaround.
We will try this next if problems re-appear...
I hope this helps anyone else having these problems. Let me know...
\- Andreas
We are using QLogic HBAs.
After doing the math relating to queue depth, we decided not to change the HBAs from the default queue depth.
Good news!
Initially, all our disks were grouped into 33 GB logical device (LDEV) units that were combined into Logical Unit Size Expansion (LUSE) units to create 468 GB LUNs for use by ESX Server. Using these LUNs resulted in various operations failing due to SCSI reservation conflicts reported in the vmkernel log file.
With the same 33 GB LDEVs we were able to create 33 GB LUNs that could be used without problems, but 200 GB LUNs had the same failures due to SCSI reservation conflicts.
VMware support says they understand our situation and are still looking for a solution.
Meanwhile, our storage people suspected that using too many small LDEVs to create a large LUSE LUN might be a bad idea. So they created some large 468 GB LDEVs that were used to create 468 GB LUNs. These single-LDEV LUNs have not failed[/b] any of our stress testing with two hosts, so we are going to increase the testing with six hosts and with alternating paths.
We'll keep you posted on these results.
Just wondering if you had heard back about the LUSE Lun issue. We are using LUSE and have seem a few small errors and I am wondering if that is why.
We have not received any feedback from VMware in a while, except that they have asked us to install all the recent patches to ESX 3.0.1 and VC 2.0.1. We are working on that.
Still no problems with the single-LDEV LUNs, but we have not started to move production virtual machines to the new LUNs. We are waiting for our storage people to restructure their disks for a more effective use of disk space with large, single-LDEV LUNs.
I am also experienceing this problem. I have a vmware case 323208. It is in Escalated Orange status.
I also noticed that servers in a cluster/ha/drs configuration performed worse then those in a stand alone state when accessing the same luns. I am currently trying to find out if our 500gb luns were created using a single LDEV.
For now all my hosts are in a stand alone state until this is resolved. In this state I have about 95% success on Vmotion, Clones, Creating from templates, etc.
We have been having the same problem with our ESX server and we're connected to a Sun 9990. We've had the problem ever since we installed 3.0 and implemented DRS and Consolidated Backup. We have since removed those and removed the cluster and still have the same issues. We are running ours on HP BL-20P's and HP BL-45P's. We had tickets in with VMware but were told it was a problem with the SAN. Consequently, the SAN was not on the HCL until after 3.0.1 was released. I am trying to replace the 4Gb HBA drivers with the 2Gb drivers but I don't know linux very well. With document 1560391 I need to know how to find out what driver version we have. I can tell what driver is running but it's the long name I don't know how to find. Anyone that can help would be greatly appreciated. We're sick of this problem and have a huge steak in VMware in our environment. It really puts a damper on everything. Thanks for the help!
to find out the VMware-Kernel running drivers use "vmkload_mod -l"
to find out the installed rpm paket is installed use "rpm -qa | grep -i qla"
Hi peetz,
had you test the parameter "esxcfg-module -s ql2xlogintimeout=5 qla2300_707" and how are the results ???
regards
Werner
All our hosts are running with this parameter now. I cannot tell if this helped with our problem. At least it did not cause any new problems.
Andreas
There is out a new KB article (http://kb.vmware.com/KanisaPlatform/Publishing/725/8411304_f.SAL_Public.html) with the title "VMotion Failure of Virtual Machines Located on LUSE LUNs on HP XP 10000 and 12000".
Let the HP technican change the Host Mode Option to 19. The Host Mode for the LUN should be 0C (Windows)
gretz Fabian
Yes, I'm aware of this article. However, you missed that a firmware
upgrade on the XP12000 might also be necessary to achieve an
improvement.
VMware told me that this is their (and HP's resp. HDS') final resolution of
the issue we (and others) are seeing. To me it sounds more like a
workaround...
It will take some more time until we can implement the required changes
in our environment, and I will post again here when this is done to
inform about the (hopefully positive) results.
We still see SCSI reservation conflicts under high I/O load and Sync CRs
(Conflict resolution) retries e.g. when creating or deleting VM snapshots,
and I hope that the changes will make them go away or at least
significantly reduce them.
\- Andreas
We have an XP12000 Using QLogic HBA's.
Ever since we upgrade to ESX 3.X even doing a directory list on /vmfs/volumes on any host would take at least 10-15 seconds, and hba scans for new LUNs could take 60 seconds or more. We even had a server lock hard while doing a hba scan and it had to be rebooted. I thought this was just something we had to live with because the VM's didn't really show any adverse affects on disk performance, the scans and dir lists just took forever. Anything we did with new LUN's were done with VM's migrated off the host.
We have 7 hosts - 3 in one cluster, 4 in another. While we were waiting on HP for support on the SAN option (HOSTMODE Option 19=ON) I arranged our LUNS so that they only were viewable to one cluster at a time (max of 4 ESX hosts per LUN). This seemed to clear up most of the reservation issues. We were able to Vmotion, etc with very few incidents. If a Vmotion timed out, I could try it again in 10 minutes or so and it would work.
Most SCSI reservation issues in /var/log/vmkwarning were gone. We would get one every once in a while.
Now, with the HOSTMODE Option 19, those issues have gone away. Dir lists on /vmfs/volumes and lun scans happen immediately.
So this setting has cleared up our current problem with SCSI reservations and some other SAN related ones that have been lingering since our upgrade to ESX 3.
My VMware SE said the other day in passing..
"Thank goodness you went with the DMX instead of the XP12000, they are having tons of issues on the XP12000..."
So it's well known to them, there offices are right down the road and EMC coporate is about 20mins away from me.
Our SAN chick is asking me to find out if you set the Host Mode to 19\[Reserve] or if you set it to something else and chose Option 19? We have a Sun 9990 which is the same thing as your XP12000. We have had the same issues that you've had since we installed ESX 3.0. We've been told to set the Host Mode to Novell and that hasn't worked. We've also been told to change the HBA driver to the 2gb from the ESX 3.0.1 default of 4Gb. None of that has helped. Thanks for your help!
Hi robsxlt,
the answer is here:
http://kb.vmware.com/KanisaPlatform/Publishing/725/8411304_f.SAL_Public.html
You need to set the host mode to 0c (Windows) and then activate option 19.
However, on the XP12000 you cannot make this change on your own, but need
HP to do it.
\- Andreas
May be a dumb question but what affect did that have on anything else running on your SAN? Did you have to shut down your ESX servers to make this change? HDS is researching for us but we have Exchange servers, file servers, ESX servers and SAN to SAN replication going on with ours and curious about the implications to the other systems. Thanks for your imput!
We have not yet implemented this change, so I cannot give definitive
answers to your questions.
It looks like setting the host mode option alone is not a big deal.
However, since you also need to upgrade your SAN array's firmware this
may have effects on other servers connected to it.
E.g. we have SAP servers that use special features of the XP12000 (like
business copies) and have installed SAN agent software for accomplishing
this. Before updating the XP12000's firmware we need to update this
agent software, because the current versions won't work with the new
firmware.
So, there my be many implications, and you will want to check carefully
with your vendor before you make this change.
Hi jatwell,
We are experiencing the same problem as you and also the SCSI reservation conflicts, only we have an EMC CX700 storage (sorry to invade this HP thread, but I have not found any other info on this problem). The ESX servers take approx. 10-12 minutes to boot with LUNs attached, less than 5 minutes without them.
I have 4 ESX 3.0.1 (fully patched) servers attached to CX700. I also get timeouts when trying to add VMFS storage to some of the servers, the others work. We have 36 LUN's attached to the servers, 28 RDM's and the rest VMFS for VM system drives.
Anyone know how to change Host mode on a CX700??
Thanks