Hi Could anyone help, I suspect LUN I/O Issues, I pulled these logs from the vmkwarning in /var/log
489bb29d-620d30df-76aa-0ff150000000 jrnl drv 4.31] failed: SCSI reservation conflict
Aug 28 12:06:25 esx01 vmkernel: 19:23:22:22.095 cpu7:1037)WARNING: SCSI: 119: Failing I/O due to too many reservation conflicts
Aug 28 12:06:29 esx01 vmkernel: 19:23:22:26.136 cpu3:1040)WARNING: SCSI: 119: Failing I/O due to too many reservation conflicts
Aug 30 02:03:14 esx01 vmkernel: 21:13:18:54.926 cpu6:1043)WARNING: SCSI: 119: Failing I/O due to too many reservation conflicts
Aug 30 02:03:14 esx01 vmkernel: 21:13:18:54.926 cpu6:1043)WARNING: FS3: 4785: Reservation error: SCSI reservation conflict
Aug 30 02:03:14 esx01 vmkernel: 21:13:18:54.926 cpu6:1043)WARNING: FS3: 4979: Reclaiming timed out heartbeat [HB state abcdef02 offset 3766272 gen 366 stamp 1862330112268 uuid
Could anyone help me with analyzing these logs, do they refer to a particular LUN?
We had similar issues in pas following setting were configuer to fix the problem
1. Identical LUN ID mapped across the cluster for a LUN
2. Adjust the Queue Depth of the Qloigc adapter
3. Set Disk.ScheNuReqOutstanding to 64
4. Set DiskTimeOut registry setting on a Windows VM to avoid Disk events logged
5. Setting a LUN's Multipathing Policy to fixed (check with your storage vendor)
hope this helps.
MF
Hard to tell from logs, but something is tying up your VMDK files, which is where the conflicts come in.
The order is:
- Number of VM's per LUN (should be hosts * VM (disk vmdk) per LUN < 256). You should have less than number of hosts times number of disks and that should not be more than 256 total. It should be 200 in case you have snapshots.
- Number of HOSTS per LUN. This increases the likelihood you will have more SCSI conflicts, because each host on a LUN needs 1 per VMDK. So many VMs + Many hosts = higher % of conflict.
- Number of open VMDK. Snapshots, number of disks per VM, all increase SCSI reservations.
How many hosts do you have per LUN? How many VM's do you have with snapshots / multiple disks?
That should give you some indication of where the problem is. Also is this Fibre or iSCSI? You should set the qlaxmaxqueuedepth for the fibre if you have. The more HOST, the SMALLER this depth should be.
Hello,
Most reservation conflicts will concentrate on the metadata files on the vmfs volume.
Here are the factors that need to be considered in resolving reservation conflicts.
1* Size of the vmdks with the VM's
2* Type of I/O generated by the VM's
3* Number of VM's per LUN
4 Snapshots and swap files sizes and locations
5 Performance characteristics of the storage subsystem
6 HBA Queue lengths
7 Number of ESX hosts per shared LUN
8 Number of storage paths and how they are balanced on multiple SP's
If you can describe some of these elements we should be able to see if there is an issue on your deployment
We had similar issues in pas following setting were configuer to fix the problem
1. Identical LUN ID mapped across the cluster for a LUN
2. Adjust the Queue Depth of the Qloigc adapter
3. Set Disk.ScheNuReqOutstanding to 64
4. Set DiskTimeOut registry setting on a Windows VM to avoid Disk events logged
5. Setting a LUN's Multipathing Policy to fixed (check with your storage vendor)
hope this helps.
MF
Hello mike.laspina
I appreciate you willing to help me, If you wouldn't mind I will try and gather as much data as possible and get back to you.
Hello MF,
I will go back gather as much info as possible. May ask a few questions regarding you solutions?
1. I don't understand what you mean by, Identical LUN ID mapped across the cluster for a LUN.
2. I have 3 ESX hosts in the cluster adding the 4th today, I'm using 4Gb Qlogic HBA's should i adjust the Queue Depth?
3. Set Disk.ScheNuReqOutstanding to 64 <--- This is the Queue Depth values, correct?
4. Set DiskTimeOut registry setting on a Windows VM to avoid Disk events logged <--- are these events logged within the OS or does VMKernel log this?
5. I am currently using MRU for the Multipathing policy, do you mind explaining why FIXED is better.
Sorry about all the questions, I really do appreciate everyones help.
Thanks in advance.
1) in other words, every lun must be presented to all ESX hosts with the same ID. So if you've got the option to create groups all on you san and tie the group to the lun than all hosts must be in the same group to enforce this
2) Yeah you could benefit from setting the queuedepth
3) no, the queuedepth is set via the commandline: /usr/sbin/esxcfg-module -s ql2xmaxqdepth=64 <qla2300_707_vmw> (the part between the <> depends on your specific hba model)
4) these settings will be logged within a VM.
5) MRU or Fixed should be used accordingly to the type of SAN you own. So if you own a Active/Active SAN(both controller present the lun at the same time to the hosts) than you need to have this setup as Fixed. If you never touched these settings than MRU is probably what it should be, VMware detects this stuff normally.
So now a couple of questions for you:
1) how big are your VMFS Datastores
2) how many vm's does each datastore contain
3) does your virtual infrastucture contain heavy i/o vm's like oracle /sql etc.
4) what kind of SAN do you own
5) print the output of "esxcfg-mpath -l"
6) are there any snapshots running, "find /vmfs/volumes/ -iname "*delta.vmdk"
Duncan
My virtualisation blog:
If you find this information useful, please award points for "correct" or "helpful".