Solved: Re: MSCS and RAW Device Mapping

evan_1 · ‎05-20-2008

Hi everyone,

I'm having an issue with using RAW LUNs and cluster in a box using MSCS in our Lab environment.

We have two nodes - using three Mapped RAW LUN for the shared disks in the cluster.

We have another two nodes (not related to the one above) using four Mapped RAW LUN for the shared disks in the cluster.

The user was complaining about getting errors using the original setup of vmdk files so i had these RAW disks presented to our ESX environment to use for these clusters. The issue is when these VMs are turned on we are getting loads of SCSI reservation conflicts in the vmkernel log on all ESX hosts. In addition to this, everything else slows down and i get "Time out" errors when trying to Add more storage. I have to power these off for things to run normally.

So, i know something is wrong. ..

Is it because i added the RAW LUNs with the nodes powered on? (they didn't get initialized in Windows until all drives were on attached to all nodes). Could it be because MSCS is not configured right in Windows? It appears when those VMs are powered on Windows is locking the LUNs and in turn affecting all ESX Servers (the other VMFS data stores run fine). Is it because i have to many RAW LUNs? Total is 7. The SAN being used is a Netapp 3070(all i know) - could it be something was not done on the SAN side?

I attached the vmkernel log to this thread. Any help is appreciated. Thanks so much!

kjb007 · ‎05-21-2008

Ok, if you're using physical mode RDM, then ESX is not controlling any of the SCSI reservation/release commands for those LUNs. Your VM is doing that directly. That is the difference between physical and virtual mode RDM. Can you verify if the errors you are seeing are for which LUNs?

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

View solution in original post

tlaurent · ‎05-20-2008

Can you go in Cluster administrator on the nodes and confirm which disks are active on which nodes? Is this an Active/Active or Active/Passive cluster?

evan_1 · ‎05-20-2008

This is an Active/Passive cluster.

Node A on both clusters are the Active nodes.

This is a noob question but, when attaching RAW LUNs for a cluster scenrio do i want to set Physical or Virtual compatibility? I set Physical.

tlaurent · ‎05-20-2008

Do the active nodes show and controlling all the disks?

As I recall for performance choose Physical, for advanced functionality like snapshots Virtual.

evan_1 · ‎05-20-2008

I'll power them on tonight and take a look. Last time i looked though they were all under the control of the Active nodes and showing up in the cluster administrator.

kjb007 · ‎05-20-2008

Are you using physical or virtual mode RDM?

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

evan_1 · ‎05-21-2008

Physical

kjb007 · ‎05-21-2008

Ok, if you're using physical mode RDM, then ESX is not controlling any of the SCSI reservation/release commands for those LUNs. Your VM is doing that directly. That is the difference between physical and virtual mode RDM. Can you verify if the errors you are seeing are for which LUNs?

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

evan_1 · ‎05-21-2008

Yeah,

I went to one of the ESX hosts and looked at a log from May 16 when this all started happening and the following paths:

vmhba:0:0:28:0

vmhba:0:0:27:0

vmhba:0:0:29:0

vmhba:0:0:30:0

vmhba:0:0:31:0

vmhba:0:0:32:0

vmhba:0:0:33:0

Are reporting the errors in the vmkernel log (I attached to this reply). These are the same (RAW LUN) paths that are attached to the four nodes. So, if the VM is directly causing this could it be because of a bad configuration in the Guest OS? Err, in the Cluster Administrator (MSCS)?

I didn't set this cluster up so i would have to talk with the Admin that did. If it appears so i guess the best plan of action would be removing completely and rebuilding the cluster?

Thanks again!

evan_1 · ‎05-21-2008

The active nodes do show control of the disks.

kjb007 · ‎05-21-2008

If these events are occurring fairly regularly, then what I would do is shut down the passive node, and leave it down. Then, watch and see if the errors still persist. If they do not, then bring the passive node back up, fail over the cluster, and then shutdown the new passive node, and then watch for errors. That way, you can at least rule out contention with the cluster disk as a possibility, and then see if you need to may be rebuild one node instead of the entire cluster.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

All

MSCS and RAW Device Mapping