Re: Help! SAN crashed but can't get volumes back u...

espsgroup · ‎07-22-2007

Hello all,

First off let me say I opened a Gold Support ticket on this issue but it happened this afternoon and coverage doesn't start until 6:00AM 7/23. I've got an entire Sharepoint and Exchange site completely down so I could really use some help. Apparently a Platinum uplift per incident is $1200.00 and I can't do that. I decided to try posting.

I've got two identical ESX 3.0.1 hosts. They are dual-attached to a pair of QLogic 1400 switches and those switches connect to a pair of redundant Infortrend R2224 SAN arrays. There is no LUN masking in our SAN. Both hosts are zoned to see all LUNs from both arrays. Up until now everything has been working great. We've been in full production. We're only licensed for Standard edition, so we don't run VMotion but I really want to and I think it would work great.

This morning we had a disk crash and for whatever reason VMWare lost access to the LUN on both paths and hung the virtual machines. The volume rebuilt on the hot-spare and is a RAID-6 volume in the first place, but I believe our particular Infortrend unit had a firmware issue. I upgraded the firmware and the array is back to operating normally.

My specific issue is that both hosts are seeing the same LUNs and therefore the same VMFS volumes but they are not. One host sees 6 of the VMFS volumes and the other host sees 7. There are 8 total that should be visible to both hosts. It's also weird in that even though these are identical machines, cabled symetrically, they see the LUNs in a different order. sda on machine A doesn't match sda on machine B, etc.

I have tried the echo 1 > /proc/vmware/LVM/EnableResignature and then a rescan but it doesn't bring anything back. I had to do this initially to get both hosts to see the any VMFS volumes on the SAN unit that suffered the failure, but both units aren't seeing the same volumes after all of that was done. Subsequent attempts have produced no further results.

Both hosts ARE seeing the right LUNs, at least Linux is. They aren't in the same order on each host as far as device paths, but they come in the same order from the SAN unit and they are all present on both systems. I just don't understand why each system sees different VMFS volumes.

I have double checked the configuration (at least the way it was working before) and haven't been able to find anything.

I read some threads about having to call support to have help with using fdisk commands but support isn't available to me until tomorrow morning. I've got users complaining hourly.

You know, $1200 per support incident is pretty steep. Even Microsoft only charges $245 per or whatever. Bleh.

Any ideas?

Jeff

GBromage · ‎07-22-2007

Hi Jeff!

I can't offer much (except sympathy and moral support) at this point, but while you're waiting for better solutions, you might want to see if you can get more detail as to why VMWare lost access. If your RAID and hot-spares are all working as they should, the VM hosts shouldn't have even known it had happend.

So the problem might be more with the storage processor than the VM configuration. If that's the case, poking at it from the VM side of things cause more harm than good.

You said that the Linux side of things are seeing the right LUNs. Doe that mean that Linux sees all 8 LUNs, even if it can't see a VMFS file system on it?

Greg.

I hope this information helps you. If it does, please consider awarding points with the 'Helpful' or 'Correct' buttons. If it doesn't help you, please ask for clarification!

espsgroup · ‎07-22-2007

Hey, thanks for looking.

Yes, both hosts see all of the LUNs.

One thing I just noticed is that I still had Resignature set to 1 on both hosts and I've rebooted a number of times. I just set it to 0 on both and I'm rebooting both of them again. I guess I'll see what that does.

Yes, I'm trying not to change anything with VMWare. I haven't touched fdisk yet because I'm scared to death of anything that changes the partition.

Another issue I think I may have is my Multipath mode is set to Fixed (Active/Active) in VMware. I'm not 100% sure but I think the Infortrends are setup for Active/Passive. I'm trying to dig further to see. This may have explained why the servers lost access in the first place.

Thanks!

Jeff

espsgroup · ‎07-22-2007

Just to illustrate better, this is what /vmfs/volumes looks like on the first host:

drwxrwxrwt 1 root root 980 Jan 4 2007 459d6eb1-0cbef30d-3a30-00144f1ff1a4

drwxrwxrwt 1 root root 2940 Mar 22 23:33 46a17df0-6038a77e-9fb6-00144f1ff84a

drwxrwxrwt 1 root root 1120 Nov 15 2006 46a17df0-75181ab6-d015-00144f1ff84a

drwxrwxrwt 1 root root 1120 May 3 15:12 46a17df1-9bebdc0a-b3cf-00144f1ff84a

drwxrwxrwt 1 root root 1260 Apr 19 14:18 46a3adc9-00a9a2c4-8c68-00144f1ff1a4

drwxrwxrwt 1 root root 1960 Mar 20 11:54 46a3adca-4eec0c66-f42a-00144f1ff1a4

drwxrwxrwt 1 root root 1120 Apr 25 23:23 46a3adca-66715b6d-6b0f-00144f1ff1a4

lrwxrwxrwx 1 root root 35 Jul 22 21:05 AS01-LD00-LUN00 -> 46a17df0-6038a77e-9fb6-00144f1ff84a

lrwxrwxrwx 1 root root 35 Jul 22 21:05 AS01-LD00-LUN01 -> 46a17df0-75181ab6-d015-00144f1ff84a

lrwxrwxrwx 1 root root 35 Jul 22 21:05 AS01-LD02-LUN00 -> 46a17df1-9bebdc0a-b3cf-00144f1ff84a

lrwxrwxrwx 1 root root 35 Jul 22 21:05 AS02-LD00-LUN00 -> 46a3adc9-00a9a2c4-8c68-00144f1ff1a4

lrwxrwxrwx 1 root root 35 Jul 22 21:05 AS02-LD01-LUN00 -> 46a3adca-4eec0c66-f42a-00144f1ff1a4

lrwxrwxrwx 1 root root 35 Jul 22 21:05 AS02-LD02-LUN00 -> 46a3adca-66715b6d-6b0f-00144f1ff1a4

And here it is on the other:

drwxrwxrwt 1 root root 980 Apr 3 12:07 461289c2-fdd288d3-1602-00144f1ff93c

drwxrwxrwt 1 root root 2940 Mar 22 23:33 46a17df0-6038a77e-9fb6-00144f1ff84a

drwxrwxrwt 1 root root 1120 Nov 15 2006 46a17df0-75181ab6-d015-00144f1ff84a

drwxrwxrwt 1 root root 1120 May 3 15:12 46a17df1-9bebdc0a-b3cf-00144f1ff84a

drwxrwxrwt 1 root root 1960 Mar 20 11:54 46a3adca-4eec0c66-f42a-00144f1ff1a4

drwxrwxrwt 1 root root 1120 Apr 25 23:23 46a3adca-66715b6d-6b0f-00144f1ff1a4

drwxrwxrwt 1 root root 1540 Apr 14 00:32 46a3ae2c-3ab5b673-4ec2-00144f1ff93c

drwxrwxrwt 1 root root 1260 Apr 19 14:22 46a3ae32-29686110-40a7-00144f1ff93c

lrwxrwxrwx 1 root root 35 Jul 22 21:06 AS01-LD00-LUN00 -> 46a17df0-6038a77e-9fb6-00144f1ff84a

lrwxrwxrwx 1 root root 35 Jul 22 21:06 AS01-LD00-LUN01 -> 46a17df0-75181ab6-d015-00144f1ff84a

lrwxrwxrwx 1 root root 35 Jul 22 21:06 AS01-LD01-LUN01 -> 46a3ae2c-3ab5b673-4ec2-00144f1ff93c

lrwxrwxrwx 1 root root 35 Jul 22 21:06 AS01-LD02-LUN00 -> 46a17df1-9bebdc0a-b3cf-00144f1ff84a

lrwxrwxrwx 1 root root 35 Jul 22 21:06 AS02-LD01-LUN00 -> 46a3adca-4eec0c66-f42a-00144f1ff1a4

lrwxrwxrwx 1 root root 35 Jul 22 21:06 AS02-LD02-LUN00 -> 46a3adca-66715b6d-6b0f-00144f1ff1a4

lrwxrwxrwx 1 root root 35 Jul 22 21:06 AS02-LD03-LUN00 -> 46a3ae32-29686110-40a7-00144f1ff93c

lrwxrwxrwx 1 root root 35 Jul 22 21:06 Local Storage -> 461289c2-fdd288d3-1602-00144f1ff93c

We don't use the local disks so ignore those.

We should[/b] be seeing these volumes on both hosts:

AS01-LD00-LUN00

AS01-LD00-LUN01

AS01-LD01-LUN01

AS01-LD02-LUN00

AS02-LD00-LUN00

AS02-LD01-LUN00

AS02-LD02-LUN00

AS02-LD03-LUN00

Here is a look at dmesg on Host A:[/b]

SCSI device sdb: 2048000000 512-byte hdwr sectors (1000000 MB)

sdb: sdb1

SCSI device sdc: 2225655808 512-byte hdwr sectors (1086746 MB)

sdc: sdc1

SCSI device sdd: 276480000 512-byte hdwr sectors (135000 MB)

sdd: unknown partition table

SCSI device sde: 122880000 512-byte hdwr sectors (60000 MB)

sde: unknown partition table

SCSI device sdf: 2048000000 512-byte hdwr sectors (1000000 MB)

sdf: sdf1

SCSI device sdg: 276480000 512-byte hdwr sectors (135000 MB)

sdg: sdg1

SCSI device sdh: 2048000000 512-byte hdwr sectors (1000000 MB)

sdh: sdh1

SCSI device sdi: 1856440320 512-byte hdwr sectors (906464 MB)

sdi: unknown partition table

SCSI device sdj: 2048000000 512-byte hdwr sectors (1000000 MB)

sdj: sdj1

SCSI device sdk: 780888064 512-byte hdwr sectors (381292 MB)

sdk: sdk1

SCSI device sdl: 2621440000 512-byte hdwr sectors (1280000 MB)

sdl: sdl1

SCSI device sdm: 409600000 512-byte hdwr sectors (200000 MB)

sdm: sdm1

SCSI device sdn: 1152176128 512-byte hdwr sectors (562586 MB)

sdn: unknown partition table

SCSI device sdo: 780888064 512-byte hdwr sectors (381292 MB)

sdo: unknown partition table

And on Host B:[/b]

SCSI device sda: 2048000000 512-byte hdwr sectors (1000000 MB)

sda: sda1

SCSI device sdb: 2225655808 512-byte hdwr sectors (1086746 MB)

sdb: sdb1

SCSI device sdc: 276480000 512-byte hdwr sectors (135000 MB)

sdc: unknown partition table

SCSI device sdd: 122880000 512-byte hdwr sectors (60000 MB)

sdd: unknown partition table

SCSI device sde: 2048000000 512-byte hdwr sectors (1000000 MB)

sde: sde1

SCSI device sdf: 276480000 512-byte hdwr sectors (135000 MB)

sdf: sdf1

SCSI device sdg: 2048000000 512-byte hdwr sectors (1000000 MB)

sdg: sdg1

SCSI device sdh: 1856440320 512-byte hdwr sectors (906464 MB)

sdh: unknown partition table

SCSI device sdi: 2048000000 512-byte hdwr sectors (1000000 MB)

sdi: sdi1

SCSI device sdj: 780888064 512-byte hdwr sectors (381292 MB)

sdj: sdj1

SCSI device sdk: 2621440000 512-byte hdwr sectors (1280000 MB)

sdk: sdk1

SCSI device sdl: 409600000 512-byte hdwr sectors (200000 MB)

sdl: sdl1

SCSI device sdm: 1152176128 512-byte hdwr sectors (562586 MB)

sdm: unknown partition table

SCSI device sdn: 780888064 512-byte hdwr sectors (381292 MB)

sdn: unknown partition table

cat /proc/scsi/qla2300/1 -- Host A:[/b]

SCSI LUN Information:

(Id:Lun) * - indicates lun is not registered with the OS.

( 0: 0): Total reqs 365, Pending reqs 0, flags 0x0, 0:0:81,

( 0: 1): Total reqs 367, Pending reqs 0, flags 0x0, 0:0:81,

( 0: 2): Total reqs 32, Pending reqs 0, flags 0x0, 0:0:81,

( 0: 3): Total reqs 32, Pending reqs 0, flags 0x0, 0:0:81,

( 1: 0): Total reqs 789, Pending reqs 0, flags 0x0, 0:0:82,

( 1: 1): Total reqs 429, Pending reqs 0, flags 0x0, 0:0:82,

( 1: 2): Total reqs 636, Pending reqs 0, flags 0x0, 0:0:82,

( 1: 3): Total reqs 32, Pending reqs 0, flags 0x0, 0:0:82,

( 2: 0): Total reqs 761, Pending reqs 0, flags 0x0, 0:0:83,

( 2: 1): Total reqs 334, Pending reqs 0, flags 0x0, 0:0:83,

( 3: 0): Total reqs 3121, Pending reqs 0, flags 0x0, 0:0:84,

( 3: 1): Total reqs 544, Pending reqs 0, flags 0x0, 0:0:84,

( 3: 2): Total reqs 32, Pending reqs 0, flags 0x0, 0:0:84,

( 3: 3): Total reqs 32, Pending reqs 0, flags 0x0, 0:0:84,

cat /proc/scsi/qla2300/1 -- Host B:[/b]

SCSI LUN Information:

(Id:Lun) * - indicates lun is not registered with the OS.

( 0: 0): Total reqs 12, Pending reqs 0, flags 0x0, 1:0:81,

( 0: 1): Total reqs 11, Pending reqs 0, flags 0x0, 1:0:81,

( 0: 2): Total reqs 11, Pending reqs 0, flags 0x0, 1:0:81,

( 0: 3): Total reqs 11, Pending reqs 0, flags 0x0, 1:0:81,

( 1: 0): Total reqs 12, Pending reqs 0, flags 0x0, 1:0:82,

( 1: 1): Total reqs 11, Pending reqs 0, flags 0x0, 1:0:82,

( 1: 2): Total reqs 11, Pending reqs 0, flags 0x0, 1:0:82,

( 1: 3): Total reqs 11, Pending reqs 0, flags 0x0, 1:0:82,

( 2: 0): Total reqs 12, Pending reqs 0, flags 0x0, 1:0:83,

( 2: 1): Total reqs 11, Pending reqs 0, flags 0x0, 1:0:83,

( 3: 0): Total reqs 12, Pending reqs 0, flags 0x0, 1:0:84,

( 3: 1): Total reqs 11, Pending reqs 0, flags 0x0, 1:0:84,

( 3: 2): Total reqs 11, Pending reqs 0, flags 0x0, 1:0:84,

( 3: 3): Total reqs 11, Pending reqs 0, flags 0x0, 1:0:84,

esxcfg-mpath -l -- Host A:[/b]

Disk vmhba0:0:0 /dev/sdb (1000000MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b80f644<->210000d0237aeaf7 vmhba0:0:0 On active preferred

FC 6:1.0 210000e08b801345<->220000d0237aeaf7 vmhba2:0:0 On

Disk vmhba0:0:1 /dev/sdc (1086746MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b80f644<->210000d0237aeaf7 vmhba0:0:1 On active preferred

FC 6:1.0 210000e08b801345<->220000d0237aeaf7 vmhba2:0:1 On

Disk vmhba0:0:2 /dev/sdd (135000MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b80f644<->210000d0237aeaf7 vmhba0:0:2 On active preferred

FC 6:1.0 210000e08b801345<->220000d0237aeaf7 vmhba2:0:2 On

Disk vmhba0:0:3 /dev/sde (60000MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b80f644<->210000d0237aeaf7 vmhba0:0:3 On active preferred

FC 6:1.0 210000e08b801345<->220000d0237aeaf7 vmhba2:0:3 On

Disk vmhba0:1:0 /dev/sdf (1000000MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b80f644<->210000d0236aeaf7 vmhba0:1:0 On active preferred

FC 6:1.0 210000e08b801345<->220000d0236aeaf7 vmhba2:1:0 On

Disk vmhba0:1:1 /dev/sdg (135000MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b80f644<->210000d0236aeaf7 vmhba0:1:1 On active preferred

FC 6:1.0 210000e08b801345<->220000d0236aeaf7 vmhba2:1:1 On

Disk vmhba0:1:2 /dev/sdh (1000000MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b80f644<->210000d0236aeaf7 vmhba0:1:2 On active preferred

FC 6:1.0 210000e08b801345<->220000d0236aeaf7 vmhba2:1:2 On

Disk vmhba0:1:3 /dev/sdi (906465MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b80f644<->210000d0236aeaf7 vmhba0:1:3 On active preferred

FC 6:1.0 210000e08b801345<->220000d0236aeaf7 vmhba2:1:3 On

Disk vmhba0:2:0 /dev/sdj (1000000MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b80f644<->210000d0237aeab4 vmhba0:2:0 On active preferred

FC 6:1.0 210000e08b801345<->220000d0237aeab4 vmhba2:2:0 On

Disk vmhba0:2:1 /dev/sdk (381293MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b80f644<->210000d0237aeab4 vmhba0:2:1 On active preferred

FC 6:1.0 210000e08b801345<->220000d0237aeab4 vmhba2:2:1 On

Disk vmhba0:3:0 /dev/sdl (1280000MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b80f644<->210000d0236aeab4 vmhba0:3:0 On active preferred

FC 6:1.0 210000e08b801345<->220000d0236aeab4 vmhba2:3:0 On

Disk vmhba0:3:1 /dev/sdm (200000MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b80f644<->210000d0236aeab4 vmhba0:3:1 On active preferred

FC 6:1.0 210000e08b801345<->220000d0236aeab4 vmhba2:3:1 On

Disk vmhba0:3:2 /dev/sdn (562586MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b80f644<->210000d0236aeab4 vmhba0:3:2 On active preferred

FC 6:1.0 210000e08b801345<->220000d0236aeab4 vmhba2:3:2 On

Disk vmhba0:3:3 /dev/sdo (381293MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b80f644<->210000d0236aeab4 vmhba0:3:3 On active preferred

FC 6:1.0 210000e08b801345<->220000d0236aeab4 vmhba2:3:3 On

Disk vmhba1:0:0 /dev/sda (69618MB) has 1 paths and policy of Fixed

Local 2:3.0 vmhba1:0:0 On active preferred

esxcfg-mpath -l -- Host B:[/b]

Disk vmhba0:0:0 /dev/sda (1000000MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b803b45<->210000d0237aeaf7 vmhba0:0:0 On active preferred

FC 6:1.0 210000e08b812209<->220000d0237aeaf7 vmhba2:0:0 On

Disk vmhba0:0:1 /dev/sdb (1086746MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b803b45<->210000d0237aeaf7 vmhba0:0:1 On active preferred

FC 6:1.0 210000e08b812209<->220000d0237aeaf7 vmhba2:0:1 On

Disk vmhba0:0:2 /dev/sdc (135000MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b803b45<->210000d0237aeaf7 vmhba0:0:2 On active preferred

FC 6:1.0 210000e08b812209<->220000d0237aeaf7 vmhba2:0:2 On

Disk vmhba0:0:3 /dev/sdd (60000MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b803b45<->210000d0237aeaf7 vmhba0:0:3 On active preferred

FC 6:1.0 210000e08b812209<->220000d0237aeaf7 vmhba2:0:3 On

Disk vmhba0:1:0 /dev/sde (1000000MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b803b45<->210000d0236aeaf7 vmhba0:1:0 On active preferred

FC 6:1.0 210000e08b812209<->220000d0236aeaf7 vmhba2:1:0 On

Disk vmhba0:1:1 /dev/sdf (135000MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b803b45<->210000d0236aeaf7 vmhba0:1:1 On active preferred

FC 6:1.0 210000e08b812209<->220000d0236aeaf7 vmhba2:1:1 On

Disk vmhba0:1:2 /dev/sdg (1000000MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b803b45<->210000d0236aeaf7 vmhba0:1:2 On active preferred

FC 6:1.0 210000e08b812209<->220000d0236aeaf7 vmhba2:1:2 On

Disk vmhba0:1:3 /dev/sdh (906465MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b803b45<->210000d0236aeaf7 vmhba0:1:3 On active preferred

FC 6:1.0 210000e08b812209<->220000d0236aeaf7 vmhba2:1:3 On

Disk vmhba0:2:0 /dev/sdi (1000000MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b803b45<->210000d0237aeab4 vmhba0:2:0 On active preferred

FC 6:1.0 210000e08b812209<->220000d0237aeab4 vmhba2:2:0 On

Disk vmhba0:2:1 /dev/sdj (381293MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b803b45<->210000d0237aeab4 vmhba0:2:1 On active preferred

FC 6:1.0 210000e08b812209<->220000d0237aeab4 vmhba2:2:1 On

Disk vmhba0:3:0 /dev/sdk (1280000MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b803b45<->210000d0236aeab4 vmhba0:3:0 On active preferred

FC 6:1.0 210000e08b812209<->220000d0236aeab4 vmhba2:3:0 On

Disk vmhba0:3:1 /dev/sdl (200000MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b803b45<->210000d0236aeab4 vmhba0:3:1 On active preferred

FC 6:1.0 210000e08b812209<->220000d0236aeab4 vmhba2:3:1 On

Disk vmhba0:3:2 /dev/sdm (562586MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b803b45<->210000d0236aeab4 vmhba0:3:2 On active preferred

FC 6:1.0 210000e08b812209<->220000d0236aeab4 vmhba2:3:2 On

Disk vmhba0:3:3 /dev/sdn (381293MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b803b45<->210000d0236aeab4 vmhba0:3:3 On active preferred

FC 6:1.0 210000e08b812209<->220000d0236aeab4 vmhba2:3:3 On

Disk vmhba1:0:0 /dev/sdo (69618MB) has 1 paths and policy of Fixed

Local 2:3.0 vmhba1:0:0 On active preferred

espsgroup · ‎07-22-2007

So, it looks like I'm missing AS01-LD01-LUN01 and AS02-LD03-LUN00 on Host A and I'm missing AS02-LD00-LD00 on Host B.

I have narrowed it down to these culprit volumes:

Host A:[/b]

\[root@adcvm03 volumes]# vmkfstools -P 46a3adc9-00a9a2c4-8c68-00144f1ff1a4

VMFS-3.21 file system spanning 1 partitions.

File system label (if any): AS02-LD00-LUN00

Mode: public

Capacity 1048508891136 (999936 file blocks * 1048576), 929720958976 (886651 blocks) avail

UUID: 46a3adc9-00a9a2c4-8c68-00144f1ff1a4

Partitions spanned:

vmhba0:2:0:1

Host B:[/b]

\[root@adcvm04 volumes]# vmkfstools -P 46a3ae2c-3ab5b673-4ec2-00144f1ff93c

VMFS-3.21 file system spanning 1 partitions.

File system label (if any): AS01-LD01-LUN01

Mode: public

Capacity 1139508510720 (1086720 file blocks * 1048576), 977773002752 (932477 blocks) avail

UUID: 46a3ae2c-3ab5b673-4ec2-00144f1ff93c

Partitions spanned:

vmhba0:0:1:1

\[root@adcvm04 volumes]# vmkfstools -P 46a3ae32-29686110-40a7-00144f1ff93c

VMFS-3.21 file system spanning 1 partitions.

File system label (if any): AS02-LD03-LUN00

Mode: public

Capacity 399700393984 (381184 file blocks * 1048576), 184290377728 (175753 blocks) avail

UUID: 46a3ae32-29686110-40a7-00144f1ff93c

Partitions spanned:

vmhba0:2:1:1

Does anyone know how to examine a VMFS volume via the vmhbaX:X:X designator? VMware is just not liking the above volumes on the opposite hosts. Obviously all volumes are correctly visible but NOT ON BOTH HOSTS.

AGH#@(@#$(#@

What a way to spend Sunday.

Jeff

GBromage · ‎07-22-2007

If you run fdisk on the /dev/sd(x) drives on the affected partitions, not changing anything but just printing the partition table, what does it say?

I hope this information helps you. If it does, please consider awarding points with the 'Helpful' or 'Correct' buttons. If it doesn't help you, please ask for clarification!

espsgroup · ‎07-22-2007

On Host A:[/b]

Disk vmhba0:2:0 /dev/sdj (1000000MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b80f644<->210000d0237aeab4 vmhba0:2:0 On active preferred

FC 6:1.0 210000e08b801345<->220000d0237aeab4 vmhba2:2:0 On

--

fdisk:

Disk /dev/sdj: 1048.5 GB, 1048576000000 bytes

255 heads, 63 sectors/track, 127482 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System

/dev/sdj1 1 127482 1023999101 fb Unknown

On Host B:[/b]

Disk vmhba0:2:0 /dev/sdi (1000000MB) has 2 paths and policy of Fixed

FC 2:1.0 210000e08b803b45<->210000d0237aeab4 vmhba0:2:0 On active preferred

FC 6:1.0 210000e08b812209<->220000d0237aeab4 vmhba2:2:0 On

--

fdisk:

Disk /dev/sdi: 1048.5 GB, 1048576000000 bytes

255 heads, 63 sectors/track, 127482 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System

/dev/sdi1 1 127482 1023999101 fb Unknown

The other two are the same. Partition table seems the same on both. That doesn't mean something in the partition isn't corrupt, I guess.

GBromage · ‎07-22-2007

The other two are the same. Partition table seems the
same on both. That doesn't mean something in the
partition isn't corrupt, I guess.

Yes. And the good news is, that (probably) means that the data on the missing LUN is still intact.

You mentioned that you tried setting the LVM.EnableResignature[/b] flag. Have you tried setting LVM.DisallowSnapshotLUN[/b] to 0? You may need to reboot the host after setting it,

I hope this information helps you. If it does, please consider awarding points with the 'Helpful' or 'Correct' buttons. If it doesn't help you, please ask for clarification!

espsgroup · ‎07-22-2007

I just changed DisallowSnapshotLUN to 0, rebooted, and still no change on Host B. Host A hasn't come up yet.

Jeff

GBromage · ‎07-22-2007

Do you have enough spare space on the SAN to create a new LUN of a similar size?

Just wondering if it would work to create a new LUN, format it and make it visible to both servers, then copy the files from one of the non-working LUNs on the new one. Then blow away and recreate the LUN.

I hope this information helps you. If it does, please consider awarding points with the 'Helpful' or 'Correct' buttons. If it doesn't help you, please ask for clarification!

espsgroup · ‎07-22-2007

Now that is certainly an idea. The Infortrend does have LUN copy capabilities and I have plenty of free space.

I will keep that in mind. VMware support is working on this one now (we paid.

GBromage · ‎07-23-2007

OK then - hope you get it all sorted.

If VMWare can't fix it, and you do have to do a re-copy, I'll let you know where to send the cheque.....

I hope this information helps you. If it does, please consider awarding points with the 'Helpful' or 'Correct' buttons. If it doesn't help you, please ask for clarification!

Rumple · ‎07-23-2007

have you tried rescanning each individual hba just for kicks and giigles on each host (from within VC and within the cmd line) just to see what happens?

espsgroup · ‎07-26-2007

I just wanted to post that VMware helped me get back up and running, but I still haven't figured out my issue 100%.

I am suspiscious of the setting "Max Concurrent Host-Lun Connections" on the Infortrend. The default for this setting is 4, which my systems are at. I have 4 servers each with two HBAs connecting to both arrays.

I've been seeing SCSI Reservation issues in the logs and this might explain those. I ended up having to just reboot everything including the SAN to get everything up and going again.

One thing that ticks me off is that VMware has no information on Infortrend's gear. I asked the Infortrend support person if they were going to and they said that they had been "looking into it" but they had no further information.

They also just informed me that they had to remove LUN Filtering/Masking from all of their US products shipped after June. Apparently there must be a lawsuit or something pending. Basically, if you update the firmware on your Infortrend now you will break your configuration if you had previously been using LUN Filtering. You have to do it on the switch or at the HBA level now. That really sucks!

Products outside of the US are not affected.

I wonder which storage company it is they are supposedly infringing on?

Jeff

espsgroup · ‎08-10-2007

Ugh!

I just got another maintenance window to clean some of this up.

I changed Max Concurrent Host-Lun Connections to 32 on each Infortrend. I also upped the Maximum Queued I/O Count to 1024.

I'm still seeing SCSI Reservation Errors like these:

Aug 10 23:06:24 adcvm02 vmkernel: 0:00:01:43.593 cpu2:1034)WARNING: SCSI: 5519: Failing I/O due to too many reservation conflicts

Aug 10 23:06:24 adcvm02 vmkernel: 0:00:01:43.593 cpu2:1034)WARNING: SCSI: 7916: status SCSI reservation conflict, rstatus #c0de01 for vmhba0:2:0. residual R 919, CR 0, ER 3

Aug 10 23:06:24 adcvm02 vmkernel: 0:00:01:43.593 cpu2:1034)WARNING: FS3: 2484: reservation error: SCSI reservation conflict

Aug 10 23:06:24 adcvm02 vmkernel: 0:00:01:43.593 cpu2:1034)WARNING: FS3: 2919: Failed with bad0022

Aug 10 23:06:24 adcvm02 vmkernel: 0:00:01:48.588 cpu2:1034)WARNING: SCSI: 5519: Failing I/O due to too many reservation conflicts

Aug 10 23:06:24 adcvm02 vmkernel: 0:00:01:48.588 cpu2:1034)WARNING: SCSI: 7916: status SCSI reservation conflict, rstatus #c0de01 for vmhba0:2:1. residual R 919, CR 0, ER 3

Aug 10 23:06:24 adcvm02 vmkernel: 0:00:01:48.588 cpu2:1034)WARNING: FS3: 2484: reservation error: SCSI reservation conflict

Aug 10 23:06:24 adcvm02 vmkernel: 0:00:01:48.588 cpu2:1034)WARNING: FS3: 2919: Failed with bad0022

I saw on a Xiotech page that running 'esxcfg-module -s ql2xlogintimeout=5 qla2300_707' and setting the QLogic HBA login timeout setting should help this issue, but it did not.

I'm re-opening the case with Infortrend, but their support was a little baffled before. Maybe I can get someone better next time.

In the meanwhile, I now have to figure out how to get all of my volumes and vms back up.

Argh!@(#*

christianZ · ‎08-12-2007

This problem looks like a path trashing (check e.g. this

http://www.vmware.com/community/thread.jspa;jsessionid=5A79B5695E07EEE0879DD2EDC474931C?messageID=22...

As I remember I configured the paths with MRU, but that was an older model (with fixed had problems).

What can Infortrend say to your config (MRU or Fixed)?

christianZ · ‎08-23-2007

I would be interested to hear whether you could resolve your problems - we will install one Infortrend system too.

christianZ · ‎08-30-2007

Any feedback??

espsgroup · ‎02-07-2008

I wanted to post an update on this issue. I've been swamped with all kinds of other things and this one has come up for me again.

We still have a sort-of working VMware configuration. Right now I have two Sunfire X4200 servers dual-pathed to an Infortrend A24F-R2224 with the latest firmware.

We are still running ESX 3.0.1 but I'm going to upgrade as soon as I have a long enough maintenance window.

What happens is I don't have access to all LUNs if I reboot my VMware servers. It's like they just dissappear on reboot, with no explanation, and not all of them. I got nowhere with both VMware support and Infortrend support and just gave up.

I am not convinced that VMotion or ESX in general work correctly with the Infortrends. Each time I reboot I lose access to my VMFS volumes and end up having to remove and re-register all guests. I have to do a server flipflop because one server won't be able to see volumes on the other.

At one time we had two Infortrends shared with 4 hosts, but I have since zoned them seperate just to avoid confusion.

I have read about others who have Infortrends working with VMotion and failover, but we have not been successful and all of our gear is in production so testing is limited.

I'm really hoping an upgrade to 3.0.2 or 3.5 will help my issues, but who knows. We're considering Sun 2540 arrays for an upgrade since I am very familiar with Engenio technology and they seem cheap and fast. We were quoted $50K for a 36x300GB 15K SAS unit with full redundancy. I am also interested in Equalogic too since Dell just bought them but we are not an iSCSI shop (yet).

espsgroup · ‎02-07-2008

I noticed this in vmkernel:

Feb 7 16:48:35 adcvm03 vmkernel: 69:07:58:13.302 cpu2:1026)LinSCSI: 2604: Forcing host status from 7 to SCSI_HOST_OK

Feb 7 16:48:35 adcvm03 vmkernel: 69:07:58:13.302 cpu2:1026)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Feb 7 16:48:35 adcvm03 vmkernel: 69:07:58:13.302 cpu2:1040)SCSI: 8040: vmhba0:0:0:0 Retry (busy)

Feb 7 16:49:07 adcvm03 vmkernel: 69:07:58:45.328 cpu0:1073)<6>scsi(2:0:0): qla2x00_status_entry No more QUEUE FULL retries..

Feb 7 16:49:07 adcvm03 vmkernel: 69:07:58:45.328 cpu0:1073)LinSCSI: 2604: Forcing host status from 7 to SCSI_HOST_OK

Feb 7 16:49:07 adcvm03 vmkernel: 69:07:58:45.328 cpu0:1073)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Feb 7 16:49:07 adcvm03 vmkernel: 69:07:58:45.328 cpu2:1040)SCSI: 8040: vmhba0:0:0:0 Retry (busy)

Feb 7 16:49:39 adcvm03 vmkernel: 69:07:59:17.356 cpu2:1026)<6>scsi(2:0:1): qla2x00_status_entry No more QUEUE FULL retries..

Feb 7 16:49:39 adcvm03 vmkernel: 69:07:59:17.356 cpu2:1026)LinSCSI: 2604: Forcing host status from 7 to SCSI_HOST_OK

Feb 7 16:49:39 adcvm03 vmkernel: 69:07:59:17.356 cpu2:1026)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Feb 7 16:49:39 adcvm03 vmkernel: 69:07:59:17.356 cpu0:1040)SCSI: 8040: vmhba0:0:1:0 Retry (busy)

All paths are set to MRU:

Disk vmhba0:0:0 /dev/sdb (1000000MB) has 2 paths and policy of Most Recently Used

FC 2:1.0 210000e08b80f644<->210000d0237aeab4 vmhba0:0:0 On active preferred

FC 6:1.0 210000e08b801345<->220000d0237aeab4 vmhba2:0:0 On

Disk vmhba0:0:1 /dev/sdc (381293MB) has 2 paths and policy of Most Recently Used

FC 2:1.0 210000e08b80f644<->210000d0237aeab4 vmhba0:0:1 On active preferred

FC 6:1.0 210000e08b801345<->220000d0237aeab4 vmhba2:0:1 On

Disk vmhba0:1:0 /dev/sdd (1280000MB) has 2 paths and policy of Most Recently Used

FC 2:1.0 210000e08b80f644<->210000d0236aeab4 vmhba0:1:0 On active preferred

FC 6:1.0 210000e08b801345<->220000d0236aeab4 vmhba2:1:0 On

Disk vmhba0:1:1 /dev/sde (200000MB) has 2 paths and policy of Most Recently Used

FC 2:1.0 210000e08b80f644<->210000d0236aeab4 vmhba0:1:1 On active preferred

FC 6:1.0 210000e08b801345<->220000d0236aeab4 vmhba2:1:1 On

Disk vmhba0:1:2 /dev/sdf (562586MB) has 2 paths and policy of Most Recently Used

FC 2:1.0 210000e08b80f644<->210000d0236aeab4 vmhba0:1:2 On active preferred

FC 6:1.0 210000e08b801345<->220000d0236aeab4 vmhba2:1:2 On

Disk vmhba0:1:3 /dev/sdg (381293MB) has 2 paths and policy of Most Recently Used

FC 2:1.0 210000e08b80f644<->210000d0236aeab4 vmhba0:1:3 On active preferred

FC 6:1.0 210000e08b801345<->220000d0236aeab4 vmhba2:1:3 On

All

Help! SAN crashed but can't get volumes back up and baffled!