Solved: Re: SAN storage redundancy with VMware Fault Toler...

rob_nixon · ‎01-13-2011

We are building a new data center, with the idea in mind of using VMware Fault Tolerance to create a zero-downtime virtual environment between two data centers. The goal is to have VMs continue to run even with loss of access to an entire data center, including SAN storage units.

I am looking for a way for a vSphere 4 virtual infrastructure to be able to write to replicated storage in two different data centers (with low-latency connections between data centers). We can do this with host servers and networking, but I have not found anything in the Fault Tolerance documentation mentioning anything other than "shared storage" We have experienced failure of individual SAN storage units (including total loss of hundreds of VMs) and are hoping that we can eliminate the storage unit as a single point of failure.

Has anyone done this, or know how it might be done?

MarekZdrojewski · ‎01-14-2011

Hi,

To clarify what idle-jam is talking about, you could watch this video from HP and VMware on YouTube.

- hth

Regards.

| Blog: https://defaultreasoning.com | Twitter: @MarekDotZ |

View solution in original post

lowteck · ‎01-13-2011

wow, complete SAN failure?

most modern SAN's have dual storage processors, dual NIC's, dual power supplies & dual ups's.

That said, I've never read nor heard of FT with 2 SAN's

But there are surely some enterprise SAN solutions that should be able to present a LUN to vmware that is actually replicated on 2 separate physical devices.

hate to see the invoice though,

low

rob_nixon · ‎01-13-2011

We have a virtualized SAN storage system, with a lot of redundancy, including multipathing with multiple HBAs, fiber switches, and storage virtualization hardware. The one component that is not redundant is the storage unit itself (big box full of disk drives). The big failure that we had was a fairly bizzare combination of hardware design flaw (that caused loss of power to the box), firmware bug, and human error.

Bottom line is, whatever is in front of the storage, if that storage unit becomes unavailable, ESX is hosed.

idle-jam · ‎01-13-2011

VMware FT would just create two instance of the VM, but then it's still single storage. FT will only support 1vCPU workload and it will also create 2x of the workload as the mirrored is confuming resources too.

With the right bandwidth (1Gbps with very very low latency) you could look at HP P4000 (Lefthand)'s Network Raid or EMC VPlex that does "FT" of SAN storage

MarekZdrojewski · ‎01-14-2011

Hi,

To clarify what idle-jam is talking about, you could watch this video from HP and VMware on YouTube.

- hth

Regards.

| Blog: https://defaultreasoning.com | Twitter: @MarekDotZ |

bulletprooffool · ‎01-14-2011

Theoretically . . with unlimited budget, bandwidth and processor . . .you could use a cluster of some sorts (EMC Cluster enabler etc) to manage the failover of the storage . . then mount this clustered storage as an NFS datastore acceisble to both ESX hosts . . then run VMWare FT on top of this - or try LeftHand Networks?

Not sure if anyone has ever tried this . . but you'll still have some time between the clustered strage going down and then being available again . . but you could try . . .

Zero downtime failover is tricky at a minimum and hundres of VMs would need MANY hosts as VMware recommend no more than 4 FT clients per host.

I'd love to have the hardware to build a lab to test this theory, but it would not be cost effieienct.

If these are really mission critical and zero downtime is the target, building a few big Clusters to host all your applications may be the lower downtime method.

VMWare FT is a great tool . . but it brings a astorage dependency with it.

One day I will virtualise myself . . .

rob_nixon · ‎01-14-2011

That video shows exactly what we want to do. Very impressive. I didn't know about the HP P4000 multi-site SAN systems. Clustering the SAN across multiple sites is what makes this work.

idle-jam · ‎01-14-2011

which such solutions the latency between both site has to be very low (less than 5ms if possible) and a 1Gbps link or more

rob_nixon · ‎01-18-2011

We are fortunate that our second data center (under construction) is just a few hundred yards away, and will have 10 gigabit ethernet connectivity. I expect to see sub-1ms latency between the two once everyting is in full operation. We are also using Cisco Nexus switches, which makes it possible to span an IP subnet across data centers so that an ESX cluster can also span datacenters.

The video shows how the fault tolerant ESX cluster can operate across datacenters with HP P4000 multi-site SAN storage. Does anyone know whether IBM or EMC has capability to do that same thing?

jslarouche · ‎01-21-2011

Looks into SVC and getting two IBM XIV's or two HP XP arrays.

rob_nixon · ‎01-27-2011

"Looks into SVC and getting two IBM XIV's or two HP XP arrays."

jslarouche,

Do you know anyone who has two SVCs with two XIVs clustered across two separate datacenters? We are using SVCs with XIV storage units, and would be interested in doing that. I am looking for someone who has actually done this in real datacenters. Not just powerpoint slides.

jslarouche · ‎01-27-2011

We haven't procured these units yet.. We are still waiting back on budget approvals to come back from the powers above.. From what my SAN guys are telling me is that we would use a new set of MPR's to use them with a 1GB dedicate pipe, we'd be doing Async to our luke warm remote DR site which is about 80KM's away to a new XIV to receive the data.. It's possible it will just cost a lot of $$$$ to implement. Fingers crossed though!

Question for you is how are the IBM XIV's and SVC technology holding up in your environment? Pretty solid? These HP EVA's are a bunch of POS devices and are two ticking time bombs waiting to happen again. We've had 4 major outages which all have taken us about 16hours to recover from all of them. Thats what happens when all your eggs are in one basket. Not a great feeling to come into work and knowing the risk that all our VM's can go down at anytime.. It's not sitting well with management right now and has stalled everything in our VMware environment.

rob_nixon · ‎01-27-2011

Our SAN storage environment is all IBM, with SVCs in front of either DS8000, or more recently XIV storage units. About 1 petabyte total storage. We have had various issues with this storage system. Latest was a bizarre incident with one of the XIVs, involving a design flaw, combined with human error, combined with a firmware bug. We ended up losing 53 terabytes of data, including total loss of 200 VMs. Took weeks to recover.

None of these systems is perfect. Anything can, and eventually will, fail. Even top-of-line enterprise storage systems.

Our storage team is looking at replicating data across multiple storage units in separate data centers. I'm hoping to leverage VMware Fault Tolerance with multi-site SAN storage to create a near-zero downtime environment for highest-criticality VMs.

All

SAN storage redundancy with VMware Fault Tolerance