Re: VM Disk Speed? or Resource Constraint? or?

TheRealJason · ‎07-09-2008

Hi, after running many successful P2Vs on most of my infrastructure, I made it through the list to our main file server. It was previously on an IBM x345, with Dual Hyperthreaded Processors, and 3GB of RAM. I have given it 2 CPU (out of 2 Quad Core 2.33GHz) and 2 GB of RAM. The disk is on a CX-700, attached via fiber. I took the existing LUNs, and created RDMs for them to attach to the VM. It is Win 2k3/SP2. Since the conversion, the backups are running for approx 18 hrs, where they finished in about 9 previously. We are using TSM for backups, and we are backing up a disk that has about 3 million files. The amount of data backed up on a normal day is anywhere from 8-12GB. The backup agent usually uses about 900MB of RAM throughout the process. It seems to me like it is the inspection times that are taking so much longer, but I can't be for sure. The server doesn't really look like it is working that hard, it just seems to take a lot more time to go through the files.

Does anyone have suggestions on what I can look at to help track it down? I know that disk access is throttled, but I'm not sure to what level. The disk does not appear to be pegged from the VI client, but maybe the way it is accessing it doesn't show properly on the intervals that are displayed?

Any thoughts at all?

Thanks,

Jason

BenConrad · ‎07-09-2008

If you P2V'ed a Windows 2003 box you probably ended up with a BusLogic SCSI driver. Switch to the LSILogic and you should see a tremendous improvement in speed.

Here is a snippet from an email I sent after changing from buslogic -> lsilogic:

Subject: RE: serverXXX poor performance.

Correction:

After I removed the VMWare snapshot additional performance was gained as seen below:

-

IOPS MB/s Latency (ms)

-

Before (Buslogic):

268 8.3 238

After (LSILogic with snap):

1442 45 22.15

After (LSILogic without snap):

1908 59.6 16.7

TheRealJason · ‎07-09-2008

Ahh, interesting. Thanks for the speedy response, I will give this a shot when I next get a chance to reboot.

Should I expect any difficulties in just shutting it down, and making the change, then bringing it back up? Any gotchas or anything?

BenConrad · ‎07-09-2008

Shut down, take a snapshot, change to LSILogic, boot, remove snapshot. The OS should do a good job switching it over.

We have now required this step in our P2V documentation.

kastlr · ‎07-09-2008

Hi Jason,

as you're using the same disks as before, it shouldn't be caused by the array.

But as you transfer the machine into your ESX environmment, you're now accessing your storage LUN's via shared components, your HBA's.

So you should check the following

- proper settings for the HBA queue size http://www.vmware.com/pdf/vi3_san_design_deploy.pdf

- block size your TSM backup did use, adjust the LSI driver settings http://kb.vmware.com/kb/9645697

And because you're using the original disks, you do have a misalligned partition in use which does cause performance impact.

If possible, I would recommend the following.

1.) Assign new empty disks to your W2k3 SP2 VM

2.) Create an alligned NTFS partition on these disks using MS diskpart tool.

3.) Follow the instruction published here http://theether.net/kb/100028 which does explain how to copy large NTFS volumes without fragmetation.

4.) Adjust the drive letters for the newly added disks

5.) Remove the original disks

And don't forget to check if the performance degration isn't caused by the network, which now is also a shared resource.

Hope this helps a bit.

Ralf

Hope this helps a bit.
Greetings from Germany. (CEST)

TheRealJason · ‎07-09-2008

As far as the queue depth, it does not appear to be an issue if I check the disk portion of esxtop, so I don't suspect that I should change that right off the bat, but I will definitely keep in mind.

I will give the block size change a go after the change to the LSI driver, as it sounds like somethng that should be done anyway.

The partition was aligned during the initial implentation of the CX-700, and the data has not moved from there. Are you suggesting that the P2V process would somehow de-align the partition on the disks?

I kind of blew off the network as being a potential issue, but I will look further into that as well.

Thanks,

Jason

kastlr · ‎07-09-2008

Hi Jason,

AFAIK, the P2V process doesn't de-align the partition.

The benefit of the documented procedure using robocopy would be that the file and directory structure of your disks will place at the beginning off the NTFS partition.

Browsing the NTFS filesystem would perform much faster, so your users AND your backup should benefit from such a cleanup task.

Regards

Ralf

Hope this helps a bit.
Greetings from Germany. (CEST)

TheRealJason · ‎08-25-2008

I have made the driver changes, and also given the Data LUN a dedicated adapter. While performance for users is still just fine, the backups are stilling running for hours on end.

The VirtualCenter charts don't seem to indicate any one resource being peaked during the backup times, and it seems as if the bottleneck is in the "processing" file portion, which would be inspecting the files for changes.

I have not modified the HBA settings yet, because I do not see them being peaked. Any suggestions on this?

Thanks,

Jason

All

VM Disk Speed? or Resource Constraint? or?