VMware Cloud Community
MJMVCIX
Enthusiast
Enthusiast

vSAN performance diagnostics reports: "The vSAN cache may not be sized correctly"

Hi All, 

Issue: 

Screenshot 2024-04-10 143446 - Copy.png

 

 

 

 

We have a 3-Node vSAN Hybrid 7.0.3 cluster. Each 2U Host has 18 disks fitted, 3 Disk groups in each host, each with 1 x SSD 745Gb SAS cache and 5 x HDD SAS 2.4TB Capacity Disks. 

Focus is to determine if the cache is sufficient. The cluster consists of 3-Nodes that each have three disk groups totalling 9 x 745Gb SAS SSD Cache Drives (6.7TB Cache) and 45 x 2.4Tb SAS HDD Capacity Drives. This equates to an approximate 12% Cache to Used capacity ratio (Used 56.5TB) with the guidance of a 10% ratio.

Performance charts are showing Read Latency with a spike as shown below. 

Screenshot 2024-04-10 134650 - Copy.png

 

 

 

 

Screenshot 2024-04-10 142634 - Copy (2).pngScreenshot 2024-04-10 143040 - Copy.png

 

 

 

 

 

 

 

 

 

 

The goal of vSAN is to have a 90% cache hit rate. A cache hit is when a read request is found on the read cache. Subsequently, a cache miss is when the block needs to be retrieved from the capacity tier. Since the capacity tier is using magnetic disks the read operation will incur latency. Looking at the below 9 x Disk Groups read cache hit rate it does not look like its reaching 90% very often and therefore there are a lot of Cache Miss?

So would the cluster benefit from an addition DG per host and therefore more cache? 

Or would a SPBM Cache reservation be advised for the affected VMs? (There are 2 x VMs that run the main LOB applications and complete batch jobs daily, this was taking 10 hrs to complete now 12 hrs to complete) so this is an review to see if the cache is struggling. 

Thanks

Tags (1)
0 Kudos
5 Replies
depping
Leadership
Leadership

I would just create a different policy for the impacted VMs and give it a read cache reservation. But only if the performance is lower than the customer/app owner expect.

0 Kudos
MJMVCIX
Enthusiast
Enthusiast

@depping thanks, yes this was one of my plans however initially i want to determine if the cache in the cluster is sufficient or not.

From what i can see "Evictions" have now been renamed to "Removals" in version 7.x onwards. 

I cant find anything definitive on:

1. How to determine if cache is sufficient or not, i know the guidance is that vSAN will aim for 90% hit rate on cache however the graph i am seeing is difficult to interpret as is going up and down so is that ok or not? The "Removals" also looks to be very active. 

2. Also no guidance i can see on removals such as if you see XX amount of removals more that 5 times in a 24 hr period, add more cache?

Screenshot 2024-04-11 134152 - Copy.png

 

0 Kudos
depping
Leadership
Leadership

The way I look at these things is fairly basic, are people complaining about the performance? If not, then it is good enough 🙂

0 Kudos
MJMVCIX
Enthusiast
Enthusiast

Morning, 

Yes they are complaining that the processes are taking longer and longer to complete, so from the initial baseline when the cluster was deployed, the process now takes 2 hours longer to complete. 

Of course this could be the individual VM software, application issue, database has got larger etc, so we can look into that however for now i want to start with the basis of all this and the vSAN Cluster, cache etc so see how that is getting on.

0 Kudos
depping
Leadership
Leadership

You could, as discussed, increase the read cache (reservation) for those VMs, if it is a limited number of VMs, which would be able to benefit from this. Considering the read cache hit rate is relatively low, using this policy capability could improve performance, that is if the performance is lagging because of slower reads of course.

0 Kudos