VMware Cloud Community
ManivelR
Hot Shot
Hot Shot

3 node VMware memory usage reports more than 90 % .

Hi All,


Detailed explanation given here.
1) We are running 3 node VSAN cluster here.
2) Each node physical's memory is 765. So, 3 * 765 GB = 2295 GB of memory are present in the cluster wise.

 Only one customer VMs are there. Totally 18 VMs and their VM summary is given below,
15 VMs * 128 GB of memory = 1792 GB
2 VMs * 16 GB = 32 GB of memory
1 VM * 32 GB of memory
In total, the above 18 VMs configured virtual memory are 1984 GB.


Internal management are running on the same cluster and the total VMs count are 9(VC,VSAN FS,LB etc..) Those 9 VMs memory are 62 GB(complete configured virtual memory of all these 9 VMs)


Total- 1984 GB + 62 GB = 2046 GB of memory( Customer VMs + Internal management VMs).


VSAN itself is taking some memory for cache. As per the VSAN monitor stats, each node utilizes 50 GB of memory, so 150 GB is being used from all these 3 nodes.'

Total- 1984 GB(configured) + 62 GB(configured) + 150 GB(current usage of VSAN) = 2196 GB of memory.


Question here,
1) Overall the VMs active memory usage is around 100 GB out of 2295 GB(cluster memory) which means it is just 5 % usage.Then why the cluster usage is reporting more than 90 % ? Note- We are using resource pool here and there is no reservation set(on each VM->edit settings)
2) However the 3 node cluster memory usage reports 95 %,93 % 89 %. Why it is reporting this?

ManivelR_2-1690267877539.png

 


3) From ESXTOP, memory ballooning is ticked as yes and it is in high state.

ManivelR_0-1690267629310.png

4) As the customer VMs are RHEL-8,we have installed open-vm-tools not VMware tools.

5 We created a dedicated resource pool for this customer and enabled the expandable reservation.

ManivelR_1-1690267814053.png

 

Is this issue happens because of open-VM tools and expandable reservation on the resource pool settings?

 

Can someone please share the thoughts?

Im not able to understand it clearly.

 

Thanks,

Manivel

 

 

Labels (1)
Tags (1)
10 Replies
Kinnison
Commander
Commander

Hi,


As far as I understand yours is an often debated topic, in your other post @Tibmeister already gave you some valid explanations, but perhaps these documents, some a little dated but always good, I think could help you further to answer your doubts:

https://kb.vmware.com/s/article/1002604
https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/perf-vsphere-memory_mana...
https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-esxi-vcenter-80-resource-management-guide.pdf


Regards,
Ferdinando

pashnal
Enthusiast
Enthusiast

Hi @ManivelR 

 

To make this simple , Memory usage of the cluster will always be equal to the memory allocated to VMs in the cluster , Your ESXi will always give how much ever you have allocated to VMs and same is not applied for CPU , CPU is allocation is real time where you see the actual usage of the cluster . 

Never overcommit your memory as the esxi will start memory harvesting which will impact your VMs . 

Hope this helps . 

 

Thanks , 

Pramod Ashnal  

 

Kinnison
Commander
Commander

Hi,


I would not be so absolute,


If one sums up the statistics related to the memory usage of a set of virtual machines called "granted" and called "consumed" and compares it with the "percentage corresponding to the memory consumed" by the host "the math doesn't add up", at least not according to what a vCenter object may tells us "here or there". The use of computing resources in terms of CPU can be in real time but "realistic" seems to me a whole other story as well, according to what my vCenter object tells me, I have several virtual machines whose CPU usage is, literally, "0 Mhz used" which IMHO is, just as literally, "a nonsense in terms".


Regards,
Ferdinando

ManivelR
Hot Shot
Hot Shot

Thanks both for your response.

I'm still not clear and not getting any clue yet.

0 Kudos
Kinnison
Commander
Commander

Hi,


Please read those documents, they contain answers to your questions.


Regards,
Ferdinando

 

 

 

 

0 Kudos
Mortuza1
Contributor
Contributor

Higher memory utilization: With memory overcommitment, ESX ensures that host memory is consumed by active guest memory
as much as possible. Typically, some virtual machines may be lightly loaded compared to others. Their memory may be used
infrequently, so for much of the time their memory will sit idle. Memory overcommitment allows the hypervisor to use memory
reclamation techniques to take the inactive or unused host physical memory away from the idle virtual machines and give it to
other virtual machines that will actively use it.

Tibmeister
Expert
Expert

vSAN really changes the entire conversation, because while core vSAN services may only use 50GB on each host, during operation it may use more due to network congestion, controller saturation, etc.  If it's on a 10GB network, pretty much expect 40% of your host RAM to be consumed by vSAN, and yes, it will take RAM from VMs and force ballooning to occur because if it doesn't, the VMs will be IO starved and not function anyway.

You also have the memory by other ESXi services as well, and having 15 VMS with 128GB RAM each is a hefty load on the host to try and schedule and keep up with the NUMA node configuration if you have more than one socket in your host.

When I do capacity planning, I shave off the host requirements and such before doing any VM calculations.  So, you have 768GB in each host (765GB Reported).  Using the reported size, take 40% of that off the top, then I take another 10% off the top for covering ancillary services.  So, 765 - 40% = 465GB, then subtract the additional 10%, rounding down, gives you 413GB of usable RAM on each host.  Multiply that by 3, and you get 1239 of usable RAM in the cluster.

Now these calculations have never failed me, and the highest I've pushed a cluster, using no overcommit of RAM is 85% of RAM.  I adjust alarms to 90% WARN and 95% ERROR, which is more than reasonable on these large of hosts.

So, as you can see, you don't have enough available RAM to run the given workload, and the fact that ballooning is occurring re-enforces that.

Now, do you really need 128GB of RAM on those VMs?  Probably not, and if the 95th percentile utilization over 45 or 90 days is under 70%, your VM has more RAM than it can use and you need to right-size.  A VM that is running at 75% ~ 85% utilization on its RAM in the given timeframe and statistic is a perfectly sized VM from a RAM perspective.

Yes, you will receive pushback on reducing a VM from "vendor recommendations", but the vendor isn't paying for your hardware, or the waste of it, and you need to make a financial case to your business that sizing the VM for the actual workload and not a theoretical test case is more fiscal.

ManivelR
Hot Shot
Hot Shot

Thanks so much Tibmeister and everyone for your valuable responses.

Hi Tibmeister/All,

 

This is the VMware Engineer response.

On checking the same,customer has disabled the ballooning on the guest operating system.

Conclusion from VMware Engineer.

Enable VM Ballooning to allow the OS to better offload unused memory to the hypervisor for load balancing resources.

Thank you,

Manivel RR

 

0 Kudos
Tibmeister
Expert
Expert

The balloon driver can cause some interesting results, plus all it does is create a virtual page in one guest's RAM for another guest to use.  Not something that is what I want in a secure environment, plus, it makes the VM "donating" the RAM seem much more heavily utilized unless the monitoring tool is balloon aware, which most aren't, so will lead to some false conclusions.

ansarabass
Enthusiast
Enthusiast

Hi Manivel,

Your detailed breakdown helps a lot in understanding your current situation. Let's break this down a bit.

Memory Overcommitment: VMware ESXi allows overcommitment of physical memory. When you sum the virtual memory (configured) of all VMs, it doesn't necessarily mean that amount is actively in use. However, the memory usage stats reflect both active and overhead memory. ESXi would provide memory to VMs based on their demand and not necessarily what's configured. This means even if VMs are only using 5% of their configured memory, ESXi might still allocate more than that due to memory overhead or other factors.

Memory Ballooning: The fact that ballooning is in a high state indicates there's contention. Ballooning is a mechanism where the hypervisor reclaims memory from VMs (through a balloon driver) when there's a memory shortage. This usually happens when there's an actual or perceived memory pressure on the host.

Resource Pools & Expandable Reservation: The expandable reservation on the resource pool means that if a VM within the pool needs more resources than are currently reserved, it can borrow from the parent pool. This can lead to scenarios where VMs in one resource pool are using more memory than you might expect. However, I don't think this is the core of your problem if no reservation is set.

Open-VM Tools vs VMware Tools: While open-VM tools are recommended for Linux VMs like RHEL, they should handle memory management functions quite similarly to the original VMware Tools. I doubt this is the root of your problem, though it's always good to ensure they are updated.

VSAN Memory Usage: You mentioned VSAN is using 50 GB of memory per node, which is significant. While this is expected as VSAN requires memory for things like caching, it adds to the overall memory usage.

I'd recommend the following steps:

Deep Dive with ESXTOP: Check for other memory metrics like swap rate, compression, etc. High swap rates can indicate memory contention.

Check Allocated Memory: The 'Consumed' memory metric in vCenter will tell you how much memory the VM is currently using, including overhead, not just what's active.

VM Memory Metrics: Dive deeper into each VM's memory usage in vCenter. Look for metrics like 'active', 'consumed', 'overhead', and 'shared'.

Cluster-Wide Settings: Ensure there aren't any cluster-wide memory settings or reservations that might be causing unexpected behaviors.

Lastly, consider reaching out to VMware support. They can provide deeper insights by analyzing your logs and metrics directly.

Hope this helps clarify things a bit!