Hi folks
Usually when I troubleshoot vMotion issues I check following logs on both participating ESXi hosts:
/var/log/vpxa.log
/var/log/hostd.log
/var/log/vmkernel.log
And sometimes when needed also the vCenter log:
/var/log/vmware/vpxd/vpxd.log
And, the VM related log file: vmware.log
Today I wanted to troubleshoot a very slow cold xvMotion situation in an environment where the migration goes from a vSphere 6.7 U3 vSAN enabled cluster to a vSphere 6.5 U2 SAN backed Storage Array cluster with a very low bandwidth performance. The process succeeds but takes to much time to complete (that’s why I troubleshoot ).
The only problem I have is, that there is not only one log entry regarding to the running vMotion process found in the initially described log files nor in the archived (.gz) log files.
I checked the host log level which was set to its default (Info), then I changed it to verbose, restarted the hostd service and started another vMotion process and got still the same problem - no vMotion related log entries available. The hosts are running fine and other log entries are created but nothing regarding the vMotion process.
Do I miss something? Had anyone the same issue/behavior? Is this related to the cold xvMotion?
Kind regards,
MtheG
I played a little bit around and analyzed a ton of log files and got some understanding on this observed behavior.
First of all, we need to clarify that a cold xvMotion process is not a hot xvMotion process! There is a “little-big” difference under the hood. Cold xvMotion uses the Network File Service (NFC) where Hot xvMotion uses the vMotion protocol.
As stated in the VMware documentation:
Cold Migration
Moving a powered off or suspended virtual machine to a new host. Optionally, you can relocate configuration and disk files for powered off or suspended virtual machines to new storage locations. You can also use cold migration to move virtual machines from one virtual switch to another, and from one data center to another. You can perform cold migration manually or you can schedule a task.
Hot Migration
Moving a powered on virtual machine to a new host. Optionally, you can also move the virtual machine disks or folder to a different datastore. Hot migration is also called live migration or vMotion. With vMotion, you migrate the virtual machine without any interruption in its availability.
From a network stack perspective, if a cold migration is triggered, the traffic goes via the Management service enabled interface (if no Provisioning TcpIp stack is configured). That should not be a big deal if you use a 10GbE or faster network on the Management subnet.
So, I started to look out for some NFC related logs and found something in vpxa.log:
info vpxa[2100002] [Originator@6876 sub=vpxNfcServer] PROXY connection to NFC: found session ticket:
info vpxa[2100426] [Originator@6876 sub=vpxLro opID=lro-19853-1bb016c2] [VpxLRO] -- BEGIN lro-19853 -- -- VpxNfcServerLro --
info vpxa[2100426] [Originator@6876 sub=Libs opID=lro-19853-1bb016c2] Optimize socket buffer size.
info vpxa[2100426] [Originator@6876 sub=Libs opID=lro-19853-1bb016c2] NfcNet_OptimizeSocketBuffer: Setting buffer size using RTT=0 ms BANDWIDTH=134000 bytes/ms BUFFER_SIZE=40KB
info vpxa[2100426] [Originator@6876 sub=vpxNfcServer opID=lro-19853-1bb016c2] Successfully initialized nfc callback for a write to the socket to be invoked on a separate thread
info vpxa[2100426] [Originator@6876 sub=vpxNfcServer opID=lro-19853-1bb016c2] Plugin started
info vpxa[2100426] [Originator@6876 sub=vpxLro opID=lro-19853-1bb016c2] [VpxLRO] -- FINISH lro-19853
--
info vpxa[2100000] [Originator@6876 sub=vpxLro opID=kg6l9mc7-3894059-auto-2bgoc-h5:70248532-f8-01-ea] [VpxLRO] -- BEGIN task-4279 -- nfcManager -- nfc.NfcManager.copy -- 52e1dd90-fb4c-1216-83ac-cd8068dd55f1
info vpxa[2100000] [Originator@6876 sub=Libs opID=kg6l9mc7-3894059-auto-2bgoc-h5:70248532-f8-01-ea] Optimize socket buffer size.
info vpxa[2100000] [Originator@6876 sub=Libs opID=kg6l9mc7-3894059-auto-2bgoc-h5:70248532-f8-01-ea] NfcNet_OptimizeSocketBuffer: Setting buffer size using RTT=11 ms BANDWIDTH=134000 bytes/ms BUFFER_SIZE=1512KB
info vpxa[2100000] [Originator@6876 sub=NfcManager opID=kg6l9mc7-3894059-auto-2bgoc-h5:70248532-f8-01-ea] Pre-processing copy spec {source-ip}:ds:///vmfs/volumes/5f7850e9-d90af766-a301-92a9fca0002c/{vm-folder}.vmdk -> {destination-ip}:ds:///vmfs/volumes/vsan:52dbd9c07edeecfa-24e791177ca19853/d1d48f5f-a820-a6a9-5cc8-6a2167900002/{vm-folder}.vmdk
info vpxa[2100000] [Originator@6876 sub=NfcManager opID=kg6l9mc7-3894059-auto-2bgoc-h5:70248532-f8-01-ea] Copy operation completed successfully
info vpxa[2100000] [Originator@6876 sub=vpxLro opID=kg6l9mc7-3894059-auto-2bgoc-h5:70248532-f8-01-ea] [VpxLRO] -- FINISH task-4279
What a wonder, there are some log entries!
Kind regards,
MtheG
Hi
Do we notice recent logging happening in these logs or are they outdated ?
Does it log events if you do a regular vMotion(Host to host) to another ESXi host in the cluster? Please try to migrate this VM to another host and check if it logs events during storage vMotion
Hi ashilkrishnan
Do we notice recent logging happening in these logs or are they outdated ?
Yes, there are recent log entries.
Does it log events if you do a regular vMotion(Host to host) to another ESXi host in the cluster?
Not realy, only a few hostd entries about change state from the affected VM.
Please try to migrate this VM to another host and check if it logs events during storage vMotion
Done, same behavior.
Normally I would expect something like this:
2020-10-16T11:56:12.912Z cpu39:3479990)Migrate: vm 3479991: 3885: Setting VMOTION info: Source ts = 32100428654216515, src ip = <aaa.aaa.aaa.aaa> dest ip = <bbb.bbb.bbb.bbb> Dest wid = 3474032 using SHARED swap, encrypted
2020-10-16T11:56:12.913Z cpu28:3480948)MigrateNet: 1751: 32100428654216515 S: Successfully bound connection to vmknic vmk1 - 'aaa.aaa.aaa.aaa'
2020-10-16T11:56:12.913Z cpu28:3480948)MigrateNet: 1751: 32100428654216515 S: Successfully bound connection to vmknic vmk1 - 'aaa.aaa.aaa.aaa'
2020-10-16T11:56:12.913Z cpu28:3480948)VMotionUtil: 5199: 32100428654216515 S: Stream connection 1 added.
2020-10-16T11:56:26.111Z cpu2:3479991)VMotion: 5367: 32100428654216515 S: Another pre-copy iteration needed with 176070 pages left to send (prev2 8388608, prev 8388608, pages dirtied by pass through device 0, network bandwidth ~1229.474 MB/s, 13478% t2d)
2020-10-16T11:56:26.599Z cpu2:3479991)VMotion: 5277: 32100428654216515 S: Stopping pre-copy: only 13404 pages left to send, which can be sent within the switchover time goal of 0.500 seconds (network bandwidth ~1441.895 MB/s, 1262% t2d)
2020-10-16T11:56:27.029Z cpu25:3480948)VMotionSend: 5095: 32100428654216515 S: Sent all modified pages to destination (network bandwidth ~1425.207 MB/s)
But that's not the case - why?
Are these entries only available during a hot vMotion? Because I am running cold xvMotion at the moment?
Kind regards,
MtheG
I played a little bit around and analyzed a ton of log files and got some understanding on this observed behavior.
First of all, we need to clarify that a cold xvMotion process is not a hot xvMotion process! There is a “little-big” difference under the hood. Cold xvMotion uses the Network File Service (NFC) where Hot xvMotion uses the vMotion protocol.
As stated in the VMware documentation:
Cold Migration
Moving a powered off or suspended virtual machine to a new host. Optionally, you can relocate configuration and disk files for powered off or suspended virtual machines to new storage locations. You can also use cold migration to move virtual machines from one virtual switch to another, and from one data center to another. You can perform cold migration manually or you can schedule a task.
Hot Migration
Moving a powered on virtual machine to a new host. Optionally, you can also move the virtual machine disks or folder to a different datastore. Hot migration is also called live migration or vMotion. With vMotion, you migrate the virtual machine without any interruption in its availability.
From a network stack perspective, if a cold migration is triggered, the traffic goes via the Management service enabled interface (if no Provisioning TcpIp stack is configured). That should not be a big deal if you use a 10GbE or faster network on the Management subnet.
So, I started to look out for some NFC related logs and found something in vpxa.log:
info vpxa[2100002] [Originator@6876 sub=vpxNfcServer] PROXY connection to NFC: found session ticket:
info vpxa[2100426] [Originator@6876 sub=vpxLro opID=lro-19853-1bb016c2] [VpxLRO] -- BEGIN lro-19853 -- -- VpxNfcServerLro --
info vpxa[2100426] [Originator@6876 sub=Libs opID=lro-19853-1bb016c2] Optimize socket buffer size.
info vpxa[2100426] [Originator@6876 sub=Libs opID=lro-19853-1bb016c2] NfcNet_OptimizeSocketBuffer: Setting buffer size using RTT=0 ms BANDWIDTH=134000 bytes/ms BUFFER_SIZE=40KB
info vpxa[2100426] [Originator@6876 sub=vpxNfcServer opID=lro-19853-1bb016c2] Successfully initialized nfc callback for a write to the socket to be invoked on a separate thread
info vpxa[2100426] [Originator@6876 sub=vpxNfcServer opID=lro-19853-1bb016c2] Plugin started
info vpxa[2100426] [Originator@6876 sub=vpxLro opID=lro-19853-1bb016c2] [VpxLRO] -- FINISH lro-19853
--
info vpxa[2100000] [Originator@6876 sub=vpxLro opID=kg6l9mc7-3894059-auto-2bgoc-h5:70248532-f8-01-ea] [VpxLRO] -- BEGIN task-4279 -- nfcManager -- nfc.NfcManager.copy -- 52e1dd90-fb4c-1216-83ac-cd8068dd55f1
info vpxa[2100000] [Originator@6876 sub=Libs opID=kg6l9mc7-3894059-auto-2bgoc-h5:70248532-f8-01-ea] Optimize socket buffer size.
info vpxa[2100000] [Originator@6876 sub=Libs opID=kg6l9mc7-3894059-auto-2bgoc-h5:70248532-f8-01-ea] NfcNet_OptimizeSocketBuffer: Setting buffer size using RTT=11 ms BANDWIDTH=134000 bytes/ms BUFFER_SIZE=1512KB
info vpxa[2100000] [Originator@6876 sub=NfcManager opID=kg6l9mc7-3894059-auto-2bgoc-h5:70248532-f8-01-ea] Pre-processing copy spec {source-ip}:ds:///vmfs/volumes/5f7850e9-d90af766-a301-92a9fca0002c/{vm-folder}.vmdk -> {destination-ip}:ds:///vmfs/volumes/vsan:52dbd9c07edeecfa-24e791177ca19853/d1d48f5f-a820-a6a9-5cc8-6a2167900002/{vm-folder}.vmdk
info vpxa[2100000] [Originator@6876 sub=NfcManager opID=kg6l9mc7-3894059-auto-2bgoc-h5:70248532-f8-01-ea] Copy operation completed successfully
info vpxa[2100000] [Originator@6876 sub=vpxLro opID=kg6l9mc7-3894059-auto-2bgoc-h5:70248532-f8-01-ea] [VpxLRO] -- FINISH task-4279
What a wonder, there are some log entries!
Kind regards,
MtheG