VMware Networking Community
lerf2
Contributor
Contributor

Unable to connect to the VMkernel end-point (port 2222)

Hi,

I encountered a issue that after a configured vCenter / ESXi / NSX manager and service VM reboot.

The connection between service VM and ESXi VMkernel end-point will fail to connect.

The only workaround is remove the service VM from ESXi, and then click resolve in NSX manager to redeploy the service VM.

After that, the connection can be established.

As I know, for a network service insertion service, after everything is configured in NSX manager,

Once the service VM loaded the dvfilterklib, and created the shared memory device /dev/dvfilterk_shm,

After that, DVFilter_Init function should able to communicate with ESXi through VMCI.

I was wondering why after reboot the whole vCenter / ESXi / NSX manager and service VM.

The VMCI connection between service VM and ESXi will broken and no way to recover.

Below is my environment:

vCenter 6.0.0, 3634794

ESXi 6.0.0, 3620759

NSX 6.2.3, 3979471

==== error log from sample program ====

DVFilterlib.0:4832: Opened the shared memory device /dev/dvfilterk_shm_0 : 4

DVFilterlib.0:4909: Attempting connect to VMKernel using vSockets (port 2222)

DVFilterlib.0:4914: Unable to connect to the VMKernel end-point (port 2222)

DVFilter_Init failed, returned 0xa

DVFilter_Init errno(104): Connection reset by peer

====

Does anyone know what might be the root cause for this issue?

Is there any other check point might help for the troubleshooting?

Thanks!

5 Replies
bayupw
Leadership
Leadership

Hi, there are some issues with NSX 6.2.3 related to DFW and others.

In fact, the NSX for vSphere 6.2.3 release has been pulled from distribution so you should not be able to download it anymore.

NSX for vSphere Field Advisory – July 2016 Edition - Support Insider - VMware Blogs

"NSX for vSphere 6.2.3 has an issue that can affect both new NSX customers as well as customers upgrading from previous versions of NSX. The NSX for vSphere 6.2.3 release has been pulled from distribution. The current version available is NSX for vSphere 6.2.2, which is the VMware minimum recommended release.  Refer to KB 2144295. VMware is actively working towards releasing the next version to replace NSX for vSphere 6.2.3 *"

If it is a new deployment and feasible to upgrade, I would suggest to upgrade to NSX 6.2.4 or 6.2.5.

Bayu Wibowo | VCIX6-DCV/NV
Author of VMware NSX Cookbook http://bit.ly/NSXCookbook
https://github.com/bayupw/PowerNSX-Scripts
https://nz.linkedin.com/in/bayupw | twitter @bayupw
0 Kudos
lerf2
Contributor
Contributor

Hi Bayu,

Thanks for your suggestion, we have upgraded NSX Manager to 6.2.5, but the connection failed issue between service VM and ESXi kernel still there.

I was wondering to know is there any other check points we can look into?

From NSX Manager installation status, DFW and service deployments are all green.

In ESXi vsipioctl getfilters, I can see the network filter policy is attached.

But I cannot understand why the VMCI channel between ESXi and service VM is not ready.


Is there any services or process in ESXi that related to the VMCI channel for dvfilter?

0 Kudos
cnrz
Expert
Expert

The following points may be helpful:

Is  it related with only the Service VM, or also the dFW on the host does not work properly?

Also does it affect single ESXi host or other ESXi hosts as well, as deployment of the Service VM may be done automatically by the Management Software of the Ecosystem?

Troubleshooting from the Service VM side as well may be helpful about the specific Ecosystem, for example if it is Palo Alto troubleshooting on PaloAlto may help

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=10108...

There are two types of communication:

  • Datagrams: connectionless – Similar to UDP Queue Pairs
  • Connection oriented – Similar to TCP

VMCI provides Socket APIs, which is similar to what is already used for TCP/UDP applications. IP addresses are replaced with VMCI ID numbers. For example, it is possible to port net perf to use VMCI sockets instead of TCP/UDP

Since the VMCI is through API calls instead of TCP/IP Vmotion of the Service VM, the Service VM Vmotion should be disabled, as this may also create some problems, but it may not be related:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=21414...

A Service VM (SVM) provides the plumbing required for special applications in workload virtual machines to access IO, generally networking, before the IO leaves the virtual machine through conventional means. (For example, through the virtual NIC).

Examples of such workload VMs are:

  • NSX Guest Introspection VM
  • McAfee IDS/IPS/Firewall
  • Palo Alto Networks Firewall
  • Symantec IDS/IPS/Firewall

The specialized plumbing effectively pins the virtual machine to the ESXi host. Therefore, the virtual machine is deployed in a 1:1 relationship to the ESXi hosts in the cluster. When the SVM be migrated, the plumbing cleanup is not handled correctly which causes the issue to occur.

Regards,

lerf2
Contributor
Contributor

Hi canero,

Thank you very much! You point out one thing very important for me! The Service VM might be migrated during the vCenter / ESXi shutdown!

I will confirm if the symptom fixed after SVM avoided from vMotion.

However I am still wondering to know more details about SVM communication within VMCI and ESXi kernel.

In this case, we can check from the SVM side that the communication between VMCI and ESXi kernel has something wrong.

But its hard to diagnostic without experience and great ideas! (such you mentioned about vMotion)

Is there any possibility that we could check from logs or other troubleshoot commands?

To guessing what might be wrong from all limitations and environment settings is hard for non-experienced engineer.

Screen Shot 2017-01-25 at 3.31.17 PM.png

0 Kudos
cnrz
Expert
Expert

As SVM deployed version is also important wrt the NSX, Esxi versions as each ecosystem such as PaloAlto, TrendMicro, Fortinet, F5 may have tested and requires specific versions for compatibility. Are both versions compatible? Sometimes even downgrading the version from the latest to the stable and tested lower version may help. Also configuration steps may have slight differences for different versions and type of integration (Load Balancer, Firewall, IPS, AV)

Troubleshooting may differ according to which integration is deing configured, but the general logic (if dFW in Slot2 is ok without Svm deployed) is using other slots for sending the traffic to the SVM and check before the traffic is delivered to dVS

Troubleshoting NetX lib, dvfilterklib and VMCI in detail would be very helpful, but how to do it is a good question

0 Kudos