Solved: Re: UAG breaks after a few days. They break 100% o...

jrodsguitar · ‎01-15-2018

This is a new implementation.

So I currently have 2 UAG's deployed. Version 3.1 and 3.2 currently deployed to Production

They are behind a NetScaler load balancer.

So after a few days the UAG's stop accepting connections on 443. I have to reboot these every night or the problem happens 100% of the time. At the moment I'm keeping 1 disabled on standby in case the other breaks during the workday. When these break, port 4172 remains open so any existing connections remain. It's only new connection attempts that fail.

I have an open case with VMWare but they've turned us over to Citrix support. I wish they would actually want to know what is causing this, since obviously something is breaking their UAG. This is a passive aggressive remark in case your reading VMWare.

We have 50 users. Yet see hundreds of stale connections on the UAG. We are not being DOS'ed as confirmed by our network team.

Citrix NetScaler Load Balancer: 192.24.16.172

UAG: 192.24.17.184

Citrix NetScaler Load Balancer is configured to perform a healthcheck per the recommended method via VMWare. Using GET /favicon.ico.

On the UAG:

netstat shows hundreds of these close_wait connections:

tcp 1 0 192.24.17.184:6443 192.24.16.172:46864 CLOSE_WAIT

tcp 1 0 192.24.17.184:6443 192.24.16.172:29408 CLOSE_WAIT

tcp 1 0 192.24.17.184:6443 192.24.16.172:65027 CLOSE_WAIT

tcp 1 0 192.24.17.184:6443 192.24.16.172:16839 CLOSE_WAIT

tcp 1 0 192.24.17.184:6443 192.24.16.172:45761 CLOSE_WAIT

tcp 1 0 192.24.17.184:6443 192.24.16.172:44743 CLOSE_WAIT

tcp 1 0 192.24.17.184:6443 192.24.16.172:9926 CLOSE_WAIT

On the UAG:

Hundreds of these in /opt/vmware/gateway/logs/SecurityGateway_blah_blah_

2018-01-14T04:42:45.017+00:00> LVL:error : [C: 192.24.16.172:58952] *** SSIGServer::SSL handshake failure: End of file (2) error:00000002:lib(0):func(0):system lib

2018-01-14T04:42:47.187+00:00> LVL:error : [C: 192.24.16.172:24632] *** SSIGServer::SSL handshake failure: End of file (2) error:00000002:lib(0):func(0):system lib

2018-01-14T04:42:50.017+00:00> LVL:error : [C: 192.24.16.172:39938] *** SSIGServer::SSL handshake failure: End of file (2) error:00000002:lib(0):func(0):system lib

2018-01-14T04:42:52.187+00:00> LVL:error : [C: 192.24.16.172:3371] *** SSIGServer::SSL handshake failure: End of file (2) error:00000002:lib(0):func(0):system lib

2018-01-14T04:42:55.017+00:00> LVL:error : [C: 192.24.16.172:42301] *** SSIGServer::SSL handshake failure: End of file (2) error:00000002:lib(0):func(0):system lib

2018-01-14T04:42:57.188+00:00> LVL:error : [C: 192.24.16.172:47881] *** SSIGServer::SSL handshake failure: End of file (2) error:00000002:lib(0):func(0):system lib

2018-01-14T04:43:00.017+00:00> LVL:error : [C: 192.24.16.172:28791] *** SSIGServer::SSL handshake failure: End of file (2) error:00000002:lib(0):func(0):system lib

Blog: https://powershell.house/

jrodsguitar · ‎01-23-2018

I greatly dislike when I find a forum post with no answer so I will answer what the final solution to this was.

I had done some digging into the UAG console and noticed the below messages. From what I had gathered the UAG has a built in mechanism that protects itself from DDOS type attacks. Our Citrix Netscaler Load Balancer health check was triggering this mechanism. So essentially the UAG thought the Load Balancer was attacking it so it shut itself down. The DosPreventionHandler kicked in. Port 4172 (PCOIP) remained open, existing users remained connected, but port 443 stopped accepting new connection. When I spoke to VMWare support they confirmed my suspicion.

The workaround is to set the below settings to 0 in the UAG.

Blog: https://powershell.house/

View solution in original post

jrodsguitar · ‎01-23-2018

I greatly dislike when I find a forum post with no answer so I will answer what the final solution to this was.

I had done some digging into the UAG console and noticed the below messages. From what I had gathered the UAG has a built in mechanism that protects itself from DDOS type attacks. Our Citrix Netscaler Load Balancer health check was triggering this mechanism. So essentially the UAG thought the Load Balancer was attacking it so it shut itself down. The DosPreventionHandler kicked in. Port 4172 (PCOIP) remained open, existing users remained connected, but port 443 stopped accepting new connection. When I spoke to VMWare support they confirmed my suspicion.

The workaround is to set the below settings to 0 in the UAG.

Blog: https://powershell.house/

nburton935 · ‎03-05-2018

Jrod,

I was running into the exact same issue (3 UAGs in prod behind NetScaler ADC). Glad I found your post - good investigations on your end.

-Nick

chriskoch99 · ‎01-20-2021

Can confirm that you still have to do this in UAG 2009. We had been doing it on our other UAG 3.9 appliances, but missed the setting in our initial deployment of 2009, and totally ran into this. Annoying little thing to troubleshoot.

nixnac · ‎02-16-2022

Do you happen to have the instructions used on the UAG to get to the screenshot below showing the DoSPreventionHandler was running? I'm trying to prove out if we are having the same issue but VMware support doesn't currently know how to check what you showed below.

sjesse · ‎02-16-2022

This issue is 4 years old, the UAG has changed a bit and you may not see this. If you have UAG stability issues you should open an SR

alexanderdb · ‎11-04-2022

-Log into the UAG Admin webui. Click the Select button for configure manually. Scroll down to the bottom and click the download button next to the option labeled Log Archive. After a brief delay, the log bundle will be downloaded to the browser's default download location. After extracting the .zip archive, the esmanager logs will be listed in the first folder. The current log file will be named esmanager.log. Older files which have been rotated out will be named esmanager.log.1, .2, .3, etc. You can view these files with any text editor, such as Notepad++.
-Alternatively, you can log into the command line shell of the UAG appliance directly. You can then view the log file by executing: more /opt/vmware/gateway/logs/esmanager.log. If you want to try to grep for the relevant entries directly, you can execute cat /opt/vmware/gateway/logs/esmanager.log |grep -i DoSPreventionHandler |less. This will display the output one page at a time, where you can press the spacebar to move forward page-by-page.

alexanderdb · ‎11-04-2022

nixnac-Log into the UAG Admin webui. Click the Select button for configure manually. Scroll down to the bottom and click the download button next to the option labeled Log Archive. After a brief delay, the log bundle will be downloaded to the browser's default download location. After extracting the .zip archive, the esmanager logs will be listed in the first folder. The current log file will be named esmanager.log. Older files which have been rotated out will be named esmanager.log.1, .2, .3, etc. You can view these files with any text editor, such as Notepad++.
-Alternatively, you can log into the command line shell of the UAG appliance directly. You can then view the log file by executing: more /opt/vmware/gateway/logs/esmanager.log. If you want to try to grep for the relevant entries directly, you can execute cat /opt/vmware/gateway/logs/esmanager.log |grep -i DoSPreventionHandler |less. This will display the output one page at a time, where you can press the spacebar to move forward page-by-page.

Jeremy9102 · ‎04-01-2024

If you configure your load balancer health check per our documentation this will not happen.

A polling interval of 30 seconds is recommended. Your logs indicate that you are polling every 5 seconds, that is why you have so many stale sessions.

https://kb.vmware.com/s/article/56636

With a polling interval of the default 30 seconds, the Response Timeout would be 91 seconds. Example Calculation: 30 * 3 = 90 + 1 = 91 secs

A load balancer monitors the health of each Unified Access Gateway appliance by periodically sending an HTTPS GET /favicon.ico request. For example, https://uag1.myco-dmz.com/favicon.ico. This monitoring is configured on the load balancer. It will perform this HTTPS GET and expect a "HTTP/1.1 200 OK" response from Unified Access Gateway to know that it is "healthy". If it gets a response other than "HTTP/1.1 200 OK" response or does not get any response, it will mark the particular Unified Access

All

UAG breaks after a few days. They break 100% of the time.

Workspace ONE Access