Re: View 5.3 - Another task is already in progress

nzorn · ‎03-12-2014

ESXi 5.5 - 1331820

vCSA 5.5.0.5201 Build 1476389

Non-persistent linked clones (refreshed on logoff)

Next problem I'm having is after we upgraded to View 5.3, ESXi 5.5, and vCenter 5.5 I'm seeing desktops get stuck, and the only resolution is to reboot the ESXi host. This has happened to us 4 times in the past month, and the ESXi servers usually won't reboot on their own, so we usually have to manually reset the server.

We've also tried using the "esxcli vm process kill" command to shutdown the desktops with no luck. These stuck desktops are not accessible either, does not show an IP address, VM tools isn't running, and nothing shows up in the console window.

VMware sent me this: "The PR regarding the Power Off is currently open with no updated information. One of the steps taken was to increase the RAM on the Connection Server and reinstall the Connection Server. There is no additional next steps after this first action item."

Our Connection Servers have 4vCPU and 10GB RAM, so I doubt that is the problem.

llacas · ‎03-12-2014

We are seeing the same issue. Before with ESXi 5.1 and View 5.2, if a VM got stuck, I would go into the server console and kill the VM with esxtop. Then I'd go in View and do remove desktop.

Ever since the upgrade to ESXi 5.5 (we ran View 5.2 with ESXi 5.5 for a few months) we've experience this problem. Still having the issue now with View 5.3 The VM can no longer be killed with esxtop. And like you, only a host reboot will do it. And yes our servers hang as well on shutdown. We power cycle once we see the host disconnect from vCenter.

I've opened a case with VMWare this morning. Hopefully they can solve this issue.

One question for you: Are you running a GRID card by nVidia (K1 or K2) and/or a Teradici APEX 2800 card in you servers?

Thanks!

Luc

nzorn · ‎03-12-2014

I am running the Teradici APEX 2800 cards in my servers. We did not have these cards prior to View 5.3 and ESXi 5.5. They are running Windows 7 x64 desktops with 2vCPU and 3GB RAM. I've uploaded gigs of logs to VMware already.

Can you post your case number so I can pass that on to VMware saying you are experiencing the exact same problem? I think they are going to give me a new case number, and I'll post that once I get it.

llacas · ‎03-12-2014

Ok well you're ahead of me since I have yet to talk to anyone from VMWare yet. Here's my support request # 14451891403.

You can tell them that the other thing we did was upgrade the Teradici APEX driver to 2.3.2 for ESXi 5.5 compatibility and then to 2.3.3 for View 5.3. We did not have this issue before upgrading to ESXi 5.5 and both those drivers. All 16 servers are experiencing this issue. All of them have dual APEX 2800 cards in them and either dual K1 or K2.

I'm thinking it could be related to the APEX... since not many people seem to be experiencing this issue. From the forum posts anyway.

If you could pass along you support request # as well. I would appreciate it. Thanks!

JackMac4 · ‎03-13-2014

What does your composer queue look like? Have you looked at the SVI_TASK_STATE and SVI_REQUESTS table? It sounds like Composer might be getting jammed up. Also, 10GB is the _minimum_ we recommend these days. Up'ing the memory can't hurt, but I don't think at face value at least that it's your problem here. How many concurrent tasks are you configured to send, the default?

---- Jack McMichael | Sr. Systems Engineer VMware End User Computing Contact me on Twitter @jackwmc4

nzorn · ‎03-13-2014

What's the best way to check the composer queue? I doubt it has much in it. I have not looked at the SQL Database tables, what should I be looking for?

Also my connection servers are only using 2.22GB of RAM out of the 10GB.

JackMac4 · ‎03-13-2014

I believe there's a KB article, but look for any entries in the SVI_REQUESTS and SVI_TASK_STATE tables, this will tell you whats queued up. It's possible that you're also overwhelming vSphere if you've configured your concurrent tasks to be too high.

---- Jack McMichael | Sr. Systems Engineer VMware End User Computing Contact me on Twitter @jackwmc4

nzorn · ‎03-13-2014

They are set to the defaults, there is no way I'm overloading the system since I have 3 sites with only around 50 desktops per site. All sites have their own individual (3) View Connection, vCenter, and Composer servers running locally.

llacas · ‎03-13-2014

And for us, this problem started right after the upgrade from ESXi 5.1 to 5.5. Before this, we could just kill a VM that was hung with esxtop. Our system as well is far from overwhelmed and we are using the defaults as well.

The common factor between me and nzorn is the APEX 2800 cards in the servers. I'm still waiting on VMWare to contact me.

JackMac4 · ‎03-13-2014

Yeah, that's interesting, then it definitely shouldn't be an overload issue. Could be the APEX cards, I suppose. You could reach out to Teradici directly, or via VMware Support. That sounds like it might be your best bet.

---- Jack McMichael | Sr. Systems Engineer VMware End User Computing Contact me on Twitter @jackwmc4

nzorn · ‎03-13-2014

Luc:

Are you also seeing this? Screens Cutting In And Out / Non Stop APEX 2800 Issues

gmtx · ‎03-13-2014

Add me to the list. I had a VM turn into a vampire (couldn't be killed) a few weeks ago. Had VMware support look at it on Tuesday night (14451551203) during a maintenance window and they tried all the same things I had tried with no success. Only a host reset (after it hung on shutdown) fixed it. That was Tuesday night. Rebuilt my pool from scratch and Wednesday morning another VM hung. What's odd is that it's the same name as the first one that hung - out of 65 VMs. What are the odds?

I have another case open today (14452465603) and I'm awaiting a callback. I guess I'm glad to see it's not just me, but it's an ugly failure mode as you can't remove a pool until all the VMs are shut down, and when one of these turns into a vampire it appears only a hard host reboot gets it into a powered-off state where View can remove it.

I never had this issue until I upgraded to 5.5. I added Apex cards to my servers on Tuesday night, but I had the first vampire before they were installed so I don't think it's an Apex thing.

Geoff

llacas · ‎03-13-2014

nzorn, no I am not seeing this. At least I haven't heard. I'll ask around to see if anyone is seeing this. But at 800+ users per days connecting on the system, I'm sure I would have heard by now.

Geoff, so did the server hang on shutdown and you had to do a cold reset? And we've always had the APEX cards in the servers since View 5.1 beta, so maybe it's not related. But this started with the upgrade to ESXi 5.5 and Teradici 2.3.3 driver.

Almost talked with VMWare today. They never called back after I asked to be contacted by phone. Hopefully we can look into the problem tomorrow.

Luc

gmtx · ‎03-13-2014

Yes, I it hung shortly after disconnecting from vCenter. Waited about 10 minutes to see if it would eventually shut down by itself, but no luck so I did a warm reset to recover via the DRAC on my Dell server.

As I mentioned, I had one of these failures before the APEX boards/drivers were ever installed on the host, so I don't think it's an APEX issue. What is odd is that it's the same host and same VM name that's hung again. Have you seen a pattern like this - same VM name hanging?

VMware support is calling me back later today. I've asked them to look at this thread and the other open SRs to see if there's been any progress made on the issue.

BTW, I'm also not hearing of any screen flicker issues from my users, and they're pretty quick to let me know if something's going on.

Geoff

gmtx · ‎03-13-2014

VMware is telling me this may be a known issue: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=207339...

Do you have sparse disks (space reclamation) enabled? I do on my new pool where the hung VM from Wednesday lives, but I just don't remember if it was on for the pool where the first VM hung. I suspect it was though as it defaults to being on when you create a pool.

Geoff

llacas · ‎03-13-2014

Wow! that does seem like the problem. It's a known issue that did not exist before version 5.5 and the solution is to disable a feature and recompose all my pools? Over a 1000 desktops? Wow not impressed at all. I'll see what they say tomorrow.

Thx for posting this.

Luc

nzorn · ‎03-14-2014

So even though my "Reclaim VM disk space" box is not checked in the pool it is still enabled?

gmtx · ‎03-14-2014

That's a really good question. Based on the KB article, where you have to make ADAM changes to turn it off at a global or pool level, it sure seems like seSparse disks are created based on the ADAM settings and not the reclaim setting in the UI.

It does appear that you can affect the actual reclaim process though. Yesterday morning I was seeing all kinds of reclaim errors in the View event log so I edited my pools and turned off space reclaim. That stopped the errors, but it seems the underlying disk type that View creates when it builds/recomposes a pool - seSparse, the type that is causing our issue - is a function of those ADAM settings and nothing that's affected through the UI. Be nice to know for certain, and I hope to hear more from VMware support today.

Geoff

llacas · ‎03-14-2014

Well, I already have at least 2 pools that have space reclaim unchecked and I still get hung VM's in those pools. So yes, it's the way the VM is created that makes a difference, not just the reclamation of space. I suggest changing the global setting, until VMWare fixes the issue, ALL pools need to be created without seSparse disks. Might as well do it in one place instead off forgetting to change it if you create a new pool. That's how I take it anyway.

Still waiting to talk to VMWare.... sigh...

gmtx · ‎03-14-2014

I made the global ADAM change too and I'll recompose all my pools this weekend to get the change in effect (right after I reboot that host with the stuck vm ).

Geoff