The express patches have been posted. This thread is long.
Please post technical experiences here and non-technical feedback here. --JohnTroyer
Hi all,
We've just encountered a serious bug with our ESX cluster - serious enough that I thought I should post about it here as a prior warning for others running ESX 3.5 Update 2.
The VMWare tech support person we spoke to wouldn't 100% confirm whether this was / would be affecting all ESX3.5u2 installs, but he strongly alluded that it was widespread. For others sake I hope I'm wrong and it's limited.
The bug:
Starting this morning, we could not power on nor VMotion any of our Virtual Machines. The VI Client threw the error "A general system error occurred: Internal Error".
Further digging lead us to messages like this one in /var/log/vmware/hostd.log, and the log file for any virtual machine we tried to power on or VMotion:
Aug 12 10:40:10.792: vmx| This product has expired.
Aug 12 10:40:10.792: vmx| Be sure that your host machine's date and time are set correctly.
Aug 12 10:40:10.792: vmx| There is a more recent version available at the VMware Web site: "http://www.vmware.com/info?id=4".
A call to tech support confirmed this as a known problem with a temporary workaround.
The work-around:
Turn off NTP (if you're using it), and then manually set the date of all ESX 3.5u2 hosts back to 10th of August. This can be done either through the VI Client (Host -> Configuration -> Time Configuration) or by typing date -s "08/10/2008" at the Service Console command line on the ESX hosts.
As soon as the date was reset to the 10th - problem solved.
Note that running VMs were operating fine, this only seems to affect initial VM power-on (including from suspended state) and VMotion.
So, it sounds like a serious licensing bug has crept into 3.5u2. Further testing shows that the problem begins as soon as the date hits 12th August - 10th is fine, 11th is fine, 12th and the problem appears.
There wasn't any real reference to similar problems in the forums as far as I could see, but it's quite possible we're seeing this before most of the rest of the world as we're in Australia, and therefore the date here ticked over to the 12th "before" those in Europe, America, etc.
Hope this helps others... took us a couple of hours to get this far - at least we can power on VMs again though!
Cheers,
Matt Kilham
Message was edited by: JohnTroyer to add new thread links.
@sisi: Nope, no vmotion and I don't think HA will be able to bring VM's up on a U2 ESX after a fialover. Just don't move until the patch is available
This Problem would have been never recognized in testing environments.
Or do you change the date on your Test ESX servers?
Might as well add testing for license problems to your normal testing routine. You know management is going to insist after today....
Yeah, like this kind of thing never happens with an MS product. Give me a break. You're ready to go with a Xen copy, a standalone virtualization platform because of this?
U2 was released on 7/25 and many of you threw it on production systems that quickly? That's your bad really. I usually wait at least a three months before moving it into production farms and hammer away at it in the lab first.
That's just (or should be) standard deployment or upgrade policies no matter who or what products you are talking about.
Looks like someone left in a huge piece of auto-expiry beta/test code. That's like a surgeon accidentally leaving in a pair of scissors: it's incredibly stupid and all manner of embarrassing. End of story.
"We did our testing, checked the forums and thought after going through our dev and test that we would upgrade a couple of the production servers, it just happened that the programmed end date for the licensing was 12th August not some other date in the future that may have caught a lot more other people!"
True, but it in this case, if you have a longer time line before moving code out to production, you didn't get hit with this. If the date was October or November, sure even my customers would be impacted. But all I'm saying is I'm
glad I have a longer window for updating.
So if you turn back time how compliant does that make you for SOX and auditing purposes. Just concerned that we would be unable to turn back time due to these issues.
Yeah the logs are fine after i migrating it using vcb to vmware server, i did a full repair and sync and that sorted it so i had the basics up
usually wait for around 8 hours to make sure all replica's (naming context) is in good shape.
I'm seeing it in my lab systems which were updated to U2 about a week after it was released so in this case, my slowness worked to my client's advantage...:-)
"it's incredibly stupid and all manner of embarrassing. End of story."
It is that indeed.
You're right. This issue should not be the reason to change to another VM-OS. This can happen to every software company.
But: you can wait as long as you want, thre month, six month, one year. You will never be sure to get the same trouble, it might happen on the following day after your update ...
Wow. This is pretty major blunder although the workaround is fairly simple as long as you stop the OS on the virtual from coming up, before changing the time. My bigger concern is that the fix for this absolutely better be able to be installed without having to shutdown any virtuals as we would normally just vmotion them to a different host ,apply the maintenance, and vmotion them back but since vmotion is also broken, it's looking like downtime is going to be needed which I'm sure for some of you, is just not an option. We've had vmware in house since v1.x and honestly this is the first major problem we've encountered. Sure, there have been glitches along the evolution of the product but for the most part, it's been a pretty smooth ride. Previously I read somewhere that VMWARE was going to notify everyone who has downloaded U2 about this bug. I have U2 installed on 5 of our 10 ESX boxes which I did after testing on our test ESX box. I have not received any notification warning me about this bug at all. Anyone else? Are they just waiting for others to get hit by the bug?! That, to me, is even a bigger problem.
Good thing I'm on holiday this week! But even so, I got calls from customers asking what had happened? Not a good position to be in for VMware, this could easily have been avoided. The problem is that too many updates are being released too often at the moment, only a few weeks ago we were deploying ESX 3.5, then Up 1 came along and then Up 2... Release after release to satisfy VMware Marketing for the new features... SVMotion, DPM, Live cloning etc... Good thing none of my customers are using update 2 in production yet...
I started installing this update last week, 3 hosts were fine (of course), upgraded two more hosts this morning, and got this fault. Never imagined it could be a bug.....
I am now in deep doodoo...
So far I've only had one client suffer from this. This was the only client that insisted on installing U2 against my advice. But, agreed that there really isnt a way to test for this. Now that it's happened, there will be.
Folks,
If we choose to stop ntpd and change system date, we need to know whether the VMs on that host (and in general) are synching their tools to the host.
The prospect of double-clicking on loads of VMware tools icons didn't really give me the Pot Noodle Horn, so I've taken the liberty of knocking up a cheeky bit of Powershell to achieve this.
So long as you have the Windows Powershell exe installed, and the VI toolkit (this worked fine on the Beta), the following should output a list of all VMs in VC with their Display Name, IP, Host (important due to host-centric changes to ntpd and system date), and current tools sync status.
get-vm | % { get-view $_.ID } | Select-Object Name, @{ Name="IPaddress";
Expression={$_.Guest.ipaddress}} , @{ Name="Hostname";
Expression={$_.Guest.hostName}} , @{ Name="ToolsSyncTimeWithHost";
Expression={$_.Config.tools.SyncTimeWithHost}} | out-file -filepath c:\psoutput.txt
Please note that I make no warranties about effectiveness, blah, blah, but it works fine for me. Even if your company policy is not to tick the timesync box in VMware tools, there are bound to be a couple of VMs that squeak through the net.
Hope this helps a few people...
piglet
Hello everyone,
I just did atest in our environment and everything work fine. We are running 4 ESX server 3.5u2. The only different is that I still running the licensing server from our original installation version 2.5
Cheers
Francisco
I'm sure this is a result of them going freeware, sucks for those of us who pay for it.
This is pretty massive, CNN worthy even. Luckily the only machine we rebooted is just a monitoring machine.
Tomorrow by noon is crazy talk, this needs 100 guys working on it to get it out within an hour. We had to do this update because of a licensing problem with OEM copies of ESXi, where you can't autostart your VMs upon server boot because it says you don't have the right license. It seems they have been having a lot of problems with licensing lately...
>> Not to be argumentative, but 36hrs is way unreasonable to make customers wait.
>> ESX is supposed to be an enterprise product. Enteprise products usually have 4hr
>> SLA's. No one expects vmware to fix, recompile, and distribute ESX patches in
>> under 4hrs...but there is a huge gap between 4hrs and 36.
+1. It's also not to much to expect that sales staff and SE's would start calling and emailing their enterprise customers who may be affected. Hoping that customers might check the community and might check the knowledgebase (which is down, BTW) is the kind of stuff you'd expect from mom-n-pop software company.
Vmware, If you want to be a big league player, you gotta play like you're in the big leagues.
If memory serves me, wasn't there another problem last year with ESX or another product and it had to be fixed the next day?
I agree, you are never going to get bug free software (there's always a "known issues" section), but I've found, (and hey, based on the response, I'm one of the few who does this), that by waiting at least 2 to 3 months (all the while testing), you can get a handle on most gotchas-but certainly not everything. But had the date been further out, as has been said, then more would have been impacted-including me. It was just fortunate for me it was the 12th of August and not October.