So here's the breakdown.
Our vCenter environment is not functioning. It spikes to 100% cpu and just stops working. I've tried all the usual support things, nothing is wrong with the database server (and all other database on that server are fine). It's the vcenter procoess vpxd.exe that is at 100%. I should also point out that I've been using vsphere though versions 3-5 and I'm a current VCP holder (going to take my vacp soon).
This is preventing all backups from running as they need vcenter to initiate the backups. I log a ticket with vmware in the morning two days ago. I get a call shortly after and the tech starts to do his thing. I tend to let techs try what they think is right. I've worked phone tech support and I remember those asshole admins who think they know everything, yet had to dial support. After a few hours he is no closer to the problem, but it's his time to go home, so he transfers me to a new tech in another timezone. This tech of course has to start over, because in IT the other guy is obviously an idiot. He suggests we reinstall vcenter. I allow it because I honestly don't know what the hell is wrong. Nothing is solved.
Fast forward another 2 hours and it's this guys time to go home. I get transferred yet again, this time to india (I think) (around the world in one phone call?). This guy quickly realizes he needs help and involves a senior engineer. The senior engineer checks all the things I checked before calling (but I understand) and then suggests we reinstall vcenter. I'm hesitant because we have done this before. He explains that the last guy probably didn't uninstall it properly. I relent and we reinstall vcenter. This causes no relief. He continues working and eventually settles in that the problem is a new ESXi host we added to the cluster a few days ago. We can't remove it, because we can't login to vcenter. He gets another engineer and we go to the database and SQL out the ESXi host. At first CPU usage drops, but then the problem comes right back. At this point it's almost 3am my time, I've been on a call well over 15 hours. I suggest we shelve it until the morning, obviously they need time to work on the issue. They download some logs and dmp files from vcenter and we call it quits. I go get a burger at steak and shake, it was pretty gross and the only food I had to eat after 11am. Bedtime is around 4:30am.
I get into work about 9am the next day (yesterday), the day of my wedding anniversary. vCenter is still down, my boss not terribly happy. I call vmware and they begin their work again around 10am. New tech as the old tech can't be reached. He again explains that everyone else must be an idiot and begins the process a new. He decides it must also be a host issue and we go though a tedious process of rebooting all esxi hosts. To my surprise (and I'm pretty sure also unrelated) vcenter starts to run slightly better even though the cpu usage stays pegged around 80-90%. He then suspects it's a storage issue and brings in a storage expert. They work on the system until about 5pm (My wife brought me lunch so I could eat!) and nothing was resolved. They downloaded the log bundle for vcenter and all hosts and left me so they could analyse the issue.
The result is I still can't not run backups. I still can not manage my VMs though vCenter and I still have no idea what is causing it. I suspect today will be more of the same.
To the support representatives credit, they have been very polite and have tried very hard to fix my problem. I really appreciate their hard work. At this point I could have rebuilt the server. The reason I haven't is because I don't want this problem to pop up again, I want to know the root cause.
In any case, I made that post simply because I needed to vent to someone and my phone was handy.