Archive

Archive for the ‘Case Post Mortems’ Category

Real life experience on performing a roll back from vRA 6.1 to 6.0.1

March 10th, 2015 No comments

I had to deal with a deployment issue in vRealize Automation yesterday that also included one issue to be the weirdest corner case I have seen so far.

All deployments for any VMs using no vRealize Orchestrator workflow stubs were successful. But let’s be honest, that is pretty boring and doesn’t cover most customers use cases, which means most customers will be using the provided stubs to actually step into the machine life cycle and modify deployment and destruction of VMs using workflows and scripts.

Read more…

Help my VM is hung/unresponsive – what to collect for GSS and why we need it.

March 9th, 2015 No comments


This came up on twitter today. While this might result in a fast turnaround time for getting the immediate issue resolved it won’t show you the root cause of the hang/unresponsive VM, and let’s be honest normally that is what we are interested in, right? (There will be exceptions to that, if your mission critical and revenue generating system is in that state you probably want to restart it as fast as possible and then go back to the white board and redesign that single point of failure as soon as possible).

Read more…

Upgrade to vCenter Server 5.5 failing due to service misconfiguration

January 2nd, 2015 5 comments

Today was a rather productive day in terms of resolved cases, it sure helps to be part of the frontline again during the times of low staffing though, finally some easy stuff to deal with again 🙂

Read more…

Categories: Case Post Mortems Tags:

vRealize Automation 6.2 stuck indefinitely on additional tenant login

December 27th, 2014 No comments

I had a customer a while ago who was unable to log in into the additional created tenant in vCloud Automation Center 6.0.1. The same issue can be reproduced in vRealize Automation as well. Logins to the default tenant would succeed without any issues and a very similar test environment did work without any issues at all for logins.

Read more…

Intermittent vMotion failures on ESXi 5.5

August 2nd, 2014 1 comment

A little while ago I had to work on a vMotion issue that showed some really interesting behaviour.

The customer described the issue with the vMotions getting stuck at 5, 13 and 91%.

Read more…

Categories: Case Post Mortems Tags:

Slow vCenter Server 5.5 logins for domain users while administrator@vsphere.local logs in fast

July 14th, 2014 No comments

Chris Wahl had an interesting issue a couple of days ago on twitter.

So how would you go on with troubleshooting the issue as there are several components involved that could be the actual bottleneck. It could be an issue on the client side, in the Web Client / C# client itself, vCenter Server, the Inventory Service, SSO or the actual Active Directory infrastructure.

Read more…

Categories: Case Post Mortems, SSO Tags:

Virtual Machine hardware version influecing QueryPerformanceCounter behaviour on reboot aka Nagios showing wrong system uptimes

June 28th, 2014 No comments

There is already a  kb article in reference to the QueryPerformanceCounter to be found in the VMware knowledge base that affects Windows XP and Windows Server 2003.

QueryPerformanceCounter behaves improperly inside a virtual machine when /usepmtimer is used with some Windows HALs

Last week a colleague of mine had another customer express concerns on how this counter is handled on his Windows Server 2008 R2 VMs. They noticed that under certain circumstances the counter would not reset anymore on perfoming an in guest reboot. For alot of customers this would generally not be of any concern, but apparently the monitoring solution the customer was using in the way it was implemented on their site was relying on this counter as it calculated the system uptime out of it. So if the counter was not reset after a reboot misleading information would be shown.

Read more…

Categories: Case Post Mortems Tags:

vCenter performance suffering due to feature usage

June 15th, 2014 No comments

Last week I handled a customer case in which vCenter Server would crash randomly and even when vCenter Server was operational if it would use most of the system’s CPU.

The vpxd.log would show the following error which indicates that the system ran out of usable RAM for vCenter Server.

2014-06-10T22:41:47.127+02:00 [11100 panic ‘Default’ opID=task-internal-1-62fd32e8-17]
–>
–> Panic: Unable to allocate memory
–> Backtrace:
–> backtrace[00] rip 000000018018b6aa
–> backtrace[01] rip 0000000180105288
–> backtrace[02] rip 000000018010668e
–> backtrace[03] rip 00000001801a1c4c
–> backtrace[04] rip 00000001801a1d9c
–> backtrace[05] rip 000000018009e466
–> backtrace[06] rip 000000018009e499
–> backtrace[07] rip 0000000073b710af
–> backtrace[08] rip 0000000073bbcb9f
–> backtrace[09] rip 0000000073bbcc87
–> backtrace[10] rip 0000000140c85348
–> backtrace[11] rip 00000001410fa8d2
–> backtrace[12] rip 0000000140fad26b
–> backtrace[13] rip 0000000140fadb12
–> backtrace[14] rip 00000001401175db
–> backtrace[15] rip 0000000000830453
–> backtrace[16] rip 00000000003e980c
–> backtrace[17] rip 00000000004293ff
–> backtrace[18] rip 00000001404185dc
–> backtrace[19] rip 0000000140a71a40
–> backtrace[20] rip 0000000140a72533
–> backtrace[21] rip 00000001408b22fe
–> backtrace[22] rip 00000001408b2350
–> backtrace[23] rip 0000000141073882
–> backtrace[24] rip 00000001410b2c4c
–> backtrace[25] rip 0000000140fc385a
–> backtrace[26] rip 00000001410893cb
–> backtrace[27] rip 00000001404257bf
–> backtrace[28] rip 0000000140429174
–> backtrace[29] rip 0000000140437874
–> backtrace[30] rip 00000001404402f5
–> backtrace[31] rip 0000000180153334
–> backtrace[32] rip 00000001801533c1
–> backtrace[33] rip 0000000180157d0c
–> backtrace[34] rip 00000001801595ca
–> backtrace[35] rip 0000000180159b0c
–> backtrace[36] rip 0000000180159ca8
–> backtrace[37] rip 00000001801548a3
–> backtrace[38] rip 00000001801558f3
–> backtrace[39] rip 0000000180155c33
–> backtrace[40] rip 000000018019db2a
–> backtrace[41] rip 0000000073b72fdf
–> backtrace[42] rip 0000000073b73080
–> backtrace[43] rip 00000000778059ed
–> backtrace[44] rip 000000007793ba01
–>
2014-06-10T22:41:48.180+02:00 [11100 panic ‘Default’ opID=task-internal-1-62fd32e8-17] (Log recursion level 2) Unable to allocate memory

 

So we gathered 2 extended log bundles between service crashes to see if the private memory for vpxd grew and indeed within 45 minutes the memory usage grew from 1.3 GB to 2.6 GB, so it doubled.

The inventory was medium sized.

–> /InventoryStats/ManagedEntityStats/Clusters/total 16
–> /InventoryStats/ManagedEntityStats/Datacenters/total 1
–> /InventoryStats/ManagedEntityStats/Datastores/total 113
–> /InventoryStats/ManagedEntityStats/NumHosts/total 47
–> /InventoryStats/ManagedEntityStats/NumPoweredOnVms/total 1028
–> /InventoryStats/ManagedEntityStats/NumVirtualMachines/total 1055
–> /InventoryStats/ManagedEntityStats/ResourcePools/total 23

With the VM having the following resource allocation.

Processor(s): 4 Processor(s) Installed.
[01]: Intel64 Family 6 Model 37 Stepping 1 GenuineIntel ~2261 Mhz
[02]: Intel64 Family 6 Model 37 Stepping 1 GenuineIntel ~2261 Mhz
[03]: Intel64 Family 6 Model 37 Stepping 1 GenuineIntel ~2261 Mhz
[04]: Intel64 Family 6 Model 37 Stepping 1 GenuineIntel ~2261 Mhz

Total Physical Memory: 16.383 MB
Available Physical Memory: 8.941 MB
Virtual Memory: Max Size: 24.574 MB
Virtual Memory: Available: 16.059 MB
Virtual Memory: In Use: 8.515 MB

So the VM was configured with 4 cores and 16 GB of RAM. For vCenter Server 5.5 this was more than enough to meet the minimum requirements. So what was actually happening?

Let’s have a look at the system utilization for memory and CPU.

vpxd_crash1

We can see that the memory is extremely spiky during the first half of logs that we have and should be around 2 GB maximum during normal operation. After about 20 hours the customer implemented a change that I recommended and the memory utilization instantly stabilized. The CPU graph looks similar.

vpxd_crash2

And again as soon as the suggested change was implemented we can see that CPU utilization went below 20% on average.

Going through the profiler logs we could see that the system spend alot of RAM on cluster operations.

/ScoreboardStats/ObjectTypes/Class=’ClusterRecommendationsTracker::RecVisitInfo’/Count/total 0.704939
/ScoreboardStats/ObjectTypes/Class=’Vim::Cluster::Recommendation’/Count/total 0.684619

Looking at the system using ProcessExplorer we could see that 2 threads were actually hogging the CPU but the stack was not visible due to the lack of symbols, the topmost called function was actually a thread exit.

After loading the dump into a debugger we can actually verify this.

Windows 7 Version 7601 (Service Pack 1) MP (4 procs) Free x64
Product: Server, suite: TerminalServer SingleUserTS
Machine Name:
Debug session time: Thu Jun 12 14:35:57.000 2014 (GMT+1)
System Uptime: 0 days 1:09:39.037
Process Uptime: 0 days 1:04:00.000
……………………………………………………….
……………………………………………………….
………………………………….
Loading unloaded module list
….
ntdll!ZwWaitForSingleObject+0xa:
00000000`76da05fa c3 ret
0:000> !runaway
User Mode Time
Thread Time
216:1710 0 days 0:57:15.922
744:1f00 0 days 0:53:48.487
2:10d4 0 days 0:00:13.868
693:1e34 0 days 0:00:07.768

We can see here that threads 216 and 744 were running 57 or 53 minutes out of the whole process uptime of little over an hour. This is definitely not normal.

When we are looking at the actual thread stack we can be pretty certain that something within vpxd.exe is indeed the culprit.

0:216> ~744 k
Child-SP RetAddr Call Site
00000000`406bd048 00000000`721692dd msvcp90!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::compare+0x93
00000000`406bd050 00000001`404aa1af msvcp90!std::operator<<char,std::char_traits<char>,std::allocator<char> >+0x29
00000000`406bd090 00000001`404aa2cc vpxd+0x12ba1af
00000000`406bd150 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd210 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd2d0 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd390 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd450 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd510 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd5d0 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd690 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd750 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd810 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd8d0 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd990 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bda50 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bdb10 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bdbd0 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bdc90 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bdd50 00000001`404aa2cc vpxd+0x12ba2cc

At this point in time I was already pretty certain that we had a DRS calculation issue at hand from the evidence found in the profiler logs as indicated above. Loading private symbols revealed the same in the actual stack trace for both threads.

DRS does log all it’s calculations into so called drmDumps. These files are usually a couple of kilobytes tops but it turns out that for 2 clusters these dump files were more than 8 MB of pure calculation text.

The customer had actually implemented several new DRS rules on a very large cluster containing over 500 VMs. The calculations DRS was performing every 5 minutes now were killing the system, even though it met the minimum requirements.

The change the customer implemented was to actually turn DRS to manual mode which instantly stabilized the system.

From there the customer basically has 2 choices.

1) Add more hardware to the system.
2) Try keeping the rules to an absolute minimum as more and complex rules will have an impact on vCenter Server performance.

Categories: Case Post Mortems Tags:

Patching or Upgrading ESXi fails with “insufficient memory” error

May 10th, 2014 No comments

A couple of days ago I helped out a junior Engineer who had a customer on the line whose hosts could not be scanned, patched or upgraded using Update Manager 5.5. They simply would get the following error on scanning the hosts.

Capture

The same could be seen in the recent tasks and events.

Capture2

 

On trying to simply scan and patch the host against the default VMware delivered baslines we got the following.

Capture3

 

The customer had already uploaded a log bundle of the ESXi and we could not see anything suspicious in the /var/log/esxupdate.log. But the vua.log had an important hint.

Capture4

 

INFO:root:Running /sbin/esxcli system visorfs ramdisk add -M 332 -m 332 -n upgradescratch -t /upgrade_scratch -p 01777

We can see that root tries to do something, specifically creating and mounting a new ram disk. But why does that fail? The customer also could not connect to the host using putty with “Access denied” errors. In the end the solution was rather simple but seems not to be very well documented as a caveat of one of the major security features of ESXi.

Capture5

At first I thought this might have been an issue with permissions on the host for vpxuser not having full root rights as it is officially documented. But taking a more careful look at the log messages Update Manager does not seem to be using the vpxuser at all but still tries to access the host using “root” which indeed is blocked from logging in when lockdown mode is enabled.

Disabling lockdown mode immediately solved the issue for the customer.

 

 

Categories: Case Post Mortems Tags: