Home > Case Post Mortems > vCenter performance suffering due to feature usage

vCenter performance suffering due to feature usage

Last week I handled a customer case in which vCenter Server would crash randomly and even when vCenter Server was operational if it would use most of the system’s CPU.

The vpxd.log would show the following error which indicates that the system ran out of usable RAM for vCenter Server.

2014-06-10T22:41:47.127+02:00 [11100 panic ‘Default’ opID=task-internal-1-62fd32e8-17]
–>
–> Panic: Unable to allocate memory
–> Backtrace:
–> backtrace[00] rip 000000018018b6aa
–> backtrace[01] rip 0000000180105288
–> backtrace[02] rip 000000018010668e
–> backtrace[03] rip 00000001801a1c4c
–> backtrace[04] rip 00000001801a1d9c
–> backtrace[05] rip 000000018009e466
–> backtrace[06] rip 000000018009e499
–> backtrace[07] rip 0000000073b710af
–> backtrace[08] rip 0000000073bbcb9f
–> backtrace[09] rip 0000000073bbcc87
–> backtrace[10] rip 0000000140c85348
–> backtrace[11] rip 00000001410fa8d2
–> backtrace[12] rip 0000000140fad26b
–> backtrace[13] rip 0000000140fadb12
–> backtrace[14] rip 00000001401175db
–> backtrace[15] rip 0000000000830453
–> backtrace[16] rip 00000000003e980c
–> backtrace[17] rip 00000000004293ff
–> backtrace[18] rip 00000001404185dc
–> backtrace[19] rip 0000000140a71a40
–> backtrace[20] rip 0000000140a72533
–> backtrace[21] rip 00000001408b22fe
–> backtrace[22] rip 00000001408b2350
–> backtrace[23] rip 0000000141073882
–> backtrace[24] rip 00000001410b2c4c
–> backtrace[25] rip 0000000140fc385a
–> backtrace[26] rip 00000001410893cb
–> backtrace[27] rip 00000001404257bf
–> backtrace[28] rip 0000000140429174
–> backtrace[29] rip 0000000140437874
–> backtrace[30] rip 00000001404402f5
–> backtrace[31] rip 0000000180153334
–> backtrace[32] rip 00000001801533c1
–> backtrace[33] rip 0000000180157d0c
–> backtrace[34] rip 00000001801595ca
–> backtrace[35] rip 0000000180159b0c
–> backtrace[36] rip 0000000180159ca8
–> backtrace[37] rip 00000001801548a3
–> backtrace[38] rip 00000001801558f3
–> backtrace[39] rip 0000000180155c33
–> backtrace[40] rip 000000018019db2a
–> backtrace[41] rip 0000000073b72fdf
–> backtrace[42] rip 0000000073b73080
–> backtrace[43] rip 00000000778059ed
–> backtrace[44] rip 000000007793ba01
–>
2014-06-10T22:41:48.180+02:00 [11100 panic ‘Default’ opID=task-internal-1-62fd32e8-17] (Log recursion level 2) Unable to allocate memory

 

So we gathered 2 extended log bundles between service crashes to see if the private memory for vpxd grew and indeed within 45 minutes the memory usage grew from 1.3 GB to 2.6 GB, so it doubled.

The inventory was medium sized.

–> /InventoryStats/ManagedEntityStats/Clusters/total 16
–> /InventoryStats/ManagedEntityStats/Datacenters/total 1
–> /InventoryStats/ManagedEntityStats/Datastores/total 113
–> /InventoryStats/ManagedEntityStats/NumHosts/total 47
–> /InventoryStats/ManagedEntityStats/NumPoweredOnVms/total 1028
–> /InventoryStats/ManagedEntityStats/NumVirtualMachines/total 1055
–> /InventoryStats/ManagedEntityStats/ResourcePools/total 23

With the VM having the following resource allocation.

Processor(s): 4 Processor(s) Installed.
[01]: Intel64 Family 6 Model 37 Stepping 1 GenuineIntel ~2261 Mhz
[02]: Intel64 Family 6 Model 37 Stepping 1 GenuineIntel ~2261 Mhz
[03]: Intel64 Family 6 Model 37 Stepping 1 GenuineIntel ~2261 Mhz
[04]: Intel64 Family 6 Model 37 Stepping 1 GenuineIntel ~2261 Mhz

Total Physical Memory: 16.383 MB
Available Physical Memory: 8.941 MB
Virtual Memory: Max Size: 24.574 MB
Virtual Memory: Available: 16.059 MB
Virtual Memory: In Use: 8.515 MB

So the VM was configured with 4 cores and 16 GB of RAM. For vCenter Server 5.5 this was more than enough to meet the minimum requirements. So what was actually happening?

Let’s have a look at the system utilization for memory and CPU.

vpxd_crash1

We can see that the memory is extremely spiky during the first half of logs that we have and should be around 2 GB maximum during normal operation. After about 20 hours the customer implemented a change that I recommended and the memory utilization instantly stabilized. The CPU graph looks similar.

vpxd_crash2

And again as soon as the suggested change was implemented we can see that CPU utilization went below 20% on average.

Going through the profiler logs we could see that the system spend alot of RAM on cluster operations.

/ScoreboardStats/ObjectTypes/Class=’ClusterRecommendationsTracker::RecVisitInfo’/Count/total 0.704939
/ScoreboardStats/ObjectTypes/Class=’Vim::Cluster::Recommendation’/Count/total 0.684619

Looking at the system using ProcessExplorer we could see that 2 threads were actually hogging the CPU but the stack was not visible due to the lack of symbols, the topmost called function was actually a thread exit.

After loading the dump into a debugger we can actually verify this.

Windows 7 Version 7601 (Service Pack 1) MP (4 procs) Free x64
Product: Server, suite: TerminalServer SingleUserTS
Machine Name:
Debug session time: Thu Jun 12 14:35:57.000 2014 (GMT+1)
System Uptime: 0 days 1:09:39.037
Process Uptime: 0 days 1:04:00.000
……………………………………………………….
……………………………………………………….
………………………………….
Loading unloaded module list
….
ntdll!ZwWaitForSingleObject+0xa:
00000000`76da05fa c3 ret
0:000> !runaway
User Mode Time
Thread Time
216:1710 0 days 0:57:15.922
744:1f00 0 days 0:53:48.487
2:10d4 0 days 0:00:13.868
693:1e34 0 days 0:00:07.768

We can see here that threads 216 and 744 were running 57 or 53 minutes out of the whole process uptime of little over an hour. This is definitely not normal.

When we are looking at the actual thread stack we can be pretty certain that something within vpxd.exe is indeed the culprit.

0:216> ~744 k
Child-SP RetAddr Call Site
00000000`406bd048 00000000`721692dd msvcp90!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::compare+0x93
00000000`406bd050 00000001`404aa1af msvcp90!std::operator<<char,std::char_traits<char>,std::allocator<char> >+0x29
00000000`406bd090 00000001`404aa2cc vpxd+0x12ba1af
00000000`406bd150 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd210 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd2d0 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd390 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd450 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd510 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd5d0 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd690 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd750 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd810 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd8d0 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd990 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bda50 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bdb10 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bdbd0 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bdc90 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bdd50 00000001`404aa2cc vpxd+0x12ba2cc

At this point in time I was already pretty certain that we had a DRS calculation issue at hand from the evidence found in the profiler logs as indicated above. Loading private symbols revealed the same in the actual stack trace for both threads.

DRS does log all it’s calculations into so called drmDumps. These files are usually a couple of kilobytes tops but it turns out that for 2 clusters these dump files were more than 8 MB of pure calculation text.

The customer had actually implemented several new DRS rules on a very large cluster containing over 500 VMs. The calculations DRS was performing every 5 minutes now were killing the system, even though it met the minimum requirements.

The change the customer implemented was to actually turn DRS to manual mode which instantly stabilized the system.

From there the customer basically has 2 choices.

1) Add more hardware to the system.
2) Try keeping the rules to an absolute minimum as more and complex rules will have an impact on vCenter Server performance.

Categories: Case Post Mortems Tags: