Archive

Archive for June, 2014

Virtual Machine hardware version influecing QueryPerformanceCounter behaviour on reboot aka Nagios showing wrong system uptimes

June 28th, 2014 No comments

There is already a  kb article in reference to the QueryPerformanceCounter to be found in the VMware knowledge base that affects Windows XP and Windows Server 2003.

QueryPerformanceCounter behaves improperly inside a virtual machine when /usepmtimer is used with some Windows HALs

Last week a colleague of mine had another customer express concerns on how this counter is handled on his Windows Server 2008 R2 VMs. They noticed that under certain circumstances the counter would not reset anymore on perfoming an in guest reboot. For alot of customers this would generally not be of any concern, but apparently the monitoring solution the customer was using in the way it was implemented on their site was relying on this counter as it calculated the system uptime out of it. So if the counter was not reset after a reboot misleading information would be shown.

Read more…

Categories: Case Post Mortems Tags:

SSO deployment options or how to achieve single pane of glass using the VCSA 5.5

June 22nd, 2014 6 comments

And another twitter question leading up to a blog post.

— Andrew Dauncey (@daunce_) 22. Juni 2014


Actually I do not, so why not write one myself then. For those who attended the SSO GSS webinar or have seen me present at VMUGs or customer days will already know what I am talking about on a high level, this will fill the information with a little bit more detail.

Read more…

Categories: SSO Tags:

vCenter performance suffering due to feature usage

June 15th, 2014 No comments

Last week I handled a customer case in which vCenter Server would crash randomly and even when vCenter Server was operational if it would use most of the system’s CPU.

The vpxd.log would show the following error which indicates that the system ran out of usable RAM for vCenter Server.

2014-06-10T22:41:47.127+02:00 [11100 panic ‘Default’ opID=task-internal-1-62fd32e8-17]
–>
–> Panic: Unable to allocate memory
–> Backtrace:
–> backtrace[00] rip 000000018018b6aa
–> backtrace[01] rip 0000000180105288
–> backtrace[02] rip 000000018010668e
–> backtrace[03] rip 00000001801a1c4c
–> backtrace[04] rip 00000001801a1d9c
–> backtrace[05] rip 000000018009e466
–> backtrace[06] rip 000000018009e499
–> backtrace[07] rip 0000000073b710af
–> backtrace[08] rip 0000000073bbcb9f
–> backtrace[09] rip 0000000073bbcc87
–> backtrace[10] rip 0000000140c85348
–> backtrace[11] rip 00000001410fa8d2
–> backtrace[12] rip 0000000140fad26b
–> backtrace[13] rip 0000000140fadb12
–> backtrace[14] rip 00000001401175db
–> backtrace[15] rip 0000000000830453
–> backtrace[16] rip 00000000003e980c
–> backtrace[17] rip 00000000004293ff
–> backtrace[18] rip 00000001404185dc
–> backtrace[19] rip 0000000140a71a40
–> backtrace[20] rip 0000000140a72533
–> backtrace[21] rip 00000001408b22fe
–> backtrace[22] rip 00000001408b2350
–> backtrace[23] rip 0000000141073882
–> backtrace[24] rip 00000001410b2c4c
–> backtrace[25] rip 0000000140fc385a
–> backtrace[26] rip 00000001410893cb
–> backtrace[27] rip 00000001404257bf
–> backtrace[28] rip 0000000140429174
–> backtrace[29] rip 0000000140437874
–> backtrace[30] rip 00000001404402f5
–> backtrace[31] rip 0000000180153334
–> backtrace[32] rip 00000001801533c1
–> backtrace[33] rip 0000000180157d0c
–> backtrace[34] rip 00000001801595ca
–> backtrace[35] rip 0000000180159b0c
–> backtrace[36] rip 0000000180159ca8
–> backtrace[37] rip 00000001801548a3
–> backtrace[38] rip 00000001801558f3
–> backtrace[39] rip 0000000180155c33
–> backtrace[40] rip 000000018019db2a
–> backtrace[41] rip 0000000073b72fdf
–> backtrace[42] rip 0000000073b73080
–> backtrace[43] rip 00000000778059ed
–> backtrace[44] rip 000000007793ba01
–>
2014-06-10T22:41:48.180+02:00 [11100 panic ‘Default’ opID=task-internal-1-62fd32e8-17] (Log recursion level 2) Unable to allocate memory

 

So we gathered 2 extended log bundles between service crashes to see if the private memory for vpxd grew and indeed within 45 minutes the memory usage grew from 1.3 GB to 2.6 GB, so it doubled.

The inventory was medium sized.

–> /InventoryStats/ManagedEntityStats/Clusters/total 16
–> /InventoryStats/ManagedEntityStats/Datacenters/total 1
–> /InventoryStats/ManagedEntityStats/Datastores/total 113
–> /InventoryStats/ManagedEntityStats/NumHosts/total 47
–> /InventoryStats/ManagedEntityStats/NumPoweredOnVms/total 1028
–> /InventoryStats/ManagedEntityStats/NumVirtualMachines/total 1055
–> /InventoryStats/ManagedEntityStats/ResourcePools/total 23

With the VM having the following resource allocation.

Processor(s): 4 Processor(s) Installed.
[01]: Intel64 Family 6 Model 37 Stepping 1 GenuineIntel ~2261 Mhz
[02]: Intel64 Family 6 Model 37 Stepping 1 GenuineIntel ~2261 Mhz
[03]: Intel64 Family 6 Model 37 Stepping 1 GenuineIntel ~2261 Mhz
[04]: Intel64 Family 6 Model 37 Stepping 1 GenuineIntel ~2261 Mhz

Total Physical Memory: 16.383 MB
Available Physical Memory: 8.941 MB
Virtual Memory: Max Size: 24.574 MB
Virtual Memory: Available: 16.059 MB
Virtual Memory: In Use: 8.515 MB

So the VM was configured with 4 cores and 16 GB of RAM. For vCenter Server 5.5 this was more than enough to meet the minimum requirements. So what was actually happening?

Let’s have a look at the system utilization for memory and CPU.

vpxd_crash1

We can see that the memory is extremely spiky during the first half of logs that we have and should be around 2 GB maximum during normal operation. After about 20 hours the customer implemented a change that I recommended and the memory utilization instantly stabilized. The CPU graph looks similar.

vpxd_crash2

And again as soon as the suggested change was implemented we can see that CPU utilization went below 20% on average.

Going through the profiler logs we could see that the system spend alot of RAM on cluster operations.

/ScoreboardStats/ObjectTypes/Class=’ClusterRecommendationsTracker::RecVisitInfo’/Count/total 0.704939
/ScoreboardStats/ObjectTypes/Class=’Vim::Cluster::Recommendation’/Count/total 0.684619

Looking at the system using ProcessExplorer we could see that 2 threads were actually hogging the CPU but the stack was not visible due to the lack of symbols, the topmost called function was actually a thread exit.

After loading the dump into a debugger we can actually verify this.

Windows 7 Version 7601 (Service Pack 1) MP (4 procs) Free x64
Product: Server, suite: TerminalServer SingleUserTS
Machine Name:
Debug session time: Thu Jun 12 14:35:57.000 2014 (GMT+1)
System Uptime: 0 days 1:09:39.037
Process Uptime: 0 days 1:04:00.000
……………………………………………………….
……………………………………………………….
………………………………….
Loading unloaded module list
….
ntdll!ZwWaitForSingleObject+0xa:
00000000`76da05fa c3 ret
0:000> !runaway
User Mode Time
Thread Time
216:1710 0 days 0:57:15.922
744:1f00 0 days 0:53:48.487
2:10d4 0 days 0:00:13.868
693:1e34 0 days 0:00:07.768

We can see here that threads 216 and 744 were running 57 or 53 minutes out of the whole process uptime of little over an hour. This is definitely not normal.

When we are looking at the actual thread stack we can be pretty certain that something within vpxd.exe is indeed the culprit.

0:216> ~744 k
Child-SP RetAddr Call Site
00000000`406bd048 00000000`721692dd msvcp90!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::compare+0x93
00000000`406bd050 00000001`404aa1af msvcp90!std::operator<<char,std::char_traits<char>,std::allocator<char> >+0x29
00000000`406bd090 00000001`404aa2cc vpxd+0x12ba1af
00000000`406bd150 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd210 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd2d0 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd390 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd450 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd510 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd5d0 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd690 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd750 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd810 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd8d0 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bd990 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bda50 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bdb10 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bdbd0 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bdc90 00000001`404aa2cc vpxd+0x12ba2cc
00000000`406bdd50 00000001`404aa2cc vpxd+0x12ba2cc

At this point in time I was already pretty certain that we had a DRS calculation issue at hand from the evidence found in the profiler logs as indicated above. Loading private symbols revealed the same in the actual stack trace for both threads.

DRS does log all it’s calculations into so called drmDumps. These files are usually a couple of kilobytes tops but it turns out that for 2 clusters these dump files were more than 8 MB of pure calculation text.

The customer had actually implemented several new DRS rules on a very large cluster containing over 500 VMs. The calculations DRS was performing every 5 minutes now were killing the system, even though it met the minimum requirements.

The change the customer implemented was to actually turn DRS to manual mode which instantly stabilized the system.

From there the customer basically has 2 choices.

1) Add more hardware to the system.
2) Try keeping the rules to an absolute minimum as more and complex rules will have an impact on vCenter Server performance.

Categories: Case Post Mortems Tags:

Changing the host name of vCenter Server 5.5 without reinstalling

June 1st, 2014 4 comments

And another twitter request that leads up to some experimenting of what you can do without relying on the installer.

Challenge accepted. The victim VC will be my own lab environment VC so this better works in the end. Additional information was provided by Jason which would make this procedure possible, the vCenter Server was a self contained instance without running any dependencies from the upper stack.

DisclaimerThis is not a tested or officially approved procedure by VMware at any means, use at your own risks, don’t come to me complaining if everything breaks, DO A BACKUP! The proper way would be a reinstall of the binary packages against existing databases.

Read more…

Categories: Homelab, SSL, SSO Tags: