Home > Case Post Mortems, Random VMware ramblings > Help my VM is hung/unresponsive – what to collect for GSS and why we need it.

Help my VM is hung/unresponsive – what to collect for GSS and why we need it.


This came up on twitter today. While this might result in a fast turnaround time for getting the immediate issue resolved it won’t show you the root cause of the hang/unresponsive VM, and let’s be honest normally that is what we are interested in, right? (There will be exceptions to that, if your mission critical and revenue generating system is in that state you probably want to restart it as fast as possible and then go back to the white board and redesign that single point of failure as soon as possible).

So what would you do if you actually wanted to troubleshoot this? GSS will recommend you to collect the following information (Depending on the browser you are using the commands might show a single dash for parameters, this is wrong, it is always a double dash).

  1. Try to ping the VM. Note down the result.
  2. Put the cluster to manual DRS (do not turn it off, especially in versions pre 5.5 and if you have Resource  Pools configured) to avoid the VM moving around while we do data collection. (In large and active clusters you can choose to just disable DRS recommendations for the affected VM).
  3. Get a putty session to the host and run the following command:
    vm-support --listvms
  4. Get the vmx file path from the above command and run the following command (assuming UpgradeVC was the affected VM):
    vm-support --performance --manifests="HungVM:Coredump_VM HungVM:Suspend_VM" --groups="Fault Hardware Logs Network Storage System Userworld VirtualMachines" --vm="/vmfs/volumes/54059b9f-6f770d82-be3b-cf4bbc6bf10/UpgradeVC/UpgradeVC.vmx" -w /vmfs/volumes/datastore_enough_space
  5. This will generate a log bundle containing a performance snapshot, extended stats for CPU calls within the guest OS, call stacks for the VMM and a full memory dump of the VM.
  6. Put DRS to fully automated again.

Please take not that depending on the amount of RAM configured for the VM the full memory dump can take quite a while, I have seen this process take up to 2 hours in total, so be patient when collecting those logs.

What will each of these steps allow GSS to do now. The performance log bundle enables us to see if the vCPUs of the VM are still processing commands. We also see if the active amount of RAM is changing (meaning new pages are touched or not) and can see if the VM is still issueing network or storage IO. This helps in narrowing down the actual nature of the issue quite a bit. Just because a VM does not respond on the GUI anymore it doesn’t mean the whole VM is completely frozen, it could just be locked up from within and not update the draw state anymore, that is why step number 1 is to ping the VM. Networking computation is being done in the kernel space of most operating systems, meaning that this is usually independent on any user world space issues, except if the VM is a single core VM and the CPU is completely locked by a user world process, then it won’t respond to pings as well).

If the VM is actually still issuing either IO or touching RAM pages then we can most likely conclude that the hypervisor is NOT the root cause of the problem. And to my experience that is most of the time the case (the only time I have personally seen this not to be true was a now fixed issue in the vSCSI layer causing VM hangs due to a race condition).

As the issue at that point is most likely to be found within the OS some further questions will need to be answered:

  1. Is only 1 VM affected or more of them?
  2. Do we see any common things in the VMs (operating system family, same install base for software, all cloned from the same template etc.)?
  3. When did the issues start happening (Having proper change management in place is crucial here)?
  4. Is there any pattern that can be determined from the hangs?

The extended CPU stats in combination with the memory dump of the VM can help in narrowing down what is actually happening inside the guest. Are we looping through a certain piece of code or are we completely idle? The first issue would be most likely a software bug in one specific piece of software that then is more or less easy to identify. The latter usually means we have a sort of deadlock inside the guest OS which is a bit harder to narrow down.

If it comes to that point GSS will also most likely ask for Microsoft or any other relevant third parties to be involved, as public symbols only go so far in a dump analysis.

The VMM (Virtual Machine Monitor) dumps and backtraces that are collected will give additional information on the actual process responsible for the scheduling and execution of the instructions from the VM on the actual hardware, there might be a problem there as well that could cause a hang but that is extremely rare, most of the time I had to analyze hangs it was down to some user world process inside the guest OS and a clear favour for Windows OS systems as well (this might be down to limited exposure though, not saying Linux is a better operating system!).

So what would such an analysis conducted look like now, let’s walk through one specific example that led to the following KB being published.

Virtual machines running on VMware ESXi 5.5 with vShield endpoint activated get into a hung state during snapshot operations

Customer reported VM ‘xyz’ was hung and that several VMs got affected during the course of the week. After collecting all necessary information some general information gathering from the logs was done.

[528]-> grep mem *.vmx
memsize = “6144”
sched.mem.minsize = “0”
sched.mem.shares = “normal”

[529]-> grep guestOS *.vmx
guestOS = “windows7srv-64″

[531]-> cat ../../../../commands/vmware_-vl.txt
VMware ESXi 5.5.0 build-2143827
VMware ESXi 5.5.0 Update 2

What do these commands tell us now? Well we directly know what guest operating system was configured during creation time of the VM (does not need to match the actually installed OS but usually it will), the amount of RAM the VM was configured with (useful in knowing how long the coffee breaks between running debugger commands will be) and the ESXi version the VM was running on (always helpful to research any known issues).

Next we are going to look for the advanced CPU instrumentation that we put into the log files.

[532]-> grep VMSAM vmware*.log | awk ‘{print $2” “$6}’ | sort | uniq -c | sort -n | tail -n 5
15 vcpu-0| rip=0xfffff80001c00f6c
38 vcpu-0| rip=0x7fef4057f00
235 vcpu-0| rip=0xfffff80001c0617e
1855 vcpu-0| rip=0xfffff88002f7e9c2
2147 vcpu-1| rip=0xfffff88002f7e9c2

What do we see here? This command is grepping for the instruction pointer addresses (these are put in as additional information by the procedure mentioned above, this is not default logging). It is then sorting it by occurrence and CPU the commands were issued on. This will give us an overview what code was executed in the VM during the collection of the log bundle run. We can see that we do have a dual vCPU VM here and that one code piece (rip=0xfffff88002f7e9c2) was called particularly often. This usually indicated idle looping and we are going to confirm that in just a second.

The next step is to actually see what happened in the vmware.log before the VM was suspended then.

2014-11-13T13:12:51.014Z| vcpu-0| I120: Guest: vsep: AUDIT:
VFileAuditSvmConnectivity : Lost connectivity to SVM, irpStatus: 0xc0000001
2014-11-13T14:02:10.778Z| vmx| I120: OvhdMem OvhdUser_SVMLowMem  : 3 3 | 0 0 0
2014-11-13T14:02:10.778Z| vmx| I120: OvhdMem OvhdMon_SVMIDT  : 2 2 | 0 0 0
2014-11-13T14:02:10.787Z| vmx| I120: SVMotion_PowerOff: Not running Storage vMotion. Nothing to do
2014-11-13T14:02:10.792Z| vmx| I120: SVMotion_PowerOff: Not running Storage vMotion. Nothing to do
2014-11-13T14:02:09.916Z| vcpu-0| I120: SUSPEND: Completed suspend: ‘Operation completed successfully’ (0)

Looks like we lost connectivity to the vShield endpoint security VM. What was visible before that was a snapshot consolidation happening as well.

The difference between the the connectivity loss and the actual suspend is about 48 minutes, this will become important later on.

So let’s fire up Windbg and actually look at the memory dump.

—————————–

Microsoft (R) Windows Debugger Version 6.3.9600.17237 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.

Loading Dump File [F:\Cases\VincentHang\memory.dmp]
Kernel Complete Dump File: Full address space is available
************* Symbol Path validation summary **************
Response Time (ms) Location
Deferred
srv*C:\WinDbgSymbols*http://msdl.microsoft.com/download/symbols
Symbol search path is:
srv*C:\WinDbgSymbols*http://msdl.microsoft.com/download/symbols;srv*C:\WinDbgSymbols
Executable search path is:
Windows 7 Kernel Version 7601 (Service Pack 1) MP (2 procs) Free x64
Product: Server, suite: TerminalServer SingleUserTS
Built by: 7601.18409.amd64fre.win7sp1_gdr.140303-2144
Machine Name:
Kernel base = 0xfffff800`01613000 PsLoadedModuleList = 0xfffff800`01856890
Debug session time: Thu Nov 13 14:01:34.001 2014 (UTC + 0:00)
System Uptime: 3 days 1:08:26.593
Loading Kernel Symbols

We can see that the operating system is a Windows 2008R2 SP1. We can also see that the system was running for about 3 days before it experienced the issue as well.

As around that time a lot of patches have been issued for the vShield endpoint drivers and the customer was using vShield endpoint I issued the following commands to verify they were not on an affected version.

0: kd> !LMI vnetflt
0: kd> !LMI vsepflt

This will print output about build numbers etc. that could be checked internally then. The installed drivers were not affected for any known issues though.

As the next step I wanted to see the number of CPUs seen by the guest and what kind of CPUs the customer used (I always do that and the next step no matter what the actual issue is for the customer).

0: kd> !cpuinfo
CP F/M/S Manufacturer MHz PRCB Signature MSR 8B Signature Features
0 6,37,1 GenuineIntel 2666 0000001500000000 21193ffe
1 6,37,1 GenuineIntel 2666 0000001500000000 21193ffe
Cached Update Signature 0000001500000000
Initial Update Signature 0000001500000000

Intel CPUs, no known hangs with that one at that point in time as well (some might still remember the BSOD issue on E5-v2 CPUs for example wher that kind of information can come in handy).

Next up we are going to look at the processor control block information that is saved for each CPU in the OS and gets us some general information + the running thread on each CPU.

0: kd> !prcb 0
PRCB for Processor 0 at fffff80001803e80:
Current IRQL — 0
Threads– Current fffff80001811cc0 Next 0000000000000000 Idle fffff80001811cc0
Processor Index 0 Number (0, 0) GroupSetMember 1
Interrupt Count — 0174f54f
Times — Dpc 00000272 Interrupt 000004dd
Kernel 00ff05a7 User 00021cff

0: kd> !prcb 1
PRCB for Processor 1 at fffff880009bf180:
Current IRQL — 0
Threads– Current fffff880009c9f40 Next 0000000000000000 Idle fffff880009c9f40
Processor Index 1 Number (0, 1) GroupSetMember 2
Interrupt Count — 015c14ff
Times — Dpc 00000e09 Interrupt 00000306
Kernel 00ffd52c User 00014d72

So it looks like both CPUs are running the idle thread, which indicates we do have an internal lockup in the guest OS itself causing the unresponsiveness. The same information can be gathered using the following command:

0: kd> !running -it

System Processors: (0000000000000003)
Idle Processors: (0000000000000003) (0000000000000000) (0000000000000000)
(0000000000000000)

Prcbs Current (pri) Next (pri) Idle

0 fffff80001803e80 fffff80001811cc0 ( 0)
fffff80001811cc0 …………….

Child-SP RetAddr Call Site
fffff800`0151ac98 fffff800`01691709 intelppm!C1Halt+0x2
fffff800`0151aca0 fffff800`0168089c nt!PoIdle+0x52a
fffff800`0151ad80 00000000`00000000 nt!KiIdleLoop+0x2c

1 fffff880009bf180 fffff880009c9f40 ( 0)
fffff880009c9f40 …………….

Child-SP RetAddr Call Site
fffff880`009d2c98 fffff800`01691709 intelppm!C1Halt+0x2
fffff880`009d2ca0 fffff800`0168089c nt!PoIdle+0x52a
fffff880`009d2d80 00000000`00000000 nt!KiIdleLoop+0x2c

Our most hit instruction pointer is also part of the idle loop, we can verify that by simply using the u (unassemble) command together with the address.

0: kd> u 0xfffff88002f7e9c2
intelppm!C1Halt+0x2:
fffff880`02f7e9c2 c3 ret
fffff880`02f7e9c3 cc int 3
fffff880`02f7e9c4 cc int 3
fffff880`02f7e9c5 cc int 3
fffff880`02f7e9c6 cc int 3
fffff880`02f7e9c7 cc int 3
fffff880`02f7e9c8 cc int 3
fffff880`02f7e9c9 0f1f8000000000 nop dword ptr [rax]

The next steps included an analysis of all locks (using the !locks and !locks -v commands) and the general stack traces from all processes. This led to the following information being found:

More than 10 threads with vsepflt blocked with the following  stack trace and other similar stack traces.

[fffffa8008550060 check_logfiles]
fe4.000bc0 fffffa80088664f0 fefedd7d Blocked nt!KiSwapContext+0x7a
nt!KiCommitThreadWait+0x1d2
nt!KeWaitForSingleObject+0x19f
nt!ExpWaitForResource+0xae
nt!ExAcquireResourceExclusiveLite+0x14f

vsepflt!SomeFunctionNameIAmNotAllowedToDisclose+0x58
vsepflt!OtherFucntioName+0x106
fltmgr!FltpPerformPreCallbacks+0x2f7
fltmgr!FltpPassThrough+0x2d9
fltmgr!FltpDispatch+0xb7
nt!IopCloseFile+0x11f
nt!ObpDecrementHandleCount+0xb4
nt!ObpCloseHandleTableEntry+0xb1
nt!ObpCloseHandle+0x94
nt!KiSystemServiceCopyEnd+0x13
+0x772313aa

This driver is actually communicating via the vsock interface with the host to the SVM VM on which we saw a connection loss to. So let’s look at the vsock call stacks as well.

0: kd> !stacks 2 vsock
Proc.Thread .Thread Ticks ThreadState Blocker
[fffff80001812180 Idle]
[fffffa80054ee9e0 System]
4.0000c8 fffffa8005779b50 ffffff7f Blocked nt!KiSwapContext+0x7a
nt!KiCommitThreadWait+0x1d2
nt!KeWaitForMultipleObjects+0x272
vsock!NotInterestingFunction+0x120
nt!PspSystemThreadStartup+0x5a
nt!KxStartSystemThread+0x16
4.0000d0 fffffa8005795040 ff01c71e Blocked nt!KiSwapContext+0x7a
nt!KiCommitThreadWait+0x1d2
nt!KeWaitForSingleObject+0x19f
vsock!SomeWaitFunction+0x77
vsock!ConnectionFunction+0x33d
vsepflt!DoesntMatter01+0x5e
vsepflt!DoesntMatter02+0x114

We can see the vsock connection function is in a blocked thread. So let’s get some more info on it.

0: kd> !thread fffffa8005795040
THREAD fffffa8005795040 Cid 0004.00d0 Teb: 0000000000000000 Win32Thread:
0000000000000000 WAIT: (Executive) KernelMode Non-Alertable
fffffa80088a3bb0 SynchronizationEvent
Not impersonating
DeviceMap fffff8a000008aa0
Owning Process fffffa80054ee9e0 Image: System
Attached Process N/A Image: N/A
Wait Start TickCount 16660706 Ticks: 190916 (0:00:49:43.062)
Context Switch Count 831738 IdealProcessor: 0
UserTime 00:00:00.000
KernelTime 00:00:02.781
Win32 Start Address vsepflt!SomeFileProcessingFunction(0xfffff88001392398)
Stack Init fffff88002163db0 Current fffff88002163850
Base fffff88002164000 Limit fffff8800215e000 Call 0
Priority 8 BasePriority 8 UnusualBoost 0 ForegroundBoost 0 IoPriority 2

Interesting the thread is actually in a waiting state since we lost connection to the SVM, causing the the VM to hang. The wait time matches what we saw in the vmware.log since the disconnect. In the end the issue turned out to be a race condition in the VMCI communication that could happen in some cases. A vMotion or Suspend/Resume of the VM would work around the issue and a fix is since then released as well.

VMware ESXi 5.5, Patch Release ESXi550-201412001

Hopefully this will give you some insight in how GSS is approaching guest VM hangs and why the extended information is needed (without the call stacks and memory dump and analysis here would have been impossible).

  1. No comments yet.
  1. No trackbacks yet.