Home > Case Post Mortems > Intermittent vMotion failures on ESXi 5.5

Intermittent vMotion failures on ESXi 5.5

August 2nd, 2014 Leave a comment Go to comments

A little while ago I had to work on a vMotion issue that showed some really interesting behaviour.

The customer described the issue with the vMotions getting stuck at 5, 13 and 91%.

5% is really easy explained, the vMotion has been kicked off as a task but is scheduled to run a bit later as other vMotions are running. 13% and 91% are curious though. Those basically are the switch over points for starting the copy process and reregistering the VM after the copy process is done and cleaning up afterwards.

If there was a general vMotion issue one would expect for it to either not kick off at all and be at 13% all the time or at 91% but not a mix. Furthermore the issue only surfaced when more than just a couple of vMotions were triggered and affected different VMs on repeating the process. A VM that would not vMotion in one run (and could then vMotioned standalone without any issues) would go through the batch job the next time but a different set of VMs would fail.

We reproduces the issue to find the underlying cause of this as this behaviour is not really documented in any KB. During our test “ExampleVM” failed to vMotion and timed out after a while, that was when we collected the ESXi host logs from both the destination and the source host.

Below I will show the analysis process that can be taken as a general approach and a step-by-step list on how to tackle vMotion problems.

1. The logs will be copied over to a linux server for analysis (can be Windows as well if you want to install all the native linux commands used using cygwin or alternatives on Windows like BareGrep and a shell interpreter that can run the reconstruct.sh script to reassamble the log files) Commands entered in the shell session will be printed bold.

2. Extract the logs and then run the ./reconstruct.sh script inside to re-assemble the frag files that are created if something in the log bundle is larger than 1MB

3. We need to locate the VM in the logs to be able to access the vmware.log files later.

[515]-> find . -name vmInventory.xml -exec grep -H ExampleVM {} \;

./esx-esx1.example.com-2014-05-28–18.25/etc/vmware/hostd/vmInventory.xml: <vmxCfgPath>/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx</vmxCfgPath>

4. Determine source and destintation host

source host: esx-esx1.example.com-2014-05-28–18.25
destination host: esx-esx2.example.com-2014-05-28–18.29

5. Determine failure time stamp

time stamp: 2014-05-28 8:00 – 13:00

If the above information can not be retrieved from Tasks/Events view or someone simply did not look at the watch it can be found from either the vmware.log of the VM as it will clearly state which host the VM has booted on or by the state changes in the hostd.log of both hosts for that VM. It is useful to concatenate all the hostd.log , vmkernel.log and vpxa.log files into one large file if the environment is big so that these logs rotate very fast and the same action could be logged into separate files.

[525]-> grep EMI var/log/hostd.all | grep ExampleVM.vmx on esx-esx1.example.com

2014-05-09T16:15:31.191Z [34981B70 info ‘Vmsvc.vm:/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx’ opID=A5FF7F4F-000000D3-78-c5-a6 user=vpxuser] State Transition (VM_STATE_ON -> VM_STATE_EMIGRATING)
2014-05-09T16:15:43.387Z [362C4B70 info ‘Vmsvc.vm:/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx’] State Transition (VM_STATE_EMIGRATING -> VM_STATE_OFF)
2014-05-23T15:57:40.840Z [3EB48B70 info ‘Vmsvc.vm:/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx’ opID=0965D0B1-000000C1-d8-e8-9f user=vpxuser] State Transition (VM_STATE_ON -> VM_STATE_EMIGRATING)
2014-05-23T15:58:07.475Z [3EB07B70 info ‘Vmsvc.vm:/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx’] State Transition (VM_STATE_EMIGRATING -> VM_STATE_OFF)
2014-05-28T10:43:40.472Z [55580B70 info ‘Vmsvc.vm:/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] State Transition (VM_STATE_ON -> VM_STATE_EMIGRATING)
2014-05-28T10:47:40.476Z [FFD4A5C0 info ‘Vmsvc.vm:/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] State Transition (VM_STATE_EMIGRATING -> VM_STATE_ON)

[511]-> grep IMMI var/log/hostd.all | grep ExampleVM.vmx on esx-esx2.example.com

2014-05-09T16:15:31.591Z [78840B70 info ‘Vmsvc.vm:/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx’ opID=A5FF7F4F-000000D3-78-c5-f5 user=vpxuser] State Transition (VM_STATE_OFF -> VM_STATE_IMMIGRATING)
2014-05-09T16:15:38.813Z [FFE40B70 info ‘Vmsvc.vm:/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx’] State Transition (VM_STATE_IMMIGRATING -> VM_STATE_ON)
2014-05-23T15:57:41.213Z [FFC2AB70 info ‘Vmsvc.vm:/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx’ opID=0965D0B1-000000C1-d8-e8-a4 user=vpxuser] State Transition (VM_STATE_OFF -> VM_STATE_IMMIGRATING)
2014-05-23T15:58:00.021Z [56203B70 info ‘Vmsvc.vm:/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx’] State Transition (VM_STATE_IMMIGRATING -> VM_STATE_ON)

We can clearly see 2 vMotions with esx-esx1.example.com being the host the VM is emigrating from and therefore the source host and esx-esx2.example.com being the host trying to immigrate the VM and therefore being the destination host. There is one more emigration going on the source host that is not seen by the destination host at all, yet in the vSphere Client we kicked off the vMotions to that specific host for 100% so this seems to be our exact failure timestamp (2014-05-28T10:43:40-2014-05-28T10:47).

6. After gathering the crucial information the configuration should ALWAYS be checked first, nothing is more frustrating than spending an hour in tedious log review, when someone just forgot to tag the proper vmkernel adapter for vMotion traffic.

[503]-> cat commands/esxcfg-vmknic_-l.txt on both source and destintation

[Source]

Interface Port Group/DVPort IP Family IP Address Netmask Broadcast MAC Address MTU
vmk0 Management IPv4 192.168.0.11 255.255.255.0 192.168.0.255 MA:CA:DD:RE:SS:01 1500
vmk1 vmotion IPv4 10.0.0.11 255.255.255.0 10.0.0.255 00:50:56:61:31:6f 1500

[Destination]

Interface Port Group/DVPort IP Family IP Address Netmask Broadcast MAC Address MTU
vmk0 Management IPv4 192.168.0.12 255.255.255.0 192.168.0.255 MA:CA:DD:RE:SS:02 1500
vmk1 vmotion IPv4 10.0.0.12 255.255.255.0 10.0.0.255 00:50:56:6c:b2:86 1500

It was confirmed by the customer that the vMotion traffic was indeed tagged on vmk1. We can see that a completely separate subnet is being used and only 1 NIC was configured on a completely separate vSwitch for vMotion.

We tested IP connectivity using vmkping in both directions on the host during a remote session and they succeeded without any issues as well. The “-I” parameter forces the ping to go out a specific vmkernel interface instead of the default “vmk0” interface.

vmkping -I vmk1 10.0.0.12 on esx-esx1.example.com

vmkping -I vmk1 10.0.0.11 on esx-esx2.example.com

The NIC used for vMotion did show some outdated drivers and firmware that would pose a risk to a known issue of corruption but nothing that should affect vMotion is such a manner.

MTU size is also configured to be 1500 across the whole environment so we should not be experiencing any packet fragmentation issues (which can also be verified by adding -s and a packet size to the vmkping command above).

7. If configuration looks good, back into the logs. I usually start with the either the vmware.log or the hostd.log

From the emigrating host logs we can get an operation ID to follow for, a very similar operation ID will be used on the destination host as well, where only the last few hex chars are different. On the source we can see opID=B41380B2-0000017F-c7-81-50 which means we will simply do a grep for B41380B2-0000017F-c7-81 in both source and destination hostd.log do see if we are getting any errors that directly jump into our face.

[Source]

[528]-> grep B41380B2-0000017F-c7 var/log/hostd.all

2014-05-28T10:42:22.430Z [FFDADB70 verbose ‘Default’ opID=B41380B2-0000017F-c7-61 user=vpxuser] AdapterServer: target=’vim.VirtualMachine:8′, method=’refreshStorageInfo’
2014-05-28T10:42:22.460Z [55A81B70 verbose ‘Hbrsvc’ opID=B41380B2-0000017F-c7-61 user=vpxuser] Replicator: VmFileProviderCallback VM (id=8)
2014-05-28T10:43:40.471Z [55CD3B70 verbose ‘Default’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] AdapterServer: target=’vim.host.VMotionManager:ha-vmotionmgr’, method=’prepareSourceEx’
2014-05-28T10:43:40.472Z [55580B70 info ‘Vcsvc.VMotion’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] PrepareSourceEx [1401273742842651], VM = ‘8’
2014-05-28T10:43:40.472Z [55580B70 info ‘Vcsvc.VMotionSrc (1401273742842651)’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] VMotionEntry: migrateType = 1
2014-05-28T10:43:40.472Z [55580B70 verbose ‘Vcsvc.VMotionSrc (1401273742842651)’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] JrnlCreate: Created 1401273820.33
2014-05-28T10:43:40.472Z [55580B70 info ‘Vmsvc.vm:/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] State Transition (VM_STATE_ON -> VM_STATE_EMIGRATING)
2014-05-28T10:43:40.472Z [55580B70 info ‘Vmsvc.vm:/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] VMotionPrepare (1401273742842651): Sending ‘to’ srcIp=10.0.0.8 dstIp=10.0.0.9, type=1, encrypted=false, remoteThumbprint=2D:3B:F9:70:F6:E5:78:AD:95:C0:FC:77:2A:9A:9F:26:9F:4A:7E:49
2014-05-28T10:43:40.472Z [55580B70 info ‘Vmsvc.vm:/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] VMotionPrepare: srcLoggingIp=
2014-05-28T10:43:40.472Z [55580B70 info ‘Vmsvc.vm:/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] VMotionPrepare: dstLoggingIp=
2014-05-28T10:43:40.472Z [55580B70 info ‘Vmsvc.vm:/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] VMotionPrepare: srcMgmtIp=192.168.0.139
2014-05-28T10:43:40.472Z [55580B70 info ‘Vmsvc.vm:/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] VMotionPrepare: dstMgmtIp=192.168.0.141
2014-05-28T10:43:40.472Z [55580B70 info ‘Vmsvc.vm:/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] VMotionPrepare: dstVmDirPath=/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/
2014-05-28T10:43:40.472Z [55580B70 info ‘Vmsvc.vm:/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] VMotionPrepare: dstVmFileName=ExampleVM.vmx
2014-05-28T10:43:40.473Z [55580B70 info ‘Vmsvc.vm:/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] SetVmdbVirtualDeviceChange: deviceChange # 0
2014-05-28T10:43:40.473Z [55CD3B70 verbose ‘Default’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] AdapterServer: target=’vmodl.query.PropertyCollector:ha-property-collector’, method=’createFilter’
2014-05-28T10:43:40.474Z [55580B70 verbose ‘Vcsvc.VMotionSrc (1401273742842651)’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] Migration changed state from NONE to BEGIN
2014-05-28T10:43:40.474Z [55580B70 info ‘Vcsvc.VMotion’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] Completed scheduling PrepareSourceEx [1401273742842651].
2014-05-28T10:47:40.475Z [FFD4A5C0 warning ‘Vcsvc.VMotionSrc (1401273742842651)’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] TimeoutCb: Expired
2014-05-28T10:47:40.475Z [FFD4A5C0 verbose ‘Vcsvc.VMotionSrc (1401273742842651)’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] Migration changed state from BEGIN to DONE
2014-05-28T10:47:40.475Z [FFD4A5C0 verbose ‘Vcsvc.VMotionSrc (1401273742842651)’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] Finish called
2014-05-28T10:47:40.476Z [FFD4A5C0 info ‘Vmsvc.vm:/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] State Transition (VM_STATE_EMIGRATING -> VM_STATE_ON)
2014-05-28T10:47:40.476Z [FFD4A5C0 info ‘Hostsvc’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] Decremented SIOC Injector Flag2
2014-05-28T10:47:40.476Z [FFD4A5C0 info ‘Vmsvc.vm:/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] VMotion cleanup completed
2014-05-28T10:47:40.476Z [FFD4A5C0 verbose ‘Vcsvc.VMotionSrc (1401273742842651)’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] JrnlEnd: Wrote active=0
2014-05-28T10:47:40.476Z [FFD4A5C0 verbose ‘Vcsvc.VMotionSrc (1401273742842651)’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] JrnlEnd: Finished journal
2014-05-28T10:47:40.476Z [FFD4A5C0 verbose ‘Vcsvc.VMotionSrc (1401273742842651)’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] [JrnlFilePersistenceProvider::DeleteJournal] Deleting /var/lib/vmware/hostd/journal/1401273820.33.
2014-05-28T10:47:40.476Z [FFD4A5C0 verbose ‘Vcsvc.VMotionSrc (1401273742842651)’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] JrnlDelete: Deleted journal
2014-05-28T11:13:40.497Z [55580B70 verbose ‘Default’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] AdapterServer: target=’vmodl.query.PropertyCollector.Filter:session[857fc2d4-0e21-ade2-16ab-e14896e63b14]520e20cd-da6e-c9eb-5854-088dfc7f4913′, method=’destroy’
2014-05-28T11:13:41.180Z [FFD4A5C0 verbose ‘Default’ opID=B41380B2-0000017F-c7-81-1c user=vpxuser] AdapterServer: target=’vim.host.VMotionManager:ha-vmotionmgr’, method=’completeSource’
2014-05-28T11:13:41.180Z [FFD4A5C0 info ‘Vcsvc.VMotion’ opID=B41380B2-0000017F-c7-81-1c user=vpxuser] CompleteSource [1401273742842651]

Nothing much to see here, no errors but only weird behaviour that the VM is turning back to state ON again. But we did get the vMotion ID (1401273742842651) as well for the VM and will use that to find out some more information later on.

[Destination]

[513]-> grep B41380B2-0000017F-c7-81 var/log/hostd.all

2014-05-28T11:13:41.328Z [53E81B70 verbose ‘Default’ opID=B41380B2-0000017F-c7-81-39 user=vpxuser] AdapterServer: target=’vim.host.VMotionManager:ha-vmotionmgr’, method=’completeDestination’
2014-05-28T11:13:41.329Z [528C2B70 info ‘Vcsvc.VMotion’ opID=B41380B2-0000017F-c7-81-39 user=vpxuser] CompleteDestination [1401273742842651]
2014-05-28T11:13:41.329Z [528C2B70 warning ‘Vcsvc.VMotion’ opID=B41380B2-0000017F-c7-81-39 user=vpxuser] CompleteDestination: Map entry not found

Basically nothing to see at all apart from a timeout. Yet at least we could see that something happened and therefore the hosts did communicate with each other, further ruling out any networking related issues.

So let’s look for that vMotion ID now in both hostd.log files.

[Source]

[529]-> grep 1401273742842651 var/log/hostd.all

2014-05-28T10:43:40.472Z [55580B70 info ‘Vcsvc.VMotion’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] PrepareSourceEx [1401273742842651], VM = ‘8’
2014-05-28T10:43:40.472Z [55580B70 info ‘Vcsvc.VMotionSrc (1401273742842651)’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] VMotionEntry: migrateType = 1
2014-05-28T10:43:40.472Z [55580B70 verbose ‘Vcsvc.VMotionSrc (1401273742842651)’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] JrnlCreate: Created 1401273820.33
2014-05-28T10:43:40.472Z [55580B70 info ‘Vmsvc.vm:/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] VMotionPrepare (1401273742842651): Sending ‘to’ srcIp=10.0.0.8 dstIp=10.0.0.9, type=1, encrypted=false, remoteThumbprint=2D:3B:F9:70:F6:E5:78:AD:95:C0:FC:77:2A:9A:9F:26:9F:4A:7E:49
2014-05-28T10:43:40.474Z [55580B70 verbose ‘Vcsvc.VMotionSrc (1401273742842651)’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] Migration changed state from NONE to BEGIN
2014-05-28T10:43:40.474Z [55580B70 info ‘Vcsvc.VMotion’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] Completed scheduling PrepareSourceEx [1401273742842651].
2014-05-28T10:47:40.475Z [FFD4A5C0 warning ‘Vcsvc.VMotionSrc (1401273742842651)’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] TimeoutCb: Expired
2014-05-28T10:47:40.475Z [FFD4A5C0 verbose ‘Vcsvc.VMotionSrc (1401273742842651)’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] Migration changed state from BEGIN to DONE
2014-05-28T10:47:40.475Z [FFD4A5C0 verbose ‘Vcsvc.VMotionSrc (1401273742842651)’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] Finish called
2014-05-28T10:47:40.476Z [FFD4A5C0 verbose ‘Vcsvc.VMotionSrc (1401273742842651)’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] JrnlEnd: Wrote active=0
2014-05-28T10:47:40.476Z [FFD4A5C0 verbose ‘Vcsvc.VMotionSrc (1401273742842651)’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] JrnlEnd: Finished journal
2014-05-28T10:47:40.476Z [FFD4A5C0 verbose ‘Vcsvc.VMotionSrc (1401273742842651)’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] [JrnlFilePersistenceProvider::DeleteJournal] Deleting /var/lib/vmware/hostd/journal/1401273820.33.
2014-05-28T10:47:40.476Z [FFD4A5C0 verbose ‘Vcsvc.VMotionSrc (1401273742842651)’ opID=B41380B2-0000017F-c7-81-50 user=vpxuser] JrnlDelete: Deleted journal
2014-05-28T11:13:41.180Z [FFD4A5C0 info ‘Vcsvc.VMotion’ opID=B41380B2-0000017F-c7-81-1c user=vpxuser] CompleteSource [1401273742842651]

We do see the same timeout again.

[Destination]

[512]-> grep 1401273742842651 var/log/hostd.all

2014-05-28T11:13:41.329Z [528C2B70 info ‘Vcsvc.VMotion’ opID=B41380B2-0000017F-c7-81-39 user=vpxuser] CompleteDestination [1401273742842651]

Nothing usable again.

So let’s have a look into the vmware.log then next.

[547]-> grep 1401273742842651 vmware.log
2014-05-28T10:43:40.484Z| vmx| I120: Received migrate ‘to’ request for mid id 1401273742842651, src ip <10.0.0.8>, dst ip <10.0.0.9>(invalidate source config).
2014-05-28T10:43:40.494Z| vmx| I120: MigrateSetInfo: state=1 srcIp=<10.0.0.8> dstIp=<10.0.0.9> mid=1401273742842651 uuid=31323436-3630-5a43-3234-313130484c43 priority=high
2014-05-28T10:43:40.494Z| vmx| I120: VMXVmdb_SetMigrationHostLogState: hostlog state transits to emigrating for migrate ‘to’ mid 1401273742842651
2014-05-28T10:54:27.822Z| vmx| I120: VMXVmdb_SetMigrationHostLogState: hostlog state transits to failure for migrate ‘to’ mid 1401273742842651

We do see a failure in here, so manually jumping into that log might give us some more information.

[548] less vmware.log

2014-05-28T10:43:40.473Z| vmx| I120: VMXVmdbCbVmVmxMigrate: Got SET callback for /vm/#_VMX/vmx/migrateState/cmd/##1_d1/op/=to
2014-05-28T10:43:40.483Z| vmx| I120: Could not identify IP address family of in/srcLogIp:
2014-05-28T10:43:40.483Z| vmx| I120: Could not identify IP address family of in/dstLogIp:
2014-05-28T10:43:40.484Z| vmx| I120: VMXVmdbVmVmxMigrateGetParam: type: 1 srcIp=<10.0.0.8> dstIp=<10.0.0.9> mid=4fa7377acdb1b uuid=31323436-3630-5a43-3234-313130484c43 priority=high checksumMemory=no maxD
owntime=0 encrypted=0 resumeDuringPageIn=no latencyAware=no diskOpFile=
2014-05-28T10:43:40.484Z| vmx| I120: VMXVmdbVmVmxMigrateGetParam: type 1 unsharedSwap 0 memMinToTransfer 0 cpuMinToTransfer 0 numDisks 0 numStreamIps 1
2014-05-28T10:43:40.484Z| vmx| I120: Received migrate ‘to’ request for mid id 1401273742842651, src ip <10.0.0.8>, dst ip <10.0.0.9>(invalidate source config).
2014-05-28T10:43:40.494Z| vmx| I120: MigrateSetInfo: state=1 srcIp=<10.0.0.8> dstIp=<10.0.0.9> mid=1401273742842651 uuid=31323436-3630-5a43-3234-313130484c43 priority=high
2014-05-28T10:43:40.494Z| vmx| I120: MigrateSetState: Transitioning from state 0 to 1.
2014-05-28T10:43:40.494Z| vmx| I120: VMXVmdb_SetMigrationHostLogState: hostlog state transits to emigrating for migrate ‘to’ mid 1401273742842651
2014-05-28T10:54:27.822Z| vmx| I120: VMXVmdb_SetMigrationHostLogState: hostlog state transits to failure for migrate ‘to’ mid 1401273742842651
2014-05-28T10:54:27.832Z| vmx| I120: MigrateSetStateFinished: type=1 new state=5
2014-05-28T10:54:27.832Z| vmx| I120: MigrateSetState: Transitioning from state 1 to 5.
2014-05-28T10:54:27.832Z| vmx| I120: Migrate_SetFailureMsgList: switching to new log file.
2014-05-28T10:54:27.832Z| vmx| I120: Migrate_SetFailureMsgList: Now in new log file.
2014-05-28T11:05:14.674Z| vmx| I120: FileIO_AtomicTempFile: Failed to unlink temporary file, errno: 5
2014-05-28T11:05:14.681Z| vmx| I120: FileIO_AtomicTempFile: Failed to create temporary file, The file already exists (3). errno: 17
2014-05-28T11:05:14.702Z| vmx| I120: Msg_Reset:
2014-05-28T11:05:14.702Z| vmx| I120: [msg.configdb.failcreatetemp] An error occurred while creating temporary file for /vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx: The file already exists
2014-05-28T11:05:14.702Z| vmx| I120: —————————————-
2014-05-28T11:05:14.702Z| vmx| W110: Migrate: Failed to write out config file.
2014-05-28T11:05:14.702Z| vmx| I120: [msg.migrate.expired] Timed out waiting for migration start request.
2014-05-28T11:05:14.702Z| vmx| I120: Migrate: cleaning up migration state.
2014-05-28T11:05:14.702Z| vmx| I120: MigrateSetState: Transitioning from state 5 to 0.
2014-05-28T11:05:14.702Z| vmx| I120: Migrate: Final status reported to VMDB.

Interesting we have a 10 minute jump and then we can see some file level failures.

Let’s see what we can find by poking around in the vpxa.log on the source for that time frame before the 10 minute jump happened.

[549] grep 2014-05-28T10:43:40 var/log/vpxa.all

2014-05-28T10:43:40.469Z [FFF6AB70 verbose ‘PropertyProvider’ opID=B41380B2-0000017F-c7-81-50] RecordOp ASSIGN: info.reason, task-20835
2014-05-28T10:43:40.469Z [FFF6AB70 verbose ‘PropertyProvider’ opID=B41380B2-0000017F-c7-81-50] RecordOp ADD: recentTask[“task-20835”], vpxaTaskManager
2014-05-28T10:43:40.469Z [FF99E1A0 verbose ‘PropertyProvider’] RecordOp ASSIGN: info.startTime, task-20835
2014-05-28T10:43:40.470Z [FF99E1A0 verbose ‘PropertyProvider’ opID=B41380B2-0000017F-c7-81-50] RecordOp ASSIGN: info.state, task-20835
2014-05-28T10:43:40.470Z [FF99E1A0 verbose ‘PropertyProvider’ opID=B41380B2-0000017F-c7-81-50] RecordOp ASSIGN: info.cancelable, task-20835
2014-05-28T10:43:40.470Z [FF99E1A0 info ‘commonvpxLro’ opID=B41380B2-0000017F-c7-81-50] [VpxLRO] — BEGIN task-20835 — — vim.host.VMotionManager.prepareSourceEx — f8198450-c858-e8c6-42fe-4bd172e8bcb1
2014-05-28T10:43:40.470Z [FF99E1A0 verbose ‘vpxavpxaDatastoreContext’ opID=B41380B2-0000017F-c7-81-50] [VpxaDatastoreContext] Resolved URL ds:///vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx to localPath /vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx
2014-05-28T10:43:40.470Z [FF99E1A0 verbose ‘vpxavpxaDatastoreContext’ opID=B41380B2-0000017F-c7-81-50] [VpxaDatastoreContext] Resolved URL ds:///vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ to localPath /vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/
2014-05-28T10:43:40.470Z [FF99E1A0 verbose ‘vpxavpxaMoVMotion’ opID=B41380B2-0000017F-c7-81-50] [MIGRATE] (1401273742842651) PrepareSourceEx start
2014-05-28T10:43:40.470Z [FF99E1A0 verbose ‘vpxavpxaMoVMotion’ opID=B41380B2-0000017F-c7-81-50] [MIGRATE] (1401273742842651) srcIp 10.0.0.8
2014-05-28T10:43:40.470Z [FF99E1A0 verbose ‘vpxavpxaMoVMotion’ opID=B41380B2-0000017F-c7-81-50] [MIGRATE] (1401273742842651) dstIp 10.0.0.9
2014-05-28T10:43:40.470Z [FF99E1A0 verbose ‘vpxavpxaMoVMotion’ opID=B41380B2-0000017F-c7-81-50] [MIGRATE] (1401273742842651) VM moId vm-9
2014-05-28T10:43:40.470Z [FF99E1A0 verbose ‘vpxavpxaMoVMotion’ opID=B41380B2-0000017F-c7-81-50] [MIGRATE] (1401273742842651) srcLoggingIp
2014-05-28T10:43:40.470Z [FF99E1A0 verbose ‘vpxavpxaMoVMotion’ opID=B41380B2-0000017F-c7-81-50] [MIGRATE] (1401273742842651) dstLoggingIp
2014-05-28T10:43:40.470Z [FF99E1A0 info ‘vpxavpxaMoVMotion’ opID=B41380B2-0000017F-c7-81-50] [VpxaMoVMotion::PrepareSourceEx] invoke IsFtFailingOverVm on VM [/vmfs/volumes/DSUUID1-many-hex-chars/ExampleVM/ExampleVM.vmx].
2014-05-28T10:43:40.470Z [FF99E1A0 verbose ‘halservices’ opID=B41380B2-0000017F-c7-81-50] [VpxaHalVmLocker] VM 8 locked successfully.
2014-05-28T10:43:40.470Z [FF99E1A0 verbose ‘commonvpxTaskInfo’ opID=B41380B2-0000017F-c7-81-50] [TaskInfoPublisher::IncListeners] host:[] Return after successful IncListeners (listeners = 5, _connectionGen = 152)
2014-05-28T10:43:40.470Z [FF99E1A0 verbose ‘commonvpxTaskInfo’ opID=B41380B2-0000017F-c7-81-50] [TaskInfoListener::TaskInfoListener] constructed. Connection number = 152
2014-05-28T10:43:40.474Z [FF99E1A0 verbose ‘commonvpxTaskInfo’ opID=B41380B2-0000017F-c7-81-50] [TaskInfoPublisher::IncListeners] host:[] Return after successful IncListeners (listeners = 6, _connectionGen = 152)
2014-05-28T10:43:40.474Z [FF99E1A0 verbose ‘commonvpxTaskInfo’ opID=B41380B2-0000017F-c7-81-50] [TaskInfoListener::TaskInfoListener] copy-constructed. Connection number = 152
2014-05-28T10:43:40.474Z [FF99E1A0 verbose ‘commonvpxTaskInfo’ opID=B41380B2-0000017F-c7-81-50] TaskInfoChannel created for haTask–vim.host.VMotionManager.prepareSourceEx-335288733
2014-05-28T10:43:40.474Z [FF99E1A0 verbose ‘commonvpxTaskInfo’ opID=B41380B2-0000017F-c7-81-50] [TaskInfoPublisher::AddChannel] host:[] Channel (haTask–vim.host.VMotionManager.prepareSourceEx-335288733) added for task: haTask–vim.host.VMotionManager.prepareSourceEx-335288733
2014-05-28T10:43:40.474Z [FF99E1A0 verbose ‘commonvpxTaskInfo’ opID=B41380B2-0000017F-c7-81-50] [TaskInfoChannel::GetTaskInfo] task: haTask–vim.host.VMotionManager.prepareSourceEx-335288733 setup for async notification
2014-05-28T10:43:40.474Z [FFFEEB70 verbose ‘VpxaHalCnxHostagent’ opID=WFU-87c25eb7] [WaitForUpdatesDone] Received callback
2014-05-28T10:43:40.474Z [FFFEEB70 verbose ‘VpxaHalCnxHostagent’ opID=WFU-87c25eb7] [VpxaHalCnxHostagent::ProcessUpdate] Applying updates from 194503 to 194504 (at 194503)
2014-05-28T10:43:40.474Z [FFFEEB70 verbose ‘commonvpxTaskInfo’ opID=WFU-87c25eb7] [TaskInfoChannel::SetTaskInfo] task: haTask–vim.host.VMotionManager.prepareSourceEx-335288733 for task: haTask–vim.host.VMotionManager.prepareSourceEx-335288733

Nothing too major in here that would help our cause. We did get the task ID in the first line though (task-20835) so let’s have a quick look for that as well before we move on to the last step.

[571]-> grep task-20835 var/log/vpxa.all

2014-05-28T10:43:40.469Z [FFF6AB70 verbose ‘PropertyProvider’ opID=B41380B2-0000017F-c7-81-50] RecordOp ASSIGN: info.reason, task-20835
2014-05-28T10:43:40.469Z [FFF6AB70 verbose ‘PropertyProvider’ opID=B41380B2-0000017F-c7-81-50] RecordOp ADD: recentTask[“task-20835”], vpxaTaskManager
2014-05-28T10:43:40.469Z [FF99E1A0 verbose ‘PropertyProvider’] RecordOp ASSIGN: info.startTime, task-20835
2014-05-28T10:43:40.470Z [FF99E1A0 verbose ‘PropertyProvider’ opID=B41380B2-0000017F-c7-81-50] RecordOp ASSIGN: info.state, task-20835
2014-05-28T10:43:40.470Z [FF99E1A0 verbose ‘PropertyProvider’ opID=B41380B2-0000017F-c7-81-50] RecordOp ASSIGN: info.cancelable, task-20835
2014-05-28T10:43:40.470Z [FF99E1A0 info ‘commonvpxLro’ opID=B41380B2-0000017F-c7-81-50] [VpxLRO] — BEGIN task-20835 — — vim.host.VMotionManager.prepareSourceEx — f8198450-c858-e8c6-42fe-4bd172e8bcb1
2014-05-28T11:13:40.496Z [FFF6AB70 verbose ‘commonvpxLro’] [VpxLRO] Dispatching Error Handler Functor for haTask–vim.host.VMotionManager.prepareSourceEx-335288733 which aborted due to from task-20835
2014-05-28T11:13:40.507Z [FF99E1A0 verbose ‘PropertyProvider’ opID=B41380B2-0000017F-c7-81-50] RecordOp ASSIGN: info.state, task-20835
2014-05-28T11:13:40.507Z [FF99E1A0 verbose ‘PropertyProvider’ opID=B41380B2-0000017F-c7-81-50] RecordOp ASSIGN: info.cancelable, task-20835
2014-05-28T11:13:40.507Z [FF99E1A0 verbose ‘PropertyProvider’ opID=B41380B2-0000017F-c7-81-50] RecordOp ASSIGN: info.error, task-20835
2014-05-28T11:13:40.507Z [FF99E1A0 info ‘commonvpxLro’ opID=B41380B2-0000017F-c7-81-50] [VpxLRO] — FINISH task-20835 — — vim.host.VMotionManager.prepareSourceEx —
2014-05-28T11:13:40.507Z [FF99E1A0 info ‘Default’ opID=B41380B2-0000017F-c7-81-50] [VpxLRO] — ERROR task-20835 — — vim.host.VMotionManager.prepareSourceEx: vim.fault.Timedout:
2014-05-28T11:13:40.508Z [FF99E1A0 verbose ‘PropertyProvider’] RecordOp ASSIGN: info.completeTime, task-20835
2014-05-28T11:23:40.517Z [FFF28B70 verbose ‘PropertyProvider’] RecordOp REMOVE: recentTask[“task-20835”], vpxaTaskManager

Another timeout confirmation but that’s about it.

Let’s recap what we found so far. We know that we do not seem to face any networking related issues, the configuration is spot on and yet we do get intermittent timeouts during times of sustained load. I had the customer run a continous vmkernel ping as well just to see if we drop any network packets when the link gets loaded with loads of vMotions but this was not the case as well.

Our last resort to turn to is the vmkernel.log, and there indeed are some interesting findings in there.

[Source]

[513]-> grep “Not free” var/log/vmkernel.all | grep ExampleDataStore | grep 2014-05-28T1[0,1] | wc -l
202

Interesting, file locking issues! Let’s see if they are relevant.

[514]-> grep “Not free” var/log/vmkernel.all | grep ExampleDataStore | grep 2014-05-28T1[0,1] | head

2014-05-28T10:43:44.500Z cpu8:65452)DLX: 4262: vol ‘ExampleDataStore’, lock at 607275008: [Req mode: 1] Not free:
2014-05-28T10:43:51.435Z cpu11:65452)DLX: 4262: vol ‘ExampleDataStore’, lock at 607275008: [Req mode: 1] Not free:
2014-05-28T10:43:58.219Z cpu11:65452)DLX: 4262: vol ‘ExampleDataStore’, lock at 607275008: [Req mode: 1] Not free:
2014-05-28T10:44:04.555Z cpu11:65452)DLX: 4262: vol ‘ExampleDataStore’, lock at 607275008: [Req mode: 1] Not free:
2014-05-28T10:44:11.280Z cpu1:65452)DLX: 4262: vol ‘ExampleDataStore’, lock at 607275008: [Req mode: 1] Not free:
2014-05-28T10:44:17.706Z cpu2:65452)DLX: 4262: vol ‘ExampleDataStore’, lock at 607275008: [Req mode: 1] Not free:
2014-05-28T10:44:24.035Z cpu2:65452)DLX: 4262: vol ‘ExampleDataStore’, lock at 607275008: [Req mode: 1] Not free:
2014-05-28T10:44:30.544Z cpu2:65452)DLX: 4262: vol ‘ExampleDataStore’, lock at 607275008: [Req mode: 1] Not free:
2014-05-28T10:44:37.026Z cpu1:65452)DLX: 4262: vol ‘ExampleDataStore’, lock at 607275008: [Req mode: 1] Not free:
2014-05-28T10:44:43.072Z cpu1:65452)DLX: 4262: vol ‘ExampleDataStore’, lock at 607275008: [Req mode: 1] Not free:

[515]-> grep “Not free” var/log/vmkernel.all | grep ExampleDataStore | grep 2014-05-28T1[0,1] | tail

2014-05-28T11:04:28.608Z cpu2:65452)DLX: 4262: vol ‘ExampleDataStore’, lock at 607275008: [Req mode: 1] Not free:
2014-05-28T11:04:35.124Z cpu2:65452)DLX: 4262: vol ‘ExampleDataStore’, lock at 607275008: [Req mode: 1] Not free:
2014-05-28T11:04:41.370Z cpu2:65452)DLX: 4262: vol ‘ExampleDataStore’, lock at 607275008: [Req mode: 1] Not free:
2014-05-28T11:04:48.329Z cpu2:65452)DLX: 4262: vol ‘ExampleDataStore’, lock at 607275008: [Req mode: 1] Not free:
2014-05-28T11:04:55.157Z cpu2:65452)DLX: 4262: vol ‘ExampleDataStore’, lock at 607275008: [Req mode: 1] Not free:
2014-05-28T11:05:01.221Z cpu2:65452)DLX: 4262: vol ‘ExampleDataStore’, lock at 607275008: [Req mode: 1] Not free:
2014-05-28T11:05:08.068Z cpu2:65452)DLX: 4262: vol ‘ExampleDataStore’, lock at 607275008: [Req mode: 1] Not free:
2014-05-28T11:05:14.674Z cpu2:65452)DLX: 4262: vol ‘ExampleDataStore’, lock at 607275008: [Req mode: 1] Not free:
2014-05-28T11:14:03.311Z cpu7:34471)DLX: 4262: vol ‘ExampleDataStore’, lock at 226365440: [Req mode: 1] Not free:
2014-05-28T11:24:43.072Z cpu9:34471)DLX: 4262: vol ‘ExampleDataStore’, lock at 226365440: [Req mode: 1] Not free:

Spot on, they are, those are the times that we saw within the vpxa.log and seem to cover our timeout frame nearly perfectly as well. Furthermore we do know that the VM that failed actually does reside on “ExampleDataStore” and we can verify this from the logs as well.

[514]-> grep DSUUID1-many-hex-chars commands/localcli_*

commands/localcli_storage-filesystem-list.txt:/vmfs/volumes/DSUUID1-many-hex-chars ExampleDataStore DSUUID1-many-hex-chars true VMFS-5

What can we do to determine who holds the lock. Basically vmkernel.log can tell us. Every time a host tries to access a file on which an exclusive lock is being held by another host we would get the status of that lock posted into the vmkernel.log. The output looks pretty similar to the “vmkfstools -D file” output shown below.

[root@whiteboxfb:/vmfs/volumes/5399bfd0-4846d1f2-8c06-902b34a05cf0/ad01] vmkfstools -D ad01.vmx
Lock [type 10c00001 offset 102289408 v 834, hb offset 3321856
gen 639, mode 1, owner 53da6538-32d1e07e-8fda-902b34a05cf0 mtime 272111
num 0 gblnum 0 gblgen 0 gblbrk 0]

We can see the mode of the lock, 1 in this example which means it is an exclusive lock” and also the owner of the lock. The important part about the owner is that it will show us the MAC address of the host holding the lock, as this is the last part of the owner string, so this can be extracted with a bit of grep magic from the vmkernel.log. The MAC addresses for this example can be found further up in the vmknic status overview as well.

[534]-> grep mode var/log/vmkernel.all | grep -A 1 ExampleDataStore | grep -A 1 2014-05-28T1[0,1] | grep owner | awk ‘{print $6}’ | awk -F- ‘{print $NF}’ | sort | uniq -c

404 MACADDRESS02

Interesting, despite not getting any traffic at all the destination host already has the lock on the file. What was even more interesting is that in the destination host logs we could see the exact opposite. Both hosts were competing about the log file. So this turns out to be a storage issue and not a networking issue after all.

The storage array and HBA of that customer were on the HCL. They also did not see any contention issues on the array or the fabric for that matter and yet vMotion would fail (and also LUNs have been disappearing and reappearing as we found out later).

In the end it turned out to be a faulty driver that was replaced and the issue was instantly gone.

Categories: Case Post Mortems Tags:
  1. Warren M.
    October 22nd, 2016 at 14:15 | #1

    Well played, thanks for the details run through of that issue. Dealing with a similar issue, in this case vSAN. With vSAN it makes perfect sense a non-related host in the cluster would have a lock.

  1. No trackbacks yet.