Real life experience on performing a roll back from vRA 6.1 to 6.0.1
I had to deal with a deployment issue in vRealize Automation yesterday that also included one issue to be the weirdest corner case I have seen so far.
All deployments for any VMs using no vRealize Orchestrator workflow stubs were successful. But let’s be honest, that is pretty boring and doesn’t cover most customers use cases, which means most customers will be using the provided stubs to actually step into the machine life cycle and modify deployment and destruction of VMs using workflows and scripts.
This particular customer hit 3 issues in total that directly affected deployments and 1 issue that could have if specific workflows had been used as well. I will walk through all 4 of them. It is important that a roll back from vRealize Automation 6.1 to 6.0.1 was performed including restore from snapshot and database restore. The same had been done for the Orchestrator as well.
Issue 1: Workflows were aborting with with a permissions error.
This was caused due to the roll back of the Orchestrator and lacking configuration on file level permissions. Orchestrator cannot simply access the underlying file system but needs special configuration for that. The configuration file to change this in is called “js-io-rights.conf” and the whole procedure outlined in the following blog post on vcoteam.info.
Issue 2: The vCenter package and all related workflows were gone after the restore.
This issue happens when an external DB is being utilized, all built-in plugin workflows disappear and so do the associated packages. This is actually being called out in the Orchestrator Documentation.
If you change the Orchestrator database after configuring and installing the plug-ins, you must click the Reset current version link on the Troubleshooting tab. This operation deletes theinstall_directory\app-server\conf\plugins\_VSOPluginInstallationVersion.xml file, which contains information about the version of the plug-ins already installed, and forces plug-in reinstallation.
Issue 3: Database time stamps causing provisioning failures.
This has to be the weirdest corner case I have yet come across on the product. When trying to deploy customized blueprints that would call out to Orchestrator to run some workflows we got back an exception in the workflow referring to a null pointer for the VM id. Investigation of the actual IaaS plugin for vRealize Automation showed that we got an IllegalArgument exception there for Java as shown below.
So what is happening here? The plugin is failing to get the VM we want to use in the workflow. This is bad and making the workflow fail. But why is the plugin failing in the first place? If we look a little closer at the exception we can see that it seems to have issues parsing the time stamp. Yet 2:31 am doesn’t look suspicious. If you read the rest though you might notice what is throwing it off. It cannot translate this time stamp properly due to a time zone transition. What happened on 8th Match 2015 in the US? Dayligth saving time zone change. The US went straight from March 8th 2am to March 8th 3am. So this time stamp indeed is illegal in the US time zone. So where is this time stamp coming from?
The answer is the IaaS database.
WHERE VMCreationDate > ‘2015-03-08 02:00:00’
OR VMDeleteDate > ‘2015-03-08 02:00:00’
OR LastPowerOffDate > ‘2015-03-08 02:00:00’
OR LastPowerOnDate > ‘2015-03-08 02:00:00’
OR RecCreationTime > ‘2015-03-08 02:00:00’
OR RecUpdateTime > ‘2015-03-08 02:00:00’
ORDER BY RecUpdateTime;
VirtualMachineName RecUpdateTime VMDNSName
Template1-VCAC 2015-03-08 02:31:02.020 Template1-VCAC
Template2-VCAC_dev 2015-03-08 02:31:02.037 Template2-VCAC_dev
Template3-VCAC 2015-03-08 02:31:02.060 Template3-VCAC
It looks like the RecUpdatetime coloumn is the culprit here for the customers templates. Further investigation showed that this time stamp gets updated during Inventory data collection. What happened was that the customer issued a manual data collection at 1:56 am, this collection then during the course updated the time stamp in the database when it got to those VMs as it is valid on the local site of the customer in GMT. The servers are located in the US and are pulling data in EST time zone though, so this actually leads to an issue then. After fixing an vSphere Agent issue on the side to make data collection work again the actual fix for the customer was to simply run another Inventory data collection then to get the time stamps updated in the database. This is an absolute corner case and I guess the customer will start playing the lottery again after hitting this issue.
Issue 4: Guest Agent not able to run properly due to permission issues
After having fixed issue number 3 we were actually pretty confident that deployments would work now as they used to do before. Unfortunately we were wrong. During the course of troubleshooting a reinstall of the guest agent happened as well, which the customer is using for their actual customization and software installation inside the guest operating system. When examining the failed system, we could see that the script supposed to run did not even start at all. Eventually the build process would time out and destroy the VM. What was curious is that the script did seem to start once we logged into an interactive session on the guest.
The underlying issue here is a safety mechanism built-in into the Windows operating system. If a zip file is downloaded from an untrusted source, you can unpack the contents just fine and nothing seems to be wrong. But in case of items that run batch jobs or scripting items, these will be blocked from the operating system and human intervention is needed. To prevent this you can unblock the files using the properties dialog if you download a single such file. If they are being stored in a zip archive though you will need to unblock the zip archive before extracting the contents and working with them.
In this case we uninstalled the agent service, deleted the whole directory, redownloaded the zip file, unblocked it, extracted the contents, and finally reinstalled the agent and made sure it runs properly and downloads the correct certificate. For good measure we stopped the service again and deleted the certificate again. This way the customer will not need to worry about a stale certificate not being replaced when they change expiring certificates on the Manager Service machine.
Even though a roll back was performed and databases had been restored we ran into several issues stopping deployment, so in case of an upgrade being done, document every configuration being made, any customization being etc. to avoid such scenarios, if possible.
Also it would be a good idea to schedule all data collections to happen after 4 am, so that you avoid weird time stamps in the database.