In attempting to update the NTP server configuration in a VCF 4.3 deployment, where only the management domain had been deployed I ran across an error in SDDC Manager. The error message was very generic and didn’t really help with the troubleshooting:
Message: “Error occurred while fetching details of entities”
Remediation Message: “Please make sure that commonsvcs service is up and running”
Logging into SDDC Manager via SSH I checked the status of the services and confirmed the commonsvcs service was online.
Next, I went looking in the log files for more information, in the /var/log/vmware/vcf/operationsmanager/operationsmanager.log file I found the task to update the NTP settings and saw it was failing at the validation steps, the error message again wasn’t very helpful:
“Error validating Dns Server.”
I was trying to update NTP, not DNS. I even checked I hadn’t submitted the wrong API call just in case.
After a day of troubleshooting I found that I had two underlying issues in the environment causing the failure:
- During the management domain bring up I had a failure due to a password issue and had to restart the process in Cloud Builder. When this happened it did not create the service user accounts for the ESXi hosts in SDDC Manager, even though they existed on the ESXi hosts. This meant when SDDC Manager was trying to connect to the ESXi hosts to validate that the new NTP servers were reachable it could not find any credentials to log in with. This is a known issue in VCF 4.1 and above and the resolution is documented in KB83837.
- We had started trying to commission a host for our workload domain and had hit an issue with a firewall port having been missed. This had caused the task to fail in SDDC Manager, however the ESXi host was still listed in the database in an error state. I discovered this by realising in the operationsmanager.log file there were 5 entities of type ESXi being checked (not shown in the screenshot above) even though we only had 4 ESXi hosts in the management domain. When I queried the SDDC Manager database using the command
psql -h localhost -U postgres -d platform -c "select * from host". This returned five host entries, with one of the fqdn values matching the workload domain host showing a status of Error. This host entry was also missing a number of values in the database table, so it seemed that when SDDC Manager tried to connect to the host it couldn’t gather all of the information it needed to make the connection.
The resolution was to follow the steps in KB83837 to recreate the ESXi service accounts first. Then once the firewall issue was resolved I was able to complete the host commission task in SDDC Manager. After both of these were completed the update of the NTP servers was completed successfully, simply resolving the ESXi service accounts issue was not sufficient.
As a side note, when validating the NTP update spec in the SDDC Manager Developer Center prior to resolving the issue the validation also failed with a generic error:
“Execution status of the validation FAILED” with the resultStatus listed as “unknown”.
Having a successful validation check to compare it against I can now see that the array of ValidationCheck objects only contained a few results, like the NTP server addresses. For a successful validation, that array should contain an entry for each component within VCF that will be updated e.g. ESXi hosts, vCenters, SDDC Manager and NSX-T Managers.