Description-
TKG 2.3
Customer had a downtime for vCenter for security patching. Once that is done, it is observed that Clusters cannot be created or deleted anymore. Looking at the management cluster, it seems that it is now in an error state. New cluster creation was stuck in CreateStalled state and, deletion triggered clusters were stuck in deleting state.
It seems that Management cluster was not able to recover automatically after vCenter went online back. It was in this state for more than 2 days until workaround is applied.
Workaround applied
kubectl rollout restart deployment capv-controller-manager -n capv-system . Running this command seems to resolve the issues and stuck cluster were cleaned up. And Management cluster went into Green state.
There could be frequent patching of vCenter based on security issues. It would not be possible to go and manually run the above workaround every time. Expectation is that once vCenter is back online, system should recover automatically.
Expectation :
1. Provide automatic fix for this issue. System should recover automatically once the vCenter comes online back. vCenter patching would be common activity. 2. Provide reference to documentation regarding best practices to be followed with respect to TKG, if there is going to be down time of vCenter as well as even vSphere. (Couldn't find any known issues documented in our release notes)
Comments