xref. #3192
xref. #4322
xref. #4812
We've encountered a number of interrelated issues with the graceful Envoy shutdown workflow that I want to capture in one place.
The first issue goes something like the following:
- Something triggers Kubernetes to terminate/restart the shutdown-manager sidecar container in an Envoy pod. We've seen definite evidence of its liveness probes failing due to timeouts, which could be caused by the container not having any resource requests and getting CPU-throttled.
- When the shutdown-manager container is terminated, its preStop hook is triggered, which runs the
contour envoy shutdown command. This command tells the associated Envoy to start draining all Listeners, including the Listener that is the target of the Envoy container's readiness probe. So now, Envoy is draining HTTP/S connections as well as reporting unready.
- The shutdown-manager's preStop hook eventually returns once enough connections have been closed, and the shutdown-manager container restarts successfully.
- The Envoy container is stuck indefinitely in a "draining" state, failing its readiness probe, but never restarting because there is no liveness probe or anything else to trigger a restart.
A few thoughts about this issue:
- adding resource requests to the shutdown-manager (and really, all containers) may help minimize the occurrence of this issue since they should ensure the container(s) get the CPU they need to avoid being restarted
- adding a liveness probe to the Envoy container may help avoid it getting permanently stuck by restarting it when it's not healthy, as long as its implementation plays nicely with the whole graceful shutdown workflow
- a restart of the shutdown-manager sidecar should probably not be triggering a drain of Envoy listeners. That should only be triggered by the Envoy container itself terminating/restarting.
- GHSA-mjp8-x484-pm3r needs to be kept in mind as we potentially modify this workflow
Secondly, #4322 describes issues with draining nodes due to the emptyDir that is used to allow the shutdown-manager and envoy containers to communicate (via UDS for the envoy admin interface, and by file to communicate when the Listener drain is done). I won't repeat that discussion here, just linking it for reference.
xref. #3192
xref. #4322
xref. #4812
We've encountered a number of interrelated issues with the graceful Envoy shutdown workflow that I want to capture in one place.
The first issue goes something like the following:
contour envoy shutdowncommand. This command tells the associated Envoy to start draining all Listeners, including the Listener that is the target of the Envoy container's readiness probe. So now, Envoy is draining HTTP/S connections as well as reporting unready.A few thoughts about this issue:
Secondly, #4322 describes issues with draining nodes due to the emptyDir that is used to allow the shutdown-manager and envoy containers to communicate (via UDS for the envoy admin interface, and by file to communicate when the Listener drain is done). I won't repeat that discussion here, just linking it for reference.