shutdown-manager issues

xref. #3192 
xref. #4322 
xref. #4812 

We've encountered a number of interrelated issues with the graceful Envoy shutdown workflow that I want to capture in one place.

The first issue goes something like the following:
1. _Something_ triggers Kubernetes to terminate/restart the shutdown-manager sidecar container in an Envoy pod. We've seen definite evidence of its liveness probes failing due to timeouts, which could be caused by the container not having any resource requests and getting CPU-throttled.
2. When the shutdown-manager container is terminated, its preStop hook is triggered, which runs the `contour envoy shutdown` command. This command tells the associated Envoy to start draining all Listeners, including the Listener that is the target of the Envoy container's readiness probe. So now, Envoy is draining HTTP/S connections as well as reporting unready.
3. The shutdown-manager's preStop hook eventually returns once enough connections have been closed, and the shutdown-manager container restarts successfully.
4. The Envoy container is stuck indefinitely in a "draining" state, failing its readiness probe, but never restarting because there is no liveness probe or anything else to trigger a restart.

A few thoughts about this issue:
- adding resource requests to the shutdown-manager (and really, all containers) may help minimize the occurrence of this issue since they should ensure the container(s) get the CPU they need to avoid being restarted
- adding a liveness probe to the Envoy container may help avoid it getting permanently stuck by restarting it when it's not healthy, as long as its implementation plays nicely with the whole graceful shutdown workflow
- a restart of the shutdown-manager sidecar should probably not be triggering a drain of Envoy listeners. That should _only_ be triggered by the Envoy container itself terminating/restarting.
- https://github.com/projectcontour/contour/security/advisories/GHSA-mjp8-x484-pm3r needs to be kept in mind as we potentially modify this workflow

Secondly, #4322 describes issues with draining nodes due to the emptyDir that is used to allow the shutdown-manager and envoy containers to communicate (via UDS for the envoy admin interface, and by file to communicate when the Listener drain is done). I won't repeat that discussion here, just linking it for reference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

shutdown-manager issues #4851

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

shutdown-manager issues #4851

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions