Skip to content

shutdown-manager issues #4851

Description

@skriss

xref. #3192
xref. #4322
xref. #4812

We've encountered a number of interrelated issues with the graceful Envoy shutdown workflow that I want to capture in one place.

The first issue goes something like the following:

  1. Something triggers Kubernetes to terminate/restart the shutdown-manager sidecar container in an Envoy pod. We've seen definite evidence of its liveness probes failing due to timeouts, which could be caused by the container not having any resource requests and getting CPU-throttled.
  2. When the shutdown-manager container is terminated, its preStop hook is triggered, which runs the contour envoy shutdown command. This command tells the associated Envoy to start draining all Listeners, including the Listener that is the target of the Envoy container's readiness probe. So now, Envoy is draining HTTP/S connections as well as reporting unready.
  3. The shutdown-manager's preStop hook eventually returns once enough connections have been closed, and the shutdown-manager container restarts successfully.
  4. The Envoy container is stuck indefinitely in a "draining" state, failing its readiness probe, but never restarting because there is no liveness probe or anything else to trigger a restart.

A few thoughts about this issue:

  • adding resource requests to the shutdown-manager (and really, all containers) may help minimize the occurrence of this issue since they should ensure the container(s) get the CPU they need to avoid being restarted
  • adding a liveness probe to the Envoy container may help avoid it getting permanently stuck by restarting it when it's not healthy, as long as its implementation plays nicely with the whole graceful shutdown workflow
  • a restart of the shutdown-manager sidecar should probably not be triggering a drain of Envoy listeners. That should only be triggered by the Envoy container itself terminating/restarting.
  • GHSA-mjp8-x484-pm3r needs to be kept in mind as we potentially modify this workflow

Secondly, #4322 describes issues with draining nodes due to the emptyDir that is used to allow the shutdown-manager and envoy containers to communicate (via UDS for the envoy admin interface, and by file to communicate when the Listener drain is done). I won't repeat that discussion here, just linking it for reference.

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.priority/important-longtermImportant over the long term, but may not be staffed and/or may need multiple releases to complete.

Type

No type
No fields configured for issues without a type.

Projects

Status
Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions