Get hot threads on lagging nodes by DaveCTurner · Pull Request #78879 · elastic/elasticsearch

DaveCTurner · 2021-10-08T15:53:20Z

When the master removes a node from the cluster with reason lagging
it's often germane to ask what the node was doing that prevented it from
applying cluster states quickly enough. Frequently it's waiting on slow
IO, or else doing something very expensive on the applier thread. This
commit adds an automatic call to the hot threads API on the lagging node
so that the master can log extra detail about what it's busy doing if
DEBUG logging is enabled.

When the master removes a node from the cluster with reason `lagging` it's often germane to ask what the node was doing that prevented it from applying cluster states quickly enough. Frequently it's waiting on slow IO, or else doing something very expensive on the applier thread. This commit adds an automatic call to the hot threads API on the lagging node so that the master can log extra detail about what it's busy doing if `DEBUG` logging is enabled.

elasticmachine · 2021-10-11T08:36:10Z

Pinging @elastic/es-distributed (Team:Distributed)

henningandersen

LGTM.

henningandersen · 2021-10-13T06:43:47Z

+                            threadContext.markAsSystemContext();
+                            client.execute(
+                                NodesHotThreadsAction.INSTANCE,
+                                new NodesHotThreadsRequest(node).threads(9999),


I find 9999 a bit high, perhaps just 100?

I think 100 is too low, if the generic pool is wedged (cf #68468) then that's already 128 threads. I reduced it to 500. Each thread yields a few kB of data, say ≤10kB, so that should add up to max 5MB.

henningandersen · 2021-10-13T06:54:53Z

+                debugListener.onFailure(new NullPointerException("client"));
+            } else {
+                // we're removing the node from the cluster so we need to keep the connection open for the hot threads request
+                transportService.connectToNode(node, new ActionListener<>() {


I think I would find it more intuitive to locate this inside LagDetector, letting it have access to transportService and client to allow it to handle the logging on its own.

It kinda messes up the tests to put it directly in LagDetector since we'd have to provide a client and a transport service everywhere, but I moved the functionality to its own class in 4c91c95.

When the master removes a node from the cluster with reason `lagging` it's often germane to ask what the node was doing that prevented it from applying cluster states quickly enough. Frequently it's waiting on slow IO, or else doing something very expensive on the applier thread. This commit adds an automatic call to the hot threads API on the lagging node so that the master can log extra detail about what it's busy doing if `DEBUG` logging is enabled.

DaveCTurner added >enhancement :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.16.0 labels Oct 8, 2021

DaveCTurner force-pushed the 2021-10-08-get-hot-threads-on-lag branch from 182935f to 0099751 Compare October 9, 2021 09:42

DaveCTurner marked this pull request as ready for review October 11, 2021 08:36

elasticmachine added the Team:Distributed Meta label for distributed team. label Oct 11, 2021

DaveCTurner requested a review from henningandersen October 11, 2021 08:36

DaveCTurner mentioned this pull request Oct 11, 2021

Node left forever due to unnormal lagging. #78240

Closed

henningandersen approved these changes Oct 13, 2021

View reviewed changes

DaveCTurner added 4 commits October 13, 2021 13:12

Merge branch 'master' into 2021-10-08-get-hot-threads-on-lag

d42d733

Move logging behaviour into (inner class of) LagDetector

4c91c95

Unused

96344d2

Merge branch 'master' into 2021-10-08-get-hot-threads-on-lag

3c0eec3

DaveCTurner merged commit 75c7cce into elastic:master Oct 14, 2021

DaveCTurner deleted the 2021-10-08-get-hot-threads-on-lag branch October 14, 2021 09:30

jakelandis added v8.0.0-beta1 and removed v8.0.0 labels Oct 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Get hot threads on lagging nodes#78879

Get hot threads on lagging nodes#78879
DaveCTurner merged 5 commits into
elastic:masterfrom
DaveCTurner:2021-10-08-get-hot-threads-on-lag

DaveCTurner commented Oct 8, 2021

Uh oh!

elasticmachine commented Oct 11, 2021

Uh oh!

henningandersen left a comment

Uh oh!

henningandersen Oct 13, 2021

Uh oh!

DaveCTurner Oct 13, 2021

Uh oh!

Uh oh!

Uh oh!

henningandersen Oct 13, 2021

Uh oh!

DaveCTurner Oct 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

DaveCTurner commented Oct 8, 2021

Uh oh!

elasticmachine commented Oct 11, 2021

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

henningandersen Oct 13, 2021

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Oct 13, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

henningandersen Oct 13, 2021

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Oct 13, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants