Get hot threads on lagging nodes#78879
Conversation
When the master removes a node from the cluster with reason `lagging` it's often germane to ask what the node was doing that prevented it from applying cluster states quickly enough. Frequently it's waiting on slow IO, or else doing something very expensive on the applier thread. This commit adds an automatic call to the hot threads API on the lagging node so that the master can log extra detail about what it's busy doing if `DEBUG` logging is enabled.
182935f to
0099751
Compare
|
Pinging @elastic/es-distributed (Team:Distributed) |
| threadContext.markAsSystemContext(); | ||
| client.execute( | ||
| NodesHotThreadsAction.INSTANCE, | ||
| new NodesHotThreadsRequest(node).threads(9999), |
There was a problem hiding this comment.
I find 9999 a bit high, perhaps just 100?
There was a problem hiding this comment.
I think 100 is too low, if the generic pool is wedged (cf #68468) then that's already 128 threads. I reduced it to 500. Each thread yields a few kB of data, say ≤10kB, so that should add up to max 5MB.
| debugListener.onFailure(new NullPointerException("client")); | ||
| } else { | ||
| // we're removing the node from the cluster so we need to keep the connection open for the hot threads request | ||
| transportService.connectToNode(node, new ActionListener<>() { |
There was a problem hiding this comment.
I think I would find it more intuitive to locate this inside LagDetector, letting it have access to transportService and client to allow it to handle the logging on its own.
There was a problem hiding this comment.
It kinda messes up the tests to put it directly in LagDetector since we'd have to provide a client and a transport service everywhere, but I moved the functionality to its own class in 4c91c95.
When the master removes a node from the cluster with reason `lagging` it's often germane to ask what the node was doing that prevented it from applying cluster states quickly enough. Frequently it's waiting on slow IO, or else doing something very expensive on the applier thread. This commit adds an automatic call to the hot threads API on the lagging node so that the master can log extra detail about what it's busy doing if `DEBUG` logging is enabled.
When the master removes a node from the cluster with reason
laggingit's often germane to ask what the node was doing that prevented it from
applying cluster states quickly enough. Frequently it's waiting on slow
IO, or else doing something very expensive on the applier thread. This
commit adds an automatic call to the hot threads API on the lagging node
so that the master can log extra detail about what it's busy doing if
DEBUGlogging is enabled.