Skip to content

Get hot threads on lagging nodes#78879

Merged
DaveCTurner merged 5 commits into
elastic:masterfrom
DaveCTurner:2021-10-08-get-hot-threads-on-lag
Oct 14, 2021
Merged

Get hot threads on lagging nodes#78879
DaveCTurner merged 5 commits into
elastic:masterfrom
DaveCTurner:2021-10-08-get-hot-threads-on-lag

Conversation

@DaveCTurner

Copy link
Copy Markdown
Member

When the master removes a node from the cluster with reason lagging
it's often germane to ask what the node was doing that prevented it from
applying cluster states quickly enough. Frequently it's waiting on slow
IO, or else doing something very expensive on the applier thread. This
commit adds an automatic call to the hot threads API on the lagging node
so that the master can log extra detail about what it's busy doing if
DEBUG logging is enabled.

@DaveCTurner DaveCTurner added >enhancement :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.16.0 labels Oct 8, 2021
When the master removes a node from the cluster with reason `lagging`
it's often germane to ask what the node was doing that prevented it from
applying cluster states quickly enough. Frequently it's waiting on slow
IO, or else doing something very expensive on the applier thread. This
commit adds an automatic call to the hot threads API on the lagging node
so that the master can log extra detail about what it's busy doing if
`DEBUG` logging is enabled.
@DaveCTurner DaveCTurner force-pushed the 2021-10-08-get-hot-threads-on-lag branch from 182935f to 0099751 Compare October 9, 2021 09:42
@DaveCTurner DaveCTurner marked this pull request as ready for review October 11, 2021 08:36
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team. label Oct 11, 2021
@elasticmachine

Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@henningandersen henningandersen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

threadContext.markAsSystemContext();
client.execute(
NodesHotThreadsAction.INSTANCE,
new NodesHotThreadsRequest(node).threads(9999),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find 9999 a bit high, perhaps just 100?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 100 is too low, if the generic pool is wedged (cf #68468) then that's already 128 threads. I reduced it to 500. Each thread yields a few kB of data, say ≤10kB, so that should add up to max 5MB.

Comment thread server/src/main/java/org/elasticsearch/cluster/coordination/LagDetector.java Outdated
debugListener.onFailure(new NullPointerException("client"));
} else {
// we're removing the node from the cluster so we need to keep the connection open for the hot threads request
transportService.connectToNode(node, new ActionListener<>() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would find it more intuitive to locate this inside LagDetector, letting it have access to transportService and client to allow it to handle the logging on its own.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It kinda messes up the tests to put it directly in LagDetector since we'd have to provide a client and a transport service everywhere, but I moved the functionality to its own class in 4c91c95.

@DaveCTurner DaveCTurner merged commit 75c7cce into elastic:master Oct 14, 2021
@DaveCTurner DaveCTurner deleted the 2021-10-08-get-hot-threads-on-lag branch October 14, 2021 09:30
DaveCTurner added a commit that referenced this pull request Oct 14, 2021
When the master removes a node from the cluster with reason `lagging`
it's often germane to ask what the node was doing that prevented it from
applying cluster states quickly enough. Frequently it's waiting on slow
IO, or else doing something very expensive on the applier thread. This
commit adds an automatic call to the hot threads API on the lagging node
so that the master can log extra detail about what it's busy doing if
`DEBUG` logging is enabled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement Team:Distributed Meta label for distributed team. v7.16.0 v8.0.0-beta1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants