Skip to content

Improve logging in LeaderChecker#78883

Merged
DaveCTurner merged 1 commit into
elastic:masterfrom
DaveCTurner:2021-10-09-leader-checker-better-logging
Oct 15, 2021
Merged

Improve logging in LeaderChecker#78883
DaveCTurner merged 1 commit into
elastic:masterfrom
DaveCTurner:2021-10-09-leader-checker-better-logging

Conversation

@DaveCTurner

Copy link
Copy Markdown
Member

Today if the LeaderChecker decides it's time to restart discovery then
we log a verbose and confusing message that looks something like this:

[instance-0000000006] master node [...] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [...] failed [3] consecutive checks
    at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:275) ~[elasticsearch-7.14.1.jar:7.14.1]
    ...
    at java.lang.Thread.run(Thread.java:831) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [...][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [...] has been removed from the cluster
    at org.elasticsearch.cluster.coordination.LeaderChecker.handleLeaderCheck(LeaderChecker.java:181) ~[elasticsearch-7.14.1.jar:7.14.1]
...
    at java.lang.Thread.run(Thread.java:831) ~[?:?]

There's quite a few problems with this:

  • We use DiscoveryNode#toString which is far too chatty.
  • There's basically nothing useful in these stack traces.
  • It's easy to miss the RemoteTransportException in the middle.
  • It's also easy to miss the root cause below it.
  • We say the master node failed which sounds very bad but, well, you
    know, that's just like, uh, our opinion. The master node is often
    fine, it just rejected our checks for some reason.
  • Reports of unstable clusters include these messages because they're
    noisy and look important, but don't include the more informative ones
    from the master because the master logs look quieter.

This commit reworks the logging in this area to avoid these problems:

  • We use DiscoveryNode#descriptionWithoutAttributes throughout.
  • We suppress the full stack traces unless DEBUG logging is on.
  • The LeaderChecker now provides the message to be logged, rather than
    putting all the details into an exception that wraps around the root
    cause.
  • The message describes the root cause rather than just saying that the
    "master node failed"
  • We distinguish timeouts from rejections and report the count of each.
  • The message guides towards checking the master node logs too.

@DaveCTurner DaveCTurner added >enhancement :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.16.0 labels Oct 9, 2021
Today if the `LeaderChecker` decides it's time to restart discovery then
we log a verbose and confusing message that looks something like this:

    [instance-0000000006] master node [...] failed, restarting discovery
    org.elasticsearch.ElasticsearchException: node [...] failed [3] consecutive checks
        at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:275) ~[elasticsearch-7.14.1.jar:7.14.1]
        ...
        at java.lang.Thread.run(Thread.java:831) [?:?]
    Caused by: org.elasticsearch.transport.RemoteTransportException: [...][internal:coordination/fault_detection/leader_check]
    Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [...] has been removed from the cluster
        at org.elasticsearch.cluster.coordination.LeaderChecker.handleLeaderCheck(LeaderChecker.java:181) ~[elasticsearch-7.14.1.jar:7.14.1]
    ...
        at java.lang.Thread.run(Thread.java:831) ~[?:?]

There's quite a few problems with this:

- We use `DiscoveryNode#toString` which is far too chatty.
- There's basically nothing useful in these stack traces.
- It's easy to miss the `RemoteTransportException` in the middle.
- It's also easy to miss the root cause below it.
- We say the master node failed which sounds very bad but, well, you
  know, that's just like, uh, our opinion. The master node is often
  fine, it just rejected our checks for some reason.
- Reports of unstable clusters include these messages because they're
  noisy and look important, but don't include the more informative ones
  from the master because the master logs look quieter.

This commit reworks the logging in this area to avoid these problems:

- We use `DiscoveryNode#descriptionWithoutAttributes` throughout.
- We suppress the full stack traces unless `DEBUG` logging is on.
- The `LeaderChecker` now provides the message to be logged, rather than
  putting all the details into an exception that wraps around the root
  cause.
- The message describes the root cause rather than just saying that the
  "master node failed"
- We distinguish timeouts from rejections and report the count of each.
- The message guides towards checking the master node logs too.
@DaveCTurner DaveCTurner force-pushed the 2021-10-09-leader-checker-better-logging branch from babf188 to 29ba8ad Compare October 9, 2021 09:49
@DaveCTurner DaveCTurner marked this pull request as ready for review October 11, 2021 12:07
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team. label Oct 11, 2021
@elasticmachine

Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

private class CheckScheduler implements Releasable {

private final AtomicBoolean isClosed = new AtomicBoolean();
private final AtomicLong failureCountSinceLastSuccess = new AtomicLong();

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't actually need an AtomicLong here, there's only ever one request in flight so everything is properly synchronised already.

@fcofdez fcofdez self-requested a review October 14, 2021 09:55

@fcofdez fcofdez left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@DaveCTurner DaveCTurner merged commit 83d96e4 into elastic:master Oct 15, 2021
@DaveCTurner DaveCTurner deleted the 2021-10-09-leader-checker-better-logging branch October 15, 2021 09:09
DaveCTurner added a commit that referenced this pull request Oct 15, 2021
Today if the `LeaderChecker` decides it's time to restart discovery then
we log a verbose and confusing message that looks something like this:

    [instance-0000000006] master node [...] failed, restarting discovery
    org.elasticsearch.ElasticsearchException: node [...] failed [3] consecutive checks
        at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:275) ~[elasticsearch-7.14.1.jar:7.14.1]
        ...
        at java.lang.Thread.run(Thread.java:831) [?:?]
    Caused by: org.elasticsearch.transport.RemoteTransportException: [...][internal:coordination/fault_detection/leader_check]
    Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [...] has been removed from the cluster
        at org.elasticsearch.cluster.coordination.LeaderChecker.handleLeaderCheck(LeaderChecker.java:181) ~[elasticsearch-7.14.1.jar:7.14.1]
    ...
        at java.lang.Thread.run(Thread.java:831) ~[?:?]

There's quite a few problems with this:

- We use `DiscoveryNode#toString` which is far too chatty.
- There's basically nothing useful in these stack traces.
- It's easy to miss the `RemoteTransportException` in the middle.
- It's also easy to miss the root cause below it.
- We say the master node failed which sounds very bad but, well, you
  know, that's just like, uh, our opinion. The master node is often
  fine, it just rejected our checks for some reason.
- Reports of unstable clusters include these messages because they're
  noisy and look important, but don't include the more informative ones
  from the master because the master logs look quieter.

This commit reworks the logging in this area to avoid these problems:

- We use `DiscoveryNode#descriptionWithoutAttributes` throughout.
- We suppress the full stack traces unless `DEBUG` logging is on.
- The `LeaderChecker` now provides the message to be logged, rather than
  putting all the details into an exception that wraps around the root
  cause.
- The message describes the root cause rather than just saying that the
  "master node failed"
- We distinguish timeouts from rejections and report the count of each.
- The message guides towards checking the master node logs too.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement Team:Distributed Meta label for distributed team. v7.16.0 v8.0.0-beta1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants