Node left forever due to unnormal lagging.

**Elasticsearch version** (`bin/elasticsearch --version`): 7.10.1

**JVM version** (`java -version`): jre1.8.0_181

**Description of the problem including expected versus actual behavior**:

In a 116 nodes cluster, one of data node applying cluster state over 1.5m, it has been removed by master due to `lagging`.
But the removed node doesn't have resouce bottle neck, after it has been removed, it cannot join to the cluster anymore, we need to manually restart it.

**Provide logs (if relevant)**:
```
[2021-09-23T16:29:49,305][INFO ][o.e.c.c.C.CoordinatorPublication] [1582719808097406009] after [9.9s] publication of cluster state version [11476] is still waiting for {1582719808097407309}{jLlw0dX_SoCV0kXfVw26pA}{9UArXzwaT
pOVlnbj2EtGPw}{9.55.180.36}{9.55.180.36:28394}{cdhilrstw}{ml.machine_memory=202335322112, rack=271943, xpack.installed=true, set=141, transform.node=true, ip=9.55.180.36, temperature=hot, ml.max_open_jobs=20, region=99} [SE
NT_APPLY_COMMIT]
[2021-09-23T16:30:07,084][WARN ][o.e.c.InternalClusterInfoService] [1582719808097406009] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
[2021-09-23T16:30:09,334][WARN ][o.e.c.c.C.CoordinatorPublication] [1582719808097406009] after [29.9s] publication of cluster state version [11476] is still waiting for {1582719808097407309}{jLlw0dX_SoCV0kXfVw26pA}{9UArXzwa
TpOVlnbj2EtGPw}{9.55.180.36}{9.55.180.36:28394}{cdhilrstw}{ml.machine_memory=202335322112, rack=271943, xpack.installed=true, set=141, transform.node=true, ip=9.55.180.36, temperature=hot, ml.max_open_jobs=20, region=99} [S
ENT_APPLY_COMMIT]
[2021-09-23T16:30:57,457][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [1582719808097406009] collector [index_recovery] timed out when collecting data
[2021-09-23T16:31:07,127][WARN ][o.e.c.InternalClusterInfoService] [1582719808097406009] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
[2021-09-23T16:31:38,199][WARN ][o.e.t.TransportService   ] [1582719808097406009] Received response for a request that has timed out, sent [10484ms] ago, timed out [601ms] ago, action [internal:coordination/fault_detection/
follower_check], node [{1582719808097408909}{w-IYskHMTxihzfvNrWLpTQ}{9UukJD0uQ6KyokPpzqiHbA}{9.55.179.211}{9.55.179.211:24686}{cdhilrstw}{ml.machine_memory=202335322112, rack=271935, xpack.installed=true, set=141, transform
.node=true, ip=9.55.179.211, temperature=hot, ml.max_open_jobs=20, region=99}], id [445547577]
[2021-09-23T16:31:39,334][WARN ][o.e.c.c.LagDetector      ] [1582719808097406009] node [{1582719808097407309}{jLlw0dX_SoCV0kXfVw26pA}{9UArXzwaTpOVlnbj2EtGPw}{9.55.180.36}{9.55.180.36:28394}{cdhilrstw}{ml.machine_memory=2023
35322112, rack=271943, xpack.installed=true, set=141, transform.node=true, ip=9.55.180.36, temperature=hot, ml.max_open_jobs=20, region=99}] is lagging at cluster state version [11475], although publication of cluster state
 version [11476] completed [1.5m] ago
[2021-09-23T16:31:39,667][INFO ][o.e.c.s.MasterService    ] [1582719808097406009] node-left[{1582719808097407309}{jLlw0dX_SoCV0kXfVw26pA}{9UArXzwaTpOVlnbj2EtGPw}{9.55.180.36}{9.55.180.36:28394}{cdhilrstw}{ml.machine_memory=
202335322112, rack=271943, xpack.installed=true, set=141, transform.node=true, ip=9.55.180.36, temperature=hot, ml.max_open_jobs=20, region=99} reason: lagging], term: 1, version: 11477, delta: removed {{1582719808097407309
}{jLlw0dX_SoCV0kXfVw26pA}{9UArXzwaTpOVlnbj2EtGPw}{9.55.180.36}{9.55.180.36:28394}{cdhilrstw}{ml.machine_memory=202335322112, rack=271943, xpack.installed=true, set=141, transform.node=true, ip=9.55.180.36, temperature=hot, 
ml.max_open_jobs=20, region=99}}
[2021-09-23T16:31:40,322][INFO ][o.e.c.s.ClusterApplierService] [1582719808097406009] removed {{1582719808097407309}{jLlw0dX_SoCV0kXfVw26pA}{9UArXzwaTpOVlnbj2EtGPw}{9.55.180.36}{9.55.180.36:28394}{cdhilrstw}{ml.machine_memo
ry=202335322112, rack=271943, xpack.installed=true, set=141, transform.node=true, ip=9.55.180.36, temperature=hot, ml.max_open_jobs=20, region=99}}, term: 1, version: 11477, reason: Publication{term=1, version=11477}
[2021-09-23T16:31:40,367][INFO ][o.e.c.r.DelayedAllocationService] [1582719808097406009] scheduling reroute for delayed shards in [4.9m] (170 delayed shards)
```

After node has been removed, we still could get cluster health through its http end port, but it shows only 115 nodes without itself. It seems the node gets into some unnormal status. It's not that easy to reproduce this issue, but we do have met this case in different clusters for sevral times.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Node left forever due to unnormal lagging. #78240

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Node left forever due to unnormal lagging. #78240

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions