In a 116 nodes cluster, one of data node applying cluster state over 1.5m, it has been removed by master due to lagging.
But the removed node doesn't have resouce bottle neck, after it has been removed, it cannot join to the cluster anymore, we need to manually restart it.
[2021-09-23T16:29:49,305][INFO ][o.e.c.c.C.CoordinatorPublication] [1582719808097406009] after [9.9s] publication of cluster state version [11476] is still waiting for {1582719808097407309}{jLlw0dX_SoCV0kXfVw26pA}{9UArXzwaT
pOVlnbj2EtGPw}{9.55.180.36}{9.55.180.36:28394}{cdhilrstw}{ml.machine_memory=202335322112, rack=271943, xpack.installed=true, set=141, transform.node=true, ip=9.55.180.36, temperature=hot, ml.max_open_jobs=20, region=99} [SE
NT_APPLY_COMMIT]
[2021-09-23T16:30:07,084][WARN ][o.e.c.InternalClusterInfoService] [1582719808097406009] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
[2021-09-23T16:30:09,334][WARN ][o.e.c.c.C.CoordinatorPublication] [1582719808097406009] after [29.9s] publication of cluster state version [11476] is still waiting for {1582719808097407309}{jLlw0dX_SoCV0kXfVw26pA}{9UArXzwa
TpOVlnbj2EtGPw}{9.55.180.36}{9.55.180.36:28394}{cdhilrstw}{ml.machine_memory=202335322112, rack=271943, xpack.installed=true, set=141, transform.node=true, ip=9.55.180.36, temperature=hot, ml.max_open_jobs=20, region=99} [S
ENT_APPLY_COMMIT]
[2021-09-23T16:30:57,457][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [1582719808097406009] collector [index_recovery] timed out when collecting data
[2021-09-23T16:31:07,127][WARN ][o.e.c.InternalClusterInfoService] [1582719808097406009] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
[2021-09-23T16:31:38,199][WARN ][o.e.t.TransportService ] [1582719808097406009] Received response for a request that has timed out, sent [10484ms] ago, timed out [601ms] ago, action [internal:coordination/fault_detection/
follower_check], node [{1582719808097408909}{w-IYskHMTxihzfvNrWLpTQ}{9UukJD0uQ6KyokPpzqiHbA}{9.55.179.211}{9.55.179.211:24686}{cdhilrstw}{ml.machine_memory=202335322112, rack=271935, xpack.installed=true, set=141, transform
.node=true, ip=9.55.179.211, temperature=hot, ml.max_open_jobs=20, region=99}], id [445547577]
[2021-09-23T16:31:39,334][WARN ][o.e.c.c.LagDetector ] [1582719808097406009] node [{1582719808097407309}{jLlw0dX_SoCV0kXfVw26pA}{9UArXzwaTpOVlnbj2EtGPw}{9.55.180.36}{9.55.180.36:28394}{cdhilrstw}{ml.machine_memory=2023
35322112, rack=271943, xpack.installed=true, set=141, transform.node=true, ip=9.55.180.36, temperature=hot, ml.max_open_jobs=20, region=99}] is lagging at cluster state version [11475], although publication of cluster state
version [11476] completed [1.5m] ago
[2021-09-23T16:31:39,667][INFO ][o.e.c.s.MasterService ] [1582719808097406009] node-left[{1582719808097407309}{jLlw0dX_SoCV0kXfVw26pA}{9UArXzwaTpOVlnbj2EtGPw}{9.55.180.36}{9.55.180.36:28394}{cdhilrstw}{ml.machine_memory=
202335322112, rack=271943, xpack.installed=true, set=141, transform.node=true, ip=9.55.180.36, temperature=hot, ml.max_open_jobs=20, region=99} reason: lagging], term: 1, version: 11477, delta: removed {{1582719808097407309
}{jLlw0dX_SoCV0kXfVw26pA}{9UArXzwaTpOVlnbj2EtGPw}{9.55.180.36}{9.55.180.36:28394}{cdhilrstw}{ml.machine_memory=202335322112, rack=271943, xpack.installed=true, set=141, transform.node=true, ip=9.55.180.36, temperature=hot,
ml.max_open_jobs=20, region=99}}
[2021-09-23T16:31:40,322][INFO ][o.e.c.s.ClusterApplierService] [1582719808097406009] removed {{1582719808097407309}{jLlw0dX_SoCV0kXfVw26pA}{9UArXzwaTpOVlnbj2EtGPw}{9.55.180.36}{9.55.180.36:28394}{cdhilrstw}{ml.machine_memo
ry=202335322112, rack=271943, xpack.installed=true, set=141, transform.node=true, ip=9.55.180.36, temperature=hot, ml.max_open_jobs=20, region=99}}, term: 1, version: 11477, reason: Publication{term=1, version=11477}
[2021-09-23T16:31:40,367][INFO ][o.e.c.r.DelayedAllocationService] [1582719808097406009] scheduling reroute for delayed shards in [4.9m] (170 delayed shards)
After node has been removed, we still could get cluster health through its http end port, but it shows only 115 nodes without itself. It seems the node gets into some unnormal status. It's not that easy to reproduce this issue, but we do have met this case in different clusters for sevral times.
Elasticsearch version (
bin/elasticsearch --version): 7.10.1JVM version (
java -version): jre1.8.0_181Description of the problem including expected versus actual behavior:
In a 116 nodes cluster, one of data node applying cluster state over 1.5m, it has been removed by master due to
lagging.But the removed node doesn't have resouce bottle neck, after it has been removed, it cannot join to the cluster anymore, we need to manually restart it.
Provide logs (if relevant):
After node has been removed, we still could get cluster health through its http end port, but it shows only 115 nodes without itself. It seems the node gets into some unnormal status. It's not that easy to reproduce this issue, but we do have met this case in different clusters for sevral times.