You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Edge workload maybe have unstable internet connection or power source. For unstable connection we can use disconnect block to handle this condition. however when client node restart because unstable power and at same time internet connection also lost, existing nomad allocation cannot be started because the node has not successfully registered/reconnected to the nomad server.
Client node restart becasue lost power source & at same time internet connection also lost
disconnect block working
Client node mark mark as disconnect rather than down
ID Node Pool DC Name Class Drain Eligibility Status
6723722e default edge ip-172-31-5-240 <none> false eligible disconnected
Allocation mark as unknown
Allocations
ID Node ID Task Group Version Desired Status Created Modified
edb17146 6723722e edge-app 17 run unknown 6m13s ago 3m34s ago
Client node try to connect to server after lost power source & internet
Dec 08 15:15:19 ip-172-31-5-240 nomad[529]: 2024-12-08T15:15:19.369Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=edb17146-2c06-49f6-d8db-27408e9e6a69 task=edge-app type=Received msg="Task received by client" failed=false
Dec 08 15:15:19 ip-172-31-5-240 nomad[529]: 2024-12-08T15:15:19.369Z [ERROR] client.driver_mgr.raw_exec: failed to reattach to executor: driver=raw_exec error="error creating rpc client for executor plugin: Reattachment process not found" task_id=edb17146-2c06-49f6-d8db-27408e9e6a69/edge-app/891c72c5
Dec 08 15:15:19 ip-172-31-5-240 nomad[529]: 2024-12-08T15:15:19.369Z [ERROR] client.alloc_runner.task_runner: error recovering task; cleaning up: alloc_id=edb17146-2c06-49f6-d8db-27408e9e6a69 task=edge-app error="failed to reattach to executor: error creating rpc client for executor plugin: Reattachment process not found" task_id=edb17146-2c06-49f6-d8db-27408e9e6a69/edge-app/891c72c5
Dec 08 15:15:19 ip-172-31-5-240 systemd[1]: Started nomad.service - Nomad.
Dec 08 15:15:19 ip-172-31-5-240 nomad[529]: 2024-12-08T15:15:19.369Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=edb17146-2c06-49f6-d8db-27408e9e6a69 task=edge-app type="Failed Restoring Task" msg="failed to restore task; will not run until server is contacted" failed=false
Dec 08 15:15:19 ip-172-31-5-240 nomad[529]: 2024-12-08T15:15:19.373Z [INFO] client: started client: node_id=6723722e-0ec8-b6ca-99b3-9f86a562bbb7
Dec 08 15:15:19 ip-172-31-5-240 nomad[529]: 2024-12-08T15:15:19.381Z [INFO] client.alloc_runner.task_runner: task failed to restore; waiting to contact server before restarting: alloc_id=edb17146-2c06-49f6-d8db-27408e9e6a69 task=edge-app
Dec 08 15:15:29 ip-172-31-5-240 nomad[529]: 2024-12-08T15:15:29.348Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: failed to get conn: dial tcp 123.x.x.x:4647: i/o timeout" rpc=Node.Register server=123.x.x.x:4647
Dec 08 15:15:29 ip-172-31-5-240 nomad[529]: 2024-12-08T15:15:29.349Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: failed to get conn: dial tcp 123.x.x.x:4647: i/o timeout" rpc=Node.Register server=123.x.x.x:4647
Dec 08 15:15:29 ip-172-31-5-240 nomad[529]: 2024-12-08T15:15:29.349Z [ERROR] client: error registering: error="rpc error: failed to get conn: dial tcp 123.x.x.x:4647: i/o timeout"
Dec 08 15:15:29 ip-172-31-5-240 nomad[529]: 2024-12-08T15:15:29.349Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.GetClientAllocs server=123.x.x.x:4647
Dec 08 15:15:29 ip-172-31-5-240 nomad[529]: 2024-12-08T15:15:29.349Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.GetClientAllocs server=123.x.x.x:4647
Dec 08 15:15:29 ip-172-31-5-240 nomad[529]: 2024-12-08T15:15:29.349Z [ERROR] client: error querying node allocations: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection"
At this time existing allocation is not running
I think if nomad is able to run the previous allocation from the above conditions it would be a good thing.
Use-cases
Edge workload with unstable internet connection & power source.
Attempted Solutions
The text was updated successfully, but these errors were encountered:
Hi @rahadiangg! This is a known issue in the disconnect block implementation. See #15783. I'm going to keep this issue as a better description of the problem and mark it for roadmapping.
Doesn't the schedule stanza cover this scenario? The allocs will be started by the client even if it's isolated (as long as there are no external dependencies such as service discovery or secrets)
Proposal
Edge workload maybe have unstable internet connection or power source. For unstable connection we can use disconnect block to handle this condition. however when client node restart because unstable power and at same time internet connection also lost, existing nomad allocation cannot be started because the node has not successfully registered/reconnected to the nomad server.
Example:
disconnect
block workingClient node mark mark as disconnect rather than down
Allocation mark as unknown
I think if nomad is able to run the previous allocation from the above conditions it would be a good thing.
Use-cases
Edge workload with unstable internet connection & power source.
Attempted Solutions
The text was updated successfully, but these errors were encountered: