Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow edge workload run after client node restart & lost internet connection #24629

Open
rahadiangg opened this issue Dec 8, 2024 · 2 comments

Comments

@rahadiangg
Copy link

rahadiangg commented Dec 8, 2024

Proposal

Edge workload maybe have unstable internet connection or power source. For unstable connection we can use disconnect block to handle this condition. however when client node restart because unstable power and at same time internet connection also lost, existing nomad allocation cannot be started because the node has not successfully registered/reconnected to the nomad server.

Example:

  1. Submit job
job "edge-app" {
  
  region = "global"

  group "edge-app" {

    disconnect {
      lost_after = "6h"
      replace    = false
      reconcile  = "best_score"
    }

    task "edge-app" {
      driver = "raw_exec"
      config {
        command = "local/my-app"
      }
      artifact {
        source = "https://url-to-binary-app/my-app"
      }
    }
  }
}
  1. Client node restart becasue lost power source & at same time internet connection also lost
  2. disconnect block working

Client node mark mark as disconnect rather than down

ID        Node Pool  DC          Name                   Class   Drain  Eligibility  Status
6723722e  default    edge     ip-172-31-5-240        <none>  false  eligible     disconnected

Allocation mark as unknown

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
edb17146  6723722e  edge-app    17       run      unknown   6m13s ago   3m34s ago
  1. Client node try to connect to server after lost power source & internet
Dec 08 15:15:19 ip-172-31-5-240 nomad[529]:     2024-12-08T15:15:19.369Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=edb17146-2c06-49f6-d8db-27408e9e6a69 task=edge-app type=Received msg="Task received by client" failed=false
Dec 08 15:15:19 ip-172-31-5-240 nomad[529]:     2024-12-08T15:15:19.369Z [ERROR] client.driver_mgr.raw_exec: failed to reattach to executor: driver=raw_exec error="error creating rpc client for executor plugin: Reattachment process not found" task_id=edb17146-2c06-49f6-d8db-27408e9e6a69/edge-app/891c72c5
Dec 08 15:15:19 ip-172-31-5-240 nomad[529]:     2024-12-08T15:15:19.369Z [ERROR] client.alloc_runner.task_runner: error recovering task; cleaning up: alloc_id=edb17146-2c06-49f6-d8db-27408e9e6a69 task=edge-app error="failed to reattach to executor: error creating rpc client for executor plugin: Reattachment process not found" task_id=edb17146-2c06-49f6-d8db-27408e9e6a69/edge-app/891c72c5
Dec 08 15:15:19 ip-172-31-5-240 systemd[1]: Started nomad.service - Nomad.
Dec 08 15:15:19 ip-172-31-5-240 nomad[529]:     2024-12-08T15:15:19.369Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=edb17146-2c06-49f6-d8db-27408e9e6a69 task=edge-app type="Failed Restoring Task" msg="failed to restore task; will not run until server is contacted" failed=false
Dec 08 15:15:19 ip-172-31-5-240 nomad[529]:     2024-12-08T15:15:19.373Z [INFO]  client: started client: node_id=6723722e-0ec8-b6ca-99b3-9f86a562bbb7
Dec 08 15:15:19 ip-172-31-5-240 nomad[529]:     2024-12-08T15:15:19.381Z [INFO]  client.alloc_runner.task_runner: task failed to restore; waiting to contact server before restarting: alloc_id=edb17146-2c06-49f6-d8db-27408e9e6a69 task=edge-app
Dec 08 15:15:29 ip-172-31-5-240 nomad[529]:     2024-12-08T15:15:29.348Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: failed to get conn: dial tcp 123.x.x.x:4647: i/o timeout" rpc=Node.Register server=123.x.x.x:4647
Dec 08 15:15:29 ip-172-31-5-240 nomad[529]:     2024-12-08T15:15:29.349Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: failed to get conn: dial tcp 123.x.x.x:4647: i/o timeout" rpc=Node.Register server=123.x.x.x:4647
Dec 08 15:15:29 ip-172-31-5-240 nomad[529]:     2024-12-08T15:15:29.349Z [ERROR] client: error registering: error="rpc error: failed to get conn: dial tcp 123.x.x.x:4647: i/o timeout"
Dec 08 15:15:29 ip-172-31-5-240 nomad[529]:     2024-12-08T15:15:29.349Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.GetClientAllocs server=123.x.x.x:4647
Dec 08 15:15:29 ip-172-31-5-240 nomad[529]:     2024-12-08T15:15:29.349Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.GetClientAllocs server=123.x.x.x:4647
Dec 08 15:15:29 ip-172-31-5-240 nomad[529]:     2024-12-08T15:15:29.349Z [ERROR] client: error querying node allocations: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection"
  1. At this time existing allocation is not running

I think if nomad is able to run the previous allocation from the above conditions it would be a good thing.

Use-cases

Edge workload with unstable internet connection & power source.

Attempted Solutions

@tgross
Copy link
Member

tgross commented Dec 9, 2024

Hi @rahadiangg! This is a known issue in the disconnect block implementation. See #15783. I'm going to keep this issue as a better description of the problem and mark it for roadmapping.

@sofixa
Copy link
Contributor

sofixa commented Dec 13, 2024

Doesn't the schedule stanza cover this scenario? The allocs will be started by the client even if it's isolated (as long as there are no external dependencies such as service discovery or secrets)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

3 participants