Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client state store persisted with incorrect allocation state #24699

Open
mismithhisler opened this issue Dec 17, 2024 · 0 comments
Open

Client state store persisted with incorrect allocation state #24699

mismithhisler opened this issue Dec 17, 2024 · 0 comments
Labels

Comments

@mismithhisler
Copy link
Member

Nomad version

1.9.3

Operating system and Environment details

Ubuntu 24.04.1

Issue

Periodic snapshots and saving state during Nomad client process shutdown save incorrect allocation state.

Reproduction steps

Run a simple job like the one listed below. Stop the job, then shutdown the Nomad process. Run nomad operator client-state /path/to/statestore | jq | grep "ClientStatus" and see the result is "running" even though the job was completed.

Optionally throw a log in the allocRunner's PersistState method printing out the allocs ClientStatus and see that it is incorrect. (It gets persisted as pending roughly ~30 seconds after being marked healthy)

Expected Result

The ClientStatus should be completed

Actual Result

The ClientStatus shows running

Job file (if appropriate)

job "test" {
  type = "service"

  group "echo" {

    task "hello" {
      driver = "exec"

      config {
        command = "sleep"
        args    = ["5000s"]
      }
    }
  }
}

Nomad Server logs (if appropriate)

N/A

Nomad Client logs (if appropriate)

Output of nomad job status

ID    Type     Priority  Status          Submit Date
test  service  50        dead (stopped)  2024-12-17T13:06:23-05:00

Logs:

    2024-12-17T13:06:28.909-0500 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=419af277-aaf1-83cf-24a1-74c60a4b1cc0 task=hello type=Terminated msg="Exit Code: 137, Signal: 9" failed=false
    2024-12-17T13:06:28.918-0500 [DEBUG] client.driver_mgr.exec.executor.stdio: received EOF, stopping recv loop: alloc_id=419af277-aaf1-83cf-24a1-74c60a4b1cc0 driver=exec task_name=hello err="rpc error: code = Unavailable desc = error reading from server: EOF"
    2024-12-17T13:06:28.929-0500 [INFO]  client.driver_mgr.exec.executor: plugin process exited: alloc_id=419af277-aaf1-83cf-24a1-74c60a4b1cc0 driver=exec task_name=hello plugin=/usr/local/bin/nomad id=31528
    2024-12-17T13:06:28.929-0500 [DEBUG] client.driver_mgr.exec.executor: plugin exited: alloc_id=419af277-aaf1-83cf-24a1-74c60a4b1cc0 driver=exec task_name=hello
    2024-12-17T13:06:28.930-0500 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=419af277-aaf1-83cf-24a1-74c60a4b1cc0 task=hello type=Killed msg="Task successfully killed" failed=false
    2024-12-17T13:06:28.938-0500 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.stdio: received EOF, stopping recv loop: alloc_id=419af277-aaf1-83cf-24a1-74c60a4b1cc0 task=hello err="rpc error: code = Unavailable desc = error reading from server: EOF"
    2024-12-17T13:06:28.941-0500 [INFO]  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=419af277-aaf1-83cf-24a1-74c60a4b1cc0 task=hello plugin=/usr/local/bin/nomad id=31520
    2024-12-17T13:06:28.941-0500 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=419af277-aaf1-83cf-24a1-74c60a4b1cc0 task=hello
    2024-12-17T13:06:28.942-0500 [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=419af277-aaf1-83cf-24a1-74c60a4b1cc0 task=hello
    2024-12-17T13:06:28.942-0500 [INFO]  client.gc: marking allocation for GC: alloc_id=419af277-aaf1-83cf-24a1-74c60a4b1cc0
    2024-12-17T13:06:29.223-0500 [DEBUG] client: updated allocations: index=8495 total=1 pulled=0 filtered=1
    2024-12-17T13:06:29.225-0500 [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=1
    2024-12-17T13:06:29.227-0500 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=1 errors=0
^C==> Caught signal: interrupt
    2024-12-17T13:06:32.341-0500 [INFO]  agent: requesting shutdown
    2024-12-17T13:06:32.341-0500 [INFO]  client: shutting down
    2024-12-17T13:06:32.347-0500 [INFO]  client.plugin: shutting down plugin manager: plugin-type=device
    2024-12-17T13:06:32.348-0500 [INFO]  client.plugin: plugin manager finished: plugin-type=device
    2024-12-17T13:06:32.349-0500 [INFO]  client.plugin: shutting down plugin manager: plugin-type=driver
    2024-12-17T13:06:32.370-0500 [INFO]  client.plugin: plugin manager finished: plugin-type=driver
    2024-12-17T13:06:32.371-0500 [INFO]  client.plugin: shutting down plugin manager: plugin-type=csi
    2024-12-17T13:06:32.371-0500 [INFO]  client.plugin: plugin manager finished: plugin-type=csi
    2024-12-17T13:06:32.371-0500 [DEBUG] client.server_mgr: shutting down
    2024-12-17T13:06:32.372-0500 [INFO]  client.alloc_runner: Persisting ClientStatus to state store: alloc_id=419af277-aaf1-83cf-24a1-74c60a4b1cc0 EXTRA_VALUE_AT_END=running
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant