You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The validator process occasionally has an irregular termination, either from a panic when some assumption is clearly wrong (ie unwrap a None) or when we detect a bad scenario and bail out with std::process::exit().
The exit() calls are typically accompanied by an ERROR log for visibility; panics will also get transcribed in the log file. However, we have observed some instances where the induced teardown will cause a segfault in another thread, and specifically, in threads that access rocksdb.
The working theory is that there are rocksdb background threads that are still running, and when the teardown starts, those threads could be operating on state that has been ripped out from under them. Some more information and findings that led us to this understanding can be found in #25941.
Granted the validator is already going down in this scenario; however, the segfault on top obscures the underlying problem and makes discovering the root cause of validator failure harder, especially for operator who may not be as intimate with the codebase.
Proposed Solution
Make it such that all threads are stopped on a panic to avoid the subsequent segfault. A custom panic hook might allow us to cancel this work before teardown starts:
For testing, we might be able to reproduce with ledger-tool by starting some long running task that is continually reading (ie analyze-storage) in a separate thread and then explicitly panicking. Might be quicker / easier than spinning up full validator.
Problem
The validator process occasionally has an irregular termination, either from a
panic
when some assumption is clearly wrong (ie unwrap aNone
) or when we detect a bad scenario and bail out withstd::process::exit()
.The
exit()
calls are typically accompanied by anERROR
log for visibility; panics will also get transcribed in the log file. However, we have observed some instances where the induced teardown will cause a segfault in another thread, and specifically, in threads that access rocksdb.The working theory is that there are rocksdb background threads that are still running, and when the teardown starts, those threads could be operating on state that has been ripped out from under them. Some more information and findings that led us to this understanding can be found in #25941.
Granted the validator is already going down in this scenario; however, the segfault on top obscures the underlying problem and makes discovering the root cause of validator failure harder, especially for operator who may not be as intimate with the codebase.
Proposed Solution
Make it such that all threads are stopped on a panic to avoid the subsequent segfault. A custom panic hook might allow us to cancel this work before teardown starts:
solana/validator/src/main.rs
Lines 1587 to 1589 in e1d38c8
We would still want to to flush metrics, but we could add extra logic to be executed when validator panics, and then call the metrics panic hook. As hypothesized from the other GH issue I linked, the below method may be what we're looking for:
https://docs.rs/rocksdb/latest/rocksdb/struct.DBCommon.html#method.cancel_all_background_work
The text was updated successfully, but these errors were encountered: