Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Irregular Validator teardown could be cleaner #29740

Open
steviez opened this issue Jan 17, 2023 · 2 comments
Open

Irregular Validator teardown could be cleaner #29740

steviez opened this issue Jan 17, 2023 · 2 comments

Comments

@steviez
Copy link
Contributor

steviez commented Jan 17, 2023

Problem

The validator process occasionally has an irregular termination, either from a panic when some assumption is clearly wrong (ie unwrap a None) or when we detect a bad scenario and bail out with std::process::exit().

The exit() calls are typically accompanied by an ERROR log for visibility; panics will also get transcribed in the log file. However, we have observed some instances where the induced teardown will cause a segfault in another thread, and specifically, in threads that access rocksdb.

The working theory is that there are rocksdb background threads that are still running, and when the teardown starts, those threads could be operating on state that has been ripped out from under them. Some more information and findings that led us to this understanding can be found in #25941.

Granted the validator is already going down in this scenario; however, the segfault on top obscures the underlying problem and makes discovering the root cause of validator failure harder, especially for operator who may not be as intimate with the codebase.

Proposed Solution

Make it such that all threads are stopped on a panic to avoid the subsequent segfault. A custom panic hook might allow us to cancel this work before teardown starts:

solana/validator/src/main.rs

Lines 1587 to 1589 in e1d38c8

solana_metrics::set_panic_hook("validator", {
let version = format!("{solana_version:?}");
Some(version)

We would still want to to flush metrics, but we could add extra logic to be executed when validator panics, and then call the metrics panic hook. As hypothesized from the other GH issue I linked, the below method may be what we're looking for:
https://docs.rs/rocksdb/latest/rocksdb/struct.DBCommon.html#method.cancel_all_background_work

@steviez
Copy link
Contributor Author

steviez commented Jan 17, 2023

rocksdb is a C++ library that we are using via an FFI + Rust wrapper. Might want to double check how the FFI is configured as it relates to handling panics. Potentially useful reading (or at least primer):
https://doc.rust-lang.org/nomicon/ffi.html

@steviez
Copy link
Contributor Author

steviez commented Jan 17, 2023

For testing, we might be able to reproduce with ledger-tool by starting some long running task that is continually reading (ie analyze-storage) in a separate thread and then explicitly panicking. Might be quicker / easier than spinning up full validator.

@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 18, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 25, 2024
@steviez steviez removed the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 25, 2024
@steviez steviez reopened this Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant