Irregular Validator teardown could be cleaner #29740

steviez · 2023-01-17T19:09:34Z

Problem

The validator process occasionally has an irregular termination, either from a panic when some assumption is clearly wrong (ie unwrap a None) or when we detect a bad scenario and bail out with std::process::exit().

The exit() calls are typically accompanied by an ERROR log for visibility; panics will also get transcribed in the log file. However, we have observed some instances where the induced teardown will cause a segfault in another thread, and specifically, in threads that access rocksdb.

The working theory is that there are rocksdb background threads that are still running, and when the teardown starts, those threads could be operating on state that has been ripped out from under them. Some more information and findings that led us to this understanding can be found in #25941.

Granted the validator is already going down in this scenario; however, the segfault on top obscures the underlying problem and makes discovering the root cause of validator failure harder, especially for operator who may not be as intimate with the codebase.

Proposed Solution

Make it such that all threads are stopped on a panic to avoid the subsequent segfault. A custom panic hook might allow us to cancel this work before teardown starts:

solana/validator/src/main.rs

Lines 1587 to 1589 in e1d38c8

    
           solana_metrics::set_panic_hook("validator", { 
        
               let version = format!("{solana_version:?}"); 
        
               Some(version)

We would still want to to flush metrics, but we could add extra logic to be executed when validator panics, and then call the metrics panic hook. As hypothesized from the other GH issue I linked, the below method may be what we're looking for:
https://docs.rs/rocksdb/latest/rocksdb/struct.DBCommon.html#method.cancel_all_background_work

The text was updated successfully, but these errors were encountered:

steviez · 2023-01-17T19:20:11Z

rocksdb is a C++ library that we are using via an FFI + Rust wrapper. Might want to double check how the FFI is configured as it relates to handling panics. Potentially useful reading (or at least primer):
https://doc.rust-lang.org/nomicon/ffi.html

steviez · 2023-01-17T19:33:00Z

For testing, we might be able to reproduce with ledger-tool by starting some long running task that is continually reading (ie analyze-storage) in a separate thread and then explicitly panicking. Might be quicker / easier than spinning up full validator.

github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 18, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 25, 2024

steviez removed the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 25, 2024

steviez reopened this Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Irregular Validator teardown could be cleaner #29740

Irregular Validator teardown could be cleaner #29740

steviez commented Jan 17, 2023 •

edited

Loading

steviez commented Jan 17, 2023 •

edited

Loading

steviez commented Jan 17, 2023 •

edited

Loading

Irregular Validator teardown could be cleaner #29740

Irregular Validator teardown could be cleaner #29740

Comments

steviez commented Jan 17, 2023 • edited Loading

Problem

Proposed Solution

steviez commented Jan 17, 2023 • edited Loading

steviez commented Jan 17, 2023 • edited Loading

steviez commented Jan 17, 2023 •

edited

Loading

steviez commented Jan 17, 2023 •

edited

Loading

steviez commented Jan 17, 2023 •

edited

Loading