Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sub Notif Stack Overflow Error #29032

Open
tracy-codes opened this issue Dec 2, 2022 · 6 comments
Open

Sub Notif Stack Overflow Error #29032

tracy-codes opened this issue Dec 2, 2022 · 6 comments

Comments

@tracy-codes
Copy link

tracy-codes commented Dec 2, 2022

Hi all, we're reporting a stack overflow error that we've encountered that causes the solana service to restart. This is seemingly random as we aren't able to reproduce with any specific load tests. The nodes are either under heavy load or they are under next-to-no load when this error happens.

The specific error we see in our logs is:

thread 'sol-sub-notif-1' has overflowed its stack
fatal runtime error: stack overflow

Below are some logs just before and right after the error (you can see the validator service just restarts itself).

[2022-12-01T20:54:37.053660080Z INFO  solana_metrics::metrics] datapoint: slot_repeated_writes same_account_writes=1542i same_account_writes_size=3228349i unique_account_writes_size=13681353i size=13681353i
[2022-12-01T20:54:37.054121482Z INFO  solana_metrics::metrics] datapoint: accounts_cache_size num_roots=19i num_slots=50i total_unique_writes_size=650953463i total_size=650935557i
[2022-12-01T20:54:37.054129302Z INFO  solana_metrics::metrics] datapoint: bank_frozen slot=164339890i hash="9gPgcQ86TxbCPfimMLKuhN9ZWYHyPq16MG5j3SSMiXyy"
[2022-12-01T20:54:37.054149073Z INFO  solana_core::cluster_slot_state_verifier] check_slot_agrees_with_cluster()
            slot: 164339890,
            root: 164339858,
            slot_state_update: BankFrozen(BankFrozenState { frozen_hash: 9gPgcQ86TxbCPfimMLKuhN9ZWYHyPq16MG5j3SSMiXyy, cluster_confirmed_hash: None, is_slot_duplicate: false })
[2022-12-01T20:54:37.057489235Z INFO  solana_metrics::metrics] datapoint: replay-slot-stats slot=164339890i fetch_entries_time=3757i fetch_entries_fail_time=0i entry_poh_verification_time=29866i entry_transaction_verification_time=44746i replay_time=221192i replay_total_elapsed=340060i total_entries=225i total_shreds=902i check_us=1585i load_us=66451i execute_us=188441i store_us=176934i update_stakes_cache_us=45396i bank_complete_time_us=39480i total_batches_len=184i num_execute_batches=42i execute_details_serialize_us=37936i execute_details_create_vm_us=12418i execute_details_execute_inner_us=50785i execute_details_deserialize_us=12366i execute_details_get_or_create_executor_us=0i execute_details_changed_account_count=20i execute_details_total_account_count=10108i execute_details_total_data_size=63251563i execute_details_data_size_changed=6724150i execute_details_create_executor_register_syscalls_us=0i execute_details_create_executor_load_elf_us=0i execute_details_create_executor_verify_code_us=0i execute_details_create_executor_jit_compile_us=0i execute_accessories_feature_set_clone_us=0i execute_accessories_compute_budget_process_transaction_us=37i execute_accessories_get_executors_us=1098i execute_accessories_process_message_us=175152i execute_accessories_update_executors_us=76i execute_accessories_process_instructions_total_us=171354i execute_accessories_process_instructions_verify_caller_us=0i execute_accessories_process_instructions_process_executable_chain_us=156146i execute_accessories_process_instructions_verify_callee_us=3919i
[2022-12-01T20:54:37.057512986Z INFO  solana_metrics::metrics] datapoint: per_program_timings slot=164339890i pubkey="all" execute_us=171354i accumulated_units=2379710i count=2631i errored_units=1185654i errored_count=26i
[2022-12-01T20:54:37.079622106Z INFO  solana_metrics::metrics] datapoint: compute_bank_stats-best_slot computed_slot=164339890i overall_best_slot=164339890i overall_best_hash="9gPgcQ86TxbCPfimMLKuhN9ZWYHyPq16MG5j3SSMiXyy" elapsed=1422i
[2022-12-01T20:54:37.079623836Z INFO  solana_core::replay_stage] 2ugTCAsMuuhqSXSWtnNF3x97RrUiTiNoz5mXno55bLeX slot_weight: 164339890 0 0 164339889
[2022-12-01T20:54:37.079644006Z INFO  solana_metrics::metrics] datapoint: bank_weight slot=164339890i weight="0"
[2022-12-01T20:54:37.079662977Z INFO  solana_core::replay_stage] validator fork confirmed 164339888 1400ms
[2022-12-01T20:54:37.079670627Z INFO  solana_metrics::metrics] datapoint: validator-confirmation duration_ms=1400i
[2022-12-01T20:54:37.079670797Z INFO  solana_core::cluster_slot_state_verifier] check_slot_agrees_with_cluster()
            slot: 164339888,
            root: 164339858,
            slot_state_update: DuplicateConfirmed(DuplicateConfirmedState { duplicate_confirmed_hash: CEG7DhUsdy8UPp65ejEJbx2h6Js7e82v43xR71MhVhv1, bank_status: Frozen(CEG7DhUsdy8UPp65ejEJbx2h6Js7e82v43xR71MhVhv1) })
[2022-12-01T20:54:37.079686627Z INFO  solana_core::replay_stage] voting: 164339890 0
[2022-12-01T20:54:37.079705708Z INFO  solana_metrics::metrics] datapoint: tower-vote latest=164339890i root=164339859i
[2022-12-01T20:54:37.079823191Z INFO  solana_metrics::metrics] datapoint: bank-forks_set_root elapsed_ms=0i slot=164339859i total_parent_banks=2i total_banks=35i total_squash_cache_ms=0i total_squash_accounts_ms=0i total_squash_accounts_index_ms=0i total_squash_accounts_cache_ms=0i total_squash_accounts_store_ms=0i total_snapshot_ms=0i tx_count=258i prune_non_rooted_ms=0i drop_parent_banks_ms=0i prune_slots_ms=0i prune_remove_ms=0i dropped_banks_len=0i accounts_data_len=76187830196i
[2022-12-01T20:54:37.079852162Z INFO  solana_core::replay_stage] new root 164339859
[2022-12-01T20:54:37.079862292Z INFO  solana_core::replay_stage] vote bank: Some((164339890, SameFork)) reset bank: 164339890
[2022-12-01T20:54:37.079870692Z INFO  solana_metrics::metrics] datapoint: blocks_produced num_blocks_on_fork=127200i num_dropped_blocks_on_fork=5632i
[2022-12-01T20:54:37.079875402Z INFO  solana_poh::poh_recorder] reset poh from: 8oEzReziNvFEAR7YmwRzZKNBURDffosNzS1w4zXGCrag,10517753027,164339889 to: 8oEzReziNvFEAR7YmwRzZKNBURDffosNzS1w4zXGCrag,164339890
[2022-12-01T20:54:37.079882982Z INFO  solana_core::replay_stage] 8jbtYnuhsW4QREqYofGg5NTivN38zrF7LLAdjwQcBoBm reset PoH to tick 10517753024 (within slot 164339890). I am not in the leader schedule yet
[2022-12-01T20:54:37.080507308Z INFO  solana_core::replay_stage] new fork:164339891 parent:164339890 root:164339859
[2022-12-01T20:54:37.081124973Z INFO  solana_metrics::metrics] datapoint: rpc_subscriptions source="gossip" num_account_subscriptions=0i num_account_pubkeys_notified=0i num_logs_subscriptions=0i num_logs_notified=0i num_program_subscriptions=2010i num_programs_notified=0i num_signature_subscriptions=0i num_signatures_notified=0i notifications_time=41850i
[2022-12-01T20:54:37.082048716Z INFO  solana_metrics::metrics] datapoint: bank-timestamp get_timestamp_estimate_us=1239i
[2022-12-01T20:54:37.082060217Z INFO  solana_metrics::metrics] datapoint: bank-timestamp-correction slot=164339891i from_genesis=1650104896i corrected=1669928077i ancestor_timestamp=1669928076i
[2022-12-01T20:54:37.082305773Z INFO  solana_metrics::metrics] datapoint: bank-new_from_parent-heights slot=164339891i block_height=148751352i parent_slot=164339890i bank_rc_creation_us=2i total_elapsed_us=1781i status_cache_rc_us=0i fee_components_us=1i blockhash_queue_us=9i stakes_cache_us=1i epoch_stakes_time_us=0i builtin_programs_us=0i rewards_pool_pubkeys_us=0i cached_executors_us=32i transaction_debug_keys_us=0i transaction_log_collector_config_us=0i feature_set_us=0i ancestors_us=13i update_epoch_us=0i update_sysvars_us=1484i fill_sysvar_cache_us=223i
[2022-12-01T20:54:37.082326803Z INFO  solana_metrics::metrics] datapoint: bank-executor-cache-stats slot=164339890i hits=428i misses=0i evictions=0i insertions=0i replacements=0i one_hit_wonders=0i

thread 'sol-sub-notif-1' has overflowed its stack
fatal runtime error: stack overflow
[2022-12-01T20:55:38.332685015Z INFO  solana_validator] solana-validator 1.13.5 (src:91384943; feat:1365939126)
[2022-12-01T20:55:38.332726026Z INFO  solana_validator] Starting validator with: ArgsOs {

Even with debug flags enabled, we do not get any other info from this error aside from those two lines. edit: wrong debug flag for this. We are going to enable the proper ones for websockets.

This error causes nodes to restart at random and it's difficult to contain, causing RPC cluster instabilities.

@tracy-codes
Copy link
Author

We are going to restart our nodes with solana_rpc::rpc_subscriptions=debug added to the RUST_LOG environment variable. Will report back with our findings once the error is caught again.

@tracy-codes
Copy link
Author

We have caught this error a few times in the last 24 hours. Below are logs from each overflow with solana_rpc::rpc_subscriptions=debug added.

Overflow 1:

[2022-12-03T08:25:09.216293420Z INFO  solana_core::cluster_slot_state_verifier] check_slot_agrees_with_cluster()
            slot: 164630436,
            root: 164630399,
            slot_state_update: DuplicateConfirmed(DuplicateConfirmedState { duplicate_confirmed_hash: 6qU113fbseLFGXkseX6q8tt8ihJ8m4Rs6p4R48YJpUQk, bank_status: Frozen(6qU113fbseLFGXkseX6q8tt8ihJ8m4Rs6p4R48YJpUQk) })
[2022-12-03T08:25:09.216433444Z INFO  solana_core::heaviest_subtree_fork_choice] marking fork starting at: (164630436, 6qU113fbseLFGXkseX6q8tt8ihJ8m4Rs6p4R48YJpUQk) valid candidate
[2022-12-03T08:25:09.216435544Z INFO  solana_metrics::metrics] datapoint: duplicate_confirmed_slot slot=164630436i duplicate_confirmed_hash="6qU113fbseLFGXkseX6q8tt8ihJ8m4Rs6p4R48YJpUQk" my_hash="6qU113fbseLFGXkseX6q8tt8ihJ8m4Rs6p4R48YJpUQk"
[2022-12-03T08:25:09.216484905Z DEBUG solana_rpc::rpc_subscriptions] notified(gossip): accounts: 0 / 0 logs: 17 / 2 programs: 1047 / 0 signatures: 0 / 0
[2022-12-03T08:25:09.216508146Z INFO  solana_metrics::metrics] datapoint: rpc_subscriptions source="gossip" num_account_subscriptions=0i num_account_pubkeys_notified=0i num_logs_subscriptions=17i num_logs_notified=2i num_program_subscriptions=1047i num_programs_notified=0i num_signature_subscriptions=0i num_signatures_notified=0i notifications_time=24884i
[2022-12-03T08:25:09.217452951Z INFO  solana_metrics::metrics] datapoint: pubsub_notifications created_to_queue_time_us=252i
[2022-12-03T08:25:09.218583161Z INFO  solana_metrics::metrics] datapoint: pubsub_notifications created_to_queue_time_us=683i
[2022-12-03T08:25:09.218617372Z INFO  solana_metrics::metrics] datapoint: pubsub_notifications created_to_queue_time_us=504i
[2022-12-03T08:25:09.218631543Z INFO  solana_metrics::metrics] datapoint: pubsub_notifications created_to_queue_time_us=314i
[2022-12-03T08:25:09.218643503Z INFO  solana_metrics::metrics] datapoint: pubsub_notifications created_to_queue_time_us=182i
[2022-12-03T08:25:09.218661454Z INFO  solana_metrics::metrics] datapoint: pubsub_notifications created_to_queue_time_us=100i
[2022-12-03T08:25:09.219040824Z INFO  solana_metrics::metrics] datapoint: pubsub_notifications created_to_queue_time_us=28i
[2022-12-03T08:25:09.219219258Z INFO  solana_metrics::metrics] datapoint: pubsub_notifications created_to_queue_time_us=30i
[2022-12-03T08:25:09.219403783Z INFO  solana_metrics::metrics] datapoint: pubsub_notifications created_to_queue_time_us=27i
[2022-12-03T08:25:09.222625220Z DEBUG solana_rpc::rpc_subscriptions] notified(bank): accounts: 1890 / 9 logs: 0 / 0 programs: 0 / 0 signatures: 3 / 0
[2022-12-03T08:25:09.222670711Z INFO  solana_metrics::metrics] datapoint: rpc_subscriptions source="bank" num_account_subscriptions=1890i num_account_pubkeys_notified=9i num_logs_subscriptions=0i num_logs_notified=0i num_program_subscriptions=0i num_programs_notified=0i num_signature_subscriptions=3i num_signatures_notified=0i notifications_time=5912i

thread 'sol-sub-notif-9' has overflowed its stack
fatal runtime error: stack overflow
[2022-12-03T08:26:13.072229984Z INFO  solana_validator] solana-validator 1.13.5 (src:91384943; feat:1365939126)
[2022-12-03T08:26:13.072271916Z INFO  solana_validator] Starting validator with: ArgsOs {

Overflow 2:

[2022-12-03T10:27:59.625373852Z DEBUG solana_rpc::rpc_subscriptions] notified(gossip): accounts: 0 / 0 logs: 0 / 0 programs: 1512 / 1 signatures: 0 / 0
[2022-12-03T10:27:59.625413743Z INFO  solana_metrics::metrics] datapoint: rpc_subscriptions source="gossip" num_account_subscriptions=0i num_account_pubkeys_notified=0i num_logs_subscriptions=0i num_logs_notified=0i num_program_subscriptions=1512i num_programs_notified=1i num_signature_subscriptions=0i num_signatures_notified=0i notifications_time=259025i
[2022-12-03T10:27:59.633817292Z INFO  solana_metrics::metrics] datapoint: shred_insert_is_full total_time_ms=399i slot=164647226i last_index=714i num_repaired=0i num_recovered=312i
[2022-12-03T10:27:59.675456318Z INFO  solana_metrics::metrics] datapoint: program_id_index_stats num_secondary_keys=4647i num_inner_keys=232898041i num_reverse_index_keys=232664574i
[2022-12-03T10:27:59.676062655Z INFO  solana_metrics::metrics] datapoint: accounts_db_store_timings hash_accounts=0i store_accounts=33420i update_index=332102i handle_reclaims=0i append_accounts=0i find_storage=0i num_accounts=15855i total_data=49531638i read_only_accounts_cache_entries=24026i read_only_accounts_cache_data_size=398118171i read_only_accounts_cache_hits=25758i read_only_accounts_cache_misses=47i read_only_accounts_cache_evicts=0i calc_stored_meta_us=8139i
[2022-12-03T10:27:59.676078395Z INFO  solana_metrics::metrics] datapoint: accounts_db_store_timings2 recycle_store_count=0i current_recycle_store_count=1000i current_recycle_store_bytes=1334362112i create_store_count=0i store_get_slot_store=0i store_find_existing=0i dropped_stores=0i
[2022-12-03T10:27:59.698318742Z INFO  solana_metrics::metrics] datapoint: optimistic_slot_elapsed average_elapsed_ms=341i
[2022-12-03T10:27:59.698370983Z INFO  solana_metrics::metrics] datapoint: optimistic_slot slot=164647225i
[2022-12-03T10:27:59.716364044Z INFO  solana_core::cluster_slot_state_verifier] check_slot_agrees_with_cluster()
            slot: 164647225,
            root: 164647194,
            slot_state_update: DuplicateConfirmed(DuplicateConfirmedState { duplicate_confirmed_hash: J5KJJGeZUXAkTGt1Biz8zzsNSkY35YxwSxotQ24eodhB, bank_status: Frozen(J5KJJGeZUXAkTGt1Biz8zzsNSkY35YxwSxotQ24eodhB) })
[2022-12-03T10:27:59.716427976Z INFO  solana_core::heaviest_subtree_fork_choice] marking fork starting at: (164647225, J5KJJGeZUXAkTGt1Biz8zzsNSkY35YxwSxotQ24eodhB) valid candidate
[2022-12-03T10:27:59.716660652Z INFO  solana_metrics::metrics] datapoint: duplicate_confirmed_slot slot=164647225i duplicate_confirmed_hash="J5KJJGeZUXAkTGt1Biz8zzsNSkY35YxwSxotQ24eodhB" my_hash="J5KJJGeZUXAkTGt1Biz8zzsNSkY35YxwSxotQ24eodhB"
[2022-12-03T10:27:59.747501583Z INFO  solana_metrics::metrics] datapoint: pubsub_notifications created_to_queue_time_us=39i
[2022-12-03T10:27:59.747535114Z INFO  solana_metrics::metrics] datapoint: pubsub_notifications created_to_queue_time_us=22i
[2022-12-03T10:27:59.812826616Z DEBUG solana_rpc::rpc_subscriptions] notified(gossip): accounts: 0 / 0 logs: 0 / 0 programs: 1512 / 1 signatures: 0 / 0
[2022-12-03T10:27:59.812880387Z INFO  solana_metrics::metrics] datapoint: rpc_subscriptions source="gossip" num_account_subscriptions=0i num_account_pubkeys_notified=0i num_logs_subscriptions=0i num_logs_notified=0i num_program_subscriptions=1512i num_programs_notified=1i num_signature_subscriptions=0i num_signatures_notified=0i notifications_time=186940i

thread 'sol-sub-notif-8[2022-12-03T10:27:59.824133924Z INFO  solana_metrics::metrics] datapoint: poh-service ticks=182i hashes=2297984i elapsed_us=352292i total_sleep_us=786963i total_tick_time_us=75i total_lock_time_us=1589i total_hash_time_us=180578i total_record_time_us=0i total_send_record_result_us=0i
' has overflowed its stack
fatal runtime error: stack overflow
[2022-12-03T10:27:59.833734386Z INFO  solana_metrics::metrics] datapoint: memory-stats total=540707250176i swap_total=838860796i free_percent=3.4979552772491216 used_bytes=350485004288i avail_percent=35.18026544420899 buffers_percent=0.0789206121909961 cached_percent=33.00244758284926 swap_free_percent=62.97974664201615
[2022-12-03T10:28:52.578606555Z INFO  solana_validator] solana-validator 1.13.5 (src:91384943; feat:1365939126)
[2022-12-03T10:28:52.578660846Z INFO  solana_validator] Starting validator with: ArgsOs

Full validator startup flags are (with some private info redacted):

"--identity",
"redacted",
"--entrypoint",
"entrypoint.mainnet-beta.solana.com:8001",
"--entrypoint",
"entrypoint2.mainnet-beta.solana.com:8001",
"--entrypoint",
"entrypoint3.mainnet-beta.solana.com:8001",
"--entrypoint",
"entrypoint4.mainnet-beta.solana.com:8001",
"--entrypoint",
"entrypoint5.mainnet-beta.solana.com:8001",
"--known-validator",
"redacted",
"--known-validator",
"redacted",
"--known-validator",
"redacted",
"--known-validator",
"redacted",
"--known-validator",
"redacted",
"--known-validator",
"redacted",
"--known-validator",
"redacted",
"--known-validator",
"redacted",
"--rpc-port",
"redacted",
"--dynamic-port-range",
"redacted",
"--no-port-check",
"--gossip-port",
"redacted",
"--no-untrusted-rpc",
"--no-voting",
"--private-rpc",
"--rpc-bind-address",
"0.0.0.0",
"--enable-cpi-and-log-storage",
"--enable-rpc-transaction-history",
"--enable-rpc-bigtable-ledger-storage",
"--account-index",
"program-id",
"--account-index",
"spl-token-owner",
"--account-index",
"spl-token-mint",
"--no-duplicate-instance-check",
"--wal-recovery-mode",
"skip_any_corrupted_record",
"--vote-account",
"redacted",
"--log",
"redacted",
"--accounts",
"/mt/accounts/solana-accounts",
"--ledger",
"/mt/ledger/validator-ledger",
"--limit-ledger-size",
"--rpc-send-default-max-retries",
"30",
"--rpc-send-service-max-retries",
"30",
"--rpc-send-retry-ms",
"1000",
"--full-rpc-api",
"--account-index-exclude-key",
"kinXdEcpDQeHPEuQnqmUgtYykqKGVFq6CeVX5iAHJq6",
"--account-index-exclude-key",
"metaqbxxUerdq28cj1RbAWkYQm3ybzjb6a8bt518x1s",
"--tpu-use-quic",

@tracy-codes
Copy link
Author

We restarted this same node with 8mb allocated to each thread for this process. We went from experiencing this crash every 2-ish hours to 2 days of consistent uptime.

I don't know what to do regarding how we can profile to get the stack usage of these threads in realtime or at the time of the crashes, but the solve here definitely seems to be that the stack size of these threads needs to be higher than the default (which I believe is 2mb).

@tracy-codes
Copy link
Author

Since this has sat for 2 weeks without any eyes on it, I've opened a PR with a patch that has fixed this issue for our network.

@McSim85
Copy link

McSim85 commented Dec 7, 2023

Hey @tracy-codes
Time to update the issue 😂
#34303

@McSim85
Copy link

McSim85 commented Dec 7, 2023

btw, do you apply patch diff every release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants