solana-validator crash due to thread 'solRpcNotifyXX' has overflowed its stack #34303

McSim85 · 2023-12-01T17:17:14Z

Problem

Hello team,
We are running into an issue where solana-validator crashes multiple times per day.

[2023-12-01T13:32:39.607535276Z INFO  solana_core::replay_stage] 3D6kqT9XhhkPvzfpJF9MSRaCQ4ytGTafFGa8nnnov9Gs reset PoH to tick 14934761152 (within slot 233355642). I am not in the leader schedule yet
[2023-12-01T13:32:39.607691482Z INFO  solana_core::replay_stage] new fork:233355643 parent:233355642 root:233355611

thread 'solRpcNotify03' has overflowed its stack
fatal runtime error: stack overflow
[2023-12-01T13:33:41.175847339Z INFO  solana_validator] solana-validator 1.16.18 (src:612616be; feat:4033350765, client:SolanaLabs)
[2023-12-01T13:33:41.175949663Z INFO  solana_validator] Starting validator with: ArgsOs

We did update to v1.16.18/20 and this problem still occurs, first seen by us on v1.14.X
The error is always the same, the only difference is the number of threads solRpcNotifyXX

 grep overflowed /home/solana/log/solana-validator.log*
/home/solana/log/solana-validator.log:thread 'solRpcNotify03' has overflowed its stack
/home/solana/log/solana-validator.log:thread 'solRpcNotify09' has overflowed its stack
/home/solana/log/solana-validator.log:thread 'solRpcNotify16' has overflowed its stack
/home/solana/log/solana-validator.log.1:thread 'solRpcNotify15' has overflowed its stack
/home/solana/log/solana-validator.log.2:thread 'solRpcNotify01' has overflowed its stack
/home/solana/log/solana-validator.log.3:thread 'solRpcNotify04' has overflowed its stack
/home/solana/log/solana-validator.log.5:thread 'solRpcNotify03' has overflowed its stack
/home/solana/log/solana-validator.log.5:thread 'solRpcNotify05' has overflowed its stack
/home/solana/log/solana-validator.log.6:thread 'solRpcNotify03' has overflowed its stack
/home/solana/log/solana-validator.log.7:thread 'solRpcNotify19' has overflowed its stack
/home/solana/log/solana-validator.log.7:thread 'solRpcNotify03' has overflowed its stack
/home/solana/log/solana-validator.log.7:thread 'solRpcNotify17' has overflowed its stack

Most of the time, the validator crashes, but occasionally, only websocket thread dies.

The validator config is just a vanilla non-voting rpc node, that serves both HTTP and WS traffic:

exec /home/solana/.local/share/solana/install/active_release/bin/solana-validator \
  --identity /home/solana/identity.json \
    --entrypoint entrypoint.mainnet-beta.solana.com:8001 \
    --entrypoint entrypoint2.mainnet-beta.solana.com:8001 \
    --entrypoint entrypoint3.mainnet-beta.solana.com:8001 \
    --entrypoint entrypoint4.mainnet-beta.solana.com:8001 \
    --entrypoint entrypoint5.mainnet-beta.solana.com:8001 \
    --ledger /home/solana/ledger \
  --accounts /mnt/accounts/accountsdb \
  --snapshots /mnt/snapshots \
  --log /home/solana/log/solana-validator.log \
  --gossip-port 8001 \
  --rpc-port 8899 \
  --rpc-bind-address 0.0.0.0 \
  --dynamic-port-range 8002-8102 \
  --wal-recovery-mode skip_any_corrupted_record \
  --private-rpc \
  --no-port-check \
  --enable-extended-tx-metadata-storage \
  --enable-rpc-transaction-history \
  --rpc-pubsub-enable-block-subscription \
  --known-validator 7Np41oeYqPefeNQEHSv1UDhYrehxin3NStELsSKCT4K2 \
  --known-validator GdnSyH3YtwcxFvQrVVJMm1JhTS4QVX7MFsX56uJLUfiZ \
  --known-validator DE1bawNcRJB9rVm3buyMVfr8mBEoyyu73NBovf2oXJsJ \
  --known-validator CakcnaRDHka2gXyfbEd2d3xsvkJkqsLw2akB3zsN1D2S \
  --limit-ledger-size 1009504738 \
  --no-voting \
  --only-known-rpc \
  --halt-on-known-validators-accounts-hash-mismatch \
  --account-index program-id spl-token-owner spl-token-mint \
  --full-rpc-api

The number of WS connections is below 1000 on average.
The server hardware is pretty fat with nvmes, 1Tb RAM and AMD EPYC 7443P 24-Core Processor.
This happens pretty randomly.
I was trying to reproduce it, but I cannot. Run benchmarks with 10000 different subscriptions.
But no success in reproducing.

I will appreciate for any advice and I will be happy to provide more logs, if needed.

The text was updated successfully, but these errors were encountered:

McSim85 · 2023-12-05T11:55:08Z

Gonna enable RUST_BACKTRACE=full and add RUST_LOG=solana_rpc::rpc_subscriptions=debug

McSim85 · 2023-12-06T21:00:16Z

I enabled this

RUST_LOG=debug
RUST_BACKTRACE=full

but the crash does not tell me much, unfortunately

[2023-12-06T17:07:05.722038912Z DEBUG solana_ledger::sigverify_shreds] CPU SHRED ECDSA for 1
[2023-12-06T17:07:05.722371835Z DEBUG solana_gossip::cluster_info] handle_pull_requests: handle_pull_requests took 848us sent: 4 total: 4 total_bytes: 2247
[2023-12-06T17:07:05.724577840Z DEBUG solana_gossip::cluster_info] handle_pull_requests: handle_pull_requests took 291us sent: 7 total: 7 total_bytes: 4731
[2023-12-06T17:07:05.728418948Z DEBUG quinn::connection] drive; id=1
[2023-12-06T17:07:05.729191038Z DEBUG hyper::proto::h1::io] flushed 5487812 bytes

thread 'solRpcNotify17' has overflowed its stack
fatal runtime error: stack overflow
[2023-12-06T17:08:59.650750600Z INFO  solana_validator] solana-validator 1.16.18 (src:612616be; feat:4033350765, client:SolanaLabs)
[2023-12-06T17:08:59.650777531Z INFO  solana_validator] Starting validator with: ArgsOs {

McSim85 · 2023-12-06T21:02:06Z

another crash

[2023-12-06T17:07:05.721580725Z DEBUG solana_ledger::sigverify_shreds] CPU SHRED ECDSA for 11
[2023-12-06T17:07:05.722038912Z DEBUG solana_ledger::sigverify_shreds] CPU SHRED ECDSA for 1
[2023-12-06T17:07:05.722371835Z DEBUG solana_gossip::cluster_info] handle_pull_requests: handle_pull_requests took 848us sent: 4 total: 4 total
_bytes: 2247
[2023-12-06T17:07:05.724577840Z DEBUG solana_gossip::cluster_info] handle_pull_requests: handle_pull_requests took 291us sent: 7 total: 7 total
_bytes: 4731
[2023-12-06T17:07:05.728418948Z DEBUG quinn::connection] drive; id=1
[2023-12-06T17:07:05.729191038Z DEBUG hyper::proto::h1::io] flushed 5487812 bytes

thread 'solRpcNotify17' has overflowed its stack
fatal runtime error: stack overflow
[2023-12-06T17:08:59.650750600Z INFO  solana_validator] solana-validator 1.16.18 (src:612616be; feat:4033350765, client:SolanaLabs)
[2023-12-06T17:08:59.650777531Z INFO  solana_validator] Starting validator with: ArgsOs

McSim85 · 2023-12-06T21:03:48Z

actually. This two lines are the same before crash

[2023-12-06T17:07:05.728418948Z DEBUG quinn::connection] drive; id=1
[2023-12-06T17:07:05.729191038Z DEBUG hyper::proto::h1::io] flushed 5487812 bytes

McSim85 · 2023-12-06T23:47:11Z

One more crash, but now - without flushed X bytes

[2023-12-06T23:20:32.578621578Z DEBUG quinn::connection] drive; id=1
[2023-12-06T23:20:32.581085277Z DEBUG solana_rpc::rpc_pubsub_service] new client (10.31.1.14:47114)
[2023-12-06T23:20:32.581272034Z DEBUG solana_rpc::rpc_subscription_tracker] Total existing subscriptions: 2729
[2023-12-06T23:20:32.581297645Z DEBUG rpc] Response: {"jsonrpc":"2.0","result":10200,"id":1}.

thread 'solRpcNotify06' has overflowed its stack
fatal runtime error: stack overflow
[2023-12-06T23:22:16.313985866Z INFO  solana_validator] solana-validator 1.16.18 (src:612616be; feat:4033350765, client:SolanaLabs)

McSim85 added the community Community contribution label Dec 1, 2023

This was referenced Dec 7, 2023

Increase ws thread stack size #29318

Closed

Sub Notif Stack Overflow Error #29032

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

solana-validator crash due to thread 'solRpcNotifyXX' has overflowed its stack #34303

solana-validator crash due to thread 'solRpcNotifyXX' has overflowed its stack #34303

McSim85 commented Dec 1, 2023 •

edited

Loading

McSim85 commented Dec 5, 2023 •

edited

Loading

McSim85 commented Dec 6, 2023

McSim85 commented Dec 6, 2023

McSim85 commented Dec 6, 2023

McSim85 commented Dec 6, 2023

solana-validator crash due to thread 'solRpcNotifyXX' has overflowed its stack #34303

solana-validator crash due to thread 'solRpcNotifyXX' has overflowed its stack #34303

Comments

McSim85 commented Dec 1, 2023 • edited Loading

Problem

McSim85 commented Dec 5, 2023 • edited Loading

McSim85 commented Dec 6, 2023

McSim85 commented Dec 6, 2023

McSim85 commented Dec 6, 2023

McSim85 commented Dec 6, 2023

McSim85 commented Dec 1, 2023 •

edited

Loading

McSim85 commented Dec 5, 2023 •

edited

Loading