Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

solana-validator crash due to thread 'solRpcNotifyXX' has overflowed its stack #34303

Open
McSim85 opened this issue Dec 1, 2023 · 5 comments
Labels
community Community contribution

Comments

@McSim85
Copy link

McSim85 commented Dec 1, 2023

Problem

Hello team,
We are running into an issue where solana-validator crashes multiple times per day.

[2023-12-01T13:32:39.607535276Z INFO  solana_core::replay_stage] 3D6kqT9XhhkPvzfpJF9MSRaCQ4ytGTafFGa8nnnov9Gs reset PoH to tick 14934761152 (within slot 233355642). I am not in the leader schedule yet
[2023-12-01T13:32:39.607691482Z INFO  solana_core::replay_stage] new fork:233355643 parent:233355642 root:233355611

thread 'solRpcNotify03' has overflowed its stack
fatal runtime error: stack overflow
[2023-12-01T13:33:41.175847339Z INFO  solana_validator] solana-validator 1.16.18 (src:612616be; feat:4033350765, client:SolanaLabs)
[2023-12-01T13:33:41.175949663Z INFO  solana_validator] Starting validator with: ArgsOs

We did update to v1.16.18/20 and this problem still occurs, first seen by us on v1.14.X
The error is always the same, the only difference is the number of threads solRpcNotifyXX

 grep overflowed /home/solana/log/solana-validator.log*
/home/solana/log/solana-validator.log:thread 'solRpcNotify03' has overflowed its stack
/home/solana/log/solana-validator.log:thread 'solRpcNotify09' has overflowed its stack
/home/solana/log/solana-validator.log:thread 'solRpcNotify16' has overflowed its stack
/home/solana/log/solana-validator.log.1:thread 'solRpcNotify15' has overflowed its stack
/home/solana/log/solana-validator.log.2:thread 'solRpcNotify01' has overflowed its stack
/home/solana/log/solana-validator.log.3:thread 'solRpcNotify04' has overflowed its stack
/home/solana/log/solana-validator.log.5:thread 'solRpcNotify03' has overflowed its stack
/home/solana/log/solana-validator.log.5:thread 'solRpcNotify05' has overflowed its stack
/home/solana/log/solana-validator.log.6:thread 'solRpcNotify03' has overflowed its stack
/home/solana/log/solana-validator.log.7:thread 'solRpcNotify19' has overflowed its stack
/home/solana/log/solana-validator.log.7:thread 'solRpcNotify03' has overflowed its stack
/home/solana/log/solana-validator.log.7:thread 'solRpcNotify17' has overflowed its stack

Most of the time, the validator crashes, but occasionally, only websocket thread dies.

The validator config is just a vanilla non-voting rpc node, that serves both HTTP and WS traffic:

exec /home/solana/.local/share/solana/install/active_release/bin/solana-validator \
  --identity /home/solana/identity.json \
    --entrypoint entrypoint.mainnet-beta.solana.com:8001 \
    --entrypoint entrypoint2.mainnet-beta.solana.com:8001 \
    --entrypoint entrypoint3.mainnet-beta.solana.com:8001 \
    --entrypoint entrypoint4.mainnet-beta.solana.com:8001 \
    --entrypoint entrypoint5.mainnet-beta.solana.com:8001 \
    --ledger /home/solana/ledger \
  --accounts /mnt/accounts/accountsdb \
  --snapshots /mnt/snapshots \
  --log /home/solana/log/solana-validator.log \
  --gossip-port 8001 \
  --rpc-port 8899 \
  --rpc-bind-address 0.0.0.0 \
  --dynamic-port-range 8002-8102 \
  --wal-recovery-mode skip_any_corrupted_record \
  --private-rpc \
  --no-port-check \
  --enable-extended-tx-metadata-storage \
  --enable-rpc-transaction-history \
  --rpc-pubsub-enable-block-subscription \
  --known-validator 7Np41oeYqPefeNQEHSv1UDhYrehxin3NStELsSKCT4K2 \
  --known-validator GdnSyH3YtwcxFvQrVVJMm1JhTS4QVX7MFsX56uJLUfiZ \
  --known-validator DE1bawNcRJB9rVm3buyMVfr8mBEoyyu73NBovf2oXJsJ \
  --known-validator CakcnaRDHka2gXyfbEd2d3xsvkJkqsLw2akB3zsN1D2S \
  --limit-ledger-size 1009504738 \
  --no-voting \
  --only-known-rpc \
  --halt-on-known-validators-accounts-hash-mismatch \
  --account-index program-id spl-token-owner spl-token-mint \
  --full-rpc-api

The number of WS connections is below 1000 on average.
The server hardware is pretty fat with nvmes, 1Tb RAM and AMD EPYC 7443P 24-Core Processor.
This happens pretty randomly.
I was trying to reproduce it, but I cannot. Run benchmarks with 10000 different subscriptions.
But no success in reproducing.

I will appreciate for any advice and I will be happy to provide more logs, if needed.

@McSim85 McSim85 added the community Community contribution label Dec 1, 2023
@McSim85
Copy link
Author

McSim85 commented Dec 5, 2023

Gonna enable RUST_BACKTRACE=full and add RUST_LOG=solana_rpc::rpc_subscriptions=debug

@McSim85
Copy link
Author

McSim85 commented Dec 6, 2023

I enabled this

RUST_LOG=debug
RUST_BACKTRACE=full

but the crash does not tell me much, unfortunately

[2023-12-06T17:07:05.722038912Z DEBUG solana_ledger::sigverify_shreds] CPU SHRED ECDSA for 1
[2023-12-06T17:07:05.722371835Z DEBUG solana_gossip::cluster_info] handle_pull_requests: handle_pull_requests took 848us sent: 4 total: 4 total_bytes: 2247
[2023-12-06T17:07:05.724577840Z DEBUG solana_gossip::cluster_info] handle_pull_requests: handle_pull_requests took 291us sent: 7 total: 7 total_bytes: 4731
[2023-12-06T17:07:05.728418948Z DEBUG quinn::connection] drive; id=1
[2023-12-06T17:07:05.729191038Z DEBUG hyper::proto::h1::io] flushed 5487812 bytes

thread 'solRpcNotify17' has overflowed its stack
fatal runtime error: stack overflow
[2023-12-06T17:08:59.650750600Z INFO  solana_validator] solana-validator 1.16.18 (src:612616be; feat:4033350765, client:SolanaLabs)
[2023-12-06T17:08:59.650777531Z INFO  solana_validator] Starting validator with: ArgsOs {

@McSim85
Copy link
Author

McSim85 commented Dec 6, 2023

another crash

[2023-12-06T17:07:05.721580725Z DEBUG solana_ledger::sigverify_shreds] CPU SHRED ECDSA for 11
[2023-12-06T17:07:05.722038912Z DEBUG solana_ledger::sigverify_shreds] CPU SHRED ECDSA for 1
[2023-12-06T17:07:05.722371835Z DEBUG solana_gossip::cluster_info] handle_pull_requests: handle_pull_requests took 848us sent: 4 total: 4 total
_bytes: 2247
[2023-12-06T17:07:05.724577840Z DEBUG solana_gossip::cluster_info] handle_pull_requests: handle_pull_requests took 291us sent: 7 total: 7 total
_bytes: 4731
[2023-12-06T17:07:05.728418948Z DEBUG quinn::connection] drive; id=1
[2023-12-06T17:07:05.729191038Z DEBUG hyper::proto::h1::io] flushed 5487812 bytes

thread 'solRpcNotify17' has overflowed its stack
fatal runtime error: stack overflow
[2023-12-06T17:08:59.650750600Z INFO  solana_validator] solana-validator 1.16.18 (src:612616be; feat:4033350765, client:SolanaLabs)
[2023-12-06T17:08:59.650777531Z INFO  solana_validator] Starting validator with: ArgsOs

@McSim85
Copy link
Author

McSim85 commented Dec 6, 2023

actually. This two lines are the same before crash

[2023-12-06T17:07:05.728418948Z DEBUG quinn::connection] drive; id=1
[2023-12-06T17:07:05.729191038Z DEBUG hyper::proto::h1::io] flushed 5487812 bytes

@McSim85
Copy link
Author

McSim85 commented Dec 6, 2023

One more crash, but now - without flushed X bytes

[2023-12-06T23:20:32.578621578Z DEBUG quinn::connection] drive; id=1
[2023-12-06T23:20:32.581085277Z DEBUG solana_rpc::rpc_pubsub_service] new client (10.31.1.14:47114)
[2023-12-06T23:20:32.581272034Z DEBUG solana_rpc::rpc_subscription_tracker] Total existing subscriptions: 2729
[2023-12-06T23:20:32.581297645Z DEBUG rpc] Response: {"jsonrpc":"2.0","result":10200,"id":1}.

thread 'solRpcNotify06' has overflowed its stack
fatal runtime error: stack overflow
[2023-12-06T23:22:16.313985866Z INFO  solana_validator] solana-validator 1.16.18 (src:612616be; feat:4033350765, client:SolanaLabs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Community contribution
Projects
None yet
Development

No branches or pull requests

1 participant