-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sub Notif Stack Overflow Error #29032
Comments
We are going to restart our nodes with |
We have caught this error a few times in the last 24 hours. Below are logs from each overflow with Overflow 1:
Overflow 2:
Full validator startup flags are (with some private info redacted):
|
We restarted this same node with 8mb allocated to each thread for this process. We went from experiencing this crash every 2-ish hours to 2 days of consistent uptime. I don't know what to do regarding how we can profile to get the stack usage of these threads in realtime or at the time of the crashes, but the solve here definitely seems to be that the stack size of these threads needs to be higher than the default (which I believe is 2mb). |
Since this has sat for 2 weeks without any eyes on it, I've opened a PR with a patch that has fixed this issue for our network. |
Hey @tracy-codes |
btw, do you apply patch diff every release? |
Hi all, we're reporting a stack overflow error that we've encountered that causes the solana service to restart. This is seemingly random as we aren't able to reproduce with any specific load tests. The nodes are either under heavy load or they are under next-to-no load when this error happens.
The specific error we see in our logs is:
Below are some logs just before and right after the error (you can see the validator service just restarts itself).
Even with debug flags enabled, we do not get any other info from this error aside from those two lines.edit: wrong debug flag for this. We are going to enable the proper ones for websockets.This error causes nodes to restart at random and it's difficult to contain, causing RPC cluster instabilities.
The text was updated successfully, but these errors were encountered: