-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
~1% of nodes are repairing almost every shred in almost every slot #33166
Comments
Prior to the first restart on 8/25 the validator was receiving some turbine shreds and some incoming repair requests. Following the restart no incoming turbine shreds or repair requests are seen. |
Looking at the gossip spy data I see a bunch of above nodes broadcasting localhost as their IP address
|
switch between turbine and repair quite frequently without a node restart. I checked some logs and they indeed send repair requests from the same IP address; so very weird why they stop receiving shreds from turbine. Also no
|
For staked nodes we always include them in the tree even if we don't have their gossip contact-info.
From a 3rd validator logs, I can see they are sending repair responses to the same IP address, so seems like the IP address is ok. Might be some issues at the tvu port.
yeah, it is all UDP. No quic code on v1.14 or v1.16 branch. |
Really interesting data point that I don't know what to make of... The periods that This means the next time we should see |
Looks like both of these nodes are in 59713-RU-Kursk |
I sent some random packets to TVU socket and repair socket of below nodes
and none show up in the metrics. Quite odd that even the metrics from their repair socket do not show a spike from the random packets, even though they are able to receive repair responses on that socket. |
This could be due to NAT: incoming packets from remote address are only permitted following outgoing packet to that address. |
Two of the nodes:
have identified issues in their firewall configuration and have updated the firewall to open respective ports/protocols. |
This is We realized that we hadn't updated our firewall config to open the following ports:
Apologies for the trouble. |
Problem
Based on
shred_insert_is_full
metric from mainnet-beta, it seems like ~1% of nodes are repairing almost every shred in every slot.Proposed Solution
ContactInfo
is not propagated in gossip, but why?In the time interval
(7907 slots), the offending nodes seem to be:
Note that majority of above nodes are staked, in particular
cc @jbiseda @bw-solana
The text was updated successfully, but these errors were encountered: