-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debug why nodes have to repair shreds despite large erasure batches #28638
Comments
I'm observing on the order of 100 shred repair requests being issued per second, but the vast majority of these end up not getting inserted into the blockstore because they already exist (propagated through turbine or recovery eventually). Only ~1 repair per second actually gets inserted into blockstore. As an experiment, I increased the Note that @steviez collected a similar set of data in #28634 that shows repaired shreds colliding with turbine/recovery I believe this answers a large part of the question posed by this issue. What is still not clear to me is if the remaining successful repairs are absolutely necessary (or turbine/recovery would have got those shreds into blockstore eventually) and if the number makes sense given expected packet loss. |
2 actions that can be taken to cut down on repair related network traffic:
|
Are the stats gathered in #28638 (comment) with current 100ms
within a "batch"? not "shred", right?
any idea how far we can push this before it starts negatively impacting node? |
They are gathered with
Good catch. I think I meant to say "erasure set" here.
Not completely sure. 200 seems to cut down the noise by an order of magnitude without impacting slot times. I previously saw further repair request reduction moving to 400, but I didn't capture block insertion times for that run. I actually don't hate the current wait mechanism. It uses a reference time for each shred as an offset from the first shred received time + some arbitrary turbine delay time. |
This is another idea we can try.
The idea is that each node has a somewhat unique network connectivity (latency/throughput/jitter). And, this dynamic adjustment will allow it to monitor and tune the repairs. Thoughts? @behzadnouri, @bw-solana |
Makes sense to me. We should try it out on a few nodes. What's a good target redundant repair shred rate? 50%? |
I can hack up something and try on monogon cluster first. |
I am not against trying this out and see if it results in an improvement or any interesting insights. However please keep in mind that these kinds of dynamic adjustments tend to become very difficult to predict their behavior or debug when things break, specially when there are a lot of other dynamic components or moving parts that they interact with. |
Noted. Would keeping dynamism in check with upper/lower bounds be a way forward? |
sure, lets run some experiments and evaluate. thank you |
I think it would be helpful if the plot shows the total number of repair shreds as well.
Because repair shreds have a trailing unique nonce, I believe the deduper in sigverify-stage does not filter them out (aside from the false positive rate of the deduper). |
I'll take a look. Thanks @behzadnouri |
This seems inline with results we've seen in the past. One thing I wonder for the duplicate repaired shreds is how many fall into each of these cases:
(2) is the one that really has me thinking. I'm wondering if we should limit outstanding repair requests to what we need to kick things over to recovery. |
(2) is a fair point, as erasure recovery can fill in the missing shreds (for which the repairs were already requested). Not sure if there's any clever way to detect it though. I don't think we store any recovery related information in blockstore today. |
I am guessing this is still way underestimating number of redundant repair requests; or something else is going on that we don't yet understand. According to this binomial distribution calculator: https://stattrek.com/online-calculator/binomial |
I'll do some more digging. A few points to note
How is the topology of the monogon cluster? Are nodes geographically distributed? |
Problem
Sherds are propagated using 32:32 data:code erasure batches.
Despite that even at times that the delinquent stake is pretty low and the cluster is working fine, we still observe very small yet non-zero repair shreds on the cluster.
Similarly on a small GCE cluster, repair metrics are yet non-zero.
One possibility is that repair is too aggressive and some repaired shreds are ultimately received from Turbine anyways or recovered through erasure codes.
Proposed Solution
Debug why nodes still repair shreds even when delinquent stake is low and cluster is not overloaded.
The text was updated successfully, but these errors were encountered: