-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High transaction latency mainnet-beta Nov 3 2023 #34107
Comments
hmm so something in
Whats the 7-4 deep slots? |
The 7-4 deep votes on the tower. We have the constraint that our 8th deep slot/vote needs to be OC in order to vote, but we could probably mitigate lockout times by adding another constraint such as "4th deep slot/vote on the tower needs to reach fork switch threshold (38%) in order to keep voting" This is treating the symptom rather than the true problems (delayed broadcast, non-deterministic fork selection), but with the number of underlying causes we've had that lead to this same problem: competing leader blocks --> partition --> large lockouts --> longer latency, treating the symptom to mitigate the bleeding seems worthwhile |
Having a dedicated thread that produces the leader block also does make sense. It can just receive the latest reset information from before |
First cut at adding another vote stake threshold constraint: #34120 |
i think we should prioritize being resilient to delayed broadcast rather than necessarily trying to fix it. the latency + throughput hit is much preferable to the alternatives in the presence of non-conforming leaders |
#34120 has been merged into master and should help mitigate the negative impacts of these events. So far, it has not been backported and would be part of a v1.18 release |
Problem
High transaction latency occurred from high forking and tower distance. This seemed to be caused by a leader's delayed slot which was over 1 second.
Start of discord analysis about event:
https://discord.com/channels/428295358100013066/1020131815362666496/1169988874441859163
Metrics of delayed leader:
https://metrics.solana.com:8888/sources/14/chronograf/data-explorer?query=SELECT%20max%28%22total_elapsed_us%22%29%20as%20%22total_elapsed_us%22%2C%20max%28%22compute_bank_stats_elapsed%22%29%20as%20%22compute_bank_stats_elapsed%22%2C%20max%28%22start_leader_elapsed%22%29%20as%20%22start_leader_elapsed%22%2C%20max%28%22generate_new_bank_forks_elapsed%22%29%20as%20%22generate_new_bank_forks_elapsed%22%2C%20max%28%22replay_active_banks_elapsed%22%29%20as%20%22replay_active_banks_elapsed%22%2C%20max%28%22process_popular_pruned_forks_elapsed%22%29%20as%20%22process_popular_pruned_forks_elapsed%22%2C%20max%28%22replay_blockstore_us%22%29%20as%20%22replay_blockstore_us%22%2C%20max%28%22collect_frozen_banks_elapsed%22%29%20as%20%22collect_frozen_banks_elapsed%22%2C%20max%28%22select_vote_and_reset_forks_elapsed%22%29%20as%20%22select_vote_and_reset_forks_elapsed%22%2C%20max%28%22reset_bank_elapsed%22%29%20as%20%22reset_bank_elapsed%22%2C%20max%28%22voting_elapsed%22%29%20as%20%22voting_elapsed%22%2C%20max%28%22select_forks_elapsed%22%29%20as%20%22select_forks_elapsed%22%20FROM%20%22mainnet-beta%22.%22autogen%22.%22replay-loop-timing-stats%22%20WHERE%20time%20%3E%20%3AdashboardTime%3A%20AND%20time%20%3C%20%3AupperDashboardTime%3A%20AND%20%22host_id%22%3D%279kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749%27%20GROUP%20BY%20time%281ms%29%20FILL%28none%29
Leader schedule and block production around the event:
Potential problem seems to be high vote latency in replay stage which could have affected the
maybe_leader_start
Related to #18308
Proposed Solution
The text was updated successfully, but these errors were encountered: