Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High transaction latency mainnet-beta Nov 3 2023 #34107

Open
sakridge opened this issue Nov 16, 2023 · 9 comments
Open

High transaction latency mainnet-beta Nov 3 2023 #34107

sakridge opened this issue Nov 16, 2023 · 9 comments
Labels
consensus Issues related to consensus bugs in the validator

Comments

@sakridge
Copy link
Member

sakridge commented Nov 16, 2023

Problem

High transaction latency occurred from high forking and tower distance. This seemed to be caused by a leader's delayed slot which was over 1 second.

Start of discord analysis about event:
https://discord.com/channels/428295358100013066/1020131815362666496/1169988874441859163

Metrics of delayed leader:
https://metrics.solana.com:8888/sources/14/chronograf/data-explorer?query=SELECT%20max%28%22total_elapsed_us%22%29%20as%20%22total_elapsed_us%22%2C%20max%28%22compute_bank_stats_elapsed%22%29%20as%20%22compute_bank_stats_elapsed%22%2C%20max%28%22start_leader_elapsed%22%29%20as%20%22start_leader_elapsed%22%2C%20max%28%22generate_new_bank_forks_elapsed%22%29%20as%20%22generate_new_bank_forks_elapsed%22%2C%20max%28%22replay_active_banks_elapsed%22%29%20as%20%22replay_active_banks_elapsed%22%2C%20max%28%22process_popular_pruned_forks_elapsed%22%29%20as%20%22process_popular_pruned_forks_elapsed%22%2C%20max%28%22replay_blockstore_us%22%29%20as%20%22replay_blockstore_us%22%2C%20max%28%22collect_frozen_banks_elapsed%22%29%20as%20%22collect_frozen_banks_elapsed%22%2C%20max%28%22select_vote_and_reset_forks_elapsed%22%29%20as%20%22select_vote_and_reset_forks_elapsed%22%2C%20max%28%22reset_bank_elapsed%22%29%20as%20%22reset_bank_elapsed%22%2C%20max%28%22voting_elapsed%22%29%20as%20%22voting_elapsed%22%2C%20max%28%22select_forks_elapsed%22%29%20as%20%22select_forks_elapsed%22%20FROM%20%22mainnet-beta%22.%22autogen%22.%22replay-loop-timing-stats%22%20WHERE%20time%20%3E%20%3AdashboardTime%3A%20AND%20time%20%3C%20%3AupperDashboardTime%3A%20AND%20%22host_id%22%3D%279kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749%27%20GROUP%20BY%20time%281ms%29%20FILL%28none%29

Leader schedule and block production around the event:

  227740016       C1ocKDYMCm2ooWptMMnpd5VEB2Nx4UMJgRuYofysyzcA
  227740017       C1ocKDYMCm2ooWptMMnpd5VEB2Nx4UMJgRuYofysyzcA
  227740018       C1ocKDYMCm2ooWptMMnpd5VEB2Nx4UMJgRuYofysyzcA
  227740019       C1ocKDYMCm2ooWptMMnpd5VEB2Nx4UMJgRuYofysyzcA
  227740020       9kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749
  227740021       9kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749
  227740022       9kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749
  227740023       9kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749  <== delayed block
  227740024       UPSCQNqdbiaqrQou9X9y8mr43ZHzvoNpKC26Mo7GubF  SKIPPED
  227740025       UPSCQNqdbiaqrQou9X9y8mr43ZHzvoNpKC26Mo7GubF  SKIPPED
  227740026       UPSCQNqdbiaqrQou9X9y8mr43ZHzvoNpKC26Mo7GubF  SKIPPED
  227740027       UPSCQNqdbiaqrQou9X9y8mr43ZHzvoNpKC26Mo7GubF  SKIPPED
  227740028       FzPoMxzQJRH87NVy1YJF8BqjrygzLLHz8MYjEZPhQHnL SKIPPED
  227740029       FzPoMxzQJRH87NVy1YJF8BqjrygzLLHz8MYjEZPhQHnL SKIPPED
  227740030       FzPoMxzQJRH87NVy1YJF8BqjrygzLLHz8MYjEZPhQHnL SKIPPED
  227740031       FzPoMxzQJRH87NVy1YJF8BqjrygzLLHz8MYjEZPhQHnL SKIPPED
  227740032       7zAHbRxEQaNjKnQMjFm7j8LebHSGfzsQDdm2ZpUNPa7G
  227740033       7zAHbRxEQaNjKnQMjFm7j8LebHSGfzsQDdm2ZpUNPa7G
  227740034       7zAHbRxEQaNjKnQMjFm7j8LebHSGfzsQDdm2ZpUNPa7G
  227740035       7zAHbRxEQaNjKnQMjFm7j8LebHSGfzsQDdm2ZpUNPa7G
  227740036       3RXKQBRv7xKTQeNdLSPhCiD4QcUfxEQ12rtgUkMf5LnS
  227740037       3RXKQBRv7xKTQeNdLSPhCiD4QcUfxEQ12rtgUkMf5LnS
  227740038       3RXKQBRv7xKTQeNdLSPhCiD4QcUfxEQ12rtgUkMf5LnS
  227740039       3RXKQBRv7xKTQeNdLSPhCiD4QcUfxEQ12rtgUkMf5LnS
  227740040       56dfPEcmfA7dJLq2T7CV3WCVCrhnPBDR4ukshnRgWFCn SKIPPED
high-tower-distance

Potential problem seems to be high vote latency in replay stage which could have affected the maybe_leader_start
Screenshot 2023-11-15 at 10 56 05 PM

Related to #18308

Proposed Solution

  • Dedicated thread for starting leader?
  • Add a stricter threshold check (sliding observed stake on the 7-4 deep slots), would reduce the lockout on these minor forks
  • Add the shred requirement from MCL - only vote early on blocks that have high probability to be observed by the rest of the cluster
  • Rework fork choice tie break to prefer later slots - penalizes blocks that are sent out late
  • Ingest gossip votes in fork choice (complex) - helps in situations where the next couple leaders are on the minor fork
@sakridge
Copy link
Member Author

sakridge commented Nov 16, 2023

cc @carllin @bw-solana @AshwinSekar

@sakridge sakridge changed the title High transaction latency mainnet-beta Nov. 3 2023 High transaction latency mainnet-beta Nov 3 2023 Nov 16, 2023
@carllin
Copy link
Contributor

carllin commented Nov 16, 2023

hmm so something in handle_votable_bank is taking a long time, thus delaying their slot which in turn races with the blocks produced by future leaders?

handle_votable_bank seems pretty tame on first look, the vote push to gossip and gossip lock is taken in a different thread in push_vote. Gotta check if these voting delay spikes pretty rare or if it's a rare occurence

Add a stricter threshold check (sliding observed stake on the 7-4 deep slots), would reduce the lockout on these minor >forks

Whats the 7-4 deep slots?

@bw-solana
Copy link
Contributor

bw-solana commented Nov 16, 2023

Whats the 7-4 deep slots?

The 7-4 deep votes on the tower. We have the constraint that our 8th deep slot/vote needs to be OC in order to vote, but we could probably mitigate lockout times by adding another constraint such as "4th deep slot/vote on the tower needs to reach fork switch threshold (38%) in order to keep voting"

This is treating the symptom rather than the true problems (delayed broadcast, non-deterministic fork selection), but with the number of underlying causes we've had that lead to this same problem: competing leader blocks --> partition --> large lockouts --> longer latency, treating the symptom to mitigate the bleeding seems worthwhile

@carllin
Copy link
Contributor

carllin commented Nov 16, 2023

Having a dedicated thread that produces the leader block also does make sense. It can just receive the latest reset information from before handle_votable_bank. I remember discussing some nasty traps with having block production start asynchronously though from back in the day. I don't remember those traps which makes me extra scared, time to dig up some ghosts 👻

@sakridge
Copy link
Member Author

compute_bank_stats was also really high during that second (~230ms):

Screenshot 2023-11-16 at 5 56 01 PM

@bw-solana
Copy link
Contributor

First cut at adding another vote stake threshold constraint: #34120

@t-nelson
Copy link
Contributor

i think we should prioritize being resilient to delayed broadcast rather than necessarily trying to fix it. the latency + throughput hit is much preferable to the alternatives in the presence of non-conforming leaders

@AshwinSekar
Copy link
Contributor

wen and I dug into this too late and unfortunately the original incident has been removed from metrics.

We analyzed long replay times from the past two weeks:
https://metrics.solana.com:8888/sources/14/chronograf/data-explorer?query=SELECT%20mean%28%22total_elapsed_us%22%29%20AS%20%22mean_total_elapsed_us%22%2C%20max%28%22total_elapsed_us%22%29%20AS%20%22max_total_elapsed_us%22%20FROM%20%22mainnet-beta%22.%22autogen%22.%22replay-loop-timing-stats%22%20WHERE%20time%20%3E%20%3AdashboardTime%3A%20AND%20time%20%3C%20%3AupperDashboardTime%3A%20GROUP%20BY%20time%281d%29%2C%20host_id#

We investigated validators who have competent hardware < 2M daily average replay-loop-timing-stats.total_elapsed_us but long tail > 10M daily max replay-loop-timing-stats.total_elapsed_us.
Spot checked a handful of these to figure out root cause:

The original leader 9kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749 had a couple of the mentioned 1700113043695000.00 spikes in the past two weeks. Apart from those anomalies the rest of the high total_elapsed_us were attributable to generate new bank forks (epoch boundary) or replay active bank (CU intensive) timings:
https://metrics.solana.com:8888/sources/14/chronograf/data-explorer?query=SELECT%20max%28%22total_elapsed_us%22%29%20AS%20%22max_total_elapsed_us%22%2C%20max%28%22collect_frozen_banks_elapsed%22%29%20AS%20%22max_collect_frozen_banks_elapsed%22%2C%20max%28%22compute_bank_stats_elapsed%22%29%20AS%20%22max_compute_bank_stats_elapsed%22%2C%20max%28%22compute_slot_stats_elapsed%22%29%20AS%20%22max_compute_slot_stats_elapsed%22%2C%20max%28%22heaviest_fork_failures_elapsed%22%29%20AS%20%22max_heaviest_fork_failures_elapsed%22%2C%20max%28%22repair_correct_slots_elapsed%22%29%20AS%20%22max_repair_correct_slots_elapsed%22%2C%20max%28%22replay_active_banks_elapsed%22%29%20AS%20%22max_replay_active_banks_elapsed%22%2C%20max%28%22replay_blockstore_us%22%29%20AS%20%22max_replay_blockstore_us%22%2C%20max%28%22reset_bank_elapsed%22%29%20AS%20%22max_reset_bank_elapsed%22%2C%20max%28%22retransmit_not_propagated_elapsed%22%29%20AS%20%22max_retransmit_not_propagated_elapsed%22%2C%20max%28%22start_leader_elapsed%22%29%20AS%20%22max_start_leader_elapsed%22%2C%20max%28%22voting_elapsed%22%29%20AS%20%22max_voting_elapsed%22%2C%20max%28%22wait_receive_elapsed%22%29%20AS%20%22max_wait_receive_elapsed%22%2C%20max%28%22generate_new_bank_forks_elapsed%22%29%20AS%20%22max_generate_new_bank_forks_elapsed%22%20FROM%20%22mainnet-beta%22.%22autogen%22.%22replay-loop-timing-stats%22%20WHERE%20time%20%3E%20%3AdashboardTime%3A%20AND%20time%20%3C%20%3AupperDashboardTime%3A%20AND%20%22host_id%22%3D%279kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749%27%20GROUP%20BY%20time%2810s%29

During this investigation we were unable to find any incidents of high voting_elapsed counts.

I'm inclined to believe that this specific incident was a 1 off. If there are more recent anomalies that pop up please let me know.

i think we should prioritize being resilient to delayed broadcast rather than necessarily trying to fix it. the latency + throughput hit is much preferable to the alternatives in the presence of non-conforming leaders

100% agree, we will never be able to get to 0 forks so we should strive to make minor fork resolution as robust as possible. However if there is a cluster wide issue that is causing delayed broadcasts, we should investigate it. Based on the data above I don't believe that is the case.

I have created #34281 for the strange > 1 quadrillion total_elapsed_us reports.

@bw-solana
Copy link
Contributor

#34120 has been merged into master and should help mitigate the negative impacts of these events. So far, it has not been backported and would be part of a v1.18 release

@wen-coding wen-coding added the consensus Issues related to consensus bugs in the validator label Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consensus Issues related to consensus bugs in the validator
Projects
None yet
Development

No branches or pull requests

6 participants