High transaction latency mainnet-beta Nov 3 2023 #34107

sakridge · 2023-11-16T09:05:32Z

Problem

High transaction latency occurred from high forking and tower distance. This seemed to be caused by a leader's delayed slot which was over 1 second.

Start of discord analysis about event:
https://discord.com/channels/428295358100013066/1020131815362666496/1169988874441859163

Leader schedule and block production around the event:

  227740016       C1ocKDYMCm2ooWptMMnpd5VEB2Nx4UMJgRuYofysyzcA
  227740017       C1ocKDYMCm2ooWptMMnpd5VEB2Nx4UMJgRuYofysyzcA
  227740018       C1ocKDYMCm2ooWptMMnpd5VEB2Nx4UMJgRuYofysyzcA
  227740019       C1ocKDYMCm2ooWptMMnpd5VEB2Nx4UMJgRuYofysyzcA
  227740020       9kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749
  227740021       9kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749
  227740022       9kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749
  227740023       9kyyeVup4tK8NFHoMF98tCH657mvjk45Qs29ZXFfn749  <== delayed block
  227740024       UPSCQNqdbiaqrQou9X9y8mr43ZHzvoNpKC26Mo7GubF  SKIPPED
  227740025       UPSCQNqdbiaqrQou9X9y8mr43ZHzvoNpKC26Mo7GubF  SKIPPED
  227740026       UPSCQNqdbiaqrQou9X9y8mr43ZHzvoNpKC26Mo7GubF  SKIPPED
  227740027       UPSCQNqdbiaqrQou9X9y8mr43ZHzvoNpKC26Mo7GubF  SKIPPED
  227740028       FzPoMxzQJRH87NVy1YJF8BqjrygzLLHz8MYjEZPhQHnL SKIPPED
  227740029       FzPoMxzQJRH87NVy1YJF8BqjrygzLLHz8MYjEZPhQHnL SKIPPED
  227740030       FzPoMxzQJRH87NVy1YJF8BqjrygzLLHz8MYjEZPhQHnL SKIPPED
  227740031       FzPoMxzQJRH87NVy1YJF8BqjrygzLLHz8MYjEZPhQHnL SKIPPED
  227740032       7zAHbRxEQaNjKnQMjFm7j8LebHSGfzsQDdm2ZpUNPa7G
  227740033       7zAHbRxEQaNjKnQMjFm7j8LebHSGfzsQDdm2ZpUNPa7G
  227740034       7zAHbRxEQaNjKnQMjFm7j8LebHSGfzsQDdm2ZpUNPa7G
  227740035       7zAHbRxEQaNjKnQMjFm7j8LebHSGfzsQDdm2ZpUNPa7G
  227740036       3RXKQBRv7xKTQeNdLSPhCiD4QcUfxEQ12rtgUkMf5LnS
  227740037       3RXKQBRv7xKTQeNdLSPhCiD4QcUfxEQ12rtgUkMf5LnS
  227740038       3RXKQBRv7xKTQeNdLSPhCiD4QcUfxEQ12rtgUkMf5LnS
  227740039       3RXKQBRv7xKTQeNdLSPhCiD4QcUfxEQ12rtgUkMf5LnS
  227740040       56dfPEcmfA7dJLq2T7CV3WCVCrhnPBDR4ukshnRgWFCn SKIPPED

Potential problem seems to be high vote latency in replay stage which could have affected the maybe_leader_start

Related to #18308

Proposed Solution

Dedicated thread for starting leader?
Add a stricter threshold check (sliding observed stake on the 7-4 deep slots), would reduce the lockout on these minor forks
Add the shred requirement from MCL - only vote early on blocks that have high probability to be observed by the rest of the cluster
Rework fork choice tie break to prefer later slots - penalizes blocks that are sent out late
Ingest gossip votes in fork choice (complex) - helps in situations where the next couple leaders are on the minor fork

The text was updated successfully, but these errors were encountered:

sakridge · 2023-11-16T09:07:59Z

cc @carllin @bw-solana @AshwinSekar

carllin · 2023-11-16T15:16:24Z

hmm so something in handle_votable_bank is taking a long time, thus delaying their slot which in turn races with the blocks produced by future leaders?

handle_votable_bank seems pretty tame on first look, the vote push to gossip and gossip lock is taken in a different thread in push_vote. Gotta check if these voting delay spikes pretty rare or if it's a rare occurence

Add a stricter threshold check (sliding observed stake on the 7-4 deep slots), would reduce the lockout on these minor >forks

Whats the 7-4 deep slots?

bw-solana · 2023-11-16T15:23:18Z

Whats the 7-4 deep slots?

The 7-4 deep votes on the tower. We have the constraint that our 8th deep slot/vote needs to be OC in order to vote, but we could probably mitigate lockout times by adding another constraint such as "4th deep slot/vote on the tower needs to reach fork switch threshold (38%) in order to keep voting"

This is treating the symptom rather than the true problems (delayed broadcast, non-deterministic fork selection), but with the number of underlying causes we've had that lead to this same problem: competing leader blocks --> partition --> large lockouts --> longer latency, treating the symptom to mitigate the bleeding seems worthwhile

carllin · 2023-11-16T15:45:39Z

Having a dedicated thread that produces the leader block also does make sense. It can just receive the latest reset information from before handle_votable_bank. I remember discussing some nasty traps with having block production start asynchronously though from back in the day. I don't remember those traps which makes me extra scared, time to dig up some ghosts 👻

sakridge · 2023-11-16T16:57:48Z

compute_bank_stats was also really high during that second (~230ms):

bw-solana · 2023-11-16T19:19:54Z

First cut at adding another vote stake threshold constraint: #34120

t-nelson · 2023-11-28T00:34:21Z

i think we should prioritize being resilient to delayed broadcast rather than necessarily trying to fix it. the latency + throughput hit is much preferable to the alternatives in the presence of non-conforming leaders

AshwinSekar · 2023-11-30T05:33:01Z

wen and I dug into this too late and unfortunately the original incident has been removed from metrics.

We analyzed long replay times from the past two weeks:
https://metrics.solana.com:8888/sources/14/chronograf/data-explorer?query=SELECT%20mean%28%22total_elapsed_us%22%29%20AS%20%22mean_total_elapsed_us%22%2C%20max%28%22total_elapsed_us%22%29%20AS%20%22max_total_elapsed_us%22%20FROM%20%22mainnet-beta%22.%22autogen%22.%22replay-loop-timing-stats%22%20WHERE%20time%20%3E%20%3AdashboardTime%3A%20AND%20time%20%3C%20%3AupperDashboardTime%3A%20GROUP%20BY%20time%281d%29%2C%20host_id#

We investigated validators who have competent hardware < 2M daily average replay-loop-timing-stats.total_elapsed_us but long tail > 10M daily max replay-loop-timing-stats.total_elapsed_us.
Spot checked a handful of these to figure out root cause:

there are a handful of validators that are reporting > 1700113043695000.00 as the total_elapsed_us, which is magnitudes higher than the sum of the other replay-loop-timing-stats metrics. Seems like it could be some sort of invalid overflow value?
long generate new bank forks times occur every epoch boundary
long replay active banks times occur during CU intensive banks. cross checking with the cluster avg, most seem to have increased replay times during this slot, however some nodes (speculation: probably worse hardware) nodes spike disproportionately.
example: https://metrics.solana.com:8888/sources/14/chronograf/data-explorer?query=SELECT%20max%28%22generate_new_bank_forks_elapsed%22%29%20AS%20%22max_generate_new_bank_forks_elapsed%22%2C%20max%28%22total_elapsed_us%22%29%20AS%20%22max_total_elapsed_us%22%2C%20max%28%22replay_active_banks_elapsed%22%29%20AS%20%22max_replay_active_banks_elapsed%22%20FROM%20%22mainnet-beta%22.%22autogen%22.%22replay-loop-timing-stats%22%20WHERE%20time%20%3E%20%3AdashboardTime%3A%20AND%20time%20%3C%20%3AupperDashboardTime%3A%20AND%20%22host_id%22%3D%27CNRyYnXZjryxNdSUwztdmVFVuQPXvugQ1d2wtRTjjTb3%27%20GROUP%20BY%20time%28%3Ainterval%3A%29%20FILL%28null%29

During this investigation we were unable to find any incidents of high voting_elapsed counts.

I'm inclined to believe that this specific incident was a 1 off. If there are more recent anomalies that pop up please let me know.

i think we should prioritize being resilient to delayed broadcast rather than necessarily trying to fix it. the latency + throughput hit is much preferable to the alternatives in the presence of non-conforming leaders

100% agree, we will never be able to get to 0 forks so we should strive to make minor fork resolution as robust as possible. However if there is a cluster wide issue that is causing delayed broadcasts, we should investigate it. Based on the data above I don't believe that is the case.

I have created #34281 for the strange > 1 quadrillion total_elapsed_us reports.

bw-solana · 2023-12-19T23:20:50Z

#34120 has been merged into master and should help mitigate the negative impacts of these events. So far, it has not been backported and would be part of a v1.18 release

sakridge changed the title ~~High transaction latency mainnet-beta Nov. 3 2023~~ High transaction latency mainnet-beta Nov 3 2023 Nov 16, 2023

bw-solana mentioned this issue Nov 16, 2023

add additional vote lockout stake threshold #34120

Merged

wen-coding added the consensus Issues related to consensus bugs in the validator label Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High transaction latency mainnet-beta Nov 3 2023 #34107

High transaction latency mainnet-beta Nov 3 2023 #34107

sakridge commented Nov 16, 2023 •

edited

Loading

sakridge commented Nov 16, 2023 •

edited

Loading

carllin commented Nov 16, 2023

bw-solana commented Nov 16, 2023 •

edited

Loading

carllin commented Nov 16, 2023

sakridge commented Nov 16, 2023

bw-solana commented Nov 16, 2023

t-nelson commented Nov 28, 2023

AshwinSekar commented Nov 30, 2023

bw-solana commented Dec 19, 2023

High transaction latency mainnet-beta Nov 3 2023 #34107

High transaction latency mainnet-beta Nov 3 2023 #34107

Comments

sakridge commented Nov 16, 2023 • edited Loading

Problem

Proposed Solution

sakridge commented Nov 16, 2023 • edited Loading

carllin commented Nov 16, 2023

bw-solana commented Nov 16, 2023 • edited Loading

carllin commented Nov 16, 2023

sakridge commented Nov 16, 2023

bw-solana commented Nov 16, 2023

t-nelson commented Nov 28, 2023

AshwinSekar commented Nov 30, 2023

bw-solana commented Dec 19, 2023

sakridge commented Nov 16, 2023 •

edited

Loading

sakridge commented Nov 16, 2023 •

edited

Loading

bw-solana commented Nov 16, 2023 •

edited

Loading