You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've seen a deadlock arise a few times on one of my nodes, primarily on testnet3, presumably triggered more often due to the existence of block storms.
What we see here is that we had a combination of many peers connecting and disconnecting. When a peer connects, we go to notify the server to update the various book keeping information it stores:
// Signal the sync manager this peer is a new sync candidate.
s.syncManager.NewPeer(sp.Peer)
The server then goes to update the SyncManager, to see if we need to pick this peer as a sync node or not, while also setting up some book keeping information:
// Start syncing by choosing the best candidate if needed.
ifisSyncCandidate&&sm.syncPeer==nil {
sm.startSync()
}
}
While one of these requests is pending, it's possible that the SyncManager is notified of a new block. When receiving a new block, the sync manager will ask the chain to process it:
graph TD
A[Server: RelayInventory]
B[Server: relayTransactions]
C[Server: AnnounceNewTransactions]
D[SyncManager: handleBlockchainNotification]
E[BlockChain: sendNotification]
F[BlockChain: connectBlock]
G[BlockChain: connectBestChain]
H[BlockChain: maybeAcceptBlock]
I[BlockChain: processOrphans]
J[BlockChain: ProcessBlock]
K[SyncManager: handleBlockMsg]
L[SyncManager: blockHandler]
D --> C
C --> B
B --> A
D --> E
E --> F
F --> G
G --> H
H --> I
I --> J
J --> K
K --> L
A --> L
M[Server: peerDoneHandler] --> A
N[Server: AddPeer] --> A
O[SyncManager: NewPeer] --> A
Deadlock Scenario
I've seen a deadlock arise a few times on one of my nodes, primarily on testnet3, presumably triggered more often due to the existence of block storms.
Here's a sample goroutine dump:
What we see here is that we had a combination of many peers connecting and disconnecting. When a peer connects, we go to notify the server to update the various book keeping information it stores:
btcd/server.go
Lines 1783 to 1784 in 67b8efd
The server then goes to update the
SyncManager
, to see if we need to pick this peer as a sync node or not, while also setting up some book keeping information:btcd/netsync/manager.go
Lines 455 to 478 in 67b8efd
While one of these requests is pending, it's possible that the
SyncManager
is notified of a new block. When receiving a new block, the sync manager will ask the chain to process it:btcd/netsync/manager.go
Lines 743 to 746 in 67b8efd
Indirectly during processing, a new event is emitted to notify the server that there's a new block:
btcd/netsync/manager.go
Line 1699 in 67b8efd
Herein lies our deadlock: peer -> server -> sync manager -> server -> peer:
This halts all new incoming peer handling, and also some RPC calls that want to cal into the
SyncManager
to query state.Resolution Paths
One easy path comes to mind: can we just make the call from server to sync manager sync? So:
The state in the sync manager will be cleaned up once the
peerDoneHandler
exits:btcd/server.go
Line 2215 in 67b8efd
The text was updated successfully, but these errors were encountered: