bug: daemon stalled due to circular waiting dependency between server can `SyncManager` #2258

Roasbeef · 2024-09-30T23:43:21Z

Deadlock Scenario

I've seen a deadlock arise a few times on one of my nodes, primarily on testnet3, presumably triggered more often due to the existence of block storms.

Here's a sample goroutine dump:

goroutine profile: total 21172
11514 @ 0x43beee 0x407625 0x407277 0x8a69f3 0x46e961
#    0x8a69f2    main.(*server).peerDoneHandler+0x52    /root/gocode/src/github.com/btcsuite/btcd/server.go:2211

9563 @ 0x43beee 0x407625 0x407277 0x8b6f75 0x8b6f5c 0x82f9a2 0x82ffd2 0x83013f 0x8308e9 0x46e961
#    0x8b6f74    main.(*server).AddPeer+0x34                        /root/gocode/src/github.com/btcsuite/btcd/server.go:2330
#    0x8b6f5b    main.(*serverPeer).OnVerAck+0x1b                    /root/gocode/src/github.com/btcsuite/btcd/server.go:533
#    0x82f9a1    github.com/btcsuite/btcd/peer.(*Peer).processRemoteVerAckMsg+0xa1    /root/gocode/src/github.com/btcsuite/btcd/peer/peer.go:2068
#    0x82ffd1    github.com/btcsuite/btcd/peer.(*Peer).waitToFinishNegotiation+0x1b1    /root/gocode/src/github.com/btcsuite/btcd/peer/peer.go:2195
#    0x83013e    github.com/btcsuite/btcd/peer.(*Peer).negotiateInboundProtocol+0x11e    /root/gocode/src/github.com/btcsuite/btcd/peer/peer.go:2241
#    0x8308e8    github.com/btcsuite/btcd/peer.(*Peer).start.func1+0x28            /root/gocode/src/github.com/btcsuite/btcd/peer/peer.go:2289

1 @ 0x43beee 0x407625 0x407277 0x8a29b7 0x8a2977 0x8a2a47 0x8562cf 0x77f0cd 0x76eea5 0x76ecc6 0x771630 0x76bb7c 0x77f59e 0x77f993 0x8526ae 0x855c85 0x46e961
#    0x8a29b6    main.(*server).RelayInventory+0xb6                            /root/gocode/src/github.com/btcsuite/btcd/server.go:2341
#    0x8a2976    main.(*server).relayTransactions+0x76                            /root/gocode/src/github.com/btcsuite/btcd/server.go:1497
#    0x8a2a46    main.(*server).AnnounceNewTransactions+0x26                        /root/gocode/src/github.com/btcsuite/btcd/server.go:1508
#    0x8562ce    github.com/btcsuite/btcd/netsync.(*SyncManager).handleBlockchainNotification+0x32e    /root/gocode/src/github.com/btcsuite/btcd/netsync/manager.go:1483
#    0x77f0cc    github.com/btcsuite/btcd/blockchain.(*BlockChain).sendNotification+0xcc            /root/gocode/src/github.com/btcsuite/btcd/blockchain/notifications.go:78
#    0x76eea4    github.com/btcsuite/btcd/blockchain.(*BlockChain).connectBlock.func2+0x84        /root/gocode/src/github.com/btcsuite/btcd/blockchain/chain.go:708
#    0x76ecc5    github.com/btcsuite/btcd/blockchain.(*BlockChain).connectBlock+0x3c5            /root/gocode/src/github.com/btcsuite/btcd/blockchain/chain.go:709
#    0x77162f    github.com/btcsuite/btcd/blockchain.(*BlockChain).connectBestChain+0x28f        /root/gocode/src/github.com/btcsuite/btcd/blockchain/chain.go:1219
#    0x76bb7b    github.com/btcsuite/btcd/blockchain.(*BlockChain).maybeAcceptBlock+0x19b        /root/gocode/src/github.com/btcsuite/btcd/blockchain/accept.go:79
#    0x77f59d    github.com/btcsuite/btcd/blockchain.(*BlockChain).processOrphans+0x27d            /root/gocode/src/github.com/btcsuite/btcd/blockchain/process.go:118
#    0x77f992    github.com/btcsuite/btcd/blockchain.(*BlockChain).ProcessBlock+0x332            /root/gocode/src/github.com/btcsuite/btcd/blockchain/process.go:236
#    0x8526ad    github.com/btcsuite/btcd/netsync.(*SyncManager).handleBlockMsg+0x2ad            /root/gocode/src/github.com/btcsuite/btcd/netsync/manager.go:745
#    0x855c84    github.com/btcsuite/btcd/netsync.(*SyncManager).blockHandler+0x424            /root/gocode/src/github.com/btcsuite/btcd/netsync/manager.go:1365

What we see here is that we had a combination of many peers connecting and disconnecting. When a peer connects, we go to notify the server to update the various book keeping information it stores:

btcd/server.go

Lines 1783 to 1784 in 67b8efd

    
           // Signal the sync manager this peer is a new sync candidate. 
        
           s.syncManager.NewPeer(sp.Peer)

The server then goes to update the SyncManager, to see if we need to pick this peer as a sync node or not, while also setting up some book keeping information:

btcd/netsync/manager.go

Lines 455 to 478 in 67b8efd

    
           // handleNewPeerMsg deals with new peers that have signalled they may 
        
           // be considered as a sync peer (they have already successfully negotiated).  It 
        
           // also starts syncing if needed.  It is invoked from the syncHandler goroutine. 
        
           func (sm *SyncManager) handleNewPeerMsg(peer *peerpkg.Peer) { 
        
           	// Ignore if in the process of shutting down. 
        
           	if atomic.LoadInt32(&sm.shutdown) != 0 { 
        
           		return 
        
           	} 
        
           	log.Infof("New valid peer %s (%s)", peer, peer.UserAgent()) 
        
           	// Initialize the peer state. 
        
           	isSyncCandidate := sm.isSyncCandidate(peer) 
        
           	sm.peerStates[peer] = &peerSyncState{ 
        
           		syncCandidate:   isSyncCandidate, 
        
           		requestedTxns:   make(map[chainhash.Hash]struct{}), 
        
           		requestedBlocks: make(map[chainhash.Hash]struct{}), 
        
           	} 
        
           	// Start syncing by choosing the best candidate if needed. 
        
           	if isSyncCandidate && sm.syncPeer == nil { 
        
           		sm.startSync() 
        
           	} 
        
           }

While one of these requests is pending, it's possible that the SyncManager is notified of a new block. When receiving a new block, the sync manager will ask the chain to process it:

btcd/netsync/manager.go

Lines 743 to 746 in 67b8efd

    
           // Process the block to include validation, best chain selection, orphan 
        
           // handling, etc. 
        
           _, isOrphan, err := sm.chain.ProcessBlock(bmsg.block, behaviorFlags) 
        
           if err != nil {

Indirectly during processing, a new event is emitted to notify the server that there's a new block:

btcd/netsync/manager.go

Line 1699 in 67b8efd

sm.chain.Subscribe(sm.handleBlockchainNotification)

Herein lies our deadlock: peer -> server -> sync manager -> server -> peer:

The peer is waiting on the server to add a new peer.
The server to add a new peer.
The sync manager is waiting on the server to notify a new block.

This halts all new incoming peer handling, and also some RPC calls that want to cal into the SyncManager to query state.

Resolution Paths

One easy path comes to mind: can we just make the call from server to sync manager sync? So:

go s.syncManager.NewPeer(sp.Peer)

The state in the sync manager will be cleaned up once the peerDoneHandler exits:

btcd/server.go

Line 2215 in 67b8efd

s.syncManager.DonePeer(sp.Peer)

.

The text was updated successfully, but these errors were encountered:

Roasbeef · 2024-10-01T00:37:05Z

graph TD
    A[Server: RelayInventory]
    B[Server: relayTransactions]
    C[Server: AnnounceNewTransactions]
    D[SyncManager: handleBlockchainNotification]
    E[BlockChain: sendNotification]
    F[BlockChain: connectBlock]
    G[BlockChain: connectBestChain]
    H[BlockChain: maybeAcceptBlock]
    I[BlockChain: processOrphans]
    J[BlockChain: ProcessBlock]
    K[SyncManager: handleBlockMsg]
    L[SyncManager: blockHandler]

    D --> C
    C --> B
    B --> A
    D --> E
    E --> F
    F --> G
    G --> H
    H --> I
    I --> J
    J --> K
    K --> L
    A --> L

    M[Server: peerDoneHandler] --> A
    N[Server: AddPeer] --> A
    O[SyncManager: NewPeer] --> A

Roasbeef added bug blockchain p2p sync labels Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: daemon stalled due to circular waiting dependency between server can `SyncManager` #2258

bug: daemon stalled due to circular waiting dependency between server can `SyncManager` #2258

Roasbeef commented Sep 30, 2024 •

edited

Loading

Roasbeef commented Oct 1, 2024

bug: daemon stalled due to circular waiting dependency between server can SyncManager #2258

bug: daemon stalled due to circular waiting dependency between server can SyncManager #2258

Comments

Roasbeef commented Sep 30, 2024 • edited Loading

Deadlock Scenario

Resolution Paths

Roasbeef commented Oct 1, 2024

bug: daemon stalled due to circular waiting dependency between server can `SyncManager` #2258

bug: daemon stalled due to circular waiting dependency between server can `SyncManager` #2258

Roasbeef commented Sep 30, 2024 •

edited

Loading