-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Block persistence changes to support headers-first sync #1048
Comments
Thanks for taking the time to write all this up. I know we've pretty much discussed this already between various PRs and IRC, but it's good to have it all in one place. I think moving the management of the block index into One of the considerations of removing the headers from the block row is that the header fetching functions in database will also need to either be changed to load them off disk, or perhaps even just removed completely since With this change, it should also be possible to completely remove the current height<->hash mapping buckets which only contain entries for the main chain. The code is already almost able to do this thanks to the recent change of having all block nodes in memory and a This next bit is some of what I posted on the aforementioned PR, but I think it's instructive to include it in this issue for reference purposes as well. One of the things I've been periodically working on is implementing a utxo cache, which will massively speed up IBD and block validation times at tip, however, one of the challenges there is that some parts of This proposal would help simplify that work tremendously. However, there is another important consideration that I didn't see mentioned here and that is one of making sure the block index is a implemented such that it only needs to be periodically flushed to the database. Either that, or some type of reconciliation code that can detect differences and replay utxos to ensure the on-state representation of the utxo set and the best chain are in agreement. Currently, that isn't a problem because it's all updated atomically as each block is connected and disconnected. However, that will no longer be the case with all of these changes + caches. While it's probably a bit premature to discuss now as I think it would need to be a multi-stage process if it's something we even would want to do, but looking even further into the future, I highly suspect the model of the |
This all makes sense except the part about the block index: "but the block index is an area where this is still problematic, in large part because it's managed by database and not blockchain." What does this refer to? The |
The issue is essentially that the utxoset and the block index (currently it's more specifically the best chain) have to be consistent with one another. When introducing a utxo cache, it necessarily means the state of the in-memory representation of the utxoset diverges until it eventually gets flushed. That also implies that the block index needs to be atomically updated during that same flush (or more specifically the best chain, but with this proposal, the connection status of each block will be a bit in the validation state, so it will apply to the modified block index nodes too). Right now, because the block index is managed by EDIT: Alternatively, it would be possible to save the hash of what the utxo set believes is the best chain and compare that to the best chain according to the block index, then "replay" everything to get them back in sync. However, that seems like it might be quite easy to get out of sync and certainly would increase startup times in the event of an unclean shutdown. |
This describes some options for implementing headers-first syncing and how it relates to block persistence. The new blockindex and chainview structures in the
blockchain
package go a long way towards enabling this, but there are some more possible changes that would make it for more efficient.First of all, the block index is currently only initialized with nodes from the main chain, partly due to how block data is structured in buckets. Instead, we should efficiently load all stored headers into the block index on init. If we ensure that we only store connected headers that satisfy PoW and checkpoint checks, there is very little risk of populating the in-memory index with a ton of invalid or orphaned blocks. As demonstrated in #1034, having a bucket that stores all block headers keyed by
<uint32(height><hash>
, allows the block index to be populated very quickly.Each
blockNode
in the index should store a bitfield representing its validation state. To support headers-first syncing, we should store with each node: if we have the block data in ffldb, if the block was fully validated, if the block failed validation, and if the block is a descendent of another invalid block. More bits may be used in the future, for example if pruning is implemented, we may want to store whether block info has been pruned from the spend journal. This bit field can be stored alongside the header.While it is possible to implement headers-first sync with the headers only in memory, they would have to be resynced if the node restarts during IBD. Instead, it seems preferable to store headers independently of block data.
I propose using ffldb to only store/fetch full blocks and creating a bucket with headers & validation state managed by in
chainio.go
. To save storage space, we would drop the block headers from the block index bucket in LevelDB and each row in the bucket would just have the block location. A one-time migration is necessary to populate the new bucket and drop header data from the current block index bucket.I'm sure there are some complications I am not aware of, so I appreciate any feedback.
The text was updated successfully, but these errors were encountered: