Notice
This document is for a development version of Ceph.
RocksDB Config Reference
Note
As of the Tentacle release, two Ceph services use RocksDB: Monitors and OSDs. This document focuses on OSD’s RocksDB.
RocksDB caching
RocksDB caching is based on preserving parts of .sst files in block cache.
For more details, see the Block-Cache wiki.
Ceph implements its own flavor of block cache.
See the source code for more details.
This custom implementation brings together RocksDB block cache, BlueStore metadata cache,
and BlueStore data cache to compete for available memory.
Cache sharding
As the default RocksDB block cache, Ceph block cache is sharded. Sharding is controlled by configuration. The purpose of sharding is to streamline multi-threaded access.
- rocksdb_cache_shard_bits
Specifies the number of shards by designating the number of significant bits in hash keys. 4 bits -> 16 shards.
- type:
int- runtime updatable:
false- default:
4
Perf counters
Ceph RocksDB cache operations are tracked by performance counters.
In the default configuration BlueStore creates two block caches:
O for onodes, and default for everything else.
The sections created in performance counters are named rocksdb-cache-O and
rocksdb-cache-default.
ceph tell osd.0 perf dump rocksdb-cache-O
"rocksdb-cache-O": {
"capacity": 134217728,
"usage": 134182832,
"pinned": 0,
"elems": 24502,
"inserts": 25806978,
"lookups": 150436987,
"hits": 124629911,
"misses": 25807076
}
Values capacity, usage, pinned and elems reflect the current state of the cache.
Values inserts, lookups, hits and misses are increased on relevant event.
Admin commands
Performance counters show a brief summation, but each cache shard has its own stats. To list RocksDB onode block cache details for each shard, run an admin socket command:
ceph tell osd.0 rocksdb show cache O
shard capacity usage pinned elems inserts lookups hits misses
0 13631488 11076400 0 2099 136987 822679 685923 136756
1 13631488 11549712 0 2043 133359 571500 438383 133117
2 13631488 11060608 0 2232 135076 908468 773313 135155
3 13631488 11166896 0 2269 134006 427070 293147 133923
4 13631488 11117984 0 2297 133367 700242 567318 132924
5 13631488 11306672 0 2155 137501 1130135 991810 138325
6 13631488 11506512 0 2353 134515 662792 528514 134278
7 13631488 11093856 0 2316 135348 718971 583421 135550
8 13631488 11660624 0 2424 137363 1092043 954248 137795
9 13631488 10962000 0 2561 131982 431702 300467 131235
10 13631488 11379392 0 1916 134543 477118 342854 134264
11 13631488 11294272 0 2555 134508 512393 378337 134056
12 13631488 11277136 0 2079 137312 1131571 993692 137879
13 13631488 10887776 0 2543 134001 567073 432903 134170
14 13631488 10986528 0 2394 133288 584452 451018 133434
15 13631488 11954464 0 2456 134615 708285 573374 134911
Counters that support clearing can be reset to zero by running a command:
ceph tell osd.0 rocksdb reset cache O
Optimum shard count
In most cases 16 shards as defined by rocksdb_cache_shard_bits=4 is a good choice.
Large OSDs can easily accommodate millions of objects and thus having millions of keys
to encode onode metadata. While the number of keys is not directly a problem,
it causes RocksDB to create very large index blocks.
During leveled compaction RocksDB merges two levels. Two index blocks are
needed at the same time.
A problem appears if both index blocks belong to the same shard
and their total size is more than capacity of the shard.
Both index blocks cannot fit in the shard simultaneously and access to index block
causes the other block to be evicted from the cache.
The constant thrashing persists until the current RocksDB compaction step finishes.
Thrashing detection
ceph tell osd.0 rocksdb show cache O
shard capacity usage pinned elems inserts lookups hits misses
...
2 13631488 11060608 0 2232 135076 908468 773313 135155
3 13631488 8166896 0 1 134006 427070 293147 133923
4 13631488 11117984 0 2297 133367 700242 567318 132924
...
It is likely that shard 3 is doing constant eviction. To verify, reset counters to zero
and observe the relation between misses and hits.
Also, the perf counter for bluefs.read_bytes will be increasing very fast as RocksDB is
reading the same index blocks over and over again.
Mitigation
More like a workaround. Reduce rocksdb_cache_shard_bits.
This will have a slight negative effect on the baseline RocksDB performance,
as fewer shards means more opportunity for lock contention.
Brought to you by the Ceph Foundation
The Ceph Documentation is a community resource funded and hosted by the non-profit Ceph Foundation. If you would like to support this and our other efforts, please consider joining now.