Notice

This document is for a development version of Ceph.

RocksDB Config Reference

Note

As of the Tentacle release, two Ceph services use RocksDB: Monitors and OSDs. This document focuses on OSD’s RocksDB.

RocksDB caching

RocksDB caching is based on preserving parts of .sst files in block cache. For more details, see the Block-Cache wiki. Ceph implements its own flavor of block cache. See the source code for more details. This custom implementation brings together RocksDB block cache, BlueStore metadata cache, and BlueStore data cache to compete for available memory.

Cache sharding

As the default RocksDB block cache, Ceph block cache is sharded. Sharding is controlled by configuration. The purpose of sharding is to streamline multi-threaded access.

rocksdb_cache_shard_bits

Specifies the number of shards by designating the number of significant bits in hash keys. 4 bits -> 16 shards.

type:

int

runtime updatable:

false

default:

4

Perf counters

Ceph RocksDB cache operations are tracked by performance counters. In the default configuration BlueStore creates two block caches: O for onodes, and default for everything else. The sections created in performance counters are named rocksdb-cache-O and rocksdb-cache-default.

ceph tell osd.0 perf dump rocksdb-cache-O
"rocksdb-cache-O": {
  "capacity": 134217728,
  "usage": 134182832,
  "pinned": 0,
  "elems": 24502,
  "inserts": 25806978,
  "lookups": 150436987,
  "hits": 124629911,
  "misses": 25807076
}

Values capacity, usage, pinned and elems reflect the current state of the cache. Values inserts, lookups, hits and misses are increased on relevant event.

Admin commands

Performance counters show a brief summation, but each cache shard has its own stats. To list RocksDB onode block cache details for each shard, run an admin socket command:

ceph tell osd.0 rocksdb show cache O
shard  capacity     usage   pinned   elems  inserts  lookups     hits   misses
    0  13631488  11076400        0    2099   136987   822679   685923   136756
    1  13631488  11549712        0    2043   133359   571500   438383   133117
    2  13631488  11060608        0    2232   135076   908468   773313   135155
    3  13631488  11166896        0    2269   134006   427070   293147   133923
    4  13631488  11117984        0    2297   133367   700242   567318   132924
    5  13631488  11306672        0    2155   137501  1130135   991810   138325
    6  13631488  11506512        0    2353   134515   662792   528514   134278
    7  13631488  11093856        0    2316   135348   718971   583421   135550
    8  13631488  11660624        0    2424   137363  1092043   954248   137795
    9  13631488  10962000        0    2561   131982   431702   300467   131235
   10  13631488  11379392        0    1916   134543   477118   342854   134264
   11  13631488  11294272        0    2555   134508   512393   378337   134056
   12  13631488  11277136        0    2079   137312  1131571   993692   137879
   13  13631488  10887776        0    2543   134001   567073   432903   134170
   14  13631488  10986528        0    2394   133288   584452   451018   133434
   15  13631488  11954464        0    2456   134615   708285   573374   134911

Counters that support clearing can be reset to zero by running a command:

ceph tell osd.0 rocksdb reset cache O

Optimum shard count

In most cases 16 shards as defined by rocksdb_cache_shard_bits=4 is a good choice. Large OSDs can easily accommodate millions of objects and thus having millions of keys to encode onode metadata. While the number of keys is not directly a problem, it causes RocksDB to create very large index blocks. During leveled compaction RocksDB merges two levels. Two index blocks are needed at the same time. A problem appears if both index blocks belong to the same shard and their total size is more than capacity of the shard. Both index blocks cannot fit in the shard simultaneously and access to index block causes the other block to be evicted from the cache. The constant thrashing persists until the current RocksDB compaction step finishes.

Thrashing detection

ceph tell osd.0 rocksdb show cache O
shard  capacity     usage   pinned   elems  inserts  lookups     hits   misses
  ...
    2  13631488  11060608        0    2232   135076   908468   773313   135155
    3  13631488   8166896        0       1   134006   427070   293147   133923
    4  13631488  11117984        0    2297   133367   700242   567318   132924
  ...

It is likely that shard 3 is doing constant eviction. To verify, reset counters to zero and observe the relation between misses and hits. Also, the perf counter for bluefs.read_bytes will be increasing very fast as RocksDB is reading the same index blocks over and over again.

Mitigation

More like a workaround. Reduce rocksdb_cache_shard_bits. This will have a slight negative effect on the baseline RocksDB performance, as fewer shards means more opportunity for lock contention.

Brought to you by the Ceph Foundation

The Ceph Documentation is a community resource funded and hosted by the non-profit Ceph Foundation. If you would like to support this and our other efforts, please consider joining now.