Poor jemalloc performance with zeroed allocations leading to TLB shootdown #27275

alessandrod · 2022-08-19T21:43:48Z

Problem

While profiling a branch including all the patches needed to bring direct account mapping with abiv1, I noticed a very large amount of TLB flushes and page faults caused by the program runtime. Initially I feared that direct mapping changes were somehow causing the issue, but I've now observed that the problem can happen in master as well. Direct mapping does seem to make it worse, most likely by making the program runtime threads a lot faster (the irony!).

The problem is the following:

It looks like jemalloc always force-purges zeroed extents immediately, instead of implementing two phase release like it does for non-zeroed allocations. Two phase cleanup reduces overhead from allocating/deallocting memory, at the expense of retaining a bit more memory during the decay period. Furthermore, jemalloc purges zeroed extents by using madvise(MADV_DONTNEED) which requires a TLB flush - and with our allocation sizes - a full TLB flush (the theory being that doing a full flush is faster than flushing the individual page entries).

Since we run the program runtime inside rayon, we have a bunch of threads constantly flushing TLBs, therefore getting into a by the book TLB shootdown (https://web.njit.edu/~dingxn/papers/ispa20.pdf).

To confirm that the shootdown is caused by the interaction between the rayon thread pool and jemalloc (the default glibc allocator doesn't exhibit the problem), I've written a minimal test case which mimics the CallFrame allocation we do in the program runtime: https://gist.github.com/alessandrod/a80788429873a4b9caa6aa53a82e0b2b

Here's perf numbers on a 64 vcpu gcloud vm:

$ hyperfine -i -L alloc malloc_memset,calloc,calloc_slab  'target/release/examples/mem {alloc}'
Benchmark 1: target/release/examples/mem malloc_memset
  Time (mean ± σ):     122.5 ms ±  22.5 ms    [User: 1566.1 ms, System: 113.2 ms]
  Range (min … max):    59.3 ms … 176.6 ms    23 runs
 
Benchmark 2: target/release/examples/mem calloc
  Time (mean ± σ):     260.2 ms ±  28.7 ms    [User: 370.3 ms, System: 5734.7 ms]
  Range (min … max):   207.0 ms … 293.6 ms    10 runs
 
Benchmark 3: target/release/examples/mem calloc_slab
  Time (mean ± σ):      94.5 ms ±  10.0 ms    [User: 85.6 ms, System: 237.9 ms]
  Range (min … max):    64.8 ms … 123.1 ms    28 runs
 
Summary
  'target/release/examples/mem calloc_slab' ran
    1.30 ± 0.27 times faster than 'target/release/examples/mem malloc_memset'
    2.75 ± 0.42 times faster than 'target/release/examples/mem calloc'

You can see that calloc is awfully slower than malloc_memset, even though the latter causes nearly twice as many page faults as it pages in the whole allocation to zero it.

calloc_slab works around the problem by pre-allocating large zero extents and then purging in one go, therefore doing only one TLB flush when the whole slab is deallocated. This confirms that the problem is caused by releasing many small calloc allocations. I've prototyped this for the program runtime - one slab per transaction execution. Unfortunately since we don't have a hard max number of instructions that can be executed per transaction, the slab needs to be quite large and while it improves perf, it also increases peak virtual memory usage significantly (although actual paged in memory stays lower than with malloc_memset).

Jemalloc implements two levels of caching: a small lock-free, per-thread cache and then larger arenas shared among threads. Turns out one way to avoid this particular issue is to make sure that the allocation fits in the per-thread cache (default is 32k, here I bumped it to 256k):

$ MALLOC_CONF=tcache_max:262144 hyperfine -i -L alloc malloc_memset,calloc,calloc_slab  'target/release/examples/mem {alloc}'
Benchmark 1: target/release/examples/mem malloc_memset
  Time (mean ± σ):     131.8 ms ±   9.1 ms    [User: 1346.1 ms, System: 113.6 ms]
  Range (min … max):   119.6 ms … 149.5 ms    22 runs
 
Benchmark 2: target/release/examples/mem calloc
  Time (mean ± σ):     135.6 ms ±   8.5 ms    [User: 1404.5 ms, System: 127.1 ms]
  Range (min … max):   124.7 ms … 154.3 ms    21 runs
 
Benchmark 3: target/release/examples/mem calloc_slab
  Time (mean ± σ):     100.5 ms ±   8.6 ms    [User: 104.2 ms, System: 308.4 ms]
  Range (min … max):    88.0 ms … 132.5 ms    30 runs
 
Summary
  'target/release/examples/mem calloc_slab' ran
    1.31 ± 0.14 times faster than 'target/release/examples/mem malloc_memset'
    1.35 ± 0.14 times faster than 'target/release/examples/mem calloc'

Proposed Solution

Has anyone looked into tuning jemalloc for the validator? This issue aside I see that there's quite a bit of memory churn, so I'm tempted to fix this issue (and possibly more), by running the jemalloc profiler and making sure that more allocations get cached.

The text was updated successfully, but these errors were encountered:

alessandrod · 2022-08-19T22:25:10Z

Btw for the lols: if you look at the stack trace, there's a _rjem_je_ehooks_default_zero_impl callback. Great! I thought I'll implement my callback and make it not purge so often. Then I found this https://github.com/jemalloc/jemalloc/blob/deb8e62a837b6dd303128a544501a7dc9677e47a/include/jemalloc/internal/ehooks.h#L367

ryoqun · 2022-08-20T07:51:17Z

hehe, nice finding.

i think we can use alloca or equivalent with increased pthread stack size? After all, cpis are like normal function calls in terms of its temporary Vecs lifetime.

alessandrod · 2022-08-23T08:50:34Z

i think we can use alloca or equivalent with increased pthread stack size? After all, cpis are like normal function calls in terms of its temporary Vecs lifetime.

I thought about that and it'd be fairly easy to implement. Max frame size is fixed and CPIs are nested in the host stack too so we don't even need alloca. But it would merge the SBF stack with the host stack, which from a security perspective isn't worth the tradeoff I think.

ryoqun · 2024-05-15T06:20:22Z

i think we can use alloca or equivalent with increased pthread stack size? After all, cpis are like normal function calls in terms of its temporary Vecs lifetime.

after almost 2 years, i finally got my hands on this: anza-xyz#1364

sakridge added the validator Issues that relate to the validator label Oct 21, 2022

github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Oct 23, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 31, 2023

alessandrod reopened this Nov 9, 2023

behzadnouri removed the stale [bot only] Added to stale content; results in auto-close after a week. label Nov 9, 2023

This was referenced May 10, 2024

Use mimalloc in attempt to reduce mem alloc perf. oddities anza-xyz/agave#1250

Open

Use tls-zeroed-aligned-memory anza-xyz/agave#1364

Draft

alessandrod mentioned this issue Jun 10, 2024

bpf_loader: use an explicit thread-local pool for stack and heap memory anza-xyz/agave#1370

Merged

ksolana mentioned this issue Aug 7, 2024

Upgrade tikv-jemallocator to 0.6 anza-xyz/agave#2396

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor jemalloc performance with zeroed allocations leading to TLB shootdown #27275

Poor jemalloc performance with zeroed allocations leading to TLB shootdown #27275

alessandrod commented Aug 19, 2022

alessandrod commented Aug 19, 2022

ryoqun commented Aug 20, 2022 •

edited

Loading

alessandrod commented Aug 23, 2022

ryoqun commented May 15, 2024

Poor jemalloc performance with zeroed allocations leading to TLB shootdown #27275

Poor jemalloc performance with zeroed allocations leading to TLB shootdown #27275

Comments

alessandrod commented Aug 19, 2022

Problem

Proposed Solution

alessandrod commented Aug 19, 2022

ryoqun commented Aug 20, 2022 • edited Loading

alessandrod commented Aug 23, 2022

ryoqun commented May 15, 2024

ryoqun commented Aug 20, 2022 •

edited

Loading