Create a sha-256 hash of the shard request cache key by Bubbad · Pull Request #74877 · elastic/elasticsearch

Bubbad · 2021-07-02T08:25:24Z

This PR optimizes the memory usage for cache keys in the request_cache.
Instead of using the entire shard search request JSON body as a key, we will now hash the json body
using SHA-256. This makes every cache key use 32 bytes of heap, instead of however large the JSON was,
which will lower the heap usage per entry in the cache.

I think SHA-256 should be enough to avoid the chance of having hash collisions, but I'm definitely open for discussion here if you think something else is better suited.

Closes #74061

elasticmachine · 2021-07-05T09:57:10Z

Pinging @elastic/es-search (Team:Search)

romseygeek · 2021-07-05T10:11:58Z

Thanks @Bubbad! I'd like to get @ywangd to have a look at this, given that he's just added some changes around security and the request cache.

ywangd · 2021-07-05T12:31:42Z

In theory, the larger the input space becomes, the higher chance the hash collision has. With the entire source and possible DLS/FLS configurations as part of the cache key, the input space is indeed quite large for a 256 bit number. I also got the impression that the current cache key calculation is safety over efficiency when working enabling request cache for DLS/FLS. I suspected there might be a (legacy?) reason (which I don't know) for why we didn't opt for efficiency, e.g. hashing some of the values, or removing the requestCache parameter from the cache key.

That said, even though the theoretical input space is large, I don't think any practical use cases would be exhaustive enough to cause a collision. It's likely all cachable searches (per shard) are in the order of thousands, ten thousands, or even millions and these numbers too low for a sha-256 collision. So overall I think it is find to use the sha-256 hash as the cache key. Ping @tvernum for awareness.

ywangd · 2021-07-05T12:32:43Z

There is no need to copyBytes if all we need is the hash.

Pushed a commit without copying bytes now

Bubbad · 2021-07-05T13:17:25Z

Thanks for your review @ywangd. Just to make sure I understood you; Given that we think sha256 is enough to avoid collisions, do you think it's fine to merge this, even with your suspicion that there might've been a reason for not already hashing the cache key?

tvernum · 2021-07-06T03:59:13Z

SHA-256 should be fine. The risk we're mostly worried about is a birthday attack.
Assuming a perfectly balanced hash output and no mistakes in my calculations (neither of which are technically true), you would get a 1% chance of a collision with 4.8 × 10³⁷ hashes.
No node can hold that many keys, but even if it could, you would need more than 100 quintillion searches per second (that is 100 billion billion) in order to be able generate that number of hashes before the sun dies (10 billion years).

That said, a hash collision would be pretty terrible.
The alternative is to hash the query (and the Differentiator, but more on that below), but not all the request data.

It's the query body (technically, source) that is variable size and can cause cache blowout (and maybe the alias filter too?). We can solve the original problem that large queries cause large cache entries by hashing that, with essentially zero risk of collision.

The same issue applies to the differentiator - it could also be quite large, so we can also change the DLS differentiator to hash the array it generates.

Bubbad · 2021-07-06T08:29:51Z

Hi @tvernum , thanks for your reply.

Yeah collisions shouldn't be able to happen in practice using sha-256, but as you say, if we can guarantee no collisions that would be even better.

In your alternative solution, to only hash the source and the differentiator specific stuff, I'm not sure I see how that would guarantee uniqueness of the cache key. Ignoring the differentiator for now; If I interpret you correctly, then we should only hash the source here and keep the others as is. Wouldn't you still end up with a theoretical chance of collision? If everything other than the source in the request data is the same, (same cluster alias, same aliasFilters etc), then in theory you can still have collisions where 2 different sources (query bodies) generate the same hash.

Or am I missing something here?

costin · 2021-07-06T11:05:30Z

Any reason why a cryptographic hashing function was picked? For cache purposes alone a hashing function (like xxhash) should yield better results across the board.

Bubbad · 2021-07-06T11:38:09Z

@costin Only reasons were it being a well known and good technology and that it's already being used in different places in ES. But I'm definitely up for changing it if needed. Do you think performance will be an issue here?

tvernum · 2021-07-07T07:26:59Z

For the new differentiator (used for result caching when DLS/FLS is active), we need a hash that is resistant to collision & preimage attacks. I don't know of any non-cryptographic hashes functions that can provide that.

We could hash the query and differentiator with different hashing algorithms, so that the performance cost is only paid when DLS/FLS are involved, but I don't think we can use xxhash (or similar) for anything that incorporates the differentiator.

costin · 2021-07-07T12:31:44Z

I don't know of any non-cryptographic hashes functions that can provide that.

👍

Bubbad · 2021-07-16T08:17:21Z

Hi @tvernum , thanks for your reply.

Yeah collisions shouldn't be able to happen in practice using sha-256, but as you say, if we can guarantee no collisions that would be even better.

In your alternative solution, to only hash the source and the differentiator specific stuff, I'm not sure I see how that would guarantee uniqueness of the cache key. Ignoring the differentiator for now; If I interpret you correctly, then we should only hash the source here and keep the others as is. Wouldn't you still end up with a theoretical chance of collision? If everything other than the source in the request data is the same, (same cluster alias, same aliasFilters etc), then in theory you can still have collisions where 2 different sources (query bodies) generate the same hash.

Or am I missing something here?

@tvernum Ping about this question. (Sorry if you've just been buzy)

tvernum · 2021-07-26T06:13:56Z

My point is that most of the data is small, so hashing introduces a tiny collision risk for a small gain in memory consumption.

The query is potentially large, so the improvement in memory consumption is more significant. That changes the tradeoff - the overall value is greater, and can justify the collision risk.

I don't actually have an opinion here - it's not my area of code.
I think someone from @elastic/es-search needs to make the call on this, I'm just presenting risks & options.

romseygeek · 2021-09-08T14:09:05Z

Sorry for the radio silence @Bubbad, life and holidays intervened here. I think we're good with this implementation, and it should give us a nice memory saving for large queries. I'll run the test suite and see if that's happy with it as well.

@elasticmachine test this please

Bubbad · 2021-09-09T15:14:11Z

Sorry for the radio silence @Bubbad, life and holidays intervened here. I think we're good with this implementation, and it should give us a nice memory saving for large queries. I'll run the test suite and see if that's happy with it as well.

@elasticmachine test this please

Sure, no problem! That's great news! I saw the build failed, I think I had it rebased on a somewhat broken commit. Rebased it on current master now, so please run it again now and hopefully it should work better.

romseygeek · 2021-09-09T15:17:17Z

@elasticmachine test this please

romseygeek · 2021-09-10T08:01:55Z

It looks like we have some genuine test failures here - are you able to reproduce and investigate them @Bubbad?

…esReference implementations support

Bubbad · 2021-09-10T09:46:20Z

It looks like we have some genuine test failures here - are you able to reproduce and investigate them @Bubbad?

Seems that the first pr fix to remove the .copyBytes() call broke the tests. Apparently some BytesReference implementations doesn't support the .array() call. I've pushed a fix now that instead iterates the BytesReference, which should be supported by all implementations. Can you please give that commit a review?

Running ./gradlew check locally now succeeds with all tests except for some org.elasticsearch.node.NodeTests tests, which seems to fail in the master branch for me as well, so I guess that's just something on my computer. If you think the commit looks good I think we can try it out on your jenkins again.

romseygeek · 2021-09-10T10:24:12Z

@elasticmachine test this please

romseygeek · 2021-09-10T13:43:18Z

@elasticmachine update branch

romseygeek · 2021-09-10T13:43:36Z

@elasticmachine ok to test

romseygeek · 2021-09-10T13:46:33Z

-            // copy it over since we don't want to share the thread-local bytes in #scratch
-            return out.copyBytes();
+
+            return new BytesArray(getHashedCacheKey(out.bytes()));


I think we can use MessageDigests.digest(out.bytes(), MessageDigests.sha256()) here rather than adding a new method?

Ah, didn't see that that function already existed. Pushed a fix now!

romseygeek

LGTM, thanks for your patience on this @Bubbad

We currently use the plaintext body of a shard request as the key to the request cache. This has the disadvantage that very large requests can quickly fill up the cache due to the size of their keys. With this commit, we instead use a sha-256 hash of the shard request as the cache key, which will use a constant (and much smaller) number of bytes.

elasticsearchmachine added v8.0.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Jul 2, 2021

Bubbad mentioned this pull request Jul 2, 2021

Reduce memory usage of shard request cache keys #74061

Closed

Bubbad changed the title ~~Create a sha-256 hash of the cache key if possible~~ Create a sha-256 hash of the shard request cache key if possible Jul 2, 2021

Bubbad changed the title ~~Create a sha-256 hash of the shard request cache key if possible~~ Create a sha-256 hash of the shard request cache key Jul 2, 2021

Bubbad force-pushed the cache_hash branch 7 times, most recently from bdd7d8b to 1a291d9 Compare July 2, 2021 13:56

romseygeek added the :Search/Search DO NOT USE DEPRECATED - DO NOT USE label Jul 5, 2021

elasticmachine added the Team:Search DEPRECATED - DO NOT USE label Jul 5, 2021

ywangd reviewed Jul 5, 2021

View reviewed changes

Bubbad added 2 commits September 9, 2021 17:11

Create a sha-256 hash of the cache key

693b77b

Review fix - Dont copy bytes

7089aad

Bubbad force-pushed the cache_hash branch from e915835 to 7089aad Compare September 9, 2021 15:11

Fixed broken tests by calculating the cache key in a way that all Byt…

03ebb3d

…esReference implementations support

romseygeek added the v7.16.0 label Sep 10, 2021

Merge branch 'master' into cache_hash

afccdd1

romseygeek reviewed Sep 10, 2021

View reviewed changes

Use MessageDigests.digest instead of custom function

37e3799

Bubbad force-pushed the cache_hash branch from 786fcc2 to 37e3799 Compare September 10, 2021 15:48

romseygeek approved these changes Sep 13, 2021

View reviewed changes

romseygeek added the auto-backport-and-merge label Sep 13, 2021

romseygeek merged commit 553e8dc into elastic:master Sep 13, 2021

jakelandis added v8.0.0-beta1 and removed v8.0.0 labels Oct 27, 2021

danhermann added the >enhancement label Dec 3, 2021

Uh oh!

Conversation

Bubbad commented Jul 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Jul 5, 2021

Uh oh!

romseygeek commented Jul 5, 2021

Uh oh!

ywangd commented Jul 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywangd Jul 5, 2021

Choose a reason for hiding this comment

Uh oh!

Bubbad Jul 5, 2021

Choose a reason for hiding this comment

Uh oh!

Bubbad commented Jul 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tvernum commented Jul 6, 2021

Uh oh!

Bubbad commented Jul 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

costin commented Jul 6, 2021

Uh oh!

Bubbad commented Jul 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tvernum commented Jul 7, 2021

Uh oh!

costin commented Jul 7, 2021

Uh oh!

Bubbad commented Jul 16, 2021

Uh oh!

tvernum commented Jul 26, 2021

Uh oh!

romseygeek commented Sep 8, 2021

Uh oh!

Bubbad commented Sep 9, 2021

Uh oh!

romseygeek commented Sep 9, 2021

Uh oh!

romseygeek commented Sep 10, 2021

Uh oh!

Bubbad commented Sep 10, 2021

Uh oh!

romseygeek commented Sep 10, 2021

Uh oh!

romseygeek commented Sep 10, 2021

Uh oh!

romseygeek commented Sep 10, 2021

Uh oh!

romseygeek Sep 10, 2021

Choose a reason for hiding this comment

Uh oh!

Bubbad Sep 10, 2021

Choose a reason for hiding this comment

Uh oh!

romseygeek left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Bubbad commented Jul 2, 2021 •

edited

Loading

ywangd commented Jul 5, 2021 •

edited

Loading

Bubbad commented Jul 5, 2021 •

edited

Loading

Bubbad commented Jul 6, 2021 •

edited

Loading

Bubbad commented Jul 6, 2021 •

edited

Loading