Skip to content

Create a sha-256 hash of the shard request cache key#74877

Merged
romseygeek merged 5 commits into
elastic:masterfrom
Bubbad:cache_hash
Sep 13, 2021
Merged

Create a sha-256 hash of the shard request cache key#74877
romseygeek merged 5 commits into
elastic:masterfrom
Bubbad:cache_hash

Conversation

@Bubbad

@Bubbad Bubbad commented Jul 2, 2021

Copy link
Copy Markdown
Contributor

This PR optimizes the memory usage for cache keys in the request_cache.
Instead of using the entire shard search request JSON body as a key, we will now hash the json body
using SHA-256. This makes every cache key use 32 bytes of heap, instead of however large the JSON was,
which will lower the heap usage per entry in the cache.

I think SHA-256 should be enough to avoid the chance of having hash collisions, but I'm definitely open for discussion here if you think something else is better suited.

Closes #74061

@elasticsearchmachine elasticsearchmachine added v8.0.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Jul 2, 2021
@Bubbad Bubbad changed the title Create a sha-256 hash of the cache key if possible Create a sha-256 hash of the shard request cache key if possible Jul 2, 2021
@Bubbad Bubbad changed the title Create a sha-256 hash of the shard request cache key if possible Create a sha-256 hash of the shard request cache key Jul 2, 2021
@Bubbad Bubbad force-pushed the cache_hash branch 7 times, most recently from bdd7d8b to 1a291d9 Compare July 2, 2021 13:56
@romseygeek romseygeek added the :Search/Search DO NOT USE DEPRECATED - DO NOT USE label Jul 5, 2021
@elasticmachine elasticmachine added the Team:Search DEPRECATED - DO NOT USE label Jul 5, 2021
@elasticmachine

Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search (Team:Search)

@romseygeek

Copy link
Copy Markdown
Contributor

Thanks @Bubbad! I'd like to get @ywangd to have a look at this, given that he's just added some changes around security and the request cache.

@ywangd

ywangd commented Jul 5, 2021

Copy link
Copy Markdown
Member

In theory, the larger the input space becomes, the higher chance the hash collision has. With the entire source and possible DLS/FLS configurations as part of the cache key, the input space is indeed quite large for a 256 bit number. I also got the impression that the current cache key calculation is safety over efficiency when working enabling request cache for DLS/FLS. I suspected there might be a (legacy?) reason (which I don't know) for why we didn't opt for efficiency, e.g. hashing some of the values, or removing the requestCache parameter from the cache key.

That said, even though the theoretical input space is large, I don't think any practical use cases would be exhaustive enough to cause a collision. It's likely all cachable searches (per shard) are in the order of thousands, ten thousands, or even millions and these numbers too low for a sha-256 collision. So overall I think it is find to use the sha-256 hash as the cache key. Ping @tvernum for awareness.

Comment on lines 410 to 411

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to copyBytes if all we need is the hash.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a commit without copying bytes now

@Bubbad

Bubbad commented Jul 5, 2021

Copy link
Copy Markdown
Contributor Author

Thanks for your review @ywangd. Just to make sure I understood you; Given that we think sha256 is enough to avoid collisions, do you think it's fine to merge this, even with your suspicion that there might've been a reason for not already hashing the cache key?

@tvernum

tvernum commented Jul 6, 2021

Copy link
Copy Markdown
Contributor

SHA-256 should be fine. The risk we're mostly worried about is a birthday attack.
Assuming a perfectly balanced hash output and no mistakes in my calculations (neither of which are technically true), you would get a 1% chance of a collision with 4.8 × 1037 hashes.
No node can hold that many keys, but even if it could, you would need more than 100 quintillion searches per second (that is 100 billion billion) in order to be able generate that number of hashes before the sun dies (10 billion years).

That said, a hash collision would be pretty terrible.
The alternative is to hash the query (and the Differentiator, but more on that below), but not all the request data.

It's the query body (technically, source) that is variable size and can cause cache blowout (and maybe the alias filter too?). We can solve the original problem that large queries cause large cache entries by hashing that, with essentially zero risk of collision.

The same issue applies to the differentiator - it could also be quite large, so we can also change the DLS differentiator to hash the array it generates.

@Bubbad

Bubbad commented Jul 6, 2021

Copy link
Copy Markdown
Contributor Author

Hi @tvernum , thanks for your reply.

Yeah collisions shouldn't be able to happen in practice using sha-256, but as you say, if we can guarantee no collisions that would be even better.

In your alternative solution, to only hash the source and the differentiator specific stuff, I'm not sure I see how that would guarantee uniqueness of the cache key. Ignoring the differentiator for now; If I interpret you correctly, then we should only hash the source here and keep the others as is. Wouldn't you still end up with a theoretical chance of collision? If everything other than the source in the request data is the same, (same cluster alias, same aliasFilters etc), then in theory you can still have collisions where 2 different sources (query bodies) generate the same hash.

Or am I missing something here?

@costin

costin commented Jul 6, 2021

Copy link
Copy Markdown
Member

Any reason why a cryptographic hashing function was picked? For cache purposes alone a hashing function (like xxhash) should yield better results across the board.

@Bubbad

Bubbad commented Jul 6, 2021

Copy link
Copy Markdown
Contributor Author

@costin Only reasons were it being a well known and good technology and that it's already being used in different places in ES. But I'm definitely up for changing it if needed. Do you think performance will be an issue here?

@tvernum

tvernum commented Jul 7, 2021

Copy link
Copy Markdown
Contributor

For the new differentiator (used for result caching when DLS/FLS is active), we need a hash that is resistant to collision & preimage attacks. I don't know of any non-cryptographic hashes functions that can provide that.

We could hash the query and differentiator with different hashing algorithms, so that the performance cost is only paid when DLS/FLS are involved, but I don't think we can use xxhash (or similar) for anything that incorporates the differentiator.

@costin

costin commented Jul 7, 2021

Copy link
Copy Markdown
Member

I don't know of any non-cryptographic hashes functions that can provide that.

👍

@Bubbad

Bubbad commented Jul 16, 2021

Copy link
Copy Markdown
Contributor Author

Hi @tvernum , thanks for your reply.

Yeah collisions shouldn't be able to happen in practice using sha-256, but as you say, if we can guarantee no collisions that would be even better.

In your alternative solution, to only hash the source and the differentiator specific stuff, I'm not sure I see how that would guarantee uniqueness of the cache key. Ignoring the differentiator for now; If I interpret you correctly, then we should only hash the source here and keep the others as is. Wouldn't you still end up with a theoretical chance of collision? If everything other than the source in the request data is the same, (same cluster alias, same aliasFilters etc), then in theory you can still have collisions where 2 different sources (query bodies) generate the same hash.

Or am I missing something here?

@tvernum Ping about this question. (Sorry if you've just been buzy)

@tvernum

tvernum commented Jul 26, 2021

Copy link
Copy Markdown
Contributor

My point is that most of the data is small, so hashing introduces a tiny collision risk for a small gain in memory consumption.

The query is potentially large, so the improvement in memory consumption is more significant. That changes the tradeoff - the overall value is greater, and can justify the collision risk.

I don't actually have an opinion here - it's not my area of code.
I think someone from @elastic/es-search needs to make the call on this, I'm just presenting risks & options.

@romseygeek

Copy link
Copy Markdown
Contributor

Sorry for the radio silence @Bubbad, life and holidays intervened here. I think we're good with this implementation, and it should give us a nice memory saving for large queries. I'll run the test suite and see if that's happy with it as well.

@elasticmachine test this please

@Bubbad

Bubbad commented Sep 9, 2021

Copy link
Copy Markdown
Contributor Author

Sorry for the radio silence @Bubbad, life and holidays intervened here. I think we're good with this implementation, and it should give us a nice memory saving for large queries. I'll run the test suite and see if that's happy with it as well.

@elasticmachine test this please

Sure, no problem! That's great news! I saw the build failed, I think I had it rebased on a somewhat broken commit. Rebased it on current master now, so please run it again now and hopefully it should work better.

@romseygeek

Copy link
Copy Markdown
Contributor

@elasticmachine test this please

@romseygeek

Copy link
Copy Markdown
Contributor

It looks like we have some genuine test failures here - are you able to reproduce and investigate them @Bubbad?

@Bubbad

Bubbad commented Sep 10, 2021

Copy link
Copy Markdown
Contributor Author

It looks like we have some genuine test failures here - are you able to reproduce and investigate them @Bubbad?

Seems that the first pr fix to remove the .copyBytes() call broke the tests. Apparently some BytesReference implementations doesn't support the .array() call. I've pushed a fix now that instead iterates the BytesReference, which should be supported by all implementations. Can you please give that commit a review?

Running ./gradlew check locally now succeeds with all tests except for some org.elasticsearch.node.NodeTests tests, which seems to fail in the master branch for me as well, so I guess that's just something on my computer. If you think the commit looks good I think we can try it out on your jenkins again.

@romseygeek

Copy link
Copy Markdown
Contributor

@elasticmachine test this please

@romseygeek

Copy link
Copy Markdown
Contributor

@elasticmachine update branch

@romseygeek

Copy link
Copy Markdown
Contributor

@elasticmachine ok to test

// copy it over since we don't want to share the thread-local bytes in #scratch
return out.copyBytes();

return new BytesArray(getHashedCacheKey(out.bytes()));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use MessageDigests.digest(out.bytes(), MessageDigests.sha256()) here rather than adding a new method?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, didn't see that that function already existed. Pushed a fix now!

@romseygeek romseygeek left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for your patience on this @Bubbad

@romseygeek romseygeek merged commit 553e8dc into elastic:master Sep 13, 2021
romseygeek pushed a commit that referenced this pull request Sep 13, 2021
We currently use the plaintext body of a shard request as the key to the 
request cache.  This has the disadvantage that very large requests can
quickly fill up the cache due to the size of their keys.  With this commit, 
we instead use a sha-256 hash of the shard request as the cache key, 
which will use a constant (and much smaller) number of bytes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team :Search/Search DO NOT USE DEPRECATED - DO NOT USE Team:Search DEPRECATED - DO NOT USE v7.16.0 v8.0.0-beta1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce memory usage of shard request cache keys

9 participants