Create a sha-256 hash of the shard request cache key#74877
Conversation
bdd7d8b to
1a291d9
Compare
|
Pinging @elastic/es-search (Team:Search) |
|
In theory, the larger the input space becomes, the higher chance the hash collision has. With the entire That said, even though the theoretical input space is large, I don't think any practical use cases would be exhaustive enough to cause a collision. It's likely all cachable searches (per shard) are in the order of thousands, ten thousands, or even millions and these numbers too low for a sha-256 collision. So overall I think it is find to use the sha-256 hash as the cache key. Ping @tvernum for awareness. |
There was a problem hiding this comment.
There is no need to copyBytes if all we need is the hash.
There was a problem hiding this comment.
Pushed a commit without copying bytes now
|
Thanks for your review @ywangd. Just to make sure I understood you; Given that we think sha256 is enough to avoid collisions, do you think it's fine to merge this, even with your suspicion that there might've been a reason for not already hashing the cache key? |
|
SHA-256 should be fine. The risk we're mostly worried about is a birthday attack. That said, a hash collision would be pretty terrible. It's the query body (technically, The same issue applies to the differentiator - it could also be quite large, so we can also change the DLS differentiator to hash the array it generates. |
|
Hi @tvernum , thanks for your reply. Yeah collisions shouldn't be able to happen in practice using sha-256, but as you say, if we can guarantee no collisions that would be even better. In your alternative solution, to only hash the source and the differentiator specific stuff, I'm not sure I see how that would guarantee uniqueness of the cache key. Ignoring the differentiator for now; If I interpret you correctly, then we should only hash the source here and keep the others as is. Wouldn't you still end up with a theoretical chance of collision? If everything other than the source in the request data is the same, (same cluster alias, same aliasFilters etc), then in theory you can still have collisions where 2 different sources (query bodies) generate the same hash. Or am I missing something here? |
|
Any reason why a cryptographic hashing function was picked? For cache purposes alone a hashing function (like xxhash) should yield better results across the board. |
|
@costin Only reasons were it being a well known and good technology and that it's already being used in different places in ES. But I'm definitely up for changing it if needed. Do you think performance will be an issue here? |
|
For the new differentiator (used for result caching when DLS/FLS is active), we need a hash that is resistant to collision & preimage attacks. I don't know of any non-cryptographic hashes functions that can provide that. We could hash the query and differentiator with different hashing algorithms, so that the performance cost is only paid when DLS/FLS are involved, but I don't think we can use xxhash (or similar) for anything that incorporates the differentiator. |
👍 |
@tvernum Ping about this question. (Sorry if you've just been buzy) |
|
My point is that most of the data is small, so hashing introduces a tiny collision risk for a small gain in memory consumption. The query is potentially large, so the improvement in memory consumption is more significant. That changes the tradeoff - the overall value is greater, and can justify the collision risk. I don't actually have an opinion here - it's not my area of code. |
|
Sorry for the radio silence @Bubbad, life and holidays intervened here. I think we're good with this implementation, and it should give us a nice memory saving for large queries. I'll run the test suite and see if that's happy with it as well. @elasticmachine test this please |
Sure, no problem! That's great news! I saw the build failed, I think I had it rebased on a somewhat broken commit. Rebased it on current master now, so please run it again now and hopefully it should work better. |
|
@elasticmachine test this please |
|
It looks like we have some genuine test failures here - are you able to reproduce and investigate them @Bubbad? |
…esReference implementations support
Seems that the first pr fix to remove the Running |
|
@elasticmachine test this please |
|
@elasticmachine update branch |
|
@elasticmachine ok to test |
| // copy it over since we don't want to share the thread-local bytes in #scratch | ||
| return out.copyBytes(); | ||
|
|
||
| return new BytesArray(getHashedCacheKey(out.bytes())); |
There was a problem hiding this comment.
I think we can use MessageDigests.digest(out.bytes(), MessageDigests.sha256()) here rather than adding a new method?
There was a problem hiding this comment.
Ah, didn't see that that function already existed. Pushed a fix now!
romseygeek
left a comment
There was a problem hiding this comment.
LGTM, thanks for your patience on this @Bubbad
We currently use the plaintext body of a shard request as the key to the request cache. This has the disadvantage that very large requests can quickly fill up the cache due to the size of their keys. With this commit, we instead use a sha-256 hash of the shard request as the cache key, which will use a constant (and much smaller) number of bytes.
This PR optimizes the memory usage for cache keys in the request_cache.
Instead of using the entire shard search request JSON body as a key, we will now hash the json body
using SHA-256. This makes every cache key use 32 bytes of heap, instead of however large the JSON was,
which will lower the heap usage per entry in the cache.
I think SHA-256 should be enough to avoid the chance of having hash collisions, but I'm definitely open for discussion here if you think something else is better suited.
Closes #74061