Interrupt aggregation reduce phase if the search task is cancelled#71714
Conversation
9af99f3 to
f412f45
Compare
|
Pinging @elastic/es-analytics-geo (Team:Analytics) |
imotov
left a comment
There was a problem hiding this comment.
It would be great to add an integration test for it and figure out a more robust way of passing the search task into aggregations. I also renamed the PR to better reflect what it is actually trying to achieve.
| private final NamedWriteableRegistry namedWriteableRegistry; | ||
| private final Function<SearchRequest, InternalAggregation.ReduceContextBuilder> requestToAggReduceContextBuilder; | ||
|
|
||
| private SearchTask searchTask; |
There was a problem hiding this comment.
SearchPhaseController is a singleton that is created here, so this instance of the searchTask is going to be shared by all searches in the node leading to a race condition.
f286295 to
bc34d99
Compare
|
I don't think it is a good idea to rely on a side-effect of somebody calling createTask and I definitely prefer not to stash task on the request. This implementation also doesn't always work since we override createTask when we do async search. I would rather figure out how to make searchTask available in the aggregation layer properly. But before we jump into implementation I think we should start with a proper test for it. |
|
Sounds good, I'll work on the integration test first then. |
|
@danielwhsu it looks like are reusing the same script that is used in search and it fails on init because it is executed in a completely different context and has no idea where to find fieldsLookup that it uses for logging. So, it throws an exception there and never blocks. When it times out on waiting on block it is typically a good idea to try reading the searchResponse instead, it frequently contains useful errors. While I was at it, I fixed a few other issues that you would have ran into next. I think we have a failing test now. Time to fix it! |
|
Thanks for the help! I'll look into piping the task into the reduce phase. |
e35a866 to
8eb554d
Compare
8eb554d to
b20d038
Compare
|
@imotov I've updated this PR with a solution that passes the unit tests. The main changes are:
I went with this approach because when the |
|
@elasticmachine test this please |
| return true; | ||
| }); | ||
| return Map.of( | ||
| SCRIPT_NAME, this::searchBlockScript, |
| .setQuery(matchAllQuery()) | ||
| .addAggregation( | ||
| new TermsAggregationBuilder("test_agg") | ||
| .script(new Script( |
There was a problem hiding this comment.
Could you use a regular terms agg? If not probably worth a comment why. I spent a little while trying to figure out if the terms script created a second block or something before I went and read it.
| * the maximum number of buckets allowed in a response | ||
| */ | ||
| public void consumeBucketsAndMaybeBreak(int size) { | ||
| if (isCanceled.get()) { |
There was a problem hiding this comment.
It's worth adding a comment that this is a volatile read. Do we want to do that on every consume? We don't check the memory breaker on every consume so maybe we should do this either?
I see that AtomicBoolean has a bunch of fun new ways to fetch the value without memory effects - but only in java 9, so, sadly, probably not for us yet.
There was a problem hiding this comment.
My reasoning here was that this happens in the reduce context and as you said, it's AtomicBoolean.get(). Do you think it will have noticeable overhead here? I can probably move inside MultiBucketConsumer and call it every few buckets if you think it makes sense.
There was a problem hiding this comment.
I'm not really sure! Its the kind of location that could suffer from it but I don't know how much without benchmarks. I'd feel quite comfortable without benchmarks if it were in the same 1024 calls thing, but it wouldn't interrupt as immediately. I expect the checks are much lighter than the circuit breaker checks but heavier than the general bucket building stuff.
There was a problem hiding this comment.
Thanks! I will try to run some benchmarks to see what's going on there.
|
@elasticmachine test this please |
|
@elasticmachine test this please |
|
@elasticmachine update branch |
|
@imotov just checking, is this still something we want to pursue? |
Definitely. I just don't have time to address @nik9000's concerns in #71714 (comment) at the moment due to other pressing work. |
|
@elasticmachine test this please |
|
I have updated the existing benchmark to mimic the checking for cancellation and ran it with both checks for cancellation and the checks commented out. The results are pretty much identical within expected margin of error: Without Cancellation Check: With Cancellation Check: |
| final TaskId parentTaskId = task.taskInfo(clusterService.localNode().getId(), false).getTaskId(); | ||
| ccsRemoteReduce(parentTaskId, rewritten, localIndices, remoteClusterIndices, timeProvider, | ||
| searchService.aggReduceContextBuilder(rewritten), | ||
| searchService.aggReduceContextBuilder(((CancellableTask) task)::isCancelled, rewritten), |
There was a problem hiding this comment.
Do you need this cast any more now that you've changed the signature?
|
@elasticmachine update branch |
|
@elasticmachine test this please |
|
@elasticmachine test this please |
…lastic#71714) This change raises a TaskCancelledException to stop the search query if it is detected that the SearchTask has been cancelled during the reduce phase. Issue: elastic#71021 Co-authored-by: Daniel Hsu <daniel.hsu7@gmail.com> Co-authored-by: Igor Motov <igor@motovs.org>
This change raises a TaskCancelledException to stop the search query if it is detected that the SearchTask has been cancelled during the reduce phase.
Issue: #71021