[Transform] Reduce indexes to query based on checkpoints#75839
Conversation
fe08512 to
e1d9e99
Compare
4ada87f to
d619baa
Compare
d619baa to
9dbffea
Compare
|
Pinging @elastic/ml-core (Team:ML) |
|
TL/DR I think I need to explain the idea for part 2, reducing calls for The change collector already collects meta information about the source indexes, e.g. it gets Assume for checkpoint In a nutshell the checkpoint for The checkpoint of If we diff the 2, we can remove [ We therefore can resolve the pattern |
benwtrent
left a comment
There was a problem hiding this comment.
I think the idea is sound and will definitely help resiliency. I don't think it will help much in the performance front.
I would like to add the name for the PIT and searchrequest to the logging messages in some fashion.
Additionally, I am slightly concerned that building the checkpoints themselves will still fail as they have to get the stats from all indices. Though, I am not 100% sure how that index stats request differs from searching them.
| case APPLY_RESULTS: | ||
| buildUpdateQuery(sourceBuilder); | ||
| break; | ||
| return new Tuple<>("apply_results", buildQueryToUpdateDestinationIndex()); |
There was a problem hiding this comment.
I really like this. Naming them like this makes debugging in the future much nicer!!!
|
@elasticmachine update branch |
For the record / others reading this issue: This is tracked and followed up in #75780. An index stats call still requires a network call, so we won't get away with this. However index stats is answered from the in-memory state, a search - whether it returns a match or not - potentially causes disk IO. The index stats is done exactly once per checkpoint, search gets executed several times. It's all baby steps and involves a lot "it depends". |
Continuous transform reduce the amount of data to query for by detecting what has been changed since the last checkpoint. This information is used to inject queries that narrow the scope. The query is send to all configured indices. This change reduces the indexes to call using checkpoint information. The number of network calls go down which in addition to performance reduces the probability of a failure. This change mainly helps the transforms of type latest, pivot transform require additional changes planned for later.
💚 Backport successful
|
… (#76968) Continuous transform reduce the amount of data to query for by detecting what has been changed since the last checkpoint. This information is used to inject queries that narrow the scope. The query is send to all configured indices. This change reduces the indexes to call using checkpoint information. The number of network calls go down which in addition to performance reduces the probability of a failure. This change mainly helps the transforms of type latest, pivot transform require additional changes planned for later. backport #75839
…ints When every index that a transform is configured to search has remained completely unchanged between checkpoints the transform should not do a search at all. Following elastic#75839 there was a problem where the scenario of all indices being unchanged between checkpoints could cause an empty list of indices to be searched, which Elasticsearch treats as meaning _all_ indices. This change should prevent that happening in future. Fixes elastic#77137
…ints (#77204) When every index that a transform is configured to search has remained completely unchanged between checkpoints the transform should not do a search at all. Following #75839 there was a problem where the scenario of all indices being unchanged between checkpoints could cause an empty list of indices to be searched, which Elasticsearch treats as meaning _all_ indices. This change should prevent that happening in future. Fixes #77137
…ints (elastic#77204) When every index that a transform is configured to search has remained completely unchanged between checkpoints the transform should not do a search at all. Following elastic#75839 there was a problem where the scenario of all indices being unchanged between checkpoints could cause an empty list of indices to be searched, which Elasticsearch treats as meaning _all_ indices. This change should prevent that happening in future. Fixes elastic#77137
…ints (#77204) (#77245) When every index that a transform is configured to search has remained completely unchanged between checkpoints the transform should not do a search at all. Following #75839 there was a problem where the scenario of all indices being unchanged between checkpoints could cause an empty list of indices to be searched, which Elasticsearch treats as meaning _all_ indices. This change should prevent that happening in future. Fixes #77137
disable optimization of index calls introduced in elastic#75839 as it can create wrong results. See elastic#77329 for follow up relates elastic#77329 fixes elastic#77310
Continuous transform reduce the amount of data to query for by detecting what has been changed
since the last checkpoint. This information is than used to inject queries that narrow the
scope. The query is send to all configured indices. This change reduces the indexes to call
using checkpoint information. This reduces not only the number of network calls, but also
reduces the probability of a failure, which is more likely to happen in large heterogeneous
clusters (hot/warm/cold architecture).