Skip to content

[Transform] Reduce indexes to query based on checkpoints#75839

Merged
hendrikmuhs merged 4 commits into
elastic:masterfrom
hendrikmuhs:transform-pit-reduce-indexcalls
Aug 26, 2021
Merged

[Transform] Reduce indexes to query based on checkpoints#75839
hendrikmuhs merged 4 commits into
elastic:masterfrom
hendrikmuhs:transform-pit-reduce-indexcalls

Conversation

@hendrikmuhs

Copy link
Copy Markdown

Continuous transform reduce the amount of data to query for by detecting what has been changed
since the last checkpoint. This information is than used to inject queries that narrow the
scope. The query is send to all configured indices. This change reduces the indexes to call
using checkpoint information. This reduces not only the number of network calls, but also
reduces the probability of a failure, which is more likely to happen in large heterogeneous
clusters (hot/warm/cold architecture).

@hendrikmuhs hendrikmuhs added the :ml/Transform Transform label Jul 29, 2021
@hendrikmuhs hendrikmuhs force-pushed the transform-pit-reduce-indexcalls branch from fe08512 to e1d9e99 Compare July 29, 2021 13:30
@hendrikmuhs hendrikmuhs force-pushed the transform-pit-reduce-indexcalls branch from 4ada87f to d619baa Compare August 5, 2021 14:46
@hendrikmuhs hendrikmuhs force-pushed the transform-pit-reduce-indexcalls branch from d619baa to 9dbffea Compare August 25, 2021 10:14
@hendrikmuhs hendrikmuhs marked this pull request as ready for review August 25, 2021 11:50
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Aug 25, 2021
@elasticmachine

Copy link
Copy Markdown
Collaborator

Pinging @elastic/ml-core (Team:ML)

@hendrikmuhs

Copy link
Copy Markdown
Author

TL/DR

I think I need to explain the idea for part 2, reducing calls for pivot:

The change collector already collects meta information about the source indexes, e.g. it gets max(@timestamp). I want to keep this information and store it in the state document. Imagine we keep the last 5 checkpoints (whether its 5 or something higher needs to be tested):

_transform.cp.42.@timestamp.max: 999
_transform.cp.43.@timestamp.max: 1022
_transform.cp.44.@timestamp.max: 1080
_transform.cp.45.@timestamp.max: 1129
_transform.cp.46.@timestamp.max: 1535

Assume for checkpoint 47 we detected in the change collector that @timestamp has a range min: 1092, max: 1693. The above lookup table tells us, we can skip every index that hasn't changed after checkpoint 44. Therefore we call getChangedIndices(44, 47) which gives us the indices with changes.

In a nutshell the checkpoint for 44 could look like this:

index-0001: 1220,1220,10135,1015,10111
index-0002: 305,15134,13531,1352,15533
...
index-0099: 10, 10, 10, 10, 10
index-0101: 20, 30, 30, 30, 30
index-0102: 40, 40, 40, 40, 40

The checkpoint of 47 might be:

index-0001: 1220,1220,10135,1015,10111
index-0002: 305,15134,13531,1352,15533
...
index-0099: 10, 10, 10, 10, 10
index-0101: 23, 32, 42, 32, 60
index-0102: 40, 50, 70, 60, 50
index-0103: 10, 10, 10, 10, 10

If we diff the 2, we can remove [index-0001, ... index-0099], only these 3 indexes are of interest:

index-0101: 23, 32, 42, 32, 60
index-0102: 40, 50, 70, 60, 50
index-0103: 10, 10, 10, 10, 10

We therefore can resolve the pattern index-* to only call the shards of those 3 indexes instead of calling all of them.

@benwtrent benwtrent left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is sound and will definitely help resiliency. I don't think it will help much in the performance front.

I would like to add the name for the PIT and searchrequest to the logging messages in some fashion.

Additionally, I am slightly concerned that building the checkpoints themselves will still fail as they have to get the stats from all indices. Though, I am not 100% sure how that index stats request differs from searching them.

case APPLY_RESULTS:
buildUpdateQuery(sourceBuilder);
break;
return new Tuple<>("apply_results", buildQueryToUpdateDestinationIndex());

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this. Naming them like this makes debugging in the future much nicer!!!

@hendrikmuhs

Copy link
Copy Markdown
Author

@elasticmachine update branch

@hendrikmuhs hendrikmuhs added the auto-backport Automatically create backport pull requests when merged label Aug 26, 2021
@hendrikmuhs

Copy link
Copy Markdown
Author

Additionally, I am slightly concerned that building the checkpoints themselves will still fail as they have to get the stats from all indices. Though, I am not 100% sure how that index stats request differs from searching them.

For the record / others reading this issue: This is tracked and followed up in #75780.

An index stats call still requires a network call, so we won't get away with this. However index stats is answered from the in-memory state, a search - whether it returns a match or not - potentially causes disk IO. The index stats is done exactly once per checkpoint, search gets executed several times.

It's all baby steps and involves a lot "it depends".

@hendrikmuhs hendrikmuhs merged commit 4974a7c into elastic:master Aug 26, 2021
@hendrikmuhs hendrikmuhs deleted the transform-pit-reduce-indexcalls branch August 26, 2021 07:08
hendrikmuhs pushed a commit to hendrikmuhs/elasticsearch that referenced this pull request Aug 26, 2021
Continuous transform reduce the amount of data to query for by detecting what has been changed
since the last checkpoint. This information is used to inject queries that narrow the scope.
The query is send to all configured indices. This change reduces the indexes to call
using checkpoint information. The number of network calls go down which in addition to performance
reduces the probability of a failure.

This change mainly helps the transforms of type latest, pivot transform require additional
changes planned for later.
@elasticsearchmachine

Copy link
Copy Markdown
Collaborator

💚 Backport successful

Status Branch Result
7.x

hendrikmuhs pushed a commit that referenced this pull request Aug 26, 2021
… (#76968)

Continuous transform reduce the amount of data to query for by detecting what has been changed
since the last checkpoint. This information is used to inject queries that narrow the scope.
The query is send to all configured indices. This change reduces the indexes to call
using checkpoint information. The number of network calls go down which in addition to performance
reduces the probability of a failure.

This change mainly helps the transforms of type latest, pivot transform require additional
changes planned for later.

backport #75839
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Sep 2, 2021
…ints

When every index that a transform is configured to search has
remained completely unchanged between checkpoints the transform
should not do a search at all.

Following elastic#75839 there was a problem where the scenario of all
indices being unchanged between checkpoints could cause an empty
list of indices to be searched, which Elasticsearch treats as
meaning _all_ indices. This change should prevent that happening
in future.

Fixes elastic#77137
droberts195 added a commit that referenced this pull request Sep 3, 2021
…ints (#77204)

When every index that a transform is configured to search has
remained completely unchanged between checkpoints the transform
should not do a search at all.

Following #75839 there was a problem where the scenario of all
indices being unchanged between checkpoints could cause an empty
list of indices to be searched, which Elasticsearch treats as
meaning _all_ indices. This change should prevent that happening
in future.

Fixes #77137
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Sep 3, 2021
…ints (elastic#77204)

When every index that a transform is configured to search has
remained completely unchanged between checkpoints the transform
should not do a search at all.

Following elastic#75839 there was a problem where the scenario of all
indices being unchanged between checkpoints could cause an empty
list of indices to be searched, which Elasticsearch treats as
meaning _all_ indices. This change should prevent that happening
in future.

Fixes elastic#77137
elasticsearchmachine pushed a commit that referenced this pull request Sep 3, 2021
…ints (#77204) (#77245)

When every index that a transform is configured to search has
remained completely unchanged between checkpoints the transform
should not do a search at all.

Following #75839 there was a problem where the scenario of all
indices being unchanged between checkpoints could cause an empty
list of indices to be searched, which Elasticsearch treats as
meaning _all_ indices. This change should prevent that happening
in future.

Fixes #77137
hendrikmuhs pushed a commit that referenced this pull request Sep 7, 2021
disable optimization of index calls introduced in #75839 as it can create wrong results.
See #77329 for follow up

relates #77329
fixes #77310
hendrikmuhs pushed a commit to hendrikmuhs/elasticsearch that referenced this pull request Sep 7, 2021
disable optimization of index calls introduced in elastic#75839 as it can create wrong results.
See elastic#77329 for follow up

relates elastic#77329
fixes elastic#77310
elasticsearchmachine pushed a commit that referenced this pull request Sep 7, 2021
disable optimization of index calls introduced in #75839 as it can create wrong results.
See #77329 for follow up

relates #77329
fixes #77310
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >enhancement :ml/Transform Transform Team:ML Meta label for the ML team v7.16.0 v8.0.0-alpha2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants