Skip to content

ref(aci): use enqueue time to query Snuba in delayed workflow #93882

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 24, 2025

Conversation

cathteng
Copy link
Member

For a slow alert condition (e.g. "> 10 events in the past 1 minute"), we always use the time the query is being made as the time to start looking backwards from. This is not optimal in the case where:

  • A task needs to be rerun
  • There is a backlog in the delayed workflow queue
  • Because we flush the buffer every minute, here is an edge case where the time period is 1 minute: events come in at 11:00:02 am, are dispatched in a task at ~11:01:00am, the query is run at 11:01:03am or later, which would mean those events aren't picked up in the result

A better solution would be to use the time the event is enqueued into the delayed workflow buffer!

However, we also batch queries such that all alerts that end up making the same queries are grouped together, and we run a single query for multiple groups. For example:

  • Two events for two different groups come in a different times within the same minute, are both processed by alert A
  • We will make a single Snuba query for group A with the two groups

In the above case, we'll need to decide which enqueue time to use. We should use the latest enqueue time out of a list of groups as it will cover all groups being queried for in the Snuba query.

Copy link

sentry-io bot commented Jun 18, 2025

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: src/sentry/workflow_engine/processors/workflow.py

Function Unhandled Issue
evaluate_workflows_action_filters SoftTimeLimitExceeded: SoftTimeLimitExceeded() se...
Event Count: 3

Did you find this useful? React with a 👍 or 👎

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Jun 18, 2025
@cathteng cathteng requested a review from kcons June 18, 2025 22:49
@cathteng cathteng marked this pull request as ready for review June 18, 2025 22:49
@cathteng cathteng requested a review from a team as a code owner June 18, 2025 22:50
Use the latest timestamp for a set of group IDs with the same Snuba query.
We will query backwards in time from this point.
"""
if self.timestamp is None or (timestamp is not None and timestamp > self.timestamp):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if timestamp is not None:
self.timestamp = timestamp if self.timestamp is None else max(timestamp, self.timestamp)

perhaps.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm



@dataclass
class TimeAndGroups:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is BulkQueryParameters is more accurate name? Or perhaps GroupQueryParameters? There's a distinction between the unique queries and what we're associating them with that I'm not sure I'm capturing accurately, but TimeAndGroups strikes me as a bit too literal.

def dcg_to_timestamp(self) -> dict[int, datetime | None]:
"""
Uses the latest timestamp each DataConditionGroup was enqueued with
All groups enqueued for a DataConditionGroup will have the same query, hence the same max timestamp.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I should know what this means, but I don't.

handler = unique_condition.handler()
group_ids = time_and_groups.group_ids
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another option is to make groups be a dict[GroupId, datetime | None] and
do time = max(ts for ts in groups.values() if ts, default=current_time).
Stores a bit more data, but lets the summarizing happen where it is being forced, which has a certain appeal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i need to refactor dcg_to_timestamp for this, if it comes to it we can do a refactor

@@ -75,6 +77,7 @@ class DelayedWorkflowItem:
delayed_conditions: list[DataCondition]
event: GroupEvent
source: WorkflowDataConditionGroupType
timestamp: datetime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth explaining what this timestamp is and what it should correspond to. Or, rename the field to make the comment pointless.

{
"event_id": self.event.event_id,
"occurrence_id": self.event.occurrence_id,
"timestamp": self.timestamp,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL we can dumps a datetime.

@@ -79,6 +79,7 @@
class EventInstance(BaseModel):
event_id: str
occurrence_id: str | None = None
timestamp: datetime | None = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably add a test that requires us to parse this value correctly from the expected format. I strongly suspect we don't.

Copy link
Member

@kcons kcons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need test to verify we can parse it and pydantic won't freak out.

Copy link

codecov bot commented Jun 18, 2025

Codecov Report

Attention: Patch coverage is 86.11111% with 10 lines in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...try/workflow_engine/processors/delayed_workflow.py 65.51% 10 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #93882       +/-   ##
===========================================
+ Coverage   37.94%   81.96%   +44.01%     
===========================================
  Files        9784    10351      +567     
  Lines      553763   598883    +45120     
  Branches    23268    23267        -1     
===========================================
+ Hits       210143   490884   +280741     
+ Misses     343151   106950   -236201     
- Partials      469     1049      +580     

cathteng added 4 commits June 23, 2025 13:09

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@cathteng cathteng force-pushed the cathy/aci/delayed-workflow-start-time branch from b304ccd to fb91cb3 Compare June 23, 2025 20:09
@cathteng cathteng requested a review from kcons June 23, 2025 20:10
group_ids: set[GroupId] = field(default_factory=set)
timestamp: datetime | None = None

def update_timestamp(self, timestamp: datetime | None) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems simpler and safer to have update(self, group_id: GroupId, timestamp: datetime | None).

@@ -74,6 +76,9 @@ class DelayedWorkflowItem:
delayed_conditions: list[DataCondition]
event: GroupEvent
source: WorkflowDataConditionGroupType
timestamp: (
datetime # time the item was created for enqueue. used in delayed workflow Snuba query
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe

# Used to pick the end of the time window in snuba querying. 
# Should be close to when fast conditions were evaluated to try to be consistent.

What you have is fine, though.

@cathteng cathteng enabled auto-merge (squash) June 24, 2025 17:12
@cathteng cathteng merged commit 203d323 into master Jun 24, 2025
64 checks passed
@cathteng cathteng deleted the cathy/aci/delayed-workflow-start-time branch June 24, 2025 17:14
@github-actions github-actions bot locked and limited conversation to collaborators Jul 10, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants