Skip to content

Conversation

@can-anyscale
Copy link
Contributor

@can-anyscale can-anyscale commented Aug 7, 2025

This is part of a series of PRs to support JobEvent in the oneevent framework. The full effort will include adding the JobEvent schema, introducing a generic interface for exporting different types of events to the Event Aggregator, and implementing the necessary integration logic.


In this PR, we improve the isolation of the event exporting job from other GCS components by using separate threads for periodic execution and gRPC request handling.

Test:

  • CI

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @can-anyscale, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've refactored the job event recording mechanism to utilize a dedicated thread within the GCS server. This change aims to prevent job event processing from blocking other critical GCS server operations, thereby improving overall system responsiveness and stability. The core idea is to isolate this specific task to ensure it doesn't contend for resources with more time-sensitive operations.

Highlights

  • Dedicated Thread for Job Events: The RayJobEventRecorder now operates on its own dedicated I/O context (ray_event_io_context), isolating job event recording from other critical GCS server operations. This is a performance optimization.
  • Flexible Event Recorder Initialization: The RayEventRecorderBase class gains a new constructor that allows it to internally manage its EventAggregatorClient and ClientCallManager, simplifying its integration by only requiring an I/O context and a dashboard agent port.
  • GCS Server Integration: The GCS server's initialization of RayJobEventRecorder has been updated to leverage this new constructor, ensuring job events are handled on their dedicated thread.
  • Build and Test Updates: The build system and unit tests have been adjusted to reflect these architectural changes, ensuring proper compilation and testing of the new event recording mechanism.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the job event recording logic to use a dedicated thread, which is a good improvement for decoupling and performance. The changes are mostly correct, but I've found a critical issue where a member variable is not initialized, which would lead to a crash. I've also pointed out a medium-severity issue with member declaration order that should be fixed to improve code quality and prevent future bugs. Please address these points.

@can-anyscale can-anyscale force-pushed the can-1ev03 branch 2 times, most recently from fa02b1e to 623d909 Compare August 7, 2025 23:18
@can-anyscale can-anyscale marked this pull request as ready for review August 7, 2025 23:19
@can-anyscale can-anyscale requested a review from a team as a code owner August 7, 2025 23:19
@can-anyscale can-anyscale added the go add ONLY when ready to merge, run all tests label Aug 8, 2025
@edoakes
Copy link
Collaborator

edoakes commented Aug 8, 2025

@can-anyscale can you provide a little more context on the motivation for this? What is the work that the ray_job_event_recorder does and why do we have reason to think it has enough overhead to warrant running on its own thread?

The concurrency model for the GCS needs a holistic overhaul, so want to make sure we move in roughly the right direction with whatever we do here.

@can-anyscale
Copy link
Contributor Author

@edoakes: great point, my intention was more for correctness than performance; I chat to @MengjinYan about this too, I'll make a post on the team channel

@can-anyscale can-anyscale force-pushed the can-1ev03 branch 2 times, most recently from 187f7cf to 652d574 Compare August 14, 2025 21:25
@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Aug 15, 2025
@can-anyscale can-anyscale force-pushed the can-1ev03 branch 3 times, most recently from 4f72608 to bf607da Compare August 15, 2025 18:58
"task_io_context",
"pubsub_io_context",
"ray_syncer_io_context",
"ray_event_io_context"};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"ray_event_io_context"};
"event_export_io_context"};

we should call it what it is directly. "ray event" is not really meaningful inside the codebase

namespace ray {
namespace telemetry {

template <typename TEventData>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's please not call everything RayEvent -- it's inside the ray codebase, there's no need to call it Ray. else we should rename everything ("RayMetrics", "RayCoreWorker", "RayGCS", ...)

this bugs me with the "ray syncer" already

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also suggest naming the directory something else since "telemetry" already has a pretty specific meaning (the usage data we collect from ray clusters by default)

event_export?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah got you, this directory includes the open-telemetry stuff as well; maybe i can just rename it to observability

Comment on lines 54 to 57
protected:
RayEventRecorderBase(
std::unique_ptr<rpc::EventAggregatorClient> event_aggregator_client,
instrumented_io_context &io_service);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's always dependency-inject the client instead of adding a private constructor


private:
rpc::EventAggregatorClient &event_aggregator_client_;
std::unique_ptr<rpc::ClientCallManager> client_call_manager_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a client call manager? this will spin up extra grpc threads.

should be reused globally from whatever component we're in

the logic to do that can exist wherever we construct the gRPC client and dependency inject it

Comment on lines 29 to 30
template <typename TEventData>
RayEventRecorderBase<TEventData>::RayEventRecorderBase(
Copy link
Collaborator

@edoakes edoakes Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a base class for this? my understanding is the event data is all of type rpc::events::RayEventsData

so we should be able to use a single concrete event export client

can-anyscale added a commit that referenced this pull request Sep 11, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 11, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 11, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 12, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 12, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 12, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 12, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 12, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 12, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 12, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 12, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 12, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 12, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 16, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 16, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 16, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 16, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 16, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 16, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 18, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 18, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 18, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 18, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 18, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 30, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 30, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Sep 30, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Oct 14, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Oct 14, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
can-anyscale added a commit that referenced this pull request Oct 14, 2025
This is part of a series of PRs to support JobEvent in the oneevent
framework. The full effort will include adding the JobEvent schema,
introducing a generic interface for exporting different types of events
to the Event Aggregator, and implementing the necessary integration
logic.

------------------

In this PR, we improve the isolation of the event exporting job from
other GCS components by using separate threads for periodic execution
and gRPC request handling.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants