Add better support for metric data types (TSDB)

#### Phase 0 - Inception
- [x] Obtain schemas annotated with dimensions and metrics from the Metrics team (small) @nik9000 
- [x] Prototyping Lucene Data Pull Mechanism(medium) @imotov
- [x] Prototyping Data Pull Mechanism in elasticsearch @imotov

#### Phase 1 - Mappings
- [x] Add `time_series_dimension` mapping parameter to fields 
  - [X] #74450 @csoulios
  - [X] #74939 @csoulios
  - [X] #78012 @csoulios
- [x] #76766 @csoulios
- [X] #78790 @imotov
- [x] #79136 @weizijun

#### Phase 2 - Ingest
- [x] Dimension-based tsid generator
  - [x] #77154 @nik9000 
  - [x] Add TSDB-specific tests 
    - [X] #78208 @nik9000 
    - [X] #78042 @nik9000 
    - [X] #78038 @nik9000 
    - [X] #78034 @nik9000 
    - [X] #78028 @nik9000 
    - [X] #78022 @nik9000 
  - [x] #80276 @csoulios
  - [x] #81382 @csoulios ([prototype](https://github.com/elastic/elasticsearch/pull/75638/files#diff-4c1601543024812997c062c3b5e2dfc1aae96707d9aec8bba2c00e0b0a69c86d)) 
  - [x] #81998 @csoulios 

- [x] Routing
  - [x] #77211 @nik9000 
  - [x] #77731 @nik9000 
  - [x] #79384 @nik9000
  - [x] #79520 @nik9000
  - [x] #81125 @csoulios 
  - [x] #79826 (Speed up xcontent filtering) @weizijun 
  - [x] Have a good hard look at the switch statement in `BulkOperation`. Maybe we can make this simpler. 
    - [X] #79394 @nik9000
    - [X] #79472 @nik9000
    - [x] #80624
  - [x] #81436 @csoulios 
  - [x] Test and fix get-by-id #82633 (See linked issue for greater description of sub-points)
    - [x] Initial implementation of `_id` for tsid (#82633)
    - [x] Generate better error messages when `_id` is automatically generated (#84903, #84962)
    - [x] Improve error messages on version conflict to include `_tsid` and `@timestamp` (#84957)
    - [x] Size
      - [x] Investigate flipping `@timestamp` component of the `_id` from little endian to big endian. That *should* mean there are more common prefixes. #85008 cuts the size of the inverted index for `_id` by 37%. That's not a lot of the index in total, but it sure does feel good for such a small change.
    - [x] Misc
      - [x] Test TSDB's `_id` in `RecoverySourceHandlerTests.java` and `EngineTests.java` #84996, #85055
      - [x] Make it possible to modify `@timestamp` or dimensions in reindex #86647 + #86704
      - [x] Test `_id` with the security `create_doc` privilege. Can a user with `create_doc` (only) ingest new TSDB docs? Does `create_doc` prevent a user from overwriting an existing TSDB doc? (`create_doc` relies on the `OpType` of the `IndexRequest`, which is [automatically set to `CREATE`](https://github.com/elastic/elasticsearch/blob/a2bc4854b562782a8a66eeacb23246d4d21a2b01/server/src/main/java/org/elasticsearch/rest/action/document/RestIndexAction.java#L83) for docs with auto-generated ids) #86638
    
- [x] Handling Time Boundaries 
  - [x] #78291 (Added `start_time`, `end_time` index settings ) @weizijun
  - [x] Make time boundaries required in tsdb indices @weizijun https://github.com/elastic/elasticsearch/pull/81146
  - [x] Replace hard check for index_mode=TIME_SERIES with bounds checking on start and end time @nik9000 https://github.com/elastic/elasticsearch/pull/81263
  - [x] Tests for nanosecond timeprecision timestamp just beyond the limit
  - [x] #82079 @martijnvg
  - [x] Automated update of index time boundaries on index rollover @martijnvg
  - [x] #83517 (@martijnvg )
  - [x] Adjust get data stream api to include index_mode and per backing index the start and end time if data stream is tsdb. #83518
  - [x] Automatically skip shards of backing indices with time ranges (based on `index.time_series.start_time` and `index.time_series.end_time` index settings) that don't match with the `@timestamp` range in a search request. #85162 (@martijnvg)
- [x] Other tasks
  - [x] #79826 @weizijun
  - [x] Compile a standard data set for comparative speed and space benchmarking (@nik9000) https://github.com/elastic/rally-tracks/pull/222
  - [x] #82238 @imotov 
  - [x] Rewrite tsdb benchmark to use time series data streams with ilm policy. Instead of indexing into a regular index. @martijnvg 
   - [x] Figure out how to parse source only once for determining the right backing index and index routing. #84046 @martijnvg 
   - [x] Implement migrating existing data streams to data streams with time series index mode. #83520  @martijnvg 
   - [x] Reconsider how time series data streams are enabled in templates. @martijnvg The current `index_mode` setting isn't good enough. It requires additional config to be specified (`time_series_dimension` attribute in mappings and `index.routing_path` as index settings) elsewhere and it doesn't allow the data stream tsdb features (routing based on `@timestamp` field) to be enabled without enabled the index level tsdb features.
   - [x] A template will create time series data stream if `index.mode` setting is set to `time_series`.
   - [x] Autogenerate `index.routing_path` index setting if not defined in composable index template that creates a tsdb data stream. All mapped fields of type `keyword` and `time_series_dimension` enabled will be included in the generated `index.routing_path` index setting. #86790 (@martijnvg)
   ~~- [ ] The `index.routing_path` index setting generation doesn't kick in when index.mode and dimension fields are defined in component templates. (@martijnvg).~~

#### Phase 2.1 Ingest follow ups
~~- [ ] Build the `_id` from dimension values~~
~~- [ ] Investigate moving timestamp to the front of the `_id` to automatically get an optimization on `_id` searches. Not sure if worth it - but possible. #84928 could be an alternative~~
- [x] Bring back something in the spirit of the append-only optimization but that works for tsdb. That's super improve write performance. #84771 is a partial prototype
- [x] We store the `_id` in lucene stored fields. We could regenerate it from the `_source` or from doc values for the `@timestamp` and the `_tsid`. That'd save some bytes per document.
- [ ] Move `IndexRequest#autoGeneratId`? It's a bit spook where it is but I don't like it any other place.
- [ ] Improve error messages in `_update_by_query` when modifying the dimensions or `@timestamp`
- [ ] On translog replay and recovery and replicas we regenerate the `_id` and assert that it matches the `_id` from the primary. Should we? Probably. Let's make sure.
- [x] Add tsdb benchmarks to the nightlies
~~- [ ] Document best practices for using dimensions-based ID generator including how to use this with component templates~~

#### Phase 3.1 QL storage API (Postponed)
- [x] Create simple time series reader
  - [X] #79197 @nik9000
  - [x] #79691 @imotov
~~- [ ] Reimplement QL storage API for TSDB database (depends on completion of Phase 2 and 3.2) (Postponed)~~

#### Phase 3.2 - Search MVP
Plans time series support in _search api are superceded by plans for this in ES|QL.
- [x] Distributed nested delayed execution framework 
  - [x] #82129 @imotov
  - [x] #83492 @imotov
  - [x] #85011 @imotov 
- [ ] Treating data stream/index as a dimension
~~- [ ] Aggregation results filtering~~
~~- [ ] Retrieve the last value for a time series metric within a parent bucket~~
- [x] Time series aggregation
- [x] Rate Function
~~- [ ] Add a new histogram field subtype to support Prometheus-style histograms~~
~~- [ ] #85523~~
~~- [ ] Should the _tsid agg return doc_counts by default?~~
~~- [ ] #90423~~

#### Phase 3.3 - Rollup / Downsampling
- [x] #85708 @csoulios 
  - Extract rollup configuration (dimensions, metrics) from index mapping
  - Create rollup index (settings and mapping)
  - Traverse source index using `TimeSeriesIndexSearcher` and compute rollups docs and add them to the rollup index
  - Finalize action: publish index metadata, modify data stream, clean up temp index
- [x] #87269 @csoulios
  - Use the updated rollup config
  - Revisit validations before invoking rollup process
- [x] #90029 @csoulios 
- [x] Query downsampled indices, add validations for:
  - [x] #89252 @salvatore-campagna 
  - Intervals: `fixed_interval` vs `calendar_interval`
  - `time_zone`
  - `date_histogram` resolution
- [x] Field Caps API
  - [x] #87849 @csoulios
  - [x]  #88695 @csoulios 
    - Expose information about if a field belongs to only time-series indices when querying multiple indices 
    - Shorten the response when some indices don't map fields as the same time series parameter - right now it's a list of indices which is nice but kibana only needs to know if the list is non-empty
- [ ] Misc
     - [x] #87554 @csoulios
     - [x] #87929  @salvatore-campagna 
     - [x] Make rollup task cancellable #88496 @weizijun 
     - [x] #88534  @salvatore-campagna 
     - [ ] Support text field labels
     - [x] #88818 @salvatore-campagna 
     - [x] Handle rollup failures
     - [x] Update tsdb rally track to add benchmarks for downsampling https://github.com/elastic/rally-tracks/pull/316 @salvatore-campagna
     - [x] #90226 @salvatore-campagna 

#### Phase 3.4 - TSID aggs (superseded by tsdb in ES|QL)
~~ - [ ] Update  min, max, sum, avg pipeline aggs for intermediate result filtering optimization ~~
~~ - [ ] Sliding window aggregation ~~
~~ - [ ] A way to filter to windows *within* the sliding window. Like "measurements take in the last 30 seconds of the window". ~~
~~ - [ ] Open transform issue for newly added time series aggs ~~
~~ - [ ] Benchmarks for the tsid agg ~~

#### Phase 3.5 - Downsampling follow ups
  - [ ] Handling histograms
  - [ ] SQL support for downsampling

#### Phase 4.0 - Compression
- [x] Synthetic `_source` @nik9000 #86603 
- [ ] Optimization of merge policies (#87684)
- [x] Deltas of deltas compression
- [x] What about sequence number?

#### Phase 5.0 - Follow-ups and Nice-to-have-s
- [ ] Default the setting's value to all of the keyword dimensions
- [ ] Support shard splitting on time_series indices
- [ ] Make an object or interface for `_id`'s values. Right now it's a `String` that we encode with `Uid.encodeId`. That was reasonable. Maybe it still is. But it feels complex and for tsdb who's `_id` is always some bytes. And encoding it also wastes a byte about 1/128 of the time. It's a common prefix byte so this is probably not really an issue. But still. This is a big change but it'd make ES easier to read. Probably wouldn't really improve the storage though.
- [ ] Figure out how to specify tsdb settings in component templates. For example index.routing_path can be specified in a composable index template if data stream template' index_mode is set to time_series. But if this setting is specified in a component template then it is required to also set the index.mode index setting. This feels backwards. @martijnvg 
- [ ] In order to retrieve the routing values (defined in `index.routin_path`), the source needs to be parsed on coordinating node. However in the case that an ingest pipeline is executed this, then the source of document will be parsed for the second time. Ideally the routing values should be extracted when ingest is performed. Similar to how the `@timestamp` field is already retrieved from a document during pipeline execution.
- [ ] In order to determine the backing index a document should be to, a timestamp is parsed into `Instant`. The format being used is: `strict_date_optional_time_nanos||strict_date_optional_time||epoch_millis`. This to allow regular data format, data nanos date format and epoch since mills defined as string. We can optimise the data parsing if we know the exact format being used. For example if on data stream there is parameter that indices that exact data format we can optimise parsing by either using `strict_date_optional_time_nanos`, `strict_date_optional_time` or `epoch_millis`.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add better support for metric data types (TSDB) #74660

Phase 0 - Inception

Phase 1 - Mappings

Phase 2 - Ingest

Phase 2.1 Ingest follow ups

Phase 3.1 QL storage API (Postponed)

Phase 3.2 - Search MVP

Phase 3.3 - Rollup / Downsampling

Phase 3.4 - TSID aggs (superseded by tsdb in ES|QL)

Phase 3.5 - Downsampling follow ups

Phase 4.0 - Compression

Phase 5.0 - Follow-ups and Nice-to-have-s

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Add better support for metric data types (TSDB) #74660

Description

Phase 0 - Inception

Phase 1 - Mappings

Phase 2 - Ingest

Phase 2.1 Ingest follow ups

Phase 3.1 QL storage API (Postponed)

Phase 3.2 - Search MVP

Phase 3.3 - Rollup / Downsampling

Phase 3.4 - TSID aggs (superseded by tsdb in ES|QL)

Phase 3.5 - Downsampling follow ups

Phase 4.0 - Compression

Phase 5.0 - Follow-ups and Nice-to-have-s

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions