Skip to content

Task level rollup of metrics #1413

Open
@agershman

Description

@agershman

Is your feature request related to a problem? Please describe.

The offered metrics are at the task and worker level granularity. Given a highly scaled environment and also the default histogram buckets, this can lead to an explosion in cardinality of ingested metrics in Prometheus. In many cases worker level granularity of metrics is not necessary to understand the overall pattern of which task types are having which level of performance. For that reason I'd like to see how open this project is to adding additional metrics which are rolled up at the task level.

Unfortunately relabeling in Prometheus is not a valid solution as dropping the worker label would violate the constraint that all samples in a given scrape need to be distinct. Dropping the worker label would in fact lead to a label collision which is a no go. Additionally, aggregating the metrics post ingestion doesn't really solve the resource issue related to high cardinality metrics at the time of ingestion.

Describe the solution you'd like

The proposed solution would be to add additional metric instruments alongside the existing ones, but which lack the worker label. In all existing call points where those instruments are increments, set, observed into, we'd do likewise but for the these task level metrics. Basically keep what we have thus preserving backwards compatibility, and add an additional set of metrics which aren't worker specific. I would leave the worker specific metrics such as number of workers online alone. This would just be targeting the task oriented metrics.

I'm happy to send a PR for this change but first wanted to gauge whether it would be accepted.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions