Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CICD metrics #1681

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

christophe-kamphaus-jemmic
Copy link
Contributor

Fixes #1600

Changes

This PR adds metrics for CICD systems and related attributes.

Merge requirement checklist

docs/attributes-registry/cicd.md Outdated Show resolved Hide resolved
docs/cicd/cicd-metrics.md Outdated Show resolved Hide resolved

---

`cicd.worker.type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In many complex CI/CDs the workers are categorized by other attributes, such as the platform they run (ubuntu, windows) the tooling they have installed (java, go) or other attributes.

I lean towards leaving this attribute open (without pre-defined values). Also consider class as an alternative name to reflect its intention to group the workers into categories.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have renamed this attribute to cicd.worker.class in 26d3064.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now actually removed cicd.worker.class in b332638.
See the discussion #1681 (comment) for reference.

@pablochacin
Copy link

Here we are introducing the concept of worker, but it seem disconnected from other elements of the domain.

For instance, pipelines are executed in workers. Shouldn't the metrics related to pipelines be somehow reference the workers? I'm not quite sure how this could be done. Just adding the worker type to the pipeline seems insufficient.

Maybe we should consider a worker id attribute. It seems useful to, for example, identify all errors in a pipeline comes from a faulty worker.

@christophe-kamphaus-jemmic
Copy link
Contributor Author

For instance, pipelines are executed in workers. Shouldn't the metrics related to pipelines be somehow reference the workers? I'm not quite sure how this could be done. Just adding the worker type to the pipeline seems insufficient.

Maybe we should consider a worker id attribute. It seems useful to, for example, identify all errors in a pipeline comes from a faulty worker.

I did consider this, however we could easily run into cardinality issues if we add such an attribute to metrics (eg. if dynamic workers are used). If we add a cicd.worker.id attribute to metrics it will have to be opt-in and with a warning about the cardinatity.

For CICD span conventions we should definitely keep this in mind. For example https://github.com/jenkinsci/opentelemetry-plugin emits span with an attribute jenkins.computer.name.

Also at some point I would like to be able to link host or container metrics to a pipeline run despite any cardinality issues as an opt-in feature: #1184 For this it makes sense to use a cicd.worker.id attribute as well.

I think we should discuss this further on a separate Github issue.

model/cicd/metrics.yaml Outdated Show resolved Hide resolved
model/cicd/metrics.yaml Outdated Show resolved Hide resolved
@@ -75,3 +75,78 @@ groups:
brief: >
The type of the task within a pipeline.
examples: ["build", "test", "deploy"]
- id: cicd.pipeline.result
type:
members:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should be consistent with names here.

Like:

  • succeeded
  • failed
  • errored
  • timed-out
  • cancelled
  • skipped

Or

  • success
  • failure
  • error
  • timeout
  • skip
  • cancel

similar verbiage to what is in CDEvents for parity https://github.com/cdevents/spec/blob/v0.4.1/core.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will go with the 2nd option and rename
skipped → skip
cancelled → cancel

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second option seems to mix style (verbs, nouns).

Since this describes an event that has already happened, in my opinion the past tense makes more sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename done in 81c4407

Good point @mrueg that we should have either all verbs or all nouns.

I think we should go with nouns since it most directly attributes a value to the result attribute. For example:
"The pipeline result is a success."
"The pipeline result is a timeout."

The CDEvents spec also went with nouns.

Skip is both a verb and a noun.
The only item that needs to be renamed to become a noun is
cancel → cancellation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed in adf29b4

model/cicd/metrics.yaml Show resolved Hide resolved
model/cicd/metrics.yaml Outdated Show resolved Hide resolved
model/cicd/metrics.yaml Outdated Show resolved Hide resolved
model/cicd/metrics.yaml Show resolved Hide resolved
- id: metric.cicd.errors
type: metric
metric_name: cicd.errors
brief: 'The number of errors in the controller of the CICD system.'
Copy link
Contributor

@lmolkova lmolkova Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cicd.errors name seems too broad - it seems it would cover all kinds of errors in all CI/CD system parts and then it stops being practical.

The first time I read it, I assumed it to count pipeline run errors, but it's probably something else.

Wonder if we could limit the scope of this metric to something reasonable. E.g. should it be cicd.controller.errors? I could envision we'd want to differentiate per component, e.g. have a cicd.scheduler.errors, etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended to count the errors from the CICD system (controller, scheduler, agent) in this metric and not errors from the pipeline run.

It is indeed good to be able to distinguish the different system components as well as errors from the pipeline runs.
Adriel also made a good point that it would be good to be able to count pipeline run errors separately from just the cicd.pipeline.run.duration with result=error because a single pipeline run might encounter several different errors (some recoverable and others not).

Most likely this will be a derived metric, eg. by using a span metrics connector on the pipeline run spans or count connector on the controller logs.

I will define the following metrics to cover these:
cicd.pipeline.run.errors for errors encountered as part of the pipeline run execution
cicd.system.errors for errrors encountered in CICD system components with an attribute component (eg. scheduler, agent, controller, …)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in fc48dd2, 0be8d4d and 996e809.

model/cicd/registry.yaml Outdated Show resolved Hide resolved
model/cicd/metrics.yaml Outdated Show resolved Hide resolved
model/cicd/metrics.yaml Outdated Show resolved Hide resolved
model/cicd/metrics.yaml Show resolved Hide resolved
model/cicd/metrics.yaml Outdated Show resolved Hide resolved
- id: metric.cicd.errors
type: metric
metric_name: cicd.errors
brief: 'The number of errors in the controller of the CICD system.'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended to count the errors from the CICD system (controller, scheduler, agent) in this metric and not errors from the pipeline run.

It is indeed good to be able to distinguish the different system components as well as errors from the pipeline runs.
Adriel also made a good point that it would be good to be able to count pipeline run errors separately from just the cicd.pipeline.run.duration with result=error because a single pipeline run might encounter several different errors (some recoverable and others not).

Most likely this will be a derived metric, eg. by using a span metrics connector on the pipeline run spans or count connector on the controller logs.

I will define the following metrics to cover these:
cicd.pipeline.run.errors for errors encountered as part of the pipeline run execution
cicd.system.errors for errrors encountered in CICD system components with an attribute component (eg. scheduler, agent, controller, …)

model/cicd/registry.yaml Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Needs More Approval
Development

Successfully merging this pull request may close these issues.

Add metrics for CICD job queues
8 participants