-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CICD metrics #1681
base: main
Are you sure you want to change the base?
Add CICD metrics #1681
Conversation
docs/attributes-registry/cicd.md
Outdated
|
||
--- | ||
|
||
`cicd.worker.type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In many complex CI/CDs the workers are categorized by other attributes, such as the platform they run (ubuntu
, windows
) the tooling they have installed (java
, go
) or other attributes.
I lean towards leaving this attribute open (without pre-defined values). Also consider class
as an alternative name to reflect its intention to group the workers into categories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have renamed this attribute to cicd.worker.class
in 26d3064.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now actually removed cicd.worker.class
in b332638.
See the discussion #1681 (comment) for reference.
Here we are introducing the concept of worker, but it seem disconnected from other elements of the domain. For instance, pipelines are executed in workers. Shouldn't the metrics related to pipelines be somehow reference the workers? I'm not quite sure how this could be done. Just adding the worker type to the pipeline seems insufficient. Maybe we should consider a |
I did consider this, however we could easily run into cardinality issues if we add such an attribute to metrics (eg. if dynamic workers are used). If we add a For CICD span conventions we should definitely keep this in mind. For example https://github.com/jenkinsci/opentelemetry-plugin emits span with an attribute Also at some point I would like to be able to link host or container metrics to a pipeline run despite any cardinality issues as an opt-in feature: #1184 For this it makes sense to use a cicd.worker.id attribute as well. I think we should discuss this further on a separate Github issue. |
@@ -75,3 +75,78 @@ groups: | |||
brief: > | |||
The type of the task within a pipeline. | |||
examples: ["build", "test", "deploy"] | |||
- id: cicd.pipeline.result | |||
type: | |||
members: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should be consistent with names here.
Like:
- succeeded
- failed
- errored
- timed-out
- cancelled
- skipped
Or
- success
- failure
- error
- timeout
- skip
- cancel
similar verbiage to what is in CDEvents for parity https://github.com/cdevents/spec/blob/v0.4.1/core.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will go with the 2nd option and rename
skipped → skip
cancelled → cancel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second option seems to mix style (verbs, nouns).
Since this describes an event that has already happened, in my opinion the past tense makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename done in 81c4407
Good point @mrueg that we should have either all verbs or all nouns.
I think we should go with nouns since it most directly attributes a value to the result attribute. For example:
"The pipeline result is a success."
"The pipeline result is a timeout."
The CDEvents spec also went with nouns.
Skip is both a verb and a noun.
The only item that needs to be renamed to become a noun is
cancel → cancellation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed in adf29b4
model/cicd/metrics.yaml
Outdated
- id: metric.cicd.errors | ||
type: metric | ||
metric_name: cicd.errors | ||
brief: 'The number of errors in the controller of the CICD system.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cicd.errors
name seems too broad - it seems it would cover all kinds of errors in all CI/CD system parts and then it stops being practical.
The first time I read it, I assumed it to count pipeline run errors, but it's probably something else.
Wonder if we could limit the scope of this metric to something reasonable. E.g. should it be cicd.controller.errors
? I could envision we'd want to differentiate per component, e.g. have a cicd.scheduler.errors
, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intended to count the errors from the CICD system (controller, scheduler, agent) in this metric and not errors from the pipeline run.
It is indeed good to be able to distinguish the different system components as well as errors from the pipeline runs.
Adriel also made a good point that it would be good to be able to count pipeline run errors separately from just the cicd.pipeline.run.duration with result=error because a single pipeline run might encounter several different errors (some recoverable and others not).
Most likely this will be a derived metric, eg. by using a span metrics connector on the pipeline run spans or count connector on the controller logs.
I will define the following metrics to cover these:
cicd.pipeline.run.errors
for errors encountered as part of the pipeline run execution
cicd.system.errors
for errrors encountered in CICD system components with an attribute component (eg. scheduler, agent, controller, …)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model/cicd/metrics.yaml
Outdated
- id: metric.cicd.errors | ||
type: metric | ||
metric_name: cicd.errors | ||
brief: 'The number of errors in the controller of the CICD system.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intended to count the errors from the CICD system (controller, scheduler, agent) in this metric and not errors from the pipeline run.
It is indeed good to be able to distinguish the different system components as well as errors from the pipeline runs.
Adriel also made a good point that it would be good to be able to count pipeline run errors separately from just the cicd.pipeline.run.duration with result=error because a single pipeline run might encounter several different errors (some recoverable and others not).
Most likely this will be a derived metric, eg. by using a span metrics connector on the pipeline run spans or count connector on the controller logs.
I will define the following metrics to cover these:
cicd.pipeline.run.errors
for errors encountered as part of the pipeline run execution
cicd.system.errors
for errrors encountered in CICD system components with an attribute component (eg. scheduler, agent, controller, …)
This ensures that all result values are nouns.
Fixes #1600
Changes
This PR adds metrics for CICD systems and related attributes.
Merge requirement checklist
[chore]