-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics for CICD job queues #1600
Comments
CICD job queue metrics should allow monitoring the health of a CICD system following the Four Golden Signals (Latency, Traffic, Errors, Saturation). These signals should be metrics specific to CICD systems and vendor agnostic. Existing metricsJenkins pipeline metricshttps://github.com/jenkinsci/prometheus-plugin/blob/master/docs/metrics/index.md
https://github.com/jenkinsci/opentelemetry-plugin/blob/main/docs/monitoring-metrics.md
Argo Workflows metricshttps://argo-workflows.readthedocs.io/en/latest/metrics/
Tekton queue metricshttps://tekton.dev/vault/pipelines-v0.59.x-lts/metrics/
ArgoCD sync metricshttps://argo-cd.readthedocs.io/en/latest/operator-manual/metrics/
Fluxhttps://fluxcd.io/flux/monitoring/metrics/
Keptnhttps://github.com/keptn/lifecycle-toolkit/blob/main/dashboards/grafana/import/
Discussion❓Is there other prior art (metrics for existing CICD systems) that we should consider? ❓Can we combine metrics for reconciliation based solutions (Argo Workflows, Flux, Tekton, ArgoCD) with queue based solution (Jenkins)? Ideas for metrics from #1111:
What metrics map to the 4 golden signals?Latency: queue latency, reconciliation loop duration, pipeline run duration Proposed metrics
|
Notes from SemCon meeting 2024-12-09Josh (paraphrased): For saturation we usually have 2 metrics: an availability/total/limit count and a utilization (percentage used of the limit). Are all proposed metrics captured by the different CICD systems? Christophe: Liudmila: Josh: |
Notes from CICD meeting 2024-12-12Queue latency is meant to track the time a pipeline run spends from having been triggered to when it starts executing. gantt
todayMarker off
dateFormat X
axisFormat
time queued :a1, 0, 2d
pipeline run duration :after a1 , 8d
It might also be good to track queue latency per pipeline, because some pipeline runs might remain stuck longer in the queue waiting for an available worker. Regarding the pipeline name attribute, Argo Workflows does not currently have that concept. There are open issues in its project to work on it. For saturation it's difficult for some systems to compute: Even for autoscaling systems It might still be good to count momentary utilization:
It might be good to have an attribute worker.type for distinguishing between vm, pod, … The saturation metric should be recommended, but not mandatory. Regarding metric name: should it be pipeline.run instead of pipeline_run to be consistent with other currently defined attribute names? Next stepsI will prepare a PR taking this feedback into account. |
Overview
Add metric conventions for pipeline (CICD) job queues. Stems from #1111
The text was updated successfully, but these errors were encountered: