Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics for CICD job queues #1600

Closed
adrielp opened this issue Nov 21, 2024 · 3 comments · Fixed by #1681
Closed

Add metrics for CICD job queues #1600

adrielp opened this issue Nov 21, 2024 · 3 comments · Fixed by #1681
Assignees

Comments

@adrielp
Copy link
Contributor

adrielp commented Nov 21, 2024

Overview

Add metric conventions for pipeline (CICD) job queues. Stems from #1111

@christophe-kamphaus-jemmic
Copy link
Contributor

CICD job queue metrics should allow monitoring the health of a CICD system following the Four Golden Signals (Latency, Traffic, Errors, Saturation). These signals should be metrics specific to CICD systems and vendor agnostic.

Existing metrics

Jenkins pipeline metrics

https://github.com/jenkinsci/prometheus-plugin/blob/master/docs/metrics/index.md

default_jenkins_executors_available	Shows how many Jenkins Executors are available	gauge
default_jenkins_executors_busy	Shows how many Jenkins Executors busy	gauge
default_jenkins_executors_connecting	Shows how many Jenkins Executors are connecting	gauge
default_jenkins_executors_defined	Shows how many Jenkins Executors are defined	gauge
default_jenkins_executors_idle	Shows how many Jenkins Executors are idle	gauge
default_jenkins_executors_online	Shows how many Jenkins Executors are online	gauge
default_jenkins_executors_queue_length	Shows number of items that can run but waiting on free executor	gauge
default_jenkins_nodes_online	Shows Nodes online status	gauge
default_jenkins_builds_duration_milliseconds_summary	Summary of Jenkins build times in milliseconds by Job	summary
default_jenkins_builds_success_build_count	Successful build count	counter
default_jenkins_builds_failed_build_count	Failed build count	counter
default_jenkins_builds_unstable_build_count	Unstable build count	counter
default_jenkins_builds_total_build_count	Total build count (excluding not_built statuses)	counter
default_jenkins_builds_aborted_build_count	Aborted build count	counter
default_jenkins_builds_health_score	Health score of a job	gauge
default_jenkins_builds_available_builds_count	Gauge which indicates how many builds are available for the given job	gauge
default_jenkins_builds_discard_active	Gauge which indicates if the build discard feature is active for the job.	gauge
default_jenkins_builds_running_build_duration_milliseconds	Gauge which indicates the runtime of the current build.	gauge
default_jenkins_builds_last_build_result_ordinal	Build status of a job (last build) (0=SUCCESS,1=UNSTABLE,2=FAILURE,3=NOT_BUILT,4=ABORTED)	gauge
default_jenkins_builds_last_build_result	Build status of a job as a boolean value (1 or 0). <br/>Where 1 stands for the build status SUCCESS or UNSTABLE and 0 for the build states FAILURE,NOT_BUILT or ABORTED	gauge
default_jenkins_builds_last_build_duration_milliseconds	Build times in milliseconds of last build	gauge
default_jenkins_builds_last_build_start_time_milliseconds	Last build start timestamp in milliseconds	gauge
default_jenkins_builds_last_build_tests_total	Number of total tests during the last build	gauge
default_jenkins_builds_last_last_build_tests_skipped	Number of skipped tests during the last build	gauge
default_jenkins_builds_last_build_tests_failing	Number of failing tests during the last build	gauge
default_jenkins_builds_last_stage_result_ordinal	Status ordinal of a stage in a pipeline (0=NOT_EXECUTED,1=ABORTED,2=SUCCESS,3=IN_PROGRESS,4=PAUSED_PENDING_INPUT,5=FAILED,6=UNSTABLE)	gauge
default_jenkins_builds_last_stage_duration_milliseconds_summary	Summary of Jenkins build times by Job and Stage in the last build	summary
default_jenkins_builds_last_logfile_size_bytes	Gauge which shows the log file size in bytes.

https://github.com/jenkinsci/opentelemetry-plugin/blob/main/docs/monitoring-metrics.md

ci.pipeline.run.duration Duration of runs
ci.pipeline.run.active Gauge of active jobs
ci.pipeline.run.active Gauge of active jobs
ci.pipeline.run.launched Job launched
ci.pipeline.run.started Job started
ci.pipeline.run.completed Job completed
ci.pipeline.run.aborted Job aborted
ci.pipeline.run.success Job successful
ci.pipeline.run.failed Job failed
jenkins.executor.available
jenkins.executor.busy
jenkins.executor.idle
jenkins.executor.online
jenkins.executor.connecting
jenkins.executor.defined
jenkins.executor.queue
jenkins.queue.waiting Number of tasks in the queue with the status 'buildable' or 'pending' (see Queue#getUnblockedItems())
jenkins.queue.blocked Number of blocked tasks in the queue. Note that waiting for an executor to be available is not a reason to be counted as blocked. (see QueueListener#onEnterBlocked() - QueueListener#onLeaveBlocked())
jenkins.queue.buildable Number of tasks in the queue with the status 'buildable' or 'pending' (see Queue#getBuildableItems())
jenkins.queue.left Total count of tasks that have been processed (see [`QueueListener#onLeft`]()-
jenkins.queue.time_spent_millis Total time spent in queue by the tasks that have been processed (see QueueListener#onLeft() and Item#getInQueueSince())
jenkins.agents.total Number of agents
jenkins.agents.online Number of online agents
jenkins.agents.offline Number of offline agents
jenkins.agents.launch.failure Number of failed launched agents
jenkins.cloud.agents.completed Number of provisioned cloud agents
jenkins.cloud.agents.launch.failure Number of failed cloud agents
Argo Workflows metrics

https://argo-workflows.readthedocs.io/en/latest/metrics/

gauge A gauge of the number of workflows currently in the cluster in each phase.
operation_duration_seconds A histogram of durations of operations. An operation is a single workflow reconciliation loop within the workflow-controller.
pods_gauge A gauge of the number of workflow created pods currently in the cluster in each phase.
pod_missing A counter of pods that were not seen - for example they are by being deleted by Kubernetes. You should only see this under high load.
pod_pending_count A counter of pods that have been seen in the Pending state.
pods_total_count A gauge of the number of pods which have entered each phase and then observed by the controller. 
queue_adds_count counter of additions to the work queues inside the controller.
queue_depth_gauge A gauge of the current depth of the queues.
queue_duration A histogram of the time events in the queues are taking to be processed.
queue_latency A histogram of the time events in the queues are taking before they are processed.
queue_retries A counter of the number of times a message has been retried in the queue
queue_unfinished_work A gauge of the number of queue items that have not been processed yet.
total_count A counter of workflows that have entered each phase for tracking them through their life-cycle, by namespace.
workers_busy_count A count of queue workers that are busy.
workflow_condition A gauge of the number of workflows with different conditions. This will tell you the number of workflows with running pods.
Tekton queue metrics

https://tekton.dev/vault/pipelines-v0.59.x-lts/metrics/

tekton_pipelines_controller_pipelinerun_duration_seconds_[bucket, sum, count]	Histogram
tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_[bucket, sum, count]	Histogram
tekton_pipelines_controller_pipelinerun_total	Counter
tekton_pipelines_controller_running_pipelineruns	Gauge
tekton_pipelines_controller_taskrun_duration_seconds_[bucket, sum, count]	Histogram
tekton_pipelines_controller_taskrun_total	Counter
tekton_pipelines_controller_running_taskruns	Gauge
tekton_pipelines_controller_running_taskruns_throttled_by_quota	Gauge
tekton_pipelines_controller_running_taskruns_throttled_by_node	Gauge
tekton_pipelines_controller_client_latency_[bucket, sum, count]	Histogram
ArgoCD sync metrics

https://argo-cd.readthedocs.io/en/latest/operator-manual/metrics/

argocd_app_k8s_request_total counter Number of Kubernetes requests executed during application reconciliation
argocd_app_reconcile histogram Application reconciliation performance in seconds.
argocd_app_sync_total counter Counter for application sync history
Flux

https://fluxcd.io/flux/monitoring/metrics/

Resource reconciliation duration metrics:
gotk_reconcile_duration_seconds_bucket{kind, name, namespace, le}
gotk_reconcile_duration_seconds_sum{kind, name, namespace}
gotk_reconcile_duration_seconds_count{kind, name, namespace}
controller_runtime_reconcile_total{controller, result}
Keptn

https://github.com/keptn/lifecycle-toolkit/blob/main/dashboards/grafana/import/

keptn_app_active
keptn_app_count_total
keptn_app_deploymentinterval
keptn_app_deploymentduration
keptn_promotion_count_total
keptn_task_active
keptn_task_count
keptn_task_duration_bucket
keptn_evaluation_active
keptn_evaluation_count_total
keptn_evaluation_duration_bucket
keptn_deployment_active
keptn_deployment_count
keptn_deployment_duration_bucket
keptn_deployment_deploymentduration
keptn_deployment_app_previousversion
keptn_deployment_app_version
keptn_deployment_workload_previousversion
keptn_deployment_workload_version
keptn_deployment_deploymentinterval

Discussion

❓Is there other prior art (metrics for existing CICD systems) that we should consider?

❓Can we combine metrics for reconciliation based solutions (Argo Workflows, Flux, Tekton, ArgoCD) with queue based solution (Jenkins)?

Ideas for metrics from #1111:

  • duration of pipelineRuns (by status, pipeline)
  • count of pipelineRuns (by status, pipeline)
  • count of agents
  • queue length of pending pipelineRuns
  • duration for how long a pipelineRun is in the queue before starting execution

What metrics map to the 4 golden signals?

Latency: queue latency, reconciliation loop duration, pipeline run duration
Traffic: pipeline run count, queue length
Errors: pipeline run count (failed vs total), controller error count
Saturation: agent/worker count (busy vs total)

Proposed metrics

cicd.queue.latency histogram
cicd.queue.length updowncounter
cicd.pipeline_run.duration { result=success|failed|aborted, cicd.pipeline.name }  <-- histogram or other type?
cicd.pipeline_run.count counter { result=success|failed|aborted, cicd.pipeline.name }
cicd.error.count counter { error.type }
cicd.worker.count updowncounter { state=busy|idle }

@christophe-kamphaus-jemmic
Copy link
Contributor

Notes from SemCon meeting 2024-12-09

Josh (paraphrased):
3 of the golden signals (latency, traffic, errors) are one histogram metric.
That histogram will have a count of everything eg. a count of the number of requests that happen. That is your traffic.
Count errors is a label within that histogram where you can have a label that says whether it's successful or failure.
Latency is what the distribution of the histogram is.

For saturation we usually have 2 metrics: an availability/total/limit count and a utilization (percentage used of the limit).

Are all proposed metrics captured by the different CICD systems?
Can existing metrics be mapped to the proposed metrics?

Christophe:
I tried to map existing metrics to common ones, but it's very difficult because most systems are reporting different things.
In the end I proposed mostly new metrics that would cover the General Class needs shared across all CICD systems.

Liudmila:
For databases or messaging we did not even try to adapt existing metrics to ideal metrics.
It's a great approch to define something brand new that otel or native instrumention might be able to implement.

Josh:
Question for the CICD group is
Can we combine metrics for the different CICD systems?
Can we implement the proposed metrics consistently between the different systems?
Do they cover the same set of use cases and have similar ways of solving the use cases?
Eg. Do I take the same action in all CICD systems when addressing a pipeline duration spike and understand what the problem is? Probably yes, but that might not be true.
For count of agents or queue length, you might do something different, but maybe you don't.
When I look at this I see a really good set of metrics for your baseline.

@christophe-kamphaus-jemmic
Copy link
Contributor

Notes from CICD meeting 2024-12-12

Queue latency is meant to track the time a pipeline run spends from having been triggered to when it starts executing.
Pipeline run duration tracks the time the run spends from execution start to when it has finished execution.

gantt
    todayMarker off
    dateFormat X
    axisFormat  
    time queued           :a1, 0, 2d
    pipeline run duration :after a1 , 8d
Loading

It might also be good to track queue latency per pipeline, because some pipeline runs might remain stuck longer in the queue waiting for an available worker.

Regarding the pipeline name attribute, Argo Workflows does not currently have that concept. There are open issues in its project to work on it.

For saturation it's difficult for some systems to compute:
How would you track utilization of a k8s cluster if it's auto-scaling?
Also nodeSelectors might restrict pipeline runs from running on certain nodes.

Even for autoscaling systems It might still be good to count momentary utilization:

  • remaining stuck for a long time at 100% might indicate an autoscaling issue
  • having low utilization might indicate an issue with scale down

It might be good to have an attribute worker.type for distinguishing between vm, pod, …

The saturation metric should be recommended, but not mandatory.

Regarding metric name: should it be pipeline.run instead of pipeline_run to be consistent with other currently defined attribute names?
For metric description take a look at how http durations are defined.

Next steps

I will prepare a PR taking this feedback into account.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants