-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design proposal: Descheduler support #258
Conversation
Skipping CI for Draft Pull Request. |
43d6da5
to
b333d9a
Compare
@RobertKrawitz Could you please have a look? |
@ingvagabund Could you please have a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it correctly, this alone will not solve the descheduler problem; this is just one piece of the puzzle. Correct?
@@ -0,0 +1,105 @@ | |||
# Overview | |||
Deschedulers are Kubernetes components that act as load balancers for compute nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's one use for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your feedback. Could you please provide additional details that would improve this definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://github.com/kubernetes-sigs/descheduler:
* Some nodes are under or over utilized.
* The original scheduling decision does not hold true any more, as taints or labels are added to or removed from nodes, pod/node affinity requirements are not satisfied any more.
* Some nodes failed and their pods moved to other nodes.
* New nodes are added to clusters.```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use singular here. A cluster is expected to have only a single descheduler installed/running/active.
b333d9a
to
839cd46
Compare
Thank you for the feedback @RobertKrawitz. AFAIU from my discussions with @stu-gott, the agreement was:
|
The descheduler itself needs to respect these new annotations. It isn't clear to me from the description whether it does. |
Yes, the descheduler needs to respect the new annotations. |
Exactly -- without that, the solution is not complete. |
1. kubevirt/kubevirt | ||
2. kubevirt/hyperconverged-cluster-operator | ||
|
||
# Design |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you please mention the current and expected descheduler workflow? Something between the following lines:
Current workflow:
- iterate through pods (the descheduling framework performs some filtering to reduce the set of pods to process)
- sort the pods (the descheduling framework allows to implemented custom sorters as well)
- iterate through filtered and sorted pods and perform eviction until either all such pods are evicted or a limit is reached (e.g. pods per node, pods per namespace). If the eviction is successful, increase counter of evicted pods. If not, move to another pod.
Expected workflow: extend the current workflow with:
- before evicting the pods sort them based on
descheduler.alpha.kubernetes.io/request-evict-only
anddescheduler.alpha.kubernetes.io/eviction-in-progress
annotations to prefer annotated pods first - if
descheduler.alpha.kubernetes.io/eviction-in-progress
annotation is present, do not invoke the eviction API, only increase the pod counter - if
descheduler.alpha.kubernetes.io/request-evict-only
annotation is present, invoke the eviction API.
Now about the eviction API error:
- 200 return code should not happen for annotated pods given "request-evict-only" is expected
- 500 return code means something went wrong => do not increase the pod counter
- 429 return code means eviction is scheduled => increase the pod counter.
Note: descheduler.alpha.kubernetes.io/request-evict-only
annotation is needed to tell the descheduler to interpret the eviction error differently. descheduler.alpha.kubernetes.io/eviction-in-progress
to acknowledge the previous eviction request and keep the counters up to date in case the descheduler is restarted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevertheless, given the validating webhook is s custom implementation and there's no upstream eviction API equivalent for "eviction request confirmed" return code the descheduler will need to keep the feature as alpha until there's an upstream/community accepted solution for it. Temporarily, the descheduler can either make the conclusion on the request based on the custom error code or parsing the error message text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these steps should go into the descheduler's requirement document which should be the counterpart to this document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, very interesting.
The "Design" section is missing some more details, including a flow of what happens, or what is expected to happen after each step.
At the moment, it only covers what the VMI controller needs to do, without details about other parts that have been mentioned in the overview.
I am also not that sure which components are under KueVirt, which ones already exists and what needs to change.
Generally speaking, I think we again try to use webhooks instead of using policies that the original feature better provide. We have a similar situation with evictions... it has policies which possibly could replace the need for custom webhooks. Has this been considered?
2. Enable cluster admins to introduce a descheduler to an existing cluster with KubeVirt deployed. | ||
|
||
## Non Goals | ||
The descheduler should respect the two annotations included in this document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An annotation
sounds like an implementation detail which is not clear at this stage.
Perhaps you should describe it in a different way. E.g. existing desch... to be aware of VM workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased it.
- Cluster administrator | ||
|
||
## User Stories | ||
- As a cluster admin running a K8s cluster with a descheduler, I want to deploy and use KubeVirt. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and load balance my VMs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my new understanding, the descheduler does more than load balance.
I need to update the part describing the descheduler.
|
||
## Repos | ||
1. kubevirt/kubevirt | ||
2. kubevirt/hyperconverged-cluster-operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is it relevant to this feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An enabler (feature-gate) should be added to have support for the feature in deployments which use it to manage KubeVirt.
f9363d4
to
6fb474c
Compare
Currently, when the criteria for VMI evacuation are met [1]: 1. The eviction admitter marks the VMI for evacuation. 2. The eviction admitter approves the eviction request. 3. kube-api denies the eviction request because the virt-launcher is protected by a PDB [2]. When a descheduler[3] tries to evict a virt-launcher pod owned by a VMI that could be evacuated, it receives a denial for the eviction request and cannot understand that a VMI evacuation was triggered in the background. In order to support descheduler integration [4], deny the eviction request with a spacial message when the VMI is marked for evacuation. This is done in order to: 1. Signal the descheduler that an evacuation was triggered although the eviction was denied. 2. Use the existing K8s eviction API. In this scenario, kube-api will not check whether the virt-launcher is protected by a PDB. Note: In case the Eviction admitter is down - kube-api will treat the eviction request as if it was approved by the admitter, and will continue to check whether a PDB protects the virt-launcher. In this case the VMI will not be marked for evacuation and the eviction initiator receive a regular denial response because of the PDB. [1] kubevirt#11286 [2] https://kubernetes.io/docs/concepts/scheduling-eviction/api-eviction/#how-api-initiated-eviction-works [3] https://github.com/kubernetes-sigs/descheduler [4] kubevirt/community#258 Signed-off-by: Orel Misan <[email protected]>
Currently, when the criteria for VMI evacuation are met [1]: 1. The eviction admitter marks the VMI for evacuation. 2. The eviction admitter approves the eviction request. 3. kube-api denies the eviction request because the virt-launcher is protected by a PDB [2]. When a descheduler[3] tries to evict a virt-launcher pod owned by a VMI that could be evacuated, it receives a denial for the eviction request and cannot understand that a VMI evacuation was triggered in the background. In order to support descheduler integration [4], deny the eviction request with a spacial message when the VMI is marked for evacuation. This is done in order to: 1. Signal the descheduler that an evacuation was triggered although the eviction was denied. 2. Use the existing K8s eviction API. In this scenario, kube-api will not check whether the virt-launcher is protected by a PDB. Note: In case the Eviction admitter is down - kube-api will treat the eviction request as if it was approved by the admitter, and will continue to check whether a PDB protects the virt-launcher. In this case the VMI will not be marked for evacuation and the eviction initiator receive a regular denial response because of the PDB. [1] kubevirt#11286 [2] https://kubernetes.io/docs/concepts/scheduling-eviction/api-eviction/#how-api-initiated-eviction-works [3] https://github.com/kubernetes-sigs/descheduler [4] kubevirt/community#258 Signed-off-by: Orel Misan <[email protected]>
Currently, when the criteria for VMI evacuation are met [1]: 1. The eviction admitter marks the VMI for evacuation. 2. The eviction admitter approves the eviction request. 3. kube-api denies the eviction request because the virt-launcher is protected by a PDB [2]. When a descheduler[3] tries to evict a virt-launcher pod owned by a VMI that could be evacuated, it receives a denial for the eviction request and cannot understand that a VMI evacuation was triggered in the background. In order to support descheduler integration [4], deny the eviction request with a spacial message when the VMI is marked for evacuation. This is done in order to: 1. Signal the descheduler that an evacuation was triggered although the eviction was denied. 2. Use the existing K8s eviction API. In this scenario, kube-api will not check whether the virt-launcher is protected by a PDB. Note: In case the Eviction admitter is down - kube-api will treat the eviction request as if it was approved by the admitter, and will continue to check whether a PDB protects the virt-launcher. In this case the VMI will not be marked for evacuation and the eviction initiator receive a regular denial response because of the PDB. [1] kubevirt#11286 [2] https://kubernetes.io/docs/concepts/scheduling-eviction/api-eviction/#how-api-initiated-eviction-works [3] https://github.com/kubernetes-sigs/descheduler [4] kubevirt/community#258 Signed-off-by: Orel Misan <[email protected]>
Add the `descheduler.alpha.kubernetes.io/request-evict-only` annotation to virt-launcher pods. This annotation should be added to already running pods too. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
Add the `descheduler.alpha.kubernetes.io/eviction-in-progress` annotation to virt-launcher pods during evacuation migration./ This annotation indicates pods whose eviction was initiated by an external component. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
Add the `descheduler.alpha.kubernetes.io/request-evict-only` annotation to virt-launcher pods. This annotation should be added to already running pods too. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
Add the `descheduler.alpha.kubernetes.io/eviction-in-progress` annotation to virt-launcher pods during evacuation migration./ This annotation indicates pods whose eviction was initiated by an external component. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
Add the `descheduler.alpha.kubernetes.io/eviction-in-progress` annotation to virt-launcher pods during evacuation migration./ This annotation indicates pods whose eviction was initiated by an external component. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
|-----------------------|-------------------|-------------------------------|-------------------------|----------------------------------| | ||
| None | N/A | True | True | 200 - Eviction granted | | ||
| LiveMigrate | True | True | False | 429 - Eviction blocked by PDB | | ||
| LiveMigrate | False | False | False | 429 - Eviction denied by webhook | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orelmisan @fossedihelm @ingvagabund
It seems strange to me that we return the same response for VMIs that won't be evicted as we do for those that will be evicted. According to this response, a 429 response means that the launcher pod will be counted as an evicted pod. This is odd because, in cases where the VMI is not migratable, the launcher pod won't be evicted and therefore shouldn't be counted as such.
Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The returned structure has an error code and message fields.
For cases where the VMI is evacuated, the message field is populated with a dedicated message.
Please see:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this, but the launcher would still be counted by the descheduler in that case, even though it won't get evicted if I understood correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following kubevirt/kubevirt#11532 The descheduler can understand whether a virt-launcher pod will be evacuated or not - based on the error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope it does. It sounds odd to me to rely on an error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prior to that PR, the descheduler couldn't tell if a virt-launcher pod was evacuated in the background.
It only saw error code 429, due to the PDBs protecting the pods and assumed the pods will not move.
Add the `descheduler.alpha.kubernetes.io/request-evict-only` annotation to virt-launcher pods. This annotation should be added to already running pods too. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
Add the `descheduler.alpha.kubernetes.io/eviction-in-progress` annotation to virt-launcher pods during evacuation migration./ This annotation indicates pods whose eviction was initiated by an external component. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
Add the `descheduler.alpha.kubernetes.io/request-evict-only` annotation to virt-launcher pods. This annotation should be added to already running pods too. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
Add the `descheduler.alpha.kubernetes.io/eviction-in-progress` annotation to virt-launcher pods during evacuation migration./ This annotation indicates pods whose eviction was initiated by an external component. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
Add the `descheduler.alpha.kubernetes.io/request-evict-only` annotation to virt-launcher pods. This annotation should be added to already running pods too. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
Add the `descheduler.alpha.kubernetes.io/eviction-in-progress` annotation to virt-launcher pods during evacuation migration./ This annotation indicates pods whose eviction was initiated by an external component. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
Add the `descheduler.alpha.kubernetes.io/request-evict-only` annotation to virt-launcher pods. This annotation should be added to already running pods too. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
Add the `descheduler.alpha.kubernetes.io/eviction-in-progress` annotation to virt-launcher pods during evacuation migration./ This annotation indicates pods whose eviction was initiated by an external component. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
Add the `descheduler.alpha.kubernetes.io/eviction-in-progress` annotation to virt-launcher pods during evacuation migration./ This annotation indicates pods whose eviction was initiated by an external component. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
Add the `descheduler.alpha.kubernetes.io/eviction-in-progress` annotation to virt-launcher pods during evacuation migration./ This annotation indicates pods whose eviction was initiated by an external component. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
Add the `descheduler.alpha.kubernetes.io/request-evict-only` annotation to virt-launcher pods. This annotation should be added to already running pods too. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
Add the `descheduler.alpha.kubernetes.io/eviction-in-progress` annotation to virt-launcher pods during evacuation migration./ This annotation indicates pods whose eviction was initiated by an external component. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
Add the `descheduler.alpha.kubernetes.io/request-evict-only` annotation to virt-launcher pods. This annotation should be added to already running pods too. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
Add the `descheduler.alpha.kubernetes.io/eviction-in-progress` annotation to virt-launcher pods during evacuation migration./ This annotation indicates pods whose eviction was initiated by an external component. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
Add the `descheduler.alpha.kubernetes.io/request-evict-only` annotation to virt-launcher pods. This annotation should be added to already running pods too. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
Add the `descheduler.alpha.kubernetes.io/eviction-in-progress` annotation to virt-launcher pods during evacuation migration./ This annotation indicates pods whose eviction was initiated by an external component. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
Add the `descheduler.alpha.kubernetes.io/request-evict-only` annotation to virt-launcher pods. This annotation should be added to already running pods too. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
Add the `descheduler.alpha.kubernetes.io/eviction-in-progress` annotation to virt-launcher pods during evacuation migration./ This annotation indicates pods whose eviction was initiated by an external component. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
Add the `descheduler.alpha.kubernetes.io/request-evict-only` annotation to virt-launcher pods. This annotation should be added to already running pods too. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
Add the `descheduler.alpha.kubernetes.io/eviction-in-progress` annotation to virt-launcher pods during evacuation migration./ This annotation indicates pods whose eviction was initiated by an external component. ref: kubevirt/community#258 Signed-off-by: fossedihelm <[email protected]>
What this PR does / why we need it:
Add a design proposal to support running KubeVirt in a cluster that also has a descheduler.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
PR kubevirt/kubevirt#11286 describes the current way KubeVirt handles eviction requests.
Descheduler proposal: kubernetes-sigs/descheduler#1354
Checklist
This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.
Release note: