Descheduler 学习记录

1. 资源平衡策略

基于当前版本 v0.21.0，Descheduler 已实现了 9种策略用于 kubernetes 集群的资源再平衡。

RemoveDuplicates
LowNodeUtilization
HighNodeUtilization
RemovePodsViolatingInterPodAntiAffinity
RemovePodsViolatingNodeAffinity
RemovePodsViolatingNodeTaints
RemovePodsViolatingTopologySpreadConstraint
RemovePodsHavingTooManyRestarts
PodLifeTime

这些策略包含一些共同的配置，分别如下：

nodeSelector：节点标签选择器，用来限制 descheduler 可在带有该标签的节点下进行处理，即限制 descheduler 驱逐在哪些节点上的 pod。
evictLocalStoragePods：当值为 true 时，表示允许驱逐挂有 local storage 卷的 pods。默认为 false。
evictSystemCriticalPods：当值为 true 时，会驱逐 kubernetes 中 kube-system 命名空间下以及任意优先级(priority)的系统关键 pod，这个操作很危险，不建议开启。默认为 false。
ignorePvcPods：值为 true 不会（即忽略）驱逐挂载 pvc 卷的 pods。默认值为 false，即默认会驱逐挂载 pvc 的 pods。
maxNoOfPodsToEvictPerNode：每个 node 可驱逐 pods 数量的最大值。pods 数量为各个策略统计的 pods 数量的总和。

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
nodeSelector: prod=dev
evictLocalStoragePods: true
evictSystemCriticalPods: true
maxNoOfPodsToEvictPerNode: 40  // 设置可驱逐的 pod 数量的最大值，保证一次不能驱逐太多，导致集群不稳定
ignorePvcPods: false
strategies:
  ...

1.1. RemoveDuplicates

该策略可以确保在同一个 node 节点上只会运行一个相关联的 pod（这些 pods被同属一个相关的控制器（如 RS、RC、StatefulSet or Job）资源所管理，以下以 RS 为例），即同一个 RS 下的所有 pod 在集群中的所有节点中最多只有一个 pod 会运行在同一个 node 节点上。如果某个 node 节点上出现了同一个 RS 的多个 pod，则会将多余的 pod 驱逐掉，仅保留一个 pod 在该 node 节点上运行。该策略的应用场景为，当集群中的某些 node 因为不可知的原因导致硬件故障等下线后，该 node 节点上的 pod 会被调度到其他节点上（这时其他节点上可能会存在多个同属于一个 RS 的 pod 位于同一个 node 节点上），等待 node 重新上线后，该策略会驱逐其他节点中多余的 pod 并将驱逐的 pod 重新调度到 node 节点上。

参数：

可选参数：excludeOwerKinds 列表类型，被参数中的所有 OwerRef 相关的 Kinds 所管理的 pods 不会被驱逐。但需注意：该策略会驱逐被 Deployment 所创建的 pod。如果需要 Deployment 所创建的 pod 不被驱逐，需要在参数中指定 ReplicaSet，而不是 Deployment。

Name Type

excludeOwerKinds list(string)
1
2
3
4
5
6
7
8
9
apiVersion: "descheduler/v1alpha1" kind: "DeschedulerPolicy" strategies: "RemoveDuplicates": enabled: true params: removeDuplicates: excludeOwnerKinds: - "ReplicaSet"

Name	Type
excludeOwerKinds	list(string)

过滤型参数

Name	Type	说明
namespaces	list	支持 include 和 exclude 两种过滤策略
thresholdPriority	int	直接指定驱逐优先级
thresholdPriorityClassName	string	通过 k8s 的 priorityClass 来指点，如果 k8s 没有创建 priorityClass 会报错
nodeFit	bool	true 为开启，开启时会优化驱逐调度，即会考虑 pod 是否满足驱逐条件，驱逐后是否有 node 可适合运行，如果没有则不会驱逐

1.2. LowNodeUtilization

该策略会查找整个集群中 未充分利用的节点，并将其他 高利用的节点 中的一些 pod 驱逐并最终在这些 未充分利用的节点 上重建 pod。该策略的参数在 nodeResourceUtilizationThresholds 下配置。其中，还有另外两个非常关键的参数 thresholds 和 targetThresholds。

thresholds: 指定资源阀值（cpu/mem/pod 数量/gpu/其他可计算的资源），用来确定哪些节点是 未充分利用的节点。如果节点中的所有相关资源利用率都低于 thresholds 中所指定资源的阀值，则可认为该节点为 未充分利用的节点。其中 pod 中 cpu/mem 等资源的值通过 k8s 中的 request 字段的值来进行计算。
targetThresholds：用来确定哪些节点是高利用率的节点，是可被驱逐的。当节点中的任一相关资源利用率高于 targetThresholds 中指定的阀值时，则可认为该节点是 高利用率的节点，节点上的 pod 可以被驱逐。

当节点的所有资源利用率位于 thresholds 和 targetThresholds 时，则可认为该节点的利用率是合理的，节点上的 pod 将不会被驱逐。该策略的驱逐方向是从 高利用率的的节点 中驱逐 pod，并最终在 未充分利用的节点 上重建出 pod；且当 高利用率的节点 或者 未充分利用的节点 的节点数中有一个为 0 时将不再执行驱逐策略驱逐 pod。

参数：

普通参数：

Name	Type	说明
thresholds	map[string]int	设定阀值，确定 `未充分利用的节点` 的界限
targetThresholds	map[srting]int	设定阀值，确定 `高利用率的节点` 的界限
numberOfNodes	int	用来在大集群中来决定是否开启 `LowNodeUtilization` 策略的阀值，如果 numberOfNodes 不为 0 时，则当集群中 `未充分利用的节点` 大于 numberOfNodes 的值时会开启该策略；反之，不开启该策略。默认 numberOfNodes 为 0

过滤型参数：

Name Type 说明

thresholdPriority int 同上

thresholdPriorityClassName string 同上

nodeFit bool 同上

Name	Type	说明
thresholdPriority	int	同上
thresholdPriorityClassName	string	同上
nodeFit	bool	同上

说明：

Descheduler 内部原生支持三种资源类型：cpu、memory 以及 pods 数量。如果其中有资源类型的值没有指定，则默认设置为 100%，目的是避免将 未充分利用的节点 误统计为 高利用的节点。
Descheduler 也支持可选择的扩展资源，比如支持 GPU 数量类型的资源 nvidia.com/gpu。如果指定了扩展资源，则 node 的节点的总资源利用率会将扩展资源算进去，如果没有指定扩展资源，则不会将该资源统计进去。
阀值选项 thresholds 和 targetThresholds 中的值不能为空，且对同一种资源指定要么全指定值要么全不指定值。
阀值选项 thresholds 和 targetThresholds 中对同一种资源中的值中，thresholds 的值必须小于或者等于（不能大于）targetThresholds 的对应类型资源的值。
阀值选项 thresholds 和 targetThresholds 中的资源类型的值必须为百分比值，且值必须位于 [0,100] 中。
numberOfNodes：参数用来触发是否开启该策略，如果 未充分利用的节点数 大于 numberOfNodes 时，将激活该策略，启动驱逐功能。否则，不启动该策略，numberOfNodes 的默认值为 0。
该策略是将 pod 从 高利用率的节点 上往 低利用率的节点 调，会让整个集群的整体资源平衡利用。

例子：

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "LowNodeUtilization":
    enabled: true
    params:
      nodeResourceUtilizationThresholds:
        thresholds:
          "cpu" : 20
          "memory": 20
          "pods": 20
        targetThresholds:
          "cpu" : 50
          "memory": 50
          "pods": 50

1.3. HighNodeUtilization

该策略会将 低利用率节点 上的 pod 驱逐到 高利用率节点 上。该策略必须和 k8s 默认调度器策略中的优选策略 MostRequestedPriority 配合一起使用，在进行驱逐重调度时会给高利用率的节点打上高分。该策略的参数都配置在 nodeResourceUtilizationThresholds 下。

该策略也通过参数 thresholds 的阀值来确定哪些节点是 低利用率的节点，它所支持的资源类型包括 cpu、memory、pods 数量以及扩展资源类型等，当节点相关资源的实际申请使用率（以 k8s 中的 request 为准）都低于 参数阀值 thresholds 中所有相关资源所设定的值时，该节点会给认为是 低利用率的节点（未充分利用的节点）。节点中只要任一一个资源的使用率大于阀值 thresholds 中的同类型资源的值时，该节点被认为合理利用的节点，该节点上的 pod 将不会被驱逐。

注意，该策略是将 pod 的从 低利用率的节点 上驱逐并在合适的 高利用率的节点 上重建。当 高利用率的节点 或者 未充分利用的节点 的节点数中有一个为 0 时将不再执行驱逐策略驱逐 pod。

参数：

普通参数：

Name Type 说明

thresholds map[string]int 同上，设定阀值，确定 未充分利用节点 的界限

numberOfNodes int 同上，在大集群中确定开启该策略的阀值
过滤型参数：

Name Type 说明

thresholdPriority int 同上

thresholdPriorityClassName string 同上

nodeFit bool 同上

Name	Type	说明
thresholds	map[string]int	同上，设定阀值，确定 `未充分利用节点` 的界限
numberOfNodes	int	同上，在大集群中确定开启该策略的阀值

Name	Type	说明
thresholdPriority	int	同上
thresholdPriorityClassName	string	同上
nodeFit	bool	同上

说明：

与 LowNodeUtilization 策略一样，该策略原生也支持 cpu、memory、pods 数量三种原生资源类型。如果资源类型未指定默认设置为 100%。
该策略也同样支持扩展资源（如 gpu 资源 nvidia.com/gpu），如果用户未指定扩展资源，该资源同样不参数 node 节点资源使用量的计算。
阀值 thresholds 同样不能为空。其资源的有效值为[0,100]。
numberOfNodes：参数用来触发是否开启该策略，如果 未充分利用的节点数 大于 numberOfNodes 时，将激活该策略，启动驱逐功能。否则，不启动该策略，numberOfNodes 的默认值为 0。
该策略是将 pod 从 低利用率的节点 上往 高利用率的节点 调，会让高利用率的节点利用率更高，低利用率的节点更低。该策略正好与 LowNodeUtilization 相反。

例子

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "HighNodeUtilization":
     enabled: true
     params:
       nodeResourceUtilizationThresholds:
         thresholds:
           "cpu" : 20
           "memory": 20
           "pods": 20

1.4. RemovePodsViolatingInterPodAntiAffinity

该策略会确保同一个节点上的 pod 之间不违背 pod 的亲和性，如果同一个节点上 pod 之间存在反亲和性，则会将相关 pod 驱逐掉。比如，如果 node 上同时运行 podA、podB、podC 三个pod，podA 与 podB 和 podC 之间存在反亲和性，则该策略会将 podA 驱逐掉，以保证 podB 和 podC 能在 node 上正常运行。

参数

过滤型参数

Name Type 说明

thresholdPriority int 同上

thresholdPriorityClassName string 同上

namespaces list 指定操作 namespace

labelSelector

nodeFit bool 同上

Name	Type	说明
thresholdPriority	int	同上
thresholdPriorityClassName	string	同上
namespaces	list	指定操作 namespace
labelSelector
nodeFit	bool	同上

例子

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingInterPodAntiAffinity":
     enabled: true

1.5. RemovePodsViolatingNodeAffinity

该策略可以确保将所有违背 节点亲和性 的 pods 都从该 node 节点驱逐掉。在 k8s 中 node 亲和性 可通过参数requiredDuringSchedulingIgnoredDuringExecution 来指定，表示 调度时调度器必须满足条件，执行时 kubelet 可忽略（即kubelet 不执行驱逐） 策略。在开始是某个 podA 满足调度策略，k8s 调度器能将该 podA 调度到该 node 节点上，但随着一段时间后，podA 不在满足 node 的亲和性，此时传统的 k8s 机制 kubelet 是不能驱逐掉 podA 的；但当开启该策略时，该策略会驱使 kubelet 执行驱逐动作，使 node 上的有 requiredDuringSchedulingIgnoredDuringExecution 标签且有违反该 node 亲和性的 pod 将会被驱逐。

参数：

普通参数：

Name Type 说明

nodeAffinityType list(string) 指明 pod 违背 node 的哪些亲和性类型，最终使 kubelet 执行驱逐动作
过滤型参数

Name Type 说明

thresholdPriority int 同上

thresholdPriorityClassName string 同上

namespaces list include/exclude两个方案

labelSelector list 同k8s

nodeFit bool 同上

Name	Type	说明
nodeAffinityType	list(string)	指明 pod 违背 node 的哪些亲和性类型，最终使 kubelet 执行驱逐动作

Name	Type	说明
thresholdPriority	int	同上
thresholdPriorityClassName	string	同上
namespaces	list	include/exclude两个方案
labelSelector	list	同k8s
nodeFit	bool	同上

例子

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingNodeAffinity":
    enabled: true
    params:
      nodeAffinityType:
      - "requiredDuringSchedulingIgnoredDuringExecution"

1.6. RemovePodsViolatingNodeTaints

该策略会对违反 node taints 污点的 pod 进行驱逐。比如，在 pod 调度开始时，node 上含有污点 NoSchedule，且 podA 中也含有容忍(toleration) node 污点的标签 key=value:NoSchedule，此时该 podA 能被调度到 node 上运行，kubelet 也不会驱逐该 podA。随着一段时间后，将 node 上的污点更新了或者移除了等情况下，在原有 k8s 上 kubelet 不能将 pod 驱逐掉。如果开启该策略将会使 kubelet 去驱逐 podA。

参数：

过滤型参数：

Name Type 说明

thresholdPriority int 同上

thresholdPriorityClassName string 同上

namespaces

labelSelector

nodeFit bool 同上

Name	Type	说明
thresholdPriority	int	同上
thresholdPriorityClassName	string	同上
namespaces
labelSelector
nodeFit	bool	同上

例子

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingNodeTaints":
    enabled: true

1.7. RemovePodsViolatingTopologySpreadConstraint

该策略会对违反拓扑约束关系的 pod 从 node 上驱逐，以达到在k8s 中多个域之间平台 pod 数量的要求，以便实现 pods 之间更细粒度的调度，方便实现容灾和高可用。 topologySpreadConstraints 表示拓扑分布约束，可以控制 Pod 在某些节点的分布，可以在多个域之间平衡 Pod 数量，topologySpreadConstraints 策略是在 k8s v1.16第一次提出，在 v1.18 进入beta版默认开启，具体详情可参考 k8s v1.18 官方说明。因此该策略只在 k8s v1.18 以上版本中可用。

说明：

该策略默认只处理硬约束（hard constraints）的条件，如果要处理 软约束（soft constraints）的条件，需要将参数 includeSoftConstraints 设置为 true。
该策略中的参数 labelSelector 在 处理拓扑平衡域 时不会生效，它只会在驱逐 pod 阶段或者决定哪个 pod 可被驱逐时才生效。

参数：

普通参数：

Name Type 说明

includeSoftConstraints bool 默认不开启，如要开启需要设为 true
过滤型参数：

Name Type 说明

thresholdPriority int 同上

thresholdPriorityClassName string 同上

namespaces

labelSelector

nodeFit bool 同上

Name	Type	说明
includeSoftConstraints	bool	默认不开启，如要开启需要设为 `true`

Name	Type	说明
thresholdPriority	int	同上
thresholdPriorityClassName	string	同上
namespaces
labelSelector
nodeFit	bool	同上

例子

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingTopologySpreadConstraint":
     enabled: true
     params:
       includeSoftConstraints: false

1.8. RemovePodsHavingTooManyRestarts

该策略会对 pod 重启次数太多的时候将 pod 驱逐，pod 重启原因有很多种，有因为 pod 的健康检查而一直重启，也有因为挂载卷不成功或者其他因素导致的 pod 重启，也有可能是 pod 所在的 node 因素导致的等，这种情况下该策略可将 pod 驱逐重建。

参数

普通参数：

Name	Type	说明
podRestartThreshold	int	指定 pod 重启次数的阀值，当 pod 的重启次数大于该值时，将执行驱逐策略
includingInitContainers	bool	计算 pod 的重启次数时，是否将 init container 的重启次数计算进去

过滤型参数：

Name Type 说明

thresholdPriority int 同上

thresholdPriorityClassName string 同上

namespaces

labelSelector

nodeFit bool 同上

Name	Type	说明
thresholdPriority	int	同上
thresholdPriorityClassName	string	同上
namespaces
labelSelector
nodeFit	bool	同上

例子

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsHavingTooManyRestarts":
     enabled: true
     params:
       podsHavingTooManyRestarts:
         podRestartThreshold: 100
         includingInitContainers: true

1.9. PodLifeTime

该策略会驱逐长时间运行中的 pod，即根据设定的阀值以及 pod 的生命时长来决定是否驱逐 pod。目前仅支持对两种状态的 pod 进行驱逐，即 running 或 pending 状态的 pod。

参数

普通参数：

Name	Type	说明
maxPodLifeTimeSeconds	int	设定 pod 生命时长的驱逐阀值，该策略会驱逐大于该阀值的 pod
podStatusPhases	list	指定可驱逐 pod 的状态，目前只支持驱逐状态为 running 和 pending 的 pod

过滤型参数

Name Type 说明

thresholdPriority int 同上

thresholdPriorityClassName string 同上

namespaces

labelSelector

Name	Type	说明
thresholdPriority	int	同上
thresholdPriorityClassName	string	同上
namespaces
labelSelector

例子

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
     enabled: true
     params:
       podLifeTime:
         maxPodLifeTimeSeconds: 86400
         podStatusPhases:
         - "Pending"

2. Pods 过滤参数详解

2.1. Namespace 过滤

Descheduler 中的策略可以通过 namespaces 参数来过滤以决定这些策略可以在哪些 namespaces 中生效或者。基于 namespaces 过滤方法共有两种，即 include 和 exclude 方法，include 表示策略可以在指定的 namespaces 中生效，而 exclude 则表示排除指定的 namespaces 后，策略可在其他 namespaces 中生效。如下策略支持 namespaces 过滤：

PodLifeTime
RemovePodsHavingTooManyRestarts
RemovePodsViolatingNodeTaints
RemovePodsViolatingNodeAffinity
RemovePodsViolatingInterPodAntiAffinity
RemoveDuplicates
RemovePodsViolatingTopologySpreadConstraint

如：

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
     enabled: true
     params:
        podLifeTime:
          maxPodLifeTimeSeconds: 86400
        namespaces:
          include:
          - "namespace1"
          - "namespace2"

表示 PodLifeTime 策略仅在 namespace1 和 namespace2 中生效。

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
     enabled: true
     params:
        podLifeTime:
          maxPodLifeTimeSeconds: 86400
        namespaces:
          exclude:
          - "namespace1"
          - "namespace2"

表示 PodLifeTime 策略可在除了 namespace1 和 namespace2 的其他所有 namespaces 中生效。

2.2. Priority 优先级过滤

Descheduler 中的 所有策略 都可以指定 优先级过滤，优先级策略通过一个优先级阀值来指定，只有当 pod 的优先级值小于该阀值时才能被 Descheduler 驱逐。有两种方式来指定优先级策略（默认情况下通过 system-cluster-critical priorityClass 来指定优先级的阀值。）：

thresholdPriority：直接通过一个 int 数值来指定。
thresholdPriorityClassName：通过 k8s 的 priority class 来进行关联， priority class 中具体会指定优先级的数值。如果策略中指定的 priority class 在 k8s 集群中没有创建（不存在），则会直接报错。
注意：上述两个优先级只能使用其中一种。在一个策略中不能同时指定上述两种优先级。
注意：如果 evictSystemCriticalPods 设置为 true，将会驱逐系统关键 pod，且设置的所有 优先级策略 都将无效。

例子

thresholdPriority 方式：

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
    enabled: true
    params:
        podLifeTime:
          maxPodLifeTimeSeconds: 86400
        thresholdPriority: 10000

thresholdPriorityClassName 方式：

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
    enabled: true
    params:
        podLifeTime:
          maxPodLifeTimeSeconds: 86400
        thresholdPriorityClassName: "priorityclass1"

2.3. Label 标签过滤

如下策略支持通过 k8s 的标准标签过滤器（labelSelector）来过滤需要驱逐指定的 pod。

PodLifeTime
RemovePodsHavingTooManyRestarts
RemovePodsViolatingNodeTaints
RemovePodsViolatingNodeAffinity
RemovePodsViolatingInterPodAntiAffinity
RemovePodsViolatingTopologySpreadConstraint

例子

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
    enabled: true
    params:
      podLifeTime:
        maxPodLifeTimeSeconds: 86400
      labelSelector:
        matchLabels:
          component: redis
        matchExpressions:
          - {key: tier, operator: In, values: [cache]}
          - {key: environment, operator: NotIn, values: [dev]}

2.4. Node fit 过滤

Descheduler 中以下策略支持 nodeFit 过滤来优化驱逐调度的选择。

RemoveDuplicates
LowNodeUtilization
HighNodeUtilization
RemovePodsViolatingInterPodAntiAffinity
RemovePodsViolatingNodeAffinity
RemovePodsViolatingNodeTaints
RemovePodsViolatingTopologySpreadConstraint
RemovePodsHavingTooManyRestarts

如果 nodeFit 为 true，则 descheduler 在驱逐 pod 时会考虑驱逐的 pod 是否满足可驱逐的标准，并且可驱逐的 pod 是否可以重新调度到其他 node 节点上，如果 pod 不能重新调度到其他 node 节点上，则不会驱逐该 pod。当前 pod 是否可驱逐的标准如下：

pod 中是否有 nodeSelector 标签选择器
pod 中有可容忍 node 的污点，即 pod 中有 Tolerations 标签， node 有对应的 Taints 标签
pod 中有 nodeAffinity
是否有其他 node 节点被标记为 unschedulable

注意：nodeFit 过滤是基于 pod 的 spec 进行过滤的，并不会关联 pod 的 owner；如果 pod 的 owner（比如 RC 等）被修改了但 pod 的 spec 未被修改，这时 nodeFit 的过滤将会引用 pod 旧的 spec 信息。这种行为在 Descheduler 中时可允许的或者说不严重的，因为 Descheduler 的机制是 尽最大努力 来进行资源的再度平衡。当然如果想实时得到 pod 的最新信息，可以使用 Deployment 来代替 RC，Deployment 会实现自动同步更新 pod 的 spec 的功能，以保证集群中 pod 的最新信息。

3. Pod 驱逐说明

Descheduler 中从 node 中驱逐一个 pod 必须遵守以下机制：

Critical pods：通过 PriorityClassName 设置的 system-cluster-critical 或者 system-node-critical 关键性 pod 永远不会被驱逐，除非将 evictSystemCriticalPods 设置为 true。
静态 pod、kubelet 直接用镜像启动的pod、孤儿 pod 等不被 ReplicationController, ReplicaSet(Deployment), StatefulSet, or Job 等控制器管理的 pod 永远不会被驱逐。因为一旦驱逐了，这些 pod 将永远不会被拉起来。
被 DaemonSets 管理的 pod 永远不会被驱逐。
pod 中使用本地存储卷的（local storage）不会被驱逐，除非将 evictLocalStoragePods 设为 true。
pod 中使用 PVCs 的默认是可以被驱逐的，可通过设置 ignorePvcPods 为 true 来保证不被驱逐。
在 LowNodeUtilization 和 RemovePodsViolatingInterPodAntiAffinity 中，pod 的驱逐优先级是从低到高，当 pod 之间的优先级相同时，会根据 k8s 的 Qos 等级来进行驱逐，即 best effort 先驱逐，其次 burstable，最后 guaranteed。
所有 pod 中如果有注释:descheduler.alpha.kubernetes.io/evict，则表明该 pod 可以被驱逐。这种情况下，用户可根据需求以及 pod 的信息来特定指定可驱逐的 pod。

4. 其他：

设置 –v=4 或者更大的值，Descheduler 的日志中会打印出任一 pod 不能被驱逐的原因。
Descheduler 的驱逐信息可通过 https://localhost:10258/metrics 地址进行查看，地址可以通过参数 --binding-address 更改，基于 https的安全端口号可以通过参数 --secure-port 更改。

name type 说明

build_info gauge constant 1

pods_evicted CounterVec total number of pods evicted

name	type	说明
build_info	gauge	constant 1
pods_evicted	CounterVec	total number of pods evicted

参考

https://github.com/kubernetes-sigs/descheduler

Descheduler

Descheduler 学习记录

1. 资源平衡策略

1.1. RemoveDuplicates

1.2. LowNodeUtilization

1.3. HighNodeUtilization

1.4. RemovePodsViolatingInterPodAntiAffinity

1.5. RemovePodsViolatingNodeAffinity

1.6. RemovePodsViolatingNodeTaints

1.7. RemovePodsViolatingTopologySpreadConstraint

1.8. RemovePodsHavingTooManyRestarts

1.9. PodLifeTime

2. Pods 过滤参数详解

2.1. Namespace 过滤

2.2. Priority 优先级过滤

2.3. Label 标签过滤

2.4. Node fit 过滤

3. Pod 驱逐说明

4. 其他：

参考

猜你喜欢