污点(Taints)

污点是加在节点(node)上的标记,用于阻止不符合条件的pod调度到该节点

污点结构

每个污点由三部分组成:

  • Key:标识污点的名称
  • Value:污点值
  • Effect:污点效果,决定调度行为
Effect 含义
NoSchedule 禁止新 Pod 调度(已有 Pod 不受影响)
PreferNoSchedule 尽量避免调度(非强制)
NoExecute 禁止新调度 + 驱逐已有不满足容忍的 Pod(支持延迟驱逐 tolerationSeconds)

操作示例

1
2
3
4
5
6
7
8
9
10
11
12
# 添加污点
kubectl taint nodes node1 gpu=true:NoSchedule

# 删除污点
kubectl taint nodes node1 gpu=true:NoSchedule-

# 查看污点
root@k8s-master:~# kubectl describe nodes k8s-master | grep Taints
Taints: node-role.kubernetes.io/control-plane:NoSchedule

#node-role.kubernetes.io/control-plane:NoSchedule
只有key : effect

应用场景

  • 隔离专用硬件节点(如 GPU、高性能存储)
  • 节点维护时驱逐业务 Pod
  • 保护敏感数据节点(仅允许特定 Pod 调度)

容忍(Tolerations)

容忍是定义在 Pod 上的属性,允许 Pod 忽略节点污点,从而调度到特定节点。

容忍配置

1
2
3
4
5
6
7
# pod.spec.
tolerations:
- key: "gpu" # 匹配污点的 Key
operator: "Equal" # 操作符:Equal(精确匹配)或 Exists(存在 Key 即可)
value: "true" # 匹配污点的 Value(operator=Equal 时需指定)
effect: "NoSchedule" # 匹配污点的 Effect
tolerationSeconds: 3600 # NoExecute 污点的容忍时长(秒)

关键逻辑

  • 一个 Pod 可定义多个容忍,只需匹配节点上的一个污点即可调度。

  • operator: Exists时无需指定 value(仅检查 Key 是否存在)。

典型场景

  • 允许 AI 训练任务调度到 GPU 节点
  • 系统组件(如 kube-proxy)容忍 Master 节点污点
  • 维护期临时容忍 NoExecute污点

亲和性(Affinity)

亲和性分为两类,用于引导 Pod 调度到符合规则的节点或与其他 Pod 协同部署

节点亲和性(Node Affinity)

控制 Pod 与节点的匹配关系:

依赖node的标签

1
2
3
4
5
6
7
8
9
10
11
12
#创建标签  #更新标签(需加 --overwrite)
root@k8s-master:~# kubectl label nodes k8s-node-1 k8s.io/role=node
node/k8s-node-1 labeled

#查看标签
root@k8s-master:~# kubectl get nodes k8s-node-1 --show-labels
NAME STATUS ROLES AGE VERSION LABELS
k8s-node-1 Ready <none> 18d v1.29.15 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node-1,kubernetes.io/os=linux

#删除标签
root@k8s-master:~# kubectl label nodes k8s-node-1 k8s.io/role-
node/k8s-node-1 unlabeled
  • 硬亲和性(Required):必须满足的条件
1
2
3
4
5
6
7
8
9
# pod.spec.
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values: [ssd]
  • 软亲和性(Preferred):优先但不强制
1
2
3
4
5
6
7
8
9
10
# pod.spec.
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100 # 优先级权重(0-100)
preference:
matchExpressions:
- key: zone
operator: In
values: [us-east]

Pod 亲和性与反亲和性

  • 亲和性(PodAffinity):将 Pod 调度到同一拓扑域(如相同节点、可用区)
  • 反亲和性(PodAntiAffinity):避免 Pod 调度到同一拓扑域(提高高可用性)
1
2
3
4
5
6
7
8
9
# pod.spec.affinity.podAffinity
# pod.spec.affinity.podAntiAffinity

podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: database
topologyKey: kubernetes.io/hostname # 按节点隔离

示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
apiVersion: v1
kind: Pod
metadata:
name: toleration-test
spec:
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
operator: Equal
value: ''
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- k8s-master
containers:
- name: busybox
image: busybox:latest
imagePullPolicy: IfNotPresent
command: #["sh", "-c", "sleep 1000"]
- sh
- -c
- sleep 1000



root@k8s-master:~# kubectl get pods -o wide
toleration-test 1/1 Running 0 7s 10.244.0.47 k8s-master <none> <none>

HPA

k8s默认水平动态伸缩仅支持通过cpu和内存

想要通过其他指标 API 动态伸缩,需要安装指标采集系统 prometheus
但 Prometheus 采集的指标不兼容kubeapi,
需要一个中间件:Prometheus Adpater

Kubernetes API Server Metrics API <–> Prometheus Adpater <–> Prometheus Metrics API

  • HPAv1:仅支持cpu、内存
  • HPAv2:支持使用自定义指标来实现自动扩缩容

安装Metrics-server

K8S 1.29.2 metrics

安装Prometheus Adpater

wsq1203/prom-k8s (github.com)

1
git clone https://github.com/wsq1203/prom-k8s.git

HPA示例

基于内存cpu

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: multi-metrics-hpa # HPA 名称
namespace: default # 命名空间(按需修改)
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment # 目标对象类型(支持 Deployment/StatefulSet)
name: your-app # 目标 Deployment 名称
minReplicas: 2 # 最小副本数(建议 ≥2 避免单点故障)
maxReplicas: 10 # 最大副本数(防止资源过度消耗)
metrics: # 多指标触发条件
- type: Resource # 资源指标类型
resource:
name: cpu # CPU 指标
target:
type: Utilization # 使用率模式
averageUtilization: 70 # CPU 使用率超过 70% 触发扩容
- type: Resource # 内存指标
resource:
name: memory # 内存指标
target:
type: Utilization
averageUtilization: 80 # 内存使用率超过 80% 触发扩容
behavior: # 扩缩容行为控制(避免抖动)
scaleDown: # 缩容策略
stabilizationWindowSeconds: 300 # 缩容冷却时间(默认 5 分钟)
policies:
- type: Percent
value: 10 # 每次缩容最多减少 10% 的副本
scaleUp: # 扩容策略
stabilizationWindowSeconds: 60 # 扩容冷却时间(默认 1 分钟)
policies:
- type: Percent
value: 100 # 每次扩容最多增加 100% 的副本(快速响应)

基于每秒请求量(通过configmap暴漏指标)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2
metadata:
name: metrics-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: metrics-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: 5
behavior:
scaleDown:
stabilizationWindowSeconds: 120