引言

Kubernetes(简称K8s)作为容器编排领域的佼佼者,已经成为现代云计算架构中不可或缺的一部分。随着K8s集群规模的不断扩大,实时监控其运行状态成为运维人员关注的焦点。本文将深入探讨K8s集群实时监控的多种高效方法,帮助运维人员确保集群稳定运行,无忧应对各种挑战。

K8s集群监控的重要性

1. 系统稳定性

实时监控可以帮助运维人员及时发现集群中的异常情况,如节点故障、资源不足等,从而采取相应措施,保障系统稳定性。

2. 性能优化

通过监控,运维人员可以了解集群的运行状况,分析性能瓶颈,优化资源配置,提高集群整体性能。

3. 故障排查

在出现问题时,实时监控数据可以帮助运维人员快速定位故障原因,缩短故障恢复时间。

K8s集群实时监控方法

1. 基于Prometheus的监控

Prometheus是一款开源的监控和报警工具,与K8s具有良好的兼容性。以下是使用Prometheus监控K8s集群的步骤:

1.1 安装Prometheus

# 下载Prometheus wget https://github.com/prometheus/prometheus/releases/download/v2.25.0/prometheus-2.25.0.linux-amd64.tar.gz # 解压并启动Prometheus tar -xvf prometheus-2.25.0.linux-amd64.tar.gz cd prometheus-2.25.0.linux-amd64 ./prometheus 

1.2 配置Prometheus

编辑prometheus.yml文件,添加以下内容:

global: scrape_interval: 15s scrape_configs: - job_name: 'kubernetes-apiserver' kubernetes_sd_configs: - role: pod scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token 

1.3 安装Prometheus Operator

# 下载Prometheus Operator wget https://github.com/prometheus-operator/prometheus-operator/releases/download/v0.45.0/prometheus-operator-0.45.0.tar.gz # 解压并部署Prometheus Operator tar -xvf prometheus-operator-0.45.0.tar.gz cd prometheus-operator-0.45.0/manifests kubectl apply -f prometheus-operator.yaml 

1.4 创建监控目标

创建一个名为k8s-node-exporter的ConfigMap,用于配置node-exporter:

apiVersion: v1 kind: ConfigMap metadata: name: k8s-node-exporter data: k8s-node-exporter.yml: | global: scrape_interval: 15s scrape_configs: - job_name: 'k8s-node-exporter' static_configs: - targets: - 'localhost:9100' 

创建一个名为k8s-pod-exporter的ConfigMap,用于配置pod-exporter:

apiVersion: v1 kind: ConfigMap metadata: name: k8s-pod-exporter data: k8s-pod-exporter.yml: | global: scrape_interval: 15s scrape_configs: - job_name: 'k8s-pod-exporter' static_configs: - targets: - 'localhost:9100' 

创建一个名为k8s-service-account的ServiceAccount,用于授权Prometheus Operator:

apiVersion: v1 kind: ServiceAccount metadata: name: k8s-service-account 

创建一个名为k8s-clusterrole的ClusterRole,用于授权Prometheus Operator:

apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: k8s-clusterrole rules: - apiGroups: - "" resources: - nodes - services - endpoints - pods - nodes/status - configmaps verbs: - get - list - watch - apiGroups: - "extensions" resources: - ingresses verbs: - get - list - watch - apiGroups: - "apps" resources: - deployments - replicasets - statefulsets verbs: - get - list - watch 

创建一个名为k8s-clusterrolebinding的ClusterRoleBinding,用于绑定ClusterRole和ServiceAccount:

apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: k8s-clusterrolebinding roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: k8s-clusterrole subjects: - kind: ServiceAccount name: k8s-service-account namespace: default 

创建一个名为k8s-prometheus-rules的ConfigMap,用于配置Prometheus规则:

apiVersion: v1 kind: ConfigMap metadata: name: k8s-prometheus-rules data: prometheus-rules.yml: | groups: - name: k8s-rules rules: - alert: HighCPUUsage expr: (100 - (avg by (job) (rate(container_cpu_usage_seconds_total{job="k8s-pod-exporter"}[5m])) * 100)) > 80 for: 1m labels: severity: "page" - alert: HighMemoryUsage expr: (100 - (avg by (job) (rate(container_memory_usage_bytes_total{job="k8s-pod-exporter"}[5m])) * 100)) > 80 for: 1m labels: severity: "page" 

创建一个名为k8s-prometheus-alertmanager的ConfigMap,用于配置Alertmanager:

apiVersion: v1 kind: ConfigMap metadata: name: k8s-prometheus-alertmanager data: alertmanager.yml: | global: resolve_timeout: 5m route: receiver: 'default' group_by: ['alertname'] repeat_interval: 1h routes: - match: alertname: HighCPUUsage receiver: 'default' route: 'default' - match: alertname: HighMemoryUsage receiver: 'default' route: 'default' 

创建一个名为k8s-prometheus-alertmanager-receiver的ConfigMap,用于配置Alertmanager接收器:

apiVersion: v1 kind: ConfigMap metadata: name: k8s-prometheus-alertmanager-receiver data: alertmanager-receiver.yml: | receivers: - name: 'default' email_configs: - to: 'admin@example.com' route: receiver: 'default' group_by: ['alertname'] repeat_interval: 1h routes: - match: alertname: HighCPUUsage receiver: 'default' route: 'default' - match: alertname: HighMemoryUsage receiver: 'default' route: 'default' 

创建一个名为k8s-prometheus-alertmanager-route的ConfigMap,用于配置Alertmanager路由:

apiVersion: v1 kind: ConfigMap metadata: name: k8s-prometheus-alertmanager-route data: alertmanager-route.yml: | route: receiver: 'default' group_by: ['alertname'] repeat_interval: 1h routes: - match: alertname: HighCPUUsage receiver: 'default' route: 'default' - match: alertname: HighMemoryUsage receiver: 'default' route: 'default' 

创建一个名为k8s-prometheus-alertmanager-route-receiver的ConfigMap,用于配置Alertmanager路由接收器:

apiVersion: v1 kind: ConfigMap metadata: name: k8s-prometheus-alertmanager-route-receiver data: alertmanager-route-receiver.yml: | receivers: - name: 'default' email_configs: - to: 'admin@example.com' route: receiver: 'default' group_by: ['alertname'] repeat_interval: 1h routes: - match: alertname: HighCPUUsage receiver: 'default' route: 'default' - match: alertname: HighMemoryUsage receiver: 'default' route: 'default' 

创建一个名为k8s-prometheus-alertmanager-route-templates的ConfigMap,用于配置Alertmanager路由模板:

apiVersion: v1 kind: ConfigMap metadata: name: k8s-prometheus-alertmanager-route-templates data: alertmanager-route-templates.yml: | route: receiver: 'default' group_by: ['alertname'] repeat_interval: 1h routes: - match: alertname: HighCPUUsage receiver: 'default' route: 'default' - match: alertname: HighMemoryUsage receiver: 'default' route: 'default' 

创建一个名为k8s-prometheus-alertmanager-route-templates-receiver的ConfigMap,用于配置Alertmanager路由模板接收器:

apiVersion: v1 kind: ConfigMap metadata: name: k8s-prometheus-alertmanager-route-templates-receiver data: alertmanager-route-templates-receiver.yml: | receivers: - name: 'default' email_configs: - to: 'admin@example.com' route: receiver: 'default' group_by: ['alertname'] repeat_interval: 1h routes: - match: alertname: HighCPUUsage receiver: 'default' route: 'default' - match: alertname: HighMemoryUsage receiver: 'default' route: 'default' 

创建一个名为k8s-prometheus-alertmanager-route-templates-receiver-email的ConfigMap,用于配置Alertmanager路由模板接收器邮箱:

apiVersion: v1 kind: ConfigMap metadata: name: k8s-prometheus-alertmanager-route-templates-receiver-email data: alertmanager-route-templates-receiver-email.yml: | receivers: - name: 'default' email_configs: - to: 'admin@example.com' route: receiver: 'default' group_by: ['alertname'] repeat_interval: 1h routes: - match: alertname: HighCPUUsage receiver: 'default' route: 'default' - match: alertname: HighMemoryUsage receiver: 'default' route: 'default' 

创建一个名为k8s-prometheus-alertmanager-route-templates-receiver-email-html的ConfigMap,用于配置Alertmanager路由模板接收器邮箱HTML:

”`yaml apiVersion: v1 kind: ConfigMap metadata: name: k8s-prometheus-alertmanager-route-templates-receiver-email-html data: alertmanager-route-templates-receiver-email-html.yml: |

receivers: - name: 'default' email_configs: - to: 'admin@example.com' html: | <html> <head> <title>Alertmanager Notification</title> </head> <body> <h1>Alertmanager Notification</h1> <p>Alert: {{ $alert.name }}</p> <p>Severity: {{ $alert.severity }}</p> <p>Starts at: {{ $alert.startsAt }}</p> <p>Ends at: {{ $alert.endsAt }}</p> <p>Generator URL: {{ $alert.generatorURL }}</p> <p>Source: {{ $alert.source }}</p> <p>Labels: {{ $alert.labels }}</p> <p>Annotations: {{ $alert.annotations }}</p> </body> </html> {{- end }} {{- if $alert.value }} <p>Value: {{ $alert.value }}</p> {{- end }} {{- if $alert.evaluator }} <p>Evaluator: {{ $alert.evaluator }}</p> {{- end }} {{- if $alert.query }} <p>Query: {{ $alert.query }}</p> {{- end }} {{- if $alert.fingerprint }} <p>Fingerprint: {{ $alert.fingerprint }}</p> {{- end }} {{- if $alert.groupLabels }} <p>Group Labels: {{ $alert.groupLabels }}</p> {{- end }} {{- if $alert.commonLabels }} <p>Common Labels: {{ $alert.commonLabels }}</p> {{- end }} {{- if $alert.commonAnnotations }} <p>Common Annotations: {{ $alert.commonAnnotations }}</p> {{- end }} {{- if $alert.resolved }} <p>Resolved: {{ $alert.resolved }}</p> {{- end }} {{- if $alert.overrides }} <p>Overrides: {{ $alert.overrides }}</p> {{- end }} {{- if $alert.alerts }} <h2>Alerts:</h2> <ul> {{- range $alert.alerts }} <li> <p>Alert: {{ .name }}</p> <p>Severity: {{ .severity }}</p> <p>Starts at: {{ .startsAt }}</p> <p>Ends at: {{ .endsAt }}</p> <p>Generator URL: {{ .generatorURL }}</p> <p>Source: {{ .source }}</p> <p>Labels: {{ .labels }}</p> <p>Annotations: {{ .annotations }}</p> <p>Value: {{ .value }}</p> <p>Evaluator: {{ .evaluator }}</p> <p>Query: {{ .query }}</p> <p>Fingerprint: {{ .fingerprint }}</p> <p>Group Labels: {{ .groupLabels }}</p> <p>Common Labels: {{ .commonLabels }}</p> <p>Common Annotations: {{ .commonAnnotations }}</p> <p>Resolved: {{ .resolved }}</p> <p>Overrides: {{ .overrides }}</p> </li> {{- end }} </ul> {{- end }} {{- if $alert.history }} <h2>History:</h2> <ul> {{- range $alert.history }} <li> <p>Timestamp: {{ .timestamp }}</p> <p>Value: {{ .value }}</p> <p>State: {{ .state }}</p> <p>Severity: {{ .severity }}</p> <p>Generator URL: {{ .generatorURL }}</p> <p>Source: {{ .source }}</p> <p>Labels: {{ .labels }}</p> <p>Annotations: {{ .annotations }}</p> </li> {{- end }} </ul> {{- end }} {{- if $alert.resolved }} <p>Resolved: {{ $alert.resolved }}</p> {{- end }} {{- if $alert.overrides }} <p>Overrides: {{ $alert.overrides }}</p> {{- end }} {{- if $alert.alerts }} <h2>Alerts:</h2> <ul> {{- range $alert.alerts }} <li> <p>Alert: {{ .name }}</p> <p>Severity: {{ .severity }}</p> <p>Starts at: {{ .startsAt }}</p> <p>Ends at: {{ .endsAt }}</p> <p>Generator URL: {{ .generatorURL }}</p> <p>Source: {{ .source }}</p> <p>Labels: {{ .labels }}</p> <p>Annotations: {{ .annotations }}</p> <p>Value: {{ .value }}</p> <p>Evaluator: {{ .evaluator }}</p> <p>Query: {{ .query }}</p> <p>Fingerprint: {{ .fingerprint }}</p> <p>Group Labels: {{ .groupLabels }}</p> <p>Common Labels: {{ .commonLabels }}</p> <p>Common Annotations: {{ .commonAnnotations }}</p> <p>Resolved: {{ .resolved }}</p> <p>Overrides: {{ .overrides }}</p> </li> {{- end }} </ul> {{- end }} {{- if $alert.history }} <h2>History:</h2> <ul> {{- range $alert.history }} <li> <p>Timestamp: {{ .timestamp }}</p> <p>Value: {{ .value }}</p> <p>State: {{ .state }}</p> <p>Severity: {{ .severity }}</p> <p>Generator URL: {{ .generatorURL }}</p> <p>Source: {{ .source }}</p> <p>Labels: {{ .labels }}</p> <p>Annotations: {{ .annotations }}</p> </li> {{- end }} </ul> {{- end }} {{- if $alert.resolved }} <p>Resolved: {{ $alert.resolved }}</p> {{- end }} {{- if $alert.overrides }} <p>Overrides: {{ $alert.overrides }}</p> {{- end }} {{- if $alert.alerts }} <h2>Alerts:</h2> <ul> {{- range $alert.alerts }} <li> <p>Alert: {{ .name }}</p> <p>Severity: {{ .severity }}</p> <p>Starts at: {{ .startsAt }}</p> <p>Ends at: {{ .endsAt }}</p> <p>Generator URL: {{ .generatorURL }}</p> <p>Source: {{ .source }}</p> <p>Labels: {{ .labels }}</p> <p>Annotations: {{ .annotations }}</p> <p>Value: {{ .value }}</p>