the site subtitle

k8s 集群监控 -- Prometheus Operator

2019.01.22

对于任何系统来说,监控都是不可或缺的东西,系统现在是什么样,与一周前相比是否需要调整资源等都需要一个监控系统来参考。对于k8s来说常用的开源监控方案也就是下面几种:

  1. heaspter+InfluxDB (已被弃用)
  2. Prometheus
  3. Open falcon
  4. Telegraf+InfluxDB

而且k8s所需的监控项目是比较多的,如下表:(k8s 1.14.0)

监控组件 metric PATH
API Server 6443/metrics
Controller Manager 10257/metrics
kube Scheduler 10259/metrics
Etcd 2379/metrics
Kubelet API 10252/metrics
Cadvisor 10250/metrics/cadvisor
Node 9100/metrics

因此想要全面的监控集群状态,我个人感觉Prometheus是比较合适的,但是当你修改依次规则就需要重启一次。而且云原生中比较推荐使用声明式代替命令式,于是Prometheus Operator就诞生了,整体结构如下图:

Install

个人推荐的方式是从repo中apply一堆yaml,如果你擅长helm,也是可以的。

git clone https://github.com/coreos/prometheus-operator
kubectl get crd
NAME                                    CREATED AT
alertmanagers.monitoring.coreos.com     2018-09-12T02:13:19Z
prometheuses.monitoring.coreos.com      2018-09-12T02:13:19Z
prometheusrules.monitoring.coreos.com   2018-09-12T02:13:20Z
servicemonitors.monitoring.coreos.com   2018-09-12T02:13:20Z

这里创建了四个CRD,分别是alertmanager实例,prometheus实例,告警规则,以及prometheus监控对象。

Metrics

API-Server

不需要做太多的设置,直接就可以采集到。

APIServer Self
kubectl get --raw="/metrics"
APIServer Proxy
- Kubelet 
kubectl get --raw="/api/v1/nodes/${HOSTNAME}/proxy/metrics"
- Cadvisor 
kubectl get --raw="/api/v1/nodes/${HOSTNAME}/proxy/metrics/cadvisor"

Controller Manager && Scheduler

需要添加对于的Service,使Prometheus发现对应的metrics。如下所示

cat <<EOF | kubectl create -f -
apiVersion: v1
kind: Service
metadata:
  namespace: kube-system
  name: kube-controller-manager-prometheus
  labels:
    k8s-app: kube-controller-manager
spec:
  selector:
    component: kube-controller-manager
  type: ClusterIP
  clusterIP: None
  ports:
  - name: http-metrics
    port: 10252
    targetPort: 10252
    protocol: TCP
---    
apiVersion: v1
kind: Service
metadata:
  namespace: kube-system
  name: kube-scheduler-prometheus
  labels:
    k8s-app: kube-scheduler
spec:
  selector:
    component: kube-scheduler
  type: ClusterIP
  clusterIP: None
  ports:
  - name: http-metrics
    port: 10251
    targetPort: 10251
    protocol: TCP
EOF

Etcd

如果你的Etcd集群启用了TLS,那么监控会稍微有点麻烦,以下内容来自https://www.qikqiak.com/post/prometheus-operator-monitor-etcd/

当你成功抓到Etcd的Metrics之后可以到图Grafana ID为3070的面板。

# 创建包含Etcd证书的secret
kubectl -n monitoring create secret generic etcd-certs --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.crt --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.key --from-file=/etc/kubernetes/pki/etcd/ca.crt
# 设置证书挂载到Prometheus中
  replicas: 1
  secrets:
  - etcd-certs
# 创建对应的 ServiceMonitor
cat <<EOF | kubectl create -f -
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: etcd-k8s
  namespace: monitoring
  labels:
    k8s-app: etcd-k8s
spec:
  jobLabel: k8s-app
  endpoints:
  - port: port
    interval: 30s
    scheme: https
    tlsConfig:
      caFile: /etc/prometheus/secrets/etcd-certs/ca.crt
      certFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.crt
      keyFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.key
      insecureSkipVerify: true
  selector:
    matchLabels:
      k8s-app: etcd
  namespaceSelector:
    matchNames:
    - kube-system
EOF
#创建Etcd的 Service
cat <<EOF | kubectl create -f -    
apiVersion: v1
kind: Service
metadata:
  name: etcd-k8s
  namespace: kube-system
  labels:
    k8s-app: etcd
spec:
  selector:
    component: etcd
  type: ClusterIP
  clusterIP: None
  ports:
  - name: port
    port: 2379
    protocol: TCP
EOF    

Kubelet

分为Kubelet本身和cadvisor两个Metrics,

# Kubelet
curl --cacert /var/lib/kubelet/pki/kubelet.crt --cert /etc/kubernetes/pki/apiserver-kubelet-client.crt --key /etc/kubernetes/pki/apiserver-kubelet-client.key  https://127.0.0.1:10250/metrics
# Cadvisor
curl --cacert /var/lib/kubelet/pki/kubelet.crt --cert /etc/kubernetes/pki/apiserver-kubelet-client.crt --key /etc/kubernetes/pki/apiserver-kubelet-client.key  https://127.0.0.1:10250/metrics/cadvisor
# Summary
curl --cacert /var/lib/kubelet/pki/kubelet.crt --cert /etc/kubernetes/pki/apiserver-kubelet-client.crt --key /etc/kubernetes/pki/apiserver-kubelet-client.y  https://ubuntu-bionic:10250/stats/summary

对于前者主要是一个pod,CNI,volume等等,后者则是pod内的CPU,Memory,网络流量等。

# 查询全部容器的CPU
sum by (container_name) (rate(container_cpu_usage_seconds_total{}[1m]))
# 查询全部容器的Memory
sum by (container_name) (rate(container_memory_usage_bytes{}[1m]))
# 查询全部容器收到的bytes
sort_desc(sum by (pod_name) (rate(container_network_receive_bytes_total[1m])

kube state metrics

作为监控设置的补充,kube-state-metrics它会轮询Kubernetes API并将有关你的Kubernetes结构化特征信息转换为metrics信息。以下是kube-state-metrics会回答的一些问题:

  • 我调度了多少个replicas?现在可用的有几个?
  • 多少个Pod是running/stopped/terminated状态?
  • Pod重启了多少次?

一般来说,该模型将采集Kubernetes事件并将其转换为metrics信息。需要Kubernetes 1.2+版本,不过,需要提醒的是metrics名称和标签是不稳定的,可能会改变。

6615 Kubernetes: DaemonSet (Prometheus)

5303 Kubernetes Deployment (Prometheus)

741 Kubernetes Deployment metrics

CoreDNS

这个在v0.29的版本中,安装完成以后就可以自动抓取到,同时导入5926 这个Dashboard模版即可观察到Coredns的监控信息。

Node Exporder

对于节点级别的监控,使用的 DaemonSet 启动的Node Exporder,也是比较简单的,之后导入ID 1860 的Grafana即可看到Node Exporter Full监控图。

# 预测磁盘将于24H内空间不足
predict_linear(node:node_filesystem_avail:[6h], 3600 * 24)  

Ref

Kubernetes监控系列(一):Kubernetes监控开源工具基本介绍以及如何使用Sysdig进行监控

http://dockone.io/article/4052 Prometheus监控实践:Kubernetes集群监控

https://blog.frognew.com/2017/12/using-prometheus-to-monitor-kubernetes.html Get Kubernetes Cluster Metrics with Prometheus in 5 Minutes

https://akomljen.com/get-kubernetes-cluster-metrics-with-prometheus-in-5-minutes/

k8s与监控–解读prometheus监控kubernetes的配置文件

https://segmentfault.com/a/1190000013230914