对于任何系统来说,监控都是不可或缺的东西,系统现在是什么样,与一周前相比是否需要调整资源等都需要一个监控系统来参考。对于k8s来说常用的开源监控方案也就是下面几种:
- heaspter+InfluxDB (已被弃用)
- Prometheus
- Open falcon
- Telegraf+InfluxDB
而且k8s所需的监控项目是比较多的,如下表:(k8s 1.14.0)
监控组件 | metric PATH |
---|---|
API Server | 6443/metrics |
Controller Manager | 10257/metrics |
kube Scheduler | 10259/metrics |
Etcd | 2379/metrics |
Kubelet API | 10252/metrics |
Cadvisor | 10250/metrics/cadvisor |
Node | 9100/metrics |
因此想要全面的监控集群状态,我个人感觉Prometheus是比较合适的,但是当你修改依次规则就需要重启一次。而且云原生中比较推荐使用声明式代替命令式,于是Prometheus Operator就诞生了,整体结构如下图:
Install
个人推荐的方式是从repo中apply一堆yaml,如果你擅长helm,也是可以的。
git clone https://github.com/coreos/prometheus-operator
kubectl get crd
NAME CREATED AT
alertmanagers.monitoring.coreos.com 2018-09-12T02:13:19Z
prometheuses.monitoring.coreos.com 2018-09-12T02:13:19Z
prometheusrules.monitoring.coreos.com 2018-09-12T02:13:20Z
servicemonitors.monitoring.coreos.com 2018-09-12T02:13:20Z
这里创建了四个CRD,分别是alertmanager实例,prometheus实例,告警规则,以及prometheus监控对象。
Metrics
API-Server
不需要做太多的设置,直接就可以采集到。
APIServer Self
kubectl get --raw="/metrics"
APIServer Proxy
- Kubelet
kubectl get --raw="/api/v1/nodes/${HOSTNAME}/proxy/metrics"
- Cadvisor
kubectl get --raw="/api/v1/nodes/${HOSTNAME}/proxy/metrics/cadvisor"
Controller Manager && Scheduler
需要添加对于的Service,使Prometheus发现对应的metrics。如下所示
cat <<EOF | kubectl create -f -
apiVersion: v1
kind: Service
metadata:
namespace: kube-system
name: kube-controller-manager-prometheus
labels:
k8s-app: kube-controller-manager
spec:
selector:
component: kube-controller-manager
type: ClusterIP
clusterIP: None
ports:
- name: http-metrics
port: 10252
targetPort: 10252
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
namespace: kube-system
name: kube-scheduler-prometheus
labels:
k8s-app: kube-scheduler
spec:
selector:
component: kube-scheduler
type: ClusterIP
clusterIP: None
ports:
- name: http-metrics
port: 10251
targetPort: 10251
protocol: TCP
EOF
Etcd
如果你的Etcd集群启用了TLS,那么监控会稍微有点麻烦,以下内容来自https://www.qikqiak.com/post/prometheus-operator-monitor-etcd/
当你成功抓到Etcd的Metrics之后可以到图Grafana ID为3070的面板。
# 创建包含Etcd证书的secret
kubectl -n monitoring create secret generic etcd-certs --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.crt --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.key --from-file=/etc/kubernetes/pki/etcd/ca.crt
# 设置证书挂载到Prometheus中
replicas: 1
secrets:
- etcd-certs
# 创建对应的 ServiceMonitor
cat <<EOF | kubectl create -f -
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: etcd-k8s
namespace: monitoring
labels:
k8s-app: etcd-k8s
spec:
jobLabel: k8s-app
endpoints:
- port: port
interval: 30s
scheme: https
tlsConfig:
caFile: /etc/prometheus/secrets/etcd-certs/ca.crt
certFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.crt
keyFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.key
insecureSkipVerify: true
selector:
matchLabels:
k8s-app: etcd
namespaceSelector:
matchNames:
- kube-system
EOF
#创建Etcd的 Service
cat <<EOF | kubectl create -f -
apiVersion: v1
kind: Service
metadata:
name: etcd-k8s
namespace: kube-system
labels:
k8s-app: etcd
spec:
selector:
component: etcd
type: ClusterIP
clusterIP: None
ports:
- name: port
port: 2379
protocol: TCP
EOF
Kubelet
分为Kubelet本身和cadvisor两个Metrics,
# Kubelet
curl --cacert /var/lib/kubelet/pki/kubelet.crt --cert /etc/kubernetes/pki/apiserver-kubelet-client.crt --key /etc/kubernetes/pki/apiserver-kubelet-client.key https://127.0.0.1:10250/metrics
# Cadvisor
curl --cacert /var/lib/kubelet/pki/kubelet.crt --cert /etc/kubernetes/pki/apiserver-kubelet-client.crt --key /etc/kubernetes/pki/apiserver-kubelet-client.key https://127.0.0.1:10250/metrics/cadvisor
# Summary
curl --cacert /var/lib/kubelet/pki/kubelet.crt --cert /etc/kubernetes/pki/apiserver-kubelet-client.crt --key /etc/kubernetes/pki/apiserver-kubelet-client.y https://ubuntu-bionic:10250/stats/summary
对于前者主要是一个pod,CNI,volume等等,后者则是pod内的CPU,Memory,网络流量等。
# 查询全部容器的CPU
sum by (container_name) (rate(container_cpu_usage_seconds_total{}[1m]))
# 查询全部容器的Memory
sum by (container_name) (rate(container_memory_usage_bytes{}[1m]))
# 查询全部容器收到的bytes
sort_desc(sum by (pod_name) (rate(container_network_receive_bytes_total[1m])
kube state metrics
作为监控设置的补充,kube-state-metrics它会轮询Kubernetes API并将有关你的Kubernetes结构化特征信息转换为metrics信息。以下是kube-state-metrics会回答的一些问题:
- 我调度了多少个replicas?现在可用的有几个?
- 多少个Pod是running/stopped/terminated状态?
- Pod重启了多少次?
一般来说,该模型将采集Kubernetes事件并将其转换为metrics信息。需要Kubernetes 1.2+版本,不过,需要提醒的是metrics名称和标签是不稳定的,可能会改变。
6615 Kubernetes: DaemonSet (Prometheus)
5303 Kubernetes Deployment (Prometheus)
741 Kubernetes Deployment metrics
CoreDNS
这个在v0.29的版本中,安装完成以后就可以自动抓取到,同时导入5926 这个Dashboard模版即可观察到Coredns的监控信息。
Node Exporder
对于节点级别的监控,使用的 DaemonSet 启动的Node Exporder,也是比较简单的,之后导入ID 1860 的Grafana即可看到Node Exporter Full监控图。
# 预测磁盘将于24H内空间不足
predict_linear(node:node_filesystem_avail:[6h], 3600 * 24)
Ref
Kubernetes监控系列(一):Kubernetes监控开源工具基本介绍以及如何使用Sysdig进行监控
http://dockone.io/article/4052 Prometheus监控实践:Kubernetes集群监控
https://blog.frognew.com/2017/12/using-prometheus-to-monitor-kubernetes.html Get Kubernetes Cluster Metrics with Prometheus in 5 Minutes
https://akomljen.com/get-kubernetes-cluster-metrics-with-prometheus-in-5-minutes/
k8s与监控–解读prometheus监控kubernetes的配置文件