腾讯云上的tke集群和eks集群的事件日志默认只会保留一个小时,有的时候,服务出现了问题,需要根据历史事件日志来进行排查下,因为历史事件日志只有1个小时,这样给我们排查带来了极大不便。腾讯云上默认是支持将集群的事件日志采集到cls,但是cls是需要收费的,而且很多人习惯用Elasticsearch来查询日志。
下面我们通过开源的eventrouter来将日志采集到Elasticsearch,然后通过kibana来查询事件日志。
eventrouter介绍说明:https://github.com/heptiolabs/eventrouter
eventrouter服务采用List-Watch机制,获取k8s集群中的实时事件events,并把这些事件推送到不同的通道,这里持久化方案是将eventrouter获取的事件保存到日志文件,然后在pod内部署一个filebeat的sidecar容器采集日志文件,将日志写到es,最终通过kinana来检索es里面的日志。
下面我们来具体部署下,本次部署是在tke集群,eks集群同样的方式部署既可。
1. 部署Elasticsearch
es集群的部署参考下面yaml创建
apiVersion: apps/v1
kind: StatefulSet
metadata:
annotations:
meta.helm.sh/release-name: weixnie-es-test
meta.helm.sh/release-namespace: weixnie
labels:
app: elasticsearch-master
app.kubernetes.io/managed-by: Helm
chart: elasticsearch
heritage: Helm
release: weixnie-es-test
name: elasticsearch-master
namespace: weixnie
spec:
podManagementPolicy: Parallel
replicas: 3
revisionHistoryLimit: 10
selector:
matchLabels:
app: elasticsearch-master
serviceName: elasticsearch-master-headless
template:
metadata:
labels:
app: elasticsearch-master
chart: elasticsearch
heritage: Helm
release: weixnie-es-test
name: elasticsearch-master
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- elasticsearch-master
topologyKey: kubernetes.io/hostname
containers:
- env:
- name: node.name
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: cluster.initial_master_nodes
value: elasticsearch-master-0,elasticsearch-master-1,elasticsearch-master-2,
- name: discovery.seed_hosts
value: elasticsearch-master-headless
- name: cluster.name
value: elasticsearch
- name: network.host
value: 0.0.0.0
- name: ES_JAVA_OPTS
value: -Xmx1g -Xms1g
- name: node.data
value: "true"
- name: node.ingest
value: "true"
- name: node.master
value: "true"
image: ccr.ccs.tencentyun.com/tke-market/elasticsearch:7.6.2
imagePullPolicy: IfNotPresent
name: elasticsearch
ports:
- containerPort: 9200
name: http
protocol: TCP
- containerPort: 9300
name: transport
protocol: TCP
readinessProbe:
exec:
command:
- sh
- -c
- |
#!/usr/bin/env bash -e
# If the node is starting up wait for the cluster to be ready (request params: 'wait_for_status=green&timeout=1s' )
# Once it has started only check that the node itself is responding
START_FILE=/tmp/.es_start_file
http () {
local path="${1}"
if [ -n "${ELASTIC_USERNAME}" ] && [ -n "${ELASTIC_PASSWORD}" ]; then
BASIC_AUTH="-u ${ELASTIC_USERNAME}:${ELASTIC_PASSWORD}"
else
BASIC_AUTH=''
fi
curl -XGET -s -k --fail ${BASIC_AUTH} http://127.0.0.1:9200${path}
}
if [ -f "${START_FILE}" ]; then
echo 'Elasticsearch is already running, lets check the node is healthy and there are master nodes available'
http "/_cluster/health?timeout=0s"
else
echo 'Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=green&timeout=1s" )'
if http "/_cluster/health?wait_for_status=green&timeout=1s" ; then
touch ${START_FILE}
exit 0
else
echo 'Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )'
exit 1
fi
fi
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 3
timeoutSeconds: 5
resources: {}
securityContext:
capabilities:
drop:
- ALL
runAsNonRoot: true
runAsUser: 1000
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /usr/share/elasticsearch/data
name: elasticsearch-master
dnsPolicy: ClusterFirst
initContainers:
- command:
- sysctl
- -w
- vm.max_map_count=262144
image: ccr.ccs.tencentyun.com/tke-market/elasticsearch:7.6.2
imagePullPolicy: IfNotPresent
name: configure-sysctl
resources: {}
securityContext:
privileged: true
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 1000
runAsUser: 1000
terminationGracePeriodSeconds: 120
updateStrategy:
type: RollingUpdate
volumeClaimTemplates:
- apiVersion: v1
kind: PersistentVolumeClaim
metadata:
creationTimestamp: null
name: elasticsearch-master
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 30Gi
volumeMode: Filesystem
status:
phase: Pending
2. 部署eventrouter
创建下eventrouter,然后配置下filebeat,这里是直接用filebeat采集到es,如果你想采集到kafaka,然后转存到es,可以配置一个logstash来实现。
apiVersion: v1
kind: ServiceAccount
metadata:
name: eventrouter
namespace: weixnie
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: eventrouter
rules:
- apiGroups: [""]
resources: ["events"]
verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: eventrouter
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: eventrouter
subjects:
- kind: ServiceAccount
name: eventrouter
namespace: weixnie
---
apiVersion: v1
data:
config.json: |-
{
"sink": "glog"
}
kind: ConfigMap
metadata:
name: eventrouter-cm
namespace: weixnie
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: eventrouter
namespace: weixnie
labels:
app: eventrouter
spec:
replicas: 1
selector:
matchLabels:
app: eventrouter
template:
metadata:
labels:
app: eventrouter
tier: control-plane-addons
spec:
containers:
- name: kube-eventrouter
image: baiyongjie/eventrouter:v0.2
imagePullPolicy: IfNotPresent
command:
- "/bin/sh"
args:
- "-c"
- "/eventrouter -v 3 -log_dir /data/log/eventrouter"
volumeMounts:
- name: config-volume
mountPath: /etc/eventrouter
- name: log-path
mountPath: /data/log/eventrouter
- name: filebeat
image: elastic/filebeat:7.6.2
command:
- "/bin/sh"
args:
- "-c"
- "filebeat -c /etc/filebeat/filebeat.yml"
volumeMounts:
- name: filebeat-config
mountPath: /etc/filebeat/
- name: log-path
mountPath: /data/log/eventrouter
serviceAccount: eventrouter
volumes:
- name: config-volume
configMap:
name: eventrouter-cm
- name: filebeat-config
configMap:
name: filebeat-config
- name: log-path
emptyDir: {}
---
apiVersion: v1
data:
filebeat.yml: |-
filebeat.inputs:
- type: log
enabled: true
paths:
- "/data/log/eventrouter/*"
setup.template.name: "tke-event" # 设置一个新的模板,模板的名称
setup.template.pattern: "tke-event-*" # 模板匹配那些索引,这里表示以nginx开头的所有的索引
setup.template.enabled: false # 关掉默认的模板配置
setup.template.overwrite: true # 开启新设置的模板
setup.ilm.enabled: false # 索引生命周期管理ilm功能默认开启,开启的情况下索引名称只能为filebeat-*, 通过setup.ilm.enabled false
output.elasticsearch:
hosts: ['elasticsearch-master:9200']
index: "tke-event-%{+yyyy.MM.dd}"
kind: ConfigMap
metadata:
name: filebeat-config
namespace: weixnie
如果要测试日志是否采集成功,可以看下es的所有是否正常创建,es索引创建正常,则说明日志采集正常
[root@VM-55-14-tlinux ~]# curl 10.55.254.57:9200/_cat/indices
green open .kibana_task_manager_1 31GLIGOZRSWaLvCD9Qi6pw 1 1 2 0 68kb 34kb
green open .apm-agent-configuration kWHztrKkRJG0QNAQuNc5_A 1 1 0 0 566b 283b
green open ilm-history-1-000001 rAcye5j4SCqp_mcL3r3q2g 1 1 18 0 50.6kb 25.3kb
green open tke-event-2022.04.30 R4R1MOJiSuGCczWsSu2bVA 1 1 390 0 590.3kb 281.3kb
green open .kibana_1 NveB_wCWTkqKVqadI2DNjw 1 1 10 1 351.9kb 175.9kb
3. 部署kibana
为了方便检索日志,这边创建一个kibana来检索事件日志
apiVersion: v1
data:
kibana.yml: |
elasticsearch.hosts: http://elasticsearch-master:9200
server.host: "0"
server.name: kibana
kind: ConfigMap
metadata:
labels:
app: kibana
name: kibana
namespace: weixnie
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: kibana
name: kibana
namespace: weixnie
spec:
replicas: 1
selector:
matchLabels:
app: kibana
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
labels:
app: kibana
spec:
containers:
- image: kibana:7.6.2
imagePullPolicy: IfNotPresent
name: kibana
ports:
- containerPort: 5601
name: kibana
protocol: TCP
securityContext:
privileged: false
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /usr/share/kibana/config/kibana.yml
name: kibana
subPath: kibana.yml
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 420
name: kibana
name: kibana
---
apiVersion: v1
kind: Service
metadata:
labels:
app: kibana
name: kibana
namespace: weixnie
spec:
ports:
- name: 5601-5601-tcp
port: 5601
protocol: TCP
targetPort: 5601
selector:
app: kibana
sessionAffinity: None
type: ClusterIP
如果集群内安装了nginx-ingress,可以通过ingress来给kibana暴露一个域名开访问
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
annotations:
kubernetes.io/ingress.class: nginx-intranet
name: kibana-ingress
namespace: weixnie
spec:
rules:
- host: kibana.tke.niewx.cn
http:
paths:
- backend:
serviceName: kibana
servicePort: 5601
path: /
pathType: ImplementationSpecific
4. 测试检索事件
登录下kibana
然后创建下索引,这里filebeat设置的索引名称都是tke-event开头,kibana里面创建一个tke-event-*的索引即可。
下面我们直接删除一个测试pod,来产生事件,看下能否在kibana检索到
[niewx@VM-0-4-centos ~]$ k delete pod nginx-6ccd9d7969-f4rfj
pod "nginx-6ccd9d7969-f4rfj" deleted
[niewx@VM-0-4-centos ~]$ k get pod | grep nginx
nginx-6ccd9d7969-fbz9d 1/1 Running 0 23s
[niewx@VM-0-4-centos ~]$ k describe pod nginx-6ccd9d7969-fbz9d | grep -A 10 Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 58s default-scheduler Successfully assigned weixnie/nginx-6ccd9d7969-fbz9d to 172.16.22.23
Normal Pulling 58s kubelet Pulling image "nginx:latest"
Normal Pulled 55s kubelet Successfully pulled image "nginx:latest"
Normal Created 55s kubelet Created container nginx
Normal Started 55s kubelet Started container nginx
这里能检索正常,说明我们的event日志持久化到es成功。
5. 定时清理es索引
事件日志是存在es里面,每天的事件都会写到一个索引,如果事件日志较多,保留太长的时间的事件会很容易将磁盘空间打满,这里我们可以写个脚本,然后配置下cronjob来定时清理es里面的索引。
清理索引脚本clean-es-indices.sh,这里需要传入2个参数,第一个参数是清理多少天以前的索引,第二个参数是es的host地址。还需要注意的是脚本里面日期的格式,因为我这边创建的索引名称日期是+%Y.%m.%d,所以脚本里面是这个,如果日期格式不是这个,需要自行修改脚本,然后重新打镜像。
#/bin/bash
day=$1
es_host=$2
DATA=`date -d "${day} days ago" +%Y.%m.%d`
echo "开始清理 $DATA 索引"
#当前日期
time=`date`
#删除n天前的日志
curl -XGET "http://${es_host}:9200/_cat/indices/?v"|grep $DATA
if [ $? == 0 ];then
curl -XDELETE "http://${es_host}:9200/*-${DATA}"
echo "于 $time 清理 $DATA 索引!"
else
echo "无 $DATA 天前索引需要清理"
fi
写个dockerfile来将脚本打到镜像里面,Dockerfile如下
FROM centos:7
COPY clean-es-indices.sh /
如果没有docker环境构建,这里也可以直接使用我已经打好的镜像ccr.ccs.tencentyun.com/nwx_registry/clean-es-indices:latest
下面我们用这个镜像创建一个cronjob
apiVersion: batch/v1beta1
kind: CronJob
metadata:
labels:
k8s-app: clean-es-indices
qcloud-app: clean-es-indices
name: clean-es-indices
namespace: weixnie
spec:
concurrencyPolicy: Allow
failedJobsHistoryLimit: 1
jobTemplate:
spec:
completions: 1
parallelism: 1
template:
metadata:
labels:
k8s-app: clean-es-indices
qcloud-app: clean-es-indices
spec:
containers:
- args:
- sh -x /clean-es-indices.sh 3 elasticsearch-master
command:
- sh
- -c
image: ccr.ccs.tencentyun.com/nwx_registry/clean-es-indices:latest
imagePullPolicy: Always
name: clean-es-indices
resources: {}
securityContext:
privileged: false
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
imagePullSecrets:
- name: qcloudregistrykey
restartPolicy: OnFailure
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
schedule: 0 */23 * * *
successfulJobsHistoryLimit: 3
suspend: false
这里的cronjob执行策略是在每小时的第 0 分钟执行, 每隔23小时执行一次,相当于每一天执行一次。启动命令里面的参数,我这里配置是3和elasticsearch-master,我这里是清理3天之前的索引,因为es和cronjob是在同namespace,所以我这里直接通过service name访问。