腾讯云上的tke集群和eks集群的事件日志默认只会保留一个小时，有的时候，服务出现了问题，需要根据历史事件日志来进行排查下，因为历史事件日志只有1个小时，这样给我们排查带来了极大不便。腾讯云上默认是支持将集群的事件日志采集到cls，但是cls是需要收费的，而且很多人习惯用Elasticsearch来查询日志。
下面我们通过开源的eventrouter来将日志采集到Elasticsearch，然后通过kibana来查询事件日志。
eventrouter介绍说明：https://github.com/heptiolabs/eventrouter

eventrouter服务采用List-Watch机制，获取k8s集群中的实时事件events，并把这些事件推送到不同的通道，这里持久化方案是将eventrouter获取的事件保存到日志文件，然后在pod内部署一个filebeat的sidecar容器采集日志文件，将日志写到es，最终通过kinana来检索es里面的日志。

下面我们来具体部署下，本次部署是在tke集群，eks集群同样的方式部署既可。

1. 部署Elasticsearch

es集群的部署参考下面yaml创建

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    meta.helm.sh/release-name: weixnie-es-test
    meta.helm.sh/release-namespace: weixnie
  labels:
    app: elasticsearch-master
    app.kubernetes.io/managed-by: Helm
    chart: elasticsearch
    heritage: Helm
    release: weixnie-es-test
  name: elasticsearch-master
  namespace: weixnie
spec:
  podManagementPolicy: Parallel
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: elasticsearch-master
  serviceName: elasticsearch-master-headless
  template:
    metadata:
      labels:
        app: elasticsearch-master
        chart: elasticsearch
        heritage: Helm
        release: weixnie-es-test
      name: elasticsearch-master
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - elasticsearch-master
            topologyKey: kubernetes.io/hostname
      containers:
      - env:
        - name: node.name
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: cluster.initial_master_nodes
          value: elasticsearch-master-0,elasticsearch-master-1,elasticsearch-master-2,
        - name: discovery.seed_hosts
          value: elasticsearch-master-headless
        - name: cluster.name
          value: elasticsearch
        - name: network.host
          value: 0.0.0.0
        - name: ES_JAVA_OPTS
          value: -Xmx1g -Xms1g
        - name: node.data
          value: "true"
        - name: node.ingest
          value: "true"
        - name: node.master
          value: "true"
        image: ccr.ccs.tencentyun.com/tke-market/elasticsearch:7.6.2
        imagePullPolicy: IfNotPresent
        name: elasticsearch
        ports:
        - containerPort: 9200
          name: http
          protocol: TCP
        - containerPort: 9300
          name: transport
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - sh
            - -c
            - |
              #!/usr/bin/env bash -e
              # If the node is starting up wait for the cluster to be ready (request params: 'wait_for_status=green&timeout=1s' )
              # Once it has started only check that the node itself is responding
              START_FILE=/tmp/.es_start_file

              http () {
                  local path="${1}"
                  if [ -n "${ELASTIC_USERNAME}" ] && [ -n "${ELASTIC_PASSWORD}" ]; then
                    BASIC_AUTH="-u ${ELASTIC_USERNAME}:${ELASTIC_PASSWORD}"
                  else
                    BASIC_AUTH=''
                  fi
                  curl -XGET -s -k --fail ${BASIC_AUTH} http://127.0.0.1:9200${path}
              }

              if [ -f "${START_FILE}" ]; then
                  echo 'Elasticsearch is already running, lets check the node is healthy and there are master nodes available'
                  http "/_cluster/health?timeout=0s"
              else
                  echo 'Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=green&timeout=1s" )'
                  if http "/_cluster/health?wait_for_status=green&timeout=1s" ; then
                      touch ${START_FILE}
                      exit 0
                  else
                      echo 'Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )'
                      exit 1
                  fi
              fi
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 3
          timeoutSeconds: 5
        resources: {}
        securityContext:
          capabilities:
            drop:
            - ALL
          runAsNonRoot: true
          runAsUser: 1000
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /usr/share/elasticsearch/data
          name: elasticsearch-master
      dnsPolicy: ClusterFirst
      initContainers:
      - command:
        - sysctl
        - -w
        - vm.max_map_count=262144
        image: ccr.ccs.tencentyun.com/tke-market/elasticsearch:7.6.2
        imagePullPolicy: IfNotPresent
        name: configure-sysctl
        resources: {}
        securityContext:
          privileged: true
          runAsUser: 0
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1000
        runAsUser: 1000
      terminationGracePeriodSeconds: 120
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: elasticsearch-master
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 30Gi
      volumeMode: Filesystem
    status:
      phase: Pending

2. 部署eventrouter

创建下eventrouter，然后配置下filebeat，这里是直接用filebeat采集到es，如果你想采集到kafaka，然后转存到es，可以配置一个logstash来实现。

apiVersion: v1
kind: ServiceAccount
metadata:
  name: eventrouter 
  namespace: weixnie
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: eventrouter 
rules:
- apiGroups: [""]
  resources: ["events"]
  verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: eventrouter 
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: eventrouter
subjects:
- kind: ServiceAccount
  name: eventrouter
  namespace: weixnie
---
apiVersion: v1
data:
  config.json: |- 
    {
      "sink": "glog"
    }
kind: ConfigMap
metadata:
  name: eventrouter-cm
  namespace: weixnie
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: eventrouter
  namespace: weixnie
  labels:
    app: eventrouter
spec:
  replicas: 1
  selector:
    matchLabels:
      app: eventrouter
  template:
    metadata:
      labels:
        app: eventrouter
        tier: control-plane-addons
    spec:
      containers:
        - name: kube-eventrouter
          image: baiyongjie/eventrouter:v0.2
          imagePullPolicy: IfNotPresent
          command:
            - "/bin/sh"
          args:
            - "-c"
            - "/eventrouter -v 3 -log_dir /data/log/eventrouter"
          volumeMounts:
          - name: config-volume
            mountPath: /etc/eventrouter
          - name: log-path
            mountPath: /data/log/eventrouter
        - name: filebeat
          image: elastic/filebeat:7.6.2
          command:
            - "/bin/sh"
          args:
            - "-c"
            - "filebeat -c /etc/filebeat/filebeat.yml"
          volumeMounts:
          - name: filebeat-config
            mountPath: /etc/filebeat/
          - name: log-path
            mountPath: /data/log/eventrouter
      serviceAccount: eventrouter
      volumes:
        - name: config-volume
          configMap:
            name: eventrouter-cm
        - name: filebeat-config
          configMap:
            name: filebeat-config
        - name: log-path
          emptyDir: {}

---
apiVersion: v1
data:
  filebeat.yml: |-
    filebeat.inputs:
      - type: log
        enabled: true
        paths:
          - "/data/log/eventrouter/*"

    setup.template.name: "tke-event"     # 设置一个新的模板，模板的名称
    setup.template.pattern: "tke-event-*" # 模板匹配那些索引，这里表示以nginx开头的所有的索引
    setup.template.enabled: false     # 关掉默认的模板配置
    setup.template.overwrite: true    # 开启新设置的模板
    setup.ilm.enabled: false  # 索引生命周期管理ilm功能默认开启，开启的情况下索引名称只能为filebeat-*， 通过setup.ilm.enabled false

    output.elasticsearch:
      hosts: ['elasticsearch-master:9200']
      index: "tke-event-%{+yyyy.MM.dd}"
kind: ConfigMap
metadata:
  name: filebeat-config
  namespace: weixnie

如果要测试日志是否采集成功，可以看下es的所有是否正常创建，es索引创建正常，则说明日志采集正常

[root@VM-55-14-tlinux ~]# curl 10.55.254.57:9200/_cat/indices
green open .kibana_task_manager_1           31GLIGOZRSWaLvCD9Qi6pw 1 1    2 0    68kb    34kb
green open .apm-agent-configuration         kWHztrKkRJG0QNAQuNc5_A 1 1    0 0    566b    283b
green open ilm-history-1-000001             rAcye5j4SCqp_mcL3r3q2g 1 1   18 0  50.6kb  25.3kb
green open tke-event-2022.04.30             R4R1MOJiSuGCczWsSu2bVA 1 1  390 0 590.3kb 281.3kb
green open .kibana_1                        NveB_wCWTkqKVqadI2DNjw 1 1   10 1 351.9kb 175.9kb

3. 部署kibana

为了方便检索日志，这边创建一个kibana来检索事件日志

apiVersion: v1
data:
  kibana.yml: |
    elasticsearch.hosts: http://elasticsearch-master:9200
    server.host: "0"
    server.name: kibana
kind: ConfigMap
metadata:
  labels:
    app: kibana
  name: kibana
  namespace: weixnie

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: kibana
  name: kibana
  namespace: weixnie
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kibana
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: kibana
    spec:
      containers:
      - image: kibana:7.6.2
        imagePullPolicy: IfNotPresent
        name: kibana
        ports:
        - containerPort: 5601
          name: kibana
          protocol: TCP
        securityContext:
          privileged: false
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /usr/share/kibana/config/kibana.yml
          name: kibana
          subPath: kibana.yml
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: default
      serviceAccountName: default
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          name: kibana
        name: kibana
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: kibana
  name: kibana
  namespace: weixnie
spec:
  ports:
  - name: 5601-5601-tcp
    port: 5601
    protocol: TCP
    targetPort: 5601
  selector:
    app: kibana
  sessionAffinity: None
  type: ClusterIP

如果集群内安装了nginx-ingress，可以通过ingress来给kibana暴露一个域名开访问

apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: nginx-intranet
  name: kibana-ingress
  namespace: weixnie
spec:
  rules:
  - host: kibana.tke.niewx.cn
    http:
      paths:
      - backend:
          serviceName: kibana
          servicePort: 5601
        path: /
        pathType: ImplementationSpecific

4. 测试检索事件

登录下kibana

然后创建下索引，这里filebeat设置的索引名称都是tke-event开头，kibana里面创建一个tke-event-*的索引即可。

下面我们直接删除一个测试pod，来产生事件，看下能否在kibana检索到

[niewx@VM-0-4-centos ~]$ k delete pod nginx-6ccd9d7969-f4rfj
pod "nginx-6ccd9d7969-f4rfj" deleted
[niewx@VM-0-4-centos ~]$ k get pod | grep nginx
nginx-6ccd9d7969-fbz9d            1/1     Running       0          23s
[niewx@VM-0-4-centos ~]$ k describe pod nginx-6ccd9d7969-fbz9d | grep -A 10 Events
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  58s   default-scheduler  Successfully assigned weixnie/nginx-6ccd9d7969-fbz9d to 172.16.22.23
  Normal  Pulling    58s   kubelet            Pulling image "nginx:latest"
  Normal  Pulled     55s   kubelet            Successfully pulled image "nginx:latest"
  Normal  Created    55s   kubelet            Created container nginx
  Normal  Started    55s   kubelet            Started container nginx

这里能检索正常，说明我们的event日志持久化到es成功。

5. 定时清理es索引

事件日志是存在es里面，每天的事件都会写到一个索引，如果事件日志较多，保留太长的时间的事件会很容易将磁盘空间打满，这里我们可以写个脚本，然后配置下cronjob来定时清理es里面的索引。

清理索引脚本clean-es-indices.sh，这里需要传入2个参数，第一个参数是清理多少天以前的索引，第二个参数是es的host地址。还需要注意的是脚本里面日期的格式，因为我这边创建的索引名称日期是+%Y.%m.%d，所以脚本里面是这个，如果日期格式不是这个，需要自行修改脚本，然后重新打镜像。

#/bin/bash

day=$1
es_host=$2

DATA=`date -d "${day} days ago" +%Y.%m.%d`

echo "开始清理  $DATA 索引"

#当前日期
time=`date`

#删除n天前的日志
curl -XGET "http://${es_host}:9200/_cat/indices/?v"|grep $DATA
if [ $? == 0 ];then
  curl -XDELETE "http://${es_host}:9200/*-${DATA}"
  echo "于 $time 清理 $DATA 索引!"
else
  echo "无 $DATA 天前索引需要清理"
fi

写个dockerfile来将脚本打到镜像里面，Dockerfile如下

FROM centos:7
COPY clean-es-indices.sh /

如果没有docker环境构建，这里也可以直接使用我已经打好的镜像ccr.ccs.tencentyun.com/nwx_registry/clean-es-indices:latest

下面我们用这个镜像创建一个cronjob

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  labels:
    k8s-app: clean-es-indices
    qcloud-app: clean-es-indices
  name: clean-es-indices
  namespace: weixnie
spec:
  concurrencyPolicy: Allow
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      completions: 1
      parallelism: 1
      template:
        metadata:
          labels:
            k8s-app: clean-es-indices
            qcloud-app: clean-es-indices
        spec:
          containers:
          - args:
            - sh -x /clean-es-indices.sh 3 elasticsearch-master
            command:
            - sh
            - -c
            image: ccr.ccs.tencentyun.com/nwx_registry/clean-es-indices:latest
            imagePullPolicy: Always
            name: clean-es-indices
            resources: {}
            securityContext:
              privileged: false
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
          dnsPolicy: ClusterFirst
          imagePullSecrets:
          - name: qcloudregistrykey
          restartPolicy: OnFailure
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
  schedule: 0 */23 * * *
  successfulJobsHistoryLimit: 3
  suspend: false

这里的cronjob执行策略是在每小时的第 0 分钟执行, 每隔23小时执行一次，相当于每一天执行一次。启动命令里面的参数，我这里配置是3和elasticsearch-master，我这里是清理3天之前的索引，因为es和cronjob是在同namespace，所以我这里直接通过service name访问。

如何将TKE/EKS集群事件日志持久化

1. 部署Elasticsearch

2. 部署eventrouter

3. 部署kibana

4. 测试检索事件

5. 定时清理es索引