腾讯云上的tke集群和eks集群的事件日志默认只会保留一个小时,有的时候,服务出现了问题,需要根据历史事件日志来进行排查下,因为历史事件日志只有1个小时,这样给我们排查带来了极大不便。腾讯云上默认是支持将集群的事件日志采集到cls,但是cls是需要收费的,而且很多人习惯用Elasticsearch来查询日志。
下面我们通过开源的eventrouter来将日志采集到Elasticsearch,然后通过kibana来查询事件日志。
eventrouter介绍说明:https://github.com/heptiolabs/eventrouter
eventrouter服务采用List-Watch机制,获取k8s集群中的实时事件events,并把这些事件推送到不同的通道,这里持久化方案是将eventrouter获取的事件保存到日志文件,然后在pod内部署一个filebeat的sidecar容器采集日志文件,将日志写到es,最终通过kinana来检索es里面的日志。
下面我们来具体部署下,本次部署是在tke集群,eks集群同样的方式部署既可。
1. 部署Elasticsearch
es集群的部署参考下面yaml创建
apiVersion: apps/v1 | |
kind: StatefulSet | |
metadata: | |
annotations: | |
meta.helm.sh/release-name: weixnie-es-test | |
meta.helm.sh/release-namespace: weixnie | |
labels: | |
app: elasticsearch-master | |
app.kubernetes.io/managed-by: Helm | |
chart: elasticsearch | |
heritage: Helm | |
release: weixnie-es-test | |
name: elasticsearch-master | |
namespace: weixnie | |
spec: | |
podManagementPolicy: Parallel | |
replicas: 3 | |
revisionHistoryLimit: 10 | |
selector: | |
matchLabels: | |
app: elasticsearch-master | |
serviceName: elasticsearch-master-headless | |
template: | |
metadata: | |
labels: | |
app: elasticsearch-master | |
chart: elasticsearch | |
heritage: Helm | |
release: weixnie-es-test | |
name: elasticsearch-master | |
spec: | |
affinity: | |
podAntiAffinity: | |
requiredDuringSchedulingIgnoredDuringExecution: | |
- labelSelector: | |
matchExpressions: | |
- key: app | |
operator: In | |
values: | |
- elasticsearch-master | |
topologyKey: kubernetes.io/hostname | |
containers: | |
- env: | |
- name: node.name | |
valueFrom: | |
fieldRef: | |
apiVersion: v1 | |
fieldPath: metadata.name | |
- name: cluster.initial_master_nodes | |
value: elasticsearch-master-0,elasticsearch-master-1,elasticsearch-master-2, | |
- name: discovery.seed_hosts | |
value: elasticsearch-master-headless | |
- name: cluster.name | |
value: elasticsearch | |
- name: network.host | |
value: 0.0.0.0 | |
- name: ES_JAVA_OPTS | |
value: -Xmx1g -Xms1g | |
- name: node.data | |
value: "true" | |
- name: node.ingest | |
value: "true" | |
- name: node.master | |
value: "true" | |
image: ccr.ccs.tencentyun.com/tke-market/elasticsearch:7.6.2 | |
imagePullPolicy: IfNotPresent | |
name: elasticsearch | |
ports: | |
- containerPort: 9200 | |
name: http | |
protocol: TCP | |
- containerPort: 9300 | |
name: transport | |
protocol: TCP | |
readinessProbe: | |
exec: | |
command: | |
- sh | |
- -c | |
- | | |
#!/usr/bin/env bash -e | |
# If the node is starting up wait for the cluster to be ready (request params: 'wait_for_status=green&timeout=1s' ) | |
# Once it has started only check that the node itself is responding | |
START_FILE=/tmp/.es_start_file | |
http () { | |
local path="${1}" | |
if [ -n "${ELASTIC_USERNAME}" ] && [ -n "${ELASTIC_PASSWORD}" ]; then | |
BASIC_AUTH="-u ${ELASTIC_USERNAME}:${ELASTIC_PASSWORD}" | |
else | |
BASIC_AUTH='' | |
fi | |
curl -XGET -s -k --fail ${BASIC_AUTH} http://127.0.0.1:9200${path} | |
} | |
if [ -f "${START_FILE}" ]; then | |
echo 'Elasticsearch is already running, lets check the node is healthy and there are master nodes available' | |
http "/_cluster/health?timeout=0s" | |
else | |
echo 'Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=green&timeout=1s" )' | |
if http "/_cluster/health?wait_for_status=green&timeout=1s" ; then | |
touch ${START_FILE} | |
exit 0 | |
else | |
echo 'Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )' | |
exit 1 | |
fi | |
fi | |
failureThreshold: 3 | |
initialDelaySeconds: 10 | |
periodSeconds: 10 | |
successThreshold: 3 | |
timeoutSeconds: 5 | |
resources: {} | |
securityContext: | |
capabilities: | |
drop: | |
- ALL | |
runAsNonRoot: true | |
runAsUser: 1000 | |
terminationMessagePath: /dev/termination-log | |
terminationMessagePolicy: File | |
volumeMounts: | |
- mountPath: /usr/share/elasticsearch/data | |
name: elasticsearch-master | |
dnsPolicy: ClusterFirst | |
initContainers: | |
- command: | |
- sysctl | |
- -w | |
- vm.max_map_count=262144 | |
image: ccr.ccs.tencentyun.com/tke-market/elasticsearch:7.6.2 | |
imagePullPolicy: IfNotPresent | |
name: configure-sysctl | |
resources: {} | |
securityContext: | |
privileged: true | |
runAsUser: 0 | |
terminationMessagePath: /dev/termination-log | |
terminationMessagePolicy: File | |
restartPolicy: Always | |
schedulerName: default-scheduler | |
securityContext: | |
fsGroup: 1000 | |
runAsUser: 1000 | |
terminationGracePeriodSeconds: 120 | |
updateStrategy: | |
type: RollingUpdate | |
volumeClaimTemplates: | |
- apiVersion: v1 | |
kind: PersistentVolumeClaim | |
metadata: | |
creationTimestamp: null | |
name: elasticsearch-master | |
spec: | |
accessModes: | |
- ReadWriteOnce | |
resources: | |
requests: | |
storage: 30Gi | |
volumeMode: Filesystem | |
status: | |
phase: Pending |
2. 部署eventrouter
创建下eventrouter,然后配置下filebeat,这里是直接用filebeat采集到es,如果你想采集到kafaka,然后转存到es,可以配置一个logstash来实现。
apiVersion: v1 | |
kind: ServiceAccount | |
metadata: | |
name: eventrouter | |
namespace: weixnie | |
--- | |
apiVersion: rbac.authorization.k8s.io/v1beta1 | |
kind: ClusterRole | |
metadata: | |
name: eventrouter | |
rules: | |
- apiGroups: [""] | |
resources: ["events"] | |
verbs: ["get", "watch", "list"] | |
--- | |
apiVersion: rbac.authorization.k8s.io/v1beta1 | |
kind: ClusterRoleBinding | |
metadata: | |
name: eventrouter | |
roleRef: | |
apiGroup: rbac.authorization.k8s.io | |
kind: ClusterRole | |
name: eventrouter | |
subjects: | |
- kind: ServiceAccount | |
name: eventrouter | |
namespace: weixnie | |
--- | |
apiVersion: v1 | |
data: | |
config.json: |- | |
{ | |
"sink": "glog" | |
} | |
kind: ConfigMap | |
metadata: | |
name: eventrouter-cm | |
namespace: weixnie | |
--- | |
apiVersion: apps/v1 | |
kind: Deployment | |
metadata: | |
name: eventrouter | |
namespace: weixnie | |
labels: | |
app: eventrouter | |
spec: | |
replicas: 1 | |
selector: | |
matchLabels: | |
app: eventrouter | |
template: | |
metadata: | |
labels: | |
app: eventrouter | |
tier: control-plane-addons | |
spec: | |
containers: | |
- name: kube-eventrouter | |
image: baiyongjie/eventrouter:v0.2 | |
imagePullPolicy: IfNotPresent | |
command: | |
- "/bin/sh" | |
args: | |
- "-c" | |
- "/eventrouter -v 3 -log_dir /data/log/eventrouter" | |
volumeMounts: | |
- name: config-volume | |
mountPath: /etc/eventrouter | |
- name: log-path | |
mountPath: /data/log/eventrouter | |
- name: filebeat | |
image: elastic/filebeat:7.6.2 | |
command: | |
- "/bin/sh" | |
args: | |
- "-c" | |
- "filebeat -c /etc/filebeat/filebeat.yml" | |
volumeMounts: | |
- name: filebeat-config | |
mountPath: /etc/filebeat/ | |
- name: log-path | |
mountPath: /data/log/eventrouter | |
serviceAccount: eventrouter | |
volumes: | |
- name: config-volume | |
configMap: | |
name: eventrouter-cm | |
- name: filebeat-config | |
configMap: | |
name: filebeat-config | |
- name: log-path | |
emptyDir: {} | |
--- | |
apiVersion: v1 | |
data: | |
filebeat.yml: |- | |
filebeat.inputs: | |
- type: log | |
enabled: true | |
paths: | |
- "/data/log/eventrouter/*" | |
setup.template.name: "tke-event" # 设置一个新的模板,模板的名称 | |
setup.template.pattern: "tke-event-*" # 模板匹配那些索引,这里表示以nginx开头的所有的索引 | |
setup.template.enabled: false # 关掉默认的模板配置 | |
setup.template.overwrite: true # 开启新设置的模板 | |
setup.ilm.enabled: false # 索引生命周期管理ilm功能默认开启,开启的情况下索引名称只能为filebeat-*, 通过setup.ilm.enabled false | |
output.elasticsearch: | |
hosts: ['elasticsearch-master:9200'] | |
index: "tke-event-%{+yyyy.MM.dd}" | |
kind: ConfigMap | |
metadata: | |
name: filebeat-config | |
namespace: weixnie |
如果要测试日志是否采集成功,可以看下es的所有是否正常创建,es索引创建正常,则说明日志采集正常
[root@VM-55-14-tlinux ~]# curl 10.55.254.57:9200/_cat/indices | |
green open .kibana_task_manager_1 31GLIGOZRSWaLvCD9Qi6pw 1 1 2 0 68kb 34kb | |
green open .apm-agent-configuration kWHztrKkRJG0QNAQuNc5_A 1 1 0 0 566b 283b | |
green open ilm-history-1-000001 rAcye5j4SCqp_mcL3r3q2g 1 1 18 0 50.6kb 25.3kb | |
green open tke-event-2022.04.30 R4R1MOJiSuGCczWsSu2bVA 1 1 390 0 590.3kb 281.3kb | |
green open .kibana_1 NveB_wCWTkqKVqadI2DNjw 1 1 10 1 351.9kb 175.9kb |
3. 部署kibana
为了方便检索日志,这边创建一个kibana来检索事件日志
apiVersion: v1 | |
data: | |
kibana.yml: | | |
elasticsearch.hosts: http://elasticsearch-master:9200 | |
server.host: "0" | |
server.name: kibana | |
kind: ConfigMap | |
metadata: | |
labels: | |
app: kibana | |
name: kibana | |
namespace: weixnie | |
--- | |
apiVersion: apps/v1 | |
kind: Deployment | |
metadata: | |
labels: | |
app: kibana | |
name: kibana | |
namespace: weixnie | |
spec: | |
replicas: 1 | |
selector: | |
matchLabels: | |
app: kibana | |
strategy: | |
rollingUpdate: | |
maxSurge: 25% | |
maxUnavailable: 25% | |
type: RollingUpdate | |
template: | |
metadata: | |
labels: | |
app: kibana | |
spec: | |
containers: | |
- image: kibana:7.6.2 | |
imagePullPolicy: IfNotPresent | |
name: kibana | |
ports: | |
- containerPort: 5601 | |
name: kibana | |
protocol: TCP | |
securityContext: | |
privileged: false | |
terminationMessagePath: /dev/termination-log | |
terminationMessagePolicy: File | |
volumeMounts: | |
- mountPath: /usr/share/kibana/config/kibana.yml | |
name: kibana | |
subPath: kibana.yml | |
dnsPolicy: ClusterFirst | |
restartPolicy: Always | |
schedulerName: default-scheduler | |
securityContext: {} | |
serviceAccount: default | |
serviceAccountName: default | |
terminationGracePeriodSeconds: 30 | |
volumes: | |
- configMap: | |
defaultMode: 420 | |
name: kibana | |
name: kibana | |
--- | |
apiVersion: v1 | |
kind: Service | |
metadata: | |
labels: | |
app: kibana | |
name: kibana | |
namespace: weixnie | |
spec: | |
ports: | |
- name: 5601-5601-tcp | |
port: 5601 | |
protocol: TCP | |
targetPort: 5601 | |
selector: | |
app: kibana | |
sessionAffinity: None | |
type: ClusterIP |
如果集群内安装了nginx-ingress,可以通过ingress来给kibana暴露一个域名开访问
apiVersion: networking.k8s.io/v1beta1 | |
kind: Ingress | |
metadata: | |
annotations: | |
kubernetes.io/ingress.class: nginx-intranet | |
name: kibana-ingress | |
namespace: weixnie | |
spec: | |
rules: | |
- host: kibana.tke.niewx.cn | |
http: | |
paths: | |
- backend: | |
serviceName: kibana | |
servicePort: 5601 | |
path: / | |
pathType: ImplementationSpecific |
4. 测试检索事件
登录下kibana
然后创建下索引,这里filebeat设置的索引名称都是tke-event开头,kibana里面创建一个tke-event-*的索引即可。
下面我们直接删除一个测试pod,来产生事件,看下能否在kibana检索到
[niewx@VM-0-4-centos ~]$ k delete pod nginx-6ccd9d7969-f4rfj | |
pod "nginx-6ccd9d7969-f4rfj" deleted | |
[niewx@VM-0-4-centos ~]$ k get pod | grep nginx | |
nginx-6ccd9d7969-fbz9d 1/1 Running 0 23s | |
[niewx@VM-0-4-centos ~]$ k describe pod nginx-6ccd9d7969-fbz9d | grep -A 10 Events | |
Events: | |
Type Reason Age From Message | |
---- ------ ---- ---- ------- | |
Normal Scheduled 58s default-scheduler Successfully assigned weixnie/nginx-6ccd9d7969-fbz9d to 172.16.22.23 | |
Normal Pulling 58s kubelet Pulling image "nginx:latest" | |
Normal Pulled 55s kubelet Successfully pulled image "nginx:latest" | |
Normal Created 55s kubelet Created container nginx | |
Normal Started 55s kubelet Started container nginx |
这里能检索正常,说明我们的event日志持久化到es成功。
5. 定时清理es索引
事件日志是存在es里面,每天的事件都会写到一个索引,如果事件日志较多,保留太长的时间的事件会很容易将磁盘空间打满,这里我们可以写个脚本,然后配置下cronjob来定时清理es里面的索引。
清理索引脚本clean-es-indices.sh,这里需要传入2个参数,第一个参数是清理多少天以前的索引,第二个参数是es的host地址。还需要注意的是脚本里面日期的格式,因为我这边创建的索引名称日期是+%Y.%m.%d,所以脚本里面是这个,如果日期格式不是这个,需要自行修改脚本,然后重新打镜像。
#/bin/bash | |
day=$1 | |
es_host=$2 | |
DATA=`date -d "${day} days ago" +%Y.%m.%d` | |
echo "开始清理 $DATA 索引" | |
#当前日期 | |
time=`date` | |
#删除n天前的日志 | |
curl -XGET "http://${es_host}:9200/_cat/indices/?v"|grep $DATA | |
if [ $? == 0 ];then | |
curl -XDELETE "http://${es_host}:9200/*-${DATA}" | |
echo "于 $time 清理 $DATA 索引!" | |
else | |
echo "无 $DATA 天前索引需要清理" | |
fi |
写个dockerfile来将脚本打到镜像里面,Dockerfile如下
FROM centos:7 | |
COPY clean-es-indices.sh / |
如果没有docker环境构建,这里也可以直接使用我已经打好的镜像ccr.ccs.tencentyun.com/nwx_registry/clean-es-indices:latest
下面我们用这个镜像创建一个cronjob
apiVersion: batch/v1beta1 | |
kind: CronJob | |
metadata: | |
labels: | |
k8s-app: clean-es-indices | |
qcloud-app: clean-es-indices | |
name: clean-es-indices | |
namespace: weixnie | |
spec: | |
concurrencyPolicy: Allow | |
failedJobsHistoryLimit: 1 | |
jobTemplate: | |
spec: | |
completions: 1 | |
parallelism: 1 | |
template: | |
metadata: | |
labels: | |
k8s-app: clean-es-indices | |
qcloud-app: clean-es-indices | |
spec: | |
containers: | |
- args: | |
- sh -x /clean-es-indices.sh 3 elasticsearch-master | |
command: | |
- sh | |
- -c | |
image: ccr.ccs.tencentyun.com/nwx_registry/clean-es-indices:latest | |
imagePullPolicy: Always | |
name: clean-es-indices | |
resources: {} | |
securityContext: | |
privileged: false | |
terminationMessagePath: /dev/termination-log | |
terminationMessagePolicy: File | |
dnsPolicy: ClusterFirst | |
imagePullSecrets: | |
- name: qcloudregistrykey | |
restartPolicy: OnFailure | |
schedulerName: default-scheduler | |
securityContext: {} | |
terminationGracePeriodSeconds: 30 | |
schedule: 0 */23 * * * | |
successfulJobsHistoryLimit: 3 | |
suspend: false |
这里的cronjob执行策略是在每小时的第 0 分钟执行, 每隔23小时执行一次,相当于每一天执行一次。启动命令里面的参数,我这里配置是3和elasticsearch-master,我这里是清理3天之前的索引,因为es和cronjob是在同namespace,所以我这里直接通过service name访问。