At my company, I built a K8s cluster with Terraform and configured a logging system with EFK (Elasticsearch, Fluent-bit, Kibana).
k8s and Elasticsearch use AWS’s EKS and Opensearch Servcie (ES 7.10), and Fluent-bit was installed separately as a Daemonset through Helm.
https://artifacthub.io/packages/helm/fluent/fluent-bit
In the case of Fluent-bit Config, it is structured as follows.
config: inputs: | [INPUT] Name tail Exclude_Path /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*, / var/log/containers/aws-load-balancer*, /var/log/containers/cert-manager*, /var/log/containers/coredns*, /var/log/containers/ebs-csi*, /var/ log/containers/grafana*, /var/log/containers/karpenter*, /var/log/containers/metrics-server*, /var/log/containers/prometheus* Path /var/log/containers/*.log Tag kube.* Mem_Buf_Limit 50MB multiline.parser python-custom outputs: | [OUTPUT] Name es Match kube.* Host $ {es_domain_name} Port 443 TLS On AWS_Auth On AWS_Region $ {region} Retry_Limit 6 Replace_Dots On Trace_Error On filters: | [FILTER] Name kubernetes Match kube.* Merge_Log On Merge_Log_Key log_processed Keep_LogOff K8S-Logging.Parser On K8S-Logging.Exclude On customParsers: | [MULTILINE_PARSER] name python-custom type regex flush_timeout 1000 rule "start_state" "/stderr F(.*)/" "cont" rule "cont" "/stderr F(.*)/" "cont"
Overall it worked fine, but one day, after a certain situation, I noticed that all the logs collected by Fluent-bit were all pointing to the same timestamp.
Previously, the log output from the Pod was recorded with an accurate timestamp, but after the Pod is reloaded, the log continues to be recorded with a fixed @timestamp, and since logs are accumulated in the past time zone, the following phenomenon appears when viewed in kibana.
Logs are all transmitted to ES in real time, but since they are all collected at a fixed past time, proper queries cannot be executed.
enter image description here
// Unlike the timestamp output in the actual application, the timestamp recorded in Fluent-bit remains the same. Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:48:55...[2023-10-16 09:48:55]... Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:48:55...[2023-10-16 09:48:55]... Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:53:55...[2023-10-16 09:53:55]... Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:58:03...[2023-10-16 09:58:03]... Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:58:03...[2023-10-16 09:58:03]... Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:58:03...[2023-10-16 09:58:03]... Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:58:55...[2023-10-16 09:58:55]... Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:58:55...[2023-10-16 09:58:55]... Oct 16, 2023 @ 16:13:36.142 2023-10-16T10:03:03...[2023-10-16 10:03:03]... Oct 16, 2023 @ 16:13:36.142 2023-10-16T10:03:55...[2023-10-16 10:03:55]...
This problem occurs when updating and reloading Pods running for each service. I confirmed that the inotify_fs_add log was output to Fluent-bit for the newly created Pod, and no other error logs could be confirmed.
Pod Reloading
│ backend-api-deployment-5d9fc55496-8mppx ● 1/1 0 Running │ backend-api-deployment-57d68dd4f5-djn8x ● 1/1 0 Terminating
Fluent-bit Log
│ Fluent Bit v2.1.8 │ * Copyright (C) 2015-2022 The Fluent Bit Authors │ * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd │ * https://fluentbit.io │ │ [2023/10/16 16:05:00] [ info] [fluent bit] version=2.1.8, commit=1d83649441, pid=1 │ [2023/10/16 16:05:00] [ info] [storage] ver=1.4.0, type=memory, sync=normal, checksum=off, max_chunks_up=128 │ [2023/10/16 16:05:00] [ info] [cmetrics] version=0.6.3 │ [2023/10/16 16:05:00] [ info] [ctraces ] version=0.3.1 │ [2023/10/16 16:05:00] [ info] [input:tail:tail.0] initializing │ [2023/10/16 16:05:00] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only) │ [2023/10/16 16:05:00] [ info] [input:tail:tail.0] multiline core started │ [2023/10/16 16:05:00] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443 │ [2023/10/16 16:05:00] [ info] [filter:kubernetes:kubernetes.0] token updated │ [2023/10/16 16:05:00] [ info] [filter:kubernetes:kubernetes.0] local POD info OK │ [2023/10/16 16:05:00] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server... │ [2023/10/16 16:05:00] [ info] [filter:kubernetes:kubernetes.0] connectivity OK │ [2023/10/16 16:05:00] [ info] [output:es:es.0] worker #0 started │ [2023/10/16 16:05:00] [ info] [output:es:es.0] worker #1 started │ [2023/10/16 16:05:00] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020 │ [2023/10/16 16:05:00] [ info] [sp] stream processor started │ [2023/10/16 16:05:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=41916664 watch_fd=1 name=/var/log/containers/backend-api-deployment-57d68dd4f5-djn8x_backend-dev_backend-api-f950104ab9f300586803316ad7beef5c9506db9fceeb4af0dc0575b3b9bacfd7.log │ [2023/10/16 16:05:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=17488499 watch_fd=2 name=/var/log/containers/backend-api-deployment-6dc5b97645-tqwxh_backend-staging_backend-api-23f14f74af6cc163f233499ba92a2746cb2f263c9379ba53fc47cd3d76e00bbb.log ... │ [2023/10/17 04:49:01] [error] [output:es:es.0] HTTP status=429 URI=/_bulk, response: │ 429 Too Many Requests /_bulk │ │ [2023/10/17 04:49:01] [ warn] [engine] failed to flush chunk '1-1697518140.870529031.flb', retry in 6 seconds: task_id=1, input=tail.0 > output=es.0 (out_id=0) │ [2023/10/17 04:49:01] [error] [output:es:es.0] HTTP status=429 URI=/_bulk, response: │ 429 Too Many Requests /_bulk │ │ [2023/10/17 04:49:01] [ warn] [engine] failed to flush chunk '1-1697518140.276173730.flb', retry in 7 seconds: task_id=0, input=tail.0 > output=es.0 (out_id=0) │ [2023/10/17 04:49:07] [ info] [engine] flush chunk '1-1697518140.870529031.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.0 (out_id=0) │ [2023/10/17 04:49:08] [ info] [engine] flush chunk '1-1697518140.276173730.flb' succeeded at retry 1: task_id=0, input=tail.0 > output=es.0 (out_id=0) │ [2023/10/17 05:08:50] [ info] [input:tail:tail.0] inotify_fs_add(): inode=37547666 watch_fd=8 name=/var/log/containers/backend-api-deployment-5d9fc55496-8mppx_backend-dev_backend-api-839c128d6227f797722496a992ca9149b95c366997c2b1ec28ceb65490582e3b.log │ [2023/10/17 05:08:52] [ info] [filter:kubernetes:kubernetes.0] token updated │ [2023/10/17 05:09:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=41916664 watch_fd=1 │ [2023/10/17 05:09:37] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=17488499 watch_fd=2
What’s interesting is that if all fluent-bit Daemon Sets are manually killed and revived through self-healing, they will all temporarily be collected normally again.
After that, if you update and reload the Service Pod again, the same problem occurs again.
What causes this problem? How can I fix it? And if there are any recommended Fluent-bit settings, additional advice is welcome.
thank you!