Use 50% Discount for all private proxies!
Unlimited possibilities with ExtraProxies

Fluent-bit in Kubernetes cannot input timestamp data properly

At my company, I built a K8s cluster with Terraform and configured a logging system with EFK (Elasticsearch, Fluent-bit, Kibana).

k8s and Elasticsearch use AWS’s EKS and Opensearch Servcie (ES 7.10), and Fluent-bit was installed separately as a Daemonset through Helm.

https://artifacthub.io/packages/helm/fluent/fluent-bit

In the case of Fluent-bit Config, it is structured as follows.

config:    inputs: |      [INPUT]          Name tail          Exclude_Path /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*, / var/log/containers/aws-load-balancer*, /var/log/containers/cert-manager*, /var/log/containers/coredns*, /var/log/containers/ebs-csi*, /var/ log/containers/grafana*, /var/log/containers/karpenter*, /var/log/containers/metrics-server*, /var/log/containers/prometheus*          Path /var/log/containers/*.log          Tag kube.*          Mem_Buf_Limit 50MB          multiline.parser python-custom     outputs: |      [OUTPUT]          Name es          Match kube.*          Host $  {es_domain_name}          Port 443          TLS On          AWS_Auth On          AWS_Region $  {region}          Retry_Limit 6          Replace_Dots On          Trace_Error On     filters: |      [FILTER]          Name kubernetes          Match kube.*          Merge_Log On          Merge_Log_Key log_processed          Keep_LogOff          K8S-Logging.Parser On          K8S-Logging.Exclude On     customParsers: |      [MULTILINE_PARSER]          name python-custom          type regex          flush_timeout 1000          rule "start_state" "/stderr F(.*)/" "cont"          rule "cont" "/stderr F(.*)/" "cont" 

Overall it worked fine, but one day, after a certain situation, I noticed that all the logs collected by Fluent-bit were all pointing to the same timestamp.

Previously, the log output from the Pod was recorded with an accurate timestamp, but after the Pod is reloaded, the log continues to be recorded with a fixed @timestamp, and since logs are accumulated in the past time zone, the following phenomenon appears when viewed in kibana.

Logs are all transmitted to ES in real time, but since they are all collected at a fixed past time, proper queries cannot be executed.

enter image description here

// Unlike the timestamp output in the actual application, the timestamp recorded in Fluent-bit remains the same. Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:48:55...[2023-10-16 09:48:55]...  Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:48:55...[2023-10-16 09:48:55]...  Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:53:55...[2023-10-16 09:53:55]...  Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:58:03...[2023-10-16 09:58:03]...  Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:58:03...[2023-10-16 09:58:03]...  Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:58:03...[2023-10-16 09:58:03]...  Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:58:55...[2023-10-16 09:58:55]...  Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:58:55...[2023-10-16 09:58:55]...  Oct 16, 2023 @ 16:13:36.142 2023-10-16T10:03:03...[2023-10-16 10:03:03]... Oct 16, 2023 @ 16:13:36.142 2023-10-16T10:03:55...[2023-10-16 10:03:55]...  

This problem occurs when updating and reloading Pods running for each service. I confirmed that the inotify_fs_add log was output to Fluent-bit for the newly created Pod, and no other error logs could be confirmed.

Pod Reloading

│ backend-api-deployment-5d9fc55496-8mppx                            ●  1/1          0 Running │ backend-api-deployment-57d68dd4f5-djn8x                            ●  1/1          0 Terminating 

Fluent-bit Log

│ Fluent Bit v2.1.8 │ * Copyright (C) 2015-2022 The Fluent Bit Authors │ * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd │ * https://fluentbit.io │ │ [2023/10/16 16:05:00] [ info] [fluent bit] version=2.1.8, commit=1d83649441, pid=1 │ [2023/10/16 16:05:00] [ info] [storage] ver=1.4.0, type=memory, sync=normal, checksum=off, max_chunks_up=128 │ [2023/10/16 16:05:00] [ info] [cmetrics] version=0.6.3 │ [2023/10/16 16:05:00] [ info] [ctraces ] version=0.3.1 │ [2023/10/16 16:05:00] [ info] [input:tail:tail.0] initializing │ [2023/10/16 16:05:00] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only) │ [2023/10/16 16:05:00] [ info] [input:tail:tail.0] multiline core started │ [2023/10/16 16:05:00] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443 │ [2023/10/16 16:05:00] [ info] [filter:kubernetes:kubernetes.0]  token updated │ [2023/10/16 16:05:00] [ info] [filter:kubernetes:kubernetes.0] local POD info OK │ [2023/10/16 16:05:00] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server... │ [2023/10/16 16:05:00] [ info] [filter:kubernetes:kubernetes.0] connectivity OK │ [2023/10/16 16:05:00] [ info] [output:es:es.0] worker #0 started │ [2023/10/16 16:05:00] [ info] [output:es:es.0] worker #1 started │ [2023/10/16 16:05:00] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020 │ [2023/10/16 16:05:00] [ info] [sp] stream processor started │ [2023/10/16 16:05:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=41916664 watch_fd=1 name=/var/log/containers/backend-api-deployment-57d68dd4f5-djn8x_backend-dev_backend-api-f950104ab9f300586803316ad7beef5c9506db9fceeb4af0dc0575b3b9bacfd7.log │ [2023/10/16 16:05:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=17488499 watch_fd=2 name=/var/log/containers/backend-api-deployment-6dc5b97645-tqwxh_backend-staging_backend-api-23f14f74af6cc163f233499ba92a2746cb2f263c9379ba53fc47cd3d76e00bbb.log ... │ [2023/10/17 04:49:01] [error] [output:es:es.0] HTTP status=429 URI=/_bulk, response: │ 429 Too Many Requests /_bulk │ │ [2023/10/17 04:49:01] [ warn] [engine] failed to flush chunk '1-1697518140.870529031.flb', retry in 6 seconds: task_id=1, input=tail.0 > output=es.0 (out_id=0) │ [2023/10/17 04:49:01] [error] [output:es:es.0] HTTP status=429 URI=/_bulk, response: │ 429 Too Many Requests /_bulk │ │ [2023/10/17 04:49:01] [ warn] [engine] failed to flush chunk '1-1697518140.276173730.flb', retry in 7 seconds: task_id=0, input=tail.0 > output=es.0 (out_id=0) │ [2023/10/17 04:49:07] [ info] [engine] flush chunk '1-1697518140.870529031.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.0 (out_id=0) │ [2023/10/17 04:49:08] [ info] [engine] flush chunk '1-1697518140.276173730.flb' succeeded at retry 1: task_id=0, input=tail.0 > output=es.0 (out_id=0) │ [2023/10/17 05:08:50] [ info] [input:tail:tail.0] inotify_fs_add(): inode=37547666 watch_fd=8 name=/var/log/containers/backend-api-deployment-5d9fc55496-8mppx_backend-dev_backend-api-839c128d6227f797722496a992ca9149b95c366997c2b1ec28ceb65490582e3b.log │ [2023/10/17 05:08:52] [ info] [filter:kubernetes:kubernetes.0]  token updated │ [2023/10/17 05:09:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=41916664 watch_fd=1 │ [2023/10/17 05:09:37] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=17488499 watch_fd=2 

What’s interesting is that if all fluent-bit Daemon Sets are manually killed and revived through self-healing, they will all temporarily be collected normally again.

After that, if you update and reload the Service Pod again, the same problem occurs again.

What causes this problem? How can I fix it? And if there are any recommended Fluent-bit settings, additional advice is welcome.

thank you!

Related Posts