[00:16:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [00:19:43] FIRING: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [00:39:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [00:43:43] FIRING: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [01:13:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [01:23:43] FIRING: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [02:58:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [03:09:43] FIRING: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [03:39:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [03:47:43] FIRING: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [03:57:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [03:58:13] FIRING: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [04:03:13] RESOLVED: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [04:12:43] FIRING: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [04:57:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [04:58:43] FIRING: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [05:23:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [05:29:43] FIRING: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [05:34:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [05:38:43] FIRING: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [05:48:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [05:49:13] FIRING: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [06:19:13] RESOLVED: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [06:19:28] FIRING: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [07:43:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [07:53:43] FIRING: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [07:58:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [08:05:43] FIRING: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [08:26:38] I've silenced the alert for benthos-mw-accesslog-sampler, known [15:22:25] FIRING: [2x] SystemdUnitFailed: statograph_post.service on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:07:25] FIRING: [2x] SystemdUnitFailed: statograph_post.service on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:12:25] FIRING: [2x] SystemdUnitFailed: statograph_post.service on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:30:43] ^ taking a look. [17:37:24] It seems like it failed after receiving an invalid JSON. [18:02:25] FIRING: [2x] SystemdUnitFailed: statograph_post.service on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:07:25] FIRING: SystemdUnitFailed: statograph_post.service on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:09:33] I'm still investigating that alert. [18:09:35] hi folks - did something happen involving benthos@mw_accesslog_metrics.service on centrallog1002 a bit after 15:00 UTC today? [18:10:21] seeing no data for mediawiki_http_requests_duration_count, which IIRC comes from there [18:10:38] denisse: wondering if this might be related to what you're investigating ^ [18:10:54] IIRC statograph uses that metric for one of the public graphs [18:12:31] swfrench-wmf: The times correlate. Statograph seems to fail after receiving an unparseable JSON but I haven't reached the root cause of it. [18:12:42] https://gerrit.wikimedia.org/g/operations/puppet/+/e6882ff2f81698dd2a79710bf6c386c6ba8321dc/hieradata/common/profile/statograph.yaml#24 [18:12:48] denisse: it queries prometheus for json so that seems likely [18:12:56] I don't think it handles the case where it gets an empty response or an error response well [18:13:08] do you have a full log handy? [18:13:17] cdanis: The ID matches what I can see in the logs. [18:13:21] Yes, I'll share it in a moment. [18:15:08] Status page ID: lyfcttm2lhw4 https://phabricator.wikimedia.org/P66673 [18:15:41] `Jul 16 15:12:29 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_duration path=root.pipeline.processors.1` <- from the systemd journal [18:15:46] those are the very last logs [18:16:29] Out of range float values are not JSON compliant ... I'm guessing that means a NaN or something [18:17:33] yeah, the current value returned by promql (at lesat on grafana explore for teh same query) is NaN [18:17:33] Could it be that the Wiki response time returned a float? [18:17:48] it always returns a float, but NaN is not valid json ;) [18:18:25] so, I suspect that benthos misbehavior is the root cause here, and the statograph errors are just a knock-on effect [18:18:48] https://phabricator.wikimedia.org/T369256 <- could it be that benthos doesn't like something about the kafka changes earlier today? [18:19:11] that was a fair bit earlier [18:24:29] hm [18:25:57] swfrench-wmf: those messages are logged very often and are a red herring [18:25:59] Jul 16 15:12:28 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="default commit failed" @service=benthos err="coordinator was not returned in broker response" group=benthos-mw-accesslog-metrics label=accesslog path=root.input [18:26:01] Jul 16 15:12:23 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="default commit failed" @service=benthos err="unable to dial: dial tcp: lookup kafka-logging1004.eqiad.wmnet: operation was canceled" group=benthos-mw-accesslog-metrics label=accesslog path=root.input [18:26:19] I think it not being able to reach kafka for some reason is the actual issue [18:26:39] That seems reasonable. [18:27:04] I can connect to all those host/ports from centrallog1002 though [18:27:55] the *first* message was [18:27:56] Jul 16 15:11:59 centrallog1002 benthos@mw_accesslog_metrics[2184438]: level=error msg="default commit failed" @service=benthos err="the internal broker struct chosen to issue this request has died--either the broker id is migrating or no longer exists" group=benthos-mw-accesslog-metrics label=accesslog path=root.input [18:27:58] cdanis: interesting, 15:12 is ~ exactly when the metrics stopped [18:28:02] yeah, so, there it is [18:28:13] and then we saw i/o timeouts, and ... it never tried again? [18:28:17] ah, there we go - I just didn't query early enough [18:28:40] Should I restart benthos/ [18:28:41] ? [18:29:14] denisse: one moment [18:29:30] 💙cdanis@centrallog1002.eqiad.wmnet ~ 🕝☕ sudo strace -p 2184438 [18:29:32] strace: Process 2184438 attached [18:29:34] futex(0x6c8da08, FUTEX_WAIT_PRIVATE, 0, NULL [18:29:36] heh it's waiting forever on something [18:29:38] perfect [18:29:59] I love those emojis, are they part of your shell? [18:30:24] https://github.com/cdanis/dotfiles/blob/master/zsh/.zshfunc/prompt_cdanis1_setup :) [18:30:31] the clock is accurate to half an hour ;) [18:30:48] Nice. :) [18:31:30] ok swfrench-wmf denisse I think we should send that benthos pid SIGQUIT, which I think will make it dump stack traces [18:31:45] I'm gonna go out on a limb and guess it's been stuck in that futex call for hours now [18:32:01] Sure, I'll send the SIGQUIT. [18:33:15] mmm delicious stack traces [18:33:43] I've sent the SIGQUIT singal. sudo kill -3 2184438 [18:33:59] yep, stacks are in the journal [18:34:46] Thanks so much for your help debugging the issue! <3 [18:35:42] https://phabricator.wikimedia.org/P66676 [18:35:48] these are the two stack traces matching kafka [18:36:03] metrics! [18:36:34] the first one says "200 minutes" which, if that is how long that goroutine has been sleeping, [18:36:43] FIRING: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [18:36:45] that's about 15:12 :) [18:37:06] so, I think we've found a new benthos bug, where franz-kafka or however benthos holds it can get stuck this way (hooray) [18:37:06] That's a lot of time! [18:37:40] Concurrency is complicated. [18:40:25] cdanis: Do you think we should report this bug upstream? [18:40:55] it's worth taking a quick look to see similar things [18:41:20] I commented on T369256 with my findings but I have to get back to my actual work now :) [18:41:20] T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic - https://phabricator.wikimedia.org/T369256 [18:41:42] cdanis: Thanks for your help debugging! [18:42:48] I forgot it's no longer called Benthos but redpanda-connect. 🥲 [18:43:11] they're making more progress on the renaming than Splunk did [20:01:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag