[05:04:32] <_joe_> urandom: what advantages do you expect from using http/2 ? [05:04:51] <_joe_> because I'd expect the opposite of advantages, generally [05:05:26] <_joe_> http 2 in the datacenter is more or less harmful, unless you decide to really push it in terms of support on both the server and the client [12:43:41] will be rolling out the NTP change shortly. affects 2169 hosts so will let Puppet roll it out gradually [12:44:09] I am one of the people on on-call so if it breaks, I keep the pieces :] [12:55:28] <_joe_> ack [12:55:35] * _joe_ silences the pager [12:56:06] _joe_: what happened to positive vibes only! I see those stickers around [12:56:07] :P [12:56:42] <_joe_> sukhe: the pager off is good vibes [12:56:45] <_joe_> for me at least [12:58:37] * kamila_ has saved up all the positive vibes from debugging haunted hardware and has them ready [12:59:04] kamila_: anything related to hardware debugging is a multiplier, so 5x [12:59:47] I have _allll_ the positive vibes [13:14:32] _joe_: yeah, I guess I was thinking "more compact" and "less blocking", but I doubt the former would register here, and I wasn't thinking about the concurrency model (or lackthereof) in a mw request [13:15:18] I can't get over how many session requests are stacked into a single mw request [13:16:01] ~4ms sounds great until you string a half dozen or more of them [13:16:03] <_joe_> don't get me wrong, that is 100% true [13:16:21] <_joe_> at least we don't make calls from /every/ request [13:16:27] <_joe_> even ones without a session cookie [13:16:47] that would be game over :) [13:28:10] As a heads up: In 10-15 minutes (if there are no objections), I'll upgrade Cassandra on the sessionstore cluster. I canary'd one node yesterday, and this version is well tested elsewhere, so I expect no issues. It's sessions though, so just in case... :) [13:30:09] * kamila_ shall prepare their good vibes [13:33:26] as FYI for the oncallers I am draining + restarting stuff on ganeti1029.eqiad.wmnet (glibc updates) [13:33:45] noted, all [13:38:20] <_joe_> urandom: we used to call sessionstore 2x per request even without a session cookie [13:38:24] <_joe_> that was fixed :) [13:38:40] <_joe_> it was IIRC some extension doing funky stuff with deprecated interfaces [13:41:17] Wow [14:37:09] any kafka person around to help with debugging T367076 (benthos consumer group lag)? I'm not very familiar with kafka and I don't see anything obvious [14:37:10] T367076: benthos mw-accesslog-metrics kafka lag and interpolation errors - https://phabricator.wikimedia.org/T367076 [14:49:16] * claime summons brouberol [14:49:17] :p [14:51:38] I'm familiar enough with kafka, but not really with benthos. We can have a look though [14:51:42] let me catch up [14:54:59] on what cluster should the topic be? I'm not seeing it on either kafka-main nor jumbo [14:55:40] it's plainly visible in the grafana dashboard, nvm [14:57:03] I spot checked both the mediawiki.httpd.accesslog and __consumer_offsets topics, both look fine [14:58:39] thank you brouberol <3 [14:59:16] for sure! [14:59:25] `kafka consumer-groups --bootstrap-server $TOPICMAPPR_KAFKA_ADDR --describe --group benthos-mw-accesslog-metrics` <--- you can indeed see lag building up [14:59:29] I don't see anything obvious on the benthos side, though I could tell it to tell me more things, that's going to be the next step if a next step turns out to be needed [14:59:56] (it has neither debug logging nor granular metrics rn, so I'll enable those when I get desperate) [15:00:41] I was going to say "it's weird that *all* consumer group members are stuck" but it's all managed by a single benthos instance [15:00:50] <_joe_> yep [15:01:03] CONSUMER-ID HOST CLIENT-ID [15:01:03] kgo-178ca9f2-1dc1-4848-a2e2-58ab7081021f /2620:0:861:102:10:64:16:86 kgo [15:01:05] yeah [15:01:06] for all partitions [15:01:31] <_joe_> kamila_: I would suppose one possible issue is benthos can't process things fast enough from each topic partition [15:01:32] so, naive questions firsts, I guess. Has this behavior survived a benthos restart? [15:01:58] that is a fair question, I'm not sure, will try that [15:02:02] (in a meeting rn) [15:02:02] that's one possibility. I can spot check the traffic on this topic over time, see if a sudden increase lines up with lag building up [15:02:45] I'd be surprised if it's a volume thing, it's not impossible (volume has been steadily increasing) but it doesn't look like that on first sight [15:03:43] the traffic seems to be steady https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&refresh=5m&var-datasource=eqiad+prometheus%2Fops&var-kafka_cluster=logging-eqiad&var-kafka_broker=All&var-topic=mediawiki.httpd.accesslog&from=1717299153548&to=1718033612971 [15:05:50] what about the host load over time? [15:06:35] what's odd to me is that we see a slight change in offset commit frequency lining up with the lag buildup in https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad+prometheus%2Fops&orgId=1&var-topic=All&var-consumer_group=benthos-mw-accesslog-metrics&from=1717400805348&to=1717471005478 [15:07:17] well, it's not odd per se, sorry, but it's interesting. It means that benthos is struggling somehow, but it's not a hard error, like a connection timeout [15:08:21] sorry reading now, I don't own those benthos instances but there should be internal metrics about processing lag, if it's the case [15:08:42] the host seems pretty hot [15:08:42] brouberol@centrallog1002:~$ uptime [15:08:42] 15:08:23 up 22 days, 2:43, 1 user, load average: 26.56, 26.87, 25.91 [15:09:01] is there a loadavg over time metric I could have a look at for this host? [15:09:27] <_joe_> yes [15:09:35] <_joe_> "host overview" dashboard [15:09:39] https://grafana.wikimedia.org/goto/N6w8vtwSR?orgId=1 [15:09:56] thanks [15:09:57] benthos 2381537 986 0.5 5110500 383668 ? Rsl Jun24 36478:56 /usr/bin/benthos --watcher --config /etc/benthos/webrequest_live.yaml [15:09:57] benthos 3370840 279 0.4 4937052 275476 ? Ssl Jun10 64207:06 /usr/bin/benthos --watcher --config /etc/benthos/mw_accesslog_metrics.yaml [15:09:59] this could match the timing (sorry don't know what that instance is from the port) [15:10:20] the CPU usage by benthos is Too Damn High ™️ [15:10:40] sorry do not consider that panel I posted [15:10:45] ack [15:10:48] it's clearly the wrong one [15:10:54] _joe_ thanks, looking [15:11:36] brouberol: which port is using benthos on those instances (if you still have a shell on those hosts) [15:12:08] (meanwhile I'm merging changes to benthos but on a completely unrelated instances) [15:12:30] the host seems pretty steady over time in terms of load https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=centrallog1002&var-datasource=thanos&var-cluster=syslog&from=1717400805348&to=1717471005478 [15:12:35] fabfur checking [15:13:22] I'm seeing [15:13:22] benthos 3370840 benthos 18u IPv6 90922507 0t0 TCP centrallog1002.eqiad.wmnet:41826->kafka-logging1003.eqiad.wmnet:9093 (ESTABLISHED) [15:13:47] nvm, sorry, that'd be for producing, whereas benthos is a consumer [15:14:32] fabfur@centrallog1002:~$ cat /etc/benthos/*.env | grep PORT [15:14:32] PORT=4152 [15:14:32] PORT=4151 [15:14:41] ok so those two instances [15:15:51] don't see any particular issue on the dashboard for these two instances [15:16:35] my changes **should** be totally unrelated (limited to ulsfo and the messages are sent to a kafka topic that is experimental , not the one that this benthos instance consumes) [15:17:15] anyway I reverted the change (unfortunately it doesn't fixed the behavior) [15:17:48] let's see if this impacts, in case we could've found a very very strange correlation [15:19:22] I wonder if the first occurence of the `level=error msg="Value interpolation error: expected number value, got null"` logs line up with the lag increase [15:19:43] on centrallog1002 ? [15:19:47] yep [15:20:14] I'm seeing these as part of the `benthos@mw_accesslog_metrics.service` logs [15:21:02] no changes in log format I assume... [15:22:20] Is there an opensearch dashboard for systemd/journald logs? [15:22:32] or are the logs local to the host? [15:24:04] (sorry don't know this) [15:25:48] they don't line up: Jun 22 22:18:03 centrallog1002 benthos@mw_accesslog_metrics[3370840]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=response_size_bytes path=root.pipeline.processors.0 [15:25:55] this one is from 4d ago [15:27:47] so, the benthos CG members are able to commit (verified by seeing different CURRENT-OFFSET values when describing the consumer group), the topic is seeing the ~same amount of traffic, and the host is loaded, but that seems to be consistent over time [15:29:03] this started on the first days of June, correct? [15:29:42] June 3rd [15:30:17] and the lag has mostly been going up ever since [15:34:18] when the lag was briefly going down, you seen an increase of the offset commit rate, but that does not seem to align with any change in host load, or saturation metric that I can see [15:42:38] given that the lag is currently sitting at 600M messages, I have several tests in mind [15:44:53] a) what happens if you restart the service? [15:44:53] b) what happens if we reset the CG offset to `earliest`? I wonder if we're impacted by a certain type of message that is maybe more expensive to deserialize [15:44:53] c) same question with `latest`, if earliest does not help. This would reset the CG to the latest available message, leaving a large gap of unconsumed messages (which can always be backfilled as long as the data is in Kafka) [15:45:42] this all depends on how critical the data is, but I'm assuming that these are doable tests, as the lag has been building up for 3 weeks and nobody screamed fire [15:47:18] I also wonder if we can see a jump in the consume request time around that time [15:49:38] mmmm really don't know how to help here, sorry... [15:51:33] Looking at the systemd services on centrallog, I can't seem to find a prometheus job that would scrape metrics from benthos. I'd be interested in seeing metrics reflecting the consume fetch rate over time [15:52:17] if somehow benthos can't fetch data fast enough, than a lag can build up. That's usually a reflection of kafka itself being loaded. That being said, I'm not seeing any load on the cluster itself. [15:52:48] benthos exports it's own metrics (internal and from processed logs) without needing a dedicated prom exporter [15:53:25] is there a thanos/anything page listing all metrics exported by benthos>? [15:54:52] or maybe easier: what's the port it uses to expose the metrics? I can curl and have a look [15:55:58] http://localhost:4152/metrics did the trick [15:56:07] https://docs.redpanda.com/redpanda-connect/components/metrics/about/ this is what's exported, the port should be the same [15:56:16] 4151 and 4152 [15:58:51] I think these are the same metrics in the dashboard I linked above [16:02:26] sadly, they don't expose the metrics from their kafka client [16:02:38] I have to step out, baby duty [16:26:58] tysm b.rouberol, I'll do my homework and report back <3 [18:09:11] I want to get a +1 for running this [18:09:21] etcdctl --endpoints https://conf1007.eqiad.wmnet:4001 rmdir /conftool/v1/pools/$site/dnsbox/ntp [18:09:32] preceeding this change is https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/7c25b4c732bbacebc6567ecff48b03e1e9524e31%5E%21/#F2 [18:10:04] which removed the values but not the directory itself and I want to do it right now [18:10:11] will wait for a +1 since it's etcd :) [18:10:16] sukhe: +1 [18:10:21] thanks cdanis [18:11:46] Error: 110: The request requires user authentication (Insufficient credentials) [0] [18:12:10] gotta be real root? [18:12:14] I remember this one I think, I have to pass --username root [18:12:45] let me see [18:13:12] sukhe: make sure you have a login shell [18:13:16] you need root's environment [18:14:19] so that + --username root using the password in .etcdrc [18:15:26] if your HOME=/root it should just work I think [18:15:45] cdanis: I didn't debug further but it didn't and passing --username root did [18:16:18] ah okay [18:16:28] probably just me forgetting [18:16:32] sounds good thanks :) [18:16:43] I logged it for future since I think volans|off pointed it out once [18:21:02] hmmm ... that sounds like something conftool-sync should ideally have cleaned up. if there's no task for it already, I'll open one. [18:22:19] swfrench-wmf: oh really? I assumed the keys get cleaned up but not the directories [18:22:37] if that's the case, thanks! the full key was /conftool/v1/pools/eqiad/dnsbox/ntp [18:23:13] right, no I just mean that it would be nice conftool-sync also cleaned empty directories :) [18:23:20] yep [18:23:20] *nice if [18:23:28] I am very afraid doing this rmdir thing :) [18:24:45] yeah, it would be ideal if folks didn't have to ever perform mutating operations with etcdctl directly [18:26:33] the fun part here is that when conftool switches to the v3 API, this should go away (no directories in v3, it's a flat keyspace ... yes that also opens a can of worms about key structure / hierarchy) [18:40:27] [FYI] I'm going to bump the severity of the KubernetesDeploymentUnavailableReplicas alert so it logs to -operations (T366932). based on the past 2w of history, I don't expect this to be noisy, but will keep an eye on it regardless. [18:40:28] T366932: Alerting on under-scaled deployments - https://phabricator.wikimedia.org/T366932