[08:00:33] 10serviceops, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [10:45:58] 10serviceops, 10Prod-Kubernetes, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Kubernetes, 10Patch-For-Review: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10Gehel) [10:46:31] 10serviceops, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Gehel) [10:57:51] 10serviceops, 10CirrusSearch, 10Discovery-Search (Current work): High level of backend errors for CirrusSearch jobs in jobrunners - https://phabricator.wikimedia.org/T356526 (10hnowlan) [11:29:26] 10serviceops, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 11 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10kostajh) [11:29:47] 10serviceops, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 11 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10kostajh) [11:52:28] 10serviceops, 10EventStreams, 10Prod-Kubernetes, 10Kubernetes: eventstreams regularly uses more than 95% of its memory limit - https://phabricator.wikimedia.org/T357005 (10Clement_Goubert) Deployed, we'll see how the memory consumption evolves. I agree the data above is a strong indicator of a memory leak... [11:52:59] 10serviceops, 10EventStreams, 10Prod-Kubernetes, 10Kubernetes: eventstreams regularly uses more than 95% of its memory limit - https://phabricator.wikimedia.org/T357005 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Low [11:53:06] 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes, 10Patch-For-Review: Increase visibility of container/pod ressource exhaustion - https://phabricator.wikimedia.org/T266216 (10Clement_Goubert) [12:35:34] 10serviceops, 10Add-Link, 10Growth-Team, 10Prod-Kubernetes, 10Kubernetes: linkrecommendation-internal regularly uses more than 95% of its memory limit - https://phabricator.wikimedia.org/T357122 (10Clement_Goubert) [12:35:49] 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes: Increase visibility of container/pod ressource exhaustion - https://phabricator.wikimedia.org/T266216 (10Clement_Goubert) [12:36:03] 10serviceops, 10Add-Link, 10Growth-Team, 10Prod-Kubernetes, 10Kubernetes: linkrecommendation-internal regularly uses more than 95% of its memory limit - https://phabricator.wikimedia.org/T357122 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Low [12:43:54] 10serviceops, 10Add-Link, 10Growth-Team, 10Prod-Kubernetes, 10Kubernetes: linkrecommendation-internal regularly uses more than 95% of its memory limit - https://phabricator.wikimedia.org/T357122 (10Clement_Goubert) I just saw @akosiaris already bumped it 200MiB last week. If that new naive increase isn't... [12:44:20] 10serviceops, 10Add-Link, 10Growth-Team, 10Prod-Kubernetes, 10Kubernetes: linkrecommendation-internal regularly uses more than 95% of its memory limit - https://phabricator.wikimedia.org/T357122 (10akosiaris) Already bumped by 200Mi in a9f958e50e5f5f4a8 ( T266216 ). I think we 'll need instead to dig a b... [13:27:52] 10serviceops, 10CirrusSearch, 10Data-Platform-SRE, 10Discovery-Search: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10Gehel) a:05brouberol→03None [13:48:17] 10serviceops, 10SRE Observability, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Introduce a way to retry checks for SystemdUnitFailed before alerting - https://phabricator.wikimedia.org/T357028 (10fgiunchedi) The patch above does essentially that, i.e. match `SystemdUnitFailed` semantics to what... [14:51:41] 10serviceops, 10SRE Observability, 10Release-Engineering-Team (Radar): Introduce a way to retry checks for SystemdUnitFailed before alerting - https://phabricator.wikimedia.org/T357028 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Since we're back to Icinga semantics in terms of waiting before alerti... [14:57:10] 10serviceops, 10SRE Observability, 10Release-Engineering-Team (Radar): Introduce a way to retry checks for SystemdUnitFailed before alerting - https://phabricator.wikimedia.org/T357028 (10Clement_Goubert) Thanks! [15:10:19] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Jhancock.wm) I can't seem to access the idrac remotely. Is it okay if I power down the server at this time? [15:16:36] 10serviceops, 10Data-Engineering, 10Wikidata, 10Wikidata-Termbox, and 3 others: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 (10akosiaris) >>! In T355685#9491036, @akosiaris wrote: >>>! In T355685#9491033, @Lucas_Werkmeister_WMDE wrote: >>>>! In T355685#9490969, @akosi... [15:21:19] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10hnowlan) >>! In T355333#9529271, @Jhancock.wm wrote: > I can't seem to access the idrac remotely. Is it okay if I power down the server at this time? I had some weirdness wh... [15:43:34] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Jhancock.wm) I reseated the NIC and it connected. when I rebooted it went down again and didn't come up. swapped it out and rebooted it. stayed up this time. should have repl... [15:45:06] 10serviceops, 10Thumbor: Consider moving to haproxy ingress for Thumbor workers - https://phabricator.wikimedia.org/T357145 (10hnowlan) [15:57:40] 10serviceops, 10Thumbor: Error accessing File:KlimtDieJungfrau.jpg after it was included on the enwiki Main Page - https://phabricator.wikimedia.org/T354858 (10hnowlan) 05Open→03Resolved a:03hnowlan This may have been a question of resources, this image is now rendering after some recent bumps. [16:05:31] 10serviceops, 10CirrusSearch, 10Discovery-Search (Current work): High level of backend errors for CirrusSearch jobs in jobrunners - https://phabricator.wikimedia.org/T356526 (10EBernhardson) I haven't managed to track down where the `Received cirrusSearchElasticaWrite job for unwritable cluster cloudelastic`... [16:06:09] 10serviceops, 10Thumbor, 10Kubernetes: Consider moving to haproxy ingress for Thumbor workers - https://phabricator.wikimedia.org/T357145 (10JMeybohm) I would actually love if we could try to reproduce what we do with haproxy with istio ingressgateway before introducing another ingress controller (but tbh I... [16:48:39] 10serviceops, 10CirrusSearch, 10Discovery-Search (Current work): High level of backend errors for CirrusSearch jobs in jobrunners - https://phabricator.wikimedia.org/T356526 (10hnowlan) This is kinda verging on a UBN for us as we go into the weekend because it's causing a lot of spam and it'll hide other err... [17:38:16] jayme: not a bad idea as regards ^ I believe we can use envoy's connection pooling to accomplish similar to what haproxy does - might be just as easy to research adding support for that to thumbor's workflow as implementing another ingress. I think we can use circuit breaking to get the same behaviours around queue limits etc, although I know less about how the connection pooling handles [17:38:22] pending requests [17:38:35] seems they very deliberately avoid calling pending requests queuing [17:49:07] 10serviceops, 10CirrusSearch, 10Discovery-Search (Current work): High level of backend errors for CirrusSearch jobs in jobrunners - https://phabricator.wikimedia.org/T356526 (10EBernhardson) If we need them silenced, best bet is probably to re-enable the writes for these wikis. Can be done with a `mediawiki-... [18:10:05] 10serviceops: Cross fleet runc upgrades - https://phabricator.wikimedia.org/T356661 (10CDanis) All pods on k8s-aux-eqiad restarted, thanks @akosiaris for the script. [23:47:38] 10serviceops, 10Data-Engineering, 10EventStreams, 10Prod-Kubernetes, and 2 others: eventstreams regularly uses more than 95% of its memory limit - https://phabricator.wikimedia.org/T357005 (10Ottomata) > wondering about the stream connection duration IIRC, varnish(?) sets a http timeout of something like... [23:55:53] 10serviceops, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 11 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Sbailey)