[08:00:33] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[10:45:58] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Kubernetes, 10Patch-For-Review: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10Gehel)
[10:46:31] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Gehel)
[10:57:51] <wikibugs>	 10serviceops, 10CirrusSearch, 10Discovery-Search (Current work): High level of backend errors for CirrusSearch jobs in jobrunners - https://phabricator.wikimedia.org/T356526 (10hnowlan)
[11:29:26] <wikibugs>	 10serviceops, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 11 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10kostajh)
[11:29:47] <wikibugs>	 10serviceops, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 11 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10kostajh)
[11:52:28] <wikibugs>	 10serviceops, 10EventStreams, 10Prod-Kubernetes, 10Kubernetes: eventstreams regularly uses more than 95% of its memory limit - https://phabricator.wikimedia.org/T357005 (10Clement_Goubert) Deployed, we'll see how the memory consumption evolves.  I agree the data above is a strong indicator of a memory leak...
[11:52:59] <wikibugs>	 10serviceops, 10EventStreams, 10Prod-Kubernetes, 10Kubernetes: eventstreams regularly uses more than 95% of its memory limit - https://phabricator.wikimedia.org/T357005 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Low
[11:53:06] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes, 10Patch-For-Review: Increase visibility of container/pod ressource exhaustion - https://phabricator.wikimedia.org/T266216 (10Clement_Goubert)
[12:35:34] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team, 10Prod-Kubernetes, 10Kubernetes: linkrecommendation-internal regularly uses more than 95% of its memory limit - https://phabricator.wikimedia.org/T357122 (10Clement_Goubert)
[12:35:49] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10observability, 10Kubernetes: Increase visibility of container/pod ressource exhaustion - https://phabricator.wikimedia.org/T266216 (10Clement_Goubert)
[12:36:03] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team, 10Prod-Kubernetes, 10Kubernetes: linkrecommendation-internal regularly uses more than 95% of its memory limit - https://phabricator.wikimedia.org/T357122 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Low
[12:43:54] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team, 10Prod-Kubernetes, 10Kubernetes: linkrecommendation-internal regularly uses more than 95% of its memory limit - https://phabricator.wikimedia.org/T357122 (10Clement_Goubert) I just saw @akosiaris already bumped it 200MiB last week. If that new naive increase isn't...
[12:44:20] <wikibugs>	 10serviceops, 10Add-Link, 10Growth-Team, 10Prod-Kubernetes, 10Kubernetes: linkrecommendation-internal regularly uses more than 95% of its memory limit - https://phabricator.wikimedia.org/T357122 (10akosiaris) Already bumped by 200Mi in a9f958e50e5f5f4a8 ( T266216 ). I think we 'll need instead to dig a b...
[13:27:52] <wikibugs>	 10serviceops, 10CirrusSearch, 10Data-Platform-SRE, 10Discovery-Search: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10Gehel) a:05brouberol→03None
[13:48:17] <wikibugs>	 10serviceops, 10SRE Observability, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Introduce a way to retry checks for SystemdUnitFailed before alerting - https://phabricator.wikimedia.org/T357028 (10fgiunchedi) The patch above does essentially that, i.e. match `SystemdUnitFailed` semantics to what...
[14:51:41] <wikibugs>	 10serviceops, 10SRE Observability, 10Release-Engineering-Team (Radar): Introduce a way to retry checks for SystemdUnitFailed before alerting - https://phabricator.wikimedia.org/T357028 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Since we're back to Icinga semantics in terms of waiting before alerti...
[14:57:10] <wikibugs>	 10serviceops, 10SRE Observability, 10Release-Engineering-Team (Radar): Introduce a way to retry checks for SystemdUnitFailed before alerting - https://phabricator.wikimedia.org/T357028 (10Clement_Goubert) Thanks!
[15:10:19] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Jhancock.wm) I can't seem to access the idrac remotely. Is it okay if I power down the server at this time?
[15:16:36] <wikibugs>	 10serviceops, 10Data-Engineering, 10Wikidata, 10Wikidata-Termbox, and 3 others: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 (10akosiaris) >>! In T355685#9491036, @akosiaris wrote: >>>! In T355685#9491033, @Lucas_Werkmeister_WMDE wrote: >>>>! In T355685#9490969, @akosi...
[15:21:19] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10hnowlan) >>! In T355333#9529271, @Jhancock.wm wrote: > I can't seem to access the idrac remotely. Is it okay if I power down the server at this time?  I had some weirdness wh...
[15:43:34] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Jhancock.wm) I reseated the NIC and it connected. when I rebooted it went down again and didn't come up. swapped it out and rebooted it. stayed up this time. should have repl...
[15:45:06] <wikibugs>	 10serviceops, 10Thumbor: Consider moving to haproxy ingress for Thumbor workers - https://phabricator.wikimedia.org/T357145 (10hnowlan)
[15:57:40] <wikibugs>	 10serviceops, 10Thumbor: Error accessing File:KlimtDieJungfrau.jpg after it was included on the enwiki Main Page - https://phabricator.wikimedia.org/T354858 (10hnowlan) 05Open→03Resolved a:03hnowlan This may have been a question of resources, this image is now rendering after some recent bumps.
[16:05:31] <wikibugs>	 10serviceops, 10CirrusSearch, 10Discovery-Search (Current work): High level of backend errors for CirrusSearch jobs in jobrunners - https://phabricator.wikimedia.org/T356526 (10EBernhardson) I haven't managed to track down where the `Received cirrusSearchElasticaWrite job for unwritable cluster cloudelastic`...
[16:06:09] <wikibugs>	 10serviceops, 10Thumbor, 10Kubernetes: Consider moving to haproxy ingress for Thumbor workers - https://phabricator.wikimedia.org/T357145 (10JMeybohm) I would actually love if we could try to reproduce what we do with haproxy with istio ingressgateway before introducing another ingress controller (but tbh I...
[16:48:39] <wikibugs>	 10serviceops, 10CirrusSearch, 10Discovery-Search (Current work): High level of backend errors for CirrusSearch jobs in jobrunners - https://phabricator.wikimedia.org/T356526 (10hnowlan) This is kinda verging on a UBN for us as we go into the weekend because it's causing a lot of spam and it'll hide other err...
[17:38:16] <hnowlan>	 jayme: not a bad idea as regards ^ I believe we can use envoy's connection pooling to accomplish similar to what haproxy does - might be just as easy to research adding support for that to thumbor's workflow as implementing another ingress. I think we can use circuit breaking to get the same behaviours around queue limits etc, although I know less about how the connection pooling handles 
[17:38:22] <hnowlan>	 pending requests 
[17:38:35] <hnowlan>	 seems they very deliberately avoid calling pending requests queuing 
[17:49:07] <wikibugs>	 10serviceops, 10CirrusSearch, 10Discovery-Search (Current work): High level of backend errors for CirrusSearch jobs in jobrunners - https://phabricator.wikimedia.org/T356526 (10EBernhardson) If we need them silenced, best bet is probably to re-enable the writes for these wikis. Can be done with a `mediawiki-...
[18:10:05] <wikibugs>	 10serviceops: Cross fleet runc upgrades - https://phabricator.wikimedia.org/T356661 (10CDanis) All pods on k8s-aux-eqiad restarted, thanks @akosiaris for the script.
[23:47:38] <wikibugs>	 10serviceops, 10Data-Engineering, 10EventStreams, 10Prod-Kubernetes, and 2 others: eventstreams regularly uses more than 95% of its memory limit - https://phabricator.wikimedia.org/T357005 (10Ottomata) > wondering about the stream connection duration  IIRC, varnish(?) sets a http timeout of something like...
[23:55:53] <wikibugs>	 10serviceops, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 11 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Sbailey)