[00:10:14] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 600.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:22:04] <wikibugs>	 (03PS24) 10Fabfur: haproxy: certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147)
[00:22:24] <wikibugs>	 (03CR) 10Fabfur: haproxy: certificate check script (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur)
[00:24:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10626939 (10phaultfinder)
[00:38:45] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1126692
[00:38:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1126692 (owner: 10TrainBranchBot)
[00:39:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10626951 (10phaultfinder)
[00:50:22] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1126692 (owner: 10TrainBranchBot)
[00:54:58] <wikibugs>	 (03CR) 10Ssingh: "Looks good, mostly questions/nits and no hard blockers IMO." [cookbooks] - 10https://gerrit.wikimedia.org/r/1126491 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez)
[01:08:39] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1126694
[01:08:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1126694 (owner: 10TrainBranchBot)
[01:27:40] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1126694 (owner: 10TrainBranchBot)
[01:44:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10626974 (10phaultfinder)
[02:11:10] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[03:09:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10627060 (10phaultfinder)
[04:19:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10627118 (10phaultfinder)
[05:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:11:11] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126707
[05:14:28] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126708
[05:28:00] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:28:26] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:29:18] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:40:08] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:26 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:40:16] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53657 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:40:50] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:56:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0600)
[06:06:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:19:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10627163 (10phaultfinder)
[06:21:10] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[06:36:10] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[06:40:52] <wikibugs>	 (03PS10) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639
[06:49:00] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1162 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[06:49:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10627185 (10phaultfinder)
[07:08:00] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:19:25] <wikibugs>	 (03PS11) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639
[07:20:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto)
[07:26:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1037.eqiad.wmnet
[07:33:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Add astein to authorised Icinga users [puppet] - 10https://gerrit.wikimedia.org/r/1126907 (https://phabricator.wikimedia.org/T388186)
[07:38:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add astein to authorised Icinga users [puppet] - 10https://gerrit.wikimedia.org/r/1126907 (https://phabricator.wikimedia.org/T388186) (owner: 10Muehlenhoff)
[07:41:15] <wikibugs>	 (03PS12) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639
[07:44:56] <wikibugs>	 (03PS13) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639
[07:45:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] sqlite: require sqlite::package in 'file' db resource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126425 (https://phabricator.wikimedia.org/T387112) (owner: 10Filippo Giunchedi)
[07:50:51] <wikibugs>	 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to astein for fr-tech icinga acknowledgements - https://phabricator.wikimedia.org/T388186#10627205 (10MoritzMuehlenhoff) 05Open→03Resolved @AStein-WMF You should now be able to log into...
[07:51:46] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123388 (owner: 10Giuseppe Lavagetto)
[07:55:27] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2211 - https://phabricator.wikimedia.org/T388295#10627212 (10Marostegui) 05Open→03Resolved Everything looks good, thank you!
[07:56:25] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Release v0.1.7 [software/bitu] - 10https://gerrit.wikimedia.org/r/1126531 (owner: 10Slyngshede)
[07:57:56] <wikibugs>	 (03CR) 10Elukey: [C:03+1] services: update eqiad changeprop Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126215 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz)
[07:58:19] <wikibugs>	 (03CR) 10Elukey: [C:03+1] services: update eqiad changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126216 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz)
[07:58:54] <wikibugs>	 (03CR) 10Elukey: [C:03+1] services: update codfw changeprop/changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126217 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz)
[08:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0800).
[08:00:05] <jouncebot>	 _joe_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:05] <jouncebot>	 jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0800)
[08:00:44] <hashar>	 o/
[08:01:38] <_joe_>	 hashar: do we need to run the trian now?
[08:01:51] <_joe_>	 it's strange to have such a superposition
[08:02:01] <wikibugs>	 (03PS1) 10Slyngshede: Revert "data.yaml temporaily remove SSH key for user" [puppet] - 10https://gerrit.wikimedia.org/r/1126910
[08:02:26] <_joe_>	 hashar: asking because otherwise I'll merge my changes
[08:02:51] <hashar>	 I have a ton of mediawiki config change to push
[08:02:51] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2235].codfw.wmnet,db[1176,1217,1228].eqiad.wmnet with reason: m5 master switch T388500
[08:02:55] <stashbot>	 T388500: Switchover m5 master db1176 -> db1228 - https://phabricator.wikimedia.org/T388500
[08:03:04] <hashar>	 the train window overlap cause of daylight saving time confusion
[08:03:15] <wikibugs>	 (03Merged) 10jenkins-bot: Release v0.1.7 [software/bitu] - 10https://gerrit.wikimedia.org/r/1126531 (owner: 10Slyngshede)
[08:03:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1037.eqiad.wmnet
[08:03:23] <hashar>	 its tied to Pacific time zone when really it should be tied to Europe :)
[08:03:31] <hashar>	 jouncebot: refresh
[08:03:32] <jouncebot>	 I refreshed my knowledge about deployments.
[08:03:35] <hashar>	 jouncebot: nowandnext
[08:03:35] <jouncebot>	 For the next 0 hour(s) and 56 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0800)
[08:03:35] <jouncebot>	 In 0 hour(s) and 56 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0900)
[08:03:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete custom partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1126570 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff)
[08:03:47] <_joe_>	 hashar: I suspected something like that
[08:03:48] <_joe_>	 :D
[08:03:50] <_joe_>	 thanks
[08:03:54] <_joe_>	 can I proceed then?
[08:04:08] <hashar>	 for what?
[08:05:07] <hashar>	 I am deploying the patches from https://wikitech.wikimedia.org/wiki/Technical_debt/Unused_config#Results
[08:06:45] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db1228 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/1126916 (https://phabricator.wikimedia.org/T388500)
[08:06:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete custom Partman recipes for labvirt* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1126917 (https://phabricator.wikimedia.org/T156955)
[08:08:49] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: make sure wait-puppet runs as root [puppet] - 10https://gerrit.wikimedia.org/r/1126912
[08:08:53] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: improve puppetserver git bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/1126913
[08:09:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "\o/ \o/ \o/" [puppet] - 10https://gerrit.wikimedia.org/r/1126917 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff)
[08:09:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125372 (owner: 10Hashar)
[08:09:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125374 (https://phabricator.wikimedia.org/T207407) (owner: 10Hashar)
[08:09:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125091 (owner: 10Hashar)
[08:09:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125092 (owner: 10Hashar)
[08:09:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125097 (https://phabricator.wikimedia.org/T348526) (owner: 10Hashar)
[08:09:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125124 (owner: 10Hashar)
[08:09:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125210 (owner: 10Reedy)
[08:10:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: make sure wait-puppet runs as root [puppet] - 10https://gerrit.wikimedia.org/r/1126912 (owner: 10Filippo Giunchedi)
[08:10:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: improve puppetserver git bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/1126913 (owner: 10Filippo Giunchedi)
[08:10:24] <_joe_>	 hashar: uhm wait
[08:10:31] <wikibugs>	 (03Merged) 10jenkins-bot: Remove obsolete $wgAllowMicrodataAttributes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125372 (owner: 10Hashar)
[08:10:33] <wikibugs>	 (03Merged) 10jenkins-bot: Remove wgArticlePlaceholderSearchIntegrationBackend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125374 (https://phabricator.wikimedia.org/T207407) (owner: 10Hashar)
[08:10:37] <wikibugs>	 (03Merged) 10jenkins-bot: Remove obsolete CirrusSearch config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125091 (owner: 10Hashar)
[08:10:39] <wikibugs>	 (03Merged) 10jenkins-bot: Fix wgCirrusSearchSimilarityProfiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125092 (owner: 10Hashar)
[08:10:40] <_joe_>	 so you're backporting patches that weren't in the schedule before?
[08:10:41] <wikibugs>	 (03Merged) 10jenkins-bot: Remove Cognate legacy settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125097 (https://phabricator.wikimedia.org/T348526) (owner: 10Hashar)
[08:11:04] <_joe_>	 I'd have liked to discuss it
[08:11:19] <wikibugs>	 (03Merged) 10jenkins-bot: Remove obsolete $wgFlowMaintenanceMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125124 (owner: 10Hashar)
[08:11:20] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings.php: Remove unused NavigationTiming config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125210 (owner: 10Reedy)
[08:11:34] <hashar>	 they are all noop cleanup patches, we pushed some of those out of window on thursday
[08:11:54] <hashar>	 I have considered pushing them on Friday but moved that to Monday instead and forgot I had an appointment
[08:12:19] <_joe_>	 hashar: that's not the point, I had a deployment scheduled, I was verifying a few details about one of the patches before proceeding, you just moved in front of me. It's not really cool, but ok, I'll wait
[08:12:20] <logmsgbot>	 !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1125372|Remove obsolete $wgAllowMicrodataAttributes]], [[gerrit:1125374|Remove wgArticlePlaceholderSearchIntegrationBackend (T207407)]], [[gerrit:1125091|Remove obsolete CirrusSearch config]], [[gerrit:1125092|Fix wgCirrusSearchSimilarityProfiles]], [[gerrit:1125097|Remove Cognate legacy settings (T348526)]], [[gerrit:1125124|Remove obsolete $wgFlowMain
[08:12:20] <logmsgbot>	 tenanceMode]], [[gerrit:1125210|InitialiseSettings.php: Remove unused NavigationTiming config]]
[08:12:21] <hashar>	 I went lazy and did not schedule them yesterday since the tuesday morning window was empty yesterday and it is often empty
[08:12:26] <stashbot>	 T207407: Remove legacy Database search integration of ArticlePlaceholder - https://phabricator.wikimedia.org/T207407
[08:12:26] <stashbot>	 T348526: [COG] [TECH] Migrate Cognate to use a virtual database domain - https://phabricator.wikimedia.org/T348526
[08:12:43] <_joe_>	 hashar: ping me when you're done
[08:13:35] <hashar>	 ah I see
[08:13:38] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Review Broadcom's storcli binary - https://phabricator.wikimedia.org/T388628 (10elukey) 03NEW
[08:13:55] <hashar>	 I guess next time I will schedule those so you are not caught off guard last minute
[08:14:15] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10627285 (10elukey) Opened T388628 to verify if we can use/import storcli in our apt repo.
[08:16:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1126910 (owner: 10Slyngshede)
[08:16:10] <wikibugs>	 (03PS3) 10Filippo Giunchedi: pontoon: add Host / Filter [puppet] - 10https://gerrit.wikimedia.org/r/1126044
[08:16:17] <_joe_>	 hashar: it's about waiting in queue appropriately, you know, civil cohexistence and mutual respect. "sorry" was the appropriate response here. In any case, let's move past this before I get even more upset :)
[08:16:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1037.eqiad.wmnet
[08:16:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti1037.eqiad.wmnet
[08:16:24] <logmsgbot>	 !log hashar@deploy2002 reedy, hashar: Backport for [[gerrit:1125372|Remove obsolete $wgAllowMicrodataAttributes]], [[gerrit:1125374|Remove wgArticlePlaceholderSearchIntegrationBackend (T207407)]], [[gerrit:1125091|Remove obsolete CirrusSearch config]], [[gerrit:1125092|Fix wgCirrusSearchSimilarityProfiles]], [[gerrit:1125097|Remove Cognate legacy settings (T348526)]], [[gerrit:1125124|Remove obsolete $wgFlowMaintenanceMod
[08:16:24] <logmsgbot>	 e]], [[gerrit:1125210|InitialiseSettings.php: Remove unused NavigationTiming config]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:16:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: add Host / Filter [puppet] - 10https://gerrit.wikimedia.org/r/1126044 (owner: 10Filippo Giunchedi)
[08:17:22] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Revert "data.yaml temporaily remove SSH key for user" [puppet] - 10https://gerrit.wikimedia.org/r/1126910 (owner: 10Slyngshede)
[08:19:14] <logmsgbot>	 !log hashar@deploy2002 reedy, hashar: Continuing with sync
[08:21:17] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] "LGTM Added already-resolved comments. I grepped for db1176 and its ipaddr across other dbproxy* files without finding it." [puppet] - 10https://gerrit.wikimedia.org/r/1126916 (https://phabricator.wikimedia.org/T388500) (owner: 10Marostegui)
[08:22:04] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1126916 (https://phabricator.wikimedia.org/T388500) (owner: 10Marostegui)
[08:24:21] <marostegui>	 !log Failover m5 from db1176 to db1228 - T388500
[08:24:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:26] <stashbot>	 T388500: Switchover m5 master db1176 -> db1228 - https://phabricator.wikimedia.org/T388500
[08:25:09] <wikibugs>	 (03PS2) 10Hashar: Drop CodeEditorEnableCore flag: always true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125095
[08:25:20] <wikibugs>	 (03CR) 10Cyndywikime: [C:03+1] Growth: eswiki+cswiki - enable new way of refreshing LinkRecommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126533 (https://phabricator.wikimedia.org/T386250) (owner: 10Michael Große)
[08:25:26] <logmsgbot>	 !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1125372|Remove obsolete $wgAllowMicrodataAttributes]], [[gerrit:1125374|Remove wgArticlePlaceholderSearchIntegrationBackend (T207407)]], [[gerrit:1125091|Remove obsolete CirrusSearch config]], [[gerrit:1125092|Fix wgCirrusSearchSimilarityProfiles]], [[gerrit:1125097|Remove Cognate legacy settings (T348526)]], [[gerrit:1125124|Remove obsolete $wgFlowMai
[08:25:26] <logmsgbot>	 ntenanceMode]], [[gerrit:1125210|InitialiseSettings.php: Remove unused NavigationTiming config]] (duration: 13m 06s)
[08:25:30] <stashbot>	 T207407: Remove legacy Database search integration of ArticlePlaceholder - https://phabricator.wikimedia.org/T207407
[08:25:30] <stashbot>	 T348526: [COG] [TECH] Migrate Cognate to use a virtual database domain - https://phabricator.wikimedia.org/T348526
[08:25:47] <wikibugs>	 (03PS4) 10Filippo Giunchedi: pontoon: refactor host filtering with Host / HostFilter [puppet] - 10https://gerrit.wikimedia.org/r/1126045
[08:26:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: refactor host filtering with Host / HostFilter [puppet] - 10https://gerrit.wikimedia.org/r/1126045 (owner: 10Filippo Giunchedi)
[08:26:30] <wikibugs>	 (03CR) 10Hashar: "My patch went to conflict with I775d9ec67f662ff3f30c097dd828833af86a29fe by @reedy@wikimedia.org . It also removed a duplicate `wfLoadExte" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125095 (owner: 10Hashar)
[08:26:42] <hashar>	 checking logs after the full depoy
[08:26:43] <hashar>	 deploy
[08:27:20] <wikibugs>	 (03PS1) 10Marostegui: db1176: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1126918
[08:27:59] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1176: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1126918 (owner: 10Marostegui)
[08:28:06] <hashar>	 _joe_: it looks all good.  And sorry next time I will add them all to the schedule instead of assuming that nobody else would use the window
[08:28:15] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1176.eqiad.wmnet
[08:28:28] <_joe_>	 hashar: I was even pinged here...
[08:28:31] <_joe_>	 anyways, ok
[08:28:44] <_joe_>	 proceeding
[08:29:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by oblivian@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123388 (owner: 10Giuseppe Lavagetto)
[08:30:15] <wikibugs>	 (03Merged) 10jenkins-bot: noc/wiki.php: allow showing a single variable in json format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123388 (owner: 10Giuseppe Lavagetto)
[08:30:46] <logmsgbot>	 !log oblivian@deploy2002 Started scap sync-world: Backport for [[gerrit:1123388|noc/wiki.php: allow showing a single variable in json format]]
[08:31:16] <wikibugs>	 07Puppet, 06SRE: puppet error at the end of the run on prometheus2008: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null - https://phabricator.wikimedia.org/T388629 (10fgiunchedi) 03NEW
[08:32:23] <wikibugs>	 (03PS2) 10Filippo Giunchedi: pontoon: add --no-prompt, remove user_confirmation [puppet] - 10https://gerrit.wikimedia.org/r/1126047
[08:32:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: add --no-prompt, remove user_confirmation [puppet] - 10https://gerrit.wikimedia.org/r/1126047 (owner: 10Filippo Giunchedi)
[08:32:48] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1176.eqiad.wmnet
[08:33:33] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: pontoon: add --no-prompt, remove user_confirmation [puppet] - 10https://gerrit.wikimedia.org/r/1126047 (owner: 10Filippo Giunchedi)
[08:33:52] <logmsgbot>	 !log oblivian@deploy2002 oblivian: Backport for [[gerrit:1123388|noc/wiki.php: allow showing a single variable in json format]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:33:54] <logmsgbot>	 !log oblivian@deploy2002 oblivian: Continuing with sync
[08:34:22] <wikibugs>	 (03PS2) 10Filippo Giunchedi: pontoon: improve error messages and new-stack cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1126914
[08:34:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: improve error messages and new-stack cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1126914 (owner: 10Filippo Giunchedi)
[08:37:08] <wikibugs>	 (03PS4) 10Filippo Giunchedi: pontoon: integration tests [puppet] - 10https://gerrit.wikimedia.org/r/1126046
[08:37:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1037.eqiad.wmnet with OS bookworm
[08:38:10] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10627344 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1037.eqiad.wmnet with OS bookworm
[08:39:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: integration tests [puppet] - 10https://gerrit.wikimedia.org/r/1126046 (owner: 10Filippo Giunchedi)
[08:40:20] <logmsgbot>	 !log oblivian@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123388|noc/wiki.php: allow showing a single variable in json format]] (duration: 09m 34s)
[08:41:00] <_joe_>	 proceeding with the second patch. it will have some small changes happen to things we're running
[08:41:03] <wikibugs>	 (03PS1) 10Brouberol: mediawiki-dumps-legacy: restore deployment to the mediawiki-dumps-legacy ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126919 (https://phabricator.wikimedia.org/T388378)
[08:41:09] <wikibugs>	 (03PS1) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378)
[08:41:12] <wikibugs>	 (03PS1) 10Brouberol: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378)
[08:41:17] <wikibugs>	 (03PS8) 10JMeybohm: k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (https://phabricator.wikimedia.org/T388388)
[08:41:19] <wikibugs>	 (03PS1) 10Brouberol: airflow-analytics-test: bind the airflow-dumps clusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126922 (https://phabricator.wikimedia.org/T388378)
[08:42:41] <wikibugs>	 (03PS1) 10Slyngshede: IDM: Switch to host running 0.1.7 [dns] - 10https://gerrit.wikimedia.org/r/1126924
[08:43:43] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto)
[08:45:02] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2230.codfw.wmnet,db1125.eqiad.wmnet with reason: Maintenance
[08:45:50] <wikibugs>	 (03CR) 10ArielGlenn: [C:03+1] "Thanks for this, looks fine." [puppet] - 10https://gerrit.wikimedia.org/r/1126603 (https://phabricator.wikimedia.org/T388564) (owner: 10Clément Goubert)
[08:45:58] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1176.eqiad.wmnet with reason: Maintenance
[08:46:17] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto)
[08:47:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1126924 (owner: 10Slyngshede)
[08:48:09] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] IDM: Switch to host running 0.1.7 [dns] - 10https://gerrit.wikimedia.org/r/1126924 (owner: 10Slyngshede)
[08:48:27] <logmsgbot>	 !log slyngshede@dns1004 START - running authdns-update
[08:50:34] <logmsgbot>	 !log slyngshede@dns1004 END - running authdns-update
[08:52:45] <logmsgbot>	 !log oblivian@deploy2002 Started scap sync-world: Updating k8s chart
[08:53:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 9.375% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[08:55:05] <logmsgbot>	 !log oblivian@deploy2002 Finished scap sync-world: Updating k8s chart (duration: 03m 42s)
[08:56:50] <_joe_>	 uh what's going on with mw-jobrunner?
[08:57:54] <jynus>	 checking
[08:58:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 3.125% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[08:58:19] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db1176 to test-s4 [puppet] - 10https://gerrit.wikimedia.org/r/1126927 (https://phabricator.wikimedia.org/T388630)
[08:58:23] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM! Thanks for the addition! I've left some questions and a couple of non-blocking nits. I'll leave to traffic the final approval." [cookbooks] - 10https://gerrit.wikimedia.org/r/1126491 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez)
[08:58:29] <wikibugs>	 (03CR) 10Vgutierrez: haproxy: certificate check script (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur)
[08:58:51] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Move db1176 to test-s4 [puppet] - 10https://gerrit.wikimedia.org/r/1126927 (https://phabricator.wikimedia.org/T388630) (owner: 10Marostegui)
[08:59:29] <jynus>	 saturation since 8:38
[08:59:49] <jynus>	 2 slowdowns before that, normaly due to deploys
[09:00:05] <jouncebot>	 jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0900)
[09:00:21] <hashar>	 ^ train will be run tonight
[09:03:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1037.eqiad.wmnet with reason: host reimage
[09:04:23] <jynus>	 _joe_: something points to something happened at 8:39, but I belive your deploy was after that?
[09:06:01] <jynus>	 latency increased at 8:21
[09:06:18] <jynus>	 https://grafana.wikimedia.org/goto/3rEC1ShHR?orgId=1
[09:06:52] <jynus>	 my guess would be at hashar's deployment
[09:07:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1037.eqiad.wmnet with reason: host reimage
[09:07:14] <_joe_>	 jynus: yes, it's "organic"
[09:07:21] <_joe_>	 and tbh ok if jobrunners are running hot
[09:07:31] <_joe_>	 as long as it's just "hot" and not "failing"
[09:07:34] <hashar>	 hmm
[09:07:44] <jynus>	 just fyi, hashar
[09:07:59] <hashar>	 all the patches I have pushed are removing unused mediawiki configs and all have been reviewed as doing just that afaik
[09:08:15] <jynus>	 there seems to be extra load since 8:20
[09:08:28] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Add alerting for important BGP sessions status [alerts] - 10https://gerrit.wikimedia.org/r/1126030 (owner: 10Ayounsi)
[09:08:29] <hashar>	 but I am not ruling out it might have caused some cascading effect somewhere! 
[09:08:30] <_joe_>	 which would square up with hashar's deployment
[09:08:52] <_joe_>	 take a look at jobs frequency, I can't spend time on this right now sorry
[09:08:59] <jynus>	 let me try to find out what the extra work is being spent on
[09:10:05] <wikibugs>	 (03Merged) 10jenkins-bot: Add alerting for important BGP sessions status [alerts] - 10https://gerrit.wikimedia.org/r/1126030 (owner: 10Ayounsi)
[09:10:05] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm)
[09:10:48] <jynus>	 there is extra parsoidCacheprewarm, but that doesn't line up with the 8:20 timestamp
[09:10:49] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db1125.eqiad.wmnet
[09:11:31] <jynus>	 the spikes that line up are refreshlinks
[09:11:45] <jynus>	 but they are not ongoing
[09:11:47] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission db1125 [puppet] - 10https://gerrit.wikimedia.org/r/1126931 (https://phabricator.wikimedia.org/T357092)
[09:12:27] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Decommission db1125 [puppet] - 10https://gerrit.wikimedia.org/r/1126931 (https://phabricator.wikimedia.org/T357092) (owner: 10Marostegui)
[09:13:48] <wikibugs>	 (03CR) 10Vgutierrez: varnish: add log filters to slowquery logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126647 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[09:13:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] idm: Add approval rule for airflow-search-ops in production [puppet] - 10https://gerrit.wikimedia.org/r/1123665 (owner: 10Muehlenhoff)
[09:14:16] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "looking good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1126646 (https://phabricator.wikimedia.org/T388597) (owner: 10BCornwall)
[09:16:25] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.dns.netbox
[09:17:28] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: fix datahub connection host values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126655 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol)
[09:17:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:18:53] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1171 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[09:19:40] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: move remaining k8s instances to prometheus2007 [puppet] - 10https://gerrit.wikimedia.org/r/1126934 (https://phabricator.wikimedia.org/T383232)
[09:26:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1037.eqiad.wmnet with OS bookworm
[09:27:05] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10627627 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1037.eqiad.wmnet with OS bookworm completed: - ganeti103...
[09:32:37] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1125.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[09:33:02] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1125.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[09:33:02] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:33:03] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1125.eqiad.wmnet
[09:33:28] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db1125.eqiad.wmnet - https://phabricator.wikimedia.org/T357092#10627649 (10Marostegui) a:05Marostegui→03None
[09:40:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[09:42:54] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1171 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[09:44:23] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db1125.eqiad.wmnet - https://phabricator.wikimedia.org/T357092#10627683 (10Marostegui) Ready for #dc-ops
[09:44:50] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[09:45:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 3.125% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[09:45:32] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[09:48:45] <jynus>	 sadly I belive the alert will return after depoyment is done
[09:50:13] <wikibugs>	 (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: restore deployment to the mediawiki-dumps-legacy ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126919 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[09:53:37] <Emperor>	 !log fio testing on ms-be2088 T384003
[09:53:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:41] <stashbot>	 T384003: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003
[09:55:28] <wikibugs>	 (03CR) 10Btullis: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[09:56:17] <wikibugs>	 (03CR) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[09:57:03] <wikibugs>	 (03CR) 10Btullis: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[09:57:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:57:58] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow-analytics-test: bind the airflow-dumps clusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126922 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[09:58:12] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:00:05] <jouncebot>	 jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0900)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1000)
[10:00:15] <hashar>	 what
[10:00:24] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:00:45] <hashar>	 yeah so that is bugged for sure :)
[10:00:51] <wikibugs>	 (03CR) 10Marostegui: "We should also remove the master role from its yaml. It can be done here or in a separate patch" [puppet] - 10https://gerrit.wikimedia.org/r/1126042 (owner: 10Jcrespo)
[10:00:58] <hashar>	 timezones are hard
[10:01:15] <hashar>	 jouncebot: refresh
[10:01:15] <jouncebot>	 I refreshed my knowledge about deployments.
[10:01:18] <hashar>	 jouncebot: now
[10:01:18] <jouncebot>	 For the next 0 hour(s) and 58 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0900)
[10:01:19] <jouncebot>	 For the next 0 hour(s) and 58 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1000)
[10:01:38] <wikibugs>	 (03PS2) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378)
[10:01:38] <wikibugs>	 (03PS2) 10Brouberol: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378)
[10:01:38] <wikibugs>	 (03PS2) 10Brouberol: airflow-analytics-test: bind the airflow-dumps clusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126922 (https://phabricator.wikimedia.org/T388378)
[10:01:59] <hashar>	 I will tie it to UTC 
[10:02:16] <wikibugs>	 (03PS1) 10Slyngshede: P:debmonitor::server remove unused template [puppet] - 10https://gerrit.wikimedia.org/r/1126937 (https://phabricator.wikimedia.org/T254480)
[10:02:34] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Remove docker related referrences on dse-k8s worker and master [puppet] - 10https://gerrit.wikimedia.org/r/1119106 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene)
[10:02:57] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:03:21] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5058/console" [puppet] - 10https://gerrit.wikimedia.org/r/1126937 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede)
[10:03:22] <hashar>	 jouncebot: refresh
[10:03:23] <jouncebot>	 I refreshed my knowledge about deployments.
[10:03:27] <hashar>	 jouncebot: now
[10:03:27] <jouncebot>	 For the next 0 hour(s) and 56 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0900)
[10:03:27] <jouncebot>	 For the next 0 hour(s) and 56 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1000)
[10:04:03] <hashar>	 oh because the train window is two hours long!
[10:04:42] <wikibugs>	 (03PS8) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577)
[10:05:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1126937 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede)
[10:05:43] <effie>	 hashar: I have a window now, ok to proceed ?
[10:05:56] <hashar>	 yeah there is no train this morning
[10:05:58] <hashar>	 it will run tonight
[10:06:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:07:16] <wikibugs>	 (03PS9) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577)
[10:07:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1037.eqiad.wmnet
[10:07:46] <wikibugs>	 (03CR) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[10:07:47] <wikibugs>	 (03PS8) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all pods to -main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845)
[10:08:05] <wikibugs>	 (03CR) 10Brouberol: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[10:08:10] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[10:08:26] <wikibugs>	 (03PS4) 10Effie Mouzeli: hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/1126607 (https://phabricator.wikimedia.org/T383845)
[10:10:09] <wikibugs>	 (03PS3) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378)
[10:10:09] <wikibugs>	 (03PS3) 10Brouberol: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378)
[10:10:09] <wikibugs>	 (03PS3) 10Brouberol: airflow-analytics-test: bind the airflow-dumps clusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126922 (https://phabricator.wikimedia.org/T388378)
[10:13:40] <moritzm>	 !log installing systemd bugfix updates from Bookworm point release
[10:13:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:50] <wikibugs>	 (03CR) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all pods to -main (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli)
[10:14:20] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[10:14:28] <jynus>	 !log removing backup1002, backup2002 dump user on es6,es7 T387892 
[10:14:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:31] <stashbot>	 T387892: Decommission backup1001, backup1002, backup2001, backup2002 (and their arrays) - https://phabricator.wikimedia.org/T387892
[10:15:24] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:17:01] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/1126607 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli)
[10:17:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1037.eqiad.wmnet
[10:18:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1037.eqiad.wmnet to cluster eqiad and group C
[10:19:58] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1037.eqiad.wmnet to cluster eqiad and group C
[10:24:27] <wikibugs>	 (03PS1) 10JMeybohm: global_config: Add kubernetesVersion for each environment/cluster [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T378429)
[10:25:44] <wikibugs>	 (03PS1) 10Hashar: Remove obsolete $wgChronologyProtectorStash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126942 (https://phabricator.wikimedia.org/T336004)
[10:25:44] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm)
[10:26:26] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[10:27:08] <wikibugs>	 (03CR) 10Hashar: "This is part of removing obsolete settings https://wikitech.wikimedia.org/wiki/Technical_debt/Unused_config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126942 (https://phabricator.wikimedia.org/T336004) (owner: 10Hashar)
[10:27:55] <wikibugs>	 (03PS1) 10Ayounsi: Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944
[10:28:28] <wikibugs>	 (03CR) 10David Caro: [V:03+1 C:03+2] cloudceph: enable qos in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1126597 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro)
[10:30:25] <wikibugs>	 (03PS2) 10JMeybohm: global_config: Add kubernetesVersion for each environment/cluster [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T388390)
[10:31:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944 (owner: 10Ayounsi)
[10:31:53] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1  (2/2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126650 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli)
[10:33:22] <wikibugs>	 (03Merged) 10jenkins-bot: mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1  (2/2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126650 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli)
[10:33:29] <wikibugs>	 (03PS10) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577)
[10:34:39] <wikibugs>	 (03CR) 10David Caro: "Tested in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1126618 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro)
[10:35:46] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1126618 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro)
[10:36:16] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[10:36:18] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[10:36:42] <wikibugs>	 (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T388390) (owner: 10JMeybohm)
[10:37:59] <wikibugs>	 (03PS2) 10David Caro: clouceph: enable qos in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1126618 (https://phabricator.wikimedia.org/T371501)
[10:38:05] <wikibugs>	 (03PS11) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577)
[10:38:25] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[10:39:00] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10627787 (10MoritzMuehlenhoff)
[10:39:15] <wikibugs>	 (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5059/co" [puppet] - 10https://gerrit.wikimedia.org/r/1126618 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro)
[10:41:09] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mediawiki::maintenance: Add backfill_localaccounts periodic jobs [puppet] - 10https://gerrit.wikimedia.org/r/1126603 (https://phabricator.wikimedia.org/T388564) (owner: 10Clément Goubert)
[10:41:15] <wikibugs>	 (03CR) 10David Caro: [V:03+1 C:03+2] clouceph: enable qos in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1126618 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro)
[10:42:21] <jynus>	 !log removing backup1002, backup2002 dbbackups user @ m1 T387892
[10:42:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:25] <stashbot>	 T387892: Decommission backup1001, backup1002, backup2001, backup2002 (and their arrays) - https://phabricator.wikimedia.org/T387892
[10:43:37] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[10:44:18] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T388390) (owner: 10JMeybohm)
[10:44:20] <wikibugs>	 (03CR) 10Elukey: "Aaron: I double checked the staging cpu/memory saturation graphs and around the time of your deploy I see a bump:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126215 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz)
[10:44:24] <logmsgbot>	 !log jiji@deploy2002 Started scap sync-world: (T383845) mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1
[10:44:26] <wikibugs>	 (03CR) 10Elukey: services: update eqiad changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126216 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz)
[10:44:27] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[10:44:31] <wikibugs>	 (03CR) 10Elukey: services: update codfw changeprop/changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126217 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz)
[10:47:04] <wikibugs>	 (03PS2) 10Ayounsi: Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944 (https://phabricator.wikimedia.org/T388641)
[10:47:43] <wikibugs>	 (03PS12) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577)
[10:48:01] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[10:48:26] <jynus>	 job runner seems happy again
[10:50:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[10:50:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:51:18] <effie>	 lets wait a little bit 
[10:51:49] <wikibugs>	 (03CR) 10Jcrespo: [C:04-1] "No worries. Now that I understood the assigment, I will rethink this." [puppet] - 10https://gerrit.wikimedia.org/r/1126042 (owner: 10Jcrespo)
[10:51:57] <jinxer-wm>	 FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:52:07] <volans>	 !incidents
[10:52:07] <sirenbot>	 5724 (ACKED)  GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad)
[10:52:07] <sirenbot>	 5726 (ACKED)  ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad)
[10:52:11] <jynus>	 acked
[10:52:13] <hnowlan>	 mw-api-int rps are way down 
[10:52:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 20.69s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:52:28] <volans>	 checking
[10:52:34] <effie>	 volans: I am delploying 
[10:52:35] <jynus>	 was there a deploy ongoing?
[10:52:37] <kamila_>	 effie is moving it to php 8.1
[10:52:48] <wikibugs>	 (03PS12) 10Giuseppe Lavagetto: Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225
[10:52:50] <jynus>	 should we revert or continue?
[10:52:57] <kamila_>	 effie: ^
[10:53:05] <wikibugs>	 (03CR) 10D3r1ck01: [C:03+1] Remove obsolete $wgChronologyProtectorStash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126942 (https://phabricator.wikimedia.org/T336004) (owner: 10Hashar)
[10:53:06] <effie>	 I am mid scap 
[10:53:09] <effie>	 scap is not done 
[10:53:14] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[10:53:14] <volans>	 [for later] the link to the runbook of the page has no content
[10:53:21] <jynus>	 api seems down
[10:53:28] <hnowlan>	 job insertion rate is way down also 
[10:53:28] <effie>	 scap is going to rollback most likely 
[10:53:41] <volans>	 did it work on canary?
[10:53:47] <jynus>	 ok, then let's give it a minute
[10:53:51] <jinxer-wm>	 FIRING: [3x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[10:54:12] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[10:54:14] <jynus>	 latency http errors skyrocketed
[10:54:21] <volans>	 https://grafana.wikimedia.org/d/aSiSoKoSk/mw-parsoid?orgId=1 looks pretty bad
[10:54:21] <effie>	 volans: I see database errors on mw
[10:54:32] <effie>	 [{reqId}] {exception_url} Wikimedia\Rdbms\DBConnectionError: Cannot access the database: could not connect to any replica DB server
[10:54:37] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:54:40] <effie>	 lets go to -sre
[10:54:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10627861 (10phaultfinder)
[10:54:43] <hnowlan>	 parsoid serving a lot of 500s
[10:55:12] <jynus>	 es overload
[10:55:15] <jinxer-wm>	 RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[10:55:15] <jinxer-wm>	 FIRING: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:55:24] <jynus>	 this is parsoid going crazy overloading content dbs
[10:55:25] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[10:55:25] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Improve SPARQL query construction in SparqlHelper [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126949
[10:55:32] <effie>	 please  lets move the conversation to -sre,
[10:55:42] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Replace distinct-values SPARQL queries [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126950 (https://phabricator.wikimedia.org/T369079)
[10:55:49] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Improve SPARQL query construction in SparqlHelper [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126951
[10:56:01] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Replace distinct-values SPARQL queries [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126952 (https://phabricator.wikimedia.org/T369079)
[10:56:37] <wikibugs>	 (03Merged) 10jenkins-bot: Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[10:56:39] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Migrate port utilisation alert from LibreNMS to alertmanager - https://phabricator.wikimedia.org/T384052#10627894 (10cmooney)
[10:56:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:57:16] <jinxer-wm>	 FIRING: [6x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 37.15s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:57:26] <jinxer-wm>	 RESOLVED: [3x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:57:38] <logmsgbot>	 !log jiji@deploy2002 scap failed: <KeyError> 'production' (scap version: 4.140.0) (duration: 13m 54s)
[10:58:51] <jinxer-wm>	 FIRING: [3x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[11:00:01] <wikibugs>	 (03CR) 10Btullis: "Removing the +1 because we are discussing another way to achieve this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[11:00:05] <jouncebot>	 mvolz: #bothumor My software never has bugs. It just develops random features. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1100).
[11:00:15] <jinxer-wm>	 RESOLVED: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 23.44% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[11:02:16] <jinxer-wm>	 RESOLVED: [6x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 4.716s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:03:51] <jinxer-wm>	 RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy   - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[11:04:11] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126563 (owner: 10PipelineBot)
[11:04:18] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] global_config: Add kubernetesVersion for each environment/cluster [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T388390) (owner: 10JMeybohm)
[11:05:30] <wikibugs>	 (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126563 (owner: 10PipelineBot)
[11:05:37] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ms-be1091.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[11:05:55] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1091.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[11:07:20] <wikibugs>	 (03PS9) 10JMeybohm: k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (https://phabricator.wikimedia.org/T388388)
[11:07:46] <jinxer-wm>	 FIRING: [6x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 4.716s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:08:43] <wikibugs>	 (03CR) 10JMeybohm: k8s::client: Allow for install of all kubectl versions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm)
[11:08:46] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm)
[11:09:11] <wikibugs>	 (03PS1) 10Stevemunene: hdfs: create dummy keytabs for new hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1126955 (https://phabricator.wikimedia.org/T388512)
[11:09:15] <effie>	 jouncebot: now
[11:09:15] <jouncebot>	 For the next 0 hour(s) and 50 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1100)
[11:09:36] <wikibugs>	 (03PS13) 10Elukey: sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577)
[11:09:37] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:10:30] <jinxer-wm>	 FIRING: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 18.75% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[11:11:18] <wikibugs>	 (03PS1) 10Superpes15: [enwiki] Throttle exemption for event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126956 (https://phabricator.wikimedia.org/T388637)
[11:11:26] <Emperor>	 !log fio testing on ms-be2088 while resetting controller T384003
[11:11:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:30] <stashbot>	 T384003: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003
[11:11:39] <wikibugs>	 (03PS1) 10Stevemunene: hdfs: Add new worker hosts1[187-208] to net_topology [puppet] - 10https://gerrit.wikimedia.org/r/1126957 (https://phabricator.wikimedia.org/T388512)
[11:11:41] <wikibugs>	 (03PS1) 10Stevemunene: hdfs: Assign the right role to new hdfs workers 1[187-208] [puppet] - 10https://gerrit.wikimedia.org/r/1126958 (https://phabricator.wikimedia.org/T388512)
[11:12:26] <jinxer-wm>	 RESOLVED: [3x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:13:13] <logmsgbot>	 !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply
[11:13:42] <logmsgbot>	 !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply
[11:14:37] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:15:30] <jinxer-wm>	 FIRING: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 18.75% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[11:15:56] <logmsgbot>	 !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply
[11:16:15] <wikibugs>	 (03PS1) 10Vgutierrez: cumin: Add liberica aliases per DC [puppet] - 10https://gerrit.wikimedia.org/r/1126959 (https://phabricator.wikimedia.org/T388369)
[11:16:26] <logmsgbot>	 !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply
[11:16:46] <logmsgbot>	 !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply
[11:17:14] <logmsgbot>	 !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply
[11:17:46] <jinxer-wm>	 FIRING: [4x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 4.716s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:18:27] <vgutierrez>	 !log reimage lvs6003 as a liberica instance - T384477
[11:18:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:30] <stashbot>	 T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477
[11:18:32] <wikibugs>	 (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125430 (owner: 10PipelineBot)
[11:19:02] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] site,hiera: Reimage lvs6003 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1125472 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[11:20:30] <jinxer-wm>	 FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy