[00:10:14] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 600.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:22:04] (03PS24) 10Fabfur: haproxy: certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) [00:22:24] (03CR) 10Fabfur: haproxy: certificate check script (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [00:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10626939 (10phaultfinder) [00:38:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1126692 [00:38:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1126692 (owner: 10TrainBranchBot) [00:39:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10626951 (10phaultfinder) [00:50:22] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1126692 (owner: 10TrainBranchBot) [00:54:58] (03CR) 10Ssingh: "Looks good, mostly questions/nits and no hard blockers IMO." [cookbooks] - 10https://gerrit.wikimedia.org/r/1126491 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [01:08:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1126694 [01:08:39] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1126694 (owner: 10TrainBranchBot) [01:27:40] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1126694 (owner: 10TrainBranchBot) [01:44:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10626974 (10phaultfinder) [02:11:10] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:09:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10627060 (10phaultfinder) [04:19:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10627118 (10phaultfinder) [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:11:11] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126707 [05:14:28] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126708 [05:28:00] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:28:26] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:29:18] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:40:08] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:26 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:40:16] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53657 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:40:50] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:56:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0600) [06:06:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:19:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10627163 (10phaultfinder) [06:21:10] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:36:10] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:40:52] (03PS10) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 [06:49:00] PROBLEM - Hadoop NodeManager on an-worker1162 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:49:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10627185 (10phaultfinder) [07:08:00] RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:19:25] (03PS11) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 [07:20:34] (03CR) 10CI reject: [V:04-1] mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto) [07:26:30] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1037.eqiad.wmnet [07:33:38] (03PS1) 10Muehlenhoff: Add astein to authorised Icinga users [puppet] - 10https://gerrit.wikimedia.org/r/1126907 (https://phabricator.wikimedia.org/T388186) [07:38:32] (03CR) 10Muehlenhoff: [C:03+2] Add astein to authorised Icinga users [puppet] - 10https://gerrit.wikimedia.org/r/1126907 (https://phabricator.wikimedia.org/T388186) (owner: 10Muehlenhoff) [07:41:15] (03PS12) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 [07:44:56] (03PS13) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 [07:45:57] (03CR) 10Filippo Giunchedi: [C:03+2] sqlite: require sqlite::package in 'file' db resource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126425 (https://phabricator.wikimedia.org/T387112) (owner: 10Filippo Giunchedi) [07:50:51] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to astein for fr-tech icinga acknowledgements - https://phabricator.wikimedia.org/T388186#10627205 (10MoritzMuehlenhoff) 05Open→03Resolved @AStein-WMF You should now be able to log into... [07:51:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123388 (owner: 10Giuseppe Lavagetto) [07:55:27] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2211 - https://phabricator.wikimedia.org/T388295#10627212 (10Marostegui) 05Open→03Resolved Everything looks good, thank you! [07:56:25] (03CR) 10Slyngshede: [C:03+2] Release v0.1.7 [software/bitu] - 10https://gerrit.wikimedia.org/r/1126531 (owner: 10Slyngshede) [07:57:56] (03CR) 10Elukey: [C:03+1] services: update eqiad changeprop Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126215 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [07:58:19] (03CR) 10Elukey: [C:03+1] services: update eqiad changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126216 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [07:58:54] (03CR) 10Elukey: [C:03+1] services: update codfw changeprop/changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126217 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [08:00:04] Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0800). [08:00:05] _joe_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0800) [08:00:44] o/ [08:01:38] <_joe_> hashar: do we need to run the trian now? [08:01:51] <_joe_> it's strange to have such a superposition [08:02:01] (03PS1) 10Slyngshede: Revert "data.yaml temporaily remove SSH key for user" [puppet] - 10https://gerrit.wikimedia.org/r/1126910 [08:02:26] <_joe_> hashar: asking because otherwise I'll merge my changes [08:02:51] I have a ton of mediawiki config change to push [08:02:51] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2235].codfw.wmnet,db[1176,1217,1228].eqiad.wmnet with reason: m5 master switch T388500 [08:02:55] T388500: Switchover m5 master db1176 -> db1228 - https://phabricator.wikimedia.org/T388500 [08:03:04] the train window overlap cause of daylight saving time confusion [08:03:15] (03Merged) 10jenkins-bot: Release v0.1.7 [software/bitu] - 10https://gerrit.wikimedia.org/r/1126531 (owner: 10Slyngshede) [08:03:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1037.eqiad.wmnet [08:03:23] its tied to Pacific time zone when really it should be tied to Europe :) [08:03:31] jouncebot: refresh [08:03:32] I refreshed my knowledge about deployments. [08:03:35] jouncebot: nowandnext [08:03:35] For the next 0 hour(s) and 56 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0800) [08:03:35] In 0 hour(s) and 56 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0900) [08:03:38] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete custom partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1126570 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [08:03:47] <_joe_> hashar: I suspected something like that [08:03:48] <_joe_> :D [08:03:50] <_joe_> thanks [08:03:54] <_joe_> can I proceed then? [08:04:08] for what? [08:05:07] I am deploying the patches from https://wikitech.wikimedia.org/wiki/Technical_debt/Unused_config#Results [08:06:45] (03PS1) 10Marostegui: mariadb: Promote db1228 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/1126916 (https://phabricator.wikimedia.org/T388500) [08:06:54] (03PS1) 10Muehlenhoff: Remove obsolete custom Partman recipes for labvirt* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1126917 (https://phabricator.wikimedia.org/T156955) [08:08:49] (03PS1) 10Filippo Giunchedi: pontoon: make sure wait-puppet runs as root [puppet] - 10https://gerrit.wikimedia.org/r/1126912 [08:08:53] (03PS1) 10Filippo Giunchedi: pontoon: improve puppetserver git bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/1126913 [08:09:20] (03CR) 10Filippo Giunchedi: [C:03+1] "\o/ \o/ \o/" [puppet] - 10https://gerrit.wikimedia.org/r/1126917 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [08:09:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125372 (owner: 10Hashar) [08:09:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125374 (https://phabricator.wikimedia.org/T207407) (owner: 10Hashar) [08:09:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125091 (owner: 10Hashar) [08:09:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125092 (owner: 10Hashar) [08:09:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125097 (https://phabricator.wikimedia.org/T348526) (owner: 10Hashar) [08:09:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125124 (owner: 10Hashar) [08:09:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125210 (owner: 10Reedy) [08:10:15] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: make sure wait-puppet runs as root [puppet] - 10https://gerrit.wikimedia.org/r/1126912 (owner: 10Filippo Giunchedi) [08:10:23] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: improve puppetserver git bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/1126913 (owner: 10Filippo Giunchedi) [08:10:24] <_joe_> hashar: uhm wait [08:10:31] (03Merged) 10jenkins-bot: Remove obsolete $wgAllowMicrodataAttributes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125372 (owner: 10Hashar) [08:10:33] (03Merged) 10jenkins-bot: Remove wgArticlePlaceholderSearchIntegrationBackend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125374 (https://phabricator.wikimedia.org/T207407) (owner: 10Hashar) [08:10:37] (03Merged) 10jenkins-bot: Remove obsolete CirrusSearch config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125091 (owner: 10Hashar) [08:10:39] (03Merged) 10jenkins-bot: Fix wgCirrusSearchSimilarityProfiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125092 (owner: 10Hashar) [08:10:40] <_joe_> so you're backporting patches that weren't in the schedule before? [08:10:41] (03Merged) 10jenkins-bot: Remove Cognate legacy settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125097 (https://phabricator.wikimedia.org/T348526) (owner: 10Hashar) [08:11:04] <_joe_> I'd have liked to discuss it [08:11:19] (03Merged) 10jenkins-bot: Remove obsolete $wgFlowMaintenanceMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125124 (owner: 10Hashar) [08:11:20] (03Merged) 10jenkins-bot: InitialiseSettings.php: Remove unused NavigationTiming config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125210 (owner: 10Reedy) [08:11:34] they are all noop cleanup patches, we pushed some of those out of window on thursday [08:11:54] I have considered pushing them on Friday but moved that to Monday instead and forgot I had an appointment [08:12:19] <_joe_> hashar: that's not the point, I had a deployment scheduled, I was verifying a few details about one of the patches before proceeding, you just moved in front of me. It's not really cool, but ok, I'll wait [08:12:20] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1125372|Remove obsolete $wgAllowMicrodataAttributes]], [[gerrit:1125374|Remove wgArticlePlaceholderSearchIntegrationBackend (T207407)]], [[gerrit:1125091|Remove obsolete CirrusSearch config]], [[gerrit:1125092|Fix wgCirrusSearchSimilarityProfiles]], [[gerrit:1125097|Remove Cognate legacy settings (T348526)]], [[gerrit:1125124|Remove obsolete $wgFlowMain [08:12:20] tenanceMode]], [[gerrit:1125210|InitialiseSettings.php: Remove unused NavigationTiming config]] [08:12:21] I went lazy and did not schedule them yesterday since the tuesday morning window was empty yesterday and it is often empty [08:12:26] T207407: Remove legacy Database search integration of ArticlePlaceholder - https://phabricator.wikimedia.org/T207407 [08:12:26] T348526: [COG] [TECH] Migrate Cognate to use a virtual database domain - https://phabricator.wikimedia.org/T348526 [08:12:43] <_joe_> hashar: ping me when you're done [08:13:35] ah I see [08:13:38] 06SRE, 06Infrastructure-Foundations: Review Broadcom's storcli binary - https://phabricator.wikimedia.org/T388628 (10elukey) 03NEW [08:13:55] I guess next time I will schedule those so you are not caught off guard last minute [08:14:15] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10627285 (10elukey) Opened T388628 to verify if we can use/import storcli in our apt repo. [08:16:05] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1126910 (owner: 10Slyngshede) [08:16:10] (03PS3) 10Filippo Giunchedi: pontoon: add Host / Filter [puppet] - 10https://gerrit.wikimedia.org/r/1126044 [08:16:17] <_joe_> hashar: it's about waiting in queue appropriately, you know, civil cohexistence and mutual respect. "sorry" was the appropriate response here. In any case, let's move past this before I get even more upset :) [08:16:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1037.eqiad.wmnet [08:16:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti1037.eqiad.wmnet [08:16:24] !log hashar@deploy2002 reedy, hashar: Backport for [[gerrit:1125372|Remove obsolete $wgAllowMicrodataAttributes]], [[gerrit:1125374|Remove wgArticlePlaceholderSearchIntegrationBackend (T207407)]], [[gerrit:1125091|Remove obsolete CirrusSearch config]], [[gerrit:1125092|Fix wgCirrusSearchSimilarityProfiles]], [[gerrit:1125097|Remove Cognate legacy settings (T348526)]], [[gerrit:1125124|Remove obsolete $wgFlowMaintenanceMod [08:16:24] e]], [[gerrit:1125210|InitialiseSettings.php: Remove unused NavigationTiming config]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:16:56] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: add Host / Filter [puppet] - 10https://gerrit.wikimedia.org/r/1126044 (owner: 10Filippo Giunchedi) [08:17:22] (03CR) 10Slyngshede: [C:03+2] Revert "data.yaml temporaily remove SSH key for user" [puppet] - 10https://gerrit.wikimedia.org/r/1126910 (owner: 10Slyngshede) [08:19:14] !log hashar@deploy2002 reedy, hashar: Continuing with sync [08:21:17] (03CR) 10Federico Ceratto: [C:03+1] "LGTM Added already-resolved comments. I grepped for db1176 and its ipaddr across other dbproxy* files without finding it." [puppet] - 10https://gerrit.wikimedia.org/r/1126916 (https://phabricator.wikimedia.org/T388500) (owner: 10Marostegui) [08:22:04] (03CR) 10Marostegui: [C:03+2] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1126916 (https://phabricator.wikimedia.org/T388500) (owner: 10Marostegui) [08:24:21] !log Failover m5 from db1176 to db1228 - T388500 [08:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:26] T388500: Switchover m5 master db1176 -> db1228 - https://phabricator.wikimedia.org/T388500 [08:25:09] (03PS2) 10Hashar: Drop CodeEditorEnableCore flag: always true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125095 [08:25:20] (03CR) 10Cyndywikime: [C:03+1] Growth: eswiki+cswiki - enable new way of refreshing LinkRecommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126533 (https://phabricator.wikimedia.org/T386250) (owner: 10Michael Große) [08:25:26] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1125372|Remove obsolete $wgAllowMicrodataAttributes]], [[gerrit:1125374|Remove wgArticlePlaceholderSearchIntegrationBackend (T207407)]], [[gerrit:1125091|Remove obsolete CirrusSearch config]], [[gerrit:1125092|Fix wgCirrusSearchSimilarityProfiles]], [[gerrit:1125097|Remove Cognate legacy settings (T348526)]], [[gerrit:1125124|Remove obsolete $wgFlowMai [08:25:26] ntenanceMode]], [[gerrit:1125210|InitialiseSettings.php: Remove unused NavigationTiming config]] (duration: 13m 06s) [08:25:30] T207407: Remove legacy Database search integration of ArticlePlaceholder - https://phabricator.wikimedia.org/T207407 [08:25:30] T348526: [COG] [TECH] Migrate Cognate to use a virtual database domain - https://phabricator.wikimedia.org/T348526 [08:25:47] (03PS4) 10Filippo Giunchedi: pontoon: refactor host filtering with Host / HostFilter [puppet] - 10https://gerrit.wikimedia.org/r/1126045 [08:26:15] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: refactor host filtering with Host / HostFilter [puppet] - 10https://gerrit.wikimedia.org/r/1126045 (owner: 10Filippo Giunchedi) [08:26:30] (03CR) 10Hashar: "My patch went to conflict with I775d9ec67f662ff3f30c097dd828833af86a29fe by @reedy@wikimedia.org . It also removed a duplicate `wfLoadExte" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125095 (owner: 10Hashar) [08:26:42] checking logs after the full depoy [08:26:43] deploy [08:27:20] (03PS1) 10Marostegui: db1176: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1126918 [08:27:59] (03CR) 10Marostegui: [C:03+2] db1176: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1126918 (owner: 10Marostegui) [08:28:06] _joe_: it looks all good. And sorry next time I will add them all to the schedule instead of assuming that nobody else would use the window [08:28:15] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1176.eqiad.wmnet [08:28:28] <_joe_> hashar: I was even pinged here... [08:28:31] <_joe_> anyways, ok [08:28:44] <_joe_> proceeding [08:29:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by oblivian@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123388 (owner: 10Giuseppe Lavagetto) [08:30:15] (03Merged) 10jenkins-bot: noc/wiki.php: allow showing a single variable in json format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123388 (owner: 10Giuseppe Lavagetto) [08:30:46] !log oblivian@deploy2002 Started scap sync-world: Backport for [[gerrit:1123388|noc/wiki.php: allow showing a single variable in json format]] [08:31:16] 07Puppet, 06SRE: puppet error at the end of the run on prometheus2008: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null - https://phabricator.wikimedia.org/T388629 (10fgiunchedi) 03NEW [08:32:23] (03PS2) 10Filippo Giunchedi: pontoon: add --no-prompt, remove user_confirmation [puppet] - 10https://gerrit.wikimedia.org/r/1126047 [08:32:26] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: add --no-prompt, remove user_confirmation [puppet] - 10https://gerrit.wikimedia.org/r/1126047 (owner: 10Filippo Giunchedi) [08:32:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1176.eqiad.wmnet [08:33:33] (03Abandoned) 10Filippo Giunchedi: pontoon: add --no-prompt, remove user_confirmation [puppet] - 10https://gerrit.wikimedia.org/r/1126047 (owner: 10Filippo Giunchedi) [08:33:52] !log oblivian@deploy2002 oblivian: Backport for [[gerrit:1123388|noc/wiki.php: allow showing a single variable in json format]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:33:54] !log oblivian@deploy2002 oblivian: Continuing with sync [08:34:22] (03PS2) 10Filippo Giunchedi: pontoon: improve error messages and new-stack cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1126914 [08:34:45] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: improve error messages and new-stack cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1126914 (owner: 10Filippo Giunchedi) [08:37:08] (03PS4) 10Filippo Giunchedi: pontoon: integration tests [puppet] - 10https://gerrit.wikimedia.org/r/1126046 [08:37:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1037.eqiad.wmnet with OS bookworm [08:38:10] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10627344 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1037.eqiad.wmnet with OS bookworm [08:39:28] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: integration tests [puppet] - 10https://gerrit.wikimedia.org/r/1126046 (owner: 10Filippo Giunchedi) [08:40:20] !log oblivian@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123388|noc/wiki.php: allow showing a single variable in json format]] (duration: 09m 34s) [08:41:00] <_joe_> proceeding with the second patch. it will have some small changes happen to things we're running [08:41:03] (03PS1) 10Brouberol: mediawiki-dumps-legacy: restore deployment to the mediawiki-dumps-legacy ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126919 (https://phabricator.wikimedia.org/T388378) [08:41:09] (03PS1) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) [08:41:12] (03PS1) 10Brouberol: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378) [08:41:17] (03PS8) 10JMeybohm: k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (https://phabricator.wikimedia.org/T388388) [08:41:19] (03PS1) 10Brouberol: airflow-analytics-test: bind the airflow-dumps clusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126922 (https://phabricator.wikimedia.org/T388378) [08:42:41] (03PS1) 10Slyngshede: IDM: Switch to host running 0.1.7 [dns] - 10https://gerrit.wikimedia.org/r/1126924 [08:43:43] (03CR) 10Giuseppe Lavagetto: [C:03+2] mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto) [08:45:02] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2230.codfw.wmnet,db1125.eqiad.wmnet with reason: Maintenance [08:45:50] (03CR) 10ArielGlenn: [C:03+1] "Thanks for this, looks fine." [puppet] - 10https://gerrit.wikimedia.org/r/1126603 (https://phabricator.wikimedia.org/T388564) (owner: 10Clément Goubert) [08:45:58] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1176.eqiad.wmnet with reason: Maintenance [08:46:17] (03Merged) 10jenkins-bot: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto) [08:47:47] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1126924 (owner: 10Slyngshede) [08:48:09] (03CR) 10Slyngshede: [C:03+2] IDM: Switch to host running 0.1.7 [dns] - 10https://gerrit.wikimedia.org/r/1126924 (owner: 10Slyngshede) [08:48:27] !log slyngshede@dns1004 START - running authdns-update [08:50:34] !log slyngshede@dns1004 END - running authdns-update [08:52:45] !log oblivian@deploy2002 Started scap sync-world: Updating k8s chart [08:53:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 9.375% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:55:05] !log oblivian@deploy2002 Finished scap sync-world: Updating k8s chart (duration: 03m 42s) [08:56:50] <_joe_> uh what's going on with mw-jobrunner? [08:57:54] checking [08:58:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 3.125% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:58:19] (03PS1) 10Marostegui: mariadb: Move db1176 to test-s4 [puppet] - 10https://gerrit.wikimedia.org/r/1126927 (https://phabricator.wikimedia.org/T388630) [08:58:23] (03CR) 10Volans: [C:03+1] "LGTM! Thanks for the addition! I've left some questions and a couple of non-blocking nits. I'll leave to traffic the final approval." [cookbooks] - 10https://gerrit.wikimedia.org/r/1126491 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [08:58:29] (03CR) 10Vgutierrez: haproxy: certificate check script (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [08:58:51] (03CR) 10Marostegui: [C:03+2] mariadb: Move db1176 to test-s4 [puppet] - 10https://gerrit.wikimedia.org/r/1126927 (https://phabricator.wikimedia.org/T388630) (owner: 10Marostegui) [08:59:29] saturation since 8:38 [08:59:49] 2 slowdowns before that, normaly due to deploys [09:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0900) [09:00:21] ^ train will be run tonight [09:03:03] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1037.eqiad.wmnet with reason: host reimage [09:04:23] _joe_: something points to something happened at 8:39, but I belive your deploy was after that? [09:06:01] latency increased at 8:21 [09:06:18] https://grafana.wikimedia.org/goto/3rEC1ShHR?orgId=1 [09:06:52] my guess would be at hashar's deployment [09:07:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1037.eqiad.wmnet with reason: host reimage [09:07:14] <_joe_> jynus: yes, it's "organic" [09:07:21] <_joe_> and tbh ok if jobrunners are running hot [09:07:31] <_joe_> as long as it's just "hot" and not "failing" [09:07:34] hmm [09:07:44] just fyi, hashar [09:07:59] all the patches I have pushed are removing unused mediawiki configs and all have been reviewed as doing just that afaik [09:08:15] there seems to be extra load since 8:20 [09:08:28] (03CR) 10Ayounsi: [C:03+2] Add alerting for important BGP sessions status [alerts] - 10https://gerrit.wikimedia.org/r/1126030 (owner: 10Ayounsi) [09:08:29] but I am not ruling out it might have caused some cascading effect somewhere! [09:08:30] <_joe_> which would square up with hashar's deployment [09:08:52] <_joe_> take a look at jobs frequency, I can't spend time on this right now sorry [09:08:59] let me try to find out what the extra work is being spent on [09:10:05] (03Merged) 10jenkins-bot: Add alerting for important BGP sessions status [alerts] - 10https://gerrit.wikimedia.org/r/1126030 (owner: 10Ayounsi) [09:10:05] (03CR) 10Muehlenhoff: "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm) [09:10:48] there is extra parsoidCacheprewarm, but that doesn't line up with the 8:20 timestamp [09:10:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db1125.eqiad.wmnet [09:11:31] the spikes that line up are refreshlinks [09:11:45] but they are not ongoing [09:11:47] (03PS1) 10Marostegui: mariadb: Decommission db1125 [puppet] - 10https://gerrit.wikimedia.org/r/1126931 (https://phabricator.wikimedia.org/T357092) [09:12:27] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission db1125 [puppet] - 10https://gerrit.wikimedia.org/r/1126931 (https://phabricator.wikimedia.org/T357092) (owner: 10Marostegui) [09:13:48] (03CR) 10Vgutierrez: varnish: add log filters to slowquery logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1126647 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [09:13:57] (03CR) 10Muehlenhoff: [C:03+2] idm: Add approval rule for airflow-search-ops in production [puppet] - 10https://gerrit.wikimedia.org/r/1123665 (owner: 10Muehlenhoff) [09:14:16] (03CR) 10Vgutierrez: [C:03+1] "looking good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1126646 (https://phabricator.wikimedia.org/T388597) (owner: 10BCornwall) [09:16:25] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [09:17:28] (03CR) 10Brouberol: [C:03+2] airflow: fix datahub connection host values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126655 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:17:42] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:18:53] PROBLEM - Hadoop NodeManager on an-worker1171 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:19:40] (03PS1) 10Filippo Giunchedi: prometheus: move remaining k8s instances to prometheus2007 [puppet] - 10https://gerrit.wikimedia.org/r/1126934 (https://phabricator.wikimedia.org/T383232) [09:26:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1037.eqiad.wmnet with OS bookworm [09:27:05] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10627627 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1037.eqiad.wmnet with OS bookworm completed: - ganeti103... [09:32:37] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1125.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [09:33:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1125.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [09:33:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:33:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1125.eqiad.wmnet [09:33:28] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db1125.eqiad.wmnet - https://phabricator.wikimedia.org/T357092#10627649 (10Marostegui) a:05Marostegui→03None [09:40:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:42:54] RECOVERY - Hadoop NodeManager on an-worker1171 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:44:23] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db1125.eqiad.wmnet - https://phabricator.wikimedia.org/T357092#10627683 (10Marostegui) Ready for #dc-ops [09:44:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [09:45:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 3.125% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:45:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [09:48:45] sadly I belive the alert will return after depoyment is done [09:50:13] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: restore deployment to the mediawiki-dumps-legacy ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126919 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [09:53:37] !log fio testing on ms-be2088 T384003 [09:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:41] T384003: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003 [09:55:28] (03CR) 10Btullis: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [09:56:17] (03CR) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [09:57:03] (03CR) 10Btullis: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [09:57:42] RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:57:58] (03CR) 10Btullis: [C:03+1] airflow-analytics-test: bind the airflow-dumps clusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126922 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [09:58:12] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0900) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1000) [10:00:15] what [10:00:24] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:00:45] yeah so that is bugged for sure :) [10:00:51] (03CR) 10Marostegui: "We should also remove the master role from its yaml. It can be done here or in a separate patch" [puppet] - 10https://gerrit.wikimedia.org/r/1126042 (owner: 10Jcrespo) [10:00:58] timezones are hard [10:01:15] jouncebot: refresh [10:01:15] I refreshed my knowledge about deployments. [10:01:18] jouncebot: now [10:01:18] For the next 0 hour(s) and 58 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0900) [10:01:19] For the next 0 hour(s) and 58 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1000) [10:01:38] (03PS2) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) [10:01:38] (03PS2) 10Brouberol: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378) [10:01:38] (03PS2) 10Brouberol: airflow-analytics-test: bind the airflow-dumps clusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126922 (https://phabricator.wikimedia.org/T388378) [10:01:59] I will tie it to UTC [10:02:16] (03PS1) 10Slyngshede: P:debmonitor::server remove unused template [puppet] - 10https://gerrit.wikimedia.org/r/1126937 (https://phabricator.wikimedia.org/T254480) [10:02:34] (03CR) 10Brouberol: [C:03+1] Remove docker related referrences on dse-k8s worker and master [puppet] - 10https://gerrit.wikimedia.org/r/1119106 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [10:02:57] RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:03:21] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5058/console" [puppet] - 10https://gerrit.wikimedia.org/r/1126937 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [10:03:22] jouncebot: refresh [10:03:23] I refreshed my knowledge about deployments. [10:03:27] jouncebot: now [10:03:27] For the next 0 hour(s) and 56 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T0900) [10:03:27] For the next 0 hour(s) and 56 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1000) [10:04:03] oh because the train window is two hours long! [10:04:42] (03PS8) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [10:05:16] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1126937 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [10:05:43] hashar: I have a window now, ok to proceed ? [10:05:56] yeah there is no train this morning [10:05:58] it will run tonight [10:06:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:07:16] (03PS9) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [10:07:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1037.eqiad.wmnet [10:07:46] (03CR) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [10:07:47] (03PS8) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all pods to -main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) [10:08:05] (03CR) 10Brouberol: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [10:08:10] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:08:26] (03PS4) 10Effie Mouzeli: hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/1126607 (https://phabricator.wikimedia.org/T383845) [10:10:09] (03PS3) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) [10:10:09] (03PS3) 10Brouberol: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378) [10:10:09] (03PS3) 10Brouberol: airflow-analytics-test: bind the airflow-dumps clusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126922 (https://phabricator.wikimedia.org/T388378) [10:13:40] !log installing systemd bugfix updates from Bookworm point release [10:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:50] (03CR) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): switch all pods to -main (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125423 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [10:14:20] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:14:28] !log removing backup1002, backup2002 dump user on es6,es7 T387892 [10:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:31] T387892: Decommission backup1001, backup1002, backup2001, backup2002 (and their arrays) - https://phabricator.wikimedia.org/T387892 [10:15:24] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:17:01] (03CR) 10Effie Mouzeli: [C:03+2] hieradata: switch 100% mw-(api-int|parsoid|jobrunner) to PHP 8.1 (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/1126607 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [10:17:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1037.eqiad.wmnet [10:18:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1037.eqiad.wmnet to cluster eqiad and group C [10:19:58] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1037.eqiad.wmnet to cluster eqiad and group C [10:24:27] (03PS1) 10JMeybohm: global_config: Add kubernetesVersion for each environment/cluster [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T378429) [10:25:44] (03PS1) 10Hashar: Remove obsolete $wgChronologyProtectorStash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126942 (https://phabricator.wikimedia.org/T336004) [10:25:44] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [10:26:26] (03CR) 10Btullis: [C:03+1] "Nice." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [10:27:08] (03CR) 10Hashar: "This is part of removing obsolete settings https://wikitech.wikimedia.org/wiki/Technical_debt/Unused_config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126942 (https://phabricator.wikimedia.org/T336004) (owner: 10Hashar) [10:27:55] (03PS1) 10Ayounsi: Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944 [10:28:28] (03CR) 10David Caro: [V:03+1 C:03+2] cloudceph: enable qos in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1126597 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [10:30:25] (03PS2) 10JMeybohm: global_config: Add kubernetesVersion for each environment/cluster [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T388390) [10:31:33] (03CR) 10Filippo Giunchedi: [C:03+1] Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944 (owner: 10Ayounsi) [10:31:53] (03CR) 10Effie Mouzeli: [C:03+2] mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126650 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [10:33:22] (03Merged) 10jenkins-bot: mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 (2/2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126650 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [10:33:29] (03PS10) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [10:34:39] (03CR) 10David Caro: "Tested in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1126618 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [10:35:46] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1126618 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [10:36:16] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:36:18] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:36:42] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T388390) (owner: 10JMeybohm) [10:37:59] (03PS2) 10David Caro: clouceph: enable qos in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1126618 (https://phabricator.wikimedia.org/T371501) [10:38:05] (03PS11) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [10:38:25] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:39:00] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10627787 (10MoritzMuehlenhoff) [10:39:15] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5059/co" [puppet] - 10https://gerrit.wikimedia.org/r/1126618 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [10:41:09] (03CR) 10Clément Goubert: [C:03+2] mediawiki::maintenance: Add backfill_localaccounts periodic jobs [puppet] - 10https://gerrit.wikimedia.org/r/1126603 (https://phabricator.wikimedia.org/T388564) (owner: 10Clément Goubert) [10:41:15] (03CR) 10David Caro: [V:03+1 C:03+2] clouceph: enable qos in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1126618 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [10:42:21] !log removing backup1002, backup2002 dbbackups user @ m1 T387892 [10:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:25] T387892: Decommission backup1001, backup1002, backup2001, backup2002 (and their arrays) - https://phabricator.wikimedia.org/T387892 [10:43:37] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:44:18] (03CR) 10Kamila Součková: [C:03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T388390) (owner: 10JMeybohm) [10:44:20] (03CR) 10Elukey: "Aaron: I double checked the staging cpu/memory saturation graphs and around the time of your deploy I see a bump:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126215 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [10:44:24] !log jiji@deploy2002 Started scap sync-world: (T383845) mw-(api-int|parsoid|jobrunner): switch all releases to PHP 8.1 [10:44:26] (03CR) 10Elukey: services: update eqiad changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126216 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [10:44:27] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [10:44:31] (03CR) 10Elukey: services: update codfw changeprop/changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126217 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [10:47:04] (03PS2) 10Ayounsi: Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944 (https://phabricator.wikimedia.org/T388641) [10:47:43] (03PS12) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [10:48:01] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:48:26] job runner seems happy again [10:50:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:50:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:51:18] lets wait a little bit [10:51:49] (03CR) 10Jcrespo: [C:04-1] "No worries. Now that I understood the assigment, I will rethink this." [puppet] - 10https://gerrit.wikimedia.org/r/1126042 (owner: 10Jcrespo) [10:51:57] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:52:07] !incidents [10:52:07] 5724 (ACKED) GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad) [10:52:07] 5726 (ACKED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [10:52:11] acked [10:52:13] mw-api-int rps are way down [10:52:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 20.69s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:52:28] checking [10:52:34] volans: I am delploying [10:52:35] was there a deploy ongoing? [10:52:37] effie is moving it to php 8.1 [10:52:48] (03PS12) 10Giuseppe Lavagetto: Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 [10:52:50] should we revert or continue? [10:52:57] effie: ^ [10:53:05] (03CR) 10D3r1ck01: [C:03+1] Remove obsolete $wgChronologyProtectorStash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126942 (https://phabricator.wikimedia.org/T336004) (owner: 10Hashar) [10:53:06] I am mid scap [10:53:09] scap is not done [10:53:14] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:53:14] [for later] the link to the runbook of the page has no content [10:53:21] api seems down [10:53:28] job insertion rate is way down also [10:53:28] scap is going to rollback most likely [10:53:41] did it work on canary? [10:53:47] ok, then let's give it a minute [10:53:51] FIRING: [3x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:54:12] (03CR) 10Cathal Mooney: [C:03+1] Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [10:54:14] latency http errors skyrocketed [10:54:21] https://grafana.wikimedia.org/d/aSiSoKoSk/mw-parsoid?orgId=1 looks pretty bad [10:54:21] volans: I see database errors on mw [10:54:32] [{reqId}] {exception_url} Wikimedia\Rdbms\DBConnectionError: Cannot access the database: could not connect to any replica DB server [10:54:37] FIRING: [3x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:54:40] lets go to -sre [10:54:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10627861 (10phaultfinder) [10:54:43] parsoid serving a lot of 500s [10:55:12] es overload [10:55:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:55:15] FIRING: [4x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:55:24] this is parsoid going crazy overloading content dbs [10:55:25] (03CR) 10Ayounsi: [C:03+2] Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [10:55:25] (03PS1) 10Lucas Werkmeister (WMDE): Improve SPARQL query construction in SparqlHelper [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126949 [10:55:32] please lets move the conversation to -sre, [10:55:42] (03PS1) 10Lucas Werkmeister (WMDE): Replace distinct-values SPARQL queries [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1126950 (https://phabricator.wikimedia.org/T369079) [10:55:49] (03PS1) 10Lucas Werkmeister (WMDE): Improve SPARQL query construction in SparqlHelper [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126951 [10:56:01] (03PS1) 10Lucas Werkmeister (WMDE): Replace distinct-values SPARQL queries [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126952 (https://phabricator.wikimedia.org/T369079) [10:56:37] (03Merged) 10jenkins-bot: Split the cloudsw alerts to their own files [alerts] - 10https://gerrit.wikimedia.org/r/1126944 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [10:56:39] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Migrate port utilisation alert from LibreNMS to alertmanager - https://phabricator.wikimedia.org/T384052#10627894 (10cmooney) [10:56:57] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:57:16] FIRING: [6x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 37.15s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:57:26] RESOLVED: [3x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:57:38] !log jiji@deploy2002 scap failed: 'production' (scap version: 4.140.0) (duration: 13m 54s) [10:58:51] FIRING: [3x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:00:01] (03CR) 10Btullis: "Removing the +1 because we are discussing another way to achieve this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [11:00:05] mvolz: #bothumor My software never has bugs. It just develops random features. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1100). [11:00:15] RESOLVED: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 23.44% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:02:16] RESOLVED: [6x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 4.716s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:03:51] RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:04:11] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126563 (owner: 10PipelineBot) [11:04:18] (03CR) 10JMeybohm: [C:03+2] global_config: Add kubernetesVersion for each environment/cluster [puppet] - 10https://gerrit.wikimedia.org/r/1126940 (https://phabricator.wikimedia.org/T388390) (owner: 10JMeybohm) [11:05:30] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126563 (owner: 10PipelineBot) [11:05:37] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ms-be1091.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:05:55] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1091.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:07:20] (03PS9) 10JMeybohm: k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (https://phabricator.wikimedia.org/T388388) [11:07:46] FIRING: [6x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 4.716s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:08:43] (03CR) 10JMeybohm: k8s::client: Allow for install of all kubectl versions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm) [11:08:46] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1115803 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm) [11:09:11] (03PS1) 10Stevemunene: hdfs: create dummy keytabs for new hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1126955 (https://phabricator.wikimedia.org/T388512) [11:09:15] jouncebot: now [11:09:15] For the next 0 hour(s) and 50 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250312T1100) [11:09:36] (03PS13) 10Elukey: sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [11:09:37] FIRING: [3x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:10:30] FIRING: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 18.75% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:11:18] (03PS1) 10Superpes15: [enwiki] Throttle exemption for event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126956 (https://phabricator.wikimedia.org/T388637) [11:11:26] !log fio testing on ms-be2088 while resetting controller T384003 [11:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:30] T384003: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003 [11:11:39] (03PS1) 10Stevemunene: hdfs: Add new worker hosts1[187-208] to net_topology [puppet] - 10https://gerrit.wikimedia.org/r/1126957 (https://phabricator.wikimedia.org/T388512) [11:11:41] (03PS1) 10Stevemunene: hdfs: Assign the right role to new hdfs workers 1[187-208] [puppet] - 10https://gerrit.wikimedia.org/r/1126958 (https://phabricator.wikimedia.org/T388512) [11:12:26] RESOLVED: [3x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:13:13] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [11:13:42] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:14:37] FIRING: [3x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:15:30] FIRING: [5x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 18.75% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:15:56] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [11:16:15] (03PS1) 10Vgutierrez: cumin: Add liberica aliases per DC [puppet] - 10https://gerrit.wikimedia.org/r/1126959 (https://phabricator.wikimedia.org/T388369) [11:16:26] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:16:46] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:17:14] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:17:46] FIRING: [4x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/canary (k8s) 4.716s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:18:27] !log reimage lvs6003 as a liberica instance - T384477 [11:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:30] T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477 [11:18:32] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125430 (owner: 10PipelineBot) [11:19:02] (03CR) 10Vgutierrez: [C:03+2] site,hiera: Reimage lvs6003 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1125472 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [11:20:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy