[00:01:06] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [00:09:15] RECOVERY - snapshot of s4 in codfw on backupmon1001 is OK: Last snapshot for s4 at codfw (db2239) taken on 2025-03-17 22:42:31 (1785 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:10:45] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:11:51] !log very late UTC deploys done [00:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:05] !log zabe@mwmaint2002:~$ cat group0.dblist | xargs -I{} bash -c "echo {}; mwscript extensions/WikimediaMaintenance/migrateESRefToContentTableStage2.php {} --delete /home/zabe/afl_text_table_deletedump/{} --sleep 0.3" # T381599 [00:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:09] T381599: Migrate current references of text table rows from afl_var_dump - https://phabricator.wikimedia.org/T381599 [00:15:44] (03PS1) 10Dzahn: mailman: list sync, add option to mail changes to an admin [puppet] - 10https://gerrit.wikimedia.org/r/1128564 (https://phabricator.wikimedia.org/T351202) [00:16:07] (03CR) 10CI reject: [V:04-1] mailman: list sync, add option to mail changes to an admin [puppet] - 10https://gerrit.wikimedia.org/r/1128564 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [00:16:45] (03PS1) 10Dwisehaupt: community_civicrm: Add profile::community_civicrm::mail [puppet] - 10https://gerrit.wikimedia.org/r/1128565 (https://phabricator.wikimedia.org/T383715) [00:17:09] (03CR) 10CI reject: [V:04-1] community_civicrm: Add profile::community_civicrm::mail [puppet] - 10https://gerrit.wikimedia.org/r/1128565 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [00:21:37] (03PS2) 10Dwisehaupt: community_civicrm: Add profile::community_civicrm::mail [puppet] - 10https://gerrit.wikimedia.org/r/1128565 (https://phabricator.wikimedia.org/T383715) [00:22:48] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10LDAP-Access-Requests: Grant Access to astein for fr-tech icinga acknowledgements - https://phabricator.wikimedia.org/T388186#10644791 (10Dzahn) a:05MoritzMuehlenhoff→03AStein-WMF This is a known issue many of us have ran into before. As Moritz des... [00:23:34] (03Abandoned) 10Dwisehaupt: community_civicrm: Add profile::community_civicrm::mail [puppet] - 10https://gerrit.wikimedia.org/r/1125223 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [00:24:53] PROBLEM - Restbase root url on restbase2033 is CRITICAL: connect to address 10.192.32.174 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [00:25:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10644793 (10phaultfinder) [00:25:48] (03CR) 10Cwhite: [C:03+2] "PCC OK https://puppet-compiler.wmflabs.org/output/1128471/5099/" [puppet] - 10https://gerrit.wikimedia.org/r/1128471 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite) [00:30:14] (03CR) 10Dwisehaupt: "Tried many different ways to get this to work with virtual users, but kept stumbling across bits that didn't work. Going to stick with loc" [puppet] - 10https://gerrit.wikimedia.org/r/1128565 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [00:32:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127954 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [00:34:17] (03CR) 10Dwisehaupt: community_civicrm: dovecot module for serving up local mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [00:38:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1128577 [00:38:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1128577 (owner: 10TrainBranchBot) [00:39:16] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host db1257.eqiad.wmnet with OS bookworm [00:39:22] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10644812 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host db1257.eqiad.wmnet with OS bookworm [00:50:47] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1257.eqiad.wmnet with reason: host reimage [00:51:45] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1128577 (owner: 10TrainBranchBot) [00:54:25] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1257.eqiad.wmnet with reason: host reimage [01:08:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1128581 [01:08:24] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1128581 (owner: 10TrainBranchBot) [01:13:37] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:13:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:13:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1257.eqiad.wmnet with OS bookworm [01:14:07] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10644871 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host db1257.eqiad.wmnet with OS bookworm completed: - db1257 (**WARN**) - Removed... [01:14:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10644872 (10Jclark-ctr) [01:14:25] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10644873 (10Jclark-ctr) 05Open→03Resolved [01:24:53] RECOVERY - Restbase root url on restbase2033 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 0.112 second response time https://wikitech.wikimedia.org/wiki/RESTBase [01:28:24] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1128581 (owner: 10TrainBranchBot) [01:32:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:42:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:50:14] "Our servers are currently under maintenance or experiencing a technical problem. Please try again in a few minutes." [01:50:15] Hmm [01:50:20] "Original error: upstream connect error or disconnect/reset before headers. reset reason: connection termination " [01:58:29] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2047 [01:58:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2047 [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T0200) [02:04:21] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [02:04:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10644924 (10phaultfinder) [02:06:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:07:17] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2050 [02:07:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2050 [02:08:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.21 [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1128584 (https://phabricator.wikimedia.org/T386216) [02:08:28] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.21 [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1128584 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [02:14:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2045.codfw.wmnet with OS bookworm [02:14:18] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10644938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm [02:20:20] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.21 [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1128584 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [02:25:22] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2045.codfw.wmnet with OS bookworm [02:25:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10644945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm executed with err... [02:26:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10644951 (10Jhancock.wm) [02:34:13] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:37:11] FIRING: Temperature: Temp issue on wdqs1021:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs1021 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [02:42:11] RESOLVED: Temperature: Temp issue on wdqs1021:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs1021 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [02:49:13] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T0300) [03:02:25] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128586 (https://phabricator.wikimedia.org/T386216) [03:02:26] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128586 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [03:03:17] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128586 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [03:03:44] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.21 refs T386216 [03:03:47] T386216: 1.44.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T386216 [03:04:42] FIRING: CertManagerCertNotReady: Certificate istio-system/jaeger is not in a ready state (k8s-aux@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-aux&var-namespace=istio-system - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T0400) [04:01:06] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [04:06:15] !log mwpresync@deploy2002 Pruned MediaWiki: 1.44.0-wmf.18 (duration: 06m 13s) [04:14:38] FIRING: SystemdUnitFailed: mediawiki_job_startupregistrystats-testwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:15:33] RECOVERY - OpenSearch unassigned shard check - 9200 on relforge1004 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [04:18:09] RECOVERY - snapshot of s3 in codfw on backupmon1001 is OK: Last snapshot for s3 at codfw (db2239) taken on 2025-03-18 01:25:28 (1155 GiB, +0.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:51:51] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:51:55] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 130, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:12:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10645089 (10phaultfinder) [05:34:13] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:49:13] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:50:23] RECOVERY - ElasticSearch unassigned shard check - 9200 on relforge1003 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [05:56:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:59:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10645147 (10phaultfinder) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T0600) [06:00:04] marostegui, Amir1, and federico3: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T0600). [06:19:28] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10645188 (10Marostegui) Thank you! [06:23:58] (03PS1) 10Marostegui: s4-pager.sql: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/1128745 [06:26:47] (03CR) 10Marostegui: [C:03+2] s4-pager.sql: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/1128745 (owner: 10Marostegui) [06:27:20] (03Merged) 10jenkins-bot: s4-pager.sql: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/1128745 (owner: 10Marostegui) [06:28:43] (03PS1) 10Marostegui: valid_section.pp: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1128746 (https://phabricator.wikimedia.org/T381475) [06:29:44] (03CR) 10Marostegui: "Amir, I am not sure if the section where I've added is correct, let me know if you want it to be there or in the metadata section." [puppet] - 10https://gerrit.wikimedia.org/r/1128746 (https://phabricator.wikimedia.org/T381475) (owner: 10Marostegui) [06:37:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [06:40:41] !log Shifted UTC morning backport windows by an hour to take in account daylight saving time difference between USA and Europe [06:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [06:50:17] jouncebot: refresh [06:50:18] I refreshed my knowledge about deployments. [06:50:22] jouncebot: nowandnext [06:50:22] For the next 0 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T0600) [06:50:23] In 0 hour(s) and 9 minute(s): UTC morning backport window (legacy daylight saving time confusion) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T0700) [06:50:39] stupid daylight saving time [06:53:02] the MediaWiki infrastructure (UTC early) is happening at 6UTC [06:53:43] but in the calendar it is tied to PST and thus is marked at 11pm [06:54:07] it thus shows up in the calendar as happening Yesterday (which is correct from the point of view of PST) [07:00:05] Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window (legacy daylight saving time confusion) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:36] o/ [07:00:39] pfff [07:00:41] jouncebot: refresh [07:00:42] I refreshed my knowledge about deployments. [07:00:44] jouncebot: now [07:00:44] For the next 0 hour(s) and 59 minute(s): UTC morning backport window (legacy daylight saving time confusion) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T0700) [07:00:54] well somehow it missed it [07:03:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127458 (https://phabricator.wikimedia.org/T388158) (owner: 10Jon Harald Søby) [07:04:33] (03Merged) 10jenkins-bot: Add Portal namespace to kaawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127458 (https://phabricator.wikimedia.org/T388158) (owner: 10Jon Harald Søby) [07:04:42] FIRING: CertManagerCertNotReady: Certificate istio-system/jaeger is not in a ready state (k8s-aux@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-aux&var-namespace=istio-system - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [07:05:30] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1127458|Add Portal namespace to kaawiki (T388158)]] [07:05:34] T388158: Create Portal namespace on kaa.wikipedia - https://phabricator.wikimedia.org/T388158 [07:08:54] (03PS13) 10Giuseppe Lavagetto: mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 [07:14:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10645256 (10phaultfinder) [07:15:55] of course httpbb tests failed due to mwdebug servers timing out :-( [07:16:25] (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1128746 (https://phabricator.wikimedia.org/T381475) (owner: 10Marostegui) [07:16:31] * hashar tries again [07:16:43] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1127458|Add Portal namespace to kaawiki (T388158)]] [07:16:47] T388158: Create Portal namespace on kaa.wikipedia - https://phabricator.wikimedia.org/T388158 [07:17:22] (03CR) 10Federico Ceratto: [C:03+1] "Is modules/profile/files/dbbackups/valid_sections.txt to be updated as well?" [puppet] - 10https://gerrit.wikimedia.org/r/1128746 (https://phabricator.wikimedia.org/T381475) (owner: 10Marostegui) [07:29:08] hmm it happens again [07:29:11] * hashar retries [07:30:26] well they are broken [07:32:58] 06SRE, 10LDAP-Access-Requests: Disable BarryTheBrowserTestBot LDAP account - https://phabricator.wikimedia.org/T388662#10645290 (10Aklapper) [07:36:26] (03PS1) 10Michael Große: Growth: enable new way of refreshing LinkRecommendations for pilots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128777 (https://phabricator.wikimedia.org/T386250) [07:37:04] I keep forgetting that the window moved forward by an hour during confusion time [07:37:45] Is anyone deploying? if so, I have a config change for it. But it can also wait to the next window [07:37:54] The change: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1128777 [07:38:33] * MichaelG_WMF reads up [07:39:18] hashar: is something broken with the deployments in general? [07:39:21] yeah [07:39:26] I am going to fill an unbreak now [07:39:34] gotcha [07:39:38] the window was set at 8am CET [07:39:42] cause the script is broken [07:39:57] I kept it as is in case the person that scheduled the change would show up [07:40:05] and copy pasted it for 9:00 (aka in 20 minutes) [07:40:12] but I have hit a wall which is that the debug servers are 503 ing [07:40:29] meh [07:40:37] but thank you for looking into it! [07:40:41] so essentially we can't deploy :/ [07:41:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128777 (https://phabricator.wikimedia.org/T386250) (owner: 10Michael Große) [07:46:06] (03CR) 10Klausman: role::ml_k8s::worker: move ml-serve2001 to containerd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1128463 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [07:46:14] (03CR) 10Klausman: [C:03+1] role::ml_k8s::staging::master: move to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1128462 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [07:46:49] (03CR) 10Klausman: [C:03+1] role::ml_k8s: extend nrpe_check_disk_options to allow containerd [puppet] - 10https://gerrit.wikimedia.org/r/1128461 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [07:49:11] filed as T389169 and I have made it an unrebak now [07:49:12] T389169: Deployment fails due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169 [07:49:19] MichaelG_WMF: I don't know what is broken :-( [07:49:52] not my area of expertise either, but I'll have a look at the task nonetheless :) [07:51:08] (03PS1) 10Hashar: Revert "Add Portal namespace to kaawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128778 (https://phabricator.wikimedia.org/T388158) [07:51:42] (03CR) 10Hashar: [C:03+2] "Self merging since that was never deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128778 (https://phabricator.wikimedia.org/T388158) (owner: 10Hashar) [07:52:22] (03CR) 10Muehlenhoff: installserver: set puppetserver2004 for UEFI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1128473 (https://phabricator.wikimedia.org/T381274) (owner: 10Elukey) [07:52:34] (03Merged) 10jenkins-bot: Revert "Add Portal namespace to kaawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128778 (https://phabricator.wikimedia.org/T388158) (owner: 10Hashar) [07:53:02] hashar: would be nice to see the full HTML that we got for those test-failures. Are we sure it is a timeout? [07:54:42] !log mforns@deploy2002 Started deploy [airflow-dags/analytics@e7be149]: hotfix for webrequest DAGs end_dates for k8s migration [07:54:59] (03PS1) 10Hashar: Add Portal namespace to kaawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128781 (https://phabricator.wikimedia.org/T388158) [07:56:05] (03CR) 10Filippo Giunchedi: [C:03+1] "Neat! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1128546 (https://phabricator.wikimedia.org/T389072) (owner: 10Cwhite) [07:56:15] !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@e7be149]: hotfix for webrequest DAGs end_dates for k8s migration (duration: 02m 09s) [07:56:51] (03PS1) 10Muehlenhoff: Remove access for swagoel [puppet] - 10https://gerrit.wikimedia.org/r/1128782 [07:57:02] MichaelG_WMF: I am not sure, but we had some occurences of the mwdebug servers timing out previously [07:57:16] in this case I don't know what is the exact cause hence why I have filed a newt task [07:57:19] I 'll investigate [07:58:43] (03PS1) 10Slyngshede: data.yaml: Offboarding jebe [puppet] - 10https://gerrit.wikimedia.org/r/1128783 [07:59:49] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Swagoel out of all services on: 1293 hosts [08:00:05] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T0800). [08:00:05] MichaelG_WMF and hashar: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:26] you can self-serve? [08:00:32] Let me know if you need/want help [08:01:06] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [08:01:07] Amir1: deployment is broken with open UBN: https://phabricator.wikimedia.org/T389169 [08:01:10] the debug servers are apparently broken [08:01:24] (03CR) 10Muehlenhoff: [C:03+1] "Patch looks good, but these mails are for the end of the date, so let's not merge yet." [puppet] - 10https://gerrit.wikimedia.org/r/1128783 (owner: 10Slyngshede) [08:02:36] I had it yesterday too, I "r"ed until it passed [08:02:38] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Swagoel out of all services on: 949 hosts [08:03:05] (03CR) 10Muehlenhoff: [C:03+2] Remove access for swagoel [puppet] - 10https://gerrit.wikimedia.org/r/1128782 (owner: 10Muehlenhoff) [08:05:19] hi, i'm here 👋 [08:05:41] my name fell off the deployments table for some reason [08:06:38] Jhs: hello!!! [08:06:53] Jhs: I have tried to deploy the kaawiki Portal namespace earlier today (an hour ago) [08:07:04] ah, ok, nice [08:07:06] but that failed due to an unrelated reason: our deployment system has an ongoing issue [08:07:27] the https://wikitech.wikimedia.org/wiki/Deployments page has some issue due to USA having already moved to summer time [08:07:32] while Europe is still in winter time [08:07:35] aha [08:07:51] yeah, i'm the one who reported the bug that led Brian to discover that :P [08:07:58] so the window was scheduled one hour ago. I fixed it by copy pasting it to now and tried to deploy at the original time (an hour ago) in case you showed up [08:08:29] and well something is broken somwehere in our infra so I have reverted your configuration change and send it back for review/pending [08:08:41] 👍 that's fine of course [08:09:01] we can try again at the next backport window, or tomorrow morning: ) [08:09:25] (03PS1) 10Brouberol: Drop airflow-analytics DNS records [dns] - 10https://gerrit.wikimedia.org/r/1128785 (https://phabricator.wikimedia.org/T389172) [08:10:17] hashar, sure. i'll move it on [[Deployments]] [08:10:49] Jhs: the new change is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1128781 [08:11:47] (03PS1) 10Brouberol: Remove airflow-analytics from the ATS and cache configuration [puppet] - 10https://gerrit.wikimedia.org/r/1128786 (https://phabricator.wikimedia.org/T389172) [08:11:48] (03PS1) 10Brouberol: Remove airflow-analytics from the IDP configuration [puppet] - 10https://gerrit.wikimedia.org/r/1128787 (https://phabricator.wikimedia.org/T389172) [08:11:50] (03PS1) 10Brouberol: Remove the airflow-analytics kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1128788 (https://phabricator.wikimedia.org/T389172) [08:12:12] hashar, great, thanks [08:14:15] (03PS1) 10Brouberol: Remove the airflow-analytics namespace from the operators tenant ns list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128789 (https://phabricator.wikimedia.org/T389172) [08:14:17] (03PS1) 10Brouberol: Remove the airflow-analytics deployemnt helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128790 (https://phabricator.wikimedia.org/T389172) [08:14:19] (03PS1) 10Brouberol: Remove the airflow-analytics namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128791 (https://phabricator.wikimedia.org/T389172) [08:14:49] hashar: httpbb's appserver/test_main.yaml continues to fail right now against mwdebug1001 and mwdebug1002. Is this expected? Or has a rollback happened? [08:16:45] (03PS1) 10Filippo Giunchedi: logstash: read k8s-mw topics as needed [puppet] - 10https://gerrit.wikimedia.org/r/1128793 (https://phabricator.wikimedia.org/T384335) [08:17:27] FIRING: SystemdUnitFailed: mediawiki_job_startupregistrystats-testwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:18:15] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1128793 (https://phabricator.wikimedia.org/T384335) (owner: 10Filippo Giunchedi) [08:19:49] (03CR) 10Filippo Giunchedi: [V:03+1] "The topics are not live yet, they will with I5020574a8936" [puppet] - 10https://gerrit.wikimedia.org/r/1128793 (https://phabricator.wikimedia.org/T384335) (owner: 10Filippo Giunchedi) [08:20:35] (03CR) 10Filippo Giunchedi: "LGTM, depends on If5960807bd" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127882 (https://phabricator.wikimedia.org/T384335) (owner: 10Clément Goubert) [08:21:43] (03CR) 10Btullis: [C:03+1] Drop airflow-analytics DNS records [dns] - 10https://gerrit.wikimedia.org/r/1128785 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:22:01] (03CR) 10Btullis: [C:03+1] Remove the airflow-analytics namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128791 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:22:05] (03CR) 10Stevemunene: [C:03+1] Drop airflow-analytics DNS records [dns] - 10https://gerrit.wikimedia.org/r/1128785 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:22:21] (03CR) 10Btullis: [C:03+1] Remove the airflow-analytics deployemnt helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128790 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:22:30] (03CR) 10Stevemunene: [C:03+1] Remove airflow-analytics from the ATS and cache configuration [puppet] - 10https://gerrit.wikimedia.org/r/1128786 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:22:41] (03CR) 10Btullis: [C:03+1] Remove the airflow-analytics namespace from the operators tenant ns list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128789 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:22:52] (03CR) 10Stevemunene: [C:03+1] Remove airflow-analytics from the IDP configuration [puppet] - 10https://gerrit.wikimedia.org/r/1128787 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:23:00] (03CR) 10Btullis: [C:03+1] Remove the airflow-analytics kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1128788 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:23:24] (03CR) 10Btullis: [C:03+1] Remove airflow-analytics from the IDP configuration [puppet] - 10https://gerrit.wikimedia.org/r/1128787 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:23:31] (03CR) 10Stevemunene: [C:03+1] Remove the airflow-analytics kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1128788 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:24:10] ah I have found it [08:24:11] `UnexpectedValueException: Invalid server index #` [08:27:48] hashar: ? [08:29:16] akosiaris: ? [08:29:22] (03CR) 10Stevemunene: [C:03+1] Remove the airflow-analytics namespace from the operators tenant ns list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128789 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:29:27] hashar: I mean, what have you found? [08:29:34] the train is broken [08:29:37] https://phabricator.wikimedia.org/T389169 [08:29:47] httpbb hits some 500 when querying the mwdebug servers [08:29:55] and apparently this time it is really an error in MediaWiki! [08:30:18] I know, I am debugging the same task, thanks for adding that info, that is what I was asking [08:30:27] * akosiaris just refreshed [08:30:27] so the trace is for Wikibase, that goes through the parser cache and sqlbagofstuff [08:30:39] (03CR) 10Brouberol: [C:03+2] Drop airflow-analytics DNS records [dns] - 10https://gerrit.wikimedia.org/r/1128785 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:30:43] which apparently is not happy about some config: `Invalid server index #` [08:30:52] !log brouberol@dns1004 START - running authdns-update [08:30:53] which smells like something is borked in operations/mediawiki-config [08:31:27] akosiaris: yeah sorry for the delay, I was busy digging in logstash / copy pasting to the task etc :) [08:32:19] (03CR) 10Brouberol: [C:03+2] Remove airflow-analytics from the ATS and cache configuration [puppet] - 10https://gerrit.wikimedia.org/r/1128786 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:32:20] sometime I regret not being in a real office, I'd would have move with my laptop to the SRE open space and loudly scream "We have an emergency! Wikis are broken!! I have croissants!" [08:32:33] np, good job on correlating with the logstash stacktrace. I was at the apache logs level and was about to move to logstash when you pasted that log line [08:33:01] \o/ [08:33:02] !log brouberol@dns1004 END - running authdns-update [08:33:11] (03CR) 10Stevemunene: [C:03+1] Remove the airflow-analytics deployemnt helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128790 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:33:18] so I think it is an issue with wmf.21 since the tests wikis got promoted over night [08:33:21] I'll look for a repro [08:33:22] (03CR) 10Stevemunene: [C:03+1] Remove the airflow-analytics namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128791 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:34:00] I am super happy to find out httpbb did catch an issue [08:34:12] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:34:17] for what is worth, httpbb no longer complains right now [08:34:36] so this has some aspects of a heisenbug at least [08:34:55] OH [08:35:14] cause the wiki is on wmf.20 [08:35:16] and it went through a phase of first no longer complaining about the P13344 page and now it no longer complains about Main_Page either [08:35:19] !log mlitn@deploy2002 Started deploy [airflow-dags/platform_eng@e7be149]: (no justification provided) [08:35:24] and the patch to promote them to wmf.21 did not get deployed due to the test failing [08:35:33] I am still trying to figure out but that is my assumption right now [08:35:47] deploy2002:~$ httpbb /srv/deployment/httpbb-tests/appserver/test_main.yaml --hosts=mwdebug[1001,1002].eqiad.wmnet is how I was running it and seeing first both failures, then only 1 and now none [08:35:53] so if you were to run the httpbb tests manually they still hit wmf.20 [08:36:01] !log mlitn@deploy2002 Finished deploy [airflow-dags/platform_eng@e7be149]: (no justification provided) (duration: 00m 46s) [08:36:13] I noticed Special:Version was showing wmf.20 which was confusing me ( https://test.wikidata.org/wiki/Special:Version ) [08:36:16] but that never got promoted [08:36:21] I think that is the explanation [08:36:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:36:47] I will write a summary [08:37:09] hashar: this is why I asked above, 10:14:49 hashar: httpbb's appserver/test_main.yaml continues to fail right now against mwdebug1001 and mwdebug1002. Is this expected? Or has a rollback happened? [08:37:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:37:17] I wanted to clear out that possibility [08:37:31] then mwdebug servers are still on wmf.21 ? [08:37:38] err s/still/already/ [08:38:09] * akosiaris double checks [08:39:09] hashar: they should be on 21, that happened last night and that doesn't get rolled back AFAIK [08:39:16] [08:39:20] yup, they are in wmf.21 [08:39:24] https://www.irccloud.com/pastebin/LzqvikZ4/ [08:39:26] good! [08:39:37] and right now, for whatever reasons, httpbb no longer complains about either of those 2 pages [08:40:20] deploy2002:~$ curl -s -4 --connect-to test.wikidata.org:443:10.64.32.123:443 https://test.wikidata.org/wiki/Special:Version | grep 'wmf\.' | head -1 [08:40:20] [08:40:30] that's ^ my super quick way of checking fwiw [08:40:37] the IP is mwdebug1001 fwiw [08:42:03] so whatever the heisenbug is, it looks like it vanishes at some point after a deployment. Which also is consistent with Amir's comment above that they just hit 'r' a few times yesterday and it finally worked. [08:42:23] so that happened previously? [08:42:33] 🤷 [08:42:36] :) [08:42:58] I just saw the bug today and responded, but I am totally oblivious to what happened yesterday [08:43:15] it is given a null shardindex [08:43:30] and with SqlBagOStuff / ParserCache I am tempted to invoke Amir1 :-] [08:44:13] akosiaris: the patch I deployed was for portals fully html assets [08:44:55] Amir1: so you just stepped on the mine, it triggered and then it decided to let you leave after a couple of times pressing 'r' ? [08:45:00] might be the data redundancy patch having a bug [08:45:17] lol, just saw throw new UnexpectedValueException( "Invalid server index #$shardIndex" ); [08:45:17] I think that is bug in I80da12396858ee4fc58ae257f6c154b3050df696 yeah [08:45:22] if this is only on wmf.21 [08:45:24] it's literally a null index [08:45:39] I thought # was a number or something but the variable is indeed null, lol [08:45:48] yesterday, wmf.21 wasn't even cut :D [08:45:50] and PHP shows it as an empty string [08:45:54] yup [08:45:56] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1029.eqiad.wmnet with reason: remove from cluster for reimage [08:46:05] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10645487 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fbeb54b5-2eb9-44e3-bebb-3ffb0c131169) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [08:46:15] it is very likely a bug in the data redundancy patch [08:46:21] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1119745 [08:46:33] give me a bit and I figure out the root cause [08:46:37] if it happened before wmf.20 that means it is in a patch before that? [08:47:04] then logstash would have it [08:47:57] why would it be before wmf.20? [08:48:02] the patch I linked should be only in wmf.21 so I had it last night and might be a different issue [08:48:13] I think there are actually two issues [08:48:21] I am refering to amir having to repeat the httpbb tests yesterday? [08:48:27] but yeah that might have been an other issue [08:49:12] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:50:17] I'd revert https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1119745 :) [08:50:33] hashar: give me an hour and I fix it [08:50:44] but my bet is `array_slice()` returning an empty array and thus there is no index [08:52:35] is the wmf.21 on mwdebug? so I can test things with eval.php? [08:52:43] should be yes [08:53:00] cool [08:53:04] the patch that bumps to wmf.21 failed deployment but scap does not rollback [08:54:03] jnuche: it is most probably https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1119745 . There might be an easy quick fix, else it can be reverted to resume the train [08:54:32] hashar: yep, I'm listening in, thank you everyone for taking a look [08:54:33] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1029 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1128412 (owner: 10Muehlenhoff) [08:56:53] well I am taking a break [08:57:17] I have long finihsed my breakfast but I am still in my pajamas and it is probably not healthy :b [08:57:43] hashar: go for it, thanks again, you made my morning a bit less stressful :) [08:58:34] I wonder whether httpbb could show the exception or request id [08:58:44] the exception id is certainly somewhere in the HTML payload [08:58:58] anyway, this is endless! [08:59:08] I'd be back in roughly half an hour [09:00:05] jnuche and jeena: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T0900). [09:00:09] jnuche: I am happy to have relieved some stress!! :-] [09:00:22] (03PS1) 10Vgutierrez: prometheus: Add node_file_age [puppet] - 10https://gerrit.wikimedia.org/r/1128798 (https://phabricator.wikimedia.org/T389175) [09:00:31] and I am very happy that httpbb caught the issue [09:00:57] (03CR) 10CI reject: [V:04-1] prometheus: Add node_file_age [puppet] - 10https://gerrit.wikimedia.org/r/1128798 (https://phabricator.wikimedia.org/T389175) (owner: 10Vgutierrez) [09:01:04] yeah, it's good to see those tests in action [09:01:06] (03CR) 10Slyngshede: "recheck" [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 (owner: 10Slyngshede) [09:01:46] train window just started, noting here again train is currently blocked by T389169 [09:01:46] T389169: UnexpectedValueException: Invalid server index # causes eployment to fail due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169 [09:07:20] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1029.eqiad.wmnet [09:09:01] found the issue [09:09:07] fix will be coming shortly [09:09:09] I hate php [09:10:39] (03PS1) 10Slyngshede: Always create a new connection to LDAP [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1128799 [09:11:37] (03PS1) 10Stevemunene: hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) [09:13:01] (03PS2) 10Stevemunene: hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) [09:13:32] yup, tested in mwdebug2001 and it fixes the issues [09:14:21] (03CR) 10CI reject: [V:04-1] hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) (owner: 10Stevemunene) [09:18:26] (03CR) 10Brouberol: [C:03+2] Remove airflow-analytics from the IDP configuration [puppet] - 10https://gerrit.wikimedia.org/r/1128787 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [09:19:30] (03PS3) 10Stevemunene: hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) [09:19:30] hashar: jnuche: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1128802 are you comfortable with merging this or should I find someone to review it? [09:21:16] (03CR) 10CI reject: [V:04-1] Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 (owner: 10Slyngshede) [09:21:16] (03CR) 10CI reject: [V:04-1] hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) (owner: 10Stevemunene) [09:21:43] (you can also backport it to wmf.21 and merge it, I take care of master) [09:23:21] Amir1: that looks good to me, gonna create the backport for 21 and deploy it [09:23:29] Thanks! [09:23:41] (03PS4) 10Stevemunene: hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) [09:23:59] (03CR) 10Brouberol: [C:03+2] Remove the airflow-analytics namespace from the operators tenant ns list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128789 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [09:24:27] (03PS1) 10Jaime Nuche: objectcache: Re-number array keys in SqlBagOStuff [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1128803 (https://phabricator.wikimedia.org/T389169) [09:24:32] (03CR) 10Effie Mouzeli: [C:03+1] hieradata: migrate mw-wikifunctions to PHP 8.1 (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1128440 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [09:24:44] (03CR) 10Effie Mouzeli: [C:03+1] mw-wikifunctions: migrate to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128439 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [09:25:17] (03CR) 10CI reject: [V:04-1] hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) (owner: 10Stevemunene) [09:26:02] (03CR) 10Volans: "Sorry, this one got lost in the backlog. Reply inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [09:26:15] 06SRE, 10Bitu, 10CAS-SSO, 06Infrastructure-Foundations, 06Security-Team: SSO kill switch for crucial services - https://phabricator.wikimedia.org/T233938#10645573 (10Arendpieter) 05Open→03Resolved [09:26:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jnuche@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1128803 (https://phabricator.wikimedia.org/T389169) (owner: 10Jaime Nuche) [09:29:17] (03Merged) 10jenkins-bot: Remove the airflow-analytics namespace from the operators tenant ns list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128789 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [09:30:43] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:30:49] (03CR) 10CI reject: [V:04-1] Always create a new connection to LDAP [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1128799 (owner: 10Slyngshede) [09:30:53] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 113, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:31:22] (03CR) 10Ladsgroup: [C:03+1] valid_section.pp: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1128746 (https://phabricator.wikimedia.org/T381475) (owner: 10Marostegui) [09:35:18] (03PS2) 10Vgutierrez: prometheus: Add node_file_age [puppet] - 10https://gerrit.wikimedia.org/r/1128798 (https://phabricator.wikimedia.org/T389175) [09:35:26] (03PS5) 10Stevemunene: hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) [09:36:31] (03CR) 10Marostegui: [C:03+2] valid_section.pp: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1128746 (https://phabricator.wikimedia.org/T381475) (owner: 10Marostegui) [09:36:42] (03CR) 10CI reject: [V:04-1] hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) (owner: 10Stevemunene) [09:37:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet [09:39:05] (03Merged) 10jenkins-bot: objectcache: Re-number array keys in SqlBagOStuff [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1128803 (https://phabricator.wikimedia.org/T389169) (owner: 10Jaime Nuche) [09:39:34] !log jnuche@deploy2002 Started scap sync-world: Backport for [[gerrit:1128803|objectcache: Re-number array keys in SqlBagOStuff (T389169)]] [09:39:38] T389169: UnexpectedValueException: Invalid server index # causes deployment to fail due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169 [09:39:59] (03PS6) 10Stevemunene: hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) [09:41:12] (03CR) 10CI reject: [V:04-1] hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) (owner: 10Stevemunene) [09:44:51] !log jnuche@deploy2002 jnuche: Backport for [[gerrit:1128803|objectcache: Re-number array keys in SqlBagOStuff (T389169)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:44:55] T389169: UnexpectedValueException: Invalid server index # causes deployment to fail due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169 [09:45:05] !log jnuche@deploy2002 jnuche: Continuing with sync [09:45:15] 06SRE, 10Bitu, 10CAS-SSO, 06Infrastructure-Foundations, 06Security-Team: SSO kill switch for crucial services - https://phabricator.wikimedia.org/T233938#10645637 (10MoritzMuehlenhoff) 05Resolved→03Open This isn't resolved? [09:45:52] success: [09:45:56] https://www.irccloud.com/pastebin/YggQfkGC/ [09:46:15] Amir1: thanks once more! :) [09:48:25] (03CR) 10Zoe: [C:03+1] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128381 (owner: 10PipelineBot) [09:49:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1029.eqiad.wmnet [09:49:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti1029.eqiad.wmnet [09:50:22] Amir1: well it is missing a PHPUnit test to cover the issue :b [09:50:37] Amir1: thank you for the quick fix! [09:51:27] (03CR) 10Alexandros Kosiaris: [C:03+1] mediawiki: add rewrite for rt.wikimedia.org to wikitech page (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [09:52:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:53:33] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:53:36] jnuche: I have closed the blocker task [09:53:41] yeah, i try to add a regression test a bit later [09:53:43] akosiaris: Amir1: thank you very much! [09:53:54] (03CR) 10Brouberol: [C:03+2] Remove the airflow-analytics deployemnt helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128790 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [09:54:02] (03CR) 10Brouberol: [C:03+2] Remove the airflow-analytics namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128791 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [09:54:09] (03CR) 10Brouberol: [C:03+2] Remove the airflow-analytics kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1128788 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [09:55:19] PROBLEM - Disk space on an-druid1003 is CRITICAL: DISK CRITICAL - free space: /srv 106304 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1003&var-datasource=eqiad+prometheus/ops [09:55:36] (03PS3) 10Vgutierrez: prometheus: Add node_file_age [puppet] - 10https://gerrit.wikimedia.org/r/1128798 (https://phabricator.wikimedia.org/T389175) [09:56:12] \o/ [09:57:48] (03CR) 10Alexandros Kosiaris: [C:03+2] mediawiki: add rewrite for rt.wikimedia.org to wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [09:57:58] <_joe_> moscovium [09:58:17] <_joe_> I loved when we used such names for servers, it was funnier [09:58:20] the 2010s called and so on [09:58:27] I liked our star era [09:58:32] <_joe_> me too [09:58:38] <_joe_> the codfw beginnings [09:58:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1029.eqiad.wmnet with OS bookworm [09:59:03] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10645695 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1029.eqiad.wmnet with OS bookworm [09:59:11] my newest devices follow that pattern now. And there are so many stars, I am not running out of names anytime soon [09:59:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 17.86% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:59:26] yeah, it felt so sciency and so nerdy. I miss that :( [10:04:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 17.86% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:05:31] (03PS7) 10Stevemunene: hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) [10:05:42] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:05:45] hmmm, the backport ran into a timeout while doing the deploy to prod K8s. It's rolling back right now. Once it's done I'm going to try to roll out the regular train [10:05:52] hopefully the timeout is a one-off [10:06:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 23.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:06:23] (03PS4) 10Vgutierrez: prometheus: Add node_file_age [puppet] - 10https://gerrit.wikimedia.org/r/1128798 (https://phabricator.wikimedia.org/T389175) [10:06:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:07:23] (03CR) 10CI reject: [V:04-1] hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) (owner: 10Stevemunene) [10:11:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 19.64% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:14:11] (03CR) 10Btullis: [C:03+2] mediawiki: Update the dumps job template to support write access [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127879 (https://phabricator.wikimedia.org/T352650) (owner: 10Btullis) [10:15:45] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1029.eqiad.wmnet with reason: host reimage [10:16:15] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 19.64% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:16:37] jnuche: yeah, last night even before branch cut, I had a lot of these too [10:16:54] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128817 (https://phabricator.wikimedia.org/T386216) [10:16:55] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128817 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [10:17:08] (03Merged) 10jenkins-bot: mediawiki: Update the dumps job template to support write access [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127879 (https://phabricator.wikimedia.org/T352650) (owner: 10Btullis) [10:17:39] (03CR) 10Filippo Giunchedi: [V:03+1] "Please note that the actual topic change will be deployed next week" [puppet] - 10https://gerrit.wikimedia.org/r/1128793 (https://phabricator.wikimedia.org/T384335) (owner: 10Filippo Giunchedi) [10:17:47] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128817 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [10:19:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1029.eqiad.wmnet with reason: host reimage [10:20:21] (03PS1) 10Ladsgroup: Bump thumbnail steps to 20% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128819 (https://phabricator.wikimedia.org/T360589) [10:21:04] (03PS1) 10Gkyziridis: inference-services: edit-check GPU version deployment on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128820 (https://phabricator.wikimedia.org/T386100) [10:21:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 9.375% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:21:56] (03CR) 10Slyngshede: "recheck" [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 (owner: 10Slyngshede) [10:25:19] (03PS5) 10Vgutierrez: prometheus: Add node_file_age [puppet] - 10https://gerrit.wikimedia.org/r/1128798 (https://phabricator.wikimedia.org/T389175) [10:25:20] (03PS1) 10Vgutierrez: liberica: Expose config file age metrics [puppet] - 10https://gerrit.wikimedia.org/r/1128821 (https://phabricator.wikimedia.org/T389175) [10:25:47] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128821 (https://phabricator.wikimedia.org/T389175) (owner: 10Vgutierrez) [10:26:05] (03CR) 10CI reject: [V:04-1] liberica: Expose config file age metrics [puppet] - 10https://gerrit.wikimedia.org/r/1128821 (https://phabricator.wikimedia.org/T389175) (owner: 10Vgutierrez) [10:29:02] (03PS2) 10Vgutierrez: liberica: Expose config file age metrics [puppet] - 10https://gerrit.wikimedia.org/r/1128821 (https://phabricator.wikimedia.org/T389175) [10:29:14] (03CR) 10Hnowlan: [C:03+2] mw-(web|api-ext): scale up in anticipation of switchover [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127859 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [10:30:12] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128821 (https://phabricator.wikimedia.org/T389175) (owner: 10Vgutierrez) [10:30:44] (03Merged) 10jenkins-bot: mw-(web|api-ext): scale up in anticipation of switchover [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127859 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [10:33:41] jouncebot: nowandnext [10:33:41] For the next 0 hour(s) and 26 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T0900) [10:33:41] In 0 hour(s) and 26 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T1100) [10:34:27] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.21 refs T386216 [10:34:30] T386216: 1.44.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T386216 [10:34:40] !log hnowlan@cumin1002 START - Cookbook sre.discovery.datacenter [10:34:44] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) [10:34:51] ^ just a status run [10:36:20] (03CR) 10Sergio Gimeno: [C:03+1] Growth: enable new way of refreshing LinkRecommendations for pilots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128777 (https://phabricator.wikimedia.org/T386250) (owner: 10Michael Große) [10:36:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1029.eqiad.wmnet with OS bookworm [10:36:44] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10645836 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1029.eqiad.wmnet with OS bookworm completed: - ganeti102... [10:37:02] ok, train worked, things seem to be back to normal, phew [10:37:25] There seems to be another issue [10:37:26] https://phabricator.wikimedia.org/T389182 [10:38:31] (03PS2) 10Slyngshede: Always create a new connection to LDAP [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1128799 [10:39:11] claime: that happened while the deployment was running, could it be transient? is it still happening? [10:39:19] jnuche: Just reran it, same error [10:40:20] not sure what that code is doing, do you think it warrants a rollback? [10:41:37] I don't think it warrants a rollback, but it needs to be fixed before we roll forwards to new wikis I think [10:43:25] I've tagged the platform team on task since from what I gathered for T341555 it's one of their jobs [10:43:26] T341555: Implement periodic maintenance scripts for mw-on-k8s - https://phabricator.wikimedia.org/T341555 [10:43:53] claime: ack, triaged as blocker [10:44:03] ty [10:46:05] (03PS1) 10Cyndywikime: Remove unused PHP config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128828 (https://phabricator.wikimedia.org/T388787) [10:46:51] (03CR) 10Ilias Sarantopoulos: inference-services: edit-check GPU version deployment on staging. (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128820 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [10:52:00] if things are stable, can I deploy something quickly? 🥺 [10:52:26] Amir1: yep [10:52:53] awesome. Thanks. I have to deploy a patch every day until it reaches 100% (5% every day I can deploy) [10:53:10] (03CR) 10Ladsgroup: [C:03+2] Bump thumbnail steps to 20% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128819 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:53:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128819 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:54:11] (03Merged) 10jenkins-bot: Bump thumbnail steps to 20% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128819 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:54:42] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1128819|Bump thumbnail steps to 20% (T360589)]] [10:54:46] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:55:19] PROBLEM - Disk space on an-druid1003 is CRITICAL: DISK CRITICAL - free space: /srv 105799 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1003&var-datasource=eqiad+prometheus/ops [10:56:40] (03PS2) 10Gkyziridis: inference-services: edit-check GPU version deployment on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128820 (https://phabricator.wikimedia.org/T386100) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T1100) [11:00:18] (03CR) 10Hnowlan: [C:03+2] switchdc: delete Job objects for mw-cron due to library support [cookbooks] - 10https://gerrit.wikimedia.org/r/1127878 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [11:00:42] (03PS3) 10Slyngshede: Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 [11:01:42] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1128819|Bump thumbnail steps to 20% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:01:47] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [11:03:55] (03CR) 10Dr0ptp4kt: [C:03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1128544 (https://phabricator.wikimedia.org/T373046) (owner: 10Btullis) [11:04:42] FIRING: CertManagerCertNotReady: Certificate istio-system/jaeger is not in a ready state (k8s-aux@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-aux&var-namespace=istio-system - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [11:04:45] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [11:05:17] (03PS27) 10Federico Ceratto: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [11:05:29] 07Puppet, 06SRE: puppet error at the end of the run on prometheus2008: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null - https://phabricator.wikimedia.org/T388629#10645982 (10MoritzMuehlenhoff) I think I have a trail: I noticed that this occurs... [11:06:06] (03PS4) 10Slyngshede: Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 [11:06:14] (03CR) 10Federico Ceratto: "I rolled back to PS18, I'm going to review the pending comments and update the PR" [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [11:06:59] (03CR) 10CI reject: [V:04-1] switchdc: delete Job objects for mw-cron due to library support [cookbooks] - 10https://gerrit.wikimedia.org/r/1127878 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [11:10:27] (03PS5) 10Slyngshede: Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 [11:12:09] (03CR) 10CI reject: [V:04-1] sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [11:13:07] (03PS6) 10Slyngshede: Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 [11:16:37] (03PS1) 10Hnowlan: elasticsearch: fix line to unblock CI [cookbooks] - 10https://gerrit.wikimedia.org/r/1128836 [11:18:44] (03PS7) 10Slyngshede: Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 [11:19:38] FIRING: [2x] SystemdUnitFailed: mediawiki_job_startupregistrystats-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:19:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10646061 (10phaultfinder) [11:20:19] sigh, due to one error, the deployment is rolling back [11:20:33] https://www.irccloud.com/pastebin/PlAw8DL5/ [11:21:16] (03PS28) 10Federico Ceratto: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [11:21:45] (03PS2) 10Muehlenhoff: Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1117896 [11:24:16] (03CR) 10Volans: [C:03+1] "LGTM, thanks for the fix!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1128836 (owner: 10Hnowlan) [11:24:16] (03CR) 10Federico Ceratto: "In patchset https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1080718/27..28 based on the CR comments I moved the usage example docs" [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [11:24:41] (03CR) 10Hnowlan: [C:03+2] elasticsearch: fix line to unblock CI [cookbooks] - 10https://gerrit.wikimedia.org/r/1128836 (owner: 10Hnowlan) [11:25:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 10.71% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:26:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet [11:26:58] (03PS29) 10Federico Ceratto: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [11:27:18] (03PS7) 10Hnowlan: switchdc: delete Job objects for mw-cron due to library support [cookbooks] - 10https://gerrit.wikimedia.org/r/1127878 (https://phabricator.wikimedia.org/T385155) [11:27:32] (03PS1) 10Muehlenhoff: Fix cloudelastic Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1128837 [11:28:41] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1128819|Bump thumbnail steps to 20% (T360589)]] [11:28:45] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [11:30:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:30:46] (03Merged) 10jenkins-bot: elasticsearch: fix line to unblock CI [cookbooks] - 10https://gerrit.wikimedia.org/r/1128836 (owner: 10Hnowlan) [11:32:26] (03CR) 10CI reject: [V:04-1] Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1117896 (owner: 10Muehlenhoff) [11:33:10] (03PS1) 10Hnowlan: thumbor: bump replicas in anticipation of increased traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128840 (https://phabricator.wikimedia.org/T360589) [11:33:29] (03CR) 10CI reject: [V:04-1] sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [11:34:13] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:34:14] (03CR) 10Ladsgroup: [C:03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128840 (https://phabricator.wikimedia.org/T360589) (owner: 10Hnowlan) [11:34:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1029.eqiad.wmnet [11:35:06] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1128819|Bump thumbnail steps to 20% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:35:10] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [11:35:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 23.21% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:35:55] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [11:36:19] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128841 [11:36:38] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Enable SUL3 logins on group 0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127954 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [11:36:45] 07Puppet, 06SRE: puppet error at the end of the run on prometheus2008: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null - https://phabricator.wikimedia.org/T388629#10646127 (10MoritzMuehlenhoff) Changelog for 17.0.14: https://mail.openjdk.org/pip... [11:38:54] (03CR) 10CI reject: [V:04-1] Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 (owner: 10Slyngshede) [11:40:36] <_joe_> jouncebot: next [11:40:36] In 0 hour(s) and 19 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T1200) [11:40:42] <_joe_> jouncebot: now [11:40:42] For the next 0 hour(s) and 19 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T1100) [11:43:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1029.eqiad.wmnet to cluster eqiad and group A [11:43:41] _joe_: I'm deploying something now [11:43:44] (03CR) 10Cathal Mooney: [C:03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/1128401 (https://phabricator.wikimedia.org/T388419) (owner: 10Ayounsi) [11:43:47] but it's stuck in 94% for the second time [11:44:01] <_joe_> Amir1: did you report this to serviceops? [11:44:03] https://www.irccloud.com/pastebin/PlAw8DL5/ [11:44:11] <_joe_> Amir1: what does kubectl tells you? [11:44:18] (03CR) 10Hnowlan: [C:03+2] thumbor: bump replicas in anticipation of increased traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128840 (https://phabricator.wikimedia.org/T360589) (owner: 10Hnowlan) [11:44:33] https://www.irccloud.com/pastebin/AJ7BF1Hj/ [11:44:49] wikifunctions is failing [11:45:24] <_joe_> always wikifunctions? [11:45:34] <_joe_> I suspect it has to do with the changes scott merged last night [11:45:35] the second time hasn't errored yet [11:45:47] (03Merged) 10jenkins-bot: thumbor: bump replicas in anticipation of increased traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128840 (https://phabricator.wikimedia.org/T360589) (owner: 10Hnowlan) [11:46:12] once it errors, I will report [11:46:14] <_joe_> hnowlan: please let's understand what's blocking amir before increasing replicas [11:46:21] yep, looking into that atm [11:46:22] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128819|Bump thumbnail steps to 20% (T360589)]] (duration: 17m 41s) [11:46:26] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [11:46:36] hnowlan: aren't you ooo? I can check wf [11:46:39] actually, it finished but it took really long time [11:46:47] <_joe_> yeah that's not great [11:47:04] 11:45:43 Finished sync-prod-k8s (duration: 08m 51s) [11:47:09] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1029.eqiad.wmnet to cluster eqiad and group A [11:48:56] <_joe_> 8 minutes isn't so terrible though [11:49:11] <_joe_> oh wait not overall, just that step [11:49:15] <_joe_> yeah that's bad [11:49:32] <_joe_> in any case, my change is a noop I'm going to merge it now [11:49:50] (03CR) 10Giuseppe Lavagetto: [C:03+2] mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 (owner: 10Giuseppe Lavagetto) [11:50:04] It took 5 minutes for mw-api-ext, and 8 for mw-wikifunctions [11:50:12] So there's something fishy [11:50:13] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:51:02] (03Merged) 10jenkins-bot: mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 (owner: 10Giuseppe Lavagetto) [11:51:04] yeah, in total it was "(duration: 17m 41s)" [11:51:29] could it be that we only switched wf to php8.1 yesterday ? [11:52:10] <_joe_> effie: yes that's what I thought [11:52:15] replicas: 6 [11:52:17] strategy: [11:52:18] <_joe_> if we had to pull images from scratch [11:52:19] rollingUpdate: [11:52:21] maxSurge: 3% [11:52:23] maxUnavailable: 6% [11:52:24] <_joe_> claime: uhhh [11:52:25] type: RollingUpdate [11:52:27] yeah... [11:52:31] I had that sort of long durations as we were rolling out for the other ones, so [11:52:46] (03CR) 10Muehlenhoff: [V:03+2] "The thumbor-pipeline-test needs to be updated to catch up with the new OpenJPEG, which leads to some new rendering output:" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1117896 (owner: 10Muehlenhoff) [11:52:51] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1117896 (owner: 10Muehlenhoff) [11:53:04] It takes 5 minutes to roll out the big mw releases [11:53:09] (mw-web, mw-api-ext, etc.) [11:53:15] looking at scap logs in logstash [11:53:24] But it takes 8 minutes for mw-wf [11:53:39] claime: now yes, when we first started rolling out the -next releases [11:53:41] And I bet it's because of the rollingUpdate settings [11:53:52] it was taking too long to pull images from scratch [11:54:19] Yeah because the images weren't on the hosts, but now they are, and mw-wf uses the same image [11:55:24] 15m Normal Pulled pod/mw-wikifunctions.eqiad.main-7959bdf596-4nvjm Successfully pulled image "docker-registry.discovery.wmnet/restricted/medi [11:55:26] awiki-multiversion:2025-03-18-112908-publish" in 2m4.968487989s [11:55:42] wait [11:55:45] but it's basically doing pod by pod, taking 2 minutes each time [11:56:23] (03CR) 10CI reject: [V:04-1] Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1117896 (owner: 10Muehlenhoff) [11:56:28] hey why's that not 8.1 [11:56:30] wtf [11:56:33] yes [11:56:54] I was checking scotts last patch and it looks ok [11:56:59] (03CR) 10Cathal Mooney: [C:03+2] Support setting custom arp-policer on CR interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/1127592 (https://phabricator.wikimedia.org/T384774) (owner: 10Cathal Mooney) [11:57:10] (03PS15) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 [11:57:25] effie: patch link? [11:57:50] (03Merged) 10jenkins-bot: Support setting custom arp-policer on CR interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/1127592 (https://phabricator.wikimedia.org/T384774) (owner: 10Cathal Mooney) [11:57:53] claime: lets go to -sre [11:57:54] (03PS30) 10Federico Ceratto: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [11:58:27] (03CR) 10CI reject: [V:04-1] Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 (owner: 10Giuseppe Lavagetto) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T1200) [12:01:06] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [12:02:18] (03CR) 10Fabfur: [C:03+1] liberica,hiera: Add IPv6 endpoints for prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/1127907 (https://phabricator.wikimedia.org/T379238) (owner: 10Vgutierrez) [12:05:35] 10ops-magru, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10646249 (10cmooney) 05Open→03Resolved Router stable and config added to automation templates, closing task. [12:09:31] (03CR) 10Dbrant: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128841 (owner: 10PipelineBot) [12:11:08] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128841 (owner: 10PipelineBot) [12:12:27] FIRING: [2x] SystemdUnitFailed: mediawiki_job_startupregistrystats-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:15:46] (03CR) 10Elukey: [V:03+1] role::ml_k8s::worker: move ml-serve2001 to containerd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1128463 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [12:16:04] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [12:16:47] (03PS2) 10Elukey: installserver: set puppetserver2004 for UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1128473 (https://phabricator.wikimedia.org/T381274) [12:17:27] RESOLVED: [2x] SystemdUnitFailed: mediawiki_job_startupregistrystats-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:19:01] (03PS3) 10Elukey: installserver: set puppetserver2004 for UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1128473 (https://phabricator.wikimedia.org/T381274) [12:19:20] (03Abandoned) 10Effie Mouzeli: Revert "mw-api-int: bump replicas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127511 (owner: 10Effie Mouzeli) [12:19:45] (03CR) 10Elukey: installserver: set puppetserver2004 for UEFI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1128473 (https://phabricator.wikimedia.org/T381274) (owner: 10Elukey) [12:20:00] (03PS1) 10Clément Goubert: mw-wikifunctions: Relax rollingUpdate strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128847 [12:20:05] (03CR) 10Elukey: [C:03+2] admin_ng: set request and limits the same for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128465 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [12:21:05] (03PS1) 10Hnowlan: admin_ng: increase requests and limits for thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128848 (https://phabricator.wikimedia.org/T360589) [12:21:05] (03CR) 10Federico Ceratto: "The linting is passing - ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [12:22:22] (03PS2) 10Hnowlan: admin_ng: increase requests and limits for thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128848 (https://phabricator.wikimedia.org/T360589) [12:22:44] (03PS3) 10Hnowlan: admin_ng: increase requests and limits for thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128848 (https://phabricator.wikimedia.org/T360589) [12:24:14] !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:24:44] !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:25:42] !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [12:25:57] (03CR) 10Ladsgroup: [C:03+2] changeprop-jobqueue: Bump categorymembership job concurrancy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128482 (owner: 10Ladsgroup) [12:26:10] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [12:26:39] !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [12:26:51] !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [12:27:24] !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [12:27:53] (03Merged) 10jenkins-bot: changeprop-jobqueue: Bump categorymembership job concurrancy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128482 (owner: 10Ladsgroup) [12:28:09] (03CR) 10Clément Goubert: [C:03+1] admin_ng: increase requests and limits for thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128848 (https://phabricator.wikimedia.org/T360589) (owner: 10Hnowlan) [12:28:33] !log ladsgroup@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [12:28:51] (03CR) 10Volans: [C:04-1] "I don't think it will work as is due to some type mismatch. I've left some suggestions inline to fix them." [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [12:29:21] !log ladsgroup@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [12:29:49] (03PS2) 10Clément Goubert: mw-wikifunctions: Relax rollingUpdate strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128847 [12:30:30] !log ladsgroup@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [12:30:38] !log ladsgroup@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [12:30:47] !log ladsgroup@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [12:30:53] !log ladsgroup@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [12:31:02] !log ladsgroup@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [12:31:15] !log ladsgroup@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [12:32:12] !log ladsgroup@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [12:33:11] !log ladsgroup@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [12:33:32] !log ladsgroup@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [12:34:08] (03CR) 10Effie Mouzeli: [C:03+1] admin_ng: increase requests and limits for thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128848 (https://phabricator.wikimedia.org/T360589) (owner: 10Hnowlan) [12:34:09] (03CR) 10Filippo Giunchedi: prometheus: Add node_file_age (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1128798 (https://phabricator.wikimedia.org/T389175) (owner: 10Vgutierrez) [12:34:16] !log ladsgroup@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [12:35:00] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10646333 (10MoritzMuehlenhoff) [12:35:33] !log rebalance ganeti eqiad/A following reimages T382507 [12:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:36] T382507: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507 [12:38:26] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/admin 'sync'. [12:38:42] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [12:38:55] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10646338 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done [12:39:08] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/admin 'sync'. [12:39:44] !log installing freetype2 security updates [12:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:03] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [12:40:46] (03CR) 10Hnowlan: [C:03+2] admin_ng: increase requests and limits for thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128848 (https://phabricator.wikimedia.org/T360589) (owner: 10Hnowlan) [12:46:14] (03Merged) 10jenkins-bot: admin_ng: increase requests and limits for thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128848 (https://phabricator.wikimedia.org/T360589) (owner: 10Hnowlan) [12:50:15] (03PS2) 10Gergő Tisza: Enable SUL3 logins on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127954 (https://phabricator.wikimedia.org/T384153) [12:50:26] (03CR) 10Gergő Tisza: Enable SUL3 logins on group 0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127954 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [12:52:56] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Enable SUL3 logins on group 0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127954 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [12:53:20] (03PS1) 10Filippo Giunchedi: pontoon: update and clarify instructions [puppet] - 10https://gerrit.wikimedia.org/r/1128857 [12:59:56] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1128473 (https://phabricator.wikimedia.org/T381274) (owner: 10Elukey) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T1300). [13:00:05] tgr and Jhs: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] oh, right, daylight confusion time [13:00:25] (*more* confusing this year because it applies to some deployment windows but not others…) [13:00:35] (03PS6) 10Vgutierrez: prometheus: Add node_file_age [puppet] - 10https://gerrit.wikimedia.org/r/1128798 (https://phabricator.wikimedia.org/T389175) [13:00:36] (03PS3) 10Vgutierrez: liberica: Expose config file age metrics [puppet] - 10https://gerrit.wikimedia.org/r/1128821 (https://phabricator.wikimedia.org/T389175) [13:00:44] I’m around but would like to go have lunch pretty soon, so if anyone else can deploy… [13:00:44] o/ [13:00:53] (03CR) 10Ilias Sarantopoulos: [C:03+1] inference-services: edit-check GPU version deployment on staging. (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128820 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [13:01:00] ah, I forgot to move one of my changes from the morning window that didn't happen to this one [13:01:09] I can deploy [13:01:11] * Jhs is present [13:01:27] could we deploy this change too: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1128777 [13:01:40] (03CR) 10Vgutierrez: prometheus: Add node_file_age (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1128798 (https://phabricator.wikimedia.org/T389175) (owner: 10Vgutierrez) [13:02:58] MichaelG_WMF: please add it to the wiki page [13:03:04] jouncebot: nowandnext [13:03:05] For the next 0 hour(s) and 56 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T1300) [13:03:05] In 0 hour(s) and 56 minute(s): Datacentre switchover: Services and Traffic (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T1400) [13:03:08] thanks! will do [13:03:38] I’ll start with Jhs as it’s a volunteer patch :) [13:03:54] (03PS2) 10Lucas Werkmeister (WMDE): Add Portal namespace to kaawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128781 (https://phabricator.wikimedia.org/T388158) (owner: 10Hashar) [13:04:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128781 (https://phabricator.wikimedia.org/T388158) (owner: 10Hashar) [13:04:52] (03PS1) 10Hashar: Hiera: enable deep merge lookup option for abuse_networks [puppet] - 10https://gerrit.wikimedia.org/r/1128859 (https://phabricator.wikimedia.org/T389181) [13:05:07] tgr_: Wiki page updated ✅ [13:05:13] (03Merged) 10jenkins-bot: Add Portal namespace to kaawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128781 (https://phabricator.wikimedia.org/T388158) (owner: 10Hashar) [13:05:42] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1128781|Add Portal namespace to kaawiki (T388158)]] [13:05:46] T388158: Create Portal namespace on kaa.wikipedia - https://phabricator.wikimedia.org/T388158 [13:06:02] (03PS2) 10Hashar: Hiera: enable deep merge lookup option for abuse_networks [puppet] - 10https://gerrit.wikimedia.org/r/1128859 (https://phabricator.wikimedia.org/T389181) [13:06:34] (03PS3) 10Hashar: Hiera: enable deep merge lookup option for abuse_networks [puppet] - 10https://gerrit.wikimedia.org/r/1128859 (https://phabricator.wikimedia.org/T389181) [13:11:27] Lucas_WMDE: I can deploy my patch at the end, it might be time-consuming to test. [13:11:29] (03PS2) 10Filippo Giunchedi: pontoon: update and clarify instructions [puppet] - 10https://gerrit.wikimedia.org/r/1128857 [13:11:34] thanks [13:11:41] Lucas_WMDE, the namespace looks correct on kaawiki now, but pages have "disappeared" since (presumably) namespaceDupes hasn't been run yet [13:12:08] makes sense (but also you’re testing way too early, scap hadn’t even synced to all k8s servers yet when you wrote that message :P) [13:12:25] zoom zoom [13:12:28] tgr_: would you mind deploying MichaelG_WMF’s change too? [13:12:42] !log lucaswerkmeister-wmde@deploy2002 hashar, lucaswerkmeister-wmde: Backport for [[gerrit:1128781|Add Portal namespace to kaawiki (T388158)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:12:46] T388158: Create Portal namespace on kaa.wikipedia - https://phabricator.wikimedia.org/T388158 [13:12:50] Jhs: now :P [13:12:56] Lucas_WMDE: sure, will do [13:13:29] great, thank you! [13:13:33] (03PS8) 10Slyngshede: Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 [13:14:05] thank you from me as well :) [13:14:25] (mine is a maintenance-script only change again, nothing to test here) [13:15:08] Lucas_WMDE, still looks correct, if you needed confirmation ;) [13:15:12] !log lucaswerkmeister-wmde@deploy2002 hashar, lucaswerkmeister-wmde: Continuing with sync [13:15:14] yay ^^ [13:16:05] (03CR) 10CI reject: [V:04-1] Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 (owner: 10Slyngshede) [13:17:17] (03PS9) 10Slyngshede: Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 [13:19:34] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, neat!" [puppet] - 10https://gerrit.wikimedia.org/r/1128798 (https://phabricator.wikimedia.org/T389175) (owner: 10Vgutierrez) [13:19:59] (03PS10) 10Slyngshede: Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 [13:23:27] Lucas_WMDE, has the script started yet? [13:23:31] (03PS11) 10Slyngshede: Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 [13:23:39] no, scap is still running [13:23:46] ok 👍 [13:23:50] (03CR) 10Gkyziridis: [C:03+2] inference-services: edit-check GPU version deployment on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128820 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [13:23:56] (03CR) 10Vgutierrez: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1128798 (https://phabricator.wikimedia.org/T389175) (owner: 10Vgutierrez) [13:25:14] (03Merged) 10jenkins-bot: inference-services: edit-check GPU version deployment on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128820 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [13:26:26] (03CR) 10Elukey: [C:03+2] role::ml_k8s: extend nrpe_check_disk_options to allow containerd [puppet] - 10https://gerrit.wikimedia.org/r/1128461 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [13:27:02] k8s deployment progress is chilling out at 94% for a bit [13:27:14] (03CR) 10Bking: [C:03+2] opensearch: Symlink sudachi dictionary into per-instance config [puppet] - 10https://gerrit.wikimedia.org/r/1128547 (owner: 10Ebernhardson) [13:27:15] (03PS12) 10Slyngshede: Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 [13:27:16] I think this was mentioned in here before, something about wikifunctionswiki being the last remaining wiki on PHP 7.4? [13:27:21] hopefully it’ll sort itself out [13:27:46] (03PS2) 10Elukey: role::ml_k8s::staging::master: move to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1128462 (https://phabricator.wikimedia.org/T387854) [13:28:07] (cc claime / effie just in case, I think you had ideas for improving that but I don’t know if those were supposed to be deployed already) [13:28:42] (03CR) 10Elukey: [C:03+2] installserver: set puppetserver2004 for UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1128473 (https://phabricator.wikimedia.org/T381274) (owner: 10Elukey) [13:30:46] noooo red output [13:31:06] (03PS13) 10Slyngshede: Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 [13:31:12] “rolling back to prior state” :( [13:31:29] helm “timed out waiting for the condition” [13:32:16] also cc _joe_ and Amir1 from the earlier discussion [13:32:28] (03PS1) 10Hnowlan: tests: drop ssim threshold of png->jpg https loader test [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1128865 (https://phabricator.wikimedia.org/T38579) [13:32:53] Lucas_WMDE: check -sre, I think wikifunctions is not functioning [13:33:12] (03CR) 10CI reject: [V:04-1] tests: drop ssim threshold of png->jpg https loader test [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1128865 (https://phabricator.wikimedia.org/T38579) (owner: 10Hnowlan) [13:33:56] well, apparently things are not as stable as they’re supposed to be [13:34:12] (03PS2) 10Hnowlan: tests: drop ssim threshold of png->jpg https loader test [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1128865 (https://phabricator.wikimedia.org/T38579) [13:34:14] seems like effectively we can’t deploy config changes or backports at the moment [13:34:37] o/ [13:35:28] we had an issue this morning with a mediawiki/core patch that caused wikidata to dies. But that got resolved [13:36:07] * hnowlan looking [13:36:17] (03CR) 10Andrew Bogott: [C:03+2] keystone.conf: update oidc comment section to reflect changes in ID mapping [puppet] - 10https://gerrit.wikimedia.org/r/1125499 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [13:37:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 21.43% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:37:15] Lucas_WMDE: was the timeout for codfw? [13:37:25] I just filed T389203 for the issue [13:37:25] T389203: Unable to deploy config changes due to timeout - https://phabricator.wikimedia.org/T389203 [13:37:29] with full scap output in https://phabricator.wikimedia.org/P74241 [13:37:30] 06SRE: Unable to deploy config changes due to timeout - https://phabricator.wikimedia.org/T389203 (10Lucas_Werkmeister_WMDE) 03NEW [13:37:43] (03CR) 10CI reject: [V:04-1] tests: drop ssim threshold of png->jpg https loader test [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1128865 (https://phabricator.wikimedia.org/T38579) (owner: 10Hnowlan) [13:37:48] hnowlan: looks like it was eqiad [13:37:50] but I’m not sure [13:38:01] is that paste restricted? [13:38:07] yes [13:38:09] need access? [13:38:13] yes please [13:38:16] (I’m not sure who it’s restricted to ^^) [13:38:25] added, I hope [13:38:31] thanks [13:38:33] you can add WMF-NDA :) [13:38:46] as subscribers or tag? [13:39:03] I’m wary of emailing all of them 😅 [13:39:13] :))) [13:39:23] I think that is in the view permissions [13:39:52] better now? [13:39:57] (03CR) 10Hnowlan: [C:03+2] mw-wikifunctions: Relax rollingUpdate strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128847 (owner: 10Clément Goubert) [13:40:01] changed visibility from subscribers to wmf-nda [13:40:09] Lucas_WMDE: sorry I missed the ping [13:40:19] scap rollback is now at 94% again btw [13:40:23] ^ this patch will hopefully improve things [13:40:41] (03CR) 10Effie Mouzeli: [C:03+1] mw-wikifunctions: Relax rollingUpdate strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128847 (owner: 10Clément Goubert) [13:40:53] (03PS3) 10Stevemunene: Remove docker related referrences on dse-k8s worker and master [puppet] - 10https://gerrit.wikimedia.org/r/1119106 (https://phabricator.wikimedia.org/T377875) [13:41:23] (03Merged) 10jenkins-bot: mw-wikifunctions: Relax rollingUpdate strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128847 (owner: 10Clément Goubert) [13:41:35] ok scap exited now [13:41:49] (03PS1) 10Herron: jaeger: add aux-k8s-codfw environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128868 (https://phabricator.wikimedia.org/T381417) [13:41:50] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Add Portal namespace to kaawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128870 [13:41:51] (03PS1) 10Herron: jaeger: hooks: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128869 (https://phabricator.wikimedia.org/T381417) [13:41:55] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] Revert "Add Portal namespace to kaawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128870 (owner: 10Lucas Werkmeister (WMDE)) [13:42:03] I’ll just git pull ^ that once it’s merged [13:42:15] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at codfw: 17.86% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:42:18] and then leave the rest to someone else, I really need to take a lunch break ^^ [13:42:44] (03Merged) 10jenkins-bot: Revert "Add Portal namespace to kaawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128870 (owner: 10Lucas Werkmeister (WMDE)) [13:42:55] (03CR) 10Elukey: [C:03+1] jaeger: add aux-k8s-codfw environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128868 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [13:43:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [13:43:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [13:43:38] (03CR) 10Elukey: [C:03+1] jaeger: hooks: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128869 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [13:43:38] !log pulled d2fa9d6821 / Ieab79b7eb1 to /src/mediawiki-staging on deploy2002 to bring config back in sync with deployed state (due to failed deployment, T389203) [13:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:42] T389203: Unable to deploy config changes due to timeout - https://phabricator.wikimedia.org/T389203 [13:44:04] sorry Jhs, we’ll have to try that again another time :( [13:44:22] (it’s a good thing I didn’t run the maintenance script while the deployment was ongoing 😬) [13:44:43] (03CR) 10Herron: [C:03+2] jaeger: add aux-k8s-codfw environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128868 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [13:45:04] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1119106 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [13:45:42] * Lucas_WMDE done deploying [13:45:52] tgr_: if you want to try deploying with that puppet change in place… [13:46:13] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [13:46:21] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [13:46:23] (03Merged) 10jenkins-bot: jaeger: add aux-k8s-codfw environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128868 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [13:46:34] sorry, not puppet, deployment-charts ^^ [13:46:43] (03CR) 10Herron: [C:03+2] jaeger: hooks: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128869 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [13:46:45] sure [13:46:59] can redeploy the portal change now, right? [13:47:04] deploy will hopefully be unblocked [13:47:25] jouncebot: nowandnext [13:47:25] For the next 0 hour(s) and 12 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T1300) [13:47:25] In 0 hour(s) and 12 minute(s): Datacentre switchover: Services and Traffic (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T1400) [13:47:30] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at codfw: 17.86% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:48:05] (03Merged) 10jenkins-bot: jaeger: hooks: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128869 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [13:49:28] (03PS1) 10Gergő Tisza: Revert^2 "Add Portal namespace to kaawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128873 [13:49:32] Lucas_WMDE, no worries [13:50:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128873 (owner: 10Gergő Tisza) [13:50:24] (03CR) 10Jasmine: [C:03+1] debug: reorder debug backends for eqiad switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127072 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [13:50:25] !log bounce mtail on centrallog2002 [13:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:11] (03Merged) 10jenkins-bot: Revert^2 "Add Portal namespace to kaawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128873 (owner: 10Gergő Tisza) [13:51:41] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1128873|Revert^2 "Add Portal namespace to kaawiki"]] [13:52:30] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at codfw: 17.86% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:54:45] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at codfw: 17.86% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:55:45] (03CR) 10Elukey: [C:03+1] interactive: notify when waiting for input (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1125953 (owner: 10Volans) [13:56:15] (03PS1) 10Bking: cloudelastic: add newest hosts as master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1128874 (https://phabricator.wikimedia.org/T387904) [13:56:49] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128874 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [13:57:10] !log tgr@deploy2002 tgr: Backport for [[gerrit:1128873|Revert^2 "Add Portal namespace to kaawiki"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:58:33] !log tgr@deploy2002 tgr: Continuing with sync [14:00:04] hnowlan and jasmine_: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Datacentre switchover: Services and Traffic . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T1400). [14:00:57] I will of course hold for the deploy [14:01:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10646749 (10phaultfinder) [14:01:37] where is the fun then [14:01:38] :D [14:01:48] (03PS3) 10Hnowlan: tests: drop ssim threshold of png->jpg https loader test [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1128865 (https://phabricator.wikimedia.org/T38579) [14:02:02] hnowlan: we have two more patches to go other than the one that's in progress. Should we reschedule them or will there be enough time? [14:02:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:04:10] (03CR) 10DCausse: [C:03+1] cloudelastic: add newest hosts as master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1128874 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [14:04:33] tgr_: It'd be best if they could be rescheduled as I'd like to keep on schedule, but if they're critical we can hold. [14:05:13] mine is not critical [14:05:37] ok will reschedule [14:07:48] (03PS1) 10Muehlenhoff: Create EFI-enabled Partman recipe for Ganeti in core sites [puppet] - 10https://gerrit.wikimedia.org/r/1128875 [14:07:59] (03PS2) 10Muehlenhoff: Create EFI-enabled Partman recipe for Ganeti in core sites [puppet] - 10https://gerrit.wikimedia.org/r/1128875 [14:08:26] (03PS14) 10Slyngshede: Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 [14:08:27] (03CR) 10CI reject: [V:04-1] Create EFI-enabled Partman recipe for Ganeti in core sites [puppet] - 10https://gerrit.wikimedia.org/r/1128875 (owner: 10Muehlenhoff) [14:08:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127954 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [14:12:23] thanks! [14:12:26] (03CR) 10Bking: [C:03+2] cloudelastic: add newest hosts as master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1128874 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [14:13:40] Please note that is not not about deploying; it is about the possibility of them going wrong and spending a lot of time reverting & debugging or disrupting the graphs and hiding issue for the current schedule [14:13:49] *it is not [14:14:31] we are still blocked on the wikifunctions error, it seems [14:15:30] Deployment of mw-wikifunctions-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1. [14:17:44] I'll do a revert, wait for scap to sync it to the canaries, then abort and hand over [14:18:06] since scap is rolling back the production k8s updates, AIUI that should be sufficient [14:19:06] (03PS3) 10Muehlenhoff: Create EFI-enabled Partman recipe for Ganeti in core sites [puppet] - 10https://gerrit.wikimedia.org/r/1128875 [14:19:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:20:09] (03PS4) 10Muehlenhoff: Create EFI-enabled Partman recipe for Ganeti in core sites [puppet] - 10https://gerrit.wikimedia.org/r/1128875 [14:20:39] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1128865 (https://phabricator.wikimedia.org/T38579) (owner: 10Hnowlan) [14:21:17] (03CR) 10Hnowlan: [C:03+2] tests: drop ssim threshold of png->jpg https loader test [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1128865 (https://phabricator.wikimedia.org/T38579) (owner: 10Hnowlan) [14:21:37] (03PS1) 10TrainBranchBot: Revert "Revert^2 "Add Portal namespace to kaawiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128876 [14:21:38] (03CR) 10TrainBranchBot: "tgr@deploy2002 created a revert of this change as Ib97625f3b8e1cf7947f76c7373d19cb4c9b0776f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128873 (owner: 10Gergő Tisza) [14:22:04] Jhs: please reschedule that too [14:22:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128876 (owner: 10TrainBranchBot) [14:22:21] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:23:20] (03Merged) 10jenkins-bot: Revert "Revert^2 "Add Portal namespace to kaawiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128876 (owner: 10TrainBranchBot) [14:23:49] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1128876|Revert "Revert^2 "Add Portal namespace to kaawiki""]] [14:24:00] tgr_, sure [14:24:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 14.29% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:24:50] (03Merged) 10jenkins-bot: tests: drop ssim threshold of png->jpg https loader test [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1128865 (https://phabricator.wikimedia.org/T38579) (owner: 10Hnowlan) [14:24:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:24:54] (03CR) 10Hnowlan: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1127878 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [14:25:06] (03CR) 10Bking: [C:03+2] Fix cloudelastic Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1128837 (owner: 10Muehlenhoff) [14:25:14] 06SRE: Unable to deploy config changes due to timeout - https://phabricator.wikimedia.org/T389203#10646883 (10Tgr) {d133f031467a571c9b02d36e7b29d1523fc3b249} was deployed but didn't help - I got the same error: {P74244} [14:25:48] (03CR) 10Andrew Bogott: [C:03+1] Remove obsolete custom Partman recipes for labvirt* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1126917 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [14:26:31] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2047 [14:26:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2047 [14:27:13] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete custom Partman recipes for labvirt* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1126917 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [14:28:48] (03PS1) 10Jdlrobson: Disable donation LINK on Catalan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128877 (https://phabricator.wikimedia.org/T387768) [14:28:50] !log tgr@deploy2002 trainbranchbot, tgr: Backport for [[gerrit:1128876|Revert "Revert^2 "Add Portal namespace to kaawiki""]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:28:57] !log tgr@deploy2002 Sync cancelled. [14:29:05] (03PS1) 10Scott French: mw-wikifunctions: temporarily extend helm timeout to 20m [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128878 [14:29:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 23.21% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:29:30] (03CR) 10Hnowlan: [C:03+1] mw-wikifunctions: temporarily extend helm timeout to 20m [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128878 (owner: 10Scott French) [14:29:45] hnowlan: done. (per above, merged a revert patch and run scap until the testserver phase only. AIUI production should still be in its pre-deploy-window condition.) [14:29:47] (03PS1) 10Alexandros Kosiaris: rt: Switch ATS to mw-web [puppet] - 10https://gerrit.wikimedia.org/r/1128879 (https://phabricator.wikimedia.org/T385777) [14:30:17] (03CR) 10CI reject: [V:04-1] rt: Switch ATS to mw-web [puppet] - 10https://gerrit.wikimedia.org/r/1128879 (https://phabricator.wikimedia.org/T385777) (owner: 10Alexandros Kosiaris) [14:30:20] (03PS16) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 [14:30:29] tgr_: ack, thank you [14:31:40] (03PS2) 10Alexandros Kosiaris: rt: Switch ATS to mw-web [puppet] - 10https://gerrit.wikimedia.org/r/1128879 (https://phabricator.wikimedia.org/T385777) [14:31:57] (03CR) 10Scott French: [C:03+2] mw-wikifunctions: temporarily extend helm timeout to 20m [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128878 (owner: 10Scott French) [14:32:12] 06SRE: Unable to deploy config changes due to timeout - https://phabricator.wikimedia.org/T389203#10646906 (10Tgr) (ran `scap backport --revert` and stopped after it synced to testservers so everything should be back to original condition) [14:32:56] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2047 [14:33:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2047 [14:34:28] (03Merged) 10jenkins-bot: switchdc: delete Job objects for mw-cron due to library support [cookbooks] - 10https://gerrit.wikimedia.org/r/1127878 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [14:35:13] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:35:18] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Add Puppet fact to determine the boot method - https://phabricator.wikimedia.org/T389217 (10MoritzMuehlenhoff) 03NEW [14:35:40] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Add Puppet fact to determine the boot method - https://phabricator.wikimedia.org/T389217#10646930 (10MoritzMuehlenhoff) [14:36:25] (03CR) 10Alexandros Kosiaris: [C:03+2] rt: Switch ATS to mw-web [puppet] - 10https://gerrit.wikimedia.org/r/1128879 (https://phabricator.wikimedia.org/T385777) (owner: 10Alexandros Kosiaris) [14:36:47] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:36:55] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 114, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:37:12] (03Merged) 10jenkins-bot: mw-wikifunctions: temporarily extend helm timeout to 20m [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128878 (owner: 10Scott French) [14:37:45] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2047 [14:37:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2047 [14:38:36] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1128821 (https://phabricator.wikimedia.org/T389175) (owner: 10Vgutierrez) [14:39:05] (03CR) 10Vgutierrez: [C:03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1128821 (https://phabricator.wikimedia.org/T389175) (owner: 10Vgutierrez) [14:40:39] we will be proceeding with the traffic and services switchover now. Please avoid making any major changes and let us know if you notice any major issues [14:41:27] akosiaris: Should I add https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094531 to a puppet window or something like that? [14:41:45] !log hnowlan@cumin2002 START - Cookbook sre.dns.admin DNS admin: depool site codfw [reason: no reason specified, no task ID specified] [14:42:02] !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.dns.admin (exit_code=99) DNS admin: depool site codfw [reason: no reason specified, no task ID specified] [14:42:08] (03CR) 10Alexandros Kosiaris: "Cool, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [14:42:14] ^ aborted, not a fail [14:42:27] ok :) [14:42:40] !log hnowlan@cumin2002 START - Cookbook sre.dns.admin DNS admin: depool site codfw [reason: Datacentre switchover, T387444] [14:42:43] T387444: MoveComms support for March 2025 Datacentre switchover - https://phabricator.wikimedia.org/T387444 [14:42:44] dancy: niah, I 'd rather do it tomorrow early EU morning long with enough time to debug things if they break [14:42:51] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site codfw [reason: Datacentre switchover, T387444] [14:43:02] s/long// [14:43:05] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:43:15] (03PS1) 10Bking: opensearch: use full paths for binaries [puppet] - 10https://gerrit.wikimedia.org/r/1128880 (https://phabricator.wikimedia.org/T386868) [14:43:46] (03PS2) 10Bking: opensearch: use full paths for binaries [puppet] - 10https://gerrit.wikimedia.org/r/1128880 (https://phabricator.wikimedia.org/T386868) [14:44:05] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:44:11] akosiaris: OK! [14:44:55] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53657 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:44:55] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:45:31] !log hnowlan@cumin2002 START - Cookbook sre.discovery.datacenter [14:45:51] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) [14:47:14] stashbot: test [14:47:55] hnowlan: fyi your last !log didn’t make it to https://sal.toolforge.org/ yet AFAICT :S [14:47:59] ah, and there’s the stashbot quit… [14:48:23] * Lucas_WMDE tries to restart stashbot [14:48:39] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1128880 (https://phabricator.wikimedia.org/T386868) (owner: 10Bking) [14:48:47] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:49:05] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:49:06] Lucas_WMDE: ah, thanks for the heads-up. Thankfully not a majorly important one (compared to the one before!) [14:49:13] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:49:23] (03CR) 10Bking: [C:03+2] opensearch: use full paths for binaries [puppet] - 10https://gerrit.wikimedia.org/r/1128880 (https://phabricator.wikimedia.org/T386868) (owner: 10Bking) [14:49:25] phew :D [14:53:45] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10647007 (10Jhancock.wm) having an issue with getting the last 4 provisioned. I run the provisioning script but it times out on the redfish call. Retrieving the... [14:56:35] !log bking@logstash1033 running puppet agent to confirm that CR 1128880 didn't cause problems T386868 [14:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:39] T386868: Port Sudachi to OpenSearch 1.x - https://phabricator.wikimedia.org/T386868 [14:58:11] things are starting to level off in the CDN https://grafana.wikimedia.org/goto/STZhBs2NR?orgId=1 [14:58:55] inflatador: o/ if you are not fixing really urgent things please hold-off merging and changing production, we are doing the first part of the switchover [14:59:45] +1 [15:00:05] hnowlan and jasmine_: I, the Bot under the Fountain, call upon thee, The Deployer, to do Datacentre switchover: Services and Traffic deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T1400). [15:00:05] jelto, arnoldokoth, and mutante: Time to do the SRE Collaboration Services office hours deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T1500). [15:00:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [15:00:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:01:02] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [15:01:09] brouberol: o/ switchover in progress, please hold off to change production unless really necessary :) [15:01:19] oops, sorry, my bad [15:01:39] hnowlan: yeah, I vaguely recall waiting 2 or 3 TTLs and saw a similar stabilization of residual traffic. this is probably about as good as it'll get for now [15:01:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudelastic1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:02:09] yes, 20-30 mins is as good as it gets in some ways. beyond that, you really can't do anything about recursors that don't respect the TTL [15:02:47] could be worse :) [15:02:50] (03PS15) 10Slyngshede: Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 [15:02:53] I'll move ahead with services [15:04:42] FIRING: CertManagerCertNotReady: Certificate istio-system/jaeger is not in a ready state (k8s-aux@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-aux&var-namespace=istio-system - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [15:05:01] !log hnowlan@cumin2002 START - Cookbook sre.discovery.datacenter depool all services in codfw: Datacenter Switchover - T385155 [15:05:05] T385155: 🧭 Northward Datacentre Switchover (March 2025) - https://phabricator.wikimedia.org/T385155 [15:06:20] FIRING: [2x] ProbeDown: Service vrts1003:25 has failed probes (tcp_vrts_smtp_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#vrts1003:25 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:06:29] (03PS1) 10Bking: opensearch: symlink sudachi dir instead of dic file [puppet] - 10https://gerrit.wikimedia.org/r/1128884 (https://phabricator.wikimedia.org/T386868) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:20] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128884 (https://phabricator.wikimedia.org/T386868) (owner: 10Bking) [15:07:58] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on vrts1003.eqiad.wmnet with reason: debugging T389079 [15:08:01] T389079: VRT Logons are delayed - https://phabricator.wikimedia.org/T389079 [15:08:57] (03PS3) 10Muehlenhoff: Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1117896 [15:09:14] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10647052 (10MoritzMuehlenhoff) JFTR, I started a patch to add a Partman config with EFI, so we should be good to use UEFI with these servers eventually once reviewed... [15:10:09] (03PS2) 10Bking: opensearch: symlink sudachi dir instead of dic file [puppet] - 10https://gerrit.wikimedia.org/r/1128884 (https://phabricator.wikimedia.org/T386868) [15:10:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:10:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:11:42] The PHPFPMTooBusy error can be ignored [15:11:46] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128884 (https://phabricator.wikimedia.org/T386868) (owner: 10Bking) [15:11:48] FIRING: [2x] PuppetZeroResources: Puppet has failed generate resources on cloudelastic1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:11:49] ack [15:11:53] as a whole the cluster is ok [15:12:07] mw-web-ro switching now [15:12:09] it's just canaries being overloaded, probably it took some bigger jobs [15:12:36] main is at 47% so [15:12:55] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 457594384 and 43 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:13:57] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 225408 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:15:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:15:43] (03PS16) 10Slyngshede: Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 [15:15:49] (03CR) 10Ebernhardson: [C:03+1] opensearch: symlink sudachi dir instead of dic file [puppet] - 10https://gerrit.wikimedia.org/r/1128884 (https://phabricator.wikimedia.org/T386868) (owner: 10Bking) [15:16:48] FIRING: [3x] PuppetZeroResources: Puppet has failed generate resources on cloudelastic1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:18:07] (03PS1) 10Brouberol: an-druid: allow k8s pods to hit the coordinator API [puppet] - 10https://gerrit.wikimedia.org/r/1128889 (https://phabricator.wikimedia.org/T386282) [15:18:27] (03PS2) 10Brouberol: an-druid: allow k8s pods to hit the coordinator API [puppet] - 10https://gerrit.wikimedia.org/r/1128889 (https://phabricator.wikimedia.org/T386282) [15:18:35] (03PS1) 10Arnaudb: vrts: denylist a sender [puppet] - 10https://gerrit.wikimedia.org/r/1128888 (https://phabricator.wikimedia.org/T389079) [15:18:57] (03CR) 10Bking: [C:03+2] opensearch: symlink sudachi dir instead of dic file [puppet] - 10https://gerrit.wikimedia.org/r/1128884 (https://phabricator.wikimedia.org/T386868) (owner: 10Bking) [15:19:12] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5101/co" [puppet] - 10https://gerrit.wikimedia.org/r/1128889 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [15:20:21] (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1128889 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [15:20:52] (03PS17) 10Slyngshede: Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 [15:21:48] FIRING: [3x] PuppetZeroResources: Puppet has failed generate resources on cloudelastic1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:22:12] hnowlan: mw-*-ro services look ok [15:22:24] (03PS1) 10Brouberol: airflow: grant an-druid access to the analytics profiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128890 (https://phabricator.wikimedia.org/T386282) [15:22:37] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: Registry of multiple webauthn devices - https://phabricator.wikimedia.org/T380180#10647144 (10TheDJ) [15:22:56] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: Registry of multiple webauthn devices - https://phabricator.wikimedia.org/T380180#10647147 (10TheDJ) [15:23:08] (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128890 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [15:26:27] (03PS1) 10Muehlenhoff: osm_replica: Fix Hiera variable [puppet] - 10https://gerrit.wikimedia.org/r/1128891 (https://phabricator.wikimedia.org/T381565) [15:26:43] (03PS18) 10Slyngshede: Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 [15:26:48] RESOLVED: [3x] PuppetZeroResources: Puppet has failed generate resources on cloudelastic1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:26:49] (03CR) 10CI reject: [V:04-1] Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 (owner: 10Slyngshede) [15:27:12] claime: cool :) eqiad mw-web itself is a little spicy as far as saturation goes, but I will follow up after the switch [15:27:31] hnowlan: the graph is misleading, alert threshold is at 75% not 60 [15:27:48] ah, d'oh [15:27:51] I guess we didn't update grafana x) [15:28:04] (03CR) 10Dzahn: [C:03+1] nftables: add a newline at the end of GERRIT_ABUSERS lines [puppet] - 10https://gerrit.wikimedia.org/r/1128338 (https://phabricator.wikimedia.org/T388783) (owner: 10Arnaudb) [15:28:36] (03CR) 10CI reject: [V:04-1] osm_replica: Fix Hiera variable [puppet] - 10https://gerrit.wikimedia.org/r/1128891 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:28:52] hnowlan: updated [15:28:54] (03CR) 10JHathaway: [C:03+1] vrts: denylist a sender [puppet] - 10https://gerrit.wikimedia.org/r/1128888 (https://phabricator.wikimedia.org/T389079) (owner: 10Arnaudb) [15:28:57] (03PS1) 10Vgutierrez: liberica: Allow configuring UDP services [puppet] - 10https://gerrit.wikimedia.org/r/1128892 (https://phabricator.wikimedia.org/T389210) [15:29:11] thanks! [15:29:15] (03PS19) 10Slyngshede: Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 [15:31:53] (03CR) 10Fabfur: pontoon: update and clarify instructions (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1128857 (owner: 10Filippo Giunchedi) [15:32:27] (03PS20) 10Slyngshede: Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 [15:34:42] (03PS1) 10Hnowlan: switchdc: clarify inputs for moving active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/1128895 (https://phabricator.wikimedia.org/T385155) [15:34:44] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all services in codfw: Datacenter Switchover - T385155 [15:34:48] T385155: 🧭 Northward Datacentre Switchover (March 2025) - https://phabricator.wikimedia.org/T385155 [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:37:04] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1128895 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [15:44:47] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128892 (https://phabricator.wikimedia.org/T389210) (owner: 10Vgutierrez) [15:45:19] traffic/services switchover is complete for today, go about your business! [15:45:33] \m/ [15:46:24] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:48:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 3.125% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:49:07] (03PS1) 10Brouberol: Fix settings deserializing by adjusing the indices [dumps] - 10https://gerrit.wikimedia.org/r/1128897 (https://phabricator.wikimedia.org/T388378) [15:49:12] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host puppetserver2004.codfw.wmnet with OS bookworm [15:49:56] (03PS2) 10Brouberol: Fix settings deserializing by adjusing the indices [dumps] - 10https://gerrit.wikimedia.org/r/1128897 (https://phabricator.wikimedia.org/T388378) [15:50:47] (03CR) 10Brouberol: [C:03+2] airflow: grant an-druid access to the analytics profiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128890 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [15:51:04] !log root@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply [15:51:09] (03CR) 10Brouberol: [V:03+1 C:03+2] an-druid: allow k8s pods to hit the coordinator API [puppet] - 10https://gerrit.wikimedia.org/r/1128889 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [15:51:55] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:53:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 3.125% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:56:46] (03PS3) 10Filippo Giunchedi: pontoon: update and clarify instructions [puppet] - 10https://gerrit.wikimedia.org/r/1128857 [15:56:50] !log Silenced PHPFPMTooBusy for release=canary for 6d - T389224 [15:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:55] T389224: Align mw-on-k8s alerts with capacity pools - https://phabricator.wikimedia.org/T389224 [15:57:11] (03CR) 10Filippo Giunchedi: "Thank you for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1128857 (owner: 10Filippo Giunchedi) [15:57:27] FIRING: SystemdUnitFailed: git_pull_charts.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:58:47] (03CR) 10Arnaudb: [C:03+2] vrts: denylist a sender [puppet] - 10https://gerrit.wikimedia.org/r/1128888 (https://phabricator.wikimedia.org/T389079) (owner: 10Arnaudb) [15:59:12] brouberol: ^^ local changes in airflow_common [15:59:38] yep, I ran git checkout -f 10s ago [15:59:45] sorry about that [15:59:46] ack, i'll just restart the timer [15:59:51] tx [15:59:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10647405 (10elukey) @Jhancock.wm I was able to run the cookbook for 2047 but I guess it is the one that you've set the BMC IP manually, so I moved to 2048. I was abl... [16:00:05] jhathaway and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T1600) [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:46] (03CR) 10Ssingh: [C:03+1] "Looks good, nothing stands out to be a cause of concern." [puppet] - 10https://gerrit.wikimedia.org/r/1128892 (https://phabricator.wikimedia.org/T389210) (owner: 10Vgutierrez) [16:01:06] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [16:01:28] !log root@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply [16:02:27] RESOLVED: SystemdUnitFailed: git_pull_charts.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:02:42] !log root@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply [16:02:44] (03CR) 10Elukey: [C:03+2] service: move kartotherian-k8s-ssl fully on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1128343 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [16:02:57] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 901659872 and 57 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:03:57] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 51344 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:06:14] !log root@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply [16:06:36] (03PS1) 10Elukey: Revert "service: move kartotherian-k8s-ssl fully on k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1128903 [16:07:49] (03CR) 10Ssingh: [C:03+1] Revert "service: move kartotherian-k8s-ssl fully on k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1128903 (owner: 10Elukey) [16:08:01] (03CR) 10Elukey: [C:03+2] Revert "service: move kartotherian-k8s-ssl fully on k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1128903 (owner: 10Elukey) [16:08:49] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:08:49] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:09:50] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1004.eqiad.wmnet with reason: bug fix [16:09:55] !log arnaudb@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1:00:00 on phabricator.wikimedia.org with reason: bug fix [16:09:57] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:10:32] !log arnaudb@cumin1002 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1:00:00 on phab1004.eqiad.wmnet with reason: debugging T389079 [16:10:35] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-worker1.*,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [16:10:36] T389079: VRT Logons are delayed - https://phabricator.wikimedia.org/T389079 [16:10:44] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1004.eqiad.wmnet with reason: bugfix [16:10:47] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-worker2.*,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [16:11:34] !log brennen@deploy2002 Started deploy [phabricator/deployment@8884125]: deploy phab2002 for T389220 [16:11:38] T389220: Deploy Phabricator/Phorge 2025-03-18 - https://phabricator.wikimedia.org/T389220 [16:11:41] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_kartotherian-k8s-ssl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:11:55] <_joe_> lol elukey [16:12:00] <_joe_> you killed confd [16:12:04] !log brennen@deploy2002 Finished deploy [phabricator/deployment@8884125]: deploy phab2002 for T389220 (duration: 00m 29s) [16:12:24] !log brennen@deploy2002 Started deploy [phabricator/deployment@8884125]: deploy phab1004 for T389220 [16:13:08] oh no [16:13:16] !log brennen@deploy2002 Finished deploy [phabricator/deployment@8884125]: deploy phab1004 for T389220 (duration: 00m 52s) [16:13:27] (03PS1) 10Elukey: service: set kartotherian's lvs conftool config to Wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1128905 [16:13:49] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:13:49] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:14:15] :( [16:14:42] checking configmaster2001 [16:14:48] (03CR) 10Ssingh: [C:03+1] service: set kartotherian's lvs conftool config to Wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1128905 (owner: 10Elukey) [16:14:55] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:15:26] 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10647485 (10RobH) > After changing the router side of the qsfp and fiber port back to solid green. > > Can you test it on your side? > > For information, you no longer ha... [16:16:27] _joe_ IIUC it was just temporary because I am stupid, but after the rollback it is good (namely, the err files can be removed) [16:16:38] <_joe_> elukey: yes [16:16:43] <_joe_> remove them [16:17:35] !log removed kartotherian-related confd error files from config-master2001 - related to a maintenance issue [16:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:40] all right should be recovered in a bit even in here [16:21:01] (03CR) 10Elukey: [C:03+2] service: set kartotherian's lvs conftool config to Wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1128905 (owner: 10Elukey) [16:21:41] RESOLVED: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_kartotherian-k8s-ssl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:21:51] !log root@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply [16:21:53] (03CR) 10Brouberol: [V:03+1 C:03+2] Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [16:24:12] !log root@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply [16:24:12] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10647578 (10MatthewVernon) 05Open→03Resolved @Jhancock.wm host seems good now - no resets reported since the reimage this time yesterday. Thanks for all your work on this... [16:25:28] !log disabled puppet on A:cp for T388147 [16:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:32] T388147: HAProxy service should not start if TLS material is invalid - https://phabricator.wikimedia.org/T388147 [16:27:20] (03CR) 10Fabfur: [C:03+2] haproxy: certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [16:27:20] !log disable puppet on lvs low traffic hosts in eqiad/codfw to restart pybal (kartotherian svc change) [16:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:59] !log enabled puppet and depooled cp4038 [16:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:13] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4038.ulsfo.wmnet [16:29:27] !log restart pybal on lvs2014 [16:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:05] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Degraded RAID due to failed sdy on ms-be2075 - https://phabricator.wikimedia.org/T383530#10647608 (10Jhancock.wm) 05Open→03Resolved [16:32:59] (03PS1) 10Brouberol: aiflow-research: set a temporary network policy to egress to an-laucher1002:8600 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128906 (https://phabricator.wikimedia.org/T386282) [16:33:17] !log root@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply [16:33:22] !log root@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply [16:33:37] (03PS1) 10MVernon: swift: remove ms-be2075 from rings [puppet] - 10https://gerrit.wikimedia.org/r/1128907 (https://phabricator.wikimedia.org/T354872) [16:33:39] (03PS1) 10MVernon: swift: re-add ms-be2075 to the rings [puppet] - 10https://gerrit.wikimedia.org/r/1128908 (https://phabricator.wikimedia.org/T354872) [16:33:44] !log restart pybal on lvs2013 (kartotherian's svc change) [16:33:52] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:magru and A:cp for 9.2.9-1wm1 [16:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:13] !log brett@cumin2002 END (FAIL) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=1) Rolling upgrade/restart of Apache Traffic Server on A:magru and A:cp for 9.2.9-1wm1 [16:37:09] !log enabled puppet on A:cp (T388147) [16:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:13] T388147: HAProxy service should not start if TLS material is invalid - https://phabricator.wikimedia.org/T388147 [16:37:39] !log T389226 Ran mwscript-k8s --comment="T389226" -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=bnwikivoyage --logwiki=metawiki 'Arafatuniofdhaka' 'আরাফাত হোসেন ভূঁইয়া' [16:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:42] T389226: Unblock stuck global rename of আরাফাত_হোসেন_ভূঁইয়া, AndreasKemper - https://phabricator.wikimedia.org/T389226 [16:37:55] 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10647639 (10cmooney) As discussed - somewhat clutching at straws at this point - we're gonna try moving the link/optic from port 48 to port 49 on the switch side. I've recon... [16:38:00] !log T389226 Ran mwscript-k8s --comment="T389226" -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=mediawikiwiki --logwiki=metawiki 'Schwarze Feder' 'AndreasKemper' [16:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:32] !log repooling cp4038 (T388147) [16:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:40] !log restart pybal on lvs1020 and lvs1019 to pick up kartotherian svc changes [16:38:41] (03CR) 10Andrea Denisse: [C:03+2] grafana: Normalize user fields and validate input in LDAP sync [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) (owner: 10Andrea Denisse) [16:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:52] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp70[02-16].magru.wmnet} and A:cp for 9.2.9-1wm1 [16:39:53] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: OpenSent - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:39:57] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: OpenSent - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:40:38] expected... that's elukey :) [16:41:29] (03PS1) 10David Caro: tools::prometheus: add apache ssl module [puppet] - 10https://gerrit.wikimedia.org/r/1128909 [16:43:17] (03CR) 10David Caro: [V:03+1] "Tested in tools:" [puppet] - 10https://gerrit.wikimedia.org/r/1128909 (owner: 10David Caro) [16:43:35] (03CR) 10CI reject: [V:04-1] tools::prometheus: add apache ssl module [puppet] - 10https://gerrit.wikimedia.org/r/1128909 (owner: 10David Caro) [16:44:15] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1128909 (owner: 10David Caro) [16:45:32] (03PS1) 10Elukey: conftool-data: remove unsed/stale configs for kartotherian-k8s-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1128910 [16:45:46] (03CR) 10David Caro: [V:03+1] "The test errors seem unrelated :/" [puppet] - 10https://gerrit.wikimedia.org/r/1128909 (owner: 10David Caro) [16:46:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [16:46:09] (03PS2) 10David Caro: tools::prometheus: add apache ssl module [puppet] - 10https://gerrit.wikimedia.org/r/1128909 [16:46:35] (03CR) 10Ssingh: [C:03+1] conftool-data: remove unsed/stale configs for kartotherian-k8s-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1128910 (owner: 10Elukey) [16:46:35] huuh [16:47:00] (03CR) 10Elukey: [C:03+2] conftool-data: remove unsed/stale configs for kartotherian-k8s-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1128910 (owner: 10Elukey) [16:48:17] (03CR) 10CI reject: [V:04-1] tools::prometheus: add apache ssl module [puppet] - 10https://gerrit.wikimedia.org/r/1128909 (owner: 10David Caro) [16:48:30] edit rate looks fine [16:48:45] just a spike of session loss during 5 or so minutes [16:50:03] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:50:32] (03PS2) 10Elukey: service: set kartotherian and kartotherian-ssl to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1128344 (https://phabricator.wikimedia.org/T389042) [16:50:58] 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10647692 (10cmooney) It's been moved to port 49 now, but switch is still reporting no TX light on the second lane: ` Mar 18 16:42:37 asw1-b12-drmrs fpc0 qsfp-0/0/49 plugged... [16:51:03] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:51:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [16:51:18] (03PS3) 10David Caro: tools::prometheus: add apache ssl module [puppet] - 10https://gerrit.wikimedia.org/r/1128909 [16:51:18] (03PS1) 10David Caro: profile::liberica: fix tests [puppet] - 10https://gerrit.wikimedia.org/r/1128912 [16:51:36] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4038.ulsfo.wmnet [16:52:10] (03CR) 10CI reject: [V:04-1] profile::liberica: fix tests [puppet] - 10https://gerrit.wikimedia.org/r/1128912 (owner: 10David Caro) [16:53:10] (03PS1) 10Giuseppe Lavagetto: CA: add timeout for token validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128913 [16:53:12] (03PS2) 10David Caro: profile::liberica: fix tests [puppet] - 10https://gerrit.wikimedia.org/r/1128912 [16:53:12] (03PS4) 10David Caro: tools::prometheus: add apache ssl module [puppet] - 10https://gerrit.wikimedia.org/r/1128909 [16:56:36] (03CR) 10David Caro: [C:03+2] tools::prometheus: add apache ssl module [puppet] - 10https://gerrit.wikimedia.org/r/1128909 (owner: 10David Caro) [16:57:19] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1128912 (owner: 10David Caro) [16:57:41] (03CR) 10David Caro: [C:03+2] profile::liberica: fix tests [puppet] - 10https://gerrit.wikimedia.org/r/1128912 (owner: 10David Caro) [17:00:05] swfrench-wmf: Time to do the MediaWiki infrastructure (UTC late) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T1700). [17:00:13] o/ [17:00:36] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host puppetserver2004.codfw.wmnet with OS bookworm [17:00:58] James_F: FYI, I'll get started on switching mw-wikifunctions to PHP 8.1 shortly [17:01:04] Sure. [17:01:20] (03CR) 10Scott French: "Thank you all for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1128440 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:01:24] Was just looking at manual mw-debug-next page outputs to see if there are any issues I notice. [17:01:25] (03CR) 10Scott French: [C:03+2] hieradata: migrate mw-wikifunctions to PHP 8.1 (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1128440 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:02:51] (03PS2) 10Scott French: mw-wikifunctions: migrate to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128439 (https://phabricator.wikimedia.org/T383845) [17:04:03] !log root@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply [17:04:49] (03CR) 10Scott French: [C:03+2] mw-wikifunctions: migrate to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128439 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:05:15] !log move traffic off cr1-drms to allow for pic reset / port reconfiguration T389071 [17:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:19] T389071: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071 [17:05:53] (03CR) 10Jcrespo: [C:03+1] swift: remove ms-be2075 from rings [puppet] - 10https://gerrit.wikimedia.org/r/1128907 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [17:05:58] (03CR) 10Jcrespo: [C:03+1] swift: re-add ms-be2075 to the rings [puppet] - 10https://gerrit.wikimedia.org/r/1128908 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [17:06:37] (03Merged) 10jenkins-bot: mw-wikifunctions: migrate to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128439 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:08:08] (03PS1) 10Elukey: sre.hosts.provision: retrieve Supermicro's BMC firmware after DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1128914 (https://phabricator.wikimedia.org/T384838) [17:08:38] (03CR) 10Brouberol: [C:03+2] aiflow-research: set a temporary network policy to egress to an-laucher1002:8600 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128906 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [17:09:34] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:11:13] PROBLEM - Hadoop NodeManager on an-worker1149 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:11:51] (03Abandoned) 10Ahmon Dancy: coredump.conf: Disable compression [puppet] - 10https://gerrit.wikimedia.org/r/1029235 (https://phabricator.wikimedia.org/T236253) (owner: 10Ahmon Dancy) [17:11:53] !log root@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply [17:12:07] !log root@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply [17:12:21] !log swfrench@deploy2002 Started scap sync-world: Switch mw-wikifuncions to PHP 8.1 - T383845 [17:12:25] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [17:13:05] James_F: starting now - just had to spot check a couple of things due to the lingering diffs from the earlier deployment timeouts [17:13:08] Ack. [17:13:13] PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:13:36] (03CR) 10Vgutierrez: "thx for taking care of this <3" [puppet] - 10https://gerrit.wikimedia.org/r/1128912 (owner: 10David Caro) [17:14:18] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10647822 (10elukey) It turned out to be my fault! I sent a fix (https://gerrit.wikimedia.org/r/1128914), once it gets merged the provision cook... [17:15:10] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:16:07] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10647833 (10Jhancock.wm) all good. Thanks for your help! (i just assumed it was me and i missed something) [17:16:44] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1128914 (https://phabricator.wikimedia.org/T384838) (owner: 10Elukey) [17:17:52] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: retrieve Supermicro's BMC firmware after DHCP [cookbooks] - 10https://gerrit.wikimedia.org/r/1128914 (https://phabricator.wikimedia.org/T384838) (owner: 10Elukey) [17:19:42] FIRING: [3x] CertManagerCertNotReady: Certificate istio-system/jaeger is not in a ready state (k8s-aux@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [17:20:37] !log swfrench@deploy2002 Finished scap sync-world: Switch mw-wikifuncions to PHP 8.1 - T383845 (duration: 11m 46s) [17:20:42] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [17:21:13] RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:22:14] !log root@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply [17:23:54] (03PS1) 10Brouberol: Prevent mysql passwords from being logged to stdout [dumps] - 10https://gerrit.wikimedia.org/r/1128918 (https://phabricator.wikimedia.org/T388378) [17:24:10] James_F: alright, that'll do it. no obvious issues so far - no errors, latency returning to normal post-deploy, no new / concerning errors in logstash AFAICS. [17:24:34] if there's anything you would like to kick the tires on, that would be greatly appreciated [17:24:37] Agreed, it looks good to me too. [17:24:42] FIRING: [3x] CertManagerCertNotReady: Certificate istio-system/jaeger is not in a ready state (k8s-aux@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [17:25:05] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [17:25:45] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [17:26:33] Function calls, special pages, etc. all seem fine, and logs look good. [17:26:37] swfrench-wmf: Thank you! [17:26:37] !log resetting PIC 0 on cr1-drmrs (QSFP ports) to move link from port 1 to port 3 T389071 [17:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:41] T389071: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071 [17:26:47] James_F: great! thank you as well :) [17:26:57] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:27:46] !log root@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [17:28:19] (03PS1) 10Scott French: Revert "mw-wikifunctions: temporarily extend helm timeout to 20m" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128919 [17:29:18] swfrench-wmf: Next up, PHP 8.3, right? ;-) [17:30:08] (03PS2) 10Jelto: wcqs: proxy requests to query qui to new wikikube endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1118074 (https://phabricator.wikimedia.org/T381909) [17:30:14] * swfrench-wmf can hear it coming, off in the distance [17:30:18] :) [17:30:39] Hopefully it'll be a lot less of a heave then from 7.4. [17:31:05] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:31:05] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:31:10] !log root@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [17:31:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10647931 (10phaultfinder) [17:32:01] agreed, yeah - not to jinx things, but ideally it should be a lot smoother [17:32:38] (03PS2) 10DLynch: Enable VisualEditor EditCheck multi-check a/b test on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127945 (https://phabricator.wikimedia.org/T384372) [17:32:38] (03PS1) 10DLynch: Enable VisualEditor EditCheck multi-check a/b test on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128921 (https://phabricator.wikimedia.org/T384372) [17:32:59] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:33:06] And PHP 8.5 won't get released until Thanksgiving, so we have six months to get semi-current.. [17:33:13] RECOVERY - Hadoop NodeManager on an-worker1149 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:33:45] (03PS1) 10DLynch: Edit check: set up the multi-check a/b test [extensions/VisualEditor] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128922 (https://phabricator.wikimedia.org/T384372) [17:34:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/VisualEditor] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128922 (https://phabricator.wikimedia.org/T384372) (owner: 10DLynch) [17:34:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127945 (https://phabricator.wikimedia.org/T384372) (owner: 10DLynch) [17:34:49] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:34:57] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:34:57] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53657 bytes in 0.264 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:35:13] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:40:29] !log root@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [17:40:37] !log root@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [17:42:07] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:42:57] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:44:48] (03CR) 10Scott French: [C:03+2] Revert "mw-wikifunctions: temporarily extend helm timeout to 20m" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128919 (owner: 10Scott French) [17:46:13] (03Merged) 10jenkins-bot: Revert "mw-wikifunctions: temporarily extend helm timeout to 20m" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128919 (owner: 10Scott French) [17:47:19] PROBLEM - Host ps1-d3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:47:25] PROBLEM - Host lsw1-d3-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:48:09] RECOVERY - Host ps1-d3-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.90 ms [17:48:19] RECOVERY - Host lsw1-d3-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.72 ms [17:49:13] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:49:42] FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:51:49] PROBLEM - Host lsw1-d3-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:52:19] RECOVERY - Host lsw1-d3-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.72 ms [17:53:36] !log root@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply [17:54:23] 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: disk (sdb) failed in moss-be2002 - https://phabricator.wikimedia.org/T389236 (10MatthewVernon) 03NEW [17:54:30] !log root@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply [17:54:35] 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: disk (sdb) failed in moss-be2002 - https://phabricator.wikimedia.org/T389236#10648088 (10MatthewVernon) p:05Triage→03Medium [17:55:22] 07Puppet, 06SRE: puppet error at the end of the run on prometheus2008: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null - https://phabricator.wikimedia.org/T388629#10648092 (10andrea.denisse) Hi, while running Puppet on the alert hosts I noticed... [17:55:46] (03CR) 10Btullis: [C:03+2] data-platform: Fix the dashboard URL of the stats server load alert [alerts] - 10https://gerrit.wikimedia.org/r/1128544 (https://phabricator.wikimedia.org/T373046) (owner: 10Btullis) [17:57:03] (03Merged) 10jenkins-bot: data-platform: Fix the dashboard URL of the stats server load alert [alerts] - 10https://gerrit.wikimedia.org/r/1128544 (https://phabricator.wikimedia.org/T373046) (owner: 10Btullis) [17:57:04] (03CR) 10Btullis: [C:03+1] "Good stuff, thanks." [dumps] - 10https://gerrit.wikimedia.org/r/1128918 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [17:58:19] (03CR) 10Btullis: [C:03+1] "Thank you." [dumps] - 10https://gerrit.wikimedia.org/r/1128897 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [17:58:45] PROBLEM - Host ps1-d3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:59:05] PROBLEM - Host lsw1-d3-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:59:09] RECOVERY - Host ps1-d3-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.08 ms [17:59:17] RECOVERY - Host lsw1-d3-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.77 ms [18:00:05] jnuche and jeena: Time to do the MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T1800). [18:03:36] (03PS2) 10Dzahn: add k8s ingress service aliases for jaeger in codfw [dns] - 10https://gerrit.wikimedia.org/r/1126180 (https://phabricator.wikimedia.org/T345894) [18:04:12] FIRING: CertManagerCertNotReady: Certificate istio-system/jaeger is not in a ready state (k8s-aux@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-aux&var-namespace=istio-system - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [18:04:39] (03CR) 10Dzahn: "thanks Alexandros:)" [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [18:09:14] 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10648248 (10RobH) Ticket updated to move the link to router port 3. [18:11:01] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1128319 (https://phabricator.wikimedia.org/T388632) (owner: 10Filippo Giunchedi) [18:13:00] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Upgrade cr1-drmrs JunOS [18:13:06] (03PS1) 10Herron: aux-k8s-codfw: populate network subnet and constants [puppet] - 10https://gerrit.wikimedia.org/r/1128926 (https://phabricator.wikimedia.org/T381417) [18:13:26] !log reboot cr1-drmrs to update JunOS (router is drained of traffic) T364092 [18:13:28] (03CR) 10CI reject: [V:04-1] aux-k8s-codfw: populate network subnet and constants [puppet] - 10https://gerrit.wikimedia.org/r/1128926 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [18:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:29] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [18:15:52] (03CR) 10Ladsgroup: "ah thanks. Sorry for this. Should I deploy it now?" [dumps] - 10https://gerrit.wikimedia.org/r/1128897 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [18:21:44] (03PS3) 10Herron: aux-k8s-codfw: populate network subnet and constants [puppet] - 10https://gerrit.wikimedia.org/r/1128926 (https://phabricator.wikimedia.org/T381417) [18:27:41] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv6: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:28:12] ^^ this is ok, due to cr1-drmrs reboot [18:29:50] (03CR) 10Elukey: [C:03+1] aux-k8s-codfw: populate network subnet and constants [puppet] - 10https://gerrit.wikimedia.org/r/1128926 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [18:34:41] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:35:17] !log re-enabling cr1-drmrs external circuits after upgrade [18:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:04] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10648385 (10cmooney) [18:40:37] (03PS2) 10Brouberol: Prevent mysql passwords from being logged to stdout [dumps] - 10https://gerrit.wikimedia.org/r/1128918 (https://phabricator.wikimedia.org/T388378) [18:41:31] (03CR) 10Brouberol: "@Ladsgroup@gmail.com no worries, we'll take care of this tomorrow. We're actively working on the dumps v1 atm." [dumps] - 10https://gerrit.wikimedia.org/r/1128897 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [18:42:39] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp70[02-16].magru.wmnet} and A:cp for 9.2.9-1wm1 [18:50:17] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:drmrs and A:cp for 9.2.9-1wm1 [18:57:28] (03PS1) 10BCornwall: Set ulsfo varnish-frontend to version 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1128931 (https://phabricator.wikimedia.org/T378737) [18:58:16] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5102/console" [puppet] - 10https://gerrit.wikimedia.org/r/1128931 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:00:47] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5103/console" [puppet] - 10https://gerrit.wikimedia.org/r/1128931 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:03:54] !log Deploying Refinery at 37a2ddf for c1126977 / T388654 tlwikisource to allowlist [19:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:58] T388654: Post-creation work for tlwikisource - https://phabricator.wikimedia.org/T388654 [19:04:16] !log dr0ptp4kt@deploy2002 Started deploy [analytics/refinery@37a2ddf]: Regular analytics weekly train [analytics/refinery@37a2ddfc] [19:06:36] !log dr0ptp4kt@deploy2002 Finished deploy [analytics/refinery@37a2ddf]: Regular analytics weekly train [analytics/refinery@37a2ddfc] (duration: 02m 20s) [19:07:04] !log dr0ptp4kt@deploy2002 Started deploy [analytics/refinery@37a2ddf] (thin): Regular analytics weekly train THIN [analytics/refinery@37a2ddfc] [19:07:17] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1128931 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:07:55] !log dr0ptp4kt@deploy2002 Finished deploy [analytics/refinery@37a2ddf] (thin): Regular analytics weekly train THIN [analytics/refinery@37a2ddfc] (duration: 00m 50s) [19:08:19] !log dr0ptp4kt@deploy2002 Started deploy [analytics/refinery@37a2ddf] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@37a2ddfc] [19:08:54] (03CR) 10Ssingh: [C:03+1] "🚢 it!" [puppet] - 10https://gerrit.wikimedia.org/r/1128931 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:08:57] !log dr0ptp4kt@deploy2002 Finished deploy [analytics/refinery@37a2ddf] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@37a2ddfc] (duration: 00m 38s) [19:12:57] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 58, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:15:24] (03CR) 10Herron: [C:03+2] aux-k8s-codfw: populate network subnet and constants [puppet] - 10https://gerrit.wikimedia.org/r/1128926 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [19:21:53] !log root@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply [19:22:20] !log root@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply [19:26:02] (03PS1) 10Herron: aux-k8s codfw: enable worker ingress [puppet] - 10https://gerrit.wikimedia.org/r/1128937 (https://phabricator.wikimedia.org/T381417) [19:29:12] RESOLVED: CertManagerCertNotReady: Certificate istio-system/jaeger is not in a ready state (k8s-aux@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-aux&var-namespace=istio-system - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [19:41:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:42:54] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:43:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:43:59] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:44:42] RESOLVED: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:44:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:47:13] (03CR) 10BCornwall: [C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1115984 (owner: 10Ncmonitor) [19:48:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [19:48:23] (03PS5) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1115984 (owner: 10Ncmonitor) [19:50:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:51:04] (03CR) 10BCornwall: [C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1115984 (owner: 10Ncmonitor) [19:51:37] !log Deployed refinery using scap, then deployed onto hdfs (concludes Deploying Refinery at 37a2ddf for c1126977 / T388654 tlwikisource to allowlist) [19:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:41] T388654: Post-creation work for tlwikisource - https://phabricator.wikimedia.org/T388654 [19:53:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:53:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [19:53:22] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:53:55] (03CR) 10BCornwall: [V:03+1 C:03+2] Set ulsfo varnish-frontend to version 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1128931 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:54:25] !log Upgrading remaining ulsfo cache nodes to Varnish 7 (T378737) [19:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:29] T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737 [19:58:27] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T2000). [20:00:05] tgr, kemayo, and MichaelG_WMF: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:40] * MichaelG_WMF is here and hopes that things will be more stable now :) [20:00:52] o/ [20:01:06] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [20:01:21] o/ [20:01:53] I can deploy [20:02:39] Kemayo: can the two patches go together? [20:02:51] tgr_: That'd be best. [20:03:07] The backport won't do anything without the config change. [20:03:12] I'll throw in the Growth patch too since that doesn't need testing [20:03:47] (03PS1) 10Alexandros Kosiaris: rt: Add port to ATS replacement rule [puppet] - 10https://gerrit.wikimedia.org/r/1128940 (https://phabricator.wikimedia.org/T385777) [20:03:55] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp1107 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:04:02] Just a heads up web will be using the deploy window after this and we'll need it free pretty much top of the hour due to a scheduling constraint [20:04:34] thanks for the heads up [20:04:41] I'll try to remind closer to 2 pacific, but I have a meeting - if we can try to make sure it doesn't go over by much that would be great (seeing there's a backport in the mix) [20:04:44] (03CR) 10Alexandros Kosiaris: [C:03+2] rt: Add port to ATS replacement rule [puppet] - 10https://gerrit.wikimedia.org/r/1128940 (https://phabricator.wikimedia.org/T385777) (owner: 10Alexandros Kosiaris) [20:04:52] Ofc!! Sorry for any inconvenience [20:04:55] RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp1107 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:05:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128922 (https://phabricator.wikimedia.org/T384372) (owner: 10DLynch) [20:05:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127945 (https://phabricator.wikimedia.org/T384372) (owner: 10DLynch) [20:05:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128777 (https://phabricator.wikimedia.org/T386250) (owner: 10Michael Große) [20:05:44] backports don't make that much of a difference these days [20:05:51] well, depends on the repo [20:06:24] but the CI caching thing made backport merges a lot faster for most repos [20:06:25] 🤞🤞🤞 [20:06:25] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [20:06:35] Absent something going wrong, or the gate jobs being backed up, I'd expect all my stuff to be done with before :30. [20:06:43] yeah no that speedup has been amazing [20:08:19] (03Merged) 10jenkins-bot: Enable VisualEditor EditCheck multi-check a/b test on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127945 (https://phabricator.wikimedia.org/T384372) (owner: 10DLynch) [20:08:21] (03Merged) 10jenkins-bot: Edit check: set up the multi-check a/b test [extensions/VisualEditor] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128922 (https://phabricator.wikimedia.org/T384372) (owner: 10DLynch) [20:08:25] (03Merged) 10jenkins-bot: Growth: enable new way of refreshing LinkRecommendations for pilots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128777 (https://phabricator.wikimedia.org/T386250) (owner: 10Michael Große) [20:08:58] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1128922|Edit check: set up the multi-check a/b test (T384372)]], [[gerrit:1127945|Enable VisualEditor EditCheck multi-check a/b test on test2wiki (T384372)]], [[gerrit:1128777|Growth: enable new way of refreshing LinkRecommendations for pilots (T386250)]] [20:09:04] T384372: Deploy config change to start the Multi-Reference Check A/B Test - https://phabricator.wikimedia.org/T384372 [20:09:04] T386250: Rewrite refreshLinkRecommendations to not iterate through article topics - https://phabricator.wikimedia.org/T386250 [20:10:29] PROBLEM - Disk space on an-druid1004 is CRITICAL: DISK CRITICAL - free space: /srv 105203 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1004&var-datasource=eqiad+prometheus/ops [20:14:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:15:05] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:16:07] !log tgr@deploy2002 migr, kemayo, tgr: Backport for [[gerrit:1128922|Edit check: set up the multi-check a/b test (T384372)]], [[gerrit:1127945|Enable VisualEditor EditCheck multi-check a/b test on test2wiki (T384372)]], [[gerrit:1128777|Growth: enable new way of refreshing LinkRecommendations for pilots (T386250)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:16:12] T384372: Deploy config change to start the Multi-Reference Check A/B Test - https://phabricator.wikimedia.org/T384372 [20:16:12] T386250: Rewrite refreshLinkRecommendations to not iterate through article topics - https://phabricator.wikimedia.org/T386250 [20:16:15] (03PS1) 10JHathaway: Revert "puppetmaster: remove use of deprecated method in logstash.rb" [puppet] - 10https://gerrit.wikimedia.org/r/1128941 [20:18:04] tgr_: Confirmed that it looks good to me, you can deploy. [20:18:15] !log tgr@deploy2002 migr, kemayo, tgr: Continuing with sync [20:25:34] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128922|Edit check: set up the multi-check a/b test (T384372)]], [[gerrit:1127945|Enable VisualEditor EditCheck multi-check a/b test on test2wiki (T384372)]], [[gerrit:1128777|Growth: enable new way of refreshing LinkRecommendations for pilots (T386250)]] (duration: 16m 36s) [20:25:40] T384372: Deploy config change to start the Multi-Reference Check A/B Test - https://phabricator.wikimedia.org/T384372 [20:25:40] T386250: Rewrite refreshLinkRecommendations to not iterate through article topics - https://phabricator.wikimedia.org/T386250 [20:26:07] (03PS2) 10JHathaway: Revert "puppetmaster: remove use of deprecated method in logstash.rb" [puppet] - 10https://gerrit.wikimedia.org/r/1128941 [20:26:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127954 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [20:26:43] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1248 crash - https://phabricator.wikimedia.org/T388837#10648722 (10VRiley-WMF) a:03VRiley-WMF [20:27:16] (03Merged) 10jenkins-bot: Enable SUL3 logins on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127954 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [20:27:45] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1127954|Enable SUL3 logins on group 0 (T384153)]] [20:27:49] T384153: SUL3 Phase 3: All existing user login on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384153 [20:27:51] (03PS1) 10Andrew Bogott: Openstack: add new files for openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128943 (https://phabricator.wikimedia.org/T381499) [20:27:52] (03PS1) 10Andrew Bogott: glance-api.conf: remove commented mention of metadata_encryption_key [puppet] - 10https://gerrit.wikimedia.org/r/1128944 (https://phabricator.wikimedia.org/T381499) [20:27:54] (03PS1) 10Andrew Bogott: Update codfw1dev deploy to openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128945 (https://phabricator.wikimedia.org/T381499) [20:28:20] (03CR) 10CI reject: [V:04-1] Openstack: add new files for openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128943 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [20:30:17] (03CR) 10JHathaway: [C:03+2] Revert "puppetmaster: remove use of deprecated method in logstash.rb" [puppet] - 10https://gerrit.wikimedia.org/r/1128941 (owner: 10JHathaway) [20:34:03] !log tgr@deploy2002 tgr: Backport for [[gerrit:1127954|Enable SUL3 logins on group 0 (T384153)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:34:08] T384153: SUL3 Phase 3: All existing user login on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384153 [20:35:13] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:38:09] (03PS2) 10Andrew Bogott: Openstack: add new files for openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128943 (https://phabricator.wikimedia.org/T381499) [20:38:09] (03PS2) 10Andrew Bogott: glance-api.conf: remove commented mention of metadata_encryption_key [puppet] - 10https://gerrit.wikimedia.org/r/1128944 (https://phabricator.wikimedia.org/T381499) [20:38:09] (03PS2) 10Andrew Bogott: Update codfw1dev deploy to openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128945 (https://phabricator.wikimedia.org/T381499) [20:38:09] (03PS1) 10Andrew Bogott: nova.conf for dalmation: update pybasedir [puppet] - 10https://gerrit.wikimedia.org/r/1128947 (https://phabricator.wikimedia.org/T381499) [20:38:10] (03PS1) 10Andrew Bogott: cloud-vps vms: remove definition for Buster and older distros [puppet] - 10https://gerrit.wikimedia.org/r/1128948 [20:38:41] (03CR) 10CI reject: [V:04-1] Openstack: add new files for openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128943 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [20:44:31] 07Puppet, 06SRE: puppet error at the end of the run on prometheus2008: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null - https://phabricator.wikimedia.org/T388629#10648788 (10jhathaway) 05Open→03Resolved a:03jhathaway Unfortunately thi... [20:45:05] !log tgr@deploy2002 tgr: Continuing with sync [20:45:34] (03PS3) 10Andrew Bogott: Openstack: add new files for openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128943 (https://phabricator.wikimedia.org/T381499) [20:45:34] (03PS2) 10Andrew Bogott: nova.conf for dalmation: update pybasedir [puppet] - 10https://gerrit.wikimedia.org/r/1128947 (https://phabricator.wikimedia.org/T381499) [20:45:34] (03PS3) 10Andrew Bogott: glance-api.conf: remove commented mention of metadata_encryption_key [puppet] - 10https://gerrit.wikimedia.org/r/1128944 (https://phabricator.wikimedia.org/T381499) [20:45:35] (03PS3) 10Andrew Bogott: Update codfw1dev deploy to openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128945 (https://phabricator.wikimedia.org/T381499) [20:45:36] (03PS2) 10Andrew Bogott: cloud-vps vms: remove definition for Buster and older distros [puppet] - 10https://gerrit.wikimedia.org/r/1128948 [20:46:04] (03CR) 10CI reject: [V:04-1] Openstack: add new files for openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128943 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [20:48:23] (03PS4) 10Andrew Bogott: Openstack: add new files for openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128943 (https://phabricator.wikimedia.org/T381499) [20:48:23] (03PS3) 10Andrew Bogott: nova.conf for dalmation: update pybasedir [puppet] - 10https://gerrit.wikimedia.org/r/1128947 (https://phabricator.wikimedia.org/T381499) [20:48:23] (03PS4) 10Andrew Bogott: glance-api.conf: remove commented mention of metadata_encryption_key [puppet] - 10https://gerrit.wikimedia.org/r/1128944 (https://phabricator.wikimedia.org/T381499) [20:48:24] (03PS4) 10Andrew Bogott: Update codfw1dev deploy to openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128945 (https://phabricator.wikimedia.org/T381499) [20:48:25] (03PS3) 10Andrew Bogott: cloud-vps vms: remove definition for Buster and older distros [puppet] - 10https://gerrit.wikimedia.org/r/1128948 [20:48:54] (03CR) 10CI reject: [V:04-1] Openstack: add new files for openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128943 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [20:49:13] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:51:37] (03PS5) 10Andrew Bogott: Openstack: add new files for openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128943 (https://phabricator.wikimedia.org/T381499) [20:51:37] (03PS4) 10Andrew Bogott: nova.conf for dalmation: update pybasedir [puppet] - 10https://gerrit.wikimedia.org/r/1128947 (https://phabricator.wikimedia.org/T381499) [20:51:37] (03PS5) 10Andrew Bogott: glance-api.conf: remove commented mention of metadata_encryption_key [puppet] - 10https://gerrit.wikimedia.org/r/1128944 (https://phabricator.wikimedia.org/T381499) [20:51:37] (03PS5) 10Andrew Bogott: Update codfw1dev deploy to openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128945 (https://phabricator.wikimedia.org/T381499) [20:51:38] (03PS4) 10Andrew Bogott: cloud-vps vms: remove definition for Buster and older distros [puppet] - 10https://gerrit.wikimedia.org/r/1128948 [20:52:09] (03CR) 10CI reject: [V:04-1] Openstack: add new files for openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128943 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [20:52:36] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1127954|Enable SUL3 logins on group 0 (T384153)]] (duration: 24m 51s) [20:52:41] T384153: SUL3 Phase 3: All existing user login on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384153 [20:52:43] !log Upgrading cp4038 to Varnish 7 (T378737) [20:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:48] T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737 [20:52:48] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp4038.ulsfo.wmnet [20:53:51] PROBLEM - Restbase root url on restbase1041 is CRITICAL: connect to address 10.64.48.40 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [20:54:19] (03PS2) 10Bking: sre.elasticsearch.rolling-operation: log correct operation type [cookbooks] - 10https://gerrit.wikimedia.org/r/1128536 (owner: 10Ryan Kemper) [20:54:31] (03CR) 10CI reject: [V:04-1] sre.elasticsearch.rolling-operation: log correct operation type [cookbooks] - 10https://gerrit.wikimedia.org/r/1128536 (owner: 10Ryan Kemper) [20:55:35] !log late UTC deploys done [20:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:59] and no T389203 this time [20:55:59] T389203: Unable to deploy config changes due to timeout - https://phabricator.wikimedia.org/T389203 [20:57:58] (03PS6) 10Andrew Bogott: Openstack: add new files for openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128943 (https://phabricator.wikimedia.org/T381499) [20:57:58] (03PS5) 10Andrew Bogott: nova.conf for dalmation: update pybasedir [puppet] - 10https://gerrit.wikimedia.org/r/1128947 (https://phabricator.wikimedia.org/T381499) [20:57:58] (03PS6) 10Andrew Bogott: glance-api.conf: remove commented mention of metadata_encryption_key [puppet] - 10https://gerrit.wikimedia.org/r/1128944 (https://phabricator.wikimedia.org/T381499) [20:57:59] (03PS6) 10Andrew Bogott: Update codfw1dev deploy to openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128945 (https://phabricator.wikimedia.org/T381499) [20:58:00] (03PS5) 10Andrew Bogott: cloud-vps vms: remove definition for Buster and older distros [puppet] - 10https://gerrit.wikimedia.org/r/1128948 [20:58:24] (03CR) 10CI reject: [V:04-1] Openstack: add new files for openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128943 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [20:59:52] 06SRE: Unable to deploy config changes due to timeout - https://phabricator.wikimedia.org/T389203#10648841 (10Tgr) The evening deploys went fine so I guess this is fixed? (By {b22a0fa1772150731701c667790d7d7f4b5fe88f} maybe?) [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T2100) [21:01:26] 07Puppet, 06Data-Persistence, 10database-backups: Possible weird interaction between es backups and puppet runs leading to failures - https://phabricator.wikimedia.org/T367882#10648852 (10jhathaway) @jcrespo is it possible this is correlated with a ferm refresh from a puppet run? In your last example the fer... [21:01:47] PROBLEM - Restbase root url on restbase1028 is CRITICAL: connect to address 10.64.0.208 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [21:02:34] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on A:drmrs and A:cp for 9.2.9-1wm1 [21:03:19] okay let's get things going [21:03:27] Jdlrobson: are you here or in slack? [21:04:14] toyofuku: here's good! [21:04:18] yayy okay [21:04:22] confirming the patch [21:04:30] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1128877 [21:04:31] ? [21:05:12] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1248 crash - https://phabricator.wikimedia.org/T388837#10648887 (10VRiley-WMF) Grabbed a report from the unit and uploaded it to dell. Service Request Number: 207190954 awaiting response for diagnosis and the course of action [21:05:48] Jdlrobson: sorry [21:05:54] I'm not the most used to IRC [21:06:53] RECOVERY - Restbase root url on restbase1041 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/RESTBase [21:08:47] RECOVERY - Restbase root url on restbase1028 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/RESTBase [21:09:29] toyofuku: yep [21:09:59] Leaving a comment on the patch but I think it needs a lil tweak [21:10:12] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4038.ulsfo.wmnet [21:10:31] (03CR) 10Stoyofuku-wmf: Disable donation LINK on Catalan Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128877 (https://phabricator.wikimedia.org/T387768) (owner: 10Jdlrobson) [21:11:12] (since I am playing the role of both reviewer and deployer I will note for the chat that I have asked myself to hold the deploy) [21:12:48] (03PS7) 10Andrew Bogott: Openstack: add new files for openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128943 (https://phabricator.wikimedia.org/T381499) [21:12:48] (03PS6) 10Andrew Bogott: nova.conf for dalmation: update pybasedir [puppet] - 10https://gerrit.wikimedia.org/r/1128947 (https://phabricator.wikimedia.org/T381499) [21:12:48] (03PS7) 10Andrew Bogott: glance-api.conf: remove commented mention of metadata_encryption_key [puppet] - 10https://gerrit.wikimedia.org/r/1128944 (https://phabricator.wikimedia.org/T381499) [21:12:49] (03PS7) 10Andrew Bogott: Update codfw1dev deploy to openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128945 (https://phabricator.wikimedia.org/T381499) [21:12:50] (03PS6) 10Andrew Bogott: cloud-vps vms: remove definition for Buster and older distros [puppet] - 10https://gerrit.wikimedia.org/r/1128948 [21:13:19] (03CR) 10CI reject: [V:04-1] Openstack: add new files for openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128943 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [21:13:57] (03CR) 10Stoyofuku-wmf: [C:03+1] "We talked via DMs and are disabling this everywhere intentionally for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128877 (https://phabricator.wikimedia.org/T387768) (owner: 10Jdlrobson) [21:14:19] I have asked myself to proceed with the deploy [21:14:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128877 (https://phabricator.wikimedia.org/T387768) (owner: 10Jdlrobson) [21:15:50] (03Merged) 10jenkins-bot: Disable donation LINK on Catalan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128877 (https://phabricator.wikimedia.org/T387768) (owner: 10Jdlrobson) [21:16:18] !log toyofuku@deploy2002 Started scap sync-world: Backport for [[gerrit:1128877|Disable donation LINK on Catalan Wikipedia (T387768)]] [21:16:22] T387768: Fix and QA donate link instrumentation - https://phabricator.wikimedia.org/T387768 [21:17:15] (03CR) 10Ebernhardson: [C:03+1] wcqs: proxy requests to query qui to new wikikube endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1118074 (https://phabricator.wikimedia.org/T381909) (owner: 10Jelto) [21:17:59] (03CR) 10Bking: [C:03+2] wcqs: proxy requests to query qui to new wikikube endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1118074 (https://phabricator.wikimedia.org/T381909) (owner: 10Jelto) [21:18:17] (03PS8) 10Andrew Bogott: Openstack: add new files for openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128943 (https://phabricator.wikimedia.org/T381499) [21:18:17] (03PS7) 10Andrew Bogott: nova.conf for dalmation: update pybasedir [puppet] - 10https://gerrit.wikimedia.org/r/1128947 (https://phabricator.wikimedia.org/T381499) [21:18:17] (03PS8) 10Andrew Bogott: glance-api.conf: remove commented mention of metadata_encryption_key [puppet] - 10https://gerrit.wikimedia.org/r/1128944 (https://phabricator.wikimedia.org/T381499) [21:18:18] (03PS8) 10Andrew Bogott: Update codfw1dev deploy to openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128945 (https://phabricator.wikimedia.org/T381499) [21:18:19] (03PS7) 10Andrew Bogott: cloud-vps vms: remove definition for Buster and older distros [puppet] - 10https://gerrit.wikimedia.org/r/1128948 [21:19:16] thcipriani: dduvall: please distribute my thanks as is applicable for the success caching project everything is SO FAST now [21:19:38] toyofuku: yay! [21:19:54] toyofuku: verified on debug! [21:19:54] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128945 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [21:20:14] So fast that Jon was able to verify before we finished deploying to test servers!! [21:20:20] dduvall: also big +1.. i've been noticing it already today! [21:20:37] https://usercontent.irccloud-cdn.com/file/cujR3SaN/Screenshot%202025-03-18%20at%202.20.32%E2%80%AFPM.png [21:21:08] Jdlrobson: w00t! [21:21:16] !log toyofuku@deploy2002 toyofuku, jdlrobson: Backport for [[gerrit:1128877|Disable donation LINK on Catalan Wikipedia (T387768)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:21:19] !log toyofuku@deploy2002 toyofuku, jdlrobson: Continuing with sync [21:21:24] moving swiftly along [21:27:57] 07Puppet, 06SRE, 06Infrastructure-Foundations, 10Keyholder: keyholder-proxy doesn't restart on config change - https://phabricator.wikimedia.org/T374711#10649030 (10jhathaway) @fgiunchedi should we consider this issue resolved, since the arming step for keyholder is manual, if I understand correctly? [21:28:54] !log toyofuku@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128877|Disable donation LINK on Catalan Wikipedia (T387768)]] (duration: 12m 35s) [21:28:59] T387768: Fix and QA donate link instrumentation - https://phabricator.wikimedia.org/T387768 [21:29:57] Jdlrobson: all done! [21:30:12] and right before your 2:30 end time ☺️ [21:33:46] wahoo [21:33:49] thanks toyofuku [21:34:17] 🫡 [21:34:23] (03PS9) 10Andrew Bogott: Update codfw1dev deploy to openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128945 (https://phabricator.wikimedia.org/T381499) [21:34:23] (03PS8) 10Andrew Bogott: cloud-vps vms: remove definition for Buster and older distros [puppet] - 10https://gerrit.wikimedia.org/r/1128948 [21:34:29] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128945 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [21:34:41] !log web deploy window done [21:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:48] sorry I forget to do that [21:36:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10649071 (10phaultfinder) [21:38:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main1001 / kafka-main1002 / kafka-main1003 / kafka-main1004 / kafka-main1005 - https://phabricator.wikimedia.org/T381593#10649090 (10VRiley-WMF) kafka-main1001, kafka-main1002, kafka-main1003, kafka-main1004 have been... [21:39:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main1001 / kafka-main1002 / kafka-main1003 / kafka-main1004 / kafka-main1005 - https://phabricator.wikimedia.org/T381593#10649096 (10VRiley-WMF) [21:42:27] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:43:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:46:36] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1248 crash - https://phabricator.wikimedia.org/T388837#10649196 (10VRiley-WMF) They have written back and asked that they would like to have the NIC reseated and the cables as well - and perform a Flea power drain prior to turning it on. Is there a specific downtime t... [21:46:47] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:48:15] !log Upgrading cp4039 to Varnish 7 (T378737) [21:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:18] T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737 [21:48:22] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp4039.ulsfo.wmnet [21:48:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:48:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:49:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:53:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:54:39] (03PS1) 10Ebernhardson: query_service gui: Allow proxy to k8s miscweb [puppet] - 10https://gerrit.wikimedia.org/r/1128987 (https://phabricator.wikimedia.org/T381909) [21:54:51] !log dancy@deploy2002 Installing scap version "4.141.2" for 2 host(s) [21:55:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:55:13] (03PS2) 10Ebernhardson: query_service gui: Allow proxy to k8s miscweb [puppet] - 10https://gerrit.wikimedia.org/r/1128987 (https://phabricator.wikimedia.org/T381909) [21:55:31] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128987 (https://phabricator.wikimedia.org/T381909) (owner: 10Ebernhardson) [21:55:42] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128987 (https://phabricator.wikimedia.org/T381909) (owner: 10Ebernhardson) [21:56:39] !log dancy@deploy2002 Installation of scap version "4.141.2" completed for 2 hosts [21:58:02] (03CR) 10Bking: [C:03+2] query_service gui: Allow proxy to k8s miscweb [puppet] - 10https://gerrit.wikimedia.org/r/1128987 (https://phabricator.wikimedia.org/T381909) (owner: 10Ebernhardson) [22:01:31] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4039.ulsfo.wmnet [22:02:41] (03CR) 10JHathaway: [C:03+1] Create EFI-enabled Partman recipe for Ganeti in core sites [puppet] - 10https://gerrit.wikimedia.org/r/1128875 (owner: 10Muehlenhoff) [22:10:57] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10649335 (10Jhancock.wm) [22:23:03] (03PS1) 10Dzahn: conftool-data: add codesearch service to discovery objects [puppet] - 10https://gerrit.wikimedia.org/r/1128988 (https://phabricator.wikimedia.org/T268199) [22:30:52] !log Upgrading cp4040 to Varnish 7 (T378737) [22:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:56] T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737 [22:31:02] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp4040.ulsfo.wmnet [22:36:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission dbprov1001, dbprov1002 - https://phabricator.wikimedia.org/T383871#10649381 (10Jclark-ctr) Decom script not run. ran script [22:36:47] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:38:15] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4040.ulsfo.wmnet [22:39:38] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:42:58] (03PS1) 10Dzahn: servicecatalog: add codesearch in state service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1128989 (https://phabricator.wikimedia.org/T268199) [22:43:24] (03CR) 10CI reject: [V:04-1] servicecatalog: add codesearch in state service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1128989 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [22:45:31] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [22:46:02] (03CR) 10Andrew Bogott: [C:03+2] Openstack: add new files for openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128943 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [22:46:09] (03CR) 10Andrew Bogott: [C:03+2] nova.conf for dalmation: update pybasedir [puppet] - 10https://gerrit.wikimedia.org/r/1128947 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [22:46:14] (03CR) 10Andrew Bogott: [C:03+2] glance-api.conf: remove commented mention of metadata_encryption_key [puppet] - 10https://gerrit.wikimedia.org/r/1128944 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [22:47:43] (03PS2) 10Dzahn: servicecatalog: add codesearch in state service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1128989 (https://phabricator.wikimedia.org/T268199) [22:50:04] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for elastic - jclark@cumin1002" [22:50:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for elastic - jclark@cumin1002" [22:50:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:52:07] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [22:54:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:55:54] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1112 [22:56:19] PROBLEM - Disk space on maps1009 is CRITICAL: DISK CRITICAL - free space: / 2794 MB (3% inode=96%): /tmp 2794 MB (3% inode=96%): /var/tmp 2794 MB (3% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=maps1009&var-datasource=eqiad+prometheus/ops [22:57:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1112 [22:57:26] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1111 [22:58:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1111 [22:59:50] !log Upgrading cp4041 to Varnish 7 (T378737) [22:59:52] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp4041.ulsfo.wmnet [22:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:54] T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737 [23:00:12] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127882 (https://phabricator.wikimedia.org/T384335) (owner: 10Clément Goubert) [23:01:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2089.codfw.wmnet with OS bullseye [23:01:49] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10649468 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2089.codfw... [23:03:44] (03PS3) 10Dzahn: servicecatalog: add codesearch in state service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1128989 (https://phabricator.wikimedia.org/T268199) [23:04:14] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4041.ulsfo.wmnet [23:13:05] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1111.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:13:07] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1112.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:13:30] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1111.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:13:55] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1111.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:14:04] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1111.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:14:13] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1112.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:16:24] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1112.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:17:33] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1112.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:18:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2089.codfw.wmnet with reason: host reimage [23:20:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10649536 (10Jclark-ctr) [23:22:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2089.codfw.wmnet with reason: host reimage [23:31:50] (03CR) 10CDobbins: geo-maps: update South America DCs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1124192 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins) [23:32:35] (03CR) 10CDobbins: geo-maps: update South America DCs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1124192 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins) [23:35:13] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:38:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10649666 (10Jclark-ctr) elastic1111 , elastic1111 updated username and password to normal wiki password Both fail provisioning [23:39:47] jouncebot: nowandnext [23:39:47] No deployments scheduled for the next 6 hour(s) and 20 minute(s) [23:39:47] In 6 hour(s) and 20 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T0600) [23:49:13] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:49:17] (03CR) 10Cwhite: [C:03+2] logstash: add ids to plugins where missing [puppet] - 10https://gerrit.wikimedia.org/r/1128546 (https://phabricator.wikimedia.org/T389072) (owner: 10Cwhite)