[00:01:06] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [00:09:15] RECOVERY - snapshot of s4 in codfw on backupmon1001 is OK: Last snapshot for s4 at codfw (db2239) taken on 2025-03-17 22:42:31 (1785 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:10:45] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:11:51] !log very late UTC deploys done [00:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:05] !log zabe@mwmaint2002:~$ cat group0.dblist | xargs -I{} bash -c "echo {}; mwscript extensions/WikimediaMaintenance/migrateESRefToContentTableStage2.php {} --delete /home/zabe/afl_text_table_deletedump/{} --sleep 0.3" # T381599 [00:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:09] T381599: Migrate current references of text table rows from afl_var_dump - https://phabricator.wikimedia.org/T381599 [00:15:44] (03PS1) 10Dzahn: mailman: list sync, add option to mail changes to an admin [puppet] - 10https://gerrit.wikimedia.org/r/1128564 (https://phabricator.wikimedia.org/T351202) [00:16:07] (03CR) 10CI reject: [V:04-1] mailman: list sync, add option to mail changes to an admin [puppet] - 10https://gerrit.wikimedia.org/r/1128564 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [00:16:45] (03PS1) 10Dwisehaupt: community_civicrm: Add profile::community_civicrm::mail [puppet] - 10https://gerrit.wikimedia.org/r/1128565 (https://phabricator.wikimedia.org/T383715) [00:17:09] (03CR) 10CI reject: [V:04-1] community_civicrm: Add profile::community_civicrm::mail [puppet] - 10https://gerrit.wikimedia.org/r/1128565 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [00:21:37] (03PS2) 10Dwisehaupt: community_civicrm: Add profile::community_civicrm::mail [puppet] - 10https://gerrit.wikimedia.org/r/1128565 (https://phabricator.wikimedia.org/T383715) [00:22:48] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10LDAP-Access-Requests: Grant Access to astein for fr-tech icinga acknowledgements - https://phabricator.wikimedia.org/T388186#10644791 (10Dzahn) a:05MoritzMuehlenhoff→03AStein-WMF This is a known issue many of us have ran into before. As Moritz des... [00:23:34] (03Abandoned) 10Dwisehaupt: community_civicrm: Add profile::community_civicrm::mail [puppet] - 10https://gerrit.wikimedia.org/r/1125223 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [00:24:53] PROBLEM - Restbase root url on restbase2033 is CRITICAL: connect to address 10.192.32.174 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [00:25:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10644793 (10phaultfinder) [00:25:48] (03CR) 10Cwhite: [C:03+2] "PCC OK https://puppet-compiler.wmflabs.org/output/1128471/5099/" [puppet] - 10https://gerrit.wikimedia.org/r/1128471 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite) [00:30:14] (03CR) 10Dwisehaupt: "Tried many different ways to get this to work with virtual users, but kept stumbling across bits that didn't work. Going to stick with loc" [puppet] - 10https://gerrit.wikimedia.org/r/1128565 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [00:32:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127954 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [00:34:17] (03CR) 10Dwisehaupt: community_civicrm: dovecot module for serving up local mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [00:38:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1128577 [00:38:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1128577 (owner: 10TrainBranchBot) [00:39:16] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host db1257.eqiad.wmnet with OS bookworm [00:39:22] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10644812 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host db1257.eqiad.wmnet with OS bookworm [00:50:47] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1257.eqiad.wmnet with reason: host reimage [00:51:45] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1128577 (owner: 10TrainBranchBot) [00:54:25] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1257.eqiad.wmnet with reason: host reimage [01:08:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1128581 [01:08:24] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1128581 (owner: 10TrainBranchBot) [01:13:37] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:13:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:13:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1257.eqiad.wmnet with OS bookworm [01:14:07] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10644871 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host db1257.eqiad.wmnet with OS bookworm completed: - db1257 (**WARN**) - Removed... [01:14:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10644872 (10Jclark-ctr) [01:14:25] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10644873 (10Jclark-ctr) 05Open→03Resolved [01:24:53] RECOVERY - Restbase root url on restbase2033 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 0.112 second response time https://wikitech.wikimedia.org/wiki/RESTBase [01:28:24] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1128581 (owner: 10TrainBranchBot) [01:32:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:42:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:50:14] "Our servers are currently under maintenance or experiencing a technical problem. Please try again in a few minutes." [01:50:15] Hmm [01:50:20] "Original error: upstream connect error or disconnect/reset before headers. reset reason: connection termination " [01:58:29] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2047 [01:58:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2047 [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T0200) [02:04:21] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [02:04:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10644924 (10phaultfinder) [02:06:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:07:17] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2050 [02:07:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2050 [02:08:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.21 [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1128584 (https://phabricator.wikimedia.org/T386216) [02:08:28] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.21 [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1128584 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [02:14:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2045.codfw.wmnet with OS bookworm [02:14:18] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10644938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm [02:20:20] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.21 [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1128584 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [02:25:22] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2045.codfw.wmnet with OS bookworm [02:25:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10644945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm executed with err... [02:26:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10644951 (10Jhancock.wm) [02:34:13] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:37:11] FIRING: Temperature: Temp issue on wdqs1021:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs1021 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [02:42:11] RESOLVED: Temperature: Temp issue on wdqs1021:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs1021 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [02:49:13] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T0300) [03:02:25] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128586 (https://phabricator.wikimedia.org/T386216) [03:02:26] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128586 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [03:03:17] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128586 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [03:03:44] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.21 refs T386216 [03:03:47] T386216: 1.44.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T386216 [03:04:42] FIRING: CertManagerCertNotReady: Certificate istio-system/jaeger is not in a ready state (k8s-aux@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-aux&var-namespace=istio-system - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T0400) [04:01:06] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [04:06:15] !log mwpresync@deploy2002 Pruned MediaWiki: 1.44.0-wmf.18 (duration: 06m 13s) [04:14:38] FIRING: SystemdUnitFailed: mediawiki_job_startupregistrystats-testwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:15:33] RECOVERY - OpenSearch unassigned shard check - 9200 on relforge1004 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [04:18:09] RECOVERY - snapshot of s3 in codfw on backupmon1001 is OK: Last snapshot for s3 at codfw (db2239) taken on 2025-03-18 01:25:28 (1155 GiB, +0.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:51:51] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:51:55] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 130, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:12:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10645089 (10phaultfinder) [05:34:13] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:49:13] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:50:23] RECOVERY - ElasticSearch unassigned shard check - 9200 on relforge1003 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [05:56:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:59:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10645147 (10phaultfinder) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T0600) [06:00:04] marostegui, Amir1, and federico3: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T0600). [06:19:28] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10645188 (10Marostegui) Thank you! [06:23:58] (03PS1) 10Marostegui: s4-pager.sql: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/1128745 [06:26:47] (03CR) 10Marostegui: [C:03+2] s4-pager.sql: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/1128745 (owner: 10Marostegui) [06:27:20] (03Merged) 10jenkins-bot: s4-pager.sql: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/1128745 (owner: 10Marostegui) [06:28:43] (03PS1) 10Marostegui: valid_section.pp: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1128746 (https://phabricator.wikimedia.org/T381475) [06:29:44] (03CR) 10Marostegui: "Amir, I am not sure if the section where I've added is correct, let me know if you want it to be there or in the metadata section." [puppet] - 10https://gerrit.wikimedia.org/r/1128746 (https://phabricator.wikimedia.org/T381475) (owner: 10Marostegui) [06:37:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [06:40:41] !log Shifted UTC morning backport windows by an hour to take in account daylight saving time difference between USA and Europe [06:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [06:50:17] jouncebot: refresh [06:50:18] I refreshed my knowledge about deployments. [06:50:22] jouncebot: nowandnext [06:50:22] For the next 0 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T0600) [06:50:23] In 0 hour(s) and 9 minute(s): UTC morning backport window (legacy daylight saving time confusion) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T0700) [06:50:39] stupid daylight saving time [06:53:02] the MediaWiki infrastructure (UTC early) is happening at 6UTC [06:53:43] but in the calendar it is tied to PST and thus is marked at 11pm [06:54:07] it thus shows up in the calendar as happening Yesterday (which is correct from the point of view of PST) [07:00:05] Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window (legacy daylight saving time confusion) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:36] o/ [07:00:39] pfff [07:00:41] jouncebot: refresh [07:00:42] I refreshed my knowledge about deployments. [07:00:44] jouncebot: now [07:00:44] For the next 0 hour(s) and 59 minute(s): UTC morning backport window (legacy daylight saving time confusion) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T0700) [07:00:54] well somehow it missed it [07:03:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127458 (https://phabricator.wikimedia.org/T388158) (owner: 10Jon Harald Søby) [07:04:33] (03Merged) 10jenkins-bot: Add Portal namespace to kaawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127458 (https://phabricator.wikimedia.org/T388158) (owner: 10Jon Harald Søby) [07:04:42] FIRING: CertManagerCertNotReady: Certificate istio-system/jaeger is not in a ready state (k8s-aux@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-aux&var-namespace=istio-system - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [07:05:30] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1127458|Add Portal namespace to kaawiki (T388158)]] [07:05:34] T388158: Create Portal namespace on kaa.wikipedia - https://phabricator.wikimedia.org/T388158 [07:08:54] (03PS13) 10Giuseppe Lavagetto: mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 [07:14:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10645256 (10phaultfinder) [07:15:55] of course httpbb tests failed due to mwdebug servers timing out :-( [07:16:25] (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1128746 (https://phabricator.wikimedia.org/T381475) (owner: 10Marostegui) [07:16:31] * hashar tries again [07:16:43] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1127458|Add Portal namespace to kaawiki (T388158)]] [07:16:47] T388158: Create Portal namespace on kaa.wikipedia - https://phabricator.wikimedia.org/T388158 [07:17:22] (03CR) 10Federico Ceratto: [C:03+1] "Is modules/profile/files/dbbackups/valid_sections.txt to be updated as well?" [puppet] - 10https://gerrit.wikimedia.org/r/1128746 (https://phabricator.wikimedia.org/T381475) (owner: 10Marostegui) [07:29:08] hmm it happens again [07:29:11] * hashar retries [07:30:26] well they are broken [07:32:58] 06SRE, 10LDAP-Access-Requests: Disable BarryTheBrowserTestBot LDAP account - https://phabricator.wikimedia.org/T388662#10645290 (10Aklapper) [07:36:26] (03PS1) 10Michael Große: Growth: enable new way of refreshing LinkRecommendations for pilots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128777 (https://phabricator.wikimedia.org/T386250) [07:37:04] I keep forgetting that the window moved forward by an hour during confusion time [07:37:45] Is anyone deploying? if so, I have a config change for it. But it can also wait to the next window [07:37:54] The change: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1128777 [07:38:33] * MichaelG_WMF reads up [07:39:18] hashar: is something broken with the deployments in general? [07:39:21] yeah [07:39:26] I am going to fill an unbreak now [07:39:34] gotcha [07:39:38] the window was set at 8am CET [07:39:42] cause the script is broken [07:39:57] I kept it as is in case the person that scheduled the change would show up [07:40:05] and copy pasted it for 9:00 (aka in 20 minutes) [07:40:12] but I have hit a wall which is that the debug servers are 503 ing [07:40:29] meh [07:40:37] but thank you for looking into it! [07:40:41] so essentially we can't deploy :/ [07:41:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128777 (https://phabricator.wikimedia.org/T386250) (owner: 10Michael Große) [07:46:06] (03CR) 10Klausman: role::ml_k8s::worker: move ml-serve2001 to containerd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1128463 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [07:46:14] (03CR) 10Klausman: [C:03+1] role::ml_k8s::staging::master: move to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1128462 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [07:46:49] (03CR) 10Klausman: [C:03+1] role::ml_k8s: extend nrpe_check_disk_options to allow containerd [puppet] - 10https://gerrit.wikimedia.org/r/1128461 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [07:49:11] filed as T389169 and I have made it an unrebak now [07:49:12] T389169: Deployment fails due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169 [07:49:19] MichaelG_WMF: I don't know what is broken :-( [07:49:52] not my area of expertise either, but I'll have a look at the task nonetheless :) [07:51:08] (03PS1) 10Hashar: Revert "Add Portal namespace to kaawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128778 (https://phabricator.wikimedia.org/T388158) [07:51:42] (03CR) 10Hashar: [C:03+2] "Self merging since that was never deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128778 (https://phabricator.wikimedia.org/T388158) (owner: 10Hashar) [07:52:22] (03CR) 10Muehlenhoff: installserver: set puppetserver2004 for UEFI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1128473 (https://phabricator.wikimedia.org/T381274) (owner: 10Elukey) [07:52:34] (03Merged) 10jenkins-bot: Revert "Add Portal namespace to kaawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128778 (https://phabricator.wikimedia.org/T388158) (owner: 10Hashar) [07:53:02] hashar: would be nice to see the full HTML that we got for those test-failures. Are we sure it is a timeout? [07:54:42] !log mforns@deploy2002 Started deploy [airflow-dags/analytics@e7be149]: hotfix for webrequest DAGs end_dates for k8s migration [07:54:59] (03PS1) 10Hashar: Add Portal namespace to kaawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128781 (https://phabricator.wikimedia.org/T388158) [07:56:05] (03CR) 10Filippo Giunchedi: [C:03+1] "Neat! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1128546 (https://phabricator.wikimedia.org/T389072) (owner: 10Cwhite) [07:56:15] !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@e7be149]: hotfix for webrequest DAGs end_dates for k8s migration (duration: 02m 09s) [07:56:51] (03PS1) 10Muehlenhoff: Remove access for swagoel [puppet] - 10https://gerrit.wikimedia.org/r/1128782 [07:57:02] MichaelG_WMF: I am not sure, but we had some occurences of the mwdebug servers timing out previously [07:57:16] in this case I don't know what is the exact cause hence why I have filed a newt task [07:57:19] I 'll investigate [07:58:43] (03PS1) 10Slyngshede: data.yaml: Offboarding jebe [puppet] - 10https://gerrit.wikimedia.org/r/1128783 [07:59:49] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Swagoel out of all services on: 1293 hosts [08:00:05] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T0800). [08:00:05] MichaelG_WMF and hashar: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:26] you can self-serve? [08:00:32] Let me know if you need/want help [08:01:06] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [08:01:07] Amir1: deployment is broken with open UBN: https://phabricator.wikimedia.org/T389169 [08:01:10] the debug servers are apparently broken [08:01:24] (03CR) 10Muehlenhoff: [C:03+1] "Patch looks good, but these mails are for the end of the date, so let's not merge yet." [puppet] - 10https://gerrit.wikimedia.org/r/1128783 (owner: 10Slyngshede) [08:02:36] I had it yesterday too, I "r"ed until it passed [08:02:38] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Swagoel out of all services on: 949 hosts [08:03:05] (03CR) 10Muehlenhoff: [C:03+2] Remove access for swagoel [puppet] - 10https://gerrit.wikimedia.org/r/1128782 (owner: 10Muehlenhoff) [08:05:19] hi, i'm here 👋 [08:05:41] my name fell off the deployments table for some reason [08:06:38] Jhs: hello!!! [08:06:53] Jhs: I have tried to deploy the kaawiki Portal namespace earlier today (an hour ago) [08:07:04] ah, ok, nice [08:07:06] but that failed due to an unrelated reason: our deployment system has an ongoing issue [08:07:27] the https://wikitech.wikimedia.org/wiki/Deployments page has some issue due to USA having already moved to summer time [08:07:32] while Europe is still in winter time [08:07:35] aha [08:07:51] yeah, i'm the one who reported the bug that led Brian to discover that :P [08:07:58] so the window was scheduled one hour ago. I fixed it by copy pasting it to now and tried to deploy at the original time (an hour ago) in case you showed up [08:08:29] and well something is broken somwehere in our infra so I have reverted your configuration change and send it back for review/pending [08:08:41] 👍 that's fine of course [08:09:01] we can try again at the next backport window, or tomorrow morning: ) [08:09:25] (03PS1) 10Brouberol: Drop airflow-analytics DNS records [dns] - 10https://gerrit.wikimedia.org/r/1128785 (https://phabricator.wikimedia.org/T389172) [08:10:17] hashar, sure. i'll move it on [[Deployments]] [08:10:49] Jhs: the new change is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1128781 [08:11:47] (03PS1) 10Brouberol: Remove airflow-analytics from the ATS and cache configuration [puppet] - 10https://gerrit.wikimedia.org/r/1128786 (https://phabricator.wikimedia.org/T389172) [08:11:48] (03PS1) 10Brouberol: Remove airflow-analytics from the IDP configuration [puppet] - 10https://gerrit.wikimedia.org/r/1128787 (https://phabricator.wikimedia.org/T389172) [08:11:50] (03PS1) 10Brouberol: Remove the airflow-analytics kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1128788 (https://phabricator.wikimedia.org/T389172) [08:12:12] hashar, great, thanks [08:14:15] (03PS1) 10Brouberol: Remove the airflow-analytics namespace from the operators tenant ns list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128789 (https://phabricator.wikimedia.org/T389172) [08:14:17] (03PS1) 10Brouberol: Remove the airflow-analytics deployemnt helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128790 (https://phabricator.wikimedia.org/T389172) [08:14:19] (03PS1) 10Brouberol: Remove the airflow-analytics namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128791 (https://phabricator.wikimedia.org/T389172) [08:14:49] hashar: httpbb's appserver/test_main.yaml continues to fail right now against mwdebug1001 and mwdebug1002. Is this expected? Or has a rollback happened? [08:16:45] (03PS1) 10Filippo Giunchedi: logstash: read k8s-mw topics as needed [puppet] - 10https://gerrit.wikimedia.org/r/1128793 (https://phabricator.wikimedia.org/T384335) [08:17:27] FIRING: SystemdUnitFailed: mediawiki_job_startupregistrystats-testwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:18:15] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1128793 (https://phabricator.wikimedia.org/T384335) (owner: 10Filippo Giunchedi) [08:19:49] (03CR) 10Filippo Giunchedi: [V:03+1] "The topics are not live yet, they will with I5020574a8936" [puppet] - 10https://gerrit.wikimedia.org/r/1128793 (https://phabricator.wikimedia.org/T384335) (owner: 10Filippo Giunchedi) [08:20:35] (03CR) 10Filippo Giunchedi: "LGTM, depends on If5960807bd" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127882 (https://phabricator.wikimedia.org/T384335) (owner: 10Clément Goubert) [08:21:43] (03CR) 10Btullis: [C:03+1] Drop airflow-analytics DNS records [dns] - 10https://gerrit.wikimedia.org/r/1128785 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:22:01] (03CR) 10Btullis: [C:03+1] Remove the airflow-analytics namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128791 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:22:05] (03CR) 10Stevemunene: [C:03+1] Drop airflow-analytics DNS records [dns] - 10https://gerrit.wikimedia.org/r/1128785 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:22:21] (03CR) 10Btullis: [C:03+1] Remove the airflow-analytics deployemnt helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128790 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:22:30] (03CR) 10Stevemunene: [C:03+1] Remove airflow-analytics from the ATS and cache configuration [puppet] - 10https://gerrit.wikimedia.org/r/1128786 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:22:41] (03CR) 10Btullis: [C:03+1] Remove the airflow-analytics namespace from the operators tenant ns list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128789 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:22:52] (03CR) 10Stevemunene: [C:03+1] Remove airflow-analytics from the IDP configuration [puppet] - 10https://gerrit.wikimedia.org/r/1128787 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:23:00] (03CR) 10Btullis: [C:03+1] Remove the airflow-analytics kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1128788 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:23:24] (03CR) 10Btullis: [C:03+1] Remove airflow-analytics from the IDP configuration [puppet] - 10https://gerrit.wikimedia.org/r/1128787 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:23:31] (03CR) 10Stevemunene: [C:03+1] Remove the airflow-analytics kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1128788 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:24:10] ah I have found it [08:24:11] `UnexpectedValueException: Invalid server index #` [08:27:48] hashar: ? [08:29:16] akosiaris: ? [08:29:22] (03CR) 10Stevemunene: [C:03+1] Remove the airflow-analytics namespace from the operators tenant ns list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128789 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:29:27] hashar: I mean, what have you found? [08:29:34] the train is broken [08:29:37] https://phabricator.wikimedia.org/T389169 [08:29:47] httpbb hits some 500 when querying the mwdebug servers [08:29:55] and apparently this time it is really an error in MediaWiki! [08:30:18] I know, I am debugging the same task, thanks for adding that info, that is what I was asking [08:30:27] * akosiaris just refreshed [08:30:27] so the trace is for Wikibase, that goes through the parser cache and sqlbagofstuff [08:30:39] (03CR) 10Brouberol: [C:03+2] Drop airflow-analytics DNS records [dns] - 10https://gerrit.wikimedia.org/r/1128785 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:30:43] which apparently is not happy about some config: `Invalid server index #` [08:30:52] !log brouberol@dns1004 START - running authdns-update [08:30:53] which smells like something is borked in operations/mediawiki-config [08:31:27] akosiaris: yeah sorry for the delay, I was busy digging in logstash / copy pasting to the task etc :) [08:32:19] (03CR) 10Brouberol: [C:03+2] Remove airflow-analytics from the ATS and cache configuration [puppet] - 10https://gerrit.wikimedia.org/r/1128786 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:32:20] sometime I regret not being in a real office, I'd would have move with my laptop to the SRE open space and loudly scream "We have an emergency! Wikis are broken!! I have croissants!" [08:32:33] np, good job on correlating with the logstash stacktrace. I was at the apache logs level and was about to move to logstash when you pasted that log line [08:33:01] \o/ [08:33:02] !log brouberol@dns1004 END - running authdns-update [08:33:11] (03CR) 10Stevemunene: [C:03+1] Remove the airflow-analytics deployemnt helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128790 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:33:18] so I think it is an issue with wmf.21 since the tests wikis got promoted over night [08:33:21] I'll look for a repro [08:33:22] (03CR) 10Stevemunene: [C:03+1] Remove the airflow-analytics namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128791 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [08:34:00] I am super happy to find out httpbb did catch an issue [08:34:12] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:34:17] for what is worth, httpbb no longer complains right now [08:34:36] so this has some aspects of a heisenbug at least [08:34:55] OH [08:35:14] cause the wiki is on wmf.20 [08:35:16] and it went through a phase of first no longer complaining about the P13344 page and now it no longer complains about Main_Page either [08:35:19] !log mlitn@deploy2002 Started deploy [airflow-dags/platform_eng@e7be149]: (no justification provided) [08:35:24] and the patch to promote them to wmf.21 did not get deployed due to the test failing [08:35:33] I am still trying to figure out but that is my assumption right now [08:35:47] deploy2002:~$ httpbb /srv/deployment/httpbb-tests/appserver/test_main.yaml --hosts=mwdebug[1001,1002].eqiad.wmnet is how I was running it and seeing first both failures, then only 1 and now none [08:35:53] so if you were to run the httpbb tests manually they still hit wmf.20 [08:36:01] !log mlitn@deploy2002 Finished deploy [airflow-dags/platform_eng@e7be149]: (no justification provided) (duration: 00m 46s) [08:36:13] I noticed Special:Version was showing wmf.20 which was confusing me ( https://test.wikidata.org/wiki/Special:Version ) [08:36:16] but that never got promoted [08:36:21] I think that is the explanation [08:36:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:36:47] I will write a summary [08:37:09] hashar: this is why I asked above, 10:14:49 hashar: httpbb's appserver/test_main.yaml continues to fail right now against mwdebug1001 and mwdebug1002. Is this expected? Or has a rollback happened? [08:37:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:37:17] I wanted to clear out that possibility [08:37:31] then mwdebug servers are still on wmf.21 ? [08:37:38] err s/still/already/ [08:38:09] * akosiaris double checks [08:39:09] hashar: they should be on 21, that happened last night and that doesn't get rolled back AFAIK [08:39:16] [08:39:20] yup, they are in wmf.21 [08:39:24] https://www.irccloud.com/pastebin/LzqvikZ4/ [08:39:26] good! [08:39:37] and right now, for whatever reasons, httpbb no longer complains about either of those 2 pages [08:40:20] deploy2002:~$ curl -s -4 --connect-to test.wikidata.org:443:10.64.32.123:443 https://test.wikidata.org/wiki/Special:Version | grep 'wmf\.' | head -1 [08:40:20] [08:40:30] that's ^ my super quick way of checking fwiw [08:40:37] the IP is mwdebug1001 fwiw [08:42:03] so whatever the heisenbug is, it looks like it vanishes at some point after a deployment. Which also is consistent with Amir's comment above that they just hit 'r' a few times yesterday and it finally worked. [08:42:23] so that happened previously? [08:42:33] 🤷 [08:42:36] :) [08:42:58] I just saw the bug today and responded, but I am totally oblivious to what happened yesterday [08:43:15] it is given a null shardindex [08:43:30] and with SqlBagOStuff / ParserCache I am tempted to invoke Amir1 :-] [08:44:13] akosiaris: the patch I deployed was for portals fully html assets [08:44:55] Amir1: so you just stepped on the mine, it triggered and then it decided to let you leave after a couple of times pressing 'r' ? [08:45:00] might be the data redundancy patch having a bug [08:45:17] lol, just saw throw new UnexpectedValueException( "Invalid server index #$shardIndex" ); [08:45:17] I think that is bug in I80da12396858ee4fc58ae257f6c154b3050df696 yeah [08:45:22] if this is only on wmf.21 [08:45:24] it's literally a null index [08:45:39] I thought # was a number or something but the variable is indeed null, lol [08:45:48] yesterday, wmf.21 wasn't even cut :D [08:45:50] and PHP shows it as an empty string [08:45:54] yup [08:45:56] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1029.eqiad.wmnet with reason: remove from cluster for reimage [08:46:05] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10645487 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fbeb54b5-2eb9-44e3-bebb-3ffb0c131169) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [08:46:15] it is very likely a bug in the data redundancy patch [08:46:21] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1119745 [08:46:33] give me a bit and I figure out the root cause [08:46:37] if it happened before wmf.20 that means it is in a patch before that? [08:47:04] then logstash would have it [08:47:57] why would it be before wmf.20? [08:48:02] the patch I linked should be only in wmf.21 so I had it last night and might be a different issue [08:48:13] I think there are actually two issues [08:48:21] I am refering to amir having to repeat the httpbb tests yesterday? [08:48:27] but yeah that might have been an other issue [08:49:12] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:50:17] I'd revert https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1119745 :) [08:50:33] hashar: give me an hour and I fix it [08:50:44] but my bet is `array_slice()` returning an empty array and thus there is no index [08:52:35] is the wmf.21 on mwdebug? so I can test things with eval.php? [08:52:43] should be yes [08:53:00] cool [08:53:04] the patch that bumps to wmf.21 failed deployment but scap does not rollback [08:54:03] jnuche: it is most probably https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1119745 . There might be an easy quick fix, else it can be reverted to resume the train [08:54:32] hashar: yep, I'm listening in, thank you everyone for taking a look [08:54:33] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1029 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1128412 (owner: 10Muehlenhoff) [08:56:53] well I am taking a break [08:57:17] I have long finihsed my breakfast but I am still in my pajamas and it is probably not healthy :b [08:57:43] hashar: go for it, thanks again, you made my morning a bit less stressful :) [08:58:34] I wonder whether httpbb could show the exception or request id [08:58:44] the exception id is certainly somewhere in the HTML payload [08:58:58] anyway, this is endless! [08:59:08] I'd be back in roughly half an hour [09:00:05] jnuche and jeena: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250318T0900). [09:00:09] jnuche: I am happy to have relieved some stress!! :-] [09:00:22] (03PS1) 10Vgutierrez: prometheus: Add node_file_age [puppet] - 10https://gerrit.wikimedia.org/r/1128798 (https://phabricator.wikimedia.org/T389175) [09:00:31] and I am very happy that httpbb caught the issue [09:00:57] (03CR) 10CI reject: [V:04-1] prometheus: Add node_file_age [puppet] - 10https://gerrit.wikimedia.org/r/1128798 (https://phabricator.wikimedia.org/T389175) (owner: 10Vgutierrez) [09:01:04] yeah, it's good to see those tests in action [09:01:06] (03CR) 10Slyngshede: "recheck" [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 (owner: 10Slyngshede) [09:01:46] train window just started, noting here again train is currently blocked by T389169 [09:01:46] T389169: UnexpectedValueException: Invalid server index # causes eployment to fail due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169 [09:07:20] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1029.eqiad.wmnet [09:09:01] found the issue [09:09:07] fix will be coming shortly [09:09:09] I hate php [09:10:39] (03PS1) 10Slyngshede: Always create a new connection to LDAP [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1128799 [09:11:37] (03PS1) 10Stevemunene: hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) [09:13:01] (03PS2) 10Stevemunene: hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) [09:13:32] yup, tested in mwdebug2001 and it fixes the issues [09:14:21] (03CR) 10CI reject: [V:04-1] hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) (owner: 10Stevemunene) [09:18:26] (03CR) 10Brouberol: [C:03+2] Remove airflow-analytics from the IDP configuration [puppet] - 10https://gerrit.wikimedia.org/r/1128787 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [09:19:30] (03PS3) 10Stevemunene: hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) [09:19:30] hashar: jnuche: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1128802 are you comfortable with merging this or should I find someone to review it? [09:21:16] (03CR) 10CI reject: [V:04-1] Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 (owner: 10Slyngshede) [09:21:16] (03CR) 10CI reject: [V:04-1] hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) (owner: 10Stevemunene) [09:21:43] (you can also backport it to wmf.21 and merge it, I take care of master) [09:23:21] Amir1: that looks good to me, gonna create the backport for 21 and deploy it [09:23:29] Thanks! [09:23:41] (03PS4) 10Stevemunene: hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) [09:23:59] (03CR) 10Brouberol: [C:03+2] Remove the airflow-analytics namespace from the operators tenant ns list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128789 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [09:24:27] (03PS1) 10Jaime Nuche: objectcache: Re-number array keys in SqlBagOStuff [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1128803 (https://phabricator.wikimedia.org/T389169) [09:24:32] (03CR) 10Effie Mouzeli: [C:03+1] hieradata: migrate mw-wikifunctions to PHP 8.1 (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1128440 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [09:24:44] (03CR) 10Effie Mouzeli: [C:03+1] mw-wikifunctions: migrate to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128439 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [09:25:17] (03CR) 10CI reject: [V:04-1] hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) (owner: 10Stevemunene) [09:26:02] (03CR) 10Volans: "Sorry, this one got lost in the backlog. Reply inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [09:26:15] 06SRE, 10Bitu, 10CAS-SSO, 06Infrastructure-Foundations, 06Security-Team: SSO kill switch for crucial services - https://phabricator.wikimedia.org/T233938#10645573 (10Arendpieter) 05Open→03Resolved [09:26:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jnuche@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1128803 (https://phabricator.wikimedia.org/T389169) (owner: 10Jaime Nuche) [09:29:17] (03Merged) 10jenkins-bot: Remove the airflow-analytics namespace from the operators tenant ns list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128789 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [09:30:43] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:30:49] (03CR) 10CI reject: [V:04-1] Always create a new connection to LDAP [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1128799 (owner: 10Slyngshede) [09:30:53] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 113, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:31:22] (03CR) 10Ladsgroup: [C:03+1] valid_section.pp: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1128746 (https://phabricator.wikimedia.org/T381475) (owner: 10Marostegui) [09:35:18] (03PS2) 10Vgutierrez: prometheus: Add node_file_age [puppet] - 10https://gerrit.wikimedia.org/r/1128798 (https://phabricator.wikimedia.org/T389175) [09:35:26] (03PS5) 10Stevemunene: hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) [09:36:31] (03CR) 10Marostegui: [C:03+2] valid_section.pp: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1128746 (https://phabricator.wikimedia.org/T381475) (owner: 10Marostegui) [09:36:42] (03CR) 10CI reject: [V:04-1] hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) (owner: 10Stevemunene) [09:37:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet [09:39:05] (03Merged) 10jenkins-bot: objectcache: Re-number array keys in SqlBagOStuff [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1128803 (https://phabricator.wikimedia.org/T389169) (owner: 10Jaime Nuche) [09:39:34] !log jnuche@deploy2002 Started scap sync-world: Backport for [[gerrit:1128803|objectcache: Re-number array keys in SqlBagOStuff (T389169)]] [09:39:38] T389169: UnexpectedValueException: Invalid server index # causes deployment to fail due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169 [09:39:59] (03PS6) 10Stevemunene: hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) [09:41:12] (03CR) 10CI reject: [V:04-1] hdfs: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1128800 (https://phabricator.wikimedia.org/T371080) (owner: 10Stevemunene) [09:44:51] !log jnuche@deploy2002 jnuche: Backport for [[gerrit:1128803|objectcache: Re-number array keys in SqlBagOStuff (T389169)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:44:55] T389169: UnexpectedValueException: Invalid server index # causes deployment to fail due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169 [09:45:05] !log jnuche@deploy2002 jnuche: Continuing with sync [09:45:15] 06SRE, 10Bitu, 10CAS-SSO, 06Infrastructure-Foundations, 06Security-Team: SSO kill switch for crucial services - https://phabricator.wikimedia.org/T233938#10645637 (10MoritzMuehlenhoff) 05Resolved→03Open This isn't resolved? [09:45:52] success: [09:45:56] https://www.irccloud.com/pastebin/YggQfkGC/ [09:46:15] Amir1: thanks once more! :) [09:48:25] (03CR) 10Zoe: [C:03+1] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128381 (owner: 10PipelineBot) [09:49:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1029.eqiad.wmnet [09:49:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti1029.eqiad.wmnet [09:50:22] Amir1: well it is missing a PHPUnit test to cover the issue :b [09:50:37] Amir1: thank you for the quick fix! [09:51:27] (03CR) 10Alexandros Kosiaris: [C:03+1] mediawiki: add rewrite for rt.wikimedia.org to wikitech page (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [09:52:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:53:33] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:53:36] jnuche: I have closed the blocker task [09:53:41] yeah, i try to add a regression test a bit later [09:53:43] akosiaris: Amir1: thank you very much! [09:53:54] (03CR) 10Brouberol: [C:03+2] Remove the airflow-analytics deployemnt helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128790 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [09:54:02] (03CR) 10Brouberol: [C:03+2] Remove the airflow-analytics namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128791 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [09:54:09] (03CR) 10Brouberol: [C:03+2] Remove the airflow-analytics kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1128788 (https://phabricator.wikimedia.org/T389172) (owner: 10Brouberol) [09:55:19] PROBLEM - Disk space on an-druid1003 is CRITICAL: DISK CRITICAL - free space: /srv 106304 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1003&var-datasource=eqiad+prometheus/ops [09:55:36]