[00:08:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1133269 [00:08:41] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1133269 (owner: 10TrainBranchBot) [00:26:02] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1133269 (owner: 10TrainBranchBot) [00:35:55] (03PS1) 10Ssingh: P:pybal: alert sooner if pybal.conf was changed [puppet] - 10https://gerrit.wikimedia.org/r/1133271 [00:36:39] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5193/co" [puppet] - 10https://gerrit.wikimedia.org/r/1133271 (owner: 10Ssingh) [00:38:57] 06SRE, 06serviceops: mwscript-cleanup.service failure - https://phabricator.wikimedia.org/T390790#10701976 (10RLazarus) a:03RLazarus [00:42:52] (03PS1) 10RLazarus: mwscript_cleanup: Add mediawiki-common to excluded releases [puppet] - 10https://gerrit.wikimedia.org/r/1133275 (https://phabricator.wikimedia.org/T390790) [01:38:25] FIRING: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:23:37] FIRING: [6x] SystemdUnitFailed: opensearch-disable-readahead.service on cirrussearch2055:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:29:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10702038 (10phaultfinder) [02:39:31] (03CR) 10Scott French: [C:03+1] mwscript_cleanup: Add mediawiki-common to excluded releases [puppet] - 10https://gerrit.wikimedia.org/r/1133275 (https://phabricator.wikimedia.org/T390790) (owner: 10RLazarus) [02:53:37] FIRING: [8x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:17:02] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [03:37:03] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [03:37:48] FIRING: PuppetFailure: Puppet has failed on cirrussearch2055:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:40:25] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:29:41] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10702171 (10phaultfinder) [04:39:49] (03CR) 10Giuseppe Lavagetto: [C:03+1] php8.1: Rebuild to update Debian packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1133229 (owner: 10Scott French) [04:47:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 5.769% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:48:57] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:49:15] FIRING: [7x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:50:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 20.55s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:52:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 1.351% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:53:35] FIRING: [9x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:53:57] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:54:15] RESOLVED: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:55:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 7.144s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:58:25] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:02:57] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:04:30] FIRING: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:05:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 16.54s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:07:57] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:09:30] RESOLVED: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:10:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 6.128s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:40:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [05:45:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [05:53:25] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:56:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T0600) [06:23:37] FIRING: [6x] SystemdUnitFailed: opensearch-disable-readahead.service on cirrussearch2055:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:45:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:50:55] jouncebot: next [06:50:55] In 0 hour(s) and 9 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T0700) [06:51:28] (03CR) 10Slyngshede: [C:03+2] Fix removal of Gerrit json prefix [software/bitu] - 10https://gerrit.wikimedia.org/r/1131991 (owner: 10Hashar) [06:51:49] !log elukey@cumin1002 START - Cookbook sre.hosts.reboot-single for host registry2004.codfw.wmnet [06:52:30] !log reboot registry2004 - already done for 2005 yesterday to debug a logging issue, to keep the codfw in the same state - T390251 [06:54:13] (03Merged) 10jenkins-bot: Fix removal of Gerrit json prefix [software/bitu] - 10https://gerrit.wikimedia.org/r/1131991 (owner: 10Hashar) [06:54:38] (03CR) 10Slyngshede: [C:03+2] Simplify invocation of clients integrations [software/bitu] - 10https://gerrit.wikimedia.org/r/1131460 (owner: 10Hashar) [06:55:46] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry2004.codfw.wmnet [06:57:20] (03Merged) 10jenkins-bot: Simplify invocation of clients integrations [software/bitu] - 10https://gerrit.wikimedia.org/r/1131460 (owner: 10Hashar) [06:57:52] (03PS2) 10Muehlenhoff: Add service record for puppetserver2004 [dns] - 10https://gerrit.wikimedia.org/r/1130073 (https://phabricator.wikimedia.org/T381274) [06:59:37] (03CR) 10Muehlenhoff: [C:03+2] Add service record for puppetserver2004 [dns] - 10https://gerrit.wikimedia.org/r/1130073 (https://phabricator.wikimedia.org/T381274) (owner: 10Muehlenhoff) [06:59:50] !log jmm@dns1004 START - running authdns-update [07:00:05] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:21] (03CR) 10Slyngshede: [C:03+2] Fix handling of status code in Gerrit integration [software/bitu] - 10https://gerrit.wikimedia.org/r/1131471 (owner: 10Hashar) [07:02:09] !log jmm@dns1004 END - running authdns-update [07:03:01] (03Merged) 10jenkins-bot: Fix handling of status code in Gerrit integration [software/bitu] - 10https://gerrit.wikimedia.org/r/1131471 (owner: 10Hashar) [07:03:31] (03CR) 10Slyngshede: [C:03+2] Add a basic test for user_block in LDAP [software/bitu] - 10https://gerrit.wikimedia.org/r/1132019 (owner: 10Hashar) [07:06:12] (03Merged) 10jenkins-bot: Add a basic test for user_block in LDAP [software/bitu] - 10https://gerrit.wikimedia.org/r/1132019 (owner: 10Hashar) [07:07:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [07:12:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [07:15:17] (03PS1) 10Muehlenhoff: Revert "Add service record for puppetserver2004" [dns] - 10https://gerrit.wikimedia.org/r/1133306 [07:15:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-esams (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [07:15:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:16:10] !log jmm@dns1004 START - running authdns-update [07:17:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [07:17:48] RESOLVED: PuppetFailure: Puppet has failed on cirrussearch2055:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:18:07] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10702317 (10MoritzMuehlenhoff) I had to revert the addition, it caused Puppet failures like the following: ` 09:09:16 err Error while ev... [07:18:25] !log jmm@dns1004 END - running authdns-update [07:19:56] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:20:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [07:20:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:22:02] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [07:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10702364 (10phaultfinder) [07:29:22] !log depool cp7001 to fix stale ocsp alert (T384227) [07:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:25] T384227: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227 [07:30:20] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:33:49] (03PS1) 10Elukey: ml-services: add seccomp profile to editquality-reverted in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133315 (https://phabricator.wikimedia.org/T369493) [07:34:16] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs2014.*,lvs1020.*} and A:lvs [07:36:15] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs2014.*,lvs1020.*} and A:lvs [07:39:28] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:39:32] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:40:21] 06SRE, 06Infrastructure-Foundations, 07LDAP: Extend LDAP group cross check - https://phabricator.wikimedia.org/T390817 (10MoritzMuehlenhoff) 03NEW [07:40:25] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:42:02] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [07:46:10] (03PS1) 10Jelto: gitlab_runner: increase job output_limit to 20MB [puppet] - 10https://gerrit.wikimedia.org/r/1133316 (https://phabricator.wikimedia.org/T390816) [07:46:47] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs2013.*,lvs1019.*} and A:lvs [07:47:40] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs2013.*,lvs1019.*} and A:lvs [07:48:52] (03PS2) 10Elukey: sre.hosts.provision: add a warning for ipv6 disabled [cookbooks] - 10https://gerrit.wikimedia.org/r/1133171 (https://phabricator.wikimedia.org/T389950) [07:49:22] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:49:31] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:50:02] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5194/co" [puppet] - 10https://gerrit.wikimedia.org/r/1133316 (https://phabricator.wikimedia.org/T390816) (owner: 10Jelto) [07:54:25] (03PS3) 10Elukey: sre.hosts.provision: add a warning for ipv6 disabled [cookbooks] - 10https://gerrit.wikimedia.org/r/1133171 (https://phabricator.wikimedia.org/T389950) [07:57:36] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:57:40] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:00:05] dancy and andre: gettimeofday() says it's time for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T0800) [08:00:20] (03PS1) 10Seanleong-wmde: Increase entityAccessLimit from 400 to 500 for all wikis except commons. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133317 [08:01:31] 10ops-eqiad, 06DC-Ops: Fix "changeme" cable labels - https://phabricator.wikimedia.org/T390818 (10ayounsi) 03NEW [08:04:28] (03PS1) 10Alexandros Kosiaris: mw-wikifunctions: Add an extra SAN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133318 (https://phabricator.wikimedia.org/T384944) [08:05:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:06:14] (03PS8) 10Tiziano Fogli: perf/navtiming: migrate alerts from grafana to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1133152 (https://phabricator.wikimedia.org/T325283) [08:07:50] topranks: I haven't had any coffee yet but that Colt iface going down in esams seems unexpected? [08:08:47] I also lack coffee, and yep that doesn’t sound good let me see [08:09:50] The Colt thing always confuses me, Lumen sold their EU business to them if I remember right [08:11:06] yeah port is down, traffic is coming over the GTT VPLS [08:11:13] should be ok until it restores [08:13:16] (03PS4) 10Elukey: WIP - sre.hosts.provision: add a warning for ipv6 disabled [cookbooks] - 10https://gerrit.wikimedia.org/r/1133171 (https://phabricator.wikimedia.org/T389950) [08:15:23] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:15:27] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:15:29] (03PS2) 10Alexandros Kosiaris: mw-wikifunctions: Add an extra SAN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133318 (https://phabricator.wikimedia.org/T384944) [08:16:04] (03PS1) 10Slyngshede: Wikimedia:jobs check if attribute exists [software/bitu] - 10https://gerrit.wikimedia.org/r/1133322 [08:18:08] !log repooled cp7001 (T384227) [08:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:11] T384227: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227 [08:19:54] (03CR) 10CI reject: [V:04-1] WIP - sre.hosts.provision: add a warning for ipv6 disabled [cookbooks] - 10https://gerrit.wikimedia.org/r/1133171 (https://phabricator.wikimedia.org/T389950) (owner: 10Elukey) [08:19:58] (03CR) 10Slyngshede: [C:03+2] Wikimedia:jobs check if attribute exists [software/bitu] - 10https://gerrit.wikimedia.org/r/1133322 (owner: 10Slyngshede) [08:20:35] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: add seccomp profile to editquality-reverted in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133315 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [08:21:02] (03CR) 10Elukey: [C:03+2] ml-services: add seccomp profile to editquality-reverted in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133315 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [08:22:39] (03Merged) 10jenkins-bot: Wikimedia:jobs check if attribute exists [software/bitu] - 10https://gerrit.wikimedia.org/r/1133322 (owner: 10Slyngshede) [08:22:55] (03PS5) 10Elukey: WIP - sre.hosts.provision: add a warning for ipv6 disabled [cookbooks] - 10https://gerrit.wikimedia.org/r/1133171 (https://phabricator.wikimedia.org/T389950) [08:23:18] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:23:21] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10702512 (10phaultfinder) [08:24:39] (03PS6) 10Elukey: WIP - sre.hosts.provision: add a warning for ipv6 disabled [cookbooks] - 10https://gerrit.wikimedia.org/r/1133171 (https://phabricator.wikimedia.org/T389950) [08:25:22] (03CR) 10Filippo Giunchedi: [C:03+1] auth_metrics: add recording rules for grafana widgets [puppet] - 10https://gerrit.wikimedia.org/r/1133170 (https://phabricator.wikimedia.org/T390672) (owner: 10Tiziano Fogli) [08:26:12] (03PS1) 10Muehlenhoff: Add a canonical file to track sensitive groups [puppet] - 10https://gerrit.wikimedia.org/r/1133325 [08:26:28] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:26:31] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:27:29] (03PS3) 10Muehlenhoff: Create insetup role for Data Platform with nftables and merge DE/Search roles [puppet] - 10https://gerrit.wikimedia.org/r/1132422 (https://phabricator.wikimedia.org/T389825) [08:27:33] (03PS7) 10Elukey: WIP - sre.hosts.provision: add a warning for ipv6 disabled [cookbooks] - 10https://gerrit.wikimedia.org/r/1133171 (https://phabricator.wikimedia.org/T389950) [08:28:21] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:28:25] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:29:51] (03PS8) 10Elukey: WIP - sre.hosts.provision: add a warning for ipv6 disabled [cookbooks] - 10https://gerrit.wikimedia.org/r/1133171 (https://phabricator.wikimedia.org/T389950) [08:30:49] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:30:53] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:31:18] !log trunk sandbox vlan to eqiad row B ganeti - T385560 [08:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:20] T385560: Create RIPE Atlas anchors VMs - https://phabricator.wikimedia.org/T385560 [08:31:57] (03PS9) 10Elukey: WIP - sre.hosts.provision: add a warning for ipv6 disabled [cookbooks] - 10https://gerrit.wikimedia.org/r/1133171 (https://phabricator.wikimedia.org/T389950) [08:32:09] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:32:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6001.drmrs.wmnet [08:32:27] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:33:05] (03PS1) 10Muehlenhoff: Switch ganeti6001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133326 [08:34:16] (03CR) 10Alexandros Kosiaris: [C:03+2] mw-wikifunctions: Add an extra SAN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133318 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [08:34:41] (03PS1) 10Slyngshede: Block/Unblock template error [software/bitu] - 10https://gerrit.wikimedia.org/r/1133327 [08:36:02] (03PS10) 10Elukey: WIP - sre.hosts.provision: add a warning for ipv6 disabled [cookbooks] - 10https://gerrit.wikimedia.org/r/1133171 (https://phabricator.wikimedia.org/T389950) [08:36:17] (03CR) 10Muehlenhoff: [C:03+2] Revert "Add service record for puppetserver2004" [dns] - 10https://gerrit.wikimedia.org/r/1133306 (owner: 10Muehlenhoff) [08:36:18] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:36:35] !log jmm@dns1004 START - running authdns-update [08:36:37] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:37:04] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10702558 (10ops-monitoring-bot) Draining ganeti6001.drmrs.wmnet of running VMs [08:37:53] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti6001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133326 (owner: 10Muehlenhoff) [08:37:57] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:38:35] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:38:54] !log jmm@dns1004 END - running authdns-update [08:39:59] (03Merged) 10jenkins-bot: mw-wikifunctions: Add an extra SAN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133318 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [08:40:55] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [08:41:03] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [08:41:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6001.drmrs.wmnet [08:41:30] (03CR) 10Slyngshede: [C:03+2] Block/Unblock template error [software/bitu] - 10https://gerrit.wikimedia.org/r/1133327 (owner: 10Slyngshede) [08:41:51] moritzm: I see a lot of puppet failures, same error as before [08:43:53] (03Merged) 10jenkins-bot: Block/Unblock template error [software/bitu] - 10https://gerrit.wikimedia.org/r/1133327 (owner: 10Slyngshede) [08:44:13] (03PS1) 10Ayounsi: gNMIc set retry to 1 minute [puppet] - 10https://gerrit.wikimedia.org/r/1133328 (https://phabricator.wikimedia.org/T388641) [08:45:16] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133328 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [08:45:16] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [08:45:34] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [08:46:10] !log akosiaris@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [08:46:39] !log akosiaris@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [08:47:15] volans: I deployed the revert a few minutes ago, the initial one was incomplete, should recover now [08:47:29] got it [08:47:30] (03PS3) 10Cyndywikime: Growth: Remove GELevelingUpFeaturesEnabled and GEMentorDashboardEnabled feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131696 (https://phabricator.wikimedia.org/T379566) [08:47:33] !log akosiaris@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [08:47:37] I see the failed ones going down [08:47:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6001.drmrs.wmnet [08:47:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6001.drmrs.wmnet [08:47:52] !log akosiaris@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [08:47:53] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [08:48:01] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [08:48:10] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [08:48:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1036.eqiad.wmnet [08:48:25] FIRING: SystemdUnitFailed: nic-saturation-exporter.service on ganeti6001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:48:28] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [08:48:35] FIRING: [9x] ProbeDown: Service ganeti6001:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:50:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1036.eqiad.wmnet [08:50:10] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:50:45] (03PS11) 10Elukey: sre.hosts.provision: add a warning for ipv6 disabled [cookbooks] - 10https://gerrit.wikimedia.org/r/1133171 (https://phabricator.wikimedia.org/T389950) [08:50:48] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10702622 (10phaultfinder) [08:51:58] (03CR) 10Elukey: "Tested with ipv6 enabled/disabled via WebUI, all worked as expected." [cookbooks] - 10https://gerrit.wikimedia.org/r/1133171 (https://phabricator.wikimedia.org/T389950) (owner: 10Elukey) [08:53:25] RESOLVED: SystemdUnitFailed: nic-saturation-exporter.service on ganeti6001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:55:24] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:56:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6001.drmrs.wmnet [08:56:03] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti6001.drmrs.wmnet [08:56:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6002.drmrs.wmnet [08:57:04] 06SRE, 06Infrastructure-Foundations, 10netops, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10702638 (10aborrero) >>! In T389958#10683594, @cmooney wrote: > @aborrero @taavi one thing we could maybe try, if we wanted to make progress sooner (i.e. with... [08:57:09] (03PS1) 10Muehlenhoff: Switch ganeti6002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133331 [09:02:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:03:00] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1133171 (https://phabricator.wikimedia.org/T389950) (owner: 10Elukey) [09:03:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on elastic2065:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:04:05] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1133325 (owner: 10Muehlenhoff) [09:04:08] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti6002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133331 (owner: 10Muehlenhoff) [09:05:48] FIRING: PuppetFailure: Puppet has failed on cirrussearch2055:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:05:52] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10702692 (10Krinkle) The two theories (increase in bots opening the login page, vs SUL3 bug part of rollout) may not be as separate. What if: * Bots are inde... [09:06:38] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10702693 (10Clement_Goubert) Yeah sure, fine by me, at least it's the last in the range so easy to keep it separated :D [09:07:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6002.drmrs.wmnet [09:09:35] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10702695 (10phaultfinder) [09:10:27] (03PS9) 10Tiziano Fogli: perf/navtiming: migrate alerts from grafana to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1133152 (https://phabricator.wikimedia.org/T325283) [09:12:03] (03CR) 10CI reject: [V:04-1] perf/navtiming: migrate alerts from grafana to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1133152 (https://phabricator.wikimedia.org/T325283) (owner: 10Tiziano Fogli) [09:12:38] (03PS1) 10Alexandros Kosiaris: Add a .gitmessage file [dns] - 10https://gerrit.wikimedia.org/r/1133334 [09:12:38] (03PS1) 10Alexandros Kosiaris: Add wikifunctions-ingress-ro records [dns] - 10https://gerrit.wikimedia.org/r/1133335 (https://phabricator.wikimedia.org/T384944) [09:12:45] RESOLVED: [3x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:14:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6002.drmrs.wmnet [09:14:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6002.drmrs.wmnet [09:14:53] (03CR) 10Alexandros Kosiaris: [C:03+2] Add a .gitmessage file [dns] - 10https://gerrit.wikimedia.org/r/1133334 (owner: 10Alexandros Kosiaris) [09:14:56] (03CR) 10Alexandros Kosiaris: [C:03+2] Add wikifunctions-ingress-ro records [dns] - 10https://gerrit.wikimedia.org/r/1133335 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [09:16:27] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: add a warning for ipv6 disabled [cookbooks] - 10https://gerrit.wikimedia.org/r/1133171 (https://phabricator.wikimedia.org/T389950) (owner: 10Elukey) [09:16:45] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reboot-single for host ganeti1036.eqiad.wmnet [09:16:47] !log akosiaris@dns1004 START - running authdns-update [09:17:02] !log failover ganeti masters in drmrs to ganeti6001/6002 [09:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:14] (03PS1) 10Brouberol: airflow: upgrade airflow to 2.10.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133338 (https://phabricator.wikimedia.org/T390575) [09:18:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on elastic2065:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:18:53] !log create mw-wikifunctions-ingress.discovery.wmnet and .svc records to facilitate the migration to ingress [09:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:02] (03PS10) 10Tiziano Fogli: perf/navtiming: migrate alerts from grafana to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1133152 (https://phabricator.wikimedia.org/T325283) [09:19:28] !log akosiaris@dns1004 END - running authdns-update [09:20:40] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10702752 (10phaultfinder) [09:21:23] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1041.eqiad.wmnet [09:21:47] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:23:42] !log ayounsi@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mr1-ulsfo with reason: reboot [09:24:25] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1036.eqiad.wmnet [09:24:27] !log rebooting mr1-ulsfo - T390052 [09:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:29] T390052: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052 [09:25:02] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1132583 (owner: 10Slyngshede) [09:26:05] (03CR) 10Slyngshede: [C:03+2] Permission log: Improve speed of permission log [software/bitu] - 10https://gerrit.wikimedia.org/r/1132583 (owner: 10Slyngshede) [09:27:18] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1041.eqiad.wmnet [09:27:48] (03PS1) 10Elukey: Revert "ml-services: add seccomp profile to editquality-reverted in codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133341 [09:27:49] (03CR) 10Btullis: [C:03+1] "Nice." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133338 (https://phabricator.wikimedia.org/T390575) (owner: 10Brouberol) [09:28:03] (03CR) 10Elukey: [V:03+2 C:03+2] Revert "ml-services: add seccomp profile to editquality-reverted in codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133341 (owner: 10Elukey) [09:28:35] FIRING: [9x] ProbeDown: Service ganeti1036:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:28:44] (03Merged) 10jenkins-bot: Permission log: Improve speed of permission log [software/bitu] - 10https://gerrit.wikimedia.org/r/1132583 (owner: 10Slyngshede) [09:29:23] (03CR) 10Brouberol: [C:03+2] airflow: upgrade airflow to 2.10.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133338 (https://phabricator.wikimedia.org/T390575) (owner: 10Brouberol) [09:29:41] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:31:42] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:32:30] FIRING: [3x] Traffic bill over quota: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota Has worsened - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [09:32:48] (03PS1) 10Alexandros Kosiaris: cache::backend: Switch mw-wikifunctions to ingress [puppet] - 10https://gerrit.wikimedia.org/r/1133343 (https://phabricator.wikimedia.org/T384944) [09:33:52] (03CR) 10Effie Mouzeli: [C:03+1] php8.1: Rebuild to update Debian packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1133229 (owner: 10Scott French) [09:34:19] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1042.eqiad.wmnet [09:34:28] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2042.codfw.wmnet [09:35:21] (03CR) 10Alexandros Kosiaris: [C:03+2] cache::backend: Switch mw-wikifunctions to ingress [puppet] - 10https://gerrit.wikimedia.org/r/1133343 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [09:35:54] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.8.0 [software/homer] - 10https://gerrit.wikimedia.org/r/1133344 [09:36:30] (03PS1) 10Marostegui: instances.yaml: Add db1257 [puppet] - 10https://gerrit.wikimedia.org/r/1133345 (https://phabricator.wikimedia.org/T381475) [09:36:34] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v0.8.0 [software/homer] - 10https://gerrit.wikimedia.org/r/1133344 (owner: 10Volans) [09:36:42] RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:37:30] FIRING: [4x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [09:38:20] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db1257 [puppet] - 10https://gerrit.wikimedia.org/r/1133345 (https://phabricator.wikimedia.org/T381475) (owner: 10Marostegui) [09:40:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6003.drmrs.wmnet [09:40:25] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1042.eqiad.wmnet [09:41:04] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2042.codfw.wmnet [09:41:08] (03PS1) 10Muehlenhoff: Switch ganeti6003 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133346 [09:41:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add db1257 to dbctl depooled T381475', diff saved to https://phabricator.wikimedia.org/P74555 and previous config saved to /var/cache/conftool/dbconfig/20250402-094109-marostegui.json [09:41:13] T381475: Productionize x3 hosts - https://phabricator.wikimedia.org/T381475 [09:41:58] (03PS1) 10Marostegui: db1257: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1133347 (https://phabricator.wikimedia.org/T381475) [09:42:05] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.8.0 [software/homer] - 10https://gerrit.wikimedia.org/r/1133344 (owner: 10Volans) [09:42:42] (03CR) 10Marostegui: [C:03+2] db1257: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1133347 (https://phabricator.wikimedia.org/T381475) (owner: 10Marostegui) [09:44:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P74556 and previous config saved to /var/cache/conftool/dbconfig/20250402-094428-root.json [09:44:35] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10702905 (10phaultfinder) [09:45:13] (03PS1) 10Marostegui: db2243: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1133348 (https://phabricator.wikimedia.org/T381475) [09:45:43] (03CR) 10Marostegui: [C:03+2] db2243: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1133348 (https://phabricator.wikimedia.org/T381475) (owner: 10Marostegui) [09:46:13] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti6003 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133346 (owner: 10Muehlenhoff) [09:48:41] (03PS1) 10Marostegui: instances.yaml: Add db2243 [puppet] - 10https://gerrit.wikimedia.org/r/1133349 (https://phabricator.wikimedia.org/T381475) [09:49:11] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db2243 [puppet] - 10https://gerrit.wikimedia.org/r/1133349 (https://phabricator.wikimedia.org/T381475) (owner: 10Marostegui) [09:52:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add db2243 to dbctl depooled T381475', diff saved to https://phabricator.wikimedia.org/P74557 and previous config saved to /var/cache/conftool/dbconfig/20250402-095213-marostegui.json [09:52:16] T381475: Productionize x3 hosts - https://phabricator.wikimedia.org/T381475 [09:52:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6003.drmrs.wmnet [09:52:30] FIRING: [4x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [09:53:25] FIRING: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:54:27] (03PS1) 10Alexandros Kosiaris: Revert "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133351 [09:54:33] (03CR) 10Alexandros Kosiaris: [V:03+2 C:03+2] Revert "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133351 (owner: 10Alexandros Kosiaris) [09:55:01] (03CR) 10Jelto: [C:03+2] wikidata-query-builder: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133122 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [09:55:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2243 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P74558 and previous config saved to /var/cache/conftool/dbconfig/20250402-095538-root.json [09:56:35] (03Merged) 10jenkins-bot: wikidata-query-builder: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133122 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [09:57:30] RESOLVED: Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [09:58:39] (03PS5) 10Arnaudb: gerrit: switchover to gerrit2002 [dns] - 10https://gerrit.wikimedia.org/r/1128818 (https://phabricator.wikimedia.org/T387833) [09:59:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6003.drmrs.wmnet [09:59:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6003.drmrs.wmnet [09:59:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6004.drmrs.wmnet [09:59:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P74559 and previous config saved to /var/cache/conftool/dbconfig/20250402-095933-root.json [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T1000) [10:01:35] (03CR) 10Ayounsi: [C:03+1] "nice!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [10:01:55] (03PS1) 10Muehlenhoff: Switch ganeti6004 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133353 [10:03:19] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti6004 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133353 (owner: 10Muehlenhoff) [10:03:35] FIRING: [9x] ProbeDown: Service ganeti6003:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:06:18] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10netops: Classify ceph traffic flows for network prioritization - https://phabricator.wikimedia.org/T390044#10702957 (10ayounsi) [10:09:11] !log jelto@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [10:09:28] !log jelto@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [10:09:38] !log jelto@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [10:10:19] !log jelto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [10:10:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2243 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P74560 and previous config saved to /var/cache/conftool/dbconfig/20250402-101044-root.json [10:12:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6004.drmrs.wmnet [10:12:09] jouncebot: nowandnext [10:12:09] For the next 0 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T1000) [10:12:09] In 0 hour(s) and 47 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T1100) [10:12:45] it doesn't look like serviceops is using the window. Deploying something then :D [10:13:11] !log jelto@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [10:13:43] (03PS1) 10Ladsgroup: Bump thumbnail steps to 60% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133354 (https://phabricator.wikimedia.org/T360589) [10:13:45] !log jelto@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [10:14:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P74561 and previous config saved to /var/cache/conftool/dbconfig/20250402-101439-root.json [10:15:41] https://logstash.wikimedia.org/ is down :/ [10:15:48] ah no that might be the idp [10:18:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6004.drmrs.wmnet [10:18:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6004.drmrs.wmnet [10:18:37] FIRING: [9x] ProbeDown: Service ganeti6004:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:19:36] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1039.eqiad.wmnet [10:21:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1039.eqiad.wmnet [10:23:37] FIRING: [6x] SystemdUnitFailed: opensearch-disable-readahead.service on cirrussearch2055:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:25:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2243 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P74563 and previous config saved to /var/cache/conftool/dbconfig/20250402-102549-root.json [10:28:35] FIRING: [10x] ProbeDown: Service ganeti6004:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:29:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P74564 and previous config saved to /var/cache/conftool/dbconfig/20250402-102944-root.json [10:30:45] (03CR) 10Muehlenhoff: [C:03+2] Double conntrack table size on KDC hosts [puppet] - 10https://gerrit.wikimedia.org/r/1128406 (owner: 10Muehlenhoff) [10:31:59] logstash/idp works again for me [10:32:27] (03CR) 10Btullis: [C:03+1] "Looks good to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1132422 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [10:33:13] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10703075 (10Tgr) >>! In T390514#10702692, @Krinkle wrote: > There does seem to be a +100% doubling [in 2024 August], but that's not big enough to be our smoki... [10:33:35] FIRING: [10x] ProbeDown: Service ganeti6004:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:35:11] (03PS1) 10Alexandros Kosiaris: mw-wikifuctions: Add main FQDNS in tlsExtraSANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133356 (https://phabricator.wikimedia.org/T384944) [10:35:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:40:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:40:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2243 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P74566 and previous config saved to /var/cache/conftool/dbconfig/20250402-104055-root.json [10:42:08] 06SRE, 06Infrastructure-Foundations, 10netops, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10703121 (10cmooney) >>! In T389958#10702638, @aborrero wrote: > Yes, lets try with the static routes. Thanks! Thanks Arturo - can we arrange a window for thi... [10:42:19] (03CR) 10Muehlenhoff: [C:03+2] Create insetup role for Data Platform with nftables and merge DE/Search roles [puppet] - 10https://gerrit.wikimedia.org/r/1132422 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [10:44:20] (03CR) 10Alexandros Kosiaris: [C:03+2] mw-wikifuctions: Add main FQDNS in tlsExtraSANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133356 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [10:44:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P74567 and previous config saved to /var/cache/conftool/dbconfig/20250402-104450-root.json [10:44:54] (03PS1) 10Slyngshede: Release version 0.1.9 [software/bitu] - 10https://gerrit.wikimedia.org/r/1133357 [10:45:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:49:34] (03Merged) 10jenkins-bot: mw-wikifuctions: Add main FQDNS in tlsExtraSANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133356 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [10:50:17] (03PS1) 10Muehlenhoff: Create insetup role for Data Persistence with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1133359 (https://phabricator.wikimedia.org/T389825) [10:50:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:50:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:51:45] jouncebot: nowandnext [10:51:45] For the next 0 hour(s) and 8 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T1000) [10:51:45] In 0 hour(s) and 8 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T1100) [10:52:08] (03CR) 10Ladsgroup: [C:03+2] Bump thumbnail steps to 60% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133354 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:52:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133354 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:52:59] (03Merged) 10jenkins-bot: Bump thumbnail steps to 60% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133354 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:53:40] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1133354|Bump thumbnail steps to 60% (T360589)]] [10:53:43] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:56:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2243 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P74568 and previous config saved to /var/cache/conftool/dbconfig/20250402-105601-root.json [10:58:35] (03PS11) 10Phedenskog: perf/navtiming: migrate alerts from grafana to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1133152 (https://phabricator.wikimedia.org/T325283) (owner: 10Tiziano Fogli) [10:59:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P74569 and previous config saved to /var/cache/conftool/dbconfig/20250402-105956-root.json [11:00:04] mvolz: #bothumor My software never has bugs. It just develops random features. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T1100). [11:00:14] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1133354|Bump thumbnail steps to 60% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:00:16] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [11:00:17] !log btullis@cumin1002 START - Cookbook sre.ceph.roll-restart-reboot-server rolling reboot on A:cephosd [11:00:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10703150 (10phaultfinder) [11:01:48] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [11:02:41] !log akosiaris@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:03:10] !log akosiaris@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:03:14] !log akosiaris@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:03:34] !log akosiaris@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:03:37] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [11:03:55] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:03:57] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:04:05] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:05:05] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132993 (owner: 10PipelineBot) [11:05:14] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127791 (owner: 10PipelineBot) [11:05:19] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123468 (owner: 10PipelineBot) [11:06:27] (03PS12) 10Phedenskog: perf/navtiming: migrate alerts from grafana to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1133152 (https://phabricator.wikimedia.org/T325283) (owner: 10Tiziano Fogli) [11:06:35] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132993 (owner: 10PipelineBot) [11:07:00] 06SRE, 10Wikidata, 10Wikimedia-Site-requests, 13Patch-For-Review, 10Wikidata Integration in Wikimedia projects (Kanban Board): Increase entityAccessLimit for WikibaseClient wikis - https://phabricator.wikimedia.org/T384455#10703178 (10jijiki) removing #serviceops, please re-add is there is something we c... [11:07:25] (03CR) 10Marostegui: "We need to downtime the hosts (usually 60 minutes is enough) and we don't have to remove it after everything has done. We can simply let i" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [11:08:52] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133354|Bump thumbnail steps to 60% (T360589)]] (duration: 15m 11s) [11:08:54] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [11:09:21] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2043.codfw.wmnet [11:09:26] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1043.eqiad.wmnet [11:10:57] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reboot-single for host ganeti1039.eqiad.wmnet [11:11:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2243 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P74570 and previous config saved to /var/cache/conftool/dbconfig/20250402-111106-root.json [11:14:02] (03PS1) 10Alexandros Kosiaris: Revert^2 "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133363 [11:14:26] (03CR) 10CI reject: [V:04-1] Revert^2 "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133363 (owner: 10Alexandros Kosiaris) [11:14:50] (03PS2) 10Alexandros Kosiaris: Revert^2 "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133363 [11:15:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P74571 and previous config saved to /var/cache/conftool/dbconfig/20250402-111501-root.json [11:15:14] (03CR) 10CI reject: [V:04-1] Revert^2 "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133363 (owner: 10Alexandros Kosiaris) [11:15:26] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1043.eqiad.wmnet [11:15:29] (03PS3) 10Alexandros Kosiaris: Revert^2 "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133363 [11:15:53] (03CR) 10CI reject: [V:04-1] Revert^2 "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133363 (owner: 10Alexandros Kosiaris) [11:16:09] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:16:13] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2043.codfw.wmnet [11:16:15] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1039.eqiad.wmnet [11:16:34] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:17:09] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [11:17:13] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:17:40] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [11:17:45] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:18:28] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [11:18:49] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:19:16] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:19:18] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [11:19:26] (03PS4) 10Alexandros Kosiaris: Revert^2 "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133363 [11:20:09] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [11:20:41] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [11:20:48] (03PS1) 10Ilias Sarantopoulos: ml-services: enable inference batching for requests in edit-check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133364 (https://phabricator.wikimedia.org/T386100) [11:20:55] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1133357 (owner: 10Slyngshede) [11:20:57] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [11:21:34] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [11:22:01] (03CR) 10Marostegui: [C:03+1] "From my side this is ok. I think Jaime should also be aware as this involve backups hosts. Although given this is just hosts not being use" [puppet] - 10https://gerrit.wikimedia.org/r/1133359 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [11:22:10] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [11:22:44] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [11:25:44] (03CR) 10Alexandros Kosiaris: [C:03+2] Revert^2 "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133363 (owner: 10Alexandros Kosiaris) [11:26:02] 06SRE, 06Infrastructure-Foundations, 10netops, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10703215 (10aborrero) >>! In T389958#10703121, @cmooney wrote: >>>! In T389958#10702638, @aborrero wrote: >> Yes, lets try with the static routes. Thanks! > >... [11:26:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2243 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P74572 and previous config saved to /var/cache/conftool/dbconfig/20250402-112611-root.json [11:26:33] (03PS1) 10Slyngshede: Permission: Prevent request of unconfigured permission [software/bitu] - 10https://gerrit.wikimedia.org/r/1133365 (https://phabricator.wikimedia.org/T390837) [11:27:03] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [11:30:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P74573 and previous config saved to /var/cache/conftool/dbconfig/20250402-113007-root.json [11:40:04] !log restart varnish on cp6016 - T390846 [11:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:07] T390846: Increased number of connections to ATS on single_backend=false DCs after varnish 7 upgrade - https://phabricator.wikimedia.org/T390846 [11:40:25] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:41:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2243 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P74574 and previous config saved to /var/cache/conftool/dbconfig/20250402-114117-root.json [11:44:24] !log securely erase certificates from A:cp-magru and provide symlink for acmecerts (T384227) [11:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:27] T384227: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227 [11:45:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P74575 and previous config saved to /var/cache/conftool/dbconfig/20250402-114512-root.json [11:47:03] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [11:54:32] (03CR) 10Kevin Bazira: [C:03+1] ml-services: enable inference batching for requests in edit-check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133364 (https://phabricator.wikimedia.org/T386100) (owner: 10Ilias Sarantopoulos) [11:54:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10703309 (10phaultfinder) [11:55:00] (03PS2) 10Jelto: trafficserver: switch querybuilder scholarly and main to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1133120 (https://phabricator.wikimedia.org/T350793) [11:56:07] (03CR) 10Jelto: "sure, I removed `query` in patchset 2. Let's test this with `query-main` and `query-scholarly` first." [puppet] - 10https://gerrit.wikimedia.org/r/1133120 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [11:56:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2243 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P74576 and previous config saved to /var/cache/conftool/dbconfig/20250402-115623-root.json [11:58:01] (03PS13) 10Filippo Giunchedi: perf/navtiming: migrate alerts from grafana to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1133152 (https://phabricator.wikimedia.org/T325283) (owner: 10Tiziano Fogli) [11:58:33] (03PS14) 10Filippo Giunchedi: perf/navtiming: migrate alerts from grafana to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1133152 (https://phabricator.wikimedia.org/T325283) (owner: 10Tiziano Fogli) [11:59:09] (03CR) 10Filippo Giunchedi: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/1133152 (https://phabricator.wikimedia.org/T325283) (owner: 10Tiziano Fogli) [11:59:41] (03PS1) 10Stevemunene: airflow-platform-eng: set up all services except systemd [puppet] - 10https://gerrit.wikimedia.org/r/1133367 (https://phabricator.wikimedia.org/T380624) [11:59:52] (03CR) 10CI reject: [V:04-1] perf/navtiming: migrate alerts from grafana to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1133152 (https://phabricator.wikimedia.org/T325283) (owner: 10Tiziano Fogli) [12:00:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P74577 and previous config saved to /var/cache/conftool/dbconfig/20250402-120018-root.json [12:01:05] 06SRE, 06Infrastructure-Foundations, 10netops, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10703334 (10aborrero) [12:01:11] (03PS15) 10Filippo Giunchedi: perf/navtiming: migrate alerts from grafana to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1133152 (https://phabricator.wikimedia.org/T325283) (owner: 10Tiziano Fogli) [12:01:16] (03CR) 10Filippo Giunchedi: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/1133152 (https://phabricator.wikimedia.org/T325283) (owner: 10Tiziano Fogli) [12:04:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1040.eqiad.wmnet [12:06:22] (03CR) 10Hashar: [C:04-1] Release version 0.1.9 (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1133357 (owner: 10Slyngshede) [12:06:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1040.eqiad.wmnet [12:08:01] (03CR) 10Seanleong-wmde: "Actual patch to increase Wiki EntityAccessLimit to 500 other than commons." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133317 (owner: 10Seanleong-wmde) [12:08:57] (03PS2) 10Slyngshede: Release version 0.1.9 [software/bitu] - 10https://gerrit.wikimedia.org/r/1133357 [12:09:10] (03CR) 10Slyngshede: Release version 0.1.9 (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1133357 (owner: 10Slyngshede) [12:09:19] (03CR) 10Muehlenhoff: "There's no impact to the existing backup insetup hosts, they are already using Ferm and will continue to do so. As such, I'll go ahead and" [puppet] - 10https://gerrit.wikimedia.org/r/1133359 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [12:09:21] (03CR) 10Muehlenhoff: [C:03+2] Create insetup role for Data Persistence with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1133359 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [12:10:29] (03PS1) 10Volans: Release v0.8.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1133370 [12:10:39] (03PS2) 10Seanleong-wmde: Increase entityAccessLimit from 400 to 500 for all wikis except commons. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133317 (https://phabricator.wikimedia.org/T384455) [12:11:08] !log btullis@cumin1002 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling reboot on A:cephosd [12:11:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2243 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P74578 and previous config saved to /var/cache/conftool/dbconfig/20250402-121128-root.json [12:11:35] (03CR) 10CI reject: [V:04-1] Release v0.8.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1133370 (owner: 10Volans) [12:12:44] jouncebot: now and next [12:12:45] No deployments scheduled for the next 0 hour(s) and 47 minute(s) [12:13:46] (03PS1) 10Alexandros Kosiaris: mw-wikifunctions-ingress: Switch to k8s-ingress-wikikube-rw [dns] - 10https://gerrit.wikimedia.org/r/1133372 (https://phabricator.wikimedia.org/T384944) [12:14:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cephosd2001.codfw.wmnet [12:15:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P74579 and previous config saved to /var/cache/conftool/dbconfig/20250402-121524-root.json [12:15:58] (03CR) 10Alexandros Kosiaris: [C:03+2] mw-wikifunctions-ingress: Switch to k8s-ingress-wikikube-rw [dns] - 10https://gerrit.wikimedia.org/r/1133372 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [12:16:13] (03CR) 10Hashar: Release version 0.1.9 (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1133357 (owner: 10Slyngshede) [12:16:21] !log akosiaris@dns1004 START - running authdns-update [12:16:46] (03PS3) 10Hashar: Release version 0.1.9 [software/bitu] - 10https://gerrit.wikimedia.org/r/1133357 (owner: 10Slyngshede) [12:17:01] (03CR) 10Hashar: [C:03+1] Release version 0.1.9 (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1133357 (owner: 10Slyngshede) [12:18:18] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10703399 (10Jhancock.wm) 05Open→03Resolved i'm gonna mark this task as resolved but i'll keep worker2331 on my list to check back on once in... [12:18:41] !log akosiaris@dns1004 END - running authdns-update [12:18:42] (03PS1) 10Muehlenhoff: Create insetup role for Traffic with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1133374 (https://phabricator.wikimedia.org/T389825) [12:19:18] (03PS2) 10Volans: Release v0.8.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1133370 [12:19:51] (03CR) 10Tiziano Fogli: [C:03+2] auth_metrics: add recording rules for grafana widgets [puppet] - 10https://gerrit.wikimedia.org/r/1133170 (https://phabricator.wikimedia.org/T390672) (owner: 10Tiziano Fogli) [12:20:38] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/bitu] - 10https://gerrit.wikimedia.org/r/1133375 [12:23:38] (03PS2) 10Muehlenhoff: failover eqiad urldownloader for security update [dns] - 10https://gerrit.wikimedia.org/r/1133108 [12:23:48] (03CR) 10Filippo Giunchedi: [C:03+1] perf/navtiming: migrate alerts from grafana to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1133152 (https://phabricator.wikimedia.org/T325283) (owner: 10Tiziano Fogli) [12:24:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd2001.codfw.wmnet [12:24:21] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: move profile::acme_chief::certificates to profile [puppet] - 10https://gerrit.wikimedia.org/r/1131270 (owner: 10Filippo Giunchedi) [12:26:05] (03CR) 10Muehlenhoff: [C:03+2] failover eqiad urldownloader for security update [dns] - 10https://gerrit.wikimedia.org/r/1133108 (owner: 10Muehlenhoff) [12:26:22] !log jmm@dns1004 START - running authdns-update [12:26:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2243 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P74580 and previous config saved to /var/cache/conftool/dbconfig/20250402-122634-root.json [12:27:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cephosd2002.codfw.wmnet [12:28:42] !log jmm@dns1004 END - running authdns-update [12:30:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P74581 and previous config saved to /var/cache/conftool/dbconfig/20250402-123029-root.json [12:34:30] (03CR) 10Jelto: [C:03+2] trafficserver: switch querybuilder scholarly and main to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1133120 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [12:36:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd2002.codfw.wmnet [12:37:11] (03PS1) 10Volans: cookbook: improve -r/--reason help message [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133382 [12:38:20] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [12:40:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cephosd2003.codfw.wmnet [12:40:36] jelto: are you planning to apply the querybuilder trafficserver change manually or should we just wait for 30 minutes before testing it? [12:40:42] (ref https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133120 / T350793) [12:40:43] T350793: move query.wikidata.org to kubernetes - https://phabricator.wikimedia.org/T350793 [12:41:27] Lucas_WMDE: I'd just wait for 30 minutes [12:41:31] ok :) [12:41:35] just checking [12:41:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2243 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P74582 and previous config saved to /var/cache/conftool/dbconfig/20250402-124139-root.json [12:43:07] Lucas_WMDE: but it seems the new load balancer mapping is live already (istio instead of apache answers). Unfortunately it returns a 404. Let me quickly check whats going on otherwise we can rollback [12:43:09] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] Increase entityAccessLimit from 400 to 500 for all wikis except commons. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133317 (https://phabricator.wikimedia.org/T384455) (owner: 10Seanleong-wmde) [12:44:25] ok [12:45:00] (03PS1) 10Vgutierrez: varnish: Use vcl.list JSON output on reload-vcl.py [puppet] - 10https://gerrit.wikimedia.org/r/1133385 (https://phabricator.wikimedia.org/T390846) [12:45:23] (03PS3) 10Btullis: airflow-product-eng: migrate scheduler and db to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125424 (https://phabricator.wikimedia.org/T380624) (owner: 10Stevemunene) [12:45:23] (03CR) 10CI reject: [V:04-1] varnish: Use vcl.list JSON output on reload-vcl.py [puppet] - 10https://gerrit.wikimedia.org/r/1133385 (https://phabricator.wikimedia.org/T390846) (owner: 10Vgutierrez) [12:45:51] (03CR) 10Elukey: [C:03+1] cookbook: improve -r/--reason help message [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133382 (owner: 10Volans) [12:45:55] (03PS4) 10Btullis: airflow-product-eng: migrate scheduler and db to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125424 (https://phabricator.wikimedia.org/T380624) (owner: 10Stevemunene) [12:46:02] (03PS6) 10Filippo Giunchedi: pontoon: provide acme-chief compat/shim [puppet] - 10https://gerrit.wikimedia.org/r/1131271 [12:46:19] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: provide acme-chief compat/shim [puppet] - 10https://gerrit.wikimedia.org/r/1131271 (owner: 10Filippo Giunchedi) [12:46:29] (03PS5) 10Btullis: airflow-product-eng: migrate scheduler and db to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125424 (https://phabricator.wikimedia.org/T380624) (owner: 10Stevemunene) [12:47:31] (03CR) 10Btullis: [C:03+1] airflow-product-eng: migrate scheduler and db to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125424 (https://phabricator.wikimedia.org/T380624) (owner: 10Stevemunene) [12:47:33] (03PS2) 10Vgutierrez: varnish: Use vcl.list JSON output on reload-vcl.py [puppet] - 10https://gerrit.wikimedia.org/r/1133385 (https://phabricator.wikimedia.org/T390846) [12:48:20] FIRING: [2x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [12:48:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10703487 (10phaultfinder) [12:49:03] (03PS1) 10Klausman: admin-ng/mlserve: Remove ratelimit in istio sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133381 (https://phabricator.wikimedia.org/T388817) [12:49:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd2003.codfw.wmnet [12:50:32] (03PS1) 10Filippo Giunchedi: pontoon: enable acme-chief in o11y-phi [puppet] - 10https://gerrit.wikimedia.org/r/1133387 [12:50:54] (03PS1) 10Jelto: wikidata-query-gui: add query-main and query-scholarly to querybuilder hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133388 (https://phabricator.wikimedia.org/T350793) [12:51:03] (03CR) 10Elukey: [C:03+1] "I am reasonably sure this may not be enough, a complete pod roll restart is needed so the envoy sidecar gets updated. But that can happen " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133381 (https://phabricator.wikimedia.org/T388817) (owner: 10Klausman) [12:51:51] (03CR) 10Jelto: "let's see if that fixes the 404 for querybuilder, otherwise I'll rollback to the legacy miscweb config again." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133388 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [12:53:54] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: enable acme-chief in o11y-phi [puppet] - 10https://gerrit.wikimedia.org/r/1133387 (owner: 10Filippo Giunchedi) [12:54:00] (03CR) 10Elukey: "@brouberol@wikimedia.org lemme know if it makes sense!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133142 (https://phabricator.wikimedia.org/T373115) (owner: 10Elukey) [12:54:08] (03CR) 10Volans: [C:03+2] cookbook: improve -r/--reason help message [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133382 (owner: 10Volans) [12:54:30] (03CR) 10Jelto: [C:03+2] wikidata-query-gui: add query-main and query-scholarly to querybuilder hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133388 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [12:54:39] (03CR) 10Hashar: "Feel free to merge at anytime, I don't have +2 / puppet-merge access :)" [puppet] - 10https://gerrit.wikimedia.org/r/1128859 (https://phabricator.wikimedia.org/T389181) (owner: 10Hashar) [12:55:12] (03CR) 10Brouberol: [C:03+1] "LGTM! It Tegola is using an official kafka client, the DNS should be resolved into the set of IPsm, which should give you more resiliency." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133142 (https://phabricator.wikimedia.org/T373115) (owner: 10Elukey) [12:55:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:56:01] (03Merged) 10jenkins-bot: wikidata-query-gui: add query-main and query-scholarly to querybuilder hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133388 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [12:56:27] (03PS1) 10Elukey: services: enable ingress for Kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 [12:57:42] !log jelto@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [12:57:47] !log jelto@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [12:57:57] !log jelto@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [12:58:03] !log jelto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [12:58:15] !log jelto@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [12:58:21] (03CR) 10Hashar: "Added Timo for information, I will deploy it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133081 (owner: 10Hashar) [12:58:22] !log jelto@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [12:58:44] (03CR) 10Stevemunene: [C:03+2] airflow-product-eng: migrate scheduler and db to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125424 (https://phabricator.wikimedia.org/T380624) (owner: 10Stevemunene) [12:58:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133081 (owner: 10Hashar) [12:59:03] jouncebot: refresh [12:59:04] I refreshed my knowledge about deployments. [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T1300). [13:00:04] jakob_WMDE and hashar: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] o/ [13:00:15] (03Merged) 10jenkins-bot: airflow-product-eng: migrate scheduler and db to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125424 (https://phabricator.wikimedia.org/T380624) (owner: 10Stevemunene) [13:00:27] o/ [13:00:30] o/ [13:00:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:01:17] I can deploy [13:01:25] unless hashar wants to do both? [13:01:37] mine is a noop [13:01:38] https://gerrit.wikimedia.org/r/c/1133081/ [13:01:48] so it can roll together with jakob_WMDE patch imho [13:02:53] jakob_WMDE: just to clarify, is your change expected to have an effect already? [13:02:56] or is it just preparation? [13:03:13] just preparation, should have no visible effect [13:03:22] ok [13:03:35] (03Merged) 10jenkins-bot: cookbook: improve -r/--reason help message [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133382 (owner: 10Volans) [13:03:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131353 (https://phabricator.wikimedia.org/T389190) (owner: 10Jakob) [13:03:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133081 (owner: 10Hashar) [13:03:41] deploying both together then [13:04:18] \o/ [13:04:29] (03Merged) 10jenkins-bot: Configure virtual terms db for wikidata prod & test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131353 (https://phabricator.wikimedia.org/T389190) (owner: 10Jakob) [13:04:30] (03Merged) 10jenkins-bot: Use wikidata familly in $wgCirrusSearchSimilarityProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133081 (owner: 10Hashar) [13:04:40] James_F: actually, should I interrupt my deployment? sounds like you have something important going on [13:04:51] (meh, I was hoping that message would go out before the config changes merged on gerrit) [13:04:52] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1131353|Configure virtual terms db for wikidata prod & test (T389190)]], [[gerrit:1133081|Use wikidata familly in $wgCirrusSearchSimilarityProfile]] [13:04:55] T389190: Deploy new term store config to PROD - https://phabricator.wikimedia.org/T389190 [13:05:14] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya - https://phabricator.wikimedia.org/T390855 (10isarantopoulos) 03NEW [13:05:48] FIRING: PuppetFailure: Puppet has failed on cirrussearch2055:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:07:41] (03PS1) 10Alexandros Kosiaris: Revert^3 "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133390 [13:08:40] (03CR) 10Alexandros Kosiaris: [C:03+2] Revert^3 "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133390 (owner: 10Alexandros Kosiaris) [13:11:20] !log lucaswerkmeister-wmde@deploy1003 jakob, hashar, lucaswerkmeister-wmde: Backport for [[gerrit:1131353|Configure virtual terms db for wikidata prod & test (T389190)]], [[gerrit:1133081|Use wikidata familly in $wgCirrusSearchSimilarityProfile]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:11:23] T389190: Deploy new term store config to PROD - https://phabricator.wikimedia.org/T389190 [13:11:35] anything to test? just that stuff isn’t broken? [13:12:01] yeah, just that stuff isn't broken [13:12:22] getting labels on wikidata.org e.g. via REST API, showing labels e.g. in links on clients [13:12:25] $wgVirtualDomainsMapping['virtual-wikibase-terms'] seems to be set correctly in `mwscript shell` fwiw [13:12:32] (tested wikidatawiki, testwikidatawiki, testwiki, enwiki) [13:12:43] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10703602 (10phaultfinder) [13:13:07] (03PS1) 10Andrew Bogott: Opensack: Upgrade eqiad1 to version 'dalmatian' [puppet] - 10https://gerrit.wikimedia.org/r/1133393 (https://phabricator.wikimedia.org/T381499) [13:13:29] hm, gerrit is hanging for me… [13:13:40] ok, it worked in a new tab [13:14:05] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [13:14:13] (03PS2) 10Elukey: services: enable ingress for Kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 [13:14:27] `mwscript eval testwikidatawiki <<< 'var_dump( $wgCirrusSearchSimilarityProfile )'` (and a few other wikis) also behaves as expected [13:14:29] !log lucaswerkmeister-wmde@deploy1003 jakob, hashar, lucaswerkmeister-wmde: Continuing with sync [13:14:47] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [13:15:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:15:50] Lucas_WMDE: thank you for the verification! [13:16:57] (03PS2) 10DCausse: search: update WDQS update lag SLI/SLO queries [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1110833 [13:17:09] (03CR) 10Bking: [C:03+1] "I was gonna suggest that we check if the cert and key match, but Puppet doesn't do that either. Thus, no reason to impose it here." [puppet] - 10https://gerrit.wikimedia.org/r/1133187 (https://phabricator.wikimedia.org/T390599) (owner: 10Ebernhardson) [13:17:10] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133393 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [13:18:08] (03PS1) 10Jelto: Revert "trafficserver: switch querybuilder scholarly and main to wikikube" [puppet] - 10https://gerrit.wikimedia.org/r/1133395 (https://phabricator.wikimedia.org/T350793) [13:18:21] Lucas_WMDE: Sorry, no, go ahead, it's been reverted. [13:18:27] ok [13:18:38] (I only saw afterwards that the change linked as the issue was in puppet anyway) [13:18:46] Yeah. [13:19:05] !log installing gnutls28 security updates on Bookworm [13:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:57] (03CR) 10Jelto: [C:03+2] Revert "trafficserver: switch querybuilder scholarly and main to wikikube" [puppet] - 10https://gerrit.wikimedia.org/r/1133395 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [13:20:40] (03CR) 10Btullis: [C:03+1] airflow-platform-eng: set up all services except systemd [puppet] - 10https://gerrit.wikimedia.org/r/1133367 (https://phabricator.wikimedia.org/T380624) (owner: 10Stevemunene) [13:20:58] (03CR) 10Ssingh: [C:03+1] "looks good, one nit to link to the changelog." [puppet] - 10https://gerrit.wikimedia.org/r/1133385 (https://phabricator.wikimedia.org/T390846) (owner: 10Vgutierrez) [13:21:48] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131353|Configure virtual terms db for wikidata prod & test (T389190)]], [[gerrit:1133081|Use wikidata familly in $wgCirrusSearchSimilarityProfile]] (duration: 16m 55s) [13:21:50] T389190: Deploy new term store config to PROD - https://phabricator.wikimedia.org/T389190 [13:23:20] (03CR) 10Ssingh: [C:03+1] "Nice and thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1133316 (https://phabricator.wikimedia.org/T390816) (owner: 10Jelto) [13:24:40] (03PS1) 10Stevemunene: airflow-platform-eng: remove db values used for db import [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133397 (https://phabricator.wikimedia.org/T380618) [13:24:56] !log UTC afternoon backport+config window done [13:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:57] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10703646 (10Jhancock.wm) @Marostegui is it safe to swap in the original disk? [13:24:59] * Lucas_WMDE done deploying [13:25:14] (03CR) 10Btullis: [C:03+1] airflow-platform-eng: remove db values used for db import [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133397 (https://phabricator.wikimedia.org/T380618) (owner: 10Stevemunene) [13:25:27] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10703647 (10Marostegui) Go for it @Jhancock.wm [13:25:36] thanks Lucas_WMDE! [13:25:43] np :) [13:26:54] (03CR) 10Stevemunene: [C:03+2] airflow-platform-eng: remove db values used for db import [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133397 (https://phabricator.wikimedia.org/T380618) (owner: 10Stevemunene) [13:26:56] (03PS3) 10Elukey: services: enable ingress for Kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 [13:28:02] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS2 Status - issue on wikikube-worker2316:9290 - https://phabricator.wikimedia.org/T390769#10703651 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated [13:28:15] (03Merged) 10jenkins-bot: airflow-platform-eng: remove db values used for db import [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133397 (https://phabricator.wikimedia.org/T380618) (owner: 10Stevemunene) [13:28:38] (03PS1) 10Ayounsi: MR: rollback gNMI [homer/public] - 10https://gerrit.wikimedia.org/r/1133398 (https://phabricator.wikimedia.org/T390052) [13:30:37] (03CR) 10Ssingh: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1133374 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [13:31:09] (03PS4) 10Elukey: services: enable ingress for Kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 [13:31:46] (03PS2) 10Ssingh: P:pybal: alert sooner if pybal.conf was changed [puppet] - 10https://gerrit.wikimedia.org/r/1133271 [13:32:50] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5195/co" [puppet] - 10https://gerrit.wikimedia.org/r/1133271 (owner: 10Ssingh) [13:33:39] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough [13:35:50] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reboot-single for host ganeti1040.eqiad.wmnet [13:36:41] (03CR) 10Ayounsi: [C:03+1] Release v0.8.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1133370 (owner: 10Volans) [13:37:05] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart rolling restart_daemons on A:dnsbox [13:37:38] (03CR) 10Volans: [C:03+2] Release v0.8.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1133370 (owner: 10Volans) [13:37:40] !log depool cp3066 for debugging T390854 [13:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:43] T390854: Partial mw-wikifunctions outage; 404s on load.php and others? - https://phabricator.wikimedia.org/T390854 [13:37:59] (03CR) 10Fabfur: [C:03+1] "looks fair to me" [puppet] - 10https://gerrit.wikimedia.org/r/1133271 (owner: 10Ssingh) [13:38:39] (03CR) 10Ssingh: [V:03+1 C:03+2] P:pybal: alert sooner if pybal.conf was changed [puppet] - 10https://gerrit.wikimedia.org/r/1133271 (owner: 10Ssingh) [13:40:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1041.eqiad.wmnet [13:41:13] (03PS5) 10Elukey: services: enable ingress for Kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 [13:41:20] (03CR) 10Ayounsi: [C:03+1] Add prepend-as-out variable for each site always [homer/public] - 10https://gerrit.wikimedia.org/r/1130095 (https://phabricator.wikimedia.org/T389606) (owner: 10Cathal Mooney) [13:41:26] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1040.eqiad.wmnet [13:41:49] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [13:42:39] FIRING: CoreBGPDown: Core BGP session down between cr1-eqiad and pfw3-eqiad (208.80.154.201) - group Fundraising - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr1-eqiad:9804&var-bgp_group=Fundraising&var-bgp_neighbor=pfw3-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:42:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:xe-3/1/7 (Core: pfw1-eqiad:xe-0/2/0 {#4026}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:43:09] (03PS1) 10Fabfur: hiera: enable TLS on volatile storage in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1133405 (https://phabricator.wikimedia.org/T384227) [13:43:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1041.eqiad.wmnet [13:43:16] (03CR) 10Ayounsi: prepend_as_out: switch outbound policy rather than modify existing (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1130093 (https://phabricator.wikimedia.org/T389606) (owner: 10Cathal Mooney) [13:43:35] FIRING: [9x] ProbeDown: Service ganeti1040:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:43:50] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [13:44:14] (03PS6) 10Elukey: services: enable ingress for Kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 [13:45:45] (03CR) 10CI reject: [V:04-1] services: enable ingress for Kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 (owner: 10Elukey) [13:45:49] 06SRE, 06Infrastructure-Foundations: Migrate the KDCs to Bookworm - https://phabricator.wikimedia.org/T390863 (10MoritzMuehlenhoff) 03NEW [13:46:05] (03PS1) 10Muehlenhoff: Setup the new KDC with nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133406 (https://phabricator.wikimedia.org/T390863) [13:46:37] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133405 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [13:48:28] (03PS16) 10Tiziano Fogli: perf/navtiming: migrate alerts from grafana to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1133152 (https://phabricator.wikimedia.org/T325283) [13:49:42] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [13:49:43] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reboot-single for host ganeti1041.eqiad.wmnet [13:51:33] (03PS7) 10Elukey: services: enable ingress for Kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 [13:52:13] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough [13:52:15] (03PS3) 10Seanleong-wmde: Increase entityAccessLimit from 400 to 500 for all wikis except commons. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133317 [13:52:44] (03CR) 10CI reject: [V:04-1] services: enable ingress for Kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 (owner: 10Elukey) [13:52:59] (03PS4) 10Seanleong-wmde: Increase entityAccessLimit from 400 to 500 for all wikis except commons. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133317 (https://phabricator.wikimedia.org/T384455) [13:53:25] FIRING: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:53:35] FIRING: [11x] ProbeDown: Service ganeti1040:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:54:56] (03PS8) 10Elukey: services: enable ingress for Kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 [13:55:10] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1041.eqiad.wmnet [13:55:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10703855 (10phaultfinder) [13:56:06] (03PS17) 10Tiziano Fogli: perf/navtiming: migrate alerts from grafana to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1133152 (https://phabricator.wikimedia.org/T325283) [13:56:28] (03PS2) 10Muehlenhoff: Setup the new KDC with nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133406 (https://phabricator.wikimedia.org/T390863) [13:58:35] FIRING: [11x] ProbeDown: Service ganeti1040:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:59:29] (03CR) 10Muehlenhoff: [C:03+2] Create insetup role for Traffic with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1133374 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [13:59:52] (03CR) 10Volans: [C:03+2] Include base_paths when initialising the plugin class [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1133133 (https://phabricator.wikimedia.org/T310577) (owner: 10Cathal Mooney) [14:00:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T1400) [14:00:19] (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-03-19-125950 to 2025-04-02-130409 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133413 (https://phabricator.wikimedia.org/T367005) [14:00:23] (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-03-25-145119 to 2025-04-02-124609 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133414 (https://phabricator.wikimedia.org/T367005) [14:00:51] !log volans@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.8.0 - volans@cumin1002 [14:00:59] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya - https://phabricator.wikimedia.org/T390855#10703895 (10isarantopoulos) [14:01:01] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [14:01:19] !log upgrading homer to version 0.8.0 to cumin hosts [14:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:47] (03CR) 10Jforrester: [C:03+2] wikifunctions: Update evaluators from 2025-03-19-125950 to 2025-04-02-130409 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133413 (https://phabricator.wikimedia.org/T367005) (owner: 10Jforrester) [14:02:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:xe-3/1/7 (Core: pfw1-eqiad:xe-0/2/0 {#4026}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:03:04] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10703909 (10Jhancock.wm) comnpleted [14:03:17] (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-03-19-125950 to 2025-04-02-130409 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133413 (https://phabricator.wikimedia.org/T367005) (owner: 10Jforrester) [14:03:52] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [14:03:58] (03CR) 10Cathal Mooney: prepend_as_out: switch outbound policy rather than modify existing (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1130093 (https://phabricator.wikimedia.org/T389606) (owner: 10Cathal Mooney) [14:03:59] 10ops-eqiad, 06DC-Ops: Inbound errors on interface cr2-eqiad:xe-3/1/7 (Core: pfw1-eqiad:xe-7/2/0 {#4027}) - https://phabricator.wikimedia.org/T390869 (10phaultfinder) 03NEW [14:04:26] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [14:04:44] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:05:10] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:05:42] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:06:05] !log volans@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.8.0 - volans@cumin1002 [14:06:12] (03PS4) 10Cathal Mooney: prepend_as_out: switch outbound policy rather than modify existing [homer/public] - 10https://gerrit.wikimedia.org/r/1130093 (https://phabricator.wikimedia.org/T389606) [14:06:31] (03CR) 10Cathal Mooney: prepend_as_out: switch outbound policy rather than modify existing (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1130093 (https://phabricator.wikimedia.org/T389606) (owner: 10Cathal Mooney) [14:06:31] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:06:37] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:07:16] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:08:08] (03CR) 10Jforrester: [C:03+2] wikifunctions: Update orchestrator from 2025-03-25-145119 to 2025-04-02-124609 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133414 (https://phabricator.wikimedia.org/T367005) (owner: 10Jforrester) [14:09:27] (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator from 2025-03-25-145119 to 2025-04-02-124609 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133414 (https://phabricator.wikimedia.org/T367005) (owner: 10Jforrester) [14:10:17] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:10:45] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:11:29] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:11:58] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:12:01] (03PS1) 10Btullis: Enable the airflow-platform-eng instance to gitsync the analytics DAGs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133420 (https://phabricator.wikimedia.org/T380618) [14:12:10] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:12:43] (03CR) 10Brouberol: [C:03+1] Enable the airflow-platform-eng instance to gitsync the analytics DAGs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133420 (https://phabricator.wikimedia.org/T380618) (owner: 10Btullis) [14:12:57] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:13:42] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1042.eqiad.wmnet [14:13:44] (03CR) 10Btullis: [C:03+2] Enable the airflow-platform-eng instance to gitsync the analytics DAGs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133420 (https://phabricator.wikimedia.org/T380618) (owner: 10Btullis) [14:14:35] (03CR) 10Stevemunene: [C:03+2] airflow-platform-eng: set up all services except systemd [puppet] - 10https://gerrit.wikimedia.org/r/1133367 (https://phabricator.wikimedia.org/T380624) (owner: 10Stevemunene) [14:15:18] (03Merged) 10jenkins-bot: Enable the airflow-platform-eng instance to gitsync the analytics DAGs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133420 (https://phabricator.wikimedia.org/T380618) (owner: 10Btullis) [14:15:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1042.eqiad.wmnet [14:17:50] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [14:18:26] Done with WF window, in case anyone wants to take the rest. [14:18:32] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [14:18:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10704096 (10phaultfinder) [14:19:26] <_joe_> uh what was this page? [14:19:30] <_joe_> !incidents [14:19:31] 5939 (UNACKED) Host pfw1-eqiad - PING - Packet loss = 100% [14:19:31] 5931 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [14:19:31] 5930 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [14:19:31] 5929 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [14:19:32] 5927 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (asw2-d-eqiad.mgmt.eqiad.wmnet) [14:19:32] 5926 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr1-eqiad.wikimedia.org) [14:19:32] 5925 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (asw2-a-eqiad.mgmt.eqiad.wmnet) [14:19:33] 5923 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (asw2-c-eqiad.mgmt.eqiad.wmnet) [14:19:33] 5924 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr1-eqiad.wikimedia.org) [14:19:47] <_joe_> slyngs: I'm in a meeting, can you take a look? [14:20:03] Sure [14:20:12] looking also [14:20:17] <_joe_> !ack 5939 [14:20:18] 5939 (ACKED) Host pfw1-eqiad - PING - Packet loss = 100% [14:20:22] <_joe_> thanks folks <3 [14:20:56] 06SRE, 10Observability-Metrics: Statograph referencing empty/nonexisting metrics goes unnoticed - https://phabricator.wikimedia.org/T390520#10704115 (10lmata) [14:21:18] 06SRE, 10Observability-Metrics: Statograph referencing empty/nonexisting metrics goes unnoticed - https://phabricator.wikimedia.org/T390520#10704118 (10lmata) p:05Triage→03Low [14:21:32] (03PS1) 10Muehlenhoff: Create insetup role for WMCS with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1133422 (https://phabricator.wikimedia.org/T389825) [14:22:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-eqiad and pfw3-eqiad (208.80.154.201) - group Fundraising - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:23:17] (03CR) 10MVernon: "Hi," [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [14:23:37] FIRING: [6x] SystemdUnitFailed: opensearch-disable-readahead.service on cirrussearch2055:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:24:56] (03PS9) 10Elukey: services: enable ingress for Kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 [14:25:47] topranks: The pfw3-eqiad alert, is that related to the Lumens fault we got earlier? [14:25:58] (03CR) 10Andrew Bogott: [C:03+2] Opensack: Upgrade eqiad1 to version 'dalmatian' [puppet] - 10https://gerrit.wikimedia.org/r/1133393 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [14:26:01] (03CR) 10CI reject: [V:04-1] services: enable ingress for Kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 (owner: 10Elukey) [14:26:02] no it's something internal [14:26:24] it's also perhaps a misconfig our side the one link failing shouldn't break anythiign [14:27:11] well other than both links are reporting as down [14:27:14] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1133365 (https://phabricator.wikimedia.org/T390837) (owner: 10Slyngshede) [14:27:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:xe-3/1/7 (Core: pfw1-eqiad:xe-0/2/0 {#4026}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:28:19] The pfw1-eqiad also reported 100% packet loss [14:30:33] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10704204 (10Marostegui) Thanks @Jhancock.wm The disk was detected as bad but I've made it good, removed the config and started to rebuild ` root@db2243:/home/marostegui# ./storc... [14:31:33] (03CR) 10FNegri: Create insetup role for WMCS with nftables and rename existing one (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133422 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [14:31:50] (03PS10) 10Elukey: services: enable ingress for Kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 [14:33:08] jhathaway / topranks if pfw1-eqiad is down, it will make sense that BGP sessions breaks? [14:33:09] (03CR) 10CI reject: [V:04-1] services: enable ingress for Kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 (owner: 10Elukey) [14:33:37] see -sre [14:33:53] taavi: Thanks [14:35:32] (03CR) 10Muehlenhoff: Create insetup role for WMCS with nftables and rename existing one (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133422 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [14:35:33] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart (exit_code=0) rolling restart_daemons on A:dnsbox [14:37:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10704257 (10phaultfinder) [14:41:30] (03CR) 10Clément Goubert: [C:03+2] "Merging as is, we can always change the routing later." [puppet] - 10https://gerrit.wikimedia.org/r/1132673 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert) [14:43:25] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host apus-fe2003.codfw.wmnet with OS bookworm [14:43:30] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578#10704274 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host apus-fe2003.codfw.wmnet with OS bookworm executed with errors: - apus-... [14:43:51] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host apus-fe2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:43:56] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reboot-single for host ganeti1042.eqiad.wmnet [14:47:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-eqiad and pfw3-eqiad (208.80.154.201) - group Fundraising - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:47:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:xe-3/1/7 (Core: pfw1-eqiad:xe-0/2/0 {#4026}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:49:22] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1042.eqiad.wmnet [14:49:50] (03PS1) 10Bking: cirrussearch: fix s3-related variable paths [puppet] - 10https://gerrit.wikimedia.org/r/1133429 (https://phabricator.wikimedia.org/T388610) [14:53:05] (03CR) 10Scott French: "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1131351 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [14:53:16] (03CR) 10Scott French: [C:03+2] deployment_server: Default to PHP 8.1 in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1131351 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [14:53:25] (03PS11) 10Elukey: services: enable ingress for Kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 [14:55:23] (03CR) 10Brennen Bearnes: [C:03+1] gitlab_runner: increase job output_limit to 20MB [puppet] - 10https://gerrit.wikimedia.org/r/1133316 (https://phabricator.wikimedia.org/T390816) (owner: 10Jelto) [14:55:45] (03CR) 10DCausse: [C:03+1] cirrussearch: fix s3-related variable paths [puppet] - 10https://gerrit.wikimedia.org/r/1133429 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [14:57:09] (03PS6) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) [14:57:25] (03PS3) 10Vgutierrez: varnish: Use vcl.list JSON output on reload-vcl.py [puppet] - 10https://gerrit.wikimedia.org/r/1133385 (https://phabricator.wikimedia.org/T390846) [14:57:52] (03CR) 10Ssingh: [C:03+2] varnish: Use vcl.list JSON output on reload-vcl.py [puppet] - 10https://gerrit.wikimedia.org/r/1133385 (https://phabricator.wikimedia.org/T390846) (owner: 10Vgutierrez) [14:57:56] er [14:58:01] (03CR) 10Ssingh: [C:03+1] varnish: Use vcl.list JSON output on reload-vcl.py [puppet] - 10https://gerrit.wikimedia.org/r/1133385 (https://phabricator.wikimedia.org/T390846) (owner: 10Vgutierrez) [14:58:39] (03CR) 10Federico Ceratto: "Thanks for the review. I added more logic around pre-pool-in checks." [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [15:00:05] arnaudb, hashar, and thcipriani: Deploy window Gerrit switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T1500) [15:00:20] yup [15:00:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10704390 (10BTullis) [15:00:47] (03CR) 10Bking: [C:03+2] cirrussearch: fix s3-related variable paths [puppet] - 10https://gerrit.wikimedia.org/r/1133429 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [15:01:34] (03CR) 10Marostegui: upgrade.py: Depool, repool, update Phabricator (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [15:03:42] jouncebot: next [15:03:42] In 1 hour(s) and 56 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T1700) [15:04:23] (03PS6) 10Arnaudb: gerrit: switchover to gerrit2002 [dns] - 10https://gerrit.wikimedia.org/r/1128818 (https://phabricator.wikimedia.org/T387833) [15:04:36] (03CR) 10CI reject: [V:04-1] upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [15:05:51] * thcipriani commences rooting for gerrit switchover success [15:06:07] (03CR) 10RLazarus: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1133275 (https://phabricator.wikimedia.org/T390790) (owner: 10RLazarus) [15:06:41] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: sync [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:12] FIRING: [6x] SystemdUnitFailed: opensearch-disable-readahead.service on cirrussearch2055:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:07:22] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [15:09:00] FIRING: [6x] SystemdUnitFailed: opensearch-disable-readahead.service on cirrussearch2055:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:10:21] Heads up: in 5min Gerrit will experience a brief (~10min) downtime during maintenance, as we test the refreshed host switchover procedure - details at https://phabricator.wikimedia.org/T387833 [15:10:27] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [15:15:53] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on gerrit1003.wikimedia.org with reason: maintenance [15:16:08] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit2002.wikimedia.org with reason: maintenance [15:16:46] (03CR) 10Arnaudb: [C:03+2] gerrit: switchover to gerrit2002 [dns] - 10https://gerrit.wikimedia.org/r/1128818 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [15:17:05] (03CR) 10Arnaudb: [C:03+2] gerrit: switchover to gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/1128814 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [15:17:24] (03PS1) 10Gehel: feat(HDFS free space alert): add an alert at 20% free space [alerts] - 10https://gerrit.wikimedia.org/r/1133433 (https://phabricator.wikimedia.org/T390875) [15:19:16] !log arnaudb@dns1004 START - running authdns-update [15:21:49] i'm assuming the gerrit outage is transient? [15:21:50] 06SRE, 06serviceops, 13Patch-For-Review: mwscript-cleanup.service failure - https://phabricator.wikimedia.org/T390790#10704587 (10RLazarus) 05Open→03Resolved ` Apr 02 15:20:03 deploy1003 systemd[1]: Starting Remove lingering Helm releases from completed maintenance scripts.... Apr 02 15:20:04 deploy1... [15:22:12] FIRING: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:25:59] cscott: gerrit switchover in progress [15:26:17] (buried in scrollback, but also on deploy calendar :)) [15:26:42] FIRING: [5x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:27:12] FIRING: [6x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:27:18] sorry I posted on sre-collab the tracking [15:27:26] cscott: it should be back in a few moments [15:27:35] dns is merging atm [15:28:47] !log arnaudb@dns1004 END - running authdns-update [15:29:01] merged, will proceed to merge puppet as well and all should be in order soon [15:29:21] Command '['git', 'fetch']' returned non-zero exit status 128. [15:29:21] Fetching new commits from: https://gerrit.wikimedia.org/r/operations/puppet [15:29:22] failed to run `git fetch` [15:29:22] Command '['git', 'fetch']' returned non-zero exit status 128. [15:29:22] Problems merging production [15:29:29] well, if the dns allows me [15:29:45] (its ok now) [15:30:07] ✅ [15:31:42] FIRING: [5x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:32:03] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [15:32:12] FIRING: [6x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:34:00] FIRING: [6x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:36:42] RESOLVED: [5x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:49] i'm getting 'fatal: remote error: Service not enabled' when trying to submit a patch to gerrit, is that expected from the switchover? [15:36:51] we're having issues with the new server taking up its role, debugging [15:37:12] RESOLVED: [6x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:38:01] working around the issue [15:40:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10704653 (10phaultfinder) [15:41:00] arnaudb: if gerrit is down, updating the dns repo is not possible of course. (if you are facing that issue, "well if the dns allows me") [15:41:13] issue has been fixed [15:41:16] we can edit zone files through cumin and then reload gdnsd, please ping me if that is required [15:41:19] ok great! [15:41:22] it was a systemd unit caveat, puppet has missed an update [15:41:37] (gerrit is back) [15:42:12] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:42:25] thank you all for your patience [15:42:55] FIRING: [5x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:45:53] (03CR) 10JHathaway: [C:03+2] Hiera: enable deep merge lookup option for abuse_networks [puppet] - 10https://gerrit.wikimedia.org/r/1128859 (https://phabricator.wikimedia.org/T389181) (owner: 10Hashar) [15:45:58] (03PS2) 10Majavah: dynamicproxy: Add dependency on acme-chief cert [puppet] - 10https://gerrit.wikimedia.org/r/1133448 [15:46:05] arnaudb: nice job! [15:46:38] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5198/console" [puppet] - 10https://gerrit.wikimedia.org/r/1133448 (owner: 10Majavah) [15:47:25] for the record https://wikitech.wikimedia.org/wiki/DNS#Update_DNS_if_gerrit_or_DNS_are_down_(on_an_emergency_only) [15:47:27] thanks, it was a rocky ride, hopefully the next (whish should occur soon-ish) one should be easier :) [15:47:44] yeah volans that was one of my tabs, thanks for the link though :D [15:47:55] RESOLVED: [2x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:49:09] (03CR) 10Vgutierrez: [C:03+2] varnish: Use vcl.list JSON output on reload-vcl.py [puppet] - 10https://gerrit.wikimedia.org/r/1133385 (https://phabricator.wikimedia.org/T390846) (owner: 10Vgutierrez) [15:49:42] (03PS3) 10Scott French: Profile::Mediawiki_deployment: add mw_kind fields [puppet] - 10https://gerrit.wikimedia.org/r/1131058 (https://phabricator.wikimedia.org/T389499) [15:49:50] (03PS3) 10Scott French: hieradata: adopt mw_kind in mw_releases [puppet] - 10https://gerrit.wikimedia.org/r/1131059 (https://phabricator.wikimedia.org/T389499) [15:49:50] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1131059 (https://phabricator.wikimedia.org/T389499) (owner: 10Scott French) [15:49:52] (03PS3) 10Scott French: Profile::Mediawiki_deployment: remove deprecated debug field [puppet] - 10https://gerrit.wikimedia.org/r/1131060 (https://phabricator.wikimedia.org/T389499) [15:52:03] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [15:54:04] (03CR) 10Clément Goubert: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1131059 (https://phabricator.wikimedia.org/T389499) (owner: 10Scott French) [15:54:26] (03CR) 10Ssingh: [C:03+1] "Looks good. [This is on me]: We should fix copying around the SANs and alias them, otherwise it's a recipe for disaster." [puppet] - 10https://gerrit.wikimedia.org/r/1133405 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [15:54:55] (03CR) 10Brouberol: [C:03+1] "LG!" [alerts] - 10https://gerrit.wikimedia.org/r/1133433 (https://phabricator.wikimedia.org/T390875) (owner: 10Gehel) [15:54:57] (03CR) 10Brouberol: [C:03+2] feat(HDFS free space alert): add an alert at 20% free space [alerts] - 10https://gerrit.wikimedia.org/r/1133433 (https://phabricator.wikimedia.org/T390875) (owner: 10Gehel) [15:59:16] (03PS1) 10Vgutierrez: varnish: Use full VCL name on vcl.discard cmd [puppet] - 10https://gerrit.wikimedia.org/r/1133454 (https://phabricator.wikimedia.org/T390846) [16:00:43] (03CR) 10Ssingh: [C:03+1] varnish: Use full VCL name on vcl.discard cmd [puppet] - 10https://gerrit.wikimedia.org/r/1133454 (https://phabricator.wikimedia.org/T390846) (owner: 10Vgutierrez) [16:00:44] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10704731 (10Marostegui) 05Open→03Resolved The RAID finished rebuilding ` VD LIST : ======= -------------------------------------------------------------- DG/VD TYPE St... [16:00:48] (03CR) 10Vgutierrez: [C:03+2] varnish: Use full VCL name on vcl.discard cmd [puppet] - 10https://gerrit.wikimedia.org/r/1133454 (https://phabricator.wikimedia.org/T390846) (owner: 10Vgutierrez) [16:03:44] jouncebot: nowandnext [16:03:44] No deployments scheduled for the next 0 hour(s) and 56 minute(s) [16:03:44] In 0 hour(s) and 56 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T1700) [16:06:01] FYI, I'm going to be merging some puppet patches that will require a follow-on scap deployment (which should be quick, as there should be no image build) [16:06:06] (03CR) 10Scott French: [C:03+2] Profile::Mediawiki_deployment: add mw_kind fields [puppet] - 10https://gerrit.wikimedia.org/r/1131058 (https://phabricator.wikimedia.org/T389499) (owner: 10Scott French) [16:06:09] (03CR) 10Bernard Wang: Stream registration for article summaries (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129958 (https://phabricator.wikimedia.org/T389097) (owner: 10Kimberly Sarabia) [16:07:42] (03CR) 10Scott French: [C:03+2] hieradata: adopt mw_kind in mw_releases [puppet] - 10https://gerrit.wikimedia.org/r/1131059 (https://phabricator.wikimedia.org/T389499) (owner: 10Scott French) [16:08:46] (03CR) 10Bernard Wang: Stream registration for article summaries (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129958 (https://phabricator.wikimedia.org/T389097) (owner: 10Kimberly Sarabia) [16:10:34] !log run-puppet-agent on deploy1003 to pick up mediawiki-deployments.yaml changes - T389499 [16:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:37] T389499: Refactor scap's kubernetes DeploymentsConfig to support selection of image kinds - https://phabricator.wikimedia.org/T389499 [16:14:04] (03CR) 10Clément Goubert: [C:03+1] Profile::Mediawiki_deployment: make debug Optional [puppet] - 10https://gerrit.wikimedia.org/r/1133466 (https://phabricator.wikimedia.org/T389499) (owner: 10Scott French) [16:15:57] (03CR) 10Scott French: [C:03+2] Profile::Mediawiki_deployment: make debug Optional [puppet] - 10https://gerrit.wikimedia.org/r/1133466 (https://phabricator.wikimedia.org/T389499) (owner: 10Scott French) [16:18:17] (03PS2) 10Hnowlan: api-gateway: use rest-gateway for wikifeeds calls to restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132585 (https://phabricator.wikimedia.org/T390317) [16:19:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10704840 (10phaultfinder) [16:20:27] (03PS3) 10Hnowlan: api-gateway: use rest-gateway for wikifeeds calls to restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132585 (https://phabricator.wikimedia.org/T390317) [16:21:35] jouncebot: nowandnext [16:21:35] No deployments scheduled for the next 0 hour(s) and 38 minute(s) [16:21:35] In 0 hour(s) and 38 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T1700) [16:22:50] Reedy: FYI, I have a scap deployment just now spinning up [16:23:00] !log swfrench@deploy1003 Started scap sync-world: Deployment to pick up change in mediawiki-deployments.yaml - T389499 [16:23:03] T389499: Refactor scap's kubernetes DeploymentsConfig to support selection of image kinds - https://phabricator.wikimedia.org/T389499 [16:23:07] Cheers [16:23:25] _should_ be a quick one, as there should be no changes to the image [16:23:40] !log reload varnish on text@drmrs to discard stale VCLs - T390846 [16:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:42] T390846: Increased number of connections to ATS on single_backend=false DCs after varnish 7 upgrade - https://phabricator.wikimedia.org/T390846 [16:24:10] !log swfrench@deploy1003 swfrench: Deployment to pick up change in mediawiki-deployments.yaml - T389499 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:25:02] !log swfrench@deploy1003 swfrench: Continuing with sync [16:25:14] (03PS1) 10Reedy: EmailAuth: Allow forceEmailAuth test check without extension dependencies [extensions/WikimediaEvents] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133468 (https://phabricator.wikimedia.org/T390437) [16:25:34] (03PS1) 10Reedy: EmailAuth: Add tests for EmailAuthRequireToken handler [extensions/WikimediaEvents] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133469 (https://phabricator.wikimedia.org/T390437) [16:25:49] (03PS1) 10Reedy: EmailAuth: Add tests for EmailAuthRequireToken handler [extensions/WikimediaEvents] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133470 (https://phabricator.wikimedia.org/T390437) [16:26:01] !log swfrench@deploy1003 Finished scap sync-world: Deployment to pick up change in mediawiki-deployments.yaml - T389499 (duration: 03m 21s) [16:26:28] Reedy: all yours [16:26:33] (03PS1) 10Reedy: EmailAuthHooks: Exclude bot users from email auth check [extensions/WikimediaEvents] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133471 (https://phabricator.wikimedia.org/T390662) [16:26:51] (03PS1) 10Reedy: EmailAuthHooks: Exclude bot users from email auth check [extensions/WikimediaEvents] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133472 (https://phabricator.wikimedia.org/T390662) [16:27:02] Thanks [16:27:09] (03CR) 10Reedy: [C:03+2] EmailAuthHooks: Exclude bot users from email auth check [extensions/WikimediaEvents] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133472 (https://phabricator.wikimedia.org/T390662) (owner: 10Reedy) [16:27:13] (03CR) 10Reedy: [C:03+2] EmailAuth: Add tests for EmailAuthRequireToken handler [extensions/WikimediaEvents] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133470 (https://phabricator.wikimedia.org/T390437) (owner: 10Reedy) [16:27:17] (03CR) 10Reedy: [C:03+2] EmailAuth: Allow forceEmailAuth test check without extension dependencies [extensions/WikimediaEvents] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133468 (https://phabricator.wikimedia.org/T390437) (owner: 10Reedy) [16:27:31] (03CR) 10Reedy: [C:03+2] EmailAuthHooks: Exclude bot users from email auth check [extensions/WikimediaEvents] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133471 (https://phabricator.wikimedia.org/T390662) (owner: 10Reedy) [16:27:35] (03CR) 10Reedy: [C:03+2] EmailAuth: Add tests for EmailAuthRequireToken handler [extensions/WikimediaEvents] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133469 (https://phabricator.wikimedia.org/T390437) (owner: 10Reedy) [16:27:42] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10704893 (10RobH) Despite my asking a few times for the engineer's info and stating we have to enter a security and escort ticket over 24 hours in advance of their arrival, they did not send me the info... [16:27:50] !log reload varnish on text@codfw to discard stale VCLs - T390846 [16:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:30] (03PS4) 10Scott French: Profile::Mediawiki_deployment: remove deprecated debug field [puppet] - 10https://gerrit.wikimedia.org/r/1131060 (https://phabricator.wikimedia.org/T389499) [16:30:57] (03PS1) 10Reedy: i18n: Add no email variant of login-message [extensions/EmailAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133473 (https://phabricator.wikimedia.org/T390780) [16:31:04] (03PS1) 10Reedy: i18n: Add no email variant of login-message [extensions/EmailAuth] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133474 (https://phabricator.wikimedia.org/T390780) [16:31:33] (03PS1) 10Reedy: i18n: Add a help message to the login flow [extensions/EmailAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133475 (https://phabricator.wikimedia.org/T390662) [16:31:47] (03PS1) 10Reedy: i18n: Add a help message to the login flow [extensions/EmailAuth] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133476 (https://phabricator.wikimedia.org/T390662) [16:34:00] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [16:34:53] (03CR) 10Reedy: [C:03+2] i18n: Add a help message to the login flow [extensions/EmailAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133475 (https://phabricator.wikimedia.org/T390662) (owner: 10Reedy) [16:34:55] (03CR) 10Reedy: [C:03+2] i18n: Add no email variant of login-message [extensions/EmailAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133473 (https://phabricator.wikimedia.org/T390780) (owner: 10Reedy) [16:34:59] (03CR) 10Reedy: [C:03+2] i18n: Add no email variant of login-message [extensions/EmailAuth] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133474 (https://phabricator.wikimedia.org/T390780) (owner: 10Reedy) [16:35:03] (03CR) 10Reedy: [C:03+2] i18n: Add a help message to the login flow [extensions/EmailAuth] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133476 (https://phabricator.wikimedia.org/T390662) (owner: 10Reedy) [16:35:15] (03PS1) 10Vgutierrez: varnish: do not attempt to discard VCLs with labels > 0 [puppet] - 10https://gerrit.wikimedia.org/r/1133477 (https://phabricator.wikimedia.org/T390846) [16:36:05] (03Merged) 10jenkins-bot: EmailAuth: Allow forceEmailAuth test check without extension dependencies [extensions/WikimediaEvents] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133468 (https://phabricator.wikimedia.org/T390437) (owner: 10Reedy) [16:36:07] (03Merged) 10jenkins-bot: EmailAuth: Add tests for EmailAuthRequireToken handler [extensions/WikimediaEvents] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133470 (https://phabricator.wikimedia.org/T390437) (owner: 10Reedy) [16:36:49] (03Merged) 10jenkins-bot: EmailAuthHooks: Exclude bot users from email auth check [extensions/WikimediaEvents] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133472 (https://phabricator.wikimedia.org/T390662) (owner: 10Reedy) [16:36:51] (03Merged) 10jenkins-bot: EmailAuth: Add tests for EmailAuthRequireToken handler [extensions/WikimediaEvents] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133469 (https://phabricator.wikimedia.org/T390437) (owner: 10Reedy) [16:36:53] (03Merged) 10jenkins-bot: EmailAuthHooks: Exclude bot users from email auth check [extensions/WikimediaEvents] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133471 (https://phabricator.wikimedia.org/T390662) (owner: 10Reedy) [16:38:27] (03CR) 10Vgutierrez: [C:03+2] varnish: do not attempt to discard VCLs with labels > 0 [puppet] - 10https://gerrit.wikimedia.org/r/1133477 (https://phabricator.wikimedia.org/T390846) (owner: 10Vgutierrez) [16:39:24] (03Merged) 10jenkins-bot: i18n: Add no email variant of login-message [extensions/EmailAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133473 (https://phabricator.wikimedia.org/T390780) (owner: 10Reedy) [16:39:26] (03Merged) 10jenkins-bot: i18n: Add a help message to the login flow [extensions/EmailAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133475 (https://phabricator.wikimedia.org/T390662) (owner: 10Reedy) [16:39:28] (03Merged) 10jenkins-bot: i18n: Add no email variant of login-message [extensions/EmailAuth] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133474 (https://phabricator.wikimedia.org/T390780) (owner: 10Reedy) [16:39:30] (03Merged) 10jenkins-bot: i18n: Add a help message to the login flow [extensions/EmailAuth] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133476 (https://phabricator.wikimedia.org/T390662) (owner: 10Reedy) [16:39:47] (03PS1) 10Reedy: EmailAuth: Add override for emailauth-login-help [extensions/WikimediaMessages] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133478 (https://phabricator.wikimedia.org/T390662) [16:39:53] (03CR) 10Reedy: [C:03+2] EmailAuth: Add override for emailauth-login-help [extensions/WikimediaMessages] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133478 (https://phabricator.wikimedia.org/T390662) (owner: 10Reedy) [16:40:00] (03PS1) 10Reedy: EmailAuth: Add override for emailauth-login-help [extensions/WikimediaMessages] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133479 (https://phabricator.wikimedia.org/T390662) [16:40:07] (03CR) 10Reedy: [C:03+2] EmailAuth: Add override for emailauth-login-help [extensions/WikimediaMessages] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133479 (https://phabricator.wikimedia.org/T390662) (owner: 10Reedy) [16:51:05] (03Merged) 10jenkins-bot: EmailAuth: Add override for emailauth-login-help [extensions/WikimediaMessages] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133478 (https://phabricator.wikimedia.org/T390662) (owner: 10Reedy) [16:51:07] (03Merged) 10jenkins-bot: EmailAuth: Add override for emailauth-login-help [extensions/WikimediaMessages] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133479 (https://phabricator.wikimedia.org/T390662) (owner: 10Reedy) [16:52:07] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1133468|EmailAuth: Allow forceEmailAuth test check without extension dependencies (T390437)]], [[gerrit:1133470|EmailAuth: Add tests for EmailAuthRequireToken handler (T390437)]], [[gerrit:1133472|EmailAuthHooks: Exclude bot users from email auth check (T390662)]], [[gerrit:1133469|EmailAuth: Add tests for EmailAuthRequireToken handler (T390437)]], [[ger [16:52:07] rit:1133471|EmailAuthHooks: Exclude bot users from email auth check (T390662)]], [[gerrit:1133473|i18n: Add no email variant of login-message (T390780)]], [[gerrit:1133474|i18n: Add no email variant of login-message (T390780)]], [[gerrit:1133475|i18n: Add a help message to the login flow (T390662)]], [[gerrit:1133476|i18n: Add a help message to the login flow (T390662)]], [[gerrit:1133478|EmailAuth: Add override for email [16:52:08] auth-login-help (T390662)]], [[gerrit:1133479|EmailAuth: Add override for emailauth-login-help (T390662)]] [16:52:11] T390437: Deploy Extension:EmailAuth - https://phabricator.wikimedia.org/T390437 [16:52:11] T390662: EmailAuth: Enable "enforce" mode for logins from unknown IP/device when IP is known to IPoid - https://phabricator.wikimedia.org/T390662 [16:52:12] T390780: Mask mailaddress during login that triggers EmailAuth - https://phabricator.wikimedia.org/T390780 [16:52:43] (03PS1) 10Bking: cirrussearch: use correct mandatory plugins [puppet] - 10https://gerrit.wikimedia.org/r/1133481 (https://phabricator.wikimedia.org/T388610) [16:53:37] (03PS1) 10Btullis: mediawiki-dumps-legacy: render a test config file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133482 (https://phabricator.wikimedia.org/T388378) [16:54:07] (03PS2) 10Btullis: mediawiki-dumps-legacy: render a test config file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133482 (https://phabricator.wikimedia.org/T388378) [17:00:04] swfrench-wmf: gettimeofday() says it's time for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T1700) [17:01:53] o/ [17:02:57] 16:53:21 Started build-and-push-container-images [17:02:57] 16:53:21 K8s images build/push output redirected to /home/reedy/scap-image-build-and-push-log [17:02:59] * Reedy is waiting [17:04:05] it's been pushing for ~8 minutes [17:04:14] indeed, yeah [17:05:06] (03CR) 10Bking: [C:03+2] cirrussearch: use correct mandatory plugins [puppet] - 10https://gerrit.wikimedia.org/r/1133481 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:05:09] `Apr 02 17:04:22 deploy1003 dockerd[1161]: time="2025-04-02T17:04:22.219902145Z" level=error msg="Upload failed, retrying: blob upload unknown"` [17:05:22] (03CR) 10Bking: [C:03+2] "self-merging in the interest of time" [puppet] - 10https://gerrit.wikimedia.org/r/1133481 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:07:11] fun times... [17:07:12] FIRING: PuppetFailure: Puppet has failed on cirrussearch2055:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:08:33] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on cirrussearch2055.codfw.wmnet with reason: adding net-new role [17:10:58] it does seem stuck [17:12:19] swfrench-wmf: Anything I can do to kick it from my side? [17:13:03] !log reloading varnish-frontend on A:cp and not A:cp-text_drmrs and not A:cp-text_codfw [17:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:10] Reedy: yeah, the uploads are retrying internally [17:13:18] I'd just let it sit for a bit [17:13:22] alas [17:13:36] * dancy eyes. [17:13:55] `Apr 02 17:05:14 deploy1003 dockerd[1161]: time="2025-04-02T17:05:14.400003411Z" level=error msg="Upload failed, retrying: received unexpected HTTP status: 500 Internal Server Error"` [17:14:14] ... so maybe not [17:15:10] definitely a scap related bug then if it's all just stopped [17:15:15] dancy: are we blaming you today? :P [17:15:39] Scap is ultimately calling `docker push`. [17:15:40] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10705172 (10phaultfinder) [17:15:46] trying to sort out what's up on the registry side [17:17:49] the 500 responses are (on the registry side), e.g. `err.detail="timeout expired while waiting for segments of /docker/registry/v2/repositories/restricted/mediawiki-multiversion/_uploads/d94d5000-a275-4f6a-b2c4-0a5144833f53/data to show up" err.message="unknown error"` [17:17:59] !log updating Cassandra/sessionstore `gc_grace_seconds` to 259200 (from 864000) [17:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:08] hmm.. That's in the swift.go part of the registry code. [17:19:09] oh, suddenly we have life again [17:19:24] 16:56:50 [mediawiki-publish] Running sudo /usr/local/bin/docker-pusher -q docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2025-04-02-165321-publish [17:19:24] 17:19:00 [mediawiki-publish] docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2025-04-02-165321-publish [17:21:32] alright, so ... let's see what this brings once nodes actually start pulling these image =/ [17:21:35] *images [17:21:47] (03CR) 10Hnowlan: [C:03+2] api-gateway: use rest-gateway for wikifeeds calls to restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132585 (https://phabricator.wikimedia.org/T390317) (owner: 10Hnowlan) [17:23:15] (03Merged) 10jenkins-bot: api-gateway: use rest-gateway for wikifeeds calls to restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132585 (https://phabricator.wikimedia.org/T390317) (owner: 10Hnowlan) [17:24:14] dancy: Reedy: bad blob [17:24:20] swfrench-wmf: rude [17:24:28] * dancy shakes a fist at .. something [17:24:32] haven't we had a few of those recently? :( [17:24:51] Reedy: https://phabricator.wikimedia.org/T390251 [17:24:58] 692649729470c45170a3e7d6a93b50d7ae11255f4e89092020b65863dfb29c5e [17:25:48] 3c5cdca1daa864a401e2c17066aa2cdc373d15d10270f24216c1446e4ea1f478 [17:25:49] !log starting `nodetool garbagecollect` on sessionstore1004 [17:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:31] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply [17:27:41] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [17:29:09] I'm pulling these down (curl) onto the deployment host to check them [17:29:43] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10705226 (10phaultfinder) [17:30:13] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [17:30:34] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [17:30:51] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply [17:31:10] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [17:32:12] (03PS1) 10Bking: cirrussearch: add cirrussearch2055 to existing elastic cluster [puppet] - 10https://gerrit.wikimedia.org/r/1133487 (https://phabricator.wikimedia.org/T388610) [17:32:21] well, this is definitely broken [17:32:22] 17:31:52 Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1. [17:32:22] 17:31:52 Stdout/stderr follows: [17:32:22] 17:31:52 skipping missing values file matching "/etc/helmfile-defaults/private/main_services/mw-misc/eqiad.yaml" [17:32:22] skipping missing values file matching "values-main.yaml" [17:32:22] Comparing release=main, chart=wmf-stable/mediawiki, namespace=mw-misc [17:32:24] mw-misc, mw-misc.eqiad.main, Deployment (apps) has changed: [17:32:41] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133487 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:32:53] Reedy: yeah, these will time out / fail [17:33:14] (03CR) 10Ebernhardson: [C:03+1] cirrussearch: add cirrussearch2055 to existing elastic cluster [puppet] - 10https://gerrit.wikimedia.org/r/1133487 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:33:19] so, this will probably succeed if we retry with a scap sync-world [17:33:19] (03CR) 10Bking: [C:03+2] cirrussearch: add cirrussearch2055 to existing elastic cluster [puppet] - 10https://gerrit.wikimedia.org/r/1133487 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:33:39] ... but just hold on for a bit while I verify the registry is serving correct blobs now [17:33:56] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [17:34:16] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [17:34:25] FIRING: SystemdUnitFailed: prometheus-ethtool-exporter.service on wikikube-worker2113:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:34:29] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [17:35:11] dancy: I realize this will slow things down a bit, but what if we single-track pushes in build-images.py et al.? [17:35:46] I think that's a reasonable experiment. I suspect that parallel pushes is what is triggering this problem. [17:36:00] I'll discuss with @dduvall [17:36:49] sounds good - it's not a solution, but should be a fairly straightforward experiment [17:37:30] FIRING: Primary outbound port utilisation over 80% #page: Alert for device asw2-c-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [17:37:32] dancy: Reedy: registry2004 and 2005 are serving correct content for both blobs now [17:38:45] dancy: can you think of anything you'd like to check before we attempt to push this through with a sync-world? [17:39:06] Nothing at this time. [17:39:20] sounds good - I'll follow up on the tast [17:39:21] (03PS2) 10SBassett: OATHAuth: Mark checkuser and suppress as requiring 2FA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133245 (https://phabricator.wikimedia.org/T150898) [17:39:22] *task [17:39:53] Reedy: You can re-run scap backport. [17:40:12] (03CR) 10CI reject: [V:04-1] OATHAuth: Mark checkuser and suppress as requiring 2FA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133245 (https://phabricator.wikimedia.org/T150898) (owner: 10SBassett) [17:40:23] (03CR) 10SBassett: OATHAuth: Mark checkuser and suppress as requiring 2FA (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133245 (https://phabricator.wikimedia.org/T150898) (owner: 10SBassett) [17:40:52] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [17:41:01] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [17:41:31] FIRING: [2x] Primary inbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [17:41:31] (03PS3) 10SBassett: OATHAuth: Mark checkuser and suppress as requiring 2FA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133245 (https://phabricator.wikimedia.org/T150898) [17:41:53] !incidents [17:41:53] 5939 (ACKED) Host pfw1-eqiad - PING - Packet loss = 100% [17:41:53] 5942 (ACKED) Primary outbound port utilisation over 80% (paged) network noc (asw2-c-eqiad.mgmt.eqiad.wmnet) [17:41:53] 5943 (ACKED) [2x] Primary inbound port utilisation over 80% (paged) network noc () [17:41:54] 5931 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [17:41:54] 5930 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [17:41:54] 5929 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [17:41:54] 5927 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (asw2-d-eqiad.mgmt.eqiad.wmnet) [17:42:06] (03CR) 10SBassett: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133245 (https://phabricator.wikimedia.org/T150898) (owner: 10SBassett) [17:42:30] FIRING: [2x] Primary outbound port utilisation over 80% #page: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [17:43:03] (03CR) 10Bartosz Dziewoński: [C:04-1] "No longer needed after https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1130751" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128913 (owner: 10Giuseppe Lavagetto) [17:43:36] (03CR) 10SBassett: [C:04-2] OATHAuth: Mark checkuser and suppress as requiring 2FA (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133245 (https://phabricator.wikimedia.org/T150898) (owner: 10SBassett) [17:46:31] RESOLVED: [2x] Primary inbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [17:47:30] RESOLVED: Primary outbound port utilisation over 80% #page: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [17:47:32] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1133468|EmailAuth: Allow forceEmailAuth test check without extension dependencies (T390437)]], [[gerrit:1133470|EmailAuth: Add tests for EmailAuthRequireToken handler (T390437)]], [[gerrit:1133472|EmailAuthHooks: Exclude bot users from email auth check (T390662)]], [[gerrit:1133469|EmailAuth: Add tests for EmailAuthRequireToken handler (T390437)]], [[ger [17:47:32] rit:1133471|EmailAuthHooks: Exclude bot users from email auth check (T390662)]], [[gerrit:1133473|i18n: Add no email variant of login-message (T390780)]], [[gerrit:1133474|i18n: Add no email variant of login-message (T390780)]], [[gerrit:1133475|i18n: Add a help message to the login flow (T390662)]], [[gerrit:1133476|i18n: Add a help message to the login flow (T390662)]], [[gerrit:1133478|EmailAuth: Add override for email [17:47:32] auth-login-help (T390662)]], [[gerrit:1133479|EmailAuth: Add override for emailauth-login-help (T390662)]] [17:47:36] T390437: Deploy Extension:EmailAuth - https://phabricator.wikimedia.org/T390437 [17:47:37] T390662: EmailAuth: Enable "enforce" mode for logins from unknown IP/device when IP is known to IPoid - https://phabricator.wikimedia.org/T390662 [17:47:37] T390780: Mask mailaddress during login that triggers EmailAuth - https://phabricator.wikimedia.org/T390780 [17:51:11] (03PS1) 10JHathaway: efi: add efi fact to facter [puppet] - 10https://gerrit.wikimedia.org/r/1133491 (https://phabricator.wikimedia.org/T389217) [17:51:31] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133491 (https://phabricator.wikimedia.org/T389217) (owner: 10JHathaway) [17:54:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:56:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host apus-fe2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:57:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host apus-fe2003.codfw.wmnet with OS bookworm [17:58:01] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578#10705387 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host apus-fe2003.codfw.wmnet with OS bookworm [17:59:00] FIRING: [8x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:00:05] dancy and andre: MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T1800). Please do the needful. [18:00:08] !log reedy@deploy1003 reedy: Backport for [[gerrit:1133468|EmailAuth: Allow forceEmailAuth test check without extension dependencies (T390437)]], [[gerrit:1133470|EmailAuth: Add tests for EmailAuthRequireToken handler (T390437)]], [[gerrit:1133472|EmailAuthHooks: Exclude bot users from email auth check (T390662)]], [[gerrit:1133469|EmailAuth: Add tests for EmailAuthRequireToken handler (T390437)]], [[gerrit:1133471|EmailA [18:00:08] uthHooks: Exclude bot users from email auth check (T390662)]], [[gerrit:1133473|i18n: Add no email variant of login-message (T390780)]], [[gerrit:1133474|i18n: Add no email variant of login-message (T390780)]], [[gerrit:1133475|i18n: Add a help message to the login flow (T390662)]], [[gerrit:1133476|i18n: Add a help message to the login flow (T390662)]], [[gerrit:1133478|EmailAuth: Add override for emailauth-login-help (T [18:00:08] 390662)]], [[gerrit:1133479|EmailAuth: Add override for emailauth-login-help (T390662)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:00:12] T390437: Deploy Extension:EmailAuth - https://phabricator.wikimedia.org/T390437 [18:00:12] T390662: EmailAuth: Enable "enforce" mode for logins from unknown IP/device when IP is known to IPoid - https://phabricator.wikimedia.org/T390662 [18:00:13] T390780: Mask mailaddress during login that triggers EmailAuth - https://phabricator.wikimedia.org/T390780 [18:00:15] o/ [18:00:24] !log reedy@deploy1003 reedy: Continuing with sync [18:01:17] hopefully this doesnt take much longer... [18:01:26] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10705407 (10phaultfinder) [18:11:19] (03PS1) 10Ebernhardson: search: Transparently retry 503 errors [puppet] - 10https://gerrit.wikimedia.org/r/1133495 (https://phabricator.wikimedia.org/T390612) [18:11:46] (03PS2) 10Ebernhardson: search: Transparently retry 503 errors [puppet] - 10https://gerrit.wikimedia.org/r/1133495 (https://phabricator.wikimedia.org/T390612) [18:13:10] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133495 (https://phabricator.wikimedia.org/T390612) (owner: 10Ebernhardson) [18:15:00] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133468|EmailAuth: Allow forceEmailAuth test check without extension dependencies (T390437)]], [[gerrit:1133470|EmailAuth: Add tests for EmailAuthRequireToken handler (T390437)]], [[gerrit:1133472|EmailAuthHooks: Exclude bot users from email auth check (T390662)]], [[gerrit:1133469|EmailAuth: Add tests for EmailAuthRequireToken handler (T390437)]], [[ge [18:15:00] rrit:1133471|EmailAuthHooks: Exclude bot users from email auth check (T390662)]], [[gerrit:1133473|i18n: Add no email variant of login-message (T390780)]], [[gerrit:1133474|i18n: Add no email variant of login-message (T390780)]], [[gerrit:1133475|i18n: Add a help message to the login flow (T390662)]], [[gerrit:1133476|i18n: Add a help message to the login flow (T390662)]], [[gerrit:1133478|EmailAuth: Add override for emai [18:15:00] lauth-login-help (T390662)]], [[gerrit:1133479|EmailAuth: Add override for emailauth-login-help (T390662)]] (duration: 27m 28s) [18:15:44] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10705422 (10phaultfinder) [18:16:56] Reedy: OK for me to roll the train? [18:18:03] dancy: I think so. Thanks! [18:18:09] Thx. Proceeding. [18:18:40] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133496 (https://phabricator.wikimedia.org/T386218) [18:18:42] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133496 (https://phabricator.wikimedia.org/T386218) (owner: 10TrainBranchBot) [18:19:21] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578#10705427 (10Jhancock.wm) Error while setting up RAID │ An unexpected error occurred while setting up a preseeded RAID │ │ confi... [18:19:31] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133496 (https://phabricator.wikimedia.org/T386218) (owner: 10TrainBranchBot) [18:20:17] (03PS1) 10Andrew Bogott: Horizon: update eqiad1 docker version [puppet] - 10https://gerrit.wikimedia.org/r/1133497 (https://phabricator.wikimedia.org/T380531) [18:31:56] (03CR) 10Andrew Bogott: [C:03+2] Horizon: update eqiad1 docker version [puppet] - 10https://gerrit.wikimedia.org/r/1133497 (https://phabricator.wikimedia.org/T380531) (owner: 10Andrew Bogott) [18:32:36] swfrench-wmf: Please make sure to update https://phabricator.wikimedia.org/T390251 with the data from today. [18:33:03] dancy: yup, thanks! writing a wall of text - just taking a while to piece it all together :) [18:33:12] Awesome. Thanks! [18:33:47] (03PS1) 10D3r1ck01: SUL3: Fix user ID mismatch during login (immediately after creation) [extensions/CentralAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133500 (https://phabricator.wikimedia.org/T388177) [18:34:35] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10705566 (10phaultfinder) [18:35:43] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db1125.eqiad.wmnet - https://phabricator.wikimedia.org/T357092#10705573 (10Jhancock.wm) a:03VRiley-WMF [18:35:48] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.23 refs T386218 [18:35:50] T386218: 1.44.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T386218 [18:36:01] (03CR) 10Michael Große: "Sorry, I missed this ping in time. As far as I can tell, this should be fine. We have not actually used these until now, so we might as we" [puppet] - 10https://gerrit.wikimedia.org/r/1132673 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert) [18:36:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/CentralAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133500 (https://phabricator.wikimedia.org/T388177) (owner: 10D3r1ck01) [18:39:00] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:39:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:42:12] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:43:25] (03PS1) 10Bartosz Dziewoński: SUL3: Fix user ID mismatch during login (immediately after creation) [extensions/CentralAuth] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133502 (https://phabricator.wikimedia.org/T388177) [18:43:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/CentralAuth] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133502 (https://phabricator.wikimedia.org/T388177) (owner: 10Bartosz Dziewoński) [18:44:02] (03PS1) 10Bartosz Dziewoński: Remove redundant WaitConditionLoop from CentralAuthTokenManager [extensions/CentralAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133503 [18:44:10] (03PS1) 10Bartosz Dziewoński: Remove redundant WaitConditionLoop from CentralAuthTokenManager [extensions/CentralAuth] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133504 [18:45:01] (03CR) 10D3r1ck01: "Thank you Matmarex, I forgot that the train has already gone out for this week. Will close the other backport." [extensions/CentralAuth] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133502 (https://phabricator.wikimedia.org/T388177) (owner: 10Bartosz Dziewoński) [18:45:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/CentralAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133503 (owner: 10Bartosz Dziewoński) [18:45:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/CentralAuth] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133504 (owner: 10Bartosz Dziewoński) [18:45:50] (03Abandoned) 10D3r1ck01: SUL3: Fix user ID mismatch during login (immediately after creation) [extensions/CentralAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133500 (https://phabricator.wikimedia.org/T388177) (owner: 10D3r1ck01) [18:47:10] (03CR) 10Bartosz Dziewoński: "wmf.22 is still live on group2, we should probably backport to both versions." [extensions/CentralAuth] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133502 (https://phabricator.wikimedia.org/T388177) (owner: 10Bartosz Dziewoński) [18:48:14] (03Restored) 10D3r1ck01: SUL3: Fix user ID mismatch during login (immediately after creation) [extensions/CentralAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133500 (https://phabricator.wikimedia.org/T388177) (owner: 10D3r1ck01) [18:48:39] (03CR) 10D3r1ck01: "group2 still on .22. We need on both branches." [extensions/CentralAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133500 (https://phabricator.wikimedia.org/T388177) (owner: 10D3r1ck01) [18:49:04] jouncebot: next [18:49:04] In 1 hour(s) and 10 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T2000) [18:49:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:50:16] will anyone be around to deploy the evening window? [18:52:52] I can do them [18:54:03] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host apus-fe2003.codfw.wmnet with OS bookworm [18:54:10] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578#10705648 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host apus-fe2003.codfw.wmnet with OS bookworm executed with errors: - apus-... [18:55:03] thanks Reedy [18:55:15] (03CR) 10Brouberol: [C:03+1] "Good idea!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133482 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [19:08:29] (03CR) 10Bking: [C:03+2] "Looks reasonable, based on similar entries in the same file." [puppet] - 10https://gerrit.wikimedia.org/r/1133495 (https://phabricator.wikimedia.org/T390612) (owner: 10Ebernhardson) [19:12:12] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:14:27] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578#10705711 (10Papaul) @Jhancock.wm that error looks to me that the server is missing an entry in partman. Have you checked it the server has a partman recipe? [19:17:57] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1133506 [19:18:01] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1133507 [19:18:06] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1133508 [19:30:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10705767 (10phaultfinder) [19:30:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10705766 (10phaultfinder) [19:32:03] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [19:34:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:37:46] (03CR) 10Pppery: "(I'm not qualified to comment on the .ech domains - they seem to be related to T205378 but I'm not sure what specifically they were intend" [puppet] - 10https://gerrit.wikimedia.org/r/1133507 (owner: 10Ncmonitor) [19:41:13] sukhe: where should the "ehc" domains actually point to? [19:41:22] ech [19:42:12] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:42:52] (03CR) 10Dzahn: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133507 (owner: 10Ncmonitor) [19:44:40] (03CR) 10Dzahn: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133507 (owner: 10Ncmonitor) [19:45:04] (03CR) 10Dzahn: "the automated change https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133507 still picked up these ech domains as if they weren't add" [dns] - 10https://gerrit.wikimedia.org/r/1122155 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [20:12:43] (03Merged) 10jenkins-bot: Remove redundant WaitConditionLoop from CentralAuthTokenManager [extensions/CentralAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133503 (owner: 10Bartosz Dziewoński) [20:12:52] (03Merged) 10jenkins-bot: Remove redundant WaitConditionLoop from CentralAuthTokenManager [extensions/CentralAuth] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133504 (owner: 10Bartosz Dziewoński) [20:14:25] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1133500|SUL3: Fix user ID mismatch during login (immediately after creation) (T388177)]], [[gerrit:1133502|SUL3: Fix user ID mismatch during login (immediately after creation) (T388177)]], [[gerrit:1133503|Remove redundant WaitConditionLoop from CentralAuthTokenManager]], [[gerrit:1133504|Remove redundant WaitConditionLoop from CentralAuthTokenManager]] [20:14:28] T388177: Wikimedia\NormalizedException\NormalizedException: User ID: {centralUserId} mismatch with {storedUserId} for user: {username} - https://phabricator.wikimedia.org/T388177 [20:16:09] (03CR) 10Dzahn: [C:03+2] mailman: list sync, add option to mail changes to an admin [puppet] - 10https://gerrit.wikimedia.org/r/1128564 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [20:16:45] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:18:28] Dreamy_Jazz: I'm trying to catch up on the recent filter blob changes but I'm lacking a lot of context. Is the blob cache meant to be by wiki, by filter, or both? [20:19:23] The impression I get from reading the code is that it should be per-wiki, right? [20:19:35] (03CR) 10Dzahn: [C:03+1] "+jinxer-wm> FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed -" [puppet] - 10https://gerrit.wikimedia.org/r/1133507 (owner: 10Ncmonitor) [20:19:44] Daimona: Thanks for making a patch for the issue. I think it should be per-wiki, but also take into account the protected variables that were used [20:20:01] Right. I meant in addition to what protected variables were used. [20:20:33] Okay, the cache is not per-filter because we wanted to avoid creating too many variable dumps [20:20:38] brett: the "widespread puppet failure" alert seems to have been caused by the ncredir change [20:21:06] The only need was to make the var dumps unique if the filter log contains protected variables. [20:21:06] it's missing a source on acmechief [20:21:19] Right, makes sense. I spent a couple minutes writing the code, now give me 10 minutes to write the commit message and I'll push it :P [20:21:34] Thanks. I'll be ready to review when you push. [20:21:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:21:56] I was also trying to figure out if and how this could've resulted in wrong data being stored in prod databases, but I still haven't wrapped my head around that. [20:22:01] swfrench-wmf: dancy k8s seems to be being slow again... [20:22:16] * dancy cries [20:22:23] (03CR) 10Dzahn: [C:03+1] "Could not evaluate: Could not retrieve information from environment production source(s) puppet://acmechief1002.eqiad.wmnet/acmedata/non-c" [puppet] - 10https://gerrit.wikimedia.org/r/1133507 (owner: 10Ncmonitor) [20:22:24] I don't think it should, because the code exited when the logs were attempted to be created. [20:22:34] 20:16:13 K8s deployment progress: 0% (ok: 0; fail: 0; left: 12) [20:22:36] 6 mins so far [20:22:39] I guess missing data is present, but probably not wrong data [20:22:43] * swfrench-wmf is looking [20:23:26] I see mediawiki-next-tls-proxy crash loop backoff in mw-debug eqiad [20:23:32] :( [20:24:23] yay? [20:24:27] haha [20:24:31] in that's a different problem ... [20:24:33] looking [20:24:34] Indeed [20:25:50] `[2025-04-02 20:22:02.923][1][critical][main] [source/server/server.cc:118] error initializing configuration '/etc/envoy/envoy.yaml': Unable to parse JSON as proto (INVALID_ARGUMENT: ...` [20:25:53] swfrench-wmf: hmm, envoy config error [20:25:55] ha yep [20:25:57] FIRING: [3x] ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:25:58] alright, something borked envoy config [20:26:10] * swfrench-wmf off to puppet history [20:26:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:26:49] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133495 ? [20:27:08] Dreamy_Jazz: that's when a global filter exists but no local filter with that ID exists, right? I was worried about the scenario where both a local and a global filter exist, and we could pick the wrong one. [20:27:12] FIRING: [9x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:27:18] looking at the puppet failures [20:27:29] swfrench-wmf: maybe, but it says "invalid value 503 for type TYPE_STRING" so I think we're looking for a patch that says `retry_on: 503` when it should say `retry_on: "503"` [20:27:37] in which case I think that patch is okay [20:27:37] Another upside.. caught at testservers! [20:27:44] not positive though, still checking [20:27:47] FIRING: [3x] HelmReleaseBadStatus: Helm release mw-misc/main on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [20:28:10] Daimona: Oh I see. I could see that being an issue. [20:28:11] (Also, wikiholism level: writing "likewiki" instead of "likewise" in the commit message. D'oh.) [20:28:14] (03PS1) 10Gergő Tisza: Enable EmailAuth enforcement on group 0/1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133532 (https://phabricator.wikimedia.org/T390437) [20:28:25] Reedy: You might as well control-C (just once) [20:28:34] I guess fix first and see if we find any broken entries later? [20:28:58] Any damage done in terms of wrong data will be already present as far as I can see [20:29:00] FIRING: [9x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:29:11] swfrench-wmf: wait except I guess that's hiera and maybe the Python config builder is outputting the wrong thing from it [20:29:30] definitely suspicious timing anyway [20:29:45] (03PS2) 10Gergő Tisza: Enable EmailAuth enforcement on group 0/1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133532 (https://phabricator.wikimedia.org/T390662) [20:29:55] It should be pretty obvious when things are broken because the variable dump would just not exist [20:29:58] Yeah, my question rn is more along the lines of: how do we even find the broken entries? After I push my patch I need to take a step back and re-read everything because I'm getting confused. [20:30:08] (03CR) 10Gergő Tisza: "Superseded by https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1133532 I think" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132996 (https://phabricator.wikimedia.org/T390662) (owner: 10Kosta Harlan) [20:30:19] swfrench-wmf et al, I need to leave for about an hour. Good luck to you all! [20:30:24] I think we would look for any log entry with a blobstore entry that does not exist on the wiki [20:30:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133532 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza) [20:30:49] As the wrong global flag is only used for the generating of the variable dump and not the log entry itself [20:30:57] RESOLVED: [3x] ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:31:45] FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:32:01] Reedy: I suspect scap has exited by now? [20:32:05] puppet is unhappy in a few places [20:32:11] (03PS1) 10BCornwall: acme-chief: Switch redirect entry types [puppet] - 10https://gerrit.wikimedia.org/r/1133534 [20:32:13] swfrench-wmf: I ctrl+c'd it [20:32:14] Patch is up. I think you're right, but I'll re-read what I've done now that the dust has settled. [20:32:47] RESOLVED: [6x] HelmReleaseBadStatus: Helm release mw-misc/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [20:33:21] There could be a collision with an existing blob store address, but in that case it may be hard to determine if the blob store looks correct [20:33:22] dancy: ack, thanks! [20:33:27] Reedy: cool, thank you [20:33:33] swfrench-wmf: so that's built at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/services_proxy/envoy_service_listener_af_common.yaml.erb#70, I would have thought to_json does the right thing here? but maybe it's taking the string 503 and returning an int 503 [20:33:36] I'll brb (grabbing a drink) [20:33:43] rzl: in the rendered yaml, it's definitely an int [20:33:48] yeah [20:34:05] okay so that's a bug in our config templates, triggered by that puppet patch (which is itself perfectly fine) [20:34:10] what's the backport window status? waiting for repairs? [20:34:14] exactly, yeah [20:34:22] roll back the search patch, unblock the push, fix the templates on our own time? [20:34:26] tgr_: yeah, we're working to unstick [20:34:30] SGTM [20:34:34] ebernhardson: happen to be around? [20:35:01] (I'm reverting) [20:35:35] sorry trying to catch up on this issue, does it also affect the broken puppetry on ncredir in drmrs, or is that a separate issues [20:35:48] (03PS1) 10RLazarus: Revert "search: Transparently retry 503 errors" [puppet] - 10https://gerrit.wikimedia.org/r/1133537 [20:36:02] jhathaway: the thing swfrench-wmf and I are looking at is unrelated to the puppet failures [20:36:13] nod okay, thanks [20:36:17] (03CR) 10CI reject: [V:04-1] Revert "search: Transparently retry 503 errors" [puppet] - 10https://gerrit.wikimedia.org/r/1133537 (owner: 10RLazarus) [20:36:21] and I believe Dreamy_Jazz/Daimona are investigating a third unrelated thing [20:36:34] Yeah. A train blocker that we will need to backport for. [20:36:36] Yeah sorry, we're on the new train blocker. [20:36:48] rzl: yup, i see your revert. Thats ok [20:37:10] (03PS2) 10RLazarus: Revert "search: Transparently retry 503 errors" [puppet] - 10https://gerrit.wikimedia.org/r/1133537 [20:37:12] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [20:37:22] ebernhardson: thanks :) sorry for the inconvenience [20:37:50] (03CR) 10Ssingh: [C:03+1] acme-chief: Switch redirect entry types [puppet] - 10https://gerrit.wikimedia.org/r/1133534 (owner: 10BCornwall) [20:37:56] (03CR) 10BCornwall: [C:03+2] acme-chief: Switch redirect entry types [puppet] - 10https://gerrit.wikimedia.org/r/1133534 (owner: 10BCornwall) [20:38:15] swfrench-wmf: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133537 for your rubber stamp when able [20:38:16] (03CR) 10Scott French: [C:03+1] Revert "search: Transparently retry 503 errors" [puppet] - 10https://gerrit.wikimedia.org/r/1133537 (owner: 10RLazarus) [20:38:19] lol thanks [20:38:28] (03CR) 10RLazarus: [C:03+2] Revert "search: Transparently retry 503 errors" [puppet] - 10https://gerrit.wikimedia.org/r/1133537 (owner: 10RLazarus) [20:39:11] patiently awaiting puppet lock [20:39:11] (03PS1) 10Dreamy Jazz: AbuseLogger: properly distinguish between global filters and central DB [extensions/AbuseFilter] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133538 (https://phabricator.wikimedia.org/T390904) [20:39:38] (03CR) 10Dzahn: [C:03+1] acme-chief: Switch redirect entry types [puppet] - 10https://gerrit.wikimedia.org/r/1133534 (owner: 10BCornwall) [20:40:41] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10705961 (10phaultfinder) [20:41:23] I reached the point where I need to think out loud / dump my thoughts. So I started https://etherpad.wikimedia.org/p/a2lD3H3AZ5uZhIC3uCWC with my thoughts on how to identify broken entries [20:41:33] (03CR) 10BCornwall: [C:03+2] "That was an issue with If662a96a91e1b61b8bf368f43a6dd56f8f7cd5c9 which has been addressed in Ife42906ee9d322adad8a89879462476a2d7944a8" [puppet] - 10https://gerrit.wikimedia.org/r/1133507 (owner: 10Ncmonitor) [20:43:30] Just FYI - there are two committed changes in private/PrivateSettings.php right now, related to the security incident.  They should be fine to just go out with the next sync-world.  (I’m keeping an eye on logs for them). [20:44:49] rzl: how goes it with puppet? anything I can help with? [20:44:58] just waiting, puppet on deploy hosts is still slow [20:45:06] jhathaway: ncredir puppet breakage is related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133534 and other [20:45:25] ah, yes indeed they are [20:45:26] =/ [20:45:49] and, done [20:46:07] aaand helmfile diffs clean [20:46:12] sweet [20:46:45] FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:46:57] Reedy: think you're unblocked, sorry about that [20:47:03] well, except for the image bumps that were abandoned by scap for stages after testervers, but such is ... those will get fixed momentarily [20:47:03] thanks mutante, brett shall we revert? [20:47:11] heh [20:47:13] np [20:47:51] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1133500|SUL3: Fix user ID mismatch during login (immediately after creation) (T388177)]], [[gerrit:1133502|SUL3: Fix user ID mismatch during login (immediately after creation) (T388177)]], [[gerrit:1133503|Remove redundant WaitConditionLoop from CentralAuthTokenManager]], [[gerrit:1133504|Remove redundant WaitConditionLoop from CentralAuthTokenManager]] [20:47:54] T388177: Wikimedia\NormalizedException\NormalizedException: User ID: {centralUserId} mismatch with {storedUserId} for user: {username} - https://phabricator.wikimedia.org/T388177 [20:48:00] jhathaway: No [20:48:02] Wait, isn't my patch still wrong? [20:48:08] it's been fixed [20:48:10] swfrench-wmf: meanwhile I'm actually not convinced that "503" is a valid retry policy, digging through the envoy source to confirm [20:48:13] ah nevermind, I see the erros cleared, thanks [20:48:51] https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/route/v3/route_components.proto#config-route-v3-retrypolicy links to https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/router_filter#x-envoy-retry-on which lists "5xx" but doesn't specify you can have a list of specific status codes [20:48:58] (03CR) 10Dzahn: [C:03+2] "noop for now. just added the option to do this. enabling later." [puppet] - 10https://gerrit.wikimedia.org/r/1128564 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [20:49:03] "retry_on: gateway-error" might DTRT though [20:49:07] ebernhardson: ^ too [20:49:14] Dreamy_Jazz: I think it is indeed still wrong. For global filters, it should pass $isGlobalFilter=true unconditionally. But $data['afl_global'] is always false because it's from the perspective of the central wiki. [20:49:36] rzl: heh, yeah I was just looking at that [20:49:51] Oh still wrong in the patch I've just merged? [20:50:12] yeah, https://github.com/envoyproxy/envoy/blob/4cd251072172cbacb5addbd0945d428db862c689/source/common/router/retry_state_impl.cc#L190 [20:50:23] Yes :/ I think so, at least [20:50:44] okay, it probably shouldn't have broken in that specific way, but [20:50:49] I removed your +2 out of an abundance of caution. And for once in my life, I'm happy that selenium is so slow that the patch still hasn't merged. [20:50:55] :D [20:51:15] I'm writing some notes in that etherpad too [20:51:22] I realized as I was writing in the etherpad. `$isGlobalFilter` is using the wrong perspective. [20:51:44] Ah I see. [20:52:09] oh, or there's retriable_status_codes too [20:52:10] rzl: so, the erb template for bare-metal hosts does quote 503, while our mesh chart module does not. [20:52:21] Surely that was the case then before my patch? [20:52:29] !log reedy@deploy1003 d3r1ck01, matmarex, reedy: Backport for [[gerrit:1133500|SUL3: Fix user ID mismatch during login (immediately after creation) (T388177)]], [[gerrit:1133502|SUL3: Fix user ID mismatch during login (immediately after creation) (T388177)]], [[gerrit:1133503|Remove redundant WaitConditionLoop from CentralAuthTokenManager]], [[gerrit:1133504|Remove redundant WaitConditionLoop from CentralAuthTokenManager] [20:52:29] ] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:52:33] In my defense, I did say it was a subtle mistake. So subtle that it was followed by at least 2 subtly wrong explanations of the mistake itself. [20:52:36] rzl: _but_ as you point out, 503 might not be supported at all [20:52:39] MatmaRex: Anything you actually want to test? [20:52:46] yeah, confirmed it is not [20:52:49] Before your patch we did not need to know if a filter was global, because we did not grab the Filter object. [20:52:55] Reedy: not really [20:53:01] We only needed to know if the log was going to the local or foreign DB. [20:53:05] np, need to wait for some other testing :) [20:53:08] if we want to set `retriable_status_codes: [503]` we can do that, and in that case it's a repeated uint32 so no casting shenanigans [20:53:11] i can do some smoke tests [20:53:22] rzl: nice [20:53:36] I'll throw all this in a note on T390612 [20:53:37] T390612: Search requests failing due to connection closure - https://phabricator.wikimedia.org/T390612 [20:53:52] I'll add some more thoughts to the etherpad [20:54:06] And I'll make yet another patchset. Hopefully the good one this time. [20:54:31] I'll update the pad, but this means log entries in meta should be unaffected. [20:54:45] swfrench-wmf: and then, check me but I don't think there's anything actionable for serviceops here? we could fail more gracefully on bad retry_on inputs but I'm not inclined to consider that super high priority [20:55:08] I guess in a pie-in-the-sky way it'd be nice to run a full envoy config validation in CI, but [20:56:45] FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:58:11] rzl: yeah, if we're operating on the assumption that `retry_on` is a string-ish enum of the documented policies, then there's no action here. I am a little puzzled as to why envoys on bare metal aren't barfing at `retry_on: "503"` [20:58:25] jinxer-wm claims that but there isnt a single drmrs host in that link [20:58:51] rzl: ... it might just be that that's an "accepted" structurally valid (but not semantically) config, though [21:00:00] yeah that's interesting [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T2100) [21:00:27] !log reedy@deploy1003 d3r1ck01, matmarex, reedy: Continuing with sync [21:01:35] OK, it's time to run some queries. [21:01:35] been a long time since I worked on this, but I think at startup Envoy only verifies the proto is correct -- and since this is a string not an enum (for whatever reason) maybe it would only fail when it went to actually make a retry decision [21:01:45] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[56] - https://phabricator.wikimedia.org/T387142#10706031 (10Jclark-ctr) @BTullis can you confirm name an-druid1005 hostname is already in use. I have added servers to netbox as an-druid1006/7 currently [21:01:45] FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:02:16] and maybe *that's* because this is behavior that "normally" comes from an HTTP header, which we can override with a config value but it's still treated internally as header-ish, meaning improper values could come in at any time [21:02:17] rzl: yeah, that's what I'm thinking [21:02:57] I'm late for a meeting with our dynamic envoy config expert so I'll let her know :D [21:03:04] hehe [21:03:07] enjoy [21:03:20] (03PS1) 10Bking: cirrussearch: add second canary for OpenSearch migration [puppet] - 10https://gerrit.wikimedia.org/r/1133551 (https://phabricator.wikimedia.org/T388610) [21:03:24] (03CR) 10Ryan Kemper: [C:03+2] search: update WDQS update lag SLI/SLO queries [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1110833 (owner: 10DCausse) [21:05:57] (03CR) 10RLazarus: [C:03+2] "Correcting my revert message: "503" isn't actually a valid retry_on for Envoy. More at https://phabricator.wikimedia.org/T390612#10706034" [puppet] - 10https://gerrit.wikimedia.org/r/1133537 (owner: 10RLazarus) [21:06:04] (03CR) 10Ryan Kemper: [V:03+2 C:03+2] search: update WDQS update lag SLI/SLO queries [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1110833 (owner: 10DCausse) [21:07:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10706057 (10VRiley-WMF) replaced drives in an-worker1185 and 1186 [21:07:55] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133500|SUL3: Fix user ID mismatch during login (immediately after creation) (T388177)]], [[gerrit:1133502|SUL3: Fix user ID mismatch during login (immediately after creation) (T388177)]], [[gerrit:1133503|Remove redundant WaitConditionLoop from CentralAuthTokenManager]], [[gerrit:1133504|Remove redundant WaitConditionLoop from CentralAuthTokenManager]] [21:07:55] (duration: 20m 03s) [21:07:58] T388177: Wikimedia\NormalizedException\NormalizedException: User ID: {centralUserId} mismatch with {storedUserId} for user: {username} - https://phabricator.wikimedia.org/T388177 [21:08:44] (03Abandoned) 10Kosta Harlan: EmailAuth: Enable "enforce" mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132996 (https://phabricator.wikimedia.org/T390662) (owner: 10Kosta Harlan) [21:09:27] thanks Reedy [21:10:15] (03PS2) 10Dreamy Jazz: AbuseLogger: properly distinguish between global filters and central DB [extensions/AbuseFilter] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133538 (https://phabricator.wikimedia.org/T390904) [21:10:48] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10706069 (10phaultfinder) [21:12:01] Daimona: With the etherpad comments, would you agree that the risk of incorrect data in the variable dump is low to none? I think the fix should still be backported, but I don't think there needs to be any cleanup. [21:13:04] jouncebot: nowandnext [21:13:04] For the next 0 hour(s) and 46 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T2100) [21:13:04] In 0 hour(s) and 46 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T2200) [21:13:17] WF aren't using their window [21:13:37] (03CR) 10Reedy: [C:03+2] Enable EmailAuth enforcement on group 0/1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133532 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza) [21:13:59] Thanks. Daimona, I can backport your fix to unblock the train once Reedy has finished their backports. [21:14:01] I still need to read the second scenario [21:14:17] But I'm feeling quite confident, yeah. [21:14:27] (03Merged) 10jenkins-bot: Enable EmailAuth enforcement on group 0/1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133532 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza) [21:14:29] Sure. Okay. Summary is that the second scenario would cause the status quo before the patch [21:15:28] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1133532|Enable EmailAuth enforcement on group 0/1 (T390662)]] [21:15:31] T390662: EmailAuth: Enable "enforce" mode for logins from unknown IP/device when IP is known to IPoid - https://phabricator.wikimedia.org/T390662 [21:16:45] RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:17:05] (03PS1) 10Ssingh: Release 9.2.10-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1133553 (https://phabricator.wikimedia.org/T390912) [21:17:28] (03CR) 10CI reject: [V:04-1] Release 9.2.10-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1133553 (https://phabricator.wikimedia.org/T390912) (owner: 10Ssingh) [21:17:35] (03PS2) 10Ssingh: Release 9.2.10-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1133553 (https://phabricator.wikimedia.org/T390912) [21:20:06] (03PS4) 10SBassett: OATHAuth: Mark checkuser, suppress and bureaucrat as requiring 2FA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133245 (https://phabricator.wikimedia.org/T150898) [21:21:46] !log reedy@deploy1003 reedy, tgr: Backport for [[gerrit:1133532|Enable EmailAuth enforcement on group 0/1 (T390662)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:21:49] T390662: EmailAuth: Enable "enforce" mode for logins from unknown IP/device when IP is known to IPoid - https://phabricator.wikimedia.org/T390662 [21:23:21] Dreamy_Jazz: okay, I read that part. I still don't have context about the recent changes but I trust your conclusions. I'm going to get a list of affected filters just to be 100% sure. [21:23:47] Sure. Thanks for investigating this. [21:23:55] !log reedy@deploy1003 reedy, tgr: Continuing with sync [21:27:48] (03PS1) 10Ryan Kemper: wdqs-update-lag: don't count wdqs-categories lag [puppet] - 10https://gerrit.wikimedia.org/r/1133554 [21:28:03] (03PS2) 10Ryan Kemper: wdqs-update-lag: don't count wdqs-categories lag [puppet] - 10https://gerrit.wikimedia.org/r/1133554 [21:31:11] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133532|Enable EmailAuth enforcement on group 0/1 (T390662)]] (duration: 15m 42s) [21:31:14] T390662: EmailAuth: Enable "enforce" mode for logins from unknown IP/device when IP is known to IPoid - https://phabricator.wikimedia.org/T390662 [21:31:59] Anyone else in the queue to deploy? If not, I'll go now. [21:32:48] I think you're good dancy [21:32:50] ffs Dreamy_Jazz [21:33:01] Thanks! [21:33:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/AbuseFilter] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133538 (https://phabricator.wikimedia.org/T390904) (owner: 10Dreamy Jazz) [21:34:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10706133 (10phaultfinder) [21:34:38] Success cache is so nice! [21:34:39] (03Merged) 10jenkins-bot: AbuseLogger: properly distinguish between global filters and central DB [extensions/AbuseFilter] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133538 (https://phabricator.wikimedia.org/T390904) (owner: 10Dreamy Jazz) [21:35:02] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1133538|AbuseLogger: properly distinguish between global filters and central DB (T390904)]] [21:35:05] T390904: MediaWiki\Extension\AbuseFilter\Filter\FilterNotFoundException: Filter 222 does not exist - https://phabricator.wikimedia.org/T390904 [21:35:40] !log starting `nodetool garbagecollect` on Cassandra/sessionstore1005 [21:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:10] !log starting `nodetool garbagecollect` on Cassandra/sessionstore2004 [21:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:34] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1133538|AbuseLogger: properly distinguish between global filters and central DB (T390904)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:41:36] T390904: MediaWiki\Extension\AbuseFilter\Filter\FilterNotFoundException: Filter 222 does not exist - https://phabricator.wikimedia.org/T390904 [21:44:49] (03PS1) 10Ryan Kemper: ElevatedMaxLagWDQS: operate only on wdqs traffic [alerts] - 10https://gerrit.wikimedia.org/r/1133556 [21:47:39] Trying to test at the moment, finding hard to trigger the original issue [21:51:34] Sorry, something came up, then I had a few consecutive brainfarts with the query, but now I'm querying the filters. Could you test with https://meta.wikimedia.org/wiki/Special:AbuseFilter/222 ? [21:52:36] I'm finding it hard to test with that filter given that I don't want to edit on a non test wiki [21:52:44] and test.wikipedia.org has the filter with ID 222 [21:52:49] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test only - bking@cumin2002 - T388610 [21:52:52] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [21:52:54] So I can't see that the exception is gone., [21:53:00] I'm just going to proceed [21:53:05] As it doesn't break anything it seems [21:53:08] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test only - bking@cumin2002 - T388610 [21:53:10] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test only - bking@cumin2002 - T388610 [21:53:10] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [21:53:16] Ah you're right [21:53:17] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test only - bking@cumin2002 - T388610 [21:53:29] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test only - bking@cumin2002 - T388610 [21:53:39] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test only - bking@cumin2002 - T388610 [21:53:41] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test only - bking@cumin2002 - T388610 [21:53:43] What about test2wiki? [21:54:02] Can't seem to create pages in mainspace [21:54:14] I think it's set to autoconfirmed and above [21:54:19] Which would exclude testing with filter 222 [21:54:35] rip [21:54:53] I think a lack of logs will help verify that the fix has worked [21:55:02] *error logs [21:55:24] Yeah I think so. [21:55:25] I was trying to find another filter to test with [21:55:42] But I can do that now and see if the fix is in place [21:55:46] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test only - bking@cumin2002 - T388610 [21:57:48] Found that filter 241 would work for the testing. [21:59:20] Appears to no longer cause an exception based on my testing. [21:59:53] i.e. triggering a global filter where the local ID does not exist causes the log entry to still be created [22:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T2200) [22:00:21] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133538|AbuseLogger: properly distinguish between global filters and central DB (T390904)]] (duration: 25m 19s) [22:00:24] T390904: MediaWiki\Extension\AbuseFilter\Filter\FilterNotFoundException: Filter 222 does not exist - https://phabricator.wikimedia.org/T390904 [22:00:29] Done with my deplots. [22:00:32] *deploys [22:01:01] Great. And I'm almost done clooking for affected filters. [22:01:42] !log Import ncmonitor 1.3.3 into bookworm-wikimedia [22:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:43] (03CR) 10Gergő Tisza: [C:03+1] OATHAuth: Mark checkuser, suppress and bureaucrat as requiring 2FA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133245 (https://phabricator.wikimedia.org/T150898) (owner: 10SBassett) [22:03:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:04:24] And I can confirm that no filters are affected (added to the pad). So that's all I guess! [22:04:48] (03CR) 10Reedy: OATHAuth: Mark checkuser, suppress and bureaucrat as requiring 2FA (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133245 (https://phabricator.wikimedia.org/T150898) (owner: 10SBassett) [22:07:17] I've closed the task, thanks Dreamy_Jazz for help with debugging and the deployment! [22:07:41] Np and thanks too! [22:08:16] Ohhhh right, the NormalizedException stuff [22:08:21] Lemme make a patch real quick [22:08:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:11:02] <3 [22:12:36] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1133562 [22:12:40] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1133563 [22:18:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:18:50] (03PS1) 10JHathaway: puppetserver: revert private repo settings [puppet] - 10https://gerrit.wikimedia.org/r/1133564 (https://phabricator.wikimedia.org/T385995) [22:20:10] (03Abandoned) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1133562 (owner: 10Ncmonitor) [22:25:26] puppet failures are me, investigating.. [22:33:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:34:09] (03PS1) 10Andrew Bogott: nova vendor data: don't upgrade packages during cloud-init. [puppet] - 10https://gerrit.wikimedia.org/r/1133571 (https://phabricator.wikimedia.org/T390822) [22:35:33] (03CR) 10Andrew Bogott: [C:03+2] nova vendor data: don't upgrade packages during cloud-init. [puppet] - 10https://gerrit.wikimedia.org/r/1133571 (https://phabricator.wikimedia.org/T390822) (owner: 10Andrew Bogott) [22:38:03] !log puppet private repo changes completed, T385995 [22:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:12] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:43:45] RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:50:26] jhathaway: I think my last patch https://gerrit.wikimedia.org/r/1133571 fell through the cracks and didn't get merged on puppetservers. Were you doing puppetserver maintenance right when I merged? [23:12:12] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:17:29] (03PS1) 10Tim Starling: Temporarily disable Lua profiler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133575 (https://phabricator.wikimedia.org/T389734) [23:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10706352 (10phaultfinder) [23:26:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2060:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2060 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:28:14] !log starting `nodetool garbagecollect` on Cassandra/sessionstore2005 [23:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:48] !log starting `nodetool garbagecollect` on Cassandra/sessionstore1006 [23:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:03] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [23:40:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1133579 [23:40:08] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1133579 (owner: 10TrainBranchBot) [23:42:12] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:50:54] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1133579 (owner: 10TrainBranchBot) [23:55:35] (03PS1) 10C. Scott Ananian: Parsoid Fragment Support v3: make mStripExtTags a persistent Parser property [core] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133581 (https://phabricator.wikimedia.org/T390420) [23:56:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [core] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133581 (https://phabricator.wikimedia.org/T390420) (owner: 10C. Scott Ananian) [23:57:03] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [23:58:12] (03CR) 10BryanDavis: [C:03+1] Temporarily disable Lua profiler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133575 (https://phabricator.wikimedia.org/T389734) (owner: 10Tim Starling)