[00:06:26] RESOLVED: SystemdUnitFailed: logrotate.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:37:03] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/recommendation-api: apply [00:37:28] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: apply [00:38:14] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/recommendation-api: apply [00:38:39] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/recommendation-api: apply [01:00:50] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:12:25] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 11m 35s) [01:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:33:59] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:42:59] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:18:59] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:43:59] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:09:00] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:31:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:31:45] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:34:00] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:34:00] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:41:35] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:41:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:42:59] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T0600) [06:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:30:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps2011.codfw.wmnet with OS bookworm [06:35:58] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11188113 (10Joe) 05Open→03Resolved I will tentatively close this task for now. [06:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:41:25] FIRING: [2x] SystemdUnitFailed: prometheus-pg-replication-lag.service on maps2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:46:04] maps2012 is expected, I'll down it and 2013/2014 [06:46:25] FIRING: [3x] SystemdUnitFailed: prometheus-pg-replication-lag.service on maps2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:47:19] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps[2012-2014].codfw.wmnet with reason: in setup [06:50:20] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2011.codfw.wmnet with reason: host reimage [06:54:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2011.codfw.wmnet with reason: host reimage [06:58:00] 06SRE, 10Hiddenparma, 06Traffic: Better mapping of requests coming from datacenters/clouds - https://phabricator.wikimedia.org/T400120#11188127 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF [07:00:04] Amir1, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T0700). [07:00:04] sergi0: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:13] o/ [07:00:24] I can self-deploy [07:01:17] sergi0: how much time do you estimate? [07:01:41] 3 min or less, it's a labs only change [07:02:02] is it ok? [07:02:29] yes thans! [07:02:57] !log upgrading Envoy on debmonitor T403663 [07:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:02] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [07:05:56] @effie all done [07:06:03] thank you [07:07:18] thanks, syncing my patch now [07:09:27] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1189038|hCaptcha: Enable on phase 1 wikis (T402366)]] [07:09:33] T402366: hCaptcha account creation trial deployment tracker - https://phabricator.wikimedia.org/T402366 [07:11:45] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:11:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:13:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps2011.codfw.wmnet with OS bookworm [07:15:25] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1189038|hCaptcha: Enable on phase 1 wikis (T402366)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:15:29] T402366: hCaptcha account creation trial deployment tracker - https://phabricator.wikimedia.org/T402366 [07:16:43] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 8.756 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:16:43] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 8.927 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:19:00] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:24:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps2012.codfw.wmnet with OS bookworm [07:25:11] !log kharlan@deploy1003 kharlan: Continuing with sync [07:30:35] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1189038|hCaptcha: Enable on phase 1 wikis (T402366)]] (duration: 21m 08s) [07:30:40] T402366: hCaptcha account creation trial deployment tracker - https://phabricator.wikimedia.org/T402366 [07:32:15] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [07:32:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/2 (Transport: cr2-codfw:xe-0/1/1:1 (Lumen, 442550293) {#5249}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:34:10] FIRING: BFDdown: BFD session down between cr2-eqiad and 208.80.154.215 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:36:45] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:36:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:37:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/2 (Transport: cr2-codfw:xe-0/1/1:1 (Lumen, 442550293) {#5249}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:39:10] RESOLVED: BFDdown: BFD session down between cr2-eqiad and 208.80.154.215 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:41:35] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.110 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:41:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:41:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11188195 (10elukey) For cp2050 I keep getting this: ` GET https://10.193.3.234/redfish/v1/TaskService/TaskMonitors/JID_580944559377 returned HTTP 400 Response... [07:43:37] done syncing [07:43:54] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2012.codfw.wmnet with reason: host reimage [07:43:59] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:49:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2012.codfw.wmnet with reason: host reimage [07:49:32] elukey@cumin1003 provision (PID 2932716) is awaiting input [07:51:36] jouncebot: next [07:51:36] In 2 hour(s) and 8 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1000) [07:52:50] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [07:53:10] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2051.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [07:56:52] doing another backport [08:03:11] this is me, ignore v [08:05:30] PROBLEM - MariaDB read only db_inventory #page on db2185 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.11.13-MariaDB-log, Uptime 570363s, event_scheduler: True, 100.34 QPS, connection latency: 0.030052s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:05:40] This is a test, ignore ^ [08:06:05] <_joe_> jynus: test successful? :) [08:06:23] yes [08:06:30] RECOVERY - MariaDB read only db_inventory #page on db2185 is OK: Version 10.11.13-MariaDB-log, Uptime 570423s, read_only: True, event_scheduler: True, 98.51 QPS, connection latency: 0.029999s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:08:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps2012.codfw.wmnet with OS bookworm [08:10:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps2013.codfw.wmnet with OS bookworm [08:13:03] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2051.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:15:37] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2052.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:22:46] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1189108|hCaptcha: Track events via Prometheus (T402767)]], [[gerrit:1189107|hCaptcha: Track events via Prometheus (T402767)]], [[gerrit:1189106|hCaptcha: Remove non-existent message]], [[gerrit:1189105|hCaptcha: Remove non-existent message]] [08:22:52] T402767: hCaptcha: Log hCaptcha error codes to Logstash and Prometheus - https://phabricator.wikimedia.org/T402767 [08:27:58] !log upgrading Envoy on IDM hosts T403663 [08:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:02] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [08:28:42] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1189108|hCaptcha: Track events via Prometheus (T402767)]], [[gerrit:1189107|hCaptcha: Track events via Prometheus (T402767)]], [[gerrit:1189106|hCaptcha: Remove non-existent message]], [[gerrit:1189105|hCaptcha: Remove non-existent message]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:28:46] T402767: hCaptcha: Log hCaptcha error codes to Logstash and Prometheus - https://phabricator.wikimedia.org/T402767 [08:28:58] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2013.codfw.wmnet with reason: host reimage [08:30:32] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2052.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:30:49] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [08:30:54] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2053.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:31:07] !log kharlan@deploy1003 kharlan: Continuing with sync [08:33:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2013.codfw.wmnet with reason: host reimage [08:33:30] !log restart pybal on lvs1019/lvs2013/lvs2014 to clear out alert [08:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:38] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:36:28] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1189108|hCaptcha: Track events via Prometheus (T402767)]], [[gerrit:1189107|hCaptcha: Track events via Prometheus (T402767)]], [[gerrit:1189106|hCaptcha: Remove non-existent message]], [[gerrit:1189105|hCaptcha: Remove non-existent message]] (duration: 13m 41s) [08:36:32] T402767: hCaptcha: Log hCaptcha error codes to Logstash and Prometheus - https://phabricator.wikimedia.org/T402767 [08:37:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:38:54] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [08:40:16] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [08:42:18] !log upgrading Envoy on deployment hosts T403663 [08:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:23] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [08:43:56] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:44:12] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:50:50] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11188395 (10elukey) cp2051 worked, cp2052 showed the issue, cp2053 worked. [08:50:54] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2053.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:52:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps2013.codfw.wmnet with OS bookworm [08:53:29] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2054.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:54:14] (03PS2) 10Muehlenhoff: Remove obsolete ganeti_init.sh script [puppet] - 10https://gerrit.wikimedia.org/r/1189117 [08:56:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:57:52] (03CR) 10Stevemunene: [C:03+1] deployment_server: restore service private files ownership [puppet] - 10https://gerrit.wikimedia.org/r/1188795 (https://phabricator.wikimedia.org/T404068) (owner: 10Brouberol) [08:59:40] jouncebot: nowandnext [08:59:40] No deployments scheduled for the next 1 hour(s) and 0 minute(s) [08:59:41] In 1 hour(s) and 0 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1000) [09:00:45] (03CR) 10Ladsgroup: [C:03+1] instances.yaml: add es2049 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1188769 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [09:01:20] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml: add es2049 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1188769 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [09:02:54] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:05:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/8 (Transit: Arelion (IC-381309) {#30386}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:06:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-esams and Arelion (2001:2035:0:699::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [09:06:49] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:06:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:11:38] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Bump weight of db1206 in general group (T403966)', diff saved to https://phabricator.wikimedia.org/P83384 and previous config saved to /var/cache/conftool/dbconfig/20250917-091137-ladsgroup.json [09:11:43] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [09:14:28] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2054.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:15:09] (03PS1) 10Elukey: WIP: sre.hosts.provision: check attributes after rebooting [cookbooks] - 10https://gerrit.wikimedia.org/r/1189118 (https://phabricator.wikimedia.org/T394357) [09:16:26] (03CR) 10Brouberol: [V:03+1 C:03+2] deployment_server: restore service private files ownership [puppet] - 10https://gerrit.wikimedia.org/r/1188795 (https://phabricator.wikimedia.org/T404068) (owner: 10Brouberol) [09:17:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps2014.codfw.wmnet with OS bookworm [09:17:10] (03PS2) 10Elukey: WIP: sre.hosts.provision: check attributes after rebooting [cookbooks] - 10https://gerrit.wikimedia.org/r/1189118 (https://phabricator.wikimedia.org/T394357) [09:17:41] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2055.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:18:34] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:19:19] (03CR) 10Elukey: [C:03+1] Re-add maps2011 [puppet] - 10https://gerrit.wikimedia.org/r/1189110 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:19:25] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [09:19:33] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [09:19:53] (03PS1) 10Cathal Mooney: Revert "cephosd: un-set bird bgp neighbors rather than override for each host" [puppet] - 10https://gerrit.wikimedia.org/r/1189119 [09:20:21] (03CR) 10CI reject: [V:04-1] Revert "cephosd: un-set bird bgp neighbors rather than override for each host" [puppet] - 10https://gerrit.wikimedia.org/r/1189119 (owner: 10Cathal Mooney) [09:20:33] !log elukey@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-codfw [09:21:56] RECOVERY - Kafka broker TLS certificate validity on kafka-main2006 is OK: SSL OK - Certificate kafka-main2006.codfw.wmnet valid until 2026-08-23 08:25:00 +0000 (expires in 339 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:23:00] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:24:15] (03PS1) 10Hnowlan: (api|rest)-gateway: move Via header definition to response [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189121 (https://phabricator.wikimedia.org/T401396) [09:24:19] (03PS2) 10Cathal Mooney: Cephosd: revert to manually setting up peering IPs [puppet] - 10https://gerrit.wikimedia.org/r/1189119 [09:24:23] (03CR) 10CI reject: [V:04-1] (api|rest)-gateway: move Via header definition to response [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189121 (https://phabricator.wikimedia.org/T401396) (owner: 10Hnowlan) [09:24:34] (03CR) 10CI reject: [V:04-1] WIP: sre.hosts.provision: check attributes after rebooting [cookbooks] - 10https://gerrit.wikimedia.org/r/1189118 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey) [09:25:20] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:26:12] (03PS2) 10Hnowlan: (api|rest)-gateway: move Via header definition to response [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189121 (https://phabricator.wikimedia.org/T401396) [09:26:46] (03CR) 10Muehlenhoff: [C:03+2] imposm-initial-import: Fix check whether imposm is running [puppet] - 10https://gerrit.wikimedia.org/r/1188801 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:26:53] (03PS2) 10Marco Fossati: ReaderExperiments' ImageBrowsing stream configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187413 (https://phabricator.wikimedia.org/T403259) [09:27:44] (03CR) 10Marco Fossati: ReaderExperiments' ImageBrowsing stream configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187413 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [09:27:56] RECOVERY - Kafka broker TLS certificate validity on kafka-main2008 is OK: SSL OK - Certificate kafka-main2008.codfw.wmnet valid until 2026-08-23 08:30:00 +0000 (expires in 339 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:28:44] !log mass deleting watchlist of bots with > 50K watchlist rows (T404808) [09:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:48] T404808: Clean up large bots watchlists in all wikis - https://phabricator.wikimedia.org/T404808 [09:29:38] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:33:34] (03PS3) 10Elukey: WIP: sre.hosts.provision: check attributes after rebooting [cookbooks] - 10https://gerrit.wikimedia.org/r/1189118 (https://phabricator.wikimedia.org/T394357) [09:33:34] RECOVERY - Kafka broker TLS certificate validity on kafka-main2009 is OK: SSL OK - Certificate kafka-main2009.codfw.wmnet valid until 2026-08-23 08:27:00 +0000 (expires in 339 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:33:56] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:34:00] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:35:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Add es2049', diff saved to https://phabricator.wikimedia.org/P83385 and previous config saved to /var/cache/conftool/dbconfig/20250917-093550-fceratto.json [09:36:06] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2014.codfw.wmnet with reason: host reimage [09:36:13] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2055.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:37:18] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1251 from api group of s1 (T403966)', diff saved to https://phabricator.wikimedia.org/P83386 and previous config saved to /var/cache/conftool/dbconfig/20250917-093718-ladsgroup.json [09:37:22] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [09:38:05] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool es2049 slowly with 10 steps - Pooling in new host [09:39:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2014.codfw.wmnet with reason: host reimage [09:40:09] RECOVERY - Kafka broker TLS certificate validity on kafka-main2010 is OK: SSL OK - Certificate kafka-main2010.codfw.wmnet valid until 2026-08-23 08:37:00 +0000 (expires in 339 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:40:18] (03CR) 10CI reject: [V:04-1] WIP: sre.hosts.provision: check attributes after rebooting [cookbooks] - 10https://gerrit.wikimedia.org/r/1189118 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey) [09:41:25] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1184 (s1 candidate master) from api group of s1 (T403966)', diff saved to https://phabricator.wikimedia.org/P83388 and previous config saved to /var/cache/conftool/dbconfig/20250917-094124-ladsgroup.json [09:41:43] RECOVERY - Kafka broker TLS certificate validity on kafka-main2007 is OK: SSL OK - Certificate kafka-main2007.codfw.wmnet valid until 2026-08-23 08:27:00 +0000 (expires in 339 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [09:42:17] !log elukey@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-codfw [09:42:47] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:42:59] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:43:11] 06SRE, 10Hiddenparma, 06Traffic: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826 (10Joe) 03NEW [09:43:53] jouncebot: nowandnext [09:43:53] No deployments scheduled for the next 0 hour(s) and 16 minute(s) [09:43:53] In 0 hour(s) and 16 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1000) [09:50:33] (03PS4) 10Elukey: sre.hosts.provision: check attributes after rebooting [cookbooks] - 10https://gerrit.wikimedia.org/r/1189118 (https://phabricator.wikimedia.org/T394357) [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:51:55] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2055.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:52:22] (03CR) 10Ladsgroup: [C:03+1] preseed.yaml: Remove es2050 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1188738 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [09:52:59] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2055.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:53:14] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2055.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:54:26] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2055.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:21] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2057.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:57:23] elukey@cumin1003 provision (PID 2947963) is awaiting input [09:57:32] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:57:32] (03PS3) 10Arnaudb: Revert^4 "mailman: add a local disk cache" [puppet] - 10https://gerrit.wikimedia.org/r/1188798 (https://phabricator.wikimedia.org/T353891) [09:59:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps2014.codfw.wmnet with OS bookworm [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1000) [10:01:46] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review, and 2 others: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11188609 (10ABran-WMF) https://gerrit.wikimedia.org/r/1188798 creates a 1GB local disk cache that should help with thos... [10:01:52] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [10:02:53] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm [10:03:23] hi everybody, going to deploy mobileapps as part of the MW infra window (upgrading the statsd sidecar only) [10:04:37] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: sync [10:05:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [10:05:18] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: sync [10:06:13] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS bookworm [10:07:35] (03CR) 10Muehlenhoff: [C:03+2] Re-add maps2011 [puppet] - 10https://gerrit.wikimedia.org/r/1189110 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:09:26] jouncebot: nowandnext [10:09:26] For the next 0 hour(s) and 50 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1000) [10:09:26] In 0 hour(s) and 50 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1100) [10:10:03] elukey: Can I deploy a operations/mediawiki-config change while you are doing that or would you prefer not? [10:11:37] (03PS1) 10Dreamy Jazz: Deploy suggested investigations to testwiki and test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189125 (https://phabricator.wikimedia.org/T404830) [10:12:00] Dreamy_Jazz: o/ already done, the only thing that worries me a little is https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=panel-13&from=now-6h&to=now&timezone=utc [10:12:29] Yeah, that isn't ideal [10:12:35] it is trending down, let's see if it settles [10:12:36] (03CR) 10Federico Ceratto: [C:03+2] preseed.yaml: Remove es2050 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1188738 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:14:33] (03CR) 10Mszwarc: [C:03+1] Deploy suggested investigations to testwiki and test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189125 (https://phabricator.wikimedia.org/T404830) (owner: 10Dreamy Jazz) [10:16:15] !log installing openjpeg2 security updates [10:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:17] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2057.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:18:02] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2058.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:18:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189125 (https://phabricator.wikimedia.org/T404830) (owner: 10Dreamy Jazz) [10:19:46] (03Merged) 10jenkins-bot: Deploy suggested investigations to testwiki and test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189125 (https://phabricator.wikimedia.org/T404830) (owner: 10Dreamy Jazz) [10:20:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [10:20:12] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1189125|Deploy suggested investigations to testwiki and test2wiki (T404830)]] [10:20:17] T404830: Deploy suggested investigations to testwiki and test2wiki - https://phabricator.wikimedia.org/T404830 [10:22:26] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1235 from api group of s1 (T403966)', diff saved to https://phabricator.wikimedia.org/P83391 and previous config saved to /var/cache/conftool/dbconfig/20250917-102225-ladsgroup.json [10:22:31] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [10:24:42] Dreamy_Jazz: green light [10:25:28] Thanks. I was seeing a trend down to 0, so started already [10:26:08] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1189125|Deploy suggested investigations to testwiki and test2wiki (T404830)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:26:13] T404830: Deploy suggested investigations to testwiki and test2wiki - https://phabricator.wikimedia.org/T404830 [10:27:58] !log dreamyjazz@deploy1003 Sync cancelled. [10:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:30:08] (03PS1) 10Dreamy Jazz: Set virtual domain mapping for virtual-checkuser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189126 (https://phabricator.wikimedia.org/T404830) [10:30:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189126 (https://phabricator.wikimedia.org/T404830) (owner: 10Dreamy Jazz) [10:30:42] (03CR) 10Hnowlan: [C:03+1] switchdc: call delete_collection_namespaced_cron_job if available [cookbooks] - 10https://gerrit.wikimedia.org/r/1187544 (https://phabricator.wikimedia.org/T399891) (owner: 10Jasmine) [10:31:31] (03Merged) 10jenkins-bot: Set virtual domain mapping for virtual-checkuser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189126 (https://phabricator.wikimedia.org/T404830) (owner: 10Dreamy Jazz) [10:31:58] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1189126|Set virtual domain mapping for virtual-checkuser (T404830)]], [[gerrit:1189125|Deploy suggested investigations to testwiki and test2wiki (T404830)]] [10:32:02] T404830: Deploy suggested investigations to testwiki and test2wiki - https://phabricator.wikimedia.org/T404830 [10:33:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1169 from api group of s1 (T403966)', diff saved to https://phabricator.wikimedia.org/P83393 and previous config saved to /var/cache/conftool/dbconfig/20250917-103306-ladsgroup.json [10:33:11] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [10:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:36:59] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2058.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:37:39] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1189126|Set virtual domain mapping for virtual-checkuser (T404830)]], [[gerrit:1189125|Deploy suggested investigations to testwiki and test2wiki (T404830)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:37:43] T404830: Deploy suggested investigations to testwiki and test2wiki - https://phabricator.wikimedia.org/T404830 [10:40:42] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11188841 (10elukey) Updated the BMC and the firmware, that seems still in progress. Will check later on :) [10:42:37] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [10:46:41] !log trigger full OSM import on maps2011 T381565 [10:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:45] T381565: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565 [10:47:02] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186461 (owner: 10PipelineBot) [10:47:56] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1189126|Set virtual domain mapping for virtual-checkuser (T404830)]], [[gerrit:1189125|Deploy suggested investigations to testwiki and test2wiki (T404830)]] (duration: 15m 58s) [10:48:00] T404830: Deploy suggested investigations to testwiki and test2wiki - https://phabricator.wikimedia.org/T404830 [10:48:27] (03PS5) 10Stevemunene: dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) [10:48:30] (03CR) 10Brouberol: dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [10:48:42] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1186461 (owner: 10PipelineBot) [10:49:32] 06SRE, 10Hiddenparma, 06Traffic: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826#11188866 (10SLyngshede-WMF) Personally I don't love the private repository with Puppet code inside it, as it hides a lot of information. I get that this is the idea, but it mak... [10:50:31] (03CR) 10CI reject: [V:04-1] dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [10:51:03] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Rebalance s1 in codfw (T403966)', diff saved to https://phabricator.wikimedia.org/P83395 and previous config saved to /var/cache/conftool/dbconfig/20250917-105102-ladsgroup.json [10:51:07] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [10:54:46] (03PS6) 10Brouberol: dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [10:55:54] (03PS3) 10Stevemunene: dse-k8s:Enable CSI and the Ceph CSI plugin on dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188754 (https://phabricator.wikimedia.org/T404576) [10:57:10] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Rebalance s3 in codfw (T403966)', diff saved to https://phabricator.wikimedia.org/P83397 and previous config saved to /var/cache/conftool/dbconfig/20250917-105709-ladsgroup.json [10:57:15] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [10:57:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:58:27] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:59:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1196 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83398 and previous config saved to /var/cache/conftool/dbconfig/20250917-105946-ladsgroup.json [11:00:05] mvolz: OwO what's this, a deployment window?? Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1100). nyaa~ [11:00:11] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:00:31] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:00:36] 06SRE, 10Hiddenparma, 06Traffic: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826#11188888 (10MoritzMuehlenhoff) It's worth mentioning that starting next quarter we'll start work on moving the user data currently defined in data.yaml to a private repository,... [11:02:01] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:02:37] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:02:41] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm [11:03:01] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:06:37] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11188900 (10elukey) Done up to cp2058, all good (excluding cp2056 as requested). Next steps: - Upgrade firmwares - Check why the cookbook didn't run on cp2052... [11:08:05] elukey@cumin1003 reimage (PID 2954571) is awaiting input [11:08:32] (03PS1) 10Jelto: ceph: add module to sync a bucket locally [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) [11:09:17] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11188903 (10elukey) Tried provision and then reimage, this time I clearly noticed a PXE/HTTP boot request but it ended up in the OS booting (it was quick and... [11:09:50] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:10:16] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:12:39] (03PS1) 10Federico Ceratto: es2050.yaml, site.pp: Prepare es2050 [puppet] - 10https://gerrit.wikimedia.org/r/1189131 (https://phabricator.wikimedia.org/T402859) [11:13:31] (03PS1) 10Hnowlan: trafficserver: multi-dc: use client request host rather than remap [puppet] - 10https://gerrit.wikimedia.org/r/1189132 (https://phabricator.wikimedia.org/T401396) [11:15:26] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11188939 (10Jclark-ctr) Updated the Cable IDs for cables run on ssw1-d1-eqiad in NetBox. @VRiley-WMF, please update the ssw1-d8-eqiad Cable IDs when you have a chance. [11:15:58] (03CR) 10Ladsgroup: [C:03+1] es2050.yaml, site.pp: Prepare es2050 [puppet] - 10https://gerrit.wikimedia.org/r/1189131 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [11:16:42] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS bookworm [11:18:59] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Bump weight of db1167 in general group (T403966)', diff saved to https://phabricator.wikimedia.org/P83400 and previous config saved to /var/cache/conftool/dbconfig/20250917-111858-ladsgroup.json [11:19:00] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:19:04] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [11:20:11] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Bump weight of db2152 in general group (T403966)', diff saved to https://phabricator.wikimedia.org/P83401 and previous config saved to /var/cache/conftool/dbconfig/20250917-112010-ladsgroup.json [11:20:51] (03PS1) 10Stevemunene: Add a dummy Ceph user keys for the cephcsi plugin to use [labs/private] - 10https://gerrit.wikimedia.org/r/1189133 (https://phabricator.wikimedia.org/T404576) [11:26:45] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:26:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:30:13] (03PS7) 10Stevemunene: dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) [11:32:45] jouncebot: nowandnext [11:32:45] For the next 0 hour(s) and 27 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1100) [11:32:45] In 0 hour(s) and 27 minute(s): Create new table for the CampaignEvents extension (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1200) [11:33:26] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11189010 (10ABran-WMF) cache is around 100MB and the UI is slowing down again [11:33:55] (03PS1) 10Dreamy Jazz: SI: Load ext.checkUser.styles on Special:SuggestedInvestigations [extensions/CheckUser] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189136 (https://phabricator.wikimedia.org/T404712) [11:34:05] (03CR) 10Dreamy Jazz: [C:03+2] SI: Load ext.checkUser.styles on Special:SuggestedInvestigations [extensions/CheckUser] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189136 (https://phabricator.wikimedia.org/T404712) (owner: 10Dreamy Jazz) [11:34:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189136 (https://phabricator.wikimedia.org/T404712) (owner: 10Dreamy Jazz) [11:36:35] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:36:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:39:02] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:39:29] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:44:00] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:45:33] (03Merged) 10jenkins-bot: SI: Load ext.checkUser.styles on Special:SuggestedInvestigations [extensions/CheckUser] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189136 (https://phabricator.wikimedia.org/T404712) (owner: 10Dreamy Jazz) [11:46:00] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1189136|SI: Load ext.checkUser.styles on Special:SuggestedInvestigations (T404712)]] [11:46:06] T404712: Suggested investigations: Subtitle links are not rendered correctly when UserInfoCard / IP reveal is not enabled - https://phabricator.wikimedia.org/T404712 [11:51:39] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1189136|SI: Load ext.checkUser.styles on Special:SuggestedInvestigations (T404712)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:51:44] T404712: Suggested investigations: Subtitle links are not rendered correctly when UserInfoCard / IP reveal is not enabled - https://phabricator.wikimedia.org/T404712 [11:52:23] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [11:52:26] (03PS3) 10Klausman: team-ml: Add alert for outdated admin_ng config [alerts] - 10https://gerrit.wikimedia.org/r/1182531 (https://phabricator.wikimedia.org/T403047) [11:54:12] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2049 slowly with 10 steps - Pooling in new host [11:55:03] (03CR) 10Klausman: [V:03+2 C:03+2] team-ml: Add alert for outdated admin_ng config [alerts] - 10https://gerrit.wikimedia.org/r/1182531 (https://phabricator.wikimedia.org/T403047) (owner: 10Klausman) [11:56:40] (03Merged) 10jenkins-bot: team-ml: Add alert for outdated admin_ng config [alerts] - 10https://gerrit.wikimedia.org/r/1182531 (https://phabricator.wikimedia.org/T403047) (owner: 10Klausman) [11:57:30] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1189136|SI: Load ext.checkUser.styles on Special:SuggestedInvestigations (T404712)]] (duration: 11m 29s) [11:57:34] T404712: Suggested investigations: Subtitle links are not rendered correctly when UserInfoCard / IP reveal is not enabled - https://phabricator.wikimedia.org/T404712 [11:57:39] (03CR) 10Muehlenhoff: [C:03+2] Assign failoid role to failoid2003 [puppet] - 10https://gerrit.wikimedia.org/r/1183100 (https://phabricator.wikimedia.org/T402406) (owner: 10Muehlenhoff) [11:58:12] (03PS1) 10Dreamy Jazz: Use correct DB domain in SuggestedInvestigationsCaseLookupService [extensions/CheckUser] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189141 (https://phabricator.wikimedia.org/T404846) [11:58:15] jouncebot: nowandnext [11:58:15] For the next 0 hour(s) and 1 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1100) [11:58:15] In 0 hour(s) and 1 minute(s): Create new table for the CampaignEvents extension (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1200) [11:58:46] daimona: Do you mind if I am deploying during your window? [11:59:03] (03PS1) 10Slyngshede: P:idp remove NDA group access from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/1189142 (https://phabricator.wikimedia.org/T404494) [11:59:11] No problem, I'm probably just going to take a couple minutes to create the table [11:59:16] Thanks [11:59:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189141 (https://phabricator.wikimedia.org/T404846) (owner: 10Dreamy Jazz) [12:00:05] Daimona: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Create new table for the CampaignEvents extension deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1200). [12:01:50] (03PS3) 10Slyngshede: Permissions: Prevent duplicate permission requests [software/bitu] - 10https://gerrit.wikimedia.org/r/1188719 (https://phabricator.wikimedia.org/T403691) [12:03:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:03:52] !log Creating new tables for the CampaignEvents extension in x1.testwiki, x1.test2wiki, x1.officewiki, and x1.wikishared # T400719 [12:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:57] T400719: Create database structure to store edit-to-event associations - https://phabricator.wikimedia.org/T400719 [12:04:18] (03PS1) 10Muehlenhoff: Move my non-FIDO SSH key to buster_ssh_keys [puppet] - 10https://gerrit.wikimedia.org/r/1189145 [12:05:02] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1188719 (https://phabricator.wikimedia.org/T403691) (owner: 10Slyngshede) [12:05:08] (03CR) 10Federico Ceratto: [C:03+2] es2050.yaml, site.pp: Prepare es2050 [puppet] - 10https://gerrit.wikimedia.org/r/1189131 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [12:06:07] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1189142 (https://phabricator.wikimedia.org/T404494) (owner: 10Slyngshede) [12:09:58] Done with my DB stuff. [12:10:54] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188875 (owner: 10PipelineBot) [12:11:11] (03Merged) 10jenkins-bot: Use correct DB domain in SuggestedInvestigationsCaseLookupService [extensions/CheckUser] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189141 (https://phabricator.wikimedia.org/T404846) (owner: 10Dreamy Jazz) [12:11:35] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1189141|Use correct DB domain in SuggestedInvestigationsCaseLookupService (T404846)]] [12:11:40] T404846: Suggested Investigations: SuggestedInvestigationsCaseLookupService uses the wrong database connection - https://phabricator.wikimedia.org/T404846 [12:12:38] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188875 (owner: 10PipelineBot) [12:12:50] (03PS2) 10Muehlenhoff: Failover failoid in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1183101 (https://phabricator.wikimedia.org/T402406) [12:14:56] (03PS1) 10Federico Ceratto: instances.yaml: add es2050 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1189148 (https://phabricator.wikimedia.org/T402859) [12:15:55] (03CR) 10Brouberol: [C:03+1] dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [12:17:22] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1189141|Use correct DB domain in SuggestedInvestigationsCaseLookupService (T404846)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:17:28] T404846: Suggested Investigations: SuggestedInvestigationsCaseLookupService uses the wrong database connection - https://phabricator.wikimedia.org/T404846 [12:18:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:20:02] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [12:25:13] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1189141|Use correct DB domain in SuggestedInvestigationsCaseLookupService (T404846)]] (duration: 13m 37s) [12:25:18] T404846: Suggested Investigations: SuggestedInvestigationsCaseLookupService uses the wrong database connection - https://phabricator.wikimedia.org/T404846 [12:27:29] (03PS1) 10Muehlenhoff: Remove obsolete configuration options from SSH type [puppet] - 10https://gerrit.wikimedia.org/r/1189149 [12:28:27] (03PS2) 10Muehlenhoff: Remove obsolete configuration options from SSH type [puppet] - 10https://gerrit.wikimedia.org/r/1189149 [12:30:17] FIRING: [2x] ProbeDown: Service wdqs2022:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:30:32] (03CR) 10Slyngshede: Permissions: Prevent duplicate permission requests (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1188719 (https://phabricator.wikimedia.org/T403691) (owner: 10Slyngshede) [12:30:41] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for es2050.codfw.wmnet [12:30:43] (03CR) 10Slyngshede: [C:03+2] Permissions: Prevent duplicate permission requests [software/bitu] - 10https://gerrit.wikimedia.org/r/1188719 (https://phabricator.wikimedia.org/T403691) (owner: 10Slyngshede) [12:31:56] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11189225 (10MoritzMuehlenhoff) [12:33:13] (03Merged) 10jenkins-bot: Permissions: Prevent duplicate permission requests [software/bitu] - 10https://gerrit.wikimedia.org/r/1188719 (https://phabricator.wikimedia.org/T403691) (owner: 10Slyngshede) [12:33:50] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1189145 (owner: 10Muehlenhoff) [12:34:01] fceratto@cumin1002 upgrade (PID 2721453) is awaiting input [12:35:13] (03CR) 10Brouberol: dse-k8s:Enable CSI and the Ceph CSI plugin on dse-k8s-codfw (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188754 (https://phabricator.wikimedia.org/T404576) (owner: 10Stevemunene) [12:35:17] RESOLVED: [2x] ProbeDown: Service wdqs2022:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:36:31] (03CR) 10A smart kitten: "@phuedx@wikimedia.org are we okay to create a revert of this revert, to hopefully be deployed at some point in the future?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188281 (owner: 10Phuedx) [12:38:04] (03CR) 10Brouberol: dse-k8s:Enable CSI and the Ceph CSI plugin on dse-k8s-codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188754 (https://phabricator.wikimedia.org/T404576) (owner: 10Stevemunene) [12:38:28] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11189253 (10MoritzMuehlenhoff) [12:38:38] (03CR) 10Muehlenhoff: [C:03+2] Move my non-FIDO SSH key to buster_ssh_keys [puppet] - 10https://gerrit.wikimedia.org/r/1189145 (owner: 10Muehlenhoff) [12:38:45] (03CR) 10Brouberol: dse-k8s:Enable CSI and the Ceph CSI plugin on dse-k8s-codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188754 (https://phabricator.wikimedia.org/T404576) (owner: 10Stevemunene) [12:41:02] (03CR) 10Brouberol: Add a dummy Ceph user keys for the cephcsi plugin to use (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1189133 (https://phabricator.wikimedia.org/T404576) (owner: 10Stevemunene) [12:45:54] fceratto@cumin1002 upgrade (PID 2721453) is awaiting input [12:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:49:13] (03CR) 10David Caro: [C:03+1] ceph: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188826 (owner: 10Majavah) [12:49:31] (03CR) 10Stevemunene: [C:03+2] dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [12:50:02] (03CR) 10David Caro: [C:03+1] ceph: Drop buster support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1188826 (owner: 10Majavah) [12:51:12] (03CR) 10Filippo Giunchedi: [C:03+1] Remove obsolete configuration options from SSH type [puppet] - 10https://gerrit.wikimedia.org/r/1189149 (owner: 10Muehlenhoff) [12:51:30] (03CR) 10Cathal Mooney: [C:03+2] Cephosd: revert to manually setting up peering IPs [puppet] - 10https://gerrit.wikimedia.org/r/1189119 (owner: 10Cathal Mooney) [12:52:35] PROBLEM - statsv Varnishkafka log producer on cp7007 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [12:53:35] RECOVERY - statsv Varnishkafka log producer on cp7007 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [12:54:08] (03PS1) 10CDanis: admin: cdanis ssh: clean up old keys + add new fido2 key [puppet] - 10https://gerrit.wikimedia.org/r/1189153 [12:54:39] RECOVERY - BFD status on lsw1-e3-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:59:29] (03CR) 10Arnaudb: [C:03+1] "I've added a question inline, otherwise: looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:15] o/ [13:00:23] yup, I don’t see anything in the calendar either [13:00:46] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178858 (owner: 10PipelineBot) [13:00:49] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179669 (owner: 10PipelineBot) [13:00:52] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1184706 (owner: 10PipelineBot) [13:00:55] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1185992 (owner: 10PipelineBot) [13:01:00] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187854 (owner: 10PipelineBot) [13:01:24] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es2050.codfw.wmnet [13:01:43] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone_es of es2027.codfw.wmnet onto es2050.codfw.wmnet [13:01:47] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool es2027 - Depool es2027.codfw.wmnet to then clone it to es2050.codfw.wmnet - fceratto@cumin1002 [13:03:02] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) es2027 - Depool es2027.codfw.wmnet to then clone it to es2050.codfw.wmnet - fceratto@cumin1002 [13:03:05] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-09-09-171717 to 2025-09-16-190551 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189155 (https://phabricator.wikimedia.org/T399323) [13:03:07] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-09-08-191243 to 2025-09-16-134119 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189156 (https://phabricator.wikimedia.org/T397956) [13:03:21] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool es2027 - Depool es2027.codfw.wmnet to then clone it to es2050.codfw.wmnet - fceratto@cumin1002 [13:03:30] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) es2027 - Depool es2027.codfw.wmnet to then clone it to es2050.codfw.wmnet - fceratto@cumin1002 [13:03:58] (03PS2) 10Stevemunene: Add a dummy Ceph user keys for the cephcsi plugin to use [labs/private] - 10https://gerrit.wikimedia.org/r/1189133 (https://phabricator.wikimedia.org/T404576) [13:04:01] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone_es (exit_code=99) of es2027.codfw.wmnet onto es2050.codfw.wmnet [13:04:23] (03PS3) 10Federico Ceratto: clone_es.py: clone readonly es* hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1183646 [13:04:46] (03PS2) 10Jelto: ceph: add module to sync a bucket locally [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) [13:06:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/8 (Transit: Arelion (IC-381309) {#30386}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:06:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-esams and Arelion (2001:2035:0:699::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [13:08:14] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6972/co" [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [13:08:41] RECOVERY - BFD status on lsw1-f1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:09:29] RECOVERY - BFD status on lsw1-c2-codfw.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:10:14] (03PS3) 10Majavah: ceph: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188826 [13:10:17] (03CR) 10Jelto: [V:03+1] ceph: add module to sync a bucket locally (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [13:10:40] (03Merged) 10jenkins-bot: dse-k8s: Define helmfiles for echoserver in dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187806 (https://phabricator.wikimedia.org/T404433) (owner: 10Stevemunene) [13:10:43] RECOVERY - BFD status on lsw1-e1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:10:56] (03CR) 10CI reject: [V:04-1] ceph: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188826 (owner: 10Majavah) [13:11:34] (03PS4) 10Majavah: ceph: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188826 [13:12:41] (03CR) 10Majavah: [C:03+2] ceph: Drop buster support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1188826 (owner: 10Majavah) [13:13:43] RECOVERY - BFD status on lsw1-f2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:13:43] RECOVERY - BFD status on lsw1-e2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:17:18] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1232 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83407 and previous config saved to /var/cache/conftool/dbconfig/20250917-131718-ladsgroup.json [13:17:23] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [13:22:03] RECOVERY - BFD status on lsw1-d2-codfw.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:23:45] RECOVERY - BFD status on lsw1-a7-codfw.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:27:59] (03CR) 10Arnaudb: [C:03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1189120 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [13:29:37] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool es2027 - Depool for cloning [13:29:48] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) es2027 - Depool for cloning [13:34:00] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:34:50] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm [13:34:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depool es2027 T402859', diff saved to https://phabricator.wikimedia.org/P83408 and previous config saved to /var/cache/conftool/dbconfig/20250917-133454-fceratto.json [13:34:59] T402859: Productionize es2049-es2057 - https://phabricator.wikimedia.org/T402859 [13:35:18] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone_es of es2027.codfw.wmnet onto es2050.codfw.wmnet [13:35:22] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool es2027 - Depool es2027.codfw.wmnet to then clone it to es2050.codfw.wmnet - fceratto@cumin1002 [13:36:55] (03PS1) 10Muehlenhoff: Apply installserver role to install1005 [puppet] - 10https://gerrit.wikimedia.org/r/1189169 (https://phabricator.wikimedia.org/T396487) [13:36:57] (03PS1) 10Muehlenhoff: Update DHCP server in eqiad to install1005 [puppet] - 10https://gerrit.wikimedia.org/r/1189170 (https://phabricator.wikimedia.org/T396487) [13:36:59] (03PS1) 10Muehlenhoff: Update the proxies used by cloudcumin to install1005 [puppet] - 10https://gerrit.wikimedia.org/r/1189171 (https://phabricator.wikimedia.org/T396487) [13:37:42] (03PS1) 10Muehlenhoff: Point webproxy in eqiad to install1005 [dns] - 10https://gerrit.wikimedia.org/r/1189173 (https://phabricator.wikimedia.org/T396487) [13:38:04] (03PS2) 10Jforrester: Enable Wikifunctions client mode on Wiktionaries, Part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172047 (https://phabricator.wikimedia.org/T397401) [13:38:25] fceratto@cumin1002 clone_es (PID 2838644) is awaiting input [13:38:27] (03CR) 10CI reject: [V:04-1] Enable Wikifunctions client mode on Wiktionaries, Part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172047 (https://phabricator.wikimedia.org/T397401) (owner: 10Jforrester) [13:39:35] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11189472 (10elukey) I retried again, "No media present" :( [13:40:03] (03PS2) 10Majavah: P:wmcs::metricsinfra: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188828 [13:40:07] elukey@cumin1003 reimage (PID 2969566) is awaiting input [13:40:32] (03CR) 10CI reject: [V:04-1] P:wmcs::metricsinfra: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188828 (owner: 10Majavah) [15:27:42] (03Merged) 10jenkins-bot: switchdc/mediawiki: remove references to mw-maint hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1187997 (https://phabricator.wikimedia.org/T404538) (owner: 10Jasmine) [15:31:19] (03Merged) 10jenkins-bot: InstallPreConfigured: Allow subclasses to skip tasks [core] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1189199 (owner: 10Zabe) [15:34:00] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:57] (03Merged) 10jenkins-bot: InstallPreConfigured: Allow subclasses to skip tasks [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189197 (owner: 10Zabe) [15:35:00] (03Merged) 10jenkins-bot: addWiki: Stop populating the interwiki table on new wikis [extensions/WikimediaMaintenance] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189196 (owner: 10Zabe) [15:35:01] (03Merged) 10jenkins-bot: addWiki: Stop populating the interwiki table on new wikis [extensions/WikimediaMaintenance] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1189198 (owner: 10Zabe) [15:35:44] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1189198|addWiki: Stop populating the interwiki table on new wikis]], [[gerrit:1189196|addWiki: Stop populating the interwiki table on new wikis]], [[gerrit:1189199|InstallPreConfigured: Allow subclasses to skip tasks]], [[gerrit:1189197|InstallPreConfigured: Allow subclasses to skip tasks]] [15:41:22] !log zabe@deploy1003 zabe: Backport for [[gerrit:1189198|addWiki: Stop populating the interwiki table on new wikis]], [[gerrit:1189196|addWiki: Stop populating the interwiki table on new wikis]], [[gerrit:1189199|InstallPreConfigured: Allow subclasses to skip tasks]], [[gerrit:1189197|InstallPreConfigured: Allow subclasses to skip tasks]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes [15:41:22] can now be verified there. [15:41:52] !log zabe@deploy1003 zabe: Continuing with sync [15:44:00] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:46:48] (03PS3) 10Btullis: Absent the resources in statistics::product_analytics [puppet] - 10https://gerrit.wikimedia.org/r/1188437 (https://phabricator.wikimedia.org/T404639) (owner: 10Bearloga) [15:46:48] (03PS2) 10Btullis: Remove the manifests for the absented product_analytics jobs [puppet] - 10https://gerrit.wikimedia.org/r/1189204 (https://phabricator.wikimedia.org/T404639) [15:47:06] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1189198|addWiki: Stop populating the interwiki table on new wikis]], [[gerrit:1189196|addWiki: Stop populating the interwiki table on new wikis]], [[gerrit:1189199|InstallPreConfigured: Allow subclasses to skip tasks]], [[gerrit:1189197|InstallPreConfigured: Allow subclasses to skip tasks]] (duration: 11m 22s) [15:51:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-esams and Arelion (2001:2035:0:699::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [15:54:26] (03CR) 10JMeybohm: [C:03+1] "I must admit that I did not try to run the script, but from reading it it seems awesome and maybe even a tiny bit better then deploy_all.s" [puppet] - 10https://gerrit.wikimedia.org/r/1188456 (https://phabricator.wikimedia.org/T380211) (owner: 10RLazarus) [15:55:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/8 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:00:04] jasmine_: gettimeofday() says it's time for [DC Switchover Live-test]. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1600) [16:00:48] 10ops-eqiad, 06SRE, 06DC-Ops: Rack C2 Network Relocation Tracking - https://phabricator.wikimedia.org/T404872#11190456 (10RobH) [16:00:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11190457 (10RobH) [16:01:12] jouncebot: nowandnext [16:01:12] For the next 1 hour(s) and 58 minute(s): [DC Switchover Live-test] (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1600) [16:01:12] In 1 hour(s) and 58 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1800) [16:01:17] 10ops-eqiad, 06SRE, 06DC-Ops: Rack C2 Network Relocation Tracking - https://phabricator.wikimedia.org/T404872#11190462 (10RobH) 05Open→03Invalid Going to try out this migration on a per sub-team task basis rather than per rack. this task is unlinked and now invalid. [16:03:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11190471 (10RobH) [16:07:47] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:08:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:10:05] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11190545 (10elukey) [16:10:08] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11190548 (10elukey) [16:10:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11190551 (10RobH) [16:10:40] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11190552 (10elukey) [16:13:46] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:22:07] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1188437 (https://phabricator.wikimedia.org/T404639) (owner: 10Bearloga) [16:39:07] (03CR) 10BCornwall: [C:03+2] varnish: Enable unified mobile routing on cawiki, hewiki, itwiki (group1) [puppet] - 10https://gerrit.wikimedia.org/r/1185116 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [16:48:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:49:10] FIRING: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:53:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:54:10] RESOLVED: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:54:32] (03CR) 10CDanis: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1180712 (https://phabricator.wikimedia.org/T398161) (owner: 10Giuseppe Lavagetto) [16:54:45] 06SRE, 10envoy, 06serviceops, 10Data-Platform-SRE (2025.09.05 - 2025.09.26): Upgrade Envoy to v1.29.12 on wcqs and wdqs hosts - https://phabricator.wikimedia.org/T404867#11190745 (10bking) [17:00:59] (03CR) 10CDobbins: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [17:06:42] 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for hueitan - https://phabricator.wikimedia.org/T404681#11190780 (10colewhite) Hi @hueitan! We need approval from your manager on this ticket to move this forward. [17:07:38] (03PS1) 10Cwhite: admin: add huei tan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1189257 (https://phabricator.wikimedia.org/T404681) [17:11:17] (03CR) 10BCornwall: [C:03+1] Point webproxy in eqiad to install1005 [dns] - 10https://gerrit.wikimedia.org/r/1189173 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [17:12:55] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for mszwarc - https://phabricator.wikimedia.org/T404697#11190794 (10colewhite) [17:15:19] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for mszwarc - https://phabricator.wikimedia.org/T404697#11190797 (10colewhite) Requesting approval from @thcipriani. Does everything look in order to you? [17:18:12] (03PS1) 10Cwhite: admin: add mszwarc to deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1189259 (https://phabricator.wikimedia.org/T404697) [17:21:09] 06SRE, 06collaboration-services, 10envoy, 06serviceops: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663#11190821 (10Dzahn) [17:23:15] !log upgrading envoyproxy on releases* hosts T403663 [17:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:20] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [17:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:24:26] !log upgrading envoyproxy on doc* hosts T403663 [17:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:45] (03PS1) 10TChin: [eventgate-*] Bump to v1.22.0 (service-utils) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189260 (https://phabricator.wikimedia.org/T403169) [17:24:59] (03CR) 10Scott French: [C:03+2] shellbox-video: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188819 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [17:26:38] (03Merged) 10jenkins-bot: shellbox-video: upgrade to PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1188819 (https://phabricator.wikimedia.org/T403284) (owner: 10Scott French) [17:29:00] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [17:29:08] !log upgrading envoyproxy on zuul* hosts T403663 [17:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:12] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [17:29:22] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [17:32:39] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [17:33:19] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [17:33:22] (03PS2) 10Esanders: Enable Flow in read-only mode on wikis using LiquidThreads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188759 (https://phabricator.wikimedia.org/T404687) [17:34:00] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:35:57] !log upgrading envoyproxy on planet* and people* hosts T403663 [17:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:02] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [17:36:23] (03PS1) 10Andrew Bogott: Horizon: remove requirement for openstack_auth.plugin.wmtotp.WmtotpPlugin [puppet] - 10https://gerrit.wikimedia.org/r/1189262 [17:36:40] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [17:37:21] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [17:37:41] !log migrated shellbox-video to PHP 8.3 - T403284 [17:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:45] T403284: Migrate production Shellbox services to PHP 8.3 - https://phabricator.wikimedia.org/T403284 [17:39:02] (03PS1) 10Andrew Bogott: Update codfw1dev horizon release, again [puppet] - 10https://gerrit.wikimedia.org/r/1189265 [17:39:28] (03CR) 10Andrew Bogott: [C:03+2] Horizon: remove requirement for openstack_auth.plugin.wmtotp.WmtotpPlugin [puppet] - 10https://gerrit.wikimedia.org/r/1189262 (owner: 10Andrew Bogott) [17:39:43] (03CR) 10Andrew Bogott: [C:03+2] Update codfw1dev horizon release, again [puppet] - 10https://gerrit.wikimedia.org/r/1189265 (owner: 10Andrew Bogott) [17:40:54] (03CR) 10Cwhite: [C:03+2] "PCC NOOP: https://puppet-compiler.wmflabs.org/output/1188900/6976/" [puppet] - 10https://gerrit.wikimedia.org/r/1188900 (owner: 10Cwhite) [17:45:49] (03CR) 10Jforrester: "Yes, I was just busy with security things last week when I planned to deploy. If you have a spare half hour, please feel free!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129894 (https://phabricator.wikimedia.org/T362317) (owner: 10Jforrester) [17:45:56] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1236 [17:46:06] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1236 [17:49:07] (03PS1) 10Cmelo: Release CampaignEvents extension to Wikimedia Commons - Sept 18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189267 (https://phabricator.wikimedia.org/T403667) [17:49:54] (03PS4) 10Btullis: Absent the resources in statistics::product_analytics [puppet] - 10https://gerrit.wikimedia.org/r/1188437 (https://phabricator.wikimedia.org/T404639) (owner: 10Bearloga) [17:49:54] (03PS3) 10Btullis: Remove the manifests for the absented product_analytics jobs [puppet] - 10https://gerrit.wikimedia.org/r/1189204 (https://phabricator.wikimedia.org/T404639) [17:50:34] (03CR) 10CI reject: [V:04-1] Release CampaignEvents extension to Wikimedia Commons - Sept 18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189267 (https://phabricator.wikimedia.org/T403667) (owner: 10Cmelo) [17:51:03] jouncebot nowandnext [17:51:03] No deployments scheduled for the next 0 hour(s) and 8 minute(s) [17:51:03] In 0 hour(s) and 8 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1800) [17:51:58] (03PS2) 10Cmelo: Release CampaignEvents extension to Wikimedia Commons - Sept 18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189267 (https://phabricator.wikimedia.org/T403667) [17:52:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188759 (https://phabricator.wikimedia.org/T404687) (owner: 10Esanders) [17:53:07] (03CR) 10Kimberly Sarabia: [C:03+1] ReaderExperiments' ImageBrowsing stream configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187413 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [17:59:27] (03PS3) 10Cmelo: Release CampaignEvents extension to Wikimedia Commons - Sept 18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189267 (https://phabricator.wikimedia.org/T403667) [17:59:32] (03PS1) 10Phuedx: EventStreamConfig: Enable experiment enrollment hoisting for MinT for Wiki Readers stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189269 [18:00:05] jeena and dduvall: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1800). [18:00:34] o/ [18:00:40] o/ [18:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:05:19] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189270 (https://phabricator.wikimedia.org/T396380) [18:05:21] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189270 (https://phabricator.wikimedia.org/T396380) (owner: 10TrainBranchBot) [18:06:11] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189270 (https://phabricator.wikimedia.org/T396380) (owner: 10TrainBranchBot) [18:07:52] !log upgrading envoyproxy on etherpad* and stewards* hosts T403663 [18:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:57] T403663: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663 [18:08:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189267 (https://phabricator.wikimedia.org/T403667) (owner: 10Cmelo) [18:11:52] (03PS1) 10Andrew Bogott: codfw1dev horizon: update docker version [puppet] - 10https://gerrit.wikimedia.org/r/1189278 [18:12:44] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev horizon: update docker version [puppet] - 10https://gerrit.wikimedia.org/r/1189278 (owner: 10Andrew Bogott) [18:15:42] (03PS1) 10Btullis: Fix the webhook TLS configuration for the spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189279 (https://phabricator.wikimedia.org/T318712) [18:17:06] !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.19 refs T396380 [18:17:10] T396380: 1.45.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T396380 [18:17:23] (03PS2) 10Joal: Add resource preemption to Hadoop Yarn scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1189221 (https://phabricator.wikimedia.org/T404871) [18:17:50] (03CR) 10CI reject: [V:04-1] Add resource preemption to Hadoop Yarn scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1189221 (https://phabricator.wikimedia.org/T404871) (owner: 10Joal) [18:19:35] (03PS3) 10Joal: Add resource preemption to Hadoop Yarn scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1189221 (https://phabricator.wikimedia.org/T404871) [18:21:41] (03CR) 10Cwhite: [C:03+2] "PCC OK: https://puppet-compiler.wmflabs.org/output/1188901/6977/" [puppet] - 10https://gerrit.wikimedia.org/r/1188901 (owner: 10Cwhite) [18:22:51] (03CR) 10Joal: "I finally got it right 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1189221 (https://phabricator.wikimedia.org/T404871) (owner: 10Joal) [18:25:35] PROBLEM - Host ms-be2088 is DOWN: PING CRITICAL - Packet loss = 100% [18:26:08] jeena: filed https://phabricator.wikimedia.org/T404902 [18:26:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189269 (owner: 10Phuedx) [18:26:56] thanks dduvall [18:27:11] ms-be2008 is my fault [18:27:26] accidently powered off the wrong host, should be coming backup now [18:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:29:28] (03CR) 10Ottomata: [C:03+1] [eventgate-*] Bump to v1.22.0 (service-utils) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189260 (https://phabricator.wikimedia.org/T403169) (owner: 10TChin) [18:29:39] RECOVERY - Host ms-be2088 is UP: PING OK - Packet loss = 0%, RTA = 30.23 ms [18:36:09] 06SRE, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA features on the new mobile URLs - https://phabricator.wikimedia.org/T403638#11191128 (10Ottomata) [18:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:36:59] jouncebot: nowandnext [18:36:59] For the next 1 hour(s) and 23 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T1800) [18:36:59] In 1 hour(s) and 23 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T2000) [18:37:37] jeena dduvall could you please let me know when the train deploy is done, so I could do a backport to wmf.18 and wmf.19? [18:38:44] FIRING: [6x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:40:41] (03CR) 10Bking: [C:03+1] Fix the webhook TLS configuration for the spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189279 (https://phabricator.wikimedia.org/T318712) (owner: 10Btullis) [18:41:14] kostajh: deploy is finished [18:42:02] (03PS2) 10Kosta Harlan: hCaptcha: Log open events to Prometheus [extensions/ConfirmEdit] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189194 (https://phabricator.wikimedia.org/T402767) [18:42:13] (03PS2) 10Kosta Harlan: hCaptcha: Log open events to Prometheus [extensions/ConfirmEdit] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1189191 (https://phabricator.wikimedia.org/T402767) [18:42:25] (03CR) 10TrainBranchBot: "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189194 (https://phabricator.wikimedia.org/T402767) (owner: 10Kosta Harlan) [18:42:25] (03CR) 10TrainBranchBot: "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1189191 (https://phabricator.wikimedia.org/T402767) (owner: 10Kosta Harlan) [18:43:38] (03PS1) 10Eric Gardner: Add ReaderExperiments extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189281 (https://phabricator.wikimedia.org/T404398) [18:45:31] (03CR) 10Eric Gardner: [C:04-2] "Not to be merged until the ReaderExperiments repo has been included in branching for 2 consecutive weeks. Assuming no trains get derailed," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189281 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner) [18:48:44] RESOLVED: [6x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:53:17] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:54:00] jeena: thanks! [18:54:19] yw! [18:55:26] (03Merged) 10jenkins-bot: hCaptcha: Log open events to Prometheus [extensions/ConfirmEdit] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189194 (https://phabricator.wikimedia.org/T402767) (owner: 10Kosta Harlan) [18:55:27] (03Merged) 10jenkins-bot: hCaptcha: Log open events to Prometheus [extensions/ConfirmEdit] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1189191 (https://phabricator.wikimedia.org/T402767) (owner: 10Kosta Harlan) [18:55:57] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1189194|hCaptcha: Log open events to Prometheus (T402767)]], [[gerrit:1189191|hCaptcha: Log open events to Prometheus (T402767)]] [18:56:02] T402767: hCaptcha: Log hCaptcha error codes to Logstash and Prometheus - https://phabricator.wikimedia.org/T402767 [18:57:50] (03PS1) 10Eric Gardner: Deploy ReaderExperiments to Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189288 (https://phabricator.wikimedia.org/T404398) [18:58:46] (03CR) 10CI reject: [V:04-1] Deploy ReaderExperiments to Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189288 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner) [19:01:12] (03PS2) 10Eric Gardner: Deploy ReaderExperiments to Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189288 (https://phabricator.wikimedia.org/T404398) [19:01:33] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1189194|hCaptcha: Log open events to Prometheus (T402767)]], [[gerrit:1189191|hCaptcha: Log open events to Prometheus (T402767)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:01:37] T402767: hCaptcha: Log hCaptcha error codes to Logstash and Prometheus - https://phabricator.wikimedia.org/T402767 [19:02:15] !log kharlan@deploy1003 kharlan: Continuing with sync [19:04:14] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [19:07:25] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1189194|hCaptcha: Log open events to Prometheus (T402767)]], [[gerrit:1189191|hCaptcha: Log open events to Prometheus (T402767)]] (duration: 11m 28s) [19:07:30] T402767: hCaptcha: Log hCaptcha error codes to Logstash and Prometheus - https://phabricator.wikimedia.org/T402767 [19:09:17] PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Fri 03 Oct 2025 07:09:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [19:13:52] (03PS1) 10Eric Gardner: Enable ReaderExperiments on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189293 (https://phabricator.wikimedia.org/T404398) [19:19:00] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:19:38] (03PS1) 10Eric Gardner: Load ReaderExperiments extension in CommonSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189294 (https://phabricator.wikimedia.org/T404398) [19:38:41] (03PS1) 10Andrew Bogott: codfw1dev: horizon package bump again [puppet] - 10https://gerrit.wikimedia.org/r/1189301 [19:40:19] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev: horizon package bump again [puppet] - 10https://gerrit.wikimedia.org/r/1189301 (owner: 10Andrew Bogott) [19:44:00] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:53:36] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for ericmill - https://phabricator.wikimedia.org/T404903#11191435 (10Aklapper) @EMill-WMF: Hi, which exact docs are you following which made you create this task? Thanks :) [19:55:33] (03CR) 10RLazarus: [C:03+2] "Thanks! I test-drove it for a round of Envoy upgrades, and found some UX improvements I'd like to make, but it Works Good™ so I'll merge a" [puppet] - 10https://gerrit.wikimedia.org/r/1188456 (https://phabricator.wikimedia.org/T380211) (owner: 10RLazarus) [19:55:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/8 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:57:19] (03CR) 10TChin: [C:03+2] [eventgate-*] Bump to v1.22.0 (service-utils) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189260 (https://phabricator.wikimedia.org/T403169) (owner: 10TChin) [19:58:19] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [19:59:17] FIRING: ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:59:17] (03Merged) 10jenkins-bot: [eventgate-*] Bump to v1.22.0 (service-utils) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189260 (https://phabricator.wikimedia.org/T403169) (owner: 10TChin) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T2000). [20:00:05] edsanders and phuedx: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] o/ [20:00:11] o/ [20:00:56] phuedx: i can take care of deploying your change if you like [20:01:07] cjming: Thanks <3 [20:01:09] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1021.eqiad.wmnet with OS bookworm [20:01:41] I can deploy my own [20:01:53] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: PDU IP added to eqiad - vriley@cumin1003" [20:01:56] cool [20:02:16] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: PDU IP added to eqiad - vriley@cumin1003" [20:02:16] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:02:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188759 (https://phabricator.wikimedia.org/T404687) (owner: 10Esanders) [20:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:03:48] (03Merged) 10jenkins-bot: Enable Flow in read-only mode on wikis using LiquidThreads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188759 (https://phabricator.wikimedia.org/T404687) (owner: 10Esanders) [20:04:13] !log esanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1188759|Enable Flow in read-only mode on wikis using LiquidThreads (T404687)]] [20:04:17] RESOLVED: ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:04:18] T404687: Deploy Flow (read-only) to wikis still using LQT - https://phabricator.wikimedia.org/T404687 [20:07:16] (03CR) 10Andrew Bogott: [C:03+1] backy2: Drop buster support [puppet] - 10https://gerrit.wikimedia.org/r/1188825 (owner: 10Majavah) [20:08:21] !log esanders@deploy1003 esanders: Backport for [[gerrit:1188759|Enable Flow in read-only mode on wikis using LiquidThreads (T404687)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:10:09] !log esanders@deploy1003 esanders: Continuing with sync [20:12:00] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1021.eqiad.wmnet with OS bookworm [20:12:27] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1021.eqiad.wmnet with OS bookworm [20:15:23] !log esanders@deploy1003 Finished scap sync-world: Backport for [[gerrit:1188759|Enable Flow in read-only mode on wikis using LiquidThreads (T404687)]] (duration: 11m 10s) [20:15:28] T404687: Deploy Flow (read-only) to wikis still using LQT - https://phabricator.wikimedia.org/T404687 [20:15:56] i'll proceed with phuedx's patch [20:16:17] (03PS2) 10Phuedx: EventStreamConfig: Enable experiment enrollment hoisting for MinT for Wiki Readers stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189269 [20:17:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189269 (owner: 10Phuedx) [20:18:31] (03Merged) 10jenkins-bot: EventStreamConfig: Enable experiment enrollment hoisting for MinT for Wiki Readers stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189269 (owner: 10Phuedx) [20:18:57] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1189269|EventStreamConfig: Enable experiment enrollment hoisting for MinT for Wiki Readers stream]] [20:19:54] (03PS1) 10Bking: opensearch-operator: fix pod security settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189320 (https://phabricator.wikimedia.org/T362978) [20:24:56] !log cjming@deploy1003 cjming, phuedx: Backport for [[gerrit:1189269|EventStreamConfig: Enable experiment enrollment hoisting for MinT for Wiki Readers stream]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:25:20] !log cjming@deploy1003 cjming, phuedx: Continuing with sync [20:27:32] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [20:30:14] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:30:34] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1189269|EventStreamConfig: Enable experiment enrollment hoisting for MinT for Wiki Readers stream]] (duration: 11m 37s) [20:35:34] !log end of UTC late backport window [20:35:34] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for mszwarc - https://phabricator.wikimedia.org/T404697#11191571 (10thcipriani) Approved! @mszwarc please ensure you're familiar with [[https://wikitech.wikimedia.org/wiki/Backport_windows#Backport_Team_members'_roles,_respons... [20:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:07] 06SRE, 10DNS, 06Traffic: Set mediawiki.gr, wikipedia.pt, and wiktionary.org.uk NS records to WMF - https://phabricator.wikimedia.org/T401438#11191594 (10BCornwall) 05Open→03Resolved @Alchimista Thanks for getting back to me! Sorry for the delay, I was enjoying the one-two punch of being out and then... [20:49:41] !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker2003.codfw.wmnet with OS bookworm [20:49:48] 10ops-codfw, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11191638 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet with OS bookworm [20:49:51] !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [20:49:59] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11191639 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [20:54:25] (03CR) 10JHathaway: [C:03+1] Remove obsolete configuration options from SSH type [puppet] - 10https://gerrit.wikimedia.org/r/1189149 (owner: 10Muehlenhoff) [20:56:04] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [20:58:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:58:50] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T2100) [21:09:46] 06SRE, 10DNS, 06Traffic: Set mediawiki.gr, wikipedia.pt, and wiktionary.org.uk NS records to WMF - https://phabricator.wikimedia.org/T401438#11191677 (10BCornwall) [21:14:03] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1021.eqiad.wmnet with OS bookworm [21:19:29] 10ops-codfw, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11191729 (10Jhancock.wm) [28/50, retrying in 84.00s] Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb..poll_puppetdb' raised: Nagios_host resource with title... [21:19:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:19:55] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11191736 (10Jhancock.wm) didn't pxe. will check nic in the morning. [21:20:12] 06SRE, 10Beta-Cluster-Infrastructure, 06Data-Persistence, 06Traffic: ATS isn't caching documents in deployment-cache-upload07 - https://phabricator.wikimedia.org/T322575#11191738 (10bd808) 05Open→03Declined deployment-cache-text07 was replaced by deployment-cache-text08. `lang=shell-session,lines=1... [21:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:24:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:26:00] 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (12) PDUs for ML expansion - https://phabricator.wikimedia.org/T400778#11191785 (10VRiley-WMF) Worked with @Jhancock.wm to assist with E10 E11 E12 for consol ports. E13 and E14 currently not powered up. Will need to run the green network cables for these PDUs [21:31:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:31:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:34:00] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:34:44] (03PS11) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [21:36:38] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 3.874 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:36:38] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 3.930 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:37:35] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [21:40:15] (03PS2) 10Aaron Schulz: [DNM] Route "/api/rest_v1/?spec" requests to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1177515 (https://phabricator.wikimedia.org/T397203) [21:40:40] (03CR) 10CI reject: [V:04-1] [DNM] Route "/api/rest_v1/?spec" requests to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1177515 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [21:40:50] (03PS3) 10Aaron Schulz: [DNM] Route old /api/rest_v1/?specs endpoints to static JSON files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177514 (https://phabricator.wikimedia.org/T397203) [21:45:38] PROBLEM - mysqld processes on es2027 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [21:46:02] PROBLEM - MariaDB read only es3 on es2027 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [21:51:32] 06SRE, 10Beta-Cluster-Infrastructure, 06Infrastructure-Foundations, 10Puppet-Core, and 2 others: Define scap::sources in a way that is shared between prod and beta - https://phabricator.wikimedia.org/T196034#11191921 (10bd808) 05Open→03Declined Making the custom `role` function used in production w... [21:51:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:51:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:55:48] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1021.eqiad.wmnet with OS bookworm [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250917T2200) [22:00:36] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker2003.codfw.wmnet with OS bookworm [22:00:48] 10ops-codfw, 06SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11191957 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host dse-k8s-worker2003.codfw.wmnet with OS bookworm executed with errors: - dse-k8s-worker... [22:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:06:46] 06SRE, 10Beta-Cluster-Infrastructure, 06Infrastructure-Foundations, 10Puppet-Core, and 2 others: Define scap::sources in a way that is shared between prod and beta - https://phabricator.wikimedia.org/T196034#11191978 (10Dzahn) I believe the following 2 steps should result in the same thing we would get... [22:10:08] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [22:10:18] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11191983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host wikikube-ctrl2006.codfw.wmnet with OS bookworm executed with errors: -... [22:11:42] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 8.228 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:11:42] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 8.352 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:18:41] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for mszwarc - https://phabricator.wikimedia.org/T404697#11192024 (10colewhite) [22:19:24] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1192 and db2166 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83413 and previous config saved to /var/cache/conftool/dbconfig/20250917-221924-ladsgroup.json [22:19:29] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [22:19:30] (03CR) 10Cwhite: [C:03+2] admin: add mszwarc to deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1189259 (https://phabricator.wikimedia.org/T404697) (owner: 10Cwhite) [22:21:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Remove db1209 and db2163 from api group (T403966)', diff saved to https://phabricator.wikimedia.org/P83414 and previous config saved to /var/cache/conftool/dbconfig/20250917-222107-ladsgroup.json [22:21:19] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for mszwarc - https://phabricator.wikimedia.org/T404697#11192035 (10colewhite) 05Open→03Resolved p:05Triage→03Medium a:03colewhite The group membership change has been deployed. Please feel free to reopen if you... [22:21:46] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1021.eqiad.wmnet with reason: host reimage [22:22:08] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Bump weight of db1167 in general group (T403966)', diff saved to https://phabricator.wikimedia.org/P83415 and previous config saved to /var/cache/conftool/dbconfig/20250917-222207-ladsgroup.json [22:22:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:24:45] (03PS1) 10Andrew Bogott: Cloudcephosd1021 -> bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1189370 [22:25:32] (03CR) 10Andrew Bogott: [C:03+2] Cloudcephosd1021 -> bookworm/reef [puppet] - 10https://gerrit.wikimedia.org/r/1189370 (owner: 10Andrew Bogott) [22:28:26] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1021.eqiad.wmnet with reason: host reimage [22:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:32:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185120 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [22:32:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:33:46] (03Merged) 10jenkins-bot: Disable wmgUseMdotRouting on cawiki, hewiki, itwiki (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1185120 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [22:34:10] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1185120|Disable wmgUseMdotRouting on cawiki, hewiki, itwiki (group1) (T403510)]] [22:34:14] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [22:36:41] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:37:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:37:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:40:46] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1185120|Disable wmgUseMdotRouting on cawiki, hewiki, itwiki (group1) (T403510)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:40:51] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [22:41:33] !log krinkle@deploy1003 krinkle: Continuing with sync [22:42:31] 06SRE: Update Wikitech "Search Console Data" doc to align with current ITS-first request process - https://phabricator.wikimedia.org/T404927 (10NBaca-WMF) 03NEW [22:43:13] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1021.eqiad.wmnet with OS bookworm [22:43:17] andrew@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [22:43:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:46:21] 06SRE, 10Beta-Cluster-Infrastructure, 06Traffic: Rename deployment-cache-(text|upload)0x to deployment-cp0x - https://phabricator.wikimedia.org/T280393#11192124 (10bd808) >>! In T280393#7163610, @taavi wrote: > One more issue: given cloud vps does not have per-role hiera keys, we need to rely on instance pre... [22:46:49] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1185120|Disable wmgUseMdotRouting on cawiki, hewiki, itwiki (group1) (T403510)]] (duration: 12m 39s) [22:46:53] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [22:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:49:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [22:49:34] 06SRE, 10Beta-Cluster-Infrastructure, 06Traffic, 13Patch-For-Review: Incorrect X-Cache-Status reported by deployment-prep caches - https://phabricator.wikimedia.org/T269825#11192132 (10bd808) [22:51:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:54:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [22:58:47] 06SRE, 10Beta-Cluster-Infrastructure, 07Technical-Debt, 07Tracking-Neverending: Minimize infrastructure differences between Beta Cluster and production - https://phabricator.wikimedia.org/T87220#11192146 (10bd808) 05Open→03Declined Yuvi had a very strong opinion here, but `*All* other differences h... [23:04:26] (03PS1) 10DLynch: Paste check: log when a paste check would have been shown if enabled [extensions/VisualEditor] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189374 (https://phabricator.wikimedia.org/T402460) [23:05:34] 06SRE: Update Wikitech "Search Console Data" doc to align with current ITS-first request process - https://phabricator.wikimedia.org/T404927#11192154 (10NBaca-WMF) I'd recommend as part of these changes that we deprecate or repurpose #search-console-access-request and either add a new tag or otherwise use the ge... [23:05:42] (03PS6) 10Dzahn: zuul: add systemd service for nodepool [puppet] - 10https://gerrit.wikimedia.org/r/1180187 (https://phabricator.wikimedia.org/T401614) [23:05:55] (03CR) 10CI reject: [V:04-1] zuul: add systemd service for nodepool [puppet] - 10https://gerrit.wikimedia.org/r/1180187 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [23:07:15] (03PS7) 10Dzahn: zuul: add systemd service for nodepool [puppet] - 10https://gerrit.wikimedia.org/r/1180187 (https://phabricator.wikimedia.org/T401614) [23:08:17] (03PS8) 10Dzahn: zuul: add systemd service for nodepool [puppet] - 10https://gerrit.wikimedia.org/r/1180187 (https://phabricator.wikimedia.org/T401614) [23:08:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:11:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:12:51] 06SRE, 06collaboration-services, 10envoy, 06serviceops: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663#11192161 (10Dzahn) 05Open→03In progress [23:13:36] (03CR) 10Dzahn: [V:04-1 C:04-1] "Duplicate declaration: Service[docker] is already declared" [puppet] - 10https://gerrit.wikimedia.org/r/1180187 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [23:14:50] (03PS9) 10Dzahn: zuul: add systemd service for nodepool [puppet] - 10https://gerrit.wikimedia.org/r/1180187 (https://phabricator.wikimedia.org/T401614) [23:18:00] My patch's tests are done, and there seem to be no objections, so I will go ahead. [23:19:00] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:22:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189374 (https://phabricator.wikimedia.org/T402460) (owner: 10DLynch) [23:24:18] (03Merged) 10jenkins-bot: Paste check: log when a paste check would have been shown if enabled [extensions/VisualEditor] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1189374 (https://phabricator.wikimedia.org/T402460) (owner: 10DLynch) [23:24:48] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1189374|Paste check: log when a paste check would have been shown if enabled (T402460)]] [23:24:53] T402460: Add instrumentation to VEFU to detect when content is pasted within an edit session - https://phabricator.wikimedia.org/T402460 [23:31:09] !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1189374|Paste check: log when a paste check would have been shown if enabled (T402460)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:31:13] T402460: Add instrumentation to VEFU to detect when content is pasted within an edit session - https://phabricator.wikimedia.org/T402460 [23:32:07] !log kemayo@deploy1003 kemayo: Continuing with sync [23:37:26] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1189374|Paste check: log when a paste check would have been shown if enabled (T402460)]] (duration: 12m 38s) [23:37:31] T402460: Add instrumentation to VEFU to detect when content is pasted within an edit session - https://phabricator.wikimedia.org/T402460 [23:38:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1189375 [23:38:28] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1189375 (owner: 10TrainBranchBot) [23:44:00] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:51:26] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1189375 (owner: 10TrainBranchBot) [23:56:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/8 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown