[00:08:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1161078 [00:08:34] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1161078 (owner: 10TrainBranchBot) [00:11:25] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp2038 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:12:07] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp2033 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:12:29] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4051 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:13:09] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4042 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:14:05] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1108 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:14:05] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp2037 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:15:49] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp2029 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:16:17] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1104 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:17:25] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp2032 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:17:29] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp2039 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:17:41] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1113 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:20:05] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4038 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:20:05] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp2040 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:20:41] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp2036 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:21:33] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1114 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:22:45] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4043 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:23:43] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp2034 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:25:05] FIRING: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:25:11] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1103 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:26:05] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp2031 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:26:05] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp2027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:26:09] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4037 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:26:37] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1105 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:27:05] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp2042 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:27:07] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp2030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:27:25] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4047 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:28:03] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1110 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:28:17] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1106 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:29:13] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1111 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:29:21] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4046 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:30:03] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1102 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:30:05] RESOLVED: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:30:39] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4052 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:31:09] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4048 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:31:21] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4039 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:31:21] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4041 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:32:07] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp2035 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:32:13] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4044 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:33:27] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4040 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:34:25] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1115 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:34:41] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1100 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:35:37] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1109 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:35:37] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4050 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:35:51] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1112 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:36:25] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4049 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:37:05] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp2028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:37:17] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1107 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:38:39] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp2041 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:39:39] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1101 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:40:55] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4045 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [00:49:50] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397386 (10phaultfinder) 03NEW [00:50:24] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1161078 (owner: 10TrainBranchBot) [01:06:41] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/adc96fa92c4576fd1d55056bda08bfc99ab0a4cc07015c81f7840ab837be82a7/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:24:49] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397386#10930318 (10phaultfinder) [01:26:41] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:39:57] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397386#10930322 (10phaultfinder) [01:59:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:05:49] (03PS1) 10Dzahn: zuul::executor: add zuul user and nodepool ssh private key [puppet] - 10https://gerrit.wikimedia.org/r/1161090 (https://phabricator.wikimedia.org/T395938) [02:12:44] (03PS1) 10Dzahn: secrets: add fake SSH private key for zuul [labs/private] - 10https://gerrit.wikimedia.org/r/1161093 (https://phabricator.wikimedia.org/T395938) [02:15:43] (03CR) 10Dzahn: [V:03+2 C:03+2] secrets: add fake SSH private key for zuul [labs/private] - 10https://gerrit.wikimedia.org/r/1161093 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [02:15:51] (03PS2) 10Dzahn: secrets: add fake SSH private key for zuul [labs/private] - 10https://gerrit.wikimedia.org/r/1161093 (https://phabricator.wikimedia.org/T395938) [02:15:55] (03CR) 10Dzahn: [V:03+2] secrets: add fake SSH private key for zuul [labs/private] - 10https://gerrit.wikimedia.org/r/1161093 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [02:22:36] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/labs/private/+/1161093" [puppet] - 10https://gerrit.wikimedia.org/r/1161090 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [02:24:18] (03PS2) 10Dzahn: zuul::executor: add zuul user and nodepool ssh private key [puppet] - 10https://gerrit.wikimedia.org/r/1161090 (https://phabricator.wikimedia.org/T395938) [02:25:23] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1161090/6023/zuul1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1161090 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [02:34:58] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397386#10930339 (10phaultfinder) [03:14:58] (03PS1) 10Andrew Bogott: Openstack [service_user] config: use internal endpoint for service users [puppet] - 10https://gerrit.wikimedia.org/r/1161113 (https://phabricator.wikimedia.org/T330759) [03:15:00] (03PS1) 10Andrew Bogott: Openstack [keystone_authtoken]: remove auth_url setting [puppet] - 10https://gerrit.wikimedia.org/r/1161114 (https://phabricator.wikimedia.org/T330759) [03:15:01] (03PS1) 10Andrew Bogott: cinder: use 'cinder' service user rather than 'novaadmin' [puppet] - 10https://gerrit.wikimedia.org/r/1161115 (https://phabricator.wikimedia.org/T330759) [03:15:17] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161115 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [03:18:29] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:21:37] (03PS1) 10Andrew Bogott: Comment back in cinder ldap passwords [labs/private] - 10https://gerrit.wikimedia.org/r/1161116 [03:22:02] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Comment back in cinder ldap passwords [labs/private] - 10https://gerrit.wikimedia.org/r/1161116 (owner: 10Andrew Bogott) [03:22:57] (03PS2) 10Andrew Bogott: cinder: use 'cinder' service user rather than 'novaadmin'. [puppet] - 10https://gerrit.wikimedia.org/r/1161115 (https://phabricator.wikimedia.org/T330759) [03:23:04] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161115 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [03:30:21] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161113 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [03:30:28] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161114 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [03:42:01] (03PS2) 10Andrew Bogott: Openstack [service_user] config: use internal endpoint for service users [puppet] - 10https://gerrit.wikimedia.org/r/1161113 (https://phabricator.wikimedia.org/T330759) [03:42:01] (03PS2) 10Andrew Bogott: Openstack [keystone_authtoken]: remove auth_url setting [puppet] - 10https://gerrit.wikimedia.org/r/1161114 (https://phabricator.wikimedia.org/T330759) [03:42:01] (03PS3) 10Andrew Bogott: cinder: use 'cinder' service user rather than 'novaadmin' [puppet] - 10https://gerrit.wikimedia.org/r/1161115 (https://phabricator.wikimedia.org/T330759) [03:42:18] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161113 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [03:45:39] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:59:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:20:19] (03CR) 10Andrew Bogott: [C:03+2] Openstack [service_user] config: use internal endpoint for service users [puppet] - 10https://gerrit.wikimedia.org/r/1161113 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [04:20:31] (03CR) 10Andrew Bogott: [C:03+2] Openstack [keystone_authtoken]: remove auth_url setting [puppet] - 10https://gerrit.wikimedia.org/r/1161114 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [04:22:17] RECOVERY - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is OK: wikitech-static OK - wikitech and wikitech-static in sync (0 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [04:24:05] FIRING: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:26:17] PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:27:17] RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:28:35] PROBLEM - nova-compute proc minimum on cloudvirt1071 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:29:04] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:29:35] RECOVERY - nova-compute proc minimum on cloudvirt1071 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:30:03] PROBLEM - nova-compute proc minimum on cloudvirt1073 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:30:17] PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:31:03] RECOVERY - nova-compute proc minimum on cloudvirt1073 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:31:17] RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:33:37] PROBLEM - nova-compute proc minimum on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:33:38] PROBLEM - nova-compute proc minimum on cloudvirt1067 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:33:56] (03PS1) 10Andrew Bogott: Revert "Openstack [keystone_authtoken]: remove auth_url setting" [puppet] - 10https://gerrit.wikimedia.org/r/1161148 [04:33:58] (03PS1) 10Andrew Bogott: Revert "Openstack [service_user] config: use internal endpoint f..." [puppet] - 10https://gerrit.wikimedia.org/r/1161149 [04:34:03] PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:34:04] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:34:37] RECOVERY - nova-compute proc minimum on cloudvirt1069 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:34:37] RECOVERY - nova-compute proc minimum on cloudvirt1067 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:34:42] (03Abandoned) 10Andrew Bogott: Revert "Openstack [service_user] config: use internal endpoint f..." [puppet] - 10https://gerrit.wikimedia.org/r/1161149 (owner: 10Andrew Bogott) [04:34:53] (03CR) 10Andrew Bogott: [C:03+2] Revert "Openstack [keystone_authtoken]: remove auth_url setting" [puppet] - 10https://gerrit.wikimedia.org/r/1161148 (owner: 10Andrew Bogott) [04:35:03] RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:36:23] PROBLEM - nova-compute proc minimum on cloudvirt1056 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:37:23] RECOVERY - nova-compute proc minimum on cloudvirt1056 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:59:05] RESOLVED: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:01:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:01:48] FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:05:24] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:07:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc2 T378715', diff saved to https://phabricator.wikimedia.org/P78392 and previous config saved to /var/cache/conftool/dbconfig/20250619-050725-root.json [05:07:31] T378715: Possibility to transition some codfw data persistence hosts to 10G - https://phabricator.wikimedia.org/T378715 [05:08:45] !log marostegui@cumin1003 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on pc1012.eqiad.wmnet with reason: Maintenance [05:09:02] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance [05:24:24] (03PS1) 10Marostegui: db2186: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1161179 (https://phabricator.wikimedia.org/T397279) [05:26:39] (03PS1) 10KartikMistry: Enable the Contribute menu in Egyptian Arabic, Igbo, and Uzbek [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161182 [05:27:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161182 (owner: 10KartikMistry) [05:34:26] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance [05:34:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T396130)', diff saved to https://phabricator.wikimedia.org/P78393 and previous config saved to /var/cache/conftool/dbconfig/20250619-053433-marostegui.json [05:34:39] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [05:38:02] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2186.codfw.wmnet with reason: Maintenance [05:38:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2186', diff saved to https://phabricator.wikimedia.org/P78394 and previous config saved to /var/cache/conftool/dbconfig/20250619-053826-root.json [05:40:09] (03CR) 10Marostegui: [C:03+2] db2186: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1161179 (https://phabricator.wikimedia.org/T397279) (owner: 10Marostegui) [05:44:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2186 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78395 and previous config saved to /var/cache/conftool/dbconfig/20250619-054418-root.json [05:50:35] FIRING: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:55:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T396130)', diff saved to https://phabricator.wikimedia.org/P78396 and previous config saved to /var/cache/conftool/dbconfig/20250619-055522-marostegui.json [05:55:28] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [05:55:35] RESOLVED: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:59:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2186 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78397 and previous config saved to /var/cache/conftool/dbconfig/20250619-055924-root.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T0600). [06:04:35] <_joe_> criung [06:04:38] <_joe_> *cringe [06:04:47] <_joe_> jouncebot: cringe [06:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:10:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P78398 and previous config saved to /var/cache/conftool/dbconfig/20250619-061030-marostegui.json [06:14:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2186 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78399 and previous config saved to /var/cache/conftool/dbconfig/20250619-061430-root.json [06:19:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqord:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:21:32] (03PS1) 10Giuseppe Lavagetto: New deployment, including new api endpoints [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1161205 [06:24:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:25:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P78400 and previous config saved to /var/cache/conftool/dbconfig/20250619-062537-marostegui.json [06:26:05] PROBLEM - Hadoop DataNode on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [06:27:05] FIRING: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:29:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2186 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78401 and previous config saved to /var/cache/conftool/dbconfig/20250619-062936-root.json [06:37:59] !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1149-1153].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 10 [06:38:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10930469 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0e9f957a-66ba-4353-b48... [06:38:23] !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker1175.eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 10 [06:38:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10930470 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9df86b9e-4cff-46c4-970... [06:39:16] !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker1154.eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 9 [06:39:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10930471 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=df7b706d-9914-413e-aa5e-5dc80159cf57) set b... [06:39:42] !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker1176.eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 9 [06:39:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10930472 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=bc400f30-590d-486b-89e0-ab54c7fac73e) set b... [06:40:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T396130)', diff saved to https://phabricator.wikimedia.org/P78402 and previous config saved to /var/cache/conftool/dbconfig/20250619-064045-marostegui.json [06:40:50] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [06:41:00] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2159.codfw.wmnet with reason: Maintenance [06:41:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T396130)', diff saved to https://phabricator.wikimedia.org/P78403 and previous config saved to /var/cache/conftool/dbconfig/20250619-064108-marostegui.json [06:41:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10930478 (10Stevemunene) the hosts are finally done draining and are listed as decommissioned {F62... [06:41:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10930479 (10Stevemunene) [06:42:20] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10930480 (10Marostegui) Thank you - from puppet side this host is ready to be installed and reimaged. [07:00:05] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T0700). nyaa~ [07:00:05] georgekyz and kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:28] I am ready for deploy [07:00:46] here [07:00:54] ping me when done georgekyz [07:01:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T396130)', diff saved to https://phabricator.wikimedia.org/P78404 and previous config saved to /var/cache/conftool/dbconfig/20250619-070146-marostegui.json [07:01:52] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [07:02:04] starting [07:02:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by gkyziridis@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160797 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis) [07:03:01] (03Merged) 10jenkins-bot: ores-extension: enable extension with revertrisk filter for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1160797 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis) [07:04:04] !log gkyziridis@deploy1003 Started scap sync-world: Backport for [[gerrit:1160797|ores-extension: enable extension with revertrisk filter for azwiki (T395824)]] [07:04:08] T395824: [batch #3] Enable revertrisk filters in recent changes in multiple wikis - https://phabricator.wikimedia.org/T395824 [07:04:30] (03PS1) 10Muehlenhoff: Make ganeti2047/ganeti2048 Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1161335 (https://phabricator.wikimedia.org/T396590) [07:04:50] !log installing edk2 security updates [07:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:28] !log gkyziridis@deploy1003 gkyziridis: Backport for [[gerrit:1160797|ores-extension: enable extension with revertrisk filter for azwiki (T395824)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:07:54] (03CR) 10Muehlenhoff: [C:03+2] Make ganeti2047/ganeti2048 Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1161335 (https://phabricator.wikimedia.org/T396590) (owner: 10Muehlenhoff) [07:08:59] !log gkyziridis@deploy1003 gkyziridis: Continuing with sync [07:12:05] RESOLVED: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:15:25] (03PS3) 10Aqu: Airflow analytics-test: Optimization for LocalExecutors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161047 (https://phabricator.wikimedia.org/T369845) [07:15:55] !log gkyziridis@deploy1003 Finished scap sync-world: Backport for [[gerrit:1160797|ores-extension: enable extension with revertrisk filter for azwiki (T395824)]] (duration: 11m 50s) [07:16:00] T395824: [batch #3] Enable revertrisk filters in recent changes in multiple wikis - https://phabricator.wikimedia.org/T395824 [07:16:32] I finished with the deployment, feel free to proceed. [07:16:38] thnx for being around [07:16:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:16:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P78405 and previous config saved to /var/cache/conftool/dbconfig/20250619-071654-marostegui.json [07:17:34] (03CR) 10CI reject: [V:04-1] Airflow analytics-test: Optimization for LocalExecutors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161047 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [07:18:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet [07:18:29] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:18:30] thanks georgekyz [07:18:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161182 (owner: 10KartikMistry) [07:19:44] (03Merged) 10jenkins-bot: Enable the Contribute menu in Egyptian Arabic, Igbo, and Uzbek [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161182 (owner: 10KartikMistry) [07:19:55] (03PS1) 10Ayounsi: sre.network.tls: add timeout to get_server_certificate [cookbooks] - 10https://gerrit.wikimedia.org/r/1161337 [07:20:12] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1161182|Enable the Contribute menu in Egyptian Arabic, Igbo, and Uzbek]] [07:20:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.27s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:21:22] (03CR) 10Brouberol: [C:03+1] Prepare for renaming kafka-stretch200[1-2] to dse-k8s-worker200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/1160888 (https://phabricator.wikimedia.org/T353789) (owner: 10Btullis) [07:21:43] FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:22:13] (03CR) 10Brouberol: Airflow: Use a python value for the xcom_sidecar resource settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) (owner: 10Btullis) [07:22:28] !log kartik@deploy1003 kartik: Backport for [[gerrit:1161182|Enable the Contribute menu in Egyptian Arabic, Igbo, and Uzbek]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:24:22] !log kartik@deploy1003 kartik: Continuing with sync [07:25:00] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2233].codfw.wmnet,db[1164,1217,1250].eqiad.wmnet with reason: Primary switchover m2 T397182 [07:25:05] T397182: Switchover m2 master db1164 -> db1250 - https://phabricator.wikimedia.org/T397182 [07:25:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.27s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:25:19] !log slyngshede@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.11 to netbox-next - slyngshede@cumin1002 - T397300 [07:25:24] T397300: Upgrade Netbox to version 4.0.11 - https://phabricator.wikimedia.org/T397300 [07:25:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet [07:25:52] (03CR) 10CI reject: [V:04-1] sre.network.tls: add timeout to get_server_certificate [cookbooks] - 10https://gerrit.wikimedia.org/r/1161337 (owner: 10Ayounsi) [07:26:43] FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:26:53] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.11 to netbox-next - slyngshede@cumin1002 - T397300 [07:26:58] FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:27:44] (03PS4) 10Aqu: Airflow analytics-test: Optimization for LocalExecutors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161047 (https://phabricator.wikimedia.org/T369845) [07:28:03] (03PS1) 10Marostegui: mariadb: Promote db1250 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/1161338 (https://phabricator.wikimedia.org/T397182) [07:29:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet [07:29:43] (03PS2) 10Ayounsi: sre.network.tls: add timeout to get_server_certificate [cookbooks] - 10https://gerrit.wikimedia.org/r/1161337 [07:29:43] (03PS1) 10Ayounsi: tox: remove python 3.9 and 3.10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1161342 [07:30:07] (03PS3) 10KartikMistry: WIP: machinetranslation: Use s3 storage for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159696 (https://phabricator.wikimedia.org/T335491) [07:30:22] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1250 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/1161338 (https://phabricator.wikimedia.org/T397182) (owner: 10Marostegui) [07:31:11] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1161182|Enable the Contribute menu in Egyptian Arabic, Igbo, and Uzbek]] (duration: 10m 59s) [07:31:43] FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:32:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P78407 and previous config saved to /var/cache/conftool/dbconfig/20250619-073201-marostegui.json [07:32:54] (03CR) 10Brouberol: Airflow analytics-test: Optimization for LocalExecutors (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161047 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [07:33:52] !log Failover m2 from db1164 to db1250 - T397182 [07:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:56] T397182: Switchover m2 master db1164 -> db1250 - https://phabricator.wikimedia.org/T397182 [07:35:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.698s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:36:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet [07:36:43] FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:36:48] FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:37:45] !log just started es read only backup regeneration T387892 [07:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:49] T387892: Decommission backup1001, backup1002, backup2001, backup2002 (and their arrays) - https://phabricator.wikimedia.org/T387892 [07:37:59] (03PS1) 10Marostegui: db1164: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1161379 [07:39:02] (03CR) 10Marostegui: [C:03+2] db1164: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1161379 (owner: 10Marostegui) [07:39:16] !log jmm@cumin1003 START - Cookbook sre.ganeti.addnode for new host ganeti2047.codfw.wmnet to cluster codfw and group B [07:39:57] (03CR) 10Brouberol: Airflow analytics-test: Optimization for LocalExecutors (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161047 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [07:40:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.602s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:41:34] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2047.codfw.wmnet to cluster codfw and group B [07:41:39] !log jmm@cumin1003 START - Cookbook sre.ganeti.addnode for new host ganeti2048.codfw.wmnet to cluster codfw and group B [07:41:43] FIRING: [17x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:41:52] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1164,1217].eqiad.wmnet with reason: Maintenance [07:42:18] (03CR) 10Jelto: [C:03+2] gitlab-runner: upgrade default image to bookworm on Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/1160120 (https://phabricator.wikimedia.org/T384595) (owner: 10Jelto) [07:44:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.48s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:45:21] PROBLEM - haproxy failover on dbproxy1029 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:45:25] PROBLEM - haproxy failover on dbproxy1027 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:45:39] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [07:46:27] (03PS1) 10Marostegui: mariadb: Move db1164 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/1161381 (https://phabricator.wikimedia.org/T397397) [07:46:43] FIRING: [17x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:47:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T396130)', diff saved to https://phabricator.wikimedia.org/P78409 and previous config saved to /var/cache/conftool/dbconfig/20250619-074708-marostegui.json [07:47:13] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [07:47:24] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance [07:47:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T396130)', diff saved to https://phabricator.wikimedia.org/P78410 and previous config saved to /var/cache/conftool/dbconfig/20250619-074731-marostegui.json [07:47:59] (03PS1) 10Elukey: admin: allow dcops to use perccli and storcli via sudo [puppet] - 10https://gerrit.wikimedia.org/r/1161382 (https://phabricator.wikimedia.org/T395939) [07:47:59] haproxy alerts are expected [07:49:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.793s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:50:18] 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, and 2 others: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10930719 (10elukey) @MatthewVernon would it be ok to start the upload of the new tiles to Swift, while we are removing... [07:50:23] (03CR) 10Vgutierrez: [C:03+1] "looks good & varnish tests are happy, please update https://wikitech.wikimedia.org/wiki/X-Analytics" [puppet] - 10https://gerrit.wikimedia.org/r/1160381 (https://phabricator.wikimedia.org/T390924) (owner: 10Krinkle) [07:50:23] (03CR) 10Marostegui: [C:03+2] mariadb: Move db1164 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/1161381 (https://phabricator.wikimedia.org/T397397) (owner: 10Marostegui) [07:50:30] !log installing glib2.0 security updates [07:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:01] (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [07:51:43] FIRING: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:51:43] FIRING: [17x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:55:04] (03PS1) 10Marostegui: db1179: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1161383 (https://phabricator.wikimedia.org/T397279) [07:55:32] (03CR) 10MVernon: [C:03+2] thanos: add new backends, remove old ones gone from rings [puppet] - 10https://gerrit.wikimedia.org/r/1160855 (https://phabricator.wikimedia.org/T391352) (owner: 10MVernon) [07:55:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1179', diff saved to https://phabricator.wikimedia.org/P78411 and previous config saved to /var/cache/conftool/dbconfig/20250619-075548-root.json [07:56:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1179.eqiad.wmnet with reason: Maintenance [07:56:43] RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:56:48] FIRING: [14x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:59:40] (03CR) 10Marostegui: [C:03+2] db1179: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1161383 (https://phabricator.wikimedia.org/T397279) (owner: 10Marostegui) [07:59:51] (03CR) 10MVernon: [C:03+2] thanos: add new nodes to ring, drain old ones [puppet] - 10https://gerrit.wikimedia.org/r/1160856 (https://phabricator.wikimedia.org/T392908) (owner: 10MVernon) [08:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T0800) [08:00:13] o/ [08:00:53] (03PS1) 10Muehlenhoff: Add Joanna to Bitu account managers [puppet] - 10https://gerrit.wikimedia.org/r/1161389 [08:01:35] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161390 (https://phabricator.wikimedia.org/T392176) [08:01:36] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.45.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161390 (https://phabricator.wikimedia.org/T392176) (owner: 10TrainBranchBot) [08:01:43] FIRING: [9x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [08:02:26] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161390 (https://phabricator.wikimedia.org/T392176) (owner: 10TrainBranchBot) [08:03:03] 06SRE, 10SRE-SLO, 10Observability-Metrics, 13Patch-For-Review: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10930800 (10elukey) >>! In T391852#10927063, @elukey wrote: >>> * The success SLO seems not taken into account so far from o... [08:04:04] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10930825 (10Stevemunene) the hosts are finally done draining and are listed as decommissioned {F62386905} disabled pup... [08:04:28] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10930837 (10Stevemunene) [08:05:28] (03PS5) 10Aqu: Airflow analytics-test: Optimization for LocalExecutors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161047 (https://phabricator.wikimedia.org/T369845) [08:05:36] (03CR) 10Aqu: Airflow analytics-test: Optimization for LocalExecutors (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161047 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [08:06:43] RESOLVED: [8x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [08:07:21] !log installing python-tornado security updates [08:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:03] (03PS1) 10Marostegui: db1164: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1161395 (https://phabricator.wikimedia.org/T397397) [08:08:06] (03PS6) 10Aqu: Airflow analytics-test: Optimization for LocalExecutors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161047 (https://phabricator.wikimedia.org/T369845) [08:08:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T396130)', diff saved to https://phabricator.wikimedia.org/P78412 and previous config saved to /var/cache/conftool/dbconfig/20250619-080812-marostegui.json [08:08:18] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [08:08:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78413 and previous config saved to /var/cache/conftool/dbconfig/20250619-080820-root.json [08:10:17] 10SRE-SLO, 06SRE Observability, 10Abstract Wikipedia team (26Q1 (Jul–Sep)), 07Essential-Work: create new SLO dashboard via Pyrra - https://phabricator.wikimedia.org/T394057#10930936 (10DSantamaria) [08:10:22] !log mvernon@cumin1003 START - Cookbook sre.hosts.decommission for hosts thanos-be[1001-1004].eqiad.wmnet [08:10:37] jouncebot: now and next [08:10:37] For the next 1 hour(s) and 49 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T0800) [08:10:50] (03PS1) 10Vgutierrez: haproxy: Disable OCSP monitoring for LE unified cert [puppet] - 10https://gerrit.wikimedia.org/r/1161397 (https://phabricator.wikimedia.org/T370821) [08:11:02] (03CR) 10Filippo Giunchedi: [C:03+2] thanos: enable snappy compression for grpc in query [puppet] - 10https://gerrit.wikimedia.org/r/1160749 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [08:11:08] (03PS2) 10Filippo Giunchedi: thanos: enable snappy compression for grpc in query [puppet] - 10https://gerrit.wikimedia.org/r/1160749 (https://phabricator.wikimedia.org/T394318) [08:11:14] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] thanos: enable snappy compression for grpc in query [puppet] - 10https://gerrit.wikimedia.org/r/1160749 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [08:11:34] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161397 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez) [08:11:41] (03CR) 10Marostegui: [C:03+2] db1164: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1161395 (https://phabricator.wikimedia.org/T397397) (owner: 10Marostegui) [08:11:59] !log hashar@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.6 refs T392176 [08:12:03] T392176: 1.45.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T392176 [08:13:21] RECOVERY - haproxy failover on dbproxy1029 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:13:25] RECOVERY - haproxy failover on dbproxy1027 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:14:20] logs are quiet [08:14:37] 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, and 2 others: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10930949 (10elukey) >>! In T396584#10930868, @MatthewVernon wrote: > @elukey yes, that should be fine to start upload -... [08:17:04] !log Ran fixStuckGlobalRename.php for T397384 T397219 T397218 [08:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:13] T397384: Unblock stuck global rename of Mr. PowerUp98 - https://phabricator.wikimedia.org/T397384 [08:17:13] T397219: Unblock stuck global rename of CyberLife070 - https://phabricator.wikimedia.org/T397219 [08:17:13] T397218: Unblock stuck global rename of Renamed user fc26ace47276834fd507d19dab11aed6 - https://phabricator.wikimedia.org/T397218 [08:21:58] (03CR) 10Muehlenhoff: [C:03+2] mediawiki/memcached: Switch to firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1156669 (owner: 10Muehlenhoff) [08:23:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P78414 and previous config saved to /var/cache/conftool/dbconfig/20250619-082320-marostegui.json [08:23:26] (03PS1) 10Slyngshede: Lock RQ dependency at 1.16.2 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1161409 (https://phabricator.wikimedia.org/T397300) [08:23:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78415 and previous config saved to /var/cache/conftool/dbconfig/20250619-082326-root.json [08:23:33] !log akosiaris@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [08:25:08] !log mvernon@cumin1003 START - Cookbook sre.dns.netbox [08:28:37] !log akosiaris@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [08:29:15] !log mvernon@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: thanos-be[1001-1004].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin1003" [08:29:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:29:57] jmm@cumin1003 addnode (PID 2229380) is awaiting input [08:31:12] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2048.codfw.wmnet to cluster codfw and group B [08:31:42] (03CR) 10Filippo Giunchedi: [C:03+2] thanos: force query-frontend query stats [puppet] - 10https://gerrit.wikimedia.org/r/1160748 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [08:31:48] !log installing modsecurity-apache security updates [08:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:19] mvernon@cumin1003 decommission (PID 2232385) is awaiting input [08:33:03] !log mvernon@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: thanos-be[1001-1004].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin1003" [08:33:03] !log mvernon@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:33:04] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts thanos-be[1001-1004].eqiad.wmnet [08:33:15] 06SRE, 10SRE-swift-storage: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10931066 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin1003 for hosts: `thanos-be[1001-1004].eqiad.wmnet` - thanos-be1001.eqiad.wmnet (**PASS**) - Downtimed host on Icinga/Ale... [08:33:29] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission thanos-be100[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T397414 (10MatthewVernon) 03NEW [08:34:19] 06SRE, 10SRE-swift-storage: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10931085 (10MatthewVernon) [08:35:30] (03CR) 10Btullis: [C:03+2] Prepare for renaming kafka-stretch200[1-2] to dse-k8s-worker200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/1160888 (https://phabricator.wikimedia.org/T353789) (owner: 10Btullis) [08:37:18] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be109[2-5] - https://phabricator.wikimedia.org/T393104#10931104 (10MatthewVernon) @Jclark-ctr I've just put in T397414 to decommission (amongst others) thanos-be1003 in `C4` and thanos-be1004 in `D7`; could th... [08:38:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P78417 and previous config saved to /var/cache/conftool/dbconfig/20250619-083827-marostegui.json [08:38:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78418 and previous config saved to /var/cache/conftool/dbconfig/20250619-083832-root.json [08:39:35] FIRING: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:42:44] (03CR) 10Marostegui: Add switchover cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [08:43:41] (03CR) 10Marostegui: "I'd like to see if we can have a deeper review by someone else" [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [08:44:57] (03CR) 10Fabfur: [C:03+1] haproxy: Disable OCSP monitoring for LE unified cert [puppet] - 10https://gerrit.wikimedia.org/r/1161397 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez) [08:46:46] (03CR) 10Vgutierrez: [C:03+2] haproxy: Disable OCSP monitoring for LE unified cert [puppet] - 10https://gerrit.wikimedia.org/r/1161397 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez) [08:47:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2235.codfw.wmnet with reason: Maintenance [08:47:36] (03PS1) 10Marostegui: db2235: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1161433 (https://phabricator.wikimedia.org/T397412) [08:48:08] (03CR) 10Marostegui: [C:03+2] db2235: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1161433 (https://phabricator.wikimedia.org/T397412) (owner: 10Marostegui) [08:48:23] (03PS1) 10Vgutierrez: hiera: Switch lvs4009 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1161434 (https://phabricator.wikimedia.org/T396561) [08:48:56] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161434 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [08:50:13] PROBLEM - MariaDB Replica IO: m5 on db2160 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2235.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2235.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:50:21] ^ expected [08:50:42] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db[2160,2235].codfw.wmnet with reason: Maintenance [08:50:43] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2021.codfw.wmnet [08:51:06] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10931171 (10ops-monitoring-bot) Draining ganeti2021.codfw.wmnet of running VMs [08:51:43] jouncebot: nowandnext [08:51:43] For the next 1 hour(s) and 8 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T0800) [08:51:43] In 1 hour(s) and 8 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T1000) [08:52:00] 06SRE, 06Traffic, 13Patch-For-Review: Research and respond to Let's Encrypt's intent to deprecate OCSP in favour of CRLs - https://phabricator.wikimedia.org/T370821#10931172 (10Vgutierrez) 05In progress→03Resolved a:03Vgutierrez [08:52:32] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from kafka-stretch2001 to dse-k8s-worker2001 [08:52:53] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [08:52:55] (03CR) 10Jcrespo: [C:03+2] "Thanks, I did some of those afterwards (apparently, if I rename roles or create new ones they default to puppet5 :-(, which created some m" [puppet] - 10https://gerrit.wikimedia.org/r/1160691 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [08:53:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T396130)', diff saved to https://phabricator.wikimedia.org/P78419 and previous config saved to /var/cache/conftool/dbconfig/20250619-085334-marostegui.json [08:53:40] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [08:53:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78420 and previous config saved to /var/cache/conftool/dbconfig/20250619-085344-root.json [08:53:50] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2182.codfw.wmnet with reason: Maintenance [08:53:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T396130)', diff saved to https://phabricator.wikimedia.org/P78421 and previous config saved to /var/cache/conftool/dbconfig/20250619-085357-marostegui.json [08:55:06] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2021.codfw.wmnet [08:55:32] (03CR) 10Urbanecm: [C:03+2] changeprop: Decrease reenqueue_delay for Getting Started notif job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150699 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm) [08:55:58] (03PS7) 10Btullis: Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) [08:56:05] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2021.codfw.wmnet [08:56:13] RECOVERY - MariaDB Replica IO: m5 on db2160 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:56:40] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kafka-stretch2001 to dse-k8s-worker2001 - btullis@cumin1003" [08:57:07] (03Merged) 10jenkins-bot: changeprop: Decrease reenqueue_delay for Getting Started notif job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150699 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm) [08:58:13] 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, and 2 others: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10931199 (10MatthewVernon) @elukey You could delete each container in parallel (in a separate tmux/screen window or wha... [08:58:36] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kafka-stretch2001 to dse-k8s-worker2001 - btullis@cumin1003" [08:58:37] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:58:37] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache dse-k8s-worker2001 on all recursors [08:58:40] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-worker2001 on all recursors [08:58:40] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker2001 [08:58:53] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker2001 [08:58:55] !log urbanecm@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [08:58:55] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s8 T397164 [08:59:04] T397164: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T397164 [08:59:08] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp2042 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:59:08] jmm@cumin1003 drain-node (PID 2238338) is awaiting input [08:59:10] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp2030 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:59:10] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp2027 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:59:10] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp2031 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:59:12] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1103 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:59:12] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4037 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:59:18] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1106 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:59:24] !log urbanecm@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [08:59:26] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4047 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:59:26] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1110 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:59:32] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kafka-stretch2001 to dse-k8s-worker2001 [08:59:34] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1114 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:59:38] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1105 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:59:44] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp2034 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:59:44] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4043 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:59:59] !log urbanecm@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [09:00:49] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from kafka-stretch2002 to dse-k8s-worker2002 [09:01:08] !log urbanecm@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [09:01:10] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [09:01:16] !log urbanecm@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [09:01:38] (03CR) 10Brouberol: Airflow analytics-test: Optimization for LocalExecutors (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161047 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [09:02:12] !log urbanecm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [09:06:46] btullis@cumin1003 rename (PID 2238782) is awaiting input [09:08:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb2002.codfw.wmnet [09:09:01] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10931246 (10ops-monitoring-bot) Draining ganeti2021.codfw.wmnet of running VMs [09:10:39] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kafka-stretch2002 to dse-k8s-worker2002 - btullis@cumin1003" [09:11:49] (03CR) 10Btullis: Airflow: Use a python value for the xcom_sidecar resource settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) (owner: 10Btullis) [09:12:43] (03PS1) 10Urbanecm: changeprop beta: Decrease reenqueue_delay for Getting Started notif job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161443 (https://phabricator.wikimedia.org/T394958) [09:13:44] btullis@cumin1003 rename (PID 2238782) is awaiting input [09:13:53] (03PS8) 10Btullis: Airflow: Render the xcom_sidecar resource settings correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) [09:14:15] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kafka-stretch2002 to dse-k8s-worker2002 - btullis@cumin1003" [09:14:15] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:14:15] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache dse-k8s-worker2002 on all recursors [09:14:19] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-worker2002 on all recursors [09:14:19] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker2002 [09:14:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb2002.codfw.wmnet [09:14:48] (03CR) 10Michael Große: [C:03+1] "Confirming reducing the reenqueue delay for this job to 30 minutes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161443 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm) [09:15:08] (03CR) 10Urbanecm: [C:03+2] changeprop beta: Decrease reenqueue_delay for Getting Started notif job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161443 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm) [09:15:22] (03PS1) 10Marostegui: db2196: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1161444 (https://phabricator.wikimedia.org/T397279) [09:15:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2196', diff saved to https://phabricator.wikimedia.org/P78422 and previous config saved to /var/cache/conftool/dbconfig/20250619-091532-root.json [09:15:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T396130)', diff saved to https://phabricator.wikimedia.org/P78423 and previous config saved to /var/cache/conftool/dbconfig/20250619-091539-marostegui.json [09:15:44] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [09:15:52] (03PS9) 10Btullis: Airflow: Use a python value for the xcom_sidecar resource settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) [09:16:01] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2196.codfw.wmnet with reason: Maintenance [09:16:16] (03CR) 10Marostegui: [C:03+2] db2196: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1161444 (https://phabricator.wikimedia.org/T397279) (owner: 10Marostegui) [09:16:20] (03PS10) 10Btullis: Airflow: Render the xcom_sidecar resource settings correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) [09:16:50] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker2002 [09:16:53] (03Merged) 10jenkins-bot: changeprop beta: Decrease reenqueue_delay for Getting Started notif job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161443 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm) [09:17:19] (03PS1) 10Elukey: sre.hosts.provision: add another special Supermicro set of EFI settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1161445 (https://phabricator.wikimedia.org/T397415) [09:17:29] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kafka-stretch2002 to dse-k8s-worker2002 [09:17:57] (03PS11) 10Btullis: Airflow: Render the xcom_sidecar resource settings correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) [09:19:12] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:19:15] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:19:21] (03CR) 10CI reject: [V:04-1] Airflow: Render the xcom_sidecar resource settings correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) (owner: 10Btullis) [09:19:43] (03PS12) 10Btullis: Airflow: Render the xcom_sidecar resource settings correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) [09:21:50] (03PS2) 10Elukey: sre.hosts.provision: add another special Supermicro set of EFI settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1161445 (https://phabricator.wikimedia.org/T397415) [09:22:08] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:22:11] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:23:17] (03PS13) 10Btullis: Airflow: Render the xcom_sidecar resource settings correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) [09:24:35] RESOLVED: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:24:48] (03PS1) 10Effie Mouzeli: Add wikikube-worker-exp to Homer wmf plugin to assign to k8s BGP group [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1161447 [09:25:11] (03CR) 10Clément Goubert: "UX issue, otherwise LGTM." [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [09:25:34] (03PS3) 10Elukey: sre.hosts.provision: add another special Supermicro set of EFI settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1161445 (https://phabricator.wikimedia.org/T397415) [09:25:44] (03CR) 10Brouberol: [C:03+1] Airflow: Render the xcom_sidecar resource settings correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) (owner: 10Btullis) [09:25:51] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:25:54] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:26:39] (03CR) 10Cathal Mooney: [C:03+1] "LGTM tested here works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/1160764 (https://phabricator.wikimedia.org/T397303) (owner: 10Vgutierrez) [09:27:05] (03CR) 10Giuseppe Lavagetto: "One main question about EXCLUDED_SERVICES, everything else can be fixed later/when we move the code to spicerack." [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [09:27:10] (03CR) 10Ladsgroup: "Wanna add collation table too or you want to add it later?" [puppet] - 10https://gerrit.wikimedia.org/r/1160178 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe) [09:28:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78424 and previous config saved to /var/cache/conftool/dbconfig/20250619-092801-root.json [09:28:12] (03PS4) 10Elukey: sre.hosts.provision: add another special Supermicro set of EFI settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1161445 (https://phabricator.wikimedia.org/T397415) [09:28:16] (03PS1) 10Effie Mouzeli: refactor server hostgroup matching [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1161448 [09:28:21] (03CR) 10Btullis: [C:03+2] Airflow: Render the xcom_sidecar resource settings correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) (owner: 10Btullis) [09:28:23] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2196 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1161449 (https://phabricator.wikimedia.org/T397419) [09:28:41] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:28:57] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4045 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:09] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1108 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:09] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp2040 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:09] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp2028 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:09] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp2035 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:09] (03CR) 10Zabe: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1117909 ?" [puppet] - 10https://gerrit.wikimedia.org/r/1160178 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe) [09:29:11] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1112 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:11] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp2033 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:11] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp2037 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:13] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4038 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:13] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1111 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:13] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4044 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:13] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4042 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:15] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4048 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:17] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1104 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:17] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1107 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:23] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4041 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:23] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4046 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:23] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4039 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:27] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1115 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:27] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp2038 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:27] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp2032 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:27] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4049 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:27] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4040 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:29] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp2039 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:29] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4051 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:33] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1102 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:39] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1109 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:39] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4050 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:39] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1101 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:39] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp2041 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:41] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4052 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:41] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1113 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:41] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1100 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:41] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp2036 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:29:48] (03PS1) 10Esanders: Deploy mobile insert menu to remaining top 20 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161450 (https://phabricator.wikimedia.org/T388591) [09:29:49] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp2029 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:30:17] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1161382 (https://phabricator.wikimedia.org/T395939) (owner: 10Elukey) [09:30:26] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker2001.codfw.wmnet with OS bookworm [09:30:36] !log btullis@cumin1003 START - Cookbook sre.hosts.move-vlan for host dse-k8s-worker2001 [09:30:41] (03Merged) 10jenkins-bot: Airflow: Render the xcom_sidecar resource settings correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160938 (https://phabricator.wikimedia.org/T396197) (owner: 10Btullis) [09:30:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P78425 and previous config saved to /var/cache/conftool/dbconfig/20250619-093047-marostegui.json [09:31:06] (03CR) 10Ladsgroup: "oh I have memory of goldfish. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1160178 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe) [09:31:10] (03PS2) 10Zabe: filtered_tables: Add new categorylinks columns [puppet] - 10https://gerrit.wikimedia.org/r/1160178 (https://phabricator.wikimedia.org/T299951) [09:31:22] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [09:32:12] (03CR) 10Ladsgroup: [C:03+2] filtered_tables: Add new categorylinks columns [puppet] - 10https://gerrit.wikimedia.org/r/1160178 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe) [09:32:29] (03CR) 10Hnowlan: [C:03+2] mobileapps: bump replicas significantly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160773 (owner: 10Hnowlan) [09:33:14] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:33:56] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [09:34:00] (03Merged) 10jenkins-bot: mobileapps: bump replicas significantly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160773 (owner: 10Hnowlan) [09:36:23] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [09:36:32] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [09:36:52] btullis@cumin1003 reimage (PID 2242387) is awaiting input [09:37:47] (03PS1) 10Btullis: Arflow: add quotes around the xcom_sidecar image value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161453 (https://phabricator.wikimedia.org/T396197) [09:38:25] (03CR) 10Cathal Mooney: [C:03+1] Add wikikube-worker-exp to Homer wmf plugin to assign to k8s BGP group [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1161447 (owner: 10Effie Mouzeli) [09:39:11] (03CR) 10Brouberol: Arflow: add quotes around the xcom_sidecar image value (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161453 (https://phabricator.wikimedia.org/T396197) (owner: 10Btullis) [09:39:44] (03CR) 10Tiziano Fogli: [C:03+1] prometheus: Add NIC queue CPU exporter [puppet] - 10https://gerrit.wikimedia.org/r/1160764 (https://phabricator.wikimedia.org/T397303) (owner: 10Vgutierrez) [09:39:48] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host dse-k8s-worker2001 - btullis@cumin1003" [09:39:52] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host dse-k8s-worker2001 - btullis@cumin1003" [09:39:52] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:39:53] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache dse-k8s-worker2001.codfw.wmnet 126.32.192.10.in-addr.arpa 6.2.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:39:56] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-worker2001.codfw.wmnet 126.32.192.10.in-addr.arpa 6.2.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:39:56] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker2001 [09:40:18] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] New deployment, including new api endpoints [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1161205 (owner: 10Giuseppe Lavagetto) [09:40:25] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker2001 [09:40:25] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host dse-k8s-worker2001 [09:40:36] <_joe_> jouncebot: nowandnext [09:40:36] For the next 0 hour(s) and 19 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T0800) [09:40:36] In 0 hour(s) and 19 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T1000) [09:43:00] (03CR) 10Cathal Mooney: [C:03+2] Add wikikube-worker-exp to Homer wmf plugin to assign to k8s BGP group [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1161447 (owner: 10Effie Mouzeli) [09:43:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78426 and previous config saved to /var/cache/conftool/dbconfig/20250619-094306-root.json [09:43:33] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:44:13] (03CR) 10Slyngshede: [C:03+1] "Looks good, but does it make sense to be notified about permissions without being allowed to approved them, in this case?" [puppet] - 10https://gerrit.wikimedia.org/r/1161389 (owner: 10Muehlenhoff) [09:44:25] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "New api endpoints for the requestctl client - oblivian@cumin1003" [09:44:28] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: New api endpoints for the requestctl client - oblivian@cumin1003 [09:44:51] (03PS6) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) [09:45:00] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: New api endpoints for the requestctl client - oblivian@cumin1003 [09:45:01] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "New api endpoints for the requestctl client - oblivian@cumin1003" [09:45:22] 06SRE, 06collaboration-services, 10observability, 13Patch-For-Review: create a new place for prometheus/alertmanager checks not tied to physical machines - https://phabricator.wikimedia.org/T397264#10931339 (10fgiunchedi) Reporting the discussion from yesterday's o11y team meeting: thank you @Dzahn for kic... [09:45:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P78427 and previous config saved to /var/cache/conftool/dbconfig/20250619-094554-marostegui.json [09:47:03] (03CR) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [09:47:45] (03PS7) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) [09:49:32] 10SRE-SLO, 10EditCheck, 10Editing-team (Kanban Board), 07Essential-Work, 13Patch-For-Review: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#10931365 (10elukey) @VPuffetMichel Hi! Is there anybody that can follow up on this task while David is afk? We... [09:50:49] (03CR) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [09:51:47] (03PS6) 10Jgiannelos: RB sunset: Configure claim TTL for PCS related endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160897 (https://phabricator.wikimedia.org/T397072) [09:52:17] (03PS7) 10Aqu: Airflow analytics-test: Optimization for LocalExecutors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161047 (https://phabricator.wikimedia.org/T369845) [09:52:21] !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Homer release to add wikikube-worker-exp - cmooney@cumin1003 [09:52:38] (03CR) 10Aqu: Airflow analytics-test: Optimization for LocalExecutors (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161047 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [09:53:10] (03CR) 10Jgiannelos: RB sunset: Configure claim TTL for PCS related endpoints (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160897 (https://phabricator.wikimedia.org/T397072) (owner: 10Jgiannelos) [09:53:59] (03CR) 10CI reject: [V:04-1] sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [09:54:48] !log cmooney@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Homer release to add wikikube-worker-exp - cmooney@cumin1003 [09:55:39] (03PS2) 10Btullis: Arflow: fix issues with rendering YAML values to airflow config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161453 (https://phabricator.wikimedia.org/T396197) [09:57:05] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker2001.codfw.wmnet with reason: host reimage [09:58:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78428 and previous config saved to /var/cache/conftool/dbconfig/20250619-095811-root.json [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T1000) [10:00:05] jayme, Raine, and claime: A patch you scheduled for MediaWiki infrastructure (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [10:01:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T396130)', diff saved to https://phabricator.wikimedia.org/P78429 and previous config saved to /var/cache/conftool/dbconfig/20250619-100102-marostegui.json [10:01:07] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:01:17] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2198.codfw.wmnet with reason: Maintenance [10:01:26] (03CR) 10Hnowlan: [C:03+1] RB sunset: Configure claim TTL for PCS related endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160897 (https://phabricator.wikimedia.org/T397072) (owner: 10Jgiannelos) [10:01:38] (03PS8) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) [10:02:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker2001.codfw.wmnet with reason: host reimage [10:04:06] !log akosiaris@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2006.codfw.wmnet with OS bookworm [10:05:31] (03PS1) 10Marostegui: control-mariadb-10.6-bookworm: New version [software] - 10https://gerrit.wikimedia.org/r/1161459 (https://phabricator.wikimedia.org/T397425) [10:05:52] (03CR) 10Muehlenhoff: "That doesn't make any sense ofc :-) I updated the patch" [puppet] - 10https://gerrit.wikimedia.org/r/1161389 (owner: 10Muehlenhoff) [10:06:23] (03CR) 10Marostegui: [C:03+2] control-mariadb-10.6-bookworm: New version [software] - 10https://gerrit.wikimedia.org/r/1161459 (https://phabricator.wikimedia.org/T397425) (owner: 10Marostegui) [10:06:51] (03Merged) 10jenkins-bot: control-mariadb-10.6-bookworm: New version [software] - 10https://gerrit.wikimedia.org/r/1161459 (https://phabricator.wikimedia.org/T397425) (owner: 10Marostegui) [10:07:47] PROBLEM - librenms.wikimedia.org requires authentication on netmon1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:08:19] PROBLEM - librenms.wikimedia.org tls expiry on netmon1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:08:28] uh... expected ^^? [10:08:29] (03PS2) 10Effie Mouzeli: refactor server hostgroup matching [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1161448 [10:08:31] PROBLEM - SSH on netmon1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:08:37] (03CR) 10Ayounsi: [C:03+1] sre.hosts.provision: add another special Supermicro set of EFI settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1161445 (https://phabricator.wikimedia.org/T397415) (owner: 10Elukey) [10:09:49] mmhh not expected no, checking [10:09:53] (03PS2) 10Muehlenhoff: Add Joanna to Bitu account managers [puppet] - 10https://gerrit.wikimedia.org/r/1161389 [10:09:55] (03CR) 10Jgiannelos: [C:03+2] RB sunset: Configure claim TTL for PCS related endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160897 (https://phabricator.wikimedia.org/T397072) (owner: 10Jgiannelos) [10:11:42] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:11:44] (03Merged) 10jenkins-bot: RB sunset: Configure claim TTL for PCS related endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160897 (https://phabricator.wikimedia.org/T397072) (owner: 10Jgiannelos) [10:12:08] well, the host is wedged [10:12:16] netmon1003 ? [10:12:24] yeah serial console unresponsive [10:12:25] yes, I'll kick it [10:12:27] though it pings [10:12:30] yep +1 [10:12:38] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [10:12:45] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [10:12:46] !log powercycle netmon1003 [10:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:14] (03PS3) 10Btullis: Arflow: fix issues with rendering YAML values to airflow config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161453 (https://phabricator.wikimedia.org/T396197) [10:13:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78430 and previous config saved to /var/cache/conftool/dbconfig/20250619-101317-root.json [10:13:29] librenms-syslog was using 60G of memory, I'd say that's not nominal [10:14:24] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [10:14:25] PROBLEM - Host netmon1003 is DOWN: PING CRITICAL - Packet loss = 100% [10:14:34] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [10:14:42] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [10:14:48] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [10:15:15] godog: yeah that doesn't sound right at all [10:15:17] RECOVERY - librenms.wikimedia.org tls expiry on netmon1003 is OK: OK - Certificate librenms.wikimedia.org will expire on Mon 15 Sep 2025 05:00:30 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:15:19] RECOVERY - Host netmon1003 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [10:15:26] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: add another special Supermicro set of EFI settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1161445 (https://phabricator.wikimedia.org/T397415) (owner: 10Elukey) [10:15:37] RECOVERY - librenms.wikimedia.org requires authentication on netmon1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 661 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:15:43] topranks: indeed, we're back [10:15:57] (03CR) 10Ayounsi: [C:03+1] "overall lgtm, maybe add a comment so we revisit it when doing the netbox 4.3 upgrade?" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1161409 (https://phabricator.wikimedia.org/T397300) (owner: 10Slyngshede) [10:16:02] but yeah a substantial memory leak as far as I can see [10:16:34] https://grafana.wikimedia.org/goto/NI-Da4PNR?orgId=1 [10:16:39] !log akosiaris@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2006.codfw.wmnet with reason: host reimage [10:16:42] RESOLVED: [8x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:16:43] !log akosiaris@cumin1003 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on aux-k8s-worker2006.codfw.wmnet with reason: host reimage [10:17:11] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [10:17:37] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2200.codfw.wmnet with reason: Maintenance [10:18:05] RECOVERY - SSH on netmon1003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:18:28] FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [10:18:30] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [10:19:11] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker2001.codfw.wmnet with OS bookworm [10:19:25] (03PS1) 10Jgiannelos: changeprop: Claim ttl should be numerical [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161460 [10:19:33] (03CR) 10CI reject: [V:04-1] changeprop: Claim ttl should be numerical [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161460 (owner: 10Jgiannelos) [10:19:38] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker2002.codfw.wmnet with OS bookworm [10:19:46] !log depool / restart / repool ms-fe1009 [some idle timeouts] [10:19:47] !log btullis@cumin1003 START - Cookbook sre.hosts.move-vlan for host dse-k8s-worker2002 [10:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:53] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [10:20:15] (03PS2) 10Jgiannelos: changeprop: Claim ttl should be numerical [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161460 [10:20:49] !log dropping searchindex table in itwiki (T397367) [10:20:52] (03PS1) 10Giuseppe Lavagetto: Bugfix for api tokens loading [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1161462 [10:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:54] T397367: Drop unneeded empty tables from wikis - https://phabricator.wikimedia.org/T397367 [10:22:12] 06SRE, 06Infrastructure-Foundations, 06SRE Observability: librenms-syslog leaks memory - https://phabricator.wikimedia.org/T397427 (10fgiunchedi) 03NEW [10:22:24] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Bugfix for api tokens loading [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1161462 (owner: 10Giuseppe Lavagetto) [10:23:06] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [10:23:08] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Fix api token loading - oblivian@cumin1003" [10:23:10] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix api token loading - oblivian@cumin1003 [10:23:42] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix api token loading - oblivian@cumin1003 [10:23:43] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Fix api token loading - oblivian@cumin1003" [10:23:46] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [10:23:53] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [10:23:58] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [10:24:32] (03PS4) 10Btullis: Arflow: fix issues with rendering YAML values to airflow config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161453 (https://phabricator.wikimedia.org/T396197) [10:25:07] (03CR) 10Hnowlan: [C:03+1] changeprop: Claim ttl should be numerical [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161460 (owner: 10Jgiannelos) [10:25:14] 06SRE, 06Infrastructure-Foundations, 06SRE Observability: librenms-syslog leaks memory - https://phabricator.wikimedia.org/T397427#10931480 (10fgiunchedi) In addition to the sawtooth pattern, atlas-exporter has begun using ~6G of memory {F62389274} [10:25:24] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host dse-k8s-worker2002 - btullis@cumin1003" [10:25:25] (03PS9) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) [10:25:29] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host dse-k8s-worker2002 - btullis@cumin1003" [10:25:29] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:25:29] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache dse-k8s-worker2002.codfw.wmnet 86.48.192.10.in-addr.arpa 6.8.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:25:32] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-worker2002.codfw.wmnet 86.48.192.10.in-addr.arpa 6.8.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:25:32] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker2002 [10:25:44] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker2002 [10:25:44] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host dse-k8s-worker2002 [10:27:01] (03CR) 10Jgiannelos: [C:03+2] changeprop: Claim ttl should be numerical [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161460 (owner: 10Jgiannelos) [10:27:10] (03PS5) 10Btullis: Arflow: fix issues with rendering YAML values to airflow config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161453 (https://phabricator.wikimedia.org/T396197) [10:28:14] (03PS10) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) [10:28:42] (03Merged) 10jenkins-bot: changeprop: Claim ttl should be numerical [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161460 (owner: 10Jgiannelos) [10:28:56] !log installing postgresql-13 security updates [10:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:30:26] (03PS6) 10Btullis: Arflow: fix issues with rendering YAML values to airflow config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161453 (https://phabricator.wikimedia.org/T396197) [10:31:15] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [10:31:29] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [10:31:32] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-cluster depool all services in codfw/codfw: maintenance [10:32:16] !log installing twisted security updates [10:32:17] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [10:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:24] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2006.codfw.wmnet with OS bookworm [10:33:28] RESOLVED: KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [10:33:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2208.codfw.wmnet with reason: Maintenance [10:34:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T396130)', diff saved to https://phabricator.wikimedia.org/P78431 and previous config saved to /var/cache/conftool/dbconfig/20250619-103400-marostegui.json [10:34:05] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:35:21] (03PS4) 10Giuseppe Lavagetto: requestctl: switch CLI from native client to API client [puppet] - 10https://gerrit.wikimedia.org/r/1160476 [10:35:21] (03PS1) 10Giuseppe Lavagetto: hiddenparma: fix proxying of api [puppet] - 10https://gerrit.wikimedia.org/r/1161465 [10:38:49] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: 2 VMs for mw-experimental - https://phabricator.wikimedia.org/T397051#10931509 (10jijiki) 05Open→03Resolved p:05Triage→03Medium [10:39:51] (03CR) 10Clément Goubert: [C:03+1] hiddenparma: fix proxying of api [puppet] - 10https://gerrit.wikimedia.org/r/1161465 (owner: 10Giuseppe Lavagetto) [10:39:53] !log installing Django security updates [10:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:48] (03CR) 10Giuseppe Lavagetto: [C:03+2] hiddenparma: fix proxying of api [puppet] - 10https://gerrit.wikimedia.org/r/1161465 (owner: 10Giuseppe Lavagetto) [10:41:35] (03PS2) 10Giuseppe Lavagetto: hiddenparma: fix proxying of api [puppet] - 10https://gerrit.wikimedia.org/r/1161465 [10:41:40] (03CR) 10Joely Rooke WMDE: [C:03+1] Create feature flags for resolving Wikibase item labels on Watchlist. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141852 (https://phabricator.wikimedia.org/T388685) (owner: 10Neslihan Turan) [10:42:39] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [10:42:39] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker2002.codfw.wmnet with reason: host reimage [10:43:43] (03PS2) 10Slyngshede: Lock RQ dependency at 1.16.2 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1161409 (https://phabricator.wikimedia.org/T397300) [10:43:46] (03CR) 10Btullis: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1135028 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [10:44:02] (03CR) 10Brouberol: [C:03+1] zookeeper: remove an-conf100[1-3] from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135028 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [10:44:40] (03CR) 10Giuseppe Lavagetto: [C:03+2] hiddenparma: fix proxying of api [puppet] - 10https://gerrit.wikimedia.org/r/1161465 (owner: 10Giuseppe Lavagetto) [10:45:16] (03CR) 10Slyngshede: [V:03+2 C:03+2] Lock RQ dependency at 1.16.2 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1161409 (https://phabricator.wikimedia.org/T397300) (owner: 10Slyngshede) [10:45:25] (03CR) 10Ayounsi: [C:03+1] Lock RQ dependency at 1.16.2 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1161409 (https://phabricator.wikimedia.org/T397300) (owner: 10Slyngshede) [10:45:56] (03PS1) 10Hnowlan: admin_ng: increase changeprop resource quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161467 (https://phabricator.wikimedia.org/T397072) [10:46:15] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) depool all services in codfw/codfw: maintenance [10:46:22] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker2002.codfw.wmnet with reason: host reimage [10:46:52] (03PS1) 10Vgutierrez: liberica: Report CPUs handling NIC queues [puppet] - 10https://gerrit.wikimedia.org/r/1161468 (https://phabricator.wikimedia.org/T397303) [10:47:12] !log slyngshede@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.11 to netbox-next - slyngshede@cumin1002 - T397300 [10:47:17] T397300: Upgrade Netbox to version 4.0.11 - https://phabricator.wikimedia.org/T397300 [10:47:40] (03PS1) 10Effie Mouzeli: admin.yaml: allow deployers to run mw-experimental-mediawiki-image-update [puppet] - 10https://gerrit.wikimedia.org/r/1161469 (https://phabricator.wikimedia.org/T396767) [10:48:27] (03PS2) 10Effie Mouzeli: admin.yaml: allow deployers to run mw-experimental-mediawiki-image-update [puppet] - 10https://gerrit.wikimedia.org/r/1161469 (https://phabricator.wikimedia.org/T396767) [10:48:39] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161469 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [10:48:52] (03PS3) 10Vgutierrez: prometheus: Add NIC queue CPU exporter [puppet] - 10https://gerrit.wikimedia.org/r/1160764 (https://phabricator.wikimedia.org/T397303) [10:48:52] (03PS2) 10Vgutierrez: liberica: Report CPUs handling NIC queues [puppet] - 10https://gerrit.wikimedia.org/r/1161468 (https://phabricator.wikimedia.org/T397303) [10:49:08] (03CR) 10Stevemunene: [C:03+2] replace an-conf100[1-3] with an-conf100[4-6] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135049 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [10:49:25] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161468 (https://phabricator.wikimedia.org/T397303) (owner: 10Vgutierrez) [10:50:42] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.11 to netbox-next - slyngshede@cumin1002 - T397300 [10:52:55] (03PS1) 10Giuseppe Lavagetto: hiddenparma: more fixes to the proxying logic [puppet] - 10https://gerrit.wikimedia.org/r/1161471 [10:53:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T396130)', diff saved to https://phabricator.wikimedia.org/P78432 and previous config saved to /var/cache/conftool/dbconfig/20250619-105347-marostegui.json [10:53:52] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:55:13] (03CR) 10Giuseppe Lavagetto: [C:03+2] hiddenparma: more fixes to the proxying logic [puppet] - 10https://gerrit.wikimedia.org/r/1161471 (owner: 10Giuseppe Lavagetto) [10:55:35] (03PS3) 10Stevemunene: replace an-conf100[1-3] with an-conf100[4-6] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135049 (https://phabricator.wikimedia.org/T374922) [10:57:42] (03PS7) 10Brouberol: Arflow: fix issues with rendering YAML values to airflow config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161453 (https://phabricator.wikimedia.org/T396197) (owner: 10Btullis) [10:58:16] (03PS1) 10Stevemunene: replace an-conf100[1-3] with an-conf100[4-6] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161474 (https://phabricator.wikimedia.org/T374922) [11:02:31] (03PS8) 10Brouberol: Arflow: fix issues with rendering YAML values to airflow config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161453 (https://phabricator.wikimedia.org/T396197) (owner: 10Btullis) [11:02:39] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker2002.codfw.wmnet with OS bookworm [11:02:50] (03CR) 10Brouberol: Arflow: fix issues with rendering YAML values to airflow config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161453 (https://phabricator.wikimedia.org/T396197) (owner: 10Btullis) [11:04:26] (03PS1) 10Vgutierrez: cacheproxy: Report CPUs assigned to NIC queues [puppet] - 10https://gerrit.wikimedia.org/r/1161476 (https://phabricator.wikimedia.org/T397303) [11:04:31] (03PS1) 10Effie Mouzeli: mediawiki_experimental: add motd [puppet] - 10https://gerrit.wikimedia.org/r/1161477 (https://phabricator.wikimedia.org/T396767) [11:05:40] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161476 (https://phabricator.wikimedia.org/T397303) (owner: 10Vgutierrez) [11:07:36] (03PS3) 10Muehlenhoff: Install the Puppet 7 agent in d-i for trixie as well [puppet] - 10https://gerrit.wikimedia.org/r/1145209 (https://phabricator.wikimedia.org/T392790) [11:08:30] (03CR) 10Muehlenhoff: [C:03+2] Install the Puppet 7 agent in d-i for trixie as well [puppet] - 10https://gerrit.wikimedia.org/r/1145209 (https://phabricator.wikimedia.org/T392790) (owner: 10Muehlenhoff) [11:08:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P78433 and previous config saved to /var/cache/conftool/dbconfig/20250619-110854-marostegui.json [11:09:21] (03PS4) 10Muehlenhoff: Add puppetserver2004 [dns] - 10https://gerrit.wikimedia.org/r/1154296 (https://phabricator.wikimedia.org/T381274) [11:15:36] (03CR) 10Clément Goubert: [C:03+1] admin.yaml: allow deployers to run mw-experimental-mediawiki-image-update [puppet] - 10https://gerrit.wikimedia.org/r/1161469 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [11:18:29] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:20:05] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1161389 (owner: 10Muehlenhoff) [11:23:56] (03CR) 10Muehlenhoff: [C:03+2] Add Joanna to Bitu account managers [puppet] - 10https://gerrit.wikimedia.org/r/1161389 (owner: 10Muehlenhoff) [11:24:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P78434 and previous config saved to /var/cache/conftool/dbconfig/20250619-112401-marostegui.json [11:34:12] (03CR) 10Brouberol: [C:03+1] replace an-conf100[1-3] with an-conf100[4-6] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161474 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [11:34:52] (03CR) 10Stevemunene: [C:03+2] replace an-conf100[1-3] with an-conf100[4-6] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161474 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [11:35:30] (03CR) 10Brouberol: [C:03+1] Arflow: fix issues with rendering YAML values to airflow config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161453 (https://phabricator.wikimedia.org/T396197) (owner: 10Btullis) [11:36:42] (03Merged) 10jenkins-bot: replace an-conf100[1-3] with an-conf100[4-6] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161474 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [11:36:49] (03CR) 10Btullis: [C:03+2] Arflow: fix issues with rendering YAML values to airflow config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161453 (https://phabricator.wikimedia.org/T396197) (owner: 10Btullis) [11:39:08] (03Merged) 10jenkins-bot: Arflow: fix issues with rendering YAML values to airflow config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161453 (https://phabricator.wikimedia.org/T396197) (owner: 10Btullis) [11:39:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T396130)', diff saved to https://phabricator.wikimedia.org/P78435 and previous config saved to /var/cache/conftool/dbconfig/20250619-113908-marostegui.json [11:39:14] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:39:24] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2220.codfw.wmnet with reason: Maintenance [11:39:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2220 (T396130)', diff saved to https://phabricator.wikimedia.org/P78436 and previous config saved to /var/cache/conftool/dbconfig/20250619-113931-marostegui.json [11:41:11] (03PS11) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) [11:41:31] (03PS1) 10Btullis: Cleanup htmldumps role ready for decommisioning htmldumper1001 [puppet] - 10https://gerrit.wikimedia.org/r/1161480 (https://phabricator.wikimedia.org/T397434) [11:42:22] (03CR) 10Clément Goubert: [C:03+1] admin_ng: increase changeprop resource quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161467 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [11:42:43] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:43:25] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:44:10] (03PS2) 10Btullis: Cleanup htmldumps role ready for decommisioning htmldumper1001 [puppet] - 10https://gerrit.wikimedia.org/r/1161480 (https://phabricator.wikimedia.org/T397434) [11:45:39] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:46:23] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-cluster pool all services in codfw/codfw: maintenance [11:47:54] (03PS3) 10Btullis: Cleanup htmldumps role ready for decommisioning htmldumper1001 [puppet] - 10https://gerrit.wikimedia.org/r/1161480 (https://phabricator.wikimedia.org/T397434) [11:48:17] (03PS4) 10KartikMistry: machinetranslation: Use S3 storage for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159696 (https://phabricator.wikimedia.org/T335491) [11:48:42] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6029/co" [puppet] - 10https://gerrit.wikimedia.org/r/1161480 (https://phabricator.wikimedia.org/T397434) (owner: 10Btullis) [11:52:27] (03PS1) 10Hnowlan: service: remove ProxyFetch checks for kartotherian, thumbor [puppet] - 10https://gerrit.wikimedia.org/r/1161485 (https://phabricator.wikimedia.org/T397148) [11:56:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T396130)', diff saved to https://phabricator.wikimedia.org/P78437 and previous config saved to /var/cache/conftool/dbconfig/20250619-115626-marostegui.json [11:56:31] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:57:58] (03PS5) 10LD: frwiki: allow bureaucrats to assign and remove temporary-account-viewer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161478 (https://phabricator.wikimedia.org/T397063) [11:59:37] (03CR) 10Clément Goubert: [C:03+1] service: remove ProxyFetch checks for kartotherian, thumbor [puppet] - 10https://gerrit.wikimedia.org/r/1161485 (https://phabricator.wikimedia.org/T397148) (owner: 10Hnowlan) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T1200) [12:01:05] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) pool all services in codfw/codfw: maintenance [12:03:00] (03CR) 10Hnowlan: [C:03+2] admin_ng: increase changeprop resource quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161467 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [12:04:09] (03CR) 10Btullis: [V:03+1] "Answered on ticket. We feel that the risk is low and we will just be experimenting with correlating operational metrics with analytics met" [puppet] - 10https://gerrit.wikimedia.org/r/1156823 (https://phabricator.wikimedia.org/T347430) (owner: 10Btullis) [12:05:09] (03CR) 10Btullis: [C:03+2] Remove obsolete analytics_cluster::postgresql role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1155720 (https://phabricator.wikimedia.org/T395557) (owner: 10Btullis) [12:06:56] (03CR) 10Btullis: [C:03+1] hadoop: remove check_procs based alerts in favor of SystemdUnitFailed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1159385 (https://phabricator.wikimedia.org/T357099) (owner: 10Filippo Giunchedi) [12:08:44] (03CR) 10Muehlenhoff: [C:03+2] Add puppetserver2004 [dns] - 10https://gerrit.wikimedia.org/r/1154296 (https://phabricator.wikimedia.org/T381274) (owner: 10Muehlenhoff) [12:08:52] !log jmm@dns1004 START - running authdns-update [12:09:00] (03CR) 10Filippo Giunchedi: [C:03+1] "SGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1156823 (https://phabricator.wikimedia.org/T347430) (owner: 10Btullis) [12:09:48] !log jmm@dns1004 END - running authdns-update [12:09:48] (03Merged) 10jenkins-bot: admin_ng: increase changeprop resource quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161467 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [12:11:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P78438 and previous config saved to /var/cache/conftool/dbconfig/20250619-121133-marostegui.json [12:11:46] !log hnowlan@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [12:12:38] !log hnowlan@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [12:12:51] !log hnowlan@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:13:12] (03CR) 10Muehlenhoff: "Looks good, two comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/1161480 (https://phabricator.wikimedia.org/T397434) (owner: 10Btullis) [12:14:11] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1161477 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [12:14:27] !log hnowlan@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:14:33] (03PS1) 10Filippo Giunchedi: librenms: bandaid for librenms-syslog memory leak [puppet] - 10https://gerrit.wikimedia.org/r/1161491 (https://phabricator.wikimedia.org/T397427) [12:14:52] (03PS1) 10Jakob: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161490 (https://phabricator.wikimedia.org/T392420) [12:16:54] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [12:17:30] (03CR) 10Dima koushha: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161490 (https://phabricator.wikimedia.org/T392420) (owner: 10Jakob) [12:17:54] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:19:05] (03CR) 10Jakob: [C:03+2] "deploying" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161490 (https://phabricator.wikimedia.org/T392420) (owner: 10Jakob) [12:19:14] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:19:31] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:20:57] (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161490 (https://phabricator.wikimedia.org/T392420) (owner: 10Jakob) [12:21:14] !log jakob@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [12:21:29] !log jakob@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [12:21:47] !log jakob@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [12:22:04] !log jakob@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [12:22:25] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [12:22:48] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [12:23:35] !log jakob@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [12:23:49] !log jakob@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [12:24:06] (03PS4) 10Btullis: Cleanup htmldumps role ready for decommisioning htmldumper1001 [puppet] - 10https://gerrit.wikimedia.org/r/1161480 (https://phabricator.wikimedia.org/T397434) [12:24:24] (03CR) 10Btullis: Cleanup htmldumps role ready for decommisioning htmldumper1001 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1161480 (https://phabricator.wikimedia.org/T397434) (owner: 10Btullis) [12:24:55] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6030/co" [puppet] - 10https://gerrit.wikimedia.org/r/1161480 (https://phabricator.wikimedia.org/T397434) (owner: 10Btullis) [12:25:33] (03PS1) 10Jakob: Revert "wikidata-query-gui: Bump query-gui image version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161501 [12:25:51] (03PS1) 10Muehlenhoff: Failover irc.w.o to irc2003 [dns] - 10https://gerrit.wikimedia.org/r/1161502 [12:26:02] (03CR) 10Dima koushha: [C:03+1] Revert "wikidata-query-gui: Bump query-gui image version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161501 (owner: 10Jakob) [12:26:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P78440 and previous config saved to /var/cache/conftool/dbconfig/20250619-122640-marostegui.json [12:26:55] (03CR) 10Jakob: [C:03+2] Revert "wikidata-query-gui: Bump query-gui image version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161501 (owner: 10Jakob) [12:28:38] (03Merged) 10jenkins-bot: Revert "wikidata-query-gui: Bump query-gui image version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161501 (owner: 10Jakob) [12:29:02] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [12:29:10] !log jakob@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [12:29:24] !log jakob@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [12:29:39] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [12:30:11] !log jakob@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [12:30:20] (03CR) 10Brouberol: [C:03+1] Cleanup htmldumps role ready for decommisioning htmldumper1001 [puppet] - 10https://gerrit.wikimedia.org/r/1161480 (https://phabricator.wikimedia.org/T397434) (owner: 10Btullis) [12:30:27] !log jakob@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [12:31:02] (03CR) 10Effie Mouzeli: [C:03+1] service: remove ProxyFetch checks for kartotherian, thumbor [puppet] - 10https://gerrit.wikimedia.org/r/1161485 (https://phabricator.wikimedia.org/T397148) (owner: 10Hnowlan) [12:31:29] !log jakob@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [12:31:30] (03CR) 10Btullis: [V:03+1 C:03+2] Cleanup htmldumps role ready for decommisioning htmldumper1001 [puppet] - 10https://gerrit.wikimedia.org/r/1161480 (https://phabricator.wikimedia.org/T397434) (owner: 10Btullis) [12:31:40] !log jakob@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [12:34:00] !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts htmldumper1001.eqiad.wmnet [12:34:26] (03PS4) 10Filippo Giunchedi: thanos: add option for series limits to store [puppet] - 10https://gerrit.wikimedia.org/r/1160670 (https://phabricator.wikimedia.org/T394318) [12:34:26] (03PS4) 10Filippo Giunchedi: hieradata: set thanos sidecar and store series limits [puppet] - 10https://gerrit.wikimedia.org/r/1160750 (https://phabricator.wikimedia.org/T394318) [12:34:26] (03PS1) 10Filippo Giunchedi: thanos: split query-frontend logs into their own file [puppet] - 10https://gerrit.wikimedia.org/r/1161505 (https://phabricator.wikimedia.org/T394318) [12:35:06] (03CR) 10Filippo Giunchedi: [C:03+2] hadoop: remove check_procs based alerts in favor of SystemdUnitFailed [puppet] - 10https://gerrit.wikimedia.org/r/1159385 (https://phabricator.wikimedia.org/T357099) (owner: 10Filippo Giunchedi) [12:38:46] (03CR) 10Giuseppe Lavagetto: [C:03+2] requestctl: switch CLI from native client to API client [puppet] - 10https://gerrit.wikimedia.org/r/1160476 (owner: 10Giuseppe Lavagetto) [12:39:46] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [12:41:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T396130)', diff saved to https://phabricator.wikimedia.org/P78441 and previous config saved to /var/cache/conftool/dbconfig/20250619-124148-marostegui.json [12:41:53] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [12:41:54] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2186,2196].codfw.wmnet with reason: Maintenance [12:42:03] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2221.codfw.wmnet with reason: Maintenance [12:42:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2221 (T396130)', diff saved to https://phabricator.wikimedia.org/P78442 and previous config saved to /var/cache/conftool/dbconfig/20250619-124210-marostegui.json [12:42:14] (03PS1) 10Lucas Werkmeister (WMDE): Enable ScopedTypeaheadSearch on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161506 (https://phabricator.wikimedia.org/T394670) [12:43:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161506 (https://phabricator.wikimedia.org/T394670) (owner: 10Lucas Werkmeister (WMDE)) [12:44:14] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: htmldumper1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [12:44:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: htmldumper1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [12:44:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:44:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts htmldumper1001.eqiad.wmnet [12:48:13] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1222.eqiad.wmnet with reason: Maintenance [12:48:22] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1161469 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [12:51:34] (03CR) 10Cathal Mooney: [C:03+1] "Worth doing for now yep" [puppet] - 10https://gerrit.wikimedia.org/r/1161491 (https://phabricator.wikimedia.org/T397427) (owner: 10Filippo Giunchedi) [12:51:37] (03PS1) 10Muehlenhoff: Switch mc-wf1001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1161508 [12:54:50] (03CR) 10Ssingh: [C:03+1] hiera: Switch lvs4009 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1161434 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [12:56:58] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): decommission htmldumper1001.eqiad.wmnet - https://phabricator.wikimedia.org/T397434#10932048 (10BTullis) a:05BTullis→03None [12:57:25] (03CR) 10Filippo Giunchedi: [C:03+2] librenms: bandaid for librenms-syslog memory leak [puppet] - 10https://gerrit.wikimedia.org/r/1161491 (https://phabricator.wikimedia.org/T397427) (owner: 10Filippo Giunchedi) [12:59:46] (03PS3) 10Effie Mouzeli: admin.yaml: allow deployers to run mw-experimental-mediawiki-image-update [puppet] - 10https://gerrit.wikimedia.org/r/1161469 (https://phabricator.wikimedia.org/T396767) [13:00:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T396130)', diff saved to https://phabricator.wikimedia.org/P78443 and previous config saved to /var/cache/conftool/dbconfig/20250619-130003-marostegui.json [13:00:10] (03PS1) 10Muehlenhoff: memcached::instance: Remove support for Ferm-only syntax [puppet] - 10https://gerrit.wikimedia.org/r/1161511 [13:00:11] (03CR) 10Effie Mouzeli: [C:03+2] admin.yaml: allow deployers to run mw-experimental-mediawiki-image-update (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1161469 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [13:00:54] PROBLEM - SSH on es2045 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:01:06] (03CR) 10Slyngshede: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1161502 (owner: 10Muehlenhoff) [13:01:14] jouncebot? [13:01:22] jouncebot: now [13:01:22] For the next 0 hour(s) and 58 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T1300) [13:01:36] did it miss the beginning of the window? ovO [13:01:38] * o_O [13:01:48] anyway, I can deploy my patch [13:02:07] (03PS2) 10Effie Mouzeli: mediawiki_experimental: add motd [puppet] - 10https://gerrit.wikimedia.org/r/1161477 (https://phabricator.wikimedia.org/T396767) [13:02:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161506 (https://phabricator.wikimedia.org/T394670) (owner: 10Lucas Werkmeister (WMDE)) [13:02:37] Don't think so [13:02:57] PROBLEM - Host es2045 #page is DOWN: PING CRITICAL - Packet loss = 100% [13:03:03] aha [13:03:15] (03CR) 10Lucas Werkmeister (WMDE): [C:04-2] Enable ScopedTypeaheadSearch on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161506 (https://phabricator.wikimedia.org/T394670) (owner: 10Lucas Werkmeister (WMDE)) [13:03:21] <_joe_> !incidents [13:03:22] 6375 (UNACKED) Host es2045 (paged) [13:03:22] something going on? [13:03:22] 6365 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [13:03:28] <_joe_> !ack 6375 [13:03:28] 6375 (ACKED) Host es2045 (paged) [13:03:41] <_joe_> federico3: is es2045 expected? [13:03:43] (I canceled the deploy) [13:03:55] Lucas_WMDE: you can deploy [13:04:05] <_joe_> Lucas_WMDE: wait just a sec until we confirm this was expected [13:04:10] ok _joe_ [13:04:20] <_joe_> marostegui: I gather it was? [13:04:31] _joe_: no it is not [13:04:36] there were also reports of Toolforge downtime in -cloud and I couldn’t load https://wm-bot.wmcloud.org/browser/index.php?display=%23wikimedia-operations so I thought something might be up with the network [13:04:46] but that might not affect production [13:04:49] !log akosiaris@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2007.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:04:56] <_joe_> marostegui: are you taking a look? [13:05:05] _joe_: federico3 is doing so [13:05:14] <_joe_> ok, ping me if you need me [13:05:21] port is up on the switch, but the host doesn't ping [13:05:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depool es2045', diff saved to https://phabricator.wikimedia.org/P78444 and previous config saved to /var/cache/conftool/dbconfig/20250619-130528-fceratto.json [13:05:37] <_joe_> Lucas_WMDE: I'm not a network engineer, but as long as wikipedia is reachable, I would bet it's not network issues [13:05:42] depooled [13:05:52] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs4009.ulsfo.wmnet with reason: switching to katran [13:06:02] fair enough but at that point I hadn’t yet checked if wikipedia was reachable ^^ [13:06:10] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs4009.ulsfo.wmnet} and A:liberica (T396561) [13:06:13] I just wanted to quickly abort the deploy first, just in case [13:06:21] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs4009.ulsfo.wmnet} and A:liberica (T396561) [13:06:35] FIRING: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:06:42] (03CR) 10Vgutierrez: [C:03+2] hiera: Switch lvs4009 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1161434 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [13:06:50] _joe_: winning friends with netops :) [13:07:22] we lost stashbot btw [13:07:23] (03PS3) 10Tiziano Fogli: monitoring services: add migration task T384830 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155240 (https://phabricator.wikimedia.org/T395443) [13:07:24] tbf its laggy [13:07:40] so I should probably not deploy on that basis alone [13:07:45] (03PS3) 10Effie Mouzeli: mediawiki_experimental: add motd [puppet] - 10https://gerrit.wikimedia.org/r/1161477 (https://phabricator.wikimedia.org/T396767) [13:07:48] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T384830 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155240 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [13:08:03] (fyi vgutierrez your !log’s are going to /dev/null because stashbot is down) [13:08:32] ack [13:08:40] federico3: fwiw es2045 unresponsive on serial console, system probably needs a kick [13:09:17] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on es2045.codfw.wmnet with reason: Host went down [13:09:52] (03PS1) 10Giuseppe Lavagetto: requestctl_client: allow using sudo [puppet] - 10https://gerrit.wikimedia.org/r/1161512 [13:10:01] its even laggy on blackporting: I did blackport something for this window but its no longer available ; see [11:57:58] (03PS5) 10LD: frwiki: allow bureaucrats to assign and remove temporary-account-viewer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161478 (https://phabricator.wikimedia.org/T397063) [13:10:14] (03PS1) 10Tiziano Fogli: systemd::service: parameterize to support migration_task [puppet] - 10https://gerrit.wikimedia.org/r/1155618 (https://phabricator.wikimedia.org/T395443) [13:10:40] !incidents [13:10:40] 6375 (ACKED) Host es2045 (paged) [13:10:40] 6365 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [13:10:42] something went wrong for some reason [13:10:48] (03PS3) 10Tiziano Fogli: monitoring services: add migration task T370530 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155625 (https://phabricator.wikimedia.org/T395443) [13:11:01] (03CR) 10Muehlenhoff: [C:03+2] Failover irc.w.o to irc2003 [dns] - 10https://gerrit.wikimedia.org/r/1161502 (owner: 10Muehlenhoff) [13:11:02] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T370530 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155625 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [13:11:13] !log jmm@dns1004 START - running authdns-update [13:11:35] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:12:09] !log jmm@dns1004 END - running authdns-update [13:12:21] LD: did you schedule your config change for the backport+config window? I don’t see it on the calendar [13:13:09] (03CR) 10Giuseppe Lavagetto: [C:03+2] requestctl_client: allow using sudo [puppet] - 10https://gerrit.wikimedia.org/r/1161512 (owner: 10Giuseppe Lavagetto) [13:13:15] I did it was logged here at 11:57 but yh its not on calendar for some reasons [13:13:36] Lucas_WMDE, Urbanecm, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T1300). [13:13:36] Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:13:58] good morning jouncebot [13:15:04] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2007.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:15:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P78445 and previous config saved to /var/cache/conftool/dbconfig/20250619-131510-marostegui.json [13:15:21] topranks: thanks [13:15:31] LD: you need to schedule your config change for https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T1300 (e.g. via the schedule-deployments tool), it doesn’t happen automatically [13:16:38] !log akosiaris@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2008.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:17:11] (03PS15) 10AOkoth: miscweb: add os-reports update mechanism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) [13:17:29] Lucas_WMDE did it through https://schedule-deployment.toolforge.org/backport/1161478 but I guess it stopped saving the action at some point [13:18:15] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki_experimental: add motd [puppet] - 10https://gerrit.wikimedia.org/r/1161477 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [13:18:18] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs4009.ulsfo.wmnet [13:18:18] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs4009.ulsfo.wmnet [13:18:25] seems like it, I don’t see change 1161478 on the wiki page or in its recent history [13:18:57] marostegui / _joe_: are we okay to deploy by now or should I still hold? [13:19:07] stashbot seems to be back at least, so that wouldn’t be a blocker anymore [13:19:07] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [13:19:09] Lucas_WMDE: it is ok from my side [13:19:33] ok, then I’ll start again with my config change [13:19:45] <_joe_> Lucas_WMDE: sorry, yes [13:19:48] (03PS1) 10Jelto: make kubectl-completion alternative entry dependent on kubectl [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/1161513 (https://phabricator.wikimedia.org/T387548) [13:19:49] not yet sure if I’ll feel confident to deploy LD’s, but we’ll see [13:19:52] thanks! [13:19:53] (03PS1) 10Vgutierrez: hiera: Repool lvs4009 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1161514 (https://phabricator.wikimedia.org/T396561) [13:20:08] (03CR) 10Lucas Werkmeister (WMDE): "lifting deploy block, should be okay again" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161506 (https://phabricator.wikimedia.org/T394670) (owner: 10Lucas Werkmeister (WMDE)) [13:20:10] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161514 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [13:20:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161506 (https://phabricator.wikimedia.org/T394670) (owner: 10Lucas Werkmeister (WMDE)) [13:20:37] (03CR) 10Jelto: "Is this still needed for v1.23? otherwise I can replicate that for v1.31" [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/1161513 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [13:21:10] (03Merged) 10jenkins-bot: Enable ScopedTypeaheadSearch on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161506 (https://phabricator.wikimedia.org/T394670) (owner: 10Lucas Werkmeister (WMDE)) [13:21:24] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1161506|Enable ScopedTypeaheadSearch on Wikidata (T394670)]] [13:21:30] T394670: Enable Scoped Typeahead Search on Wikidata - https://phabricator.wikimedia.org/T394670 [13:21:34] (03CR) 10Ssingh: [C:03+1] hiera: Repool lvs4009 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1161514 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [13:21:51] o_O my spiderpig is broken [13:21:52] (03CR) 10Vgutierrez: [C:03+2] hiera: Repool lvs4009 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1161514 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [13:22:05] I get {"code":"needauth","message":"Please log in first","url":"https://idp.wikimedia.org/login?service=https%3A%2F%2Fspiderpig.wikimedia.org%2Fapi%2Flogin%3Fnext%3Dhttps%253A%252F%252Fspiderpig.wikimedia.org%252F"} [13:22:17] even on the front page [13:22:42] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host irc1003.wikimedia.org [13:22:48] I’m logged in on idp at least… [13:23:09] 500 Internal Server Error on idm when I try to log in there [13:23:15] help :( [13:23:26] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1161506|Enable ScopedTypeaheadSearch on Wikidata (T394670)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:23:30] … [13:23:33] well, I can test my change, but [13:23:34] I registered the blackport manually but it's fine for me to reschedule a window [13:23:49] might be fixed at some point [13:24:47] Lucas_WMDE: works for me, that's weird. delete the cookie maybe? [13:25:03] I’ll try [13:25:06] on which domain? :'D [13:25:35] ok on idm my only cookies are GeoIP apparently [13:25:44] I’ll try logging out of idp [13:26:04] Lucas_WMDE: I can take over if to finish your deploy if you want [13:26:21] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs4009.ulsfo.wmnet} and A:liberica (T396561) [13:26:26] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc1003.wikimedia.org [13:26:26] T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561 [13:26:35] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:26:38] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs4009.ulsfo.wmnet} and A:liberica (T396561) [13:26:54] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2008.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:27:19] RECOVERY - Host es2045 #page is UP: PING OK - Packet loss = 0%, RTA = 30.24 ms [13:27:27] ok my spiderpig is totally messed up right now and I can’t even figure out how [13:27:32] tried to delete cookies already [13:27:38] claime: please do, AFAICT the feature mostly works [13:27:44] RECOVERY - SSH on es2045 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:27:46] ok, proceeding [13:27:48] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with sync [13:27:49] (there might be a minor issue we’ll want to follow up on but nothing deploy blocking) [13:30:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P78446 and previous config saved to /var/cache/conftool/dbconfig/20250619-133018-marostegui.json [13:30:45] * Lucas_WMDE installing obs to try to get a screen recording of spiderpig [13:31:32] !log repool lvs4009 (upload) using katran - T396561 [13:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:37] T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561 [13:32:07] !log akosiaris@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:32:12] oh, now it wants a spiderpig-otp again [13:32:36] Lucas_WMDE: first step to livestreaming deploys on twitch, let's go [13:32:43] claime: `ssh lucaswerkmeister-wmde@ scap spiderpig-otp` would appear to be missing a hostname btw :P [13:32:51] huh [13:32:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161511 (owner: 10Muehlenhoff) [13:33:01] I had the otp prompt earlier and it showed deploy1003 as it should [13:33:04] huh [13:33:14] so that's weird [13:33:21] I remember it working at some point but that would’ve been weeks ago for me [13:33:46] (03PS1) 10Vgutierrez: hiera: Switch lvs4008 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1161518 (https://phabricator.wikimedia.org/T396561) [13:33:52] Lucas_WMDE: I had it less than 10 minutes ago when I tried to reproduce your issue :/ [13:33:58] o_O now it tells me `ssh @ scap spiderpig-otpz [13:34:00] * `ssh @ scap spiderpig-otp` [13:34:11] so something is still messed up [13:34:11] lolsob [13:34:36] you'll end up with `ssh @` [13:34:37] * Lucas_WMDE tries a private window [13:34:45] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1161506|Enable ScopedTypeaheadSearch on Wikidata (T394670)]] (duration: 13m 20s) [13:34:50] T394670: Enable Scoped Typeahead Search on Wikidata - https://phabricator.wikimedia.org/T394670 [13:34:58] ok there it shows the hostname [13:35:17] ok in a private window everything works [13:35:20] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161518 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [13:35:25] who knows what0s messed up in my main session [13:35:26] * what’s [13:37:14] LD, your change should at least have a +1 before I deploy it. Dreamy_Jazz you around? [13:37:48] (03CR) 10Ssingh: [C:03+1] hiera: Switch lvs4008 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1161518 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [13:38:07] the only cookies firefox shows me are GeoIP, mjx.menu, NetworkProbeLimit, WMF-Last-Access and WMF-Uniq [13:38:11] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs4008.ulsfo.wmnet with reason: switching to katran [13:38:16] so I don’t know what I would even need to delete [13:38:45] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs4008.ulsfo.wmnet} and A:liberica (T396561) [13:38:50] T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561 [13:38:52] * Lucas_WMDE tries deleting the three things in local storage just in case [13:38:57] Do you have errors in the web dev console or something? [13:39:45] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs4008.ulsfo.wmnet} and A:liberica (T396561) [13:39:56] many [13:40:01] claime ideally true but core-permissions can be +1 by the deployer as long there's no syntax error and there's consensus ; anyway I'm not sure it will be deployed today as long the bug runs [13:40:08] such as Error: Auth failed for call to /api/monitoring/logs/mediawiki: Please log in first [13:41:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161508 (owner: 10Muehlenhoff) [13:42:22] jouncebot: nowandnext [13:42:22] For the next 0 hour(s) and 17 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T1300) [13:42:22] In 0 hour(s) and 47 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T1430) [13:42:23] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:45:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T396130)', diff saved to https://phabricator.wikimedia.org/P78447 and previous config saved to /var/cache/conftool/dbconfig/20250619-134525-marostegui.json [13:45:31] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [13:45:40] Yeah I'd rather have someone from TSP +1 or deploy it tbh, as I'm not aware of that policy [13:45:41] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2222.codfw.wmnet with reason: Maintenance [13:45:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2222 (T396130)', diff saved to https://phabricator.wikimedia.org/P78448 and previous config saved to /var/cache/conftool/dbconfig/20250619-134548-marostegui.json [13:46:33] claime all's good, I CC'ed DJ [13:46:35] RESOLVED: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:47:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161511 (owner: 10Muehlenhoff) [13:47:54] (03CR) 10JMeybohm: "Not actually required but still nice to have. I think we will have 1.23 around for some time still" [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/1161513 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [13:47:58] I gotta go, it can be deployed at anytime - no test needed - but I brb tonight if you prefer waiting [13:53:33] (03PS1) 10Jakob: Don't cache i18n json files in WDQS UI [puppet] - 10https://gerrit.wikimedia.org/r/1161517 (https://phabricator.wikimedia.org/T397452) [13:54:34] !log jmm@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2021.codfw.wmnet with reason: remove for decom [13:56:07] !log reprepro -C main include bullseye-wikimedia trafficserver_9.2.11-1wm1_amd64.changes: T397456 [13:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:12] T397456: Upgrade to ATS 9.2.11 - https://phabricator.wikimedia.org/T397456 [13:56:44] filed https://phabricator.wikimedia.org/T397457 for my SpiderPig woes FTR [13:56:54] !log UTC afternoon backport+config window done [13:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:05] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on P{cp4037*} and A:cp - 9.2.11 upgrade (T397456) [13:57:31] (03CR) 10Ssingh: [C:03+2] hiera: set do_ech to false for durum3003 [puppet] - 10https://gerrit.wikimedia.org/r/1159498 (owner: 10Ssingh) [13:57:39] (03CR) 10Ssingh: [V:03+2 C:03+2] hiera: set do_ech to false for durum3003 [puppet] - 10https://gerrit.wikimedia.org/r/1159498 (owner: 10Ssingh) [13:57:51] (03PS2) 10Ssingh: hiera: set do_ech to false for durum3003 [puppet] - 10https://gerrit.wikimedia.org/r/1159498 [13:58:49] (03CR) 10Ssingh: [C:03+2] hiera: set do_ech to false for durum3003 [puppet] - 10https://gerrit.wikimedia.org/r/1159498 (owner: 10Ssingh) [13:59:04] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1161511 (owner: 10Muehlenhoff) [13:59:07] (03PS1) 10Jelto: make kubectl-completion alternative entry dependent on kubectl (v1.31) [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1161526 (https://phabricator.wikimedia.org/T387548) [13:59:50] (03CR) 10Jelto: "ack! v1.31 change I56b85e7f5cb8676cbe1704cf09cd30064abd7cca" [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/1161513 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [13:59:56] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum3003.esams.wmnet with OS bookworm [14:01:36] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on P{cp4037*} and A:cp - 9.2.11 upgrade (T397456) [14:01:42] T397456: Upgrade to ATS 9.2.11 - https://phabricator.wikimedia.org/T397456 [14:03:22] PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:03:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T396130)', diff saved to https://phabricator.wikimedia.org/P78449 and previous config saved to /var/cache/conftool/dbconfig/20250619-140334-marostegui.json [14:03:38] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs4008.ulsfo.wmnet [14:03:38] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs4008.ulsfo.wmnet [14:03:39] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [14:04:06] BFD on asw1-by27-esams expected [14:04:21] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on P{cp7002*} and A:cp - 9.2.11 upgrade (T397456) [14:09:01] (03PS1) 10Vgutierrez: hiera: Repool lvs4008 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1161528 (https://phabricator.wikimedia.org/T396561) [14:09:13] (03CR) 10Ssingh: [C:03+1] hiera: Repool lvs4008 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1161528 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [14:09:22] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on P{cp7002*} and A:cp - 9.2.11 upgrade (T397456) [14:09:27] T397456: Upgrade to ATS 9.2.11 - https://phabricator.wikimedia.org/T397456 [14:09:32] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161528 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [14:12:10] (03PS2) 10Ssingh: hiera: durum: revert ECH experiment [puppet] - 10https://gerrit.wikimedia.org/r/1156355 (https://phabricator.wikimedia.org/T205378) [14:16:50] (03CR) 10Vgutierrez: [C:03+2] hiera: Repool lvs4008 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1161528 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [14:18:16] (03CR) 10Ssingh: [C:03+1] prometheus: Add NIC queue CPU exporter [puppet] - 10https://gerrit.wikimedia.org/r/1160764 (https://phabricator.wikimedia.org/T397303) (owner: 10Vgutierrez) [14:18:20] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 13 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/1156355 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [14:18:24] (03CR) 10Jakob: "Hello @jhathaway@wikimedia.org and @mmuhlenhoff@wikimedia.org! I'm adding you as reviewers since you're listed as the deployers for the u" [puppet] - 10https://gerrit.wikimedia.org/r/1161517 (https://phabricator.wikimedia.org/T397452) (owner: 10Jakob) [14:18:26] (03CR) 10Ssingh: [C:03+1] liberica: Report CPUs handling NIC queues [puppet] - 10https://gerrit.wikimedia.org/r/1161468 (https://phabricator.wikimedia.org/T397303) (owner: 10Vgutierrez) [14:18:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P78450 and previous config saved to /var/cache/conftool/dbconfig/20250619-141841-marostegui.json [14:19:44] (03CR) 10Ssingh: [C:03+1] cacheproxy: Report CPUs assigned to NIC queues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1161476 (https://phabricator.wikimedia.org/T397303) (owner: 10Vgutierrez) [14:20:57] (03CR) 10Vgutierrez: cacheproxy: Report CPUs assigned to NIC queues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1161476 (https://phabricator.wikimedia.org/T397303) (owner: 10Vgutierrez) [14:21:37] (03CR) 10Ssingh: [C:03+1] cacheproxy: Report CPUs assigned to NIC queues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1161476 (https://phabricator.wikimedia.org/T397303) (owner: 10Vgutierrez) [14:22:24] (03PS1) 10Jgiannelos: RB sunset: Configure claim root TTL for PCS related endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161531 [14:22:40] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs4008.ulsfo.wmnet} and A:liberica (T396561) [14:22:45] T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561 [14:22:58] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs4008.ulsfo.wmnet} and A:liberica (T396561) [14:23:16] !log repool lvs4008 (text) using katran - T396561 [14:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:37] (03CR) 10Muehlenhoff: "Thanks for the reviews, much appreciated! I'll update the patch to incorporate all feedback" [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [14:23:40] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum3003.esams.wmnet with reason: host reimage [14:26:58] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum3003.esams.wmnet with reason: host reimage [14:28:29] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:30:04] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T1430) [14:33:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P78451 and previous config saved to /var/cache/conftool/dbconfig/20250619-143348-marostegui.json [14:36:05] !log installing twitter-bootstrap3 security updates [14:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:07] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2021.codfw.wmnet [14:41:16] !log removing ganeti2021 from codfw cluster for decom T396590 [14:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:21] T396590: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590 [14:41:22] RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:42:12] (03CR) 10Vgutierrez: [C:03+2] prometheus: Add NIC queue CPU exporter [puppet] - 10https://gerrit.wikimedia.org/r/1160764 (https://phabricator.wikimedia.org/T397303) (owner: 10Vgutierrez) [14:43:15] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2021.codfw.wmnet [14:44:22] PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:44:23] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:44:27] (03CR) 10Vgutierrez: [C:03+2] liberica: Report CPUs handling NIC queues [puppet] - 10https://gerrit.wikimedia.org/r/1161468 (https://phabricator.wikimedia.org/T397303) (owner: 10Vgutierrez) [14:45:07] !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-staging-etcd2003.codfw.wmnet to plain [14:45:39] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:46:10] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-staging-etcd2003.codfw.wmnet to plain [14:46:11] (03PS1) 10Alexandros Kosiaris: calico: Switch default-deny to using services instead of ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161533 (https://phabricator.wikimedia.org/T397341) [14:46:22] RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:46:40] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum3003.esams.wmnet with OS bookworm [14:46:50] (03CR) 10Ssingh: [C:03+2] hiera: set do_ech to false for durum3004 [puppet] - 10https://gerrit.wikimedia.org/r/1159499 (owner: 10Ssingh) [14:46:56] (03PS2) 10Ssingh: hiera: set do_ech to false for durum3004 [puppet] - 10https://gerrit.wikimedia.org/r/1159499 [14:46:58] !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-staging-etcd2001.codfw.wmnet to plain [14:47:34] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-staging-etcd2001.codfw.wmnet to plain [14:48:06] (03PS5) 10Seanleong-wmde: Create feature flags to resolve Wikibase item labels on the Watchlist. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141852 (https://phabricator.wikimedia.org/T388685) (owner: 10Neslihan Turan) [14:48:24] (03CR) 10Ssingh: [C:03+2] hiera: set do_ech to false for durum3004 [puppet] - 10https://gerrit.wikimedia.org/r/1159499 (owner: 10Ssingh) [14:48:43] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10932449 (10MoritzMuehlenhoff) [14:48:50] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum3004.esams.wmnet with OS bookworm [14:48:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T396130)', diff saved to https://phabricator.wikimedia.org/P78452 and previous config saved to /var/cache/conftool/dbconfig/20250619-144855-marostegui.json [14:49:01] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [14:49:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141852 (https://phabricator.wikimedia.org/T388685) (owner: 10Neslihan Turan) [14:50:49] (03PS1) 10Vgutierrez: liberica: Fix nic-queue-cpu outfile path [puppet] - 10https://gerrit.wikimedia.org/r/1161534 (https://phabricator.wikimedia.org/T397303) [14:51:28] (03PS1) 10Alexandros Kosiaris: calico default-deny: Switch other clusters to follow wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161535 [14:51:48] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161534 (https://phabricator.wikimedia.org/T397303) (owner: 10Vgutierrez) [14:51:59] (03CR) 10Ssingh: [C:03+1] "Sorry for missing it in the review" [puppet] - 10https://gerrit.wikimedia.org/r/1161534 (https://phabricator.wikimedia.org/T397303) (owner: 10Vgutierrez) [14:52:30] PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:54:18] (03CR) 10Vgutierrez: [C:03+2] liberica: Fix nic-queue-cpu outfile path [puppet] - 10https://gerrit.wikimedia.org/r/1161534 (https://phabricator.wikimedia.org/T397303) (owner: 10Vgutierrez) [14:54:33] !log jmm@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti2021.codfw.wmnet [14:55:08] PROBLEM - ganeti-confd running on ganeti2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [14:55:08] PROBLEM - ganeti-noded running on ganeti2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [14:56:21] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2218.codfw.wmnet with reason: Maintenance [14:56:50] !log jmm@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2021.codfw.wmnet with reason: remove for decom [14:57:45] FIRING: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:58:17] (03CR) 10CI reject: [V:04-1] calico default-deny: Switch other clusters to follow wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161535 (owner: 10Alexandros Kosiaris) [15:05:00] (03CR) 10Clément Goubert: [C:03+1] calico: Switch default-deny to using services instead of ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161533 (https://phabricator.wikimedia.org/T397341) (owner: 10Alexandros Kosiaris) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:05] FIRING: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:13:38] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [15:13:47] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum3004.esams.wmnet with reason: host reimage [15:13:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:14:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T396130)', diff saved to https://phabricator.wikimedia.org/P78453 and previous config saved to /var/cache/conftool/dbconfig/20250619-151402-marostegui.json [15:14:07] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [15:14:59] (03CR) 10Hnowlan: [C:03+1] "Could you add a note to values.yaml to note that the claim_ttl also sets root ttl please? Otherwise lgtm." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161531 (owner: 10Jgiannelos) [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:17:07] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum3004.esams.wmnet with reason: host reimage [15:17:36] (03PS2) 10Jgiannelos: RB sunset: Configure claim root TTL for PCS related endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161531 [15:18:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:18:47] (03PS3) 10Jgiannelos: RB sunset: Configure claim root TTL for PCS related endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161531 [15:19:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T396130)', diff saved to https://phabricator.wikimedia.org/P78454 and previous config saved to /var/cache/conftool/dbconfig/20250619-151907-marostegui.json [15:19:12] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [15:20:20] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:21:49] (03CR) 10Jgiannelos: [C:03+2] RB sunset: Configure claim root TTL for PCS related endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161531 (owner: 10Jgiannelos) [15:22:55] jouncebot: nowandnext [15:22:55] No deployments scheduled for the next 0 hour(s) and 37 minute(s) [15:22:55] In 0 hour(s) and 37 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T1600) [15:23:32] (03Merged) 10jenkins-bot: RB sunset: Configure claim root TTL for PCS related endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161531 (owner: 10Jgiannelos) [15:26:19] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [15:26:38] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [15:26:56] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [15:27:32] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [15:30:28] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [15:30:34] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm [15:30:47] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [15:31:11] (03PS2) 10Alexandros Kosiaris: calico default-deny: Switch other clusters to follow wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161535 [15:31:23] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [15:31:27] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [15:31:38] (03PS3) 10Alexandros Kosiaris: calico default-deny: Switch other clusters to follow wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161535 [15:34:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P78455 and previous config saved to /var/cache/conftool/dbconfig/20250619-153414-marostegui.json [15:35:29] RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:36:26] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum3004.esams.wmnet with OS bookworm [15:37:34] (03CR) 10Btullis: [V:03+1 C:03+2] Presto: Add a prometheus connector pointing to thanos [puppet] - 10https://gerrit.wikimedia.org/r/1156823 (https://phabricator.wikimedia.org/T347430) (owner: 10Btullis) [15:43:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:43:49] (03PS3) 10Ssingh: hiera: durum: revert ECH experiment [puppet] - 10https://gerrit.wikimedia.org/r/1156355 (https://phabricator.wikimedia.org/T205378) [15:45:39] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:46:10] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [15:48:04] (03PS1) 10Gmodena: services: mw-page-content-change : raise JobManager memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161554 (https://phabricator.wikimedia.org/T397336) [15:48:44] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:49:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P78456 and previous config saved to /var/cache/conftool/dbconfig/20250619-154921-marostegui.json [15:49:40] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6033/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1156355 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [15:50:12] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host pc2012 [15:50:27] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc2012 [15:51:21] (03PS4) 10Ssingh: hiera: durum: revert ECH experiment [puppet] - 10https://gerrit.wikimedia.org/r/1156355 (https://phabricator.wikimedia.org/T205378) [15:52:34] (03PS5) 10Ssingh: hiera: durum: revert ECH experiment [puppet] - 10https://gerrit.wikimedia.org/r/1156355 (https://phabricator.wikimedia.org/T205378) [15:53:36] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host pc2012 [15:53:46] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc2012 [15:53:51] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [15:55:22] (03PS1) 10Hnowlan: changeprop: pcs concurrency config at values level, bump native transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161559 (https://phabricator.wikimedia.org/T397072) [15:56:31] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:56:37] (03CR) 10CI reject: [V:04-1] changeprop: pcs concurrency config at values level, bump native transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161559 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [15:56:38] (03CR) 10Vgutierrez: [C:03+1] hiera: durum: revert ECH experiment [puppet] - 10https://gerrit.wikimedia.org/r/1156355 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [15:58:17] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6035/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1156355 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [15:58:50] (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: durum: revert ECH experiment [puppet] - 10https://gerrit.wikimedia.org/r/1156355 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [15:59:22] (03PS2) 10Hnowlan: changeprop: pcs concurrency config at values level, bump native transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161559 (https://phabricator.wikimedia.org/T397072) [15:59:32] 06SRE, 06Traffic, 06Traffic-Icebox, 07HTTPS, and 2 others: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378#10932632 (10ssingh) The ECH experiment has been reverted as of today. [16:00:05] jhathaway and moritzm: Your horoscope predicts another Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T1600). [16:00:05] jakob_WMDE: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:44] (03CR) 10CI reject: [V:04-1] changeprop: pcs concurrency config at values level, bump native transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161559 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [16:03:05] (03PS3) 10Hnowlan: changeprop: pcs concurrency config at values level, bump native transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161559 (https://phabricator.wikimedia.org/T397072) [16:03:05] RESOLVED: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:04:08] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage [16:04:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T396130)', diff saved to https://phabricator.wikimedia.org/P78457 and previous config saved to /var/cache/conftool/dbconfig/20250619-160429-marostegui.json [16:04:34] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [16:04:44] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [16:04:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T396130)', diff saved to https://phabricator.wikimedia.org/P78458 and previous config saved to /var/cache/conftool/dbconfig/20250619-160451-marostegui.json [16:04:52] o/ Jake_Park [16:05:09] jhathaway: did you mean to ping me? :) [16:05:31] yup, you have a patch out for the puppet window? [16:05:39] yes! [16:05:46] (03PS1) 10Cparle: Add UploadWizard tables [puppet] - 10https://gerrit.wikimedia.org/r/1161562 (https://phabricator.wikimedia.org/T393793) [16:05:48] shall I merge? [16:06:27] yes please! [16:06:32] (03CR) 10JHathaway: [C:03+2] Don't cache i18n json files in WDQS UI [puppet] - 10https://gerrit.wikimedia.org/r/1161517 (https://phabricator.wikimedia.org/T397452) (owner: 10Jakob) [16:07:42] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage [16:07:54] (03CR) 10CI reject: [V:04-1] Add UploadWizard tables [puppet] - 10https://gerrit.wikimedia.org/r/1161562 (https://phabricator.wikimedia.org/T393793) (owner: 10Cparle) [16:08:06] jakob_WMDE: merged [16:08:09] can you test? [16:08:17] (03PS4) 10Hnowlan: changeprop: pcs concurrency config at values level, bump native transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161559 (https://phabricator.wikimedia.org/T397072) [16:08:20] which hosts does this apply to? [16:08:45] I can test it on query.wikidata.org [16:09:38] jouncebot: nowandnext [16:09:38] For the next 0 hour(s) and 50 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T1600) [16:09:38] In 0 hour(s) and 50 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T1700) [16:09:38] In 0 hour(s) and 50 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T1700) [16:10:13] Anyone mind if I to a security deploy? [16:10:32] Or should I wait until the puppet window is finished [16:11:43] I think you can go ahead Dreamy_Jazz [16:11:49] Thanks [16:13:47] jhathaway: am I supposed to see the effect of the change already? [16:17:19] (03PS1) 10Vgutierrez: prometheus::node_nic_queue_cpu: Define script once [puppet] - 10https://gerrit.wikimedia.org/r/1161566 (https://phabricator.wikimedia.org/T397303) [16:19:06] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161566 (https://phabricator.wikimedia.org/T397303) (owner: 10Vgutierrez) [16:21:56] (03PS2) 10Cparle: Add UploadWizard tables [puppet] - 10https://gerrit.wikimedia.org/r/1161562 (https://phabricator.wikimedia.org/T393793) [16:22:23] (03CR) 10Ssingh: [C:03+1] prometheus::node_nic_queue_cpu: Define script once [puppet] - 10https://gerrit.wikimedia.org/r/1161566 (https://phabricator.wikimedia.org/T397303) (owner: 10Vgutierrez) [16:22:29] (03CR) 10Vgutierrez: [C:03+2] prometheus::node_nic_queue_cpu: Define script once [puppet] - 10https://gerrit.wikimedia.org/r/1161566 (https://phabricator.wikimedia.org/T397303) (owner: 10Vgutierrez) [16:23:59] !log dreamyjazz Deployed security patch for T396750 [16:24:54] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm [16:26:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T396130)', diff saved to https://phabricator.wikimedia.org/P78459 and previous config saved to /var/cache/conftool/dbconfig/20250619-162639-marostegui.json [16:26:44] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [16:27:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.153s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:29:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:29:34] jhathaway: did the puppet run for my change happen already? [16:32:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.241s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:32:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:33:28] !log dreamyjazz Deployed security patch for T396750 [16:39:58] jakob_WMDE: should be rolled out now [16:40:09] * jakob_WMDE checks [16:40:49] ugh, looks like it didn't work. [16:41:17] `curl -sI https://query.wikidata.org/i18n/en.json | grep -i cache-control` still says "cache-control: max-age=3600, must-revalidate". we want it to be "cache-control: no-cache". [16:41:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P78460 and previous config saved to /var/cache/conftool/dbconfig/20250619-164146-marostegui.json [16:43:19] jakob_WMDE: yeah I'm getting the same result [16:43:35] !log dreamyjazz Deployed security patch for T397196 [16:44:02] jhathaway: do you spot anything wrong with the patch? I must admit I couldn't figure out an easy way to test this [16:44:51] nothing obvious, jakob_WMDE [16:45:02] :/ [16:48:18] jhathaway: I'm seeing some manual apache reloading happened after a similar change in T301461#8017646. is that something we would need to do here as well? [16:48:19] T301461: Investigate cache issues after WDQS UI deployments - https://phabricator.wikimedia.org/T301461 [16:49:09] hmm, could be [16:50:25] hmm, reloaded, but no change [16:51:17] the query service UI also recently switched to the kubernetes infrastructure. I don't know how exactly this setup works and what might need to be poked... [16:51:26] ah [16:51:55] (03CR) 10Jgiannelos: [C:03+1] changeprop: pcs concurrency config at values level, bump native transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161559 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [16:52:57] !log dreamyjazz Deployed security patch for T397196 [16:54:34] One more security deploy to do (currently in progress) and then I'll be done [16:56:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P78461 and previous config saved to /var/cache/conftool/dbconfig/20250619-165653-marostegui.json [16:59:00] (03CR) 10Hnowlan: [C:03+2] changeprop: pcs concurrency config at values level, bump native transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161559 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [17:00:05] bd808: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T1700) [17:00:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:01:11] (03Merged) 10jenkins-bot: changeprop: pcs concurrency config at values level, bump native transcludes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161559 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [17:02:22] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.821s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:02:25] (03CR) 10Majavah: [C:03+1] memcached::instance: Remove support for Ferm-only syntax [puppet] - 10https://gerrit.wikimedia.org/r/1161511 (owner: 10Muehlenhoff) [17:02:58] jhathaway> 0/ [17:03:10] !log dreamyjazz Deployed security patch for T397088 [17:03:23] Finished deploying security patches [17:04:35] jhathaway: I'm still unsure whether the change itself is the problem, or if it's something else that's off. can you think of anything else to try? [17:05:25] otherwise I think I'll have to try this again tomorrow (or next week, probably) and make sure I understand this setup better. getting late here :) [17:07:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.874s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:08:40] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [17:08:49] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [17:09:25] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [17:09:42] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [17:10:12] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [17:10:25] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [17:12:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T396130)', diff saved to https://phabricator.wikimedia.org/P78462 and previous config saved to /var/cache/conftool/dbconfig/20250619-171201-marostegui.json [17:12:06] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [17:12:16] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [17:13:35] FIRING: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:19:39] (03PS1) 10Stevemunene: blunderbuss: replace an-conf100[1-3] with an-conf100[4-6] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161573 (https://phabricator.wikimedia.org/T374922) [17:28:35] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:29:44] (03PS1) 10Ssingh: prometheus: check if collector file is already defined [puppet] - 10https://gerrit.wikimedia.org/r/1161576 (https://phabricator.wikimedia.org/T397303) [17:30:26] (03CR) 10Vgutierrez: [C:03+1] "L8 is melted here apparently 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1161576 (https://phabricator.wikimedia.org/T397303) (owner: 10Ssingh) [17:30:47] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6036/co" [puppet] - 10https://gerrit.wikimedia.org/r/1161576 (https://phabricator.wikimedia.org/T397303) (owner: 10Ssingh) [17:31:08] (03CR) 10Ssingh: [V:03+1 C:03+2] prometheus: check if collector file is already defined [puppet] - 10https://gerrit.wikimedia.org/r/1161576 (https://phabricator.wikimedia.org/T397303) (owner: 10Ssingh) [17:31:36] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [17:31:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T396130)', diff saved to https://phabricator.wikimedia.org/P78464 and previous config saved to /var/cache/conftool/dbconfig/20250619-173142-marostegui.json [17:31:48] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [17:32:22] !log forcing agent run on A:liberica-drmrs to merge CR 1161576 [17:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:14] (03PS1) 10MusikAnimal: tables-catalog: add PageAssessments [puppet] - 10https://gerrit.wikimedia.org/r/1161578 (https://phabricator.wikimedia.org/T393792) [17:37:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T396130)', diff saved to https://phabricator.wikimedia.org/P78465 and previous config saved to /var/cache/conftool/dbconfig/20250619-173737-marostegui.json [17:37:43] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [17:39:18] (03PS1) 10Jhancock.wm: Adding build2003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1161579 (https://phabricator.wikimedia.org/T393015) [17:42:19] jakob_WMDE: sounds good, happy to help you next week [17:42:53] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:42:53] (03CR) 10Cathal Mooney: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1161579 (https://phabricator.wikimedia.org/T393015) (owner: 10Jhancock.wm) [17:43:45] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:44:11] (03CR) 10Jhancock.wm: [C:03+2] Adding build2003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1161579 (https://phabricator.wikimedia.org/T393015) (owner: 10Jhancock.wm) [17:44:15] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:46:13] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 07 Aug 2025 09:25:51 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:49:15] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:52:37] (03PS2) 10Msz2001: Set category collation to `uca-pl-u-kn` for plwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161581 (https://phabricator.wikimedia.org/T397466) [17:52:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P78466 and previous config saved to /var/cache/conftool/dbconfig/20250619-175244-marostegui.json [17:53:15] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 07 Aug 2025 09:25:51 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:53:35] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54083 bytes in 0.135 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:53:35] RESOLVED: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:53:43] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.241 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:54:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161581 (https://phabricator.wikimedia.org/T397466) (owner: 10Msz2001) [18:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T1800) [18:07:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P78467 and previous config saved to /var/cache/conftool/dbconfig/20250619-180751-marostegui.json [18:22:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T396130)', diff saved to https://phabricator.wikimedia.org/P78468 and previous config saved to /var/cache/conftool/dbconfig/20250619-182258-marostegui.json [18:23:04] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [18:23:14] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [18:23:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1181 (T396130)', diff saved to https://phabricator.wikimedia.org/P78469 and previous config saved to /var/cache/conftool/dbconfig/20250619-182320-marostegui.json [18:28:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T396130)', diff saved to https://phabricator.wikimedia.org/P78470 and previous config saved to /var/cache/conftool/dbconfig/20250619-182817-marostegui.json [18:28:27] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [18:28:29] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:35:15] (03CR) 10LD: [C:03+1] Set category collation to `uca-pl-u-kn` for plwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161581 (https://phabricator.wikimedia.org/T397466) (owner: 10Msz2001) [18:43:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P78471 and previous config saved to /var/cache/conftool/dbconfig/20250619-184325-marostegui.json [18:48:34] (03PS1) 10Bartosz Dziewoński: PageChangeEmissionTest: order move events by kind. [extensions/EventBus] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1161588 (https://phabricator.wikimedia.org/T397087) [18:50:58] (03CR) 10Brouberol: [C:03+1] blunderbuss: replace an-conf100[1-3] with an-conf100[4-6] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161573 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [18:53:47] (03PS8) 10Brouberol: Airflow analytics-test: Optimization for LocalExecutors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161047 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [18:58:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P78472 and previous config saved to /var/cache/conftool/dbconfig/20250619-185832-marostegui.json [18:59:46] (03PS2) 10Bartosz Dziewoński: DomainEvents: Constant repeating notifications [extensions/Echo] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1161589 (https://phabricator.wikimedia.org/T397103) [19:00:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/EventBus] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1161588 (https://phabricator.wikimedia.org/T397087) (owner: 10Bartosz Dziewoński) [19:01:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/Echo] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1161589 (https://phabricator.wikimedia.org/T397103) (owner: 10Bartosz Dziewoński) [19:03:02] (03CR) 10Gmodena: [C:03+1] PageChangeEmissionTest: order move events by kind. [extensions/EventBus] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1161588 (https://phabricator.wikimedia.org/T397087) (owner: 10Bartosz Dziewoński) [19:13:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T396130)', diff saved to https://phabricator.wikimedia.org/P78473 and previous config saved to /var/cache/conftool/dbconfig/20250619-191339-marostegui.json [19:13:45] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [19:13:55] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1191.eqiad.wmnet with reason: Maintenance [19:14:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T396130)', diff saved to https://phabricator.wikimedia.org/P78474 and previous config saved to /var/cache/conftool/dbconfig/20250619-191401-marostegui.json [19:18:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T396130)', diff saved to https://phabricator.wikimedia.org/P78475 and previous config saved to /var/cache/conftool/dbconfig/20250619-191848-marostegui.json [19:18:54] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [19:19:00] (03PS1) 10Brouberol: airflow: restore the AIRFLOW_KERBEROS_HOSTNAME variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161592 [19:21:54] (03CR) 10Brouberol: [C:03+2] airflow: restore the AIRFLOW_KERBEROS_HOSTNAME variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161592 (owner: 10Brouberol) [19:23:35] FIRING: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:24:55] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [19:25:23] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [19:33:35] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:33:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P78476 and previous config saved to /var/cache/conftool/dbconfig/20250619-193355-marostegui.json [19:35:10] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [19:35:45] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [19:36:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [19:36:57] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [19:45:39] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [19:49:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P78477 and previous config saved to /var/cache/conftool/dbconfig/20250619-194902-marostegui.json [19:51:57] jouncebot: next [19:51:57] In 0 hour(s) and 8 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T2000) [19:59:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159626 (https://phabricator.wikimedia.org/T386440) (owner: 10Mimurawil) [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T2000). [20:00:04] msz2001, MatmaRex, and kostajh: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:20] Hi! Can someone deploy for me this patch, please? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1161581 [20:01:25] hi [20:01:32] (i'm not a deployer) [20:01:32] \o [20:01:41] I can deploy [20:02:01] o/ [20:03:19] starting with the config patches [20:03:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159626 (https://phabricator.wikimedia.org/T386440) (owner: 10Mimurawil) [20:03:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161581 (https://phabricator.wikimedia.org/T397466) (owner: 10Msz2001) [20:03:35] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:04:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T396130)', diff saved to https://phabricator.wikimedia.org/P78478 and previous config saved to /var/cache/conftool/dbconfig/20250619-200409-marostegui.json [20:04:11] (03Merged) 10jenkins-bot: Configure instrument for CheckUser - UserInfoCard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159626 (https://phabricator.wikimedia.org/T386440) (owner: 10Mimurawil) [20:04:14] (03Merged) 10jenkins-bot: Set category collation to `uca-pl-u-kn` for plwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161581 (https://phabricator.wikimedia.org/T397466) (owner: 10Msz2001) [20:04:15] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [20:04:25] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance [20:04:31] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1159626|Configure instrument for CheckUser - UserInfoCard (T386440)]], [[gerrit:1161581|Set category collation to `uca-pl-u-kn` for plwikiquote (T397466)]] [20:04:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T396130)', diff saved to https://phabricator.wikimedia.org/P78479 and previous config saved to /var/cache/conftool/dbconfig/20250619-200432-marostegui.json [20:04:38] T386440: UserInfoCard: Instrument the feature - https://phabricator.wikimedia.org/T386440 [20:04:38] T397466: Add 'plwikiquote' => 'uca-pl-u-kn' in wgCategoryCollation in InitialiseSettings.php - https://phabricator.wikimedia.org/T397466 [20:04:41] MatmaRex: are you able to verify https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/1161589 when I sync it? [20:04:46] FYI, my backports can go out together, and i don't have anything to test on mwdebug [20:04:51] ok [20:06:43] !log kharlan@deploy1003 kharlan, msz2001, mimurawil: Backport for [[gerrit:1159626|Configure instrument for CheckUser - UserInfoCard (T386440)]], [[gerrit:1161581|Set category collation to `uca-pl-u-kn` for plwikiquote (T397466)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:07:46] Msz2001: are you able to verify your patch on plwikiquote? [20:08:00] Yes, works as intended [20:08:08] !log kharlan@deploy1003 kharlan, msz2001, mimurawil: Continuing with sync [20:08:12] cool [20:09:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T396130)', diff saved to https://phabricator.wikimedia.org/P78480 and previous config saved to /var/cache/conftool/dbconfig/20250619-200921-marostegui.json [20:09:26] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [20:11:30] (03CR) 10Krinkle: "Done. https://wikitech.wikimedia.org/w/index.php?title=X-Analytics&diff=2315515&oldid=2162232" [puppet] - 10https://gerrit.wikimedia.org/r/1160381 (https://phabricator.wikimedia.org/T390924) (owner: 10Krinkle) [20:15:02] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159626|Configure instrument for CheckUser - UserInfoCard (T386440)]], [[gerrit:1161581|Set category collation to `uca-pl-u-kn` for plwikiquote (T397466)]] (duration: 10m 30s) [20:15:10] T386440: UserInfoCard: Instrument the feature - https://phabricator.wikimedia.org/T386440 [20:15:10] T397466: Add 'plwikiquote' => 'uca-pl-u-kn' in wgCategoryCollation in InitialiseSettings.php - https://phabricator.wikimedia.org/T397466 [20:15:34] Thanks! [20:15:40] no problem! [20:16:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/EventBus] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1161588 (https://phabricator.wikimedia.org/T397087) (owner: 10Bartosz Dziewoński) [20:16:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/Echo] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1161589 (https://phabricator.wikimedia.org/T397103) (owner: 10Bartosz Dziewoński) [20:17:07] (03Merged) 10jenkins-bot: PageChangeEmissionTest: order move events by kind. [extensions/EventBus] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1161588 (https://phabricator.wikimedia.org/T397087) (owner: 10Bartosz Dziewoński) [20:24:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P78481 and previous config saved to /var/cache/conftool/dbconfig/20250619-202428-marostegui.json [20:26:08] (03Merged) 10jenkins-bot: DomainEvents: Constant repeating notifications [extensions/Echo] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1161589 (https://phabricator.wikimedia.org/T397103) (owner: 10Bartosz Dziewoński) [20:26:26] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1161588|PageChangeEmissionTest: order move events by kind. (T397087)]], [[gerrit:1161589|DomainEvents: Constant repeating notifications (T397103)]] [20:26:33] T397087: phpunit\integration\PageChangeEmissionTest::testPageMove with data set "Valid move with redirect" ('SourcePageA', 'DestinationPageA', true, 3) - https://phabricator.wikimedia.org/T397087 [20:26:33] T397103: Constant repeating notifications - https://phabricator.wikimedia.org/T397103 [20:27:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:28:32] !log kharlan@deploy1003 kharlan, matmarex: Backport for [[gerrit:1161588|PageChangeEmissionTest: order move events by kind. (T397087)]], [[gerrit:1161589|DomainEvents: Constant repeating notifications (T397103)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:28:35] RESOLVED: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:29:42] !log kharlan@deploy1003 kharlan, matmarex: Continuing with sync [20:29:50] kostajh: mwdebug looks good, i don't have anything specific to test [20:29:56] sync [20:29:59] *syncing [20:32:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.317s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:33:51] (03PS1) 10Jforrester: ApiQueryZFunctionReference: Return an actual empty array instead of [false] [extensions/WikiLambda] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1161622 (https://phabricator.wikimedia.org/T396978) [20:36:35] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1161588|PageChangeEmissionTest: order move events by kind. (T397087)]], [[gerrit:1161589|DomainEvents: Constant repeating notifications (T397103)]] (duration: 10m 08s) [20:36:41] T397087: phpunit\integration\PageChangeEmissionTest::testPageMove with data set "Valid move with redirect" ('SourcePageA', 'DestinationPageA', true, 3) - https://phabricator.wikimedia.org/T397087 [20:36:41] T397103: Constant repeating notifications - https://phabricator.wikimedia.org/T397103 [20:39:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P78482 and previous config saved to /var/cache/conftool/dbconfig/20250619-203935-marostegui.json [20:40:21] ok, all done [20:40:41] !log UTC late deploys done [20:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:54] thanks for the party [20:41:13] :D [20:43:37] thanks kostajh [20:47:12] Hi! 30 mins ago, I've had a change of category collation deployed to plwikiquote. Apparently, for a full effect on existing pages, it requires updateCollation.php maintenance script to be invoked. Can I ask someone to do it? [20:50:32] oops. kostajh: ^ [20:54:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T396130)', diff saved to https://phabricator.wikimedia.org/P78483 and previous config saved to /var/cache/conftool/dbconfig/20250619-205443-marostegui.json [20:54:48] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [20:54:58] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance [20:55:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T396130)', diff saved to https://phabricator.wikimedia.org/P78484 and previous config saved to /var/cache/conftool/dbconfig/20250619-205505-marostegui.json [20:55:11] if kosta isn't around any more, you can schedule this for the next deployment window next week… nothing terrible will happen if we don't run the script, but the order in categories will increasingly get jumbled up as the pages are edited, until the script run happens [20:56:38] okay [20:59:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T396130)', diff saved to https://phabricator.wikimedia.org/P78485 and previous config saved to /var/cache/conftool/dbconfig/20250619-205955-marostegui.json [21:00:02] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250619T2100) [21:00:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:10:04] subtasked it as T397483 [21:10:05] T397483: plwikiquote: run updateCollation.php - https://phabricator.wikimedia.org/T397483 [21:13:35] FIRING: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:15:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P78486 and previous config saved to /var/cache/conftool/dbconfig/20250619-211502-marostegui.json [21:24:12] LD MatmaRex I can run the script [21:24:46] updateCollation.php on plwikiquote, no args/options? [21:26:12] I see various invocations here https://sal.toolforge.org/production?p=0&q=updateCollation.php&d= [21:26:34] kostajh: i think so, but i'm checking if the script changed since i last saw it [21:28:19] kostajh: yeah, that's right. you can add --previous-collation=uppercase, it's supposed to be an optimization, i'm not sure if it makes a difference [21:28:34] kostajh: the script should take maybe 10 minutes, given the size of the wiki [21:28:35] RESOLVED: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:28:59] (or at least in that order of magnitude, it varies a lot) [21:29:47] hmm. Any risk in running this? [21:30:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P78487 and previous config saved to /var/cache/conftool/dbconfig/20250619-213010-marostegui.json [21:32:40] MatmaRex: --dry-run says 103,479 rows would be updated [21:33:38] which is all the rows in `categorylinks` [21:33:58] that seems correct [21:34:27] they all have to be recomputed for the new collation. it's a pretty routine script [21:35:47] MatmaRex: ok, I will run `mwscript-k8s -f --comment="T397483" -- updateCollation.php --wiki=plwikiquote --previous-collation=uppercase` [21:35:47] T397483: plwikiquote: run updateCollation.php - https://phabricator.wikimedia.org/T397483 [21:35:59] sounds ok? [21:36:27] sounds ok [21:36:40] running it [21:39:07] done [21:39:37] !log mwscript-k8s -f --comment="T397483" -- updateCollation.php --wiki=plwikiquote --previous-collation=uppercase [21:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:23] MatmaRex: thanks for your help [21:41:14] np [21:42:05] FIRING: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:45:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T396130)', diff saved to https://phabricator.wikimedia.org/P78488 and previous config saved to /var/cache/conftool/dbconfig/20250619-214517-marostegui.json [21:45:24] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [21:45:34] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1227.eqiad.wmnet with reason: Maintenance [21:45:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T396130)', diff saved to https://phabricator.wikimedia.org/P78489 and previous config saved to /var/cache/conftool/dbconfig/20250619-214540-marostegui.json [21:57:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [22:00:03] (03PS1) 10Andrea Denisse: grafana: Disable dashboard sync for a version upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1161628 (https://phabricator.wikimedia.org/T397442) [22:03:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T396130)', diff saved to https://phabricator.wikimedia.org/P78490 and previous config saved to /var/cache/conftool/dbconfig/20250619-220317-marostegui.json [22:03:22] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [22:12:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [22:18:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P78491 and previous config saved to /var/cache/conftool/dbconfig/20250619-221824-marostegui.json [22:28:29] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:33:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P78492 and previous config saved to /var/cache/conftool/dbconfig/20250619-223332-marostegui.json [22:37:05] RESOLVED: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [22:48:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T396130)', diff saved to https://phabricator.wikimedia.org/P78493 and previous config saved to /var/cache/conftool/dbconfig/20250619-224839-marostegui.json [22:48:46] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [22:48:54] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1253.eqiad.wmnet with reason: Maintenance [22:49:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1253 (T396130)', diff saved to https://phabricator.wikimedia.org/P78494 and previous config saved to /var/cache/conftool/dbconfig/20250619-224901-marostegui.json [22:49:36] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161660 [22:50:43] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161661 [22:54:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T396130)', diff saved to https://phabricator.wikimedia.org/P78495 and previous config saved to /var/cache/conftool/dbconfig/20250619-225448-marostegui.json [22:54:53] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [23:00:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:09:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P78496 and previous config saved to /var/cache/conftool/dbconfig/20250619-230955-marostegui.json [23:11:30] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10933319 (10Krinkle) [23:25:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P78497 and previous config saved to /var/cache/conftool/dbconfig/20250619-232502-marostegui.json [23:38:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1161681 [23:38:31] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1161681 (owner: 10TrainBranchBot) [23:40:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T396130)', diff saved to https://phabricator.wikimedia.org/P78498 and previous config saved to /var/cache/conftool/dbconfig/20250619-234009-marostegui.json [23:40:15] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [23:40:25] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [23:45:39] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [23:51:42] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1161681 (owner: 10TrainBranchBot) [23:56:05] FIRING: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn