[00:08:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1178991 [00:08:32] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1178991 (owner: 10TrainBranchBot) [00:22:10] (03PS1) 10Dzahn: httpbb: minor changes to test file for new zuul [puppet] - 10https://gerrit.wikimedia.org/r/1178994 [00:22:25] (03CR) 10CI reject: [V:04-1] httpbb: minor changes to test file for new zuul [puppet] - 10https://gerrit.wikimedia.org/r/1178994 (owner: 10Dzahn) [00:30:53] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1178991 (owner: 10TrainBranchBot) [00:35:17] FIRING: ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:40:17] FIRING: [2x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:00:51] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:09:52] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-eqiad: Upgrading to Java 11.0.28 - eevans@cumin1002 [01:12:42] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 11m 50s) [01:17:18] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1046.eqiad.wmnet with OS bullseye [01:23:55] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1047.eqiad.wmnet with OS bullseye [01:29:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11088297 (10phaultfinder) [01:31:04] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1046.eqiad.wmnet with OS bullseye [01:36:40] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1047.eqiad.wmnet with OS bullseye [01:53:36] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_lldpd.service on install1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:58:29] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1046.eqiad.wmnet with reason: host reimage [01:59:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T399249)', diff saved to https://phabricator.wikimedia.org/P81363 and previous config saved to /var/cache/conftool/dbconfig/20250815-015904-fceratto.json [01:59:09] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [02:03:18] 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11088325 (10Samwilson) Oh that's good to know, thanks! I was starting to wonder if something like that was going on, but then found [[https://gitlab.wikimedia.org/toolforge-repo... [02:03:30] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1046.eqiad.wmnet with reason: host reimage [02:03:55] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1047.eqiad.wmnet with reason: host reimage [02:04:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:07:31] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1047.eqiad.wmnet with reason: host reimage [02:09:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:14:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P81364 and previous config saved to /var/cache/conftool/dbconfig/20250815-021412-fceratto.json [02:22:07] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1046.eqiad.wmnet with OS bullseye [02:25:46] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1047.eqiad.wmnet with OS bullseye [02:29:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P81365 and previous config saved to /var/cache/conftool/dbconfig/20250815-022919-fceratto.json [02:35:36] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye [02:40:54] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: Upgrading to Java 11.0.28 - eevans@cumin1002 [02:44:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T399249)', diff saved to https://phabricator.wikimedia.org/P81366 and previous config saved to /var/cache/conftool/dbconfig/20250815-024426-fceratto.json [02:44:31] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [02:44:42] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2237.codfw.wmnet with reason: Maintenance [02:44:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2237 (T399249)', diff saved to https://phabricator.wikimedia.org/P81367 and previous config saved to /var/cache/conftool/dbconfig/20250815-024449-fceratto.json [02:54:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11088364 (10phaultfinder) [03:01:43] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1042.eqiad.wmnet with OS bullseye [03:03:00] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye [03:04:32] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:25:13] andrew@cumin2002 reimage (PID 125559) is awaiting input [03:29:32] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/3 (Core: lsw1-e4-codfw:ethernet-1/55 {#130117100037}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:44:49] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1042.eqiad.wmnet with OS bullseye [03:45:36] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye [03:49:58] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:49:58] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:51:54] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 6.679 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:51:54] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 6.368 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:54:58] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:54:58] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:55:52] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 4.024 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:55:52] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 3.093 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:58:58] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:58:58] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:00:10] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11088400 (10phaultfinder) [04:01:52] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 3.792 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:01:54] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:08:12] andrew@cumin2002 reimage (PID 145629) is awaiting input [04:12:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:13:46] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1042.eqiad.wmnet with OS bullseye [04:14:34] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye [04:15:10] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11088405 (10phaultfinder) [04:17:56] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:39:51] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1042.eqiad.wmnet with OS bullseye [04:40:32] FIRING: [2x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:48:00] PROBLEM - Disk space on prometheus1007 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s-dse 12420MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus1007&var-datasource=eqiad+prometheus/ops [04:50:12] 10ops-codfw, 06DC-Ops: Alert for device ps1-c7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401982 (10phaultfinder) 03NEW [05:07:56] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:08:36] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:12:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:15:10] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11088465 (10phaultfinder) [05:49:58] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11088476 (10phaultfinder) [05:54:32] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_lldpd.service on install1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250815T0600) [06:08:00] PROBLEM - Disk space on prometheus1007 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s-dse 12387MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus1007&var-datasource=eqiad+prometheus/ops [06:13:36] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:16:36] PROBLEM - Exim SMTP on lists1004 is CRITICAL: connect to address 208.80.154.81 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Exim [06:19:42] RECOVERY - Exim SMTP on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Exim [06:40:02] (03PS6) 10Giuseppe Lavagetto: varnish: also convert abuse networks to use x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1175990 (https://phabricator.wikimedia.org/T396621) [06:40:03] (03PS6) 10Giuseppe Lavagetto: Remove blocked-nets from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1175991 (https://phabricator.wikimedia.org/T396621) [06:40:03] (03PS4) 10Giuseppe Lavagetto: varnish: stop loading netmaps [puppet] - 10https://gerrit.wikimedia.org/r/1175992 (https://phabricator.wikimedia.org/T396621) [06:40:03] (03PS2) 10Giuseppe Lavagetto: varnish: refactor inclusion of requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058) [06:41:20] (03CR) 10CI reject: [V:04-1] varnish: refactor inclusion of requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250815T0700) [07:02:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve1013.eqiad.wmnet [07:04:32] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:09:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1013.eqiad.wmnet [07:10:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:15:07] 10ops-eqiad, 06SRE, 06DC-Ops: PXE provision script needed for ml-lab and ml-serve hosts - https://phabricator.wikimedia.org/T401964#11088507 (10klausman) An time next week during my usual waking hours (0800-1800 UTC) should be doable. A few notes about the specific hosts: ml-lab1001/2 are just that lab mac... [07:24:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11088511 (10phaultfinder) [07:29:32] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/3 (Core: lsw1-e4-codfw:ethernet-1/55 {#130117100037}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:54:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11088549 (10phaultfinder) [08:04:28] (03CR) 10Majavah: [C:03+2] hieradata: Update Cloud VPS NTP servers [puppet] - 10https://gerrit.wikimedia.org/r/1178871 (https://phabricator.wikimedia.org/T401848) (owner: 10Majavah) [08:19:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11088563 (10phaultfinder) [08:25:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:29:29] (03PS1) 10Majavah: hieradata: Add new Puppet ENC servers [puppet] - 10https://gerrit.wikimedia.org/r/1179101 (https://phabricator.wikimedia.org/T401986) [08:32:11] (03PS1) 10Majavah: openstack: encapi: Support Trixie hosts [puppet] - 10https://gerrit.wikimedia.org/r/1179103 (https://phabricator.wikimedia.org/T401986) [08:32:59] (03CR) 10Majavah: [C:03+2] hieradata: Add new Puppet ENC servers [puppet] - 10https://gerrit.wikimedia.org/r/1179101 (https://phabricator.wikimedia.org/T401986) (owner: 10Majavah) [08:33:05] (03CR) 10Majavah: [C:03+2] openstack: encapi: Support Trixie hosts [puppet] - 10https://gerrit.wikimedia.org/r/1179103 (https://phabricator.wikimedia.org/T401986) (owner: 10Majavah) [08:40:32] FIRING: [2x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:40:54] PROBLEM - SSH on build2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:51:44] RECOVERY - SSH on build2002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:03:07] ^ the build2002 error might re-appear, that's the Java 8 forward port build for Bookworm, the test suite is insane and brings the VM almost to it's knees... [09:09:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11088655 (10phaultfinder) [09:10:49] !log update python3-flask-keystone in trixie-wikimedia T401986 [09:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:53] T401986: Refresh Cloud VPS Puppet ENC servers to run on Trixie and enable IPv6 - https://phabricator.wikimedia.org/T401986 [09:14:11] (03PS1) 10Majavah: openstack: encapi: Listen on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1179105 (https://phabricator.wikimedia.org/T401986) [09:16:23] (03CR) 10Majavah: [C:03+2] openstack: encapi: Listen on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1179105 (https://phabricator.wikimedia.org/T401986) (owner: 10Majavah) [09:17:33] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:19:34] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:20:17] FIRING: [6x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:22:33] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GoogleNewsSitemap/+/1178944 could probably use an emergency deploy (cc slyngs, moritzm as SREs on call) later today, though I wouldn’t mind someone else taking a look first (CC esp. Amir1, zabe) [09:22:45] fixes a (user-facing, I believe) production error on two wikinewses [09:23:16] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:24:16] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:24:34] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:25:02] (I can take care of deployment once merged and ermegency deployment approved) [09:25:17] FIRING: [6x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:27:26] RESOLVED: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:30:17] FIRING: [10x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:33:36] FIRING: SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:54] (03PS3) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [09:35:19] (03CR) 10CI reject: [V:04-1] C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:40:43] Lucas_WMDE, zabe: I don't have a grasp of the finer details of the patch, but if it's an UBN and you both think it looks fine, then please go ahead (or wait for Amir to voice an opinion given that you also tagged him) [09:40:52] fine either way [09:45:24] (03PS4) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [09:45:32] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11089033 (10phaultfinder) [09:45:45] (03CR) 10CI reject: [V:04-1] C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:45:57] zabe: since the fix should be easily testable, I’d say let’s go ahead with the backport now [09:45:58] wdyt? [09:46:13] sounds good [09:46:15] (03PS5) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [09:46:20] leave it open on master for a little bit, but fix it on wmf.14 already [09:46:48] (03CR) 10CI reject: [V:04-1] C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:47:34] (03PS1) 10Zabe: Migrate overlooked query to categorylinks read new [extensions/GoogleNewsSitemap] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1179108 (https://phabricator.wikimedia.org/T401951) [09:48:48] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "IMHO Okay to deploy per my review on the master branch (but I’m leaving the actual CR+2 to the `scap backport`)." [extensions/GoogleNewsSitemap] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1179108 (https://phabricator.wikimedia.org/T401951) (owner: 10Zabe) [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:53:23] reproducable by visiting https://zh.wikinews.org/wiki/Special:%E6%96%B0%E9%97%BB%E8%AE%A2%E9%98%85 [09:53:38] Lucas_WMDE: Should I deploy or are you doing it? [09:54:32] zabe: I thought you would do it [09:54:37] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_lldpd.service on install1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:54:37] I can also do it if you prefer [09:54:53] No it is fine, just wanted to make we are not clashing [09:54:56] ok [09:56:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [extensions/GoogleNewsSitemap] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1179108 (https://phabricator.wikimedia.org/T401951) (owner: 10Zabe) [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:22] (03Merged) 10jenkins-bot: Migrate overlooked query to categorylinks read new [extensions/GoogleNewsSitemap] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1179108 (https://phabricator.wikimedia.org/T401951) (owner: 10Zabe) [09:57:54] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1179108|Migrate overlooked query to categorylinks read new (T401951)]] [09:57:58] T401951: Error: Call to a member function getText() on null - https://phabricator.wikimedia.org/T401951 [09:59:44] !log uploaded openjdk-8 8u462-ga-1 to bookworm (backport of latest Java 8 security fixes) [09:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:49] !log zabe@deploy1003 zabe: Backport for [[gerrit:1179108|Migrate overlooked query to categorylinks read new (T401951)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:59:58] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11089078 (10phaultfinder) [10:00:44] visiting the above link no longer fatals when using mwdebug [10:00:55] !log zabe@deploy1003 zabe: Continuing with sync [10:03:36] RESOLVED: SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:06:12] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1179108|Migrate overlooked query to categorylinks read new (T401951)]] (duration: 08m 17s) [10:06:16] T401951: Error: Call to a member function getText() on null - https://phabricator.wikimedia.org/T401951 [10:07:32] \o/ [10:07:54] (03PS1) 10Muehlenhoff: Bump the version numbers for Java 8/17 images based on latest security releases [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1179110 [10:09:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T399249)', diff saved to https://phabricator.wikimedia.org/P81368 and previous config saved to /var/cache/conftool/dbconfig/20250815-100901-fceratto.json [10:09:03] logstash is also looking better now [10:09:05] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [10:11:21] fantastic, thanks for fixing [10:24:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P81369 and previous config saved to /var/cache/conftool/dbconfig/20250815-102408-fceratto.json [10:30:09] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11089178 (10phaultfinder) [10:39:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P81370 and previous config saved to /var/cache/conftool/dbconfig/20250815-103917-fceratto.json [10:44:46] (03CR) 10Ladsgroup: "I like it, I will check whether this is going to impact watchlist or not." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178990 (https://phabricator.wikimedia.org/T399455) (owner: 10Zabe) [10:48:19] (03PS1) 10Muehlenhoff: Update Cumin aliases to handle the transition to routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1179111 (https://phabricator.wikimedia.org/T394263) [10:48:58] (03PS2) 10Muehlenhoff: Update Cumin aliases to handle the transition to routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1179111 (https://phabricator.wikimedia.org/T394263) [10:51:24] (03CR) 10Cathal Mooney: [C:03+1] Bump the version numbers for Java 8/17 images based on latest security releases [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1179110 (owner: 10Muehlenhoff) [10:51:44] PROBLEM - Disk space on wikikube-worker-exp1001 is CRITICAL: DISK CRITICAL - /tmp/nerdctl-cp-2679370730 is not accessible: No such file or directory https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=wikikube-worker-exp1001&var-datasource=eqiad+prometheus/ops [10:54:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T399249)', diff saved to https://phabricator.wikimedia.org/P81371 and previous config saved to /var/cache/conftool/dbconfig/20250815-105424-fceratto.json [10:54:29] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [10:54:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2239.codfw.wmnet with reason: Maintenance [10:56:13] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Bump the version numbers for Java 8/17 images based on latest security releases [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1179110 (owner: 10Muehlenhoff) [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250815T0700) [11:00:05] jelto, arnoldokoth, and mutante: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) GitLab version upgrades deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250815T1100). [11:02:14] PROBLEM - Disk space on prometheus1008 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s-dse 11652MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus1008&var-datasource=eqiad+prometheus/ops [11:04:32] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:15:14] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11089301 (10phaultfinder) [11:21:54] (03CR) 10Ladsgroup: "We should announce this in cloud announce. I can take care of that." [puppet] - 10https://gerrit.wikimedia.org/r/1178899 (https://phabricator.wikimedia.org/T36320) (owner: 10Zabe) [11:29:32] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/3 (Core: lsw1-e4-codfw:ethernet-1/55 {#130117100037}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:39:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11089382 (10phaultfinder) [11:47:41] (03PS6) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [11:48:00] PROBLEM - Disk space on prometheus1007 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s-dse 12431MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus1007&var-datasource=eqiad+prometheus/ops [11:48:58] (03CR) 10Ladsgroup: [C:03+1] "Okay, this wont' affect watchlist since the preference for it is set differently via getDefaultDaysPreferenceName" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178990 (https://phabricator.wikimedia.org/T399455) (owner: 10Zabe) [11:56:30] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11089413 (10Ladsgroup) >>! In T401966#11087932, @RobH wrote: > @Marostegui: Would you be able to advise on behalf of #data-persistence a schedule for updating... [12:00:18] (03PS1) 10Majavah: P:mariadb::cloudinfra: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1179119 [12:01:03] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6587/co" [puppet] - 10https://gerrit.wikimedia.org/r/1179119 (owner: 10Majavah) [12:02:45] (03PS1) 10Huei Tan: Add Metrics Platform stream configuration and registration for MinT for Wikipedia Readers Page visit instrumentation for experiment by Language and Product Localization team. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) [12:05:10] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11089430 (10Ladsgroup) >>! In T400198#11087058, @VRiley-WMF wrote: > @Marostegui This will be for the install of all 9 of these servers? 1049 - 1057? The ticket only lists 1049... [12:05:12] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11089431 (10phaultfinder) [12:07:33] (03PS1) 10Majavah: openstack: puppet: Set user-agent for ENC client script [puppet] - 10https://gerrit.wikimedia.org/r/1179121 [12:10:40] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11089444 (10MatthewVernon) @RobH I'm concerned that some of these hosts are relatively new (e.g. thanos-be1009 and ms-be1095 were purch... [12:13:33] (03PS3) 10Huji: Enable electionclerk user group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155805 (https://phabricator.wikimedia.org/T396347) [12:15:13] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401982#11089451 (10Jhancock.wm) removing thresholds for alerts and merging with T401634. action plan detailed in that ticket. [12:16:16] 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11089454 (10Jhancock.wm) [12:16:17] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401982#11089457 (10Jhancock.wm) →14Duplicate dup:03T401634 [12:17:12] 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11089458 (10Jhancock.wm) [12:17:25] (03PS1) 10Majavah: hieradata: Move ENC Git updater job to enc-4 [puppet] - 10https://gerrit.wikimedia.org/r/1179126 (https://phabricator.wikimedia.org/T401986) [12:17:27] (03PS1) 10Majavah: hieradata: Remove old ENC hosts [puppet] - 10https://gerrit.wikimedia.org/r/1179127 (https://phabricator.wikimedia.org/T401986) [12:18:57] 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11089462 (10Jhancock.wm) C 7 needs a server removed. there is an excellent candidate. There is still one frack server in this rack. Will need to coordinate with that team to get it moved i... [12:44:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11089532 (10phaultfinder) [12:53:36] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_lldpd.service on install1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:54:21] (03PS1) 10Slyngshede: P:cache::haproxy add ASN lookup function [puppet] - 10https://gerrit.wikimedia.org/r/1179136 (https://phabricator.wikimedia.org/T398161) [12:55:11] (03CR) 10Majavah: [C:03+2] hieradata: Move ENC Git updater job to enc-4 [puppet] - 10https://gerrit.wikimedia.org/r/1179126 (https://phabricator.wikimedia.org/T401986) (owner: 10Majavah) [12:59:51] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11089560 (10phaultfinder) [13:19:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11089656 (10phaultfinder) [13:25:15] (03PS1) 10Wangombe: Make MT limit 80% on Welch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179141 (https://phabricator.wikimedia.org/T385482) [13:26:51] 06SRE: Add known-client-ingestion-source objects an logic - https://phabricator.wikimedia.org/T402014 (10JMeybohm) 03NEW [13:30:17] FIRING: [2x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:40:17] FIRING: [2x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:45:17] RESOLVED: ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:45:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:00:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:01:06] (03PS3) 10Hnowlan: rest-gateway: use simplified list of rest.php APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177966 (https://phabricator.wikimedia.org/T400132) [14:19:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11089794 (10phaultfinder) [14:20:41] 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11089795 (10Joe) >>! In T400119#11086977, @bd808 wrote: >>>! In T400119#11084530, @Samwilson wrote: >> ~~Will GitLab CI be excluded from this policy?~~ > > I know you added the... [14:23:08] (03CR) 10KartikMistry: [C:03+1] "We can schedule this for Monday if that works." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179141 (https://phabricator.wikimedia.org/T385482) (owner: 10Wangombe) [14:42:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:47:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:50:04] (03CR) 10Phuedx: [C:03+1] Add Metrics Platform stream configuration and registration for MinT for Wikipedia Readers Page visit instrumentation for experiment by Langu (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) (owner: 10Huei Tan) [14:50:30] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 3 unhealthy realservers pooled on lvs5005:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [14:51:44] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [14:51:58] FIRING: ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:55:30] RESOLVED: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 1 unhealthy realservers pooled on lvs5005:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [14:56:03] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11089855 (10Papaul) [14:56:58] RESOLVED: ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:01:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [15:02:38] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11089870 (10Jhancock.wm) [15:02:51] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11089871 (10Jhancock.wm) a:05Marostegui→03Jhancock.wm [15:04:32] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:08:36] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11089877 (10phaultfinder) [15:20:30] PROBLEM - Host ml-lab1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:23:23] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Fix PXE miss-configurations - https://phabricator.wikimedia.org/T396717#11089894 (10RobH) @volans, Can you advise what we can run to check a specific host, or how you generated the full output of P77792? I'm going to start working on running the provision script... [15:23:37] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:24:00] RECOVERY - Host ml-lab1001 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [15:29:32] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/3 (Core: lsw1-e4-codfw:ethernet-1/55 {#130117100037}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:31:10] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11089941 (10RobH) We're currently attempting to figure out the scope and fix for this, but the first few would 100% be DC Ops figuring... [15:39:03] (03PS4) 10Bking: opensearch-operator: Add chart for review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) [15:39:10] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11089971 (10MatthewVernon) @RobH OK, fair enough! ms-fe1013 can be simply depooled whenever and repooled when you're done - either pic... [15:40:42] (03CR) 10CI reject: [V:04-1] opensearch-operator: Add chart for review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [15:44:28] (03CR) 10Bking: opensearch-operator: Add chart for review (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [16:01:29] (03PS2) 10Dzahn: httpbb: minor changes to test file for new zuul [puppet] - 10https://gerrit.wikimedia.org/r/1178994 [16:02:46] (03CR) 10Dzahn: [C:03+2] httpbb: minor changes to test file for new zuul [puppet] - 10https://gerrit.wikimedia.org/r/1178994 (owner: 10Dzahn) [16:12:11] (03PS1) 10DLynch: Avoid error when switching to source editing [extensions/DiscussionTools] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1179158 (https://phabricator.wikimedia.org/T402024) [16:17:45] I need an emergency deploy for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/1179158 -- context is T402024, are SRE okay with a deployment? (cc: thcipriani / jeena / hashar). I already have someone to deploy (me). [16:17:46] T402024: Minerva: TypeError: this.toolbar.getTarget(...).switchToWikitextEditor is not a function - https://phabricator.wikimedia.org/T402024 [16:21:37] !log ammarpad@deploy1003 mwscript-k8s job started: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=loginwiki --logwiki=metawiki 'Band of the Falcon' 'Griffith the Hero' # T401997 [16:21:42] T401997: Unblock stuck global rename of Griffith the Hero - https://phabricator.wikimedia.org/T401997 [16:22:23] 06SRE, 10LDAP-Access-Requests: Superset / LDAP access - https://phabricator.wikimedia.org/T402022#11090144 (10Dzahn) You currently have deployment access. So shell access to deployment servers and ability to deploy MediaWiki changes. You also have the LDAP group "wmf" as a staff member which grants a bunch of... [16:23:08] 06SRE, 10SRE-Access-Requests: Superset / LDAP access - https://phabricator.wikimedia.org/T402022#11090148 (10Dzahn) [16:23:58] Kemayo: fine by me, arnoldokoth and/or bblack as SREs on-call, are you OK with am emergency deploy per https://wikitech.wikimedia.org/wiki/Deployments/Emergencies#step-by-step [16:34:43] thcipriani: Eeerm. Not very confident making that call. Maybe rzl: swfrench-wmf: ? [16:36:34] arnoldokoth: definitely constitutes an emergency, as the oncaller it just means you'll be around to jump in if needed :) [16:38:26] +1 to what rzl said ^ [16:39:56] I wouldn't expect the patch to cause any problems, absent something going wrong with the deploy process itself. [16:40:32] Kemayo: looks like you're clear (also, don't jinx deployment :)) [16:40:53] Nothing can stop this deployment now, for I am invincible. [16:41:01] Perfect. [16:41:02] that's the spirit [16:41:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/DiscussionTools] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1179158 (https://phabricator.wikimedia.org/T402024) (owner: 10DLynch) [16:41:22] well, it was probably a mistake to say that, but at least now nothing *else* can possibly go wrong [16:42:40] (03Merged) 10jenkins-bot: Avoid error when switching to source editing [extensions/DiscussionTools] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1179158 (https://phabricator.wikimedia.org/T402024) (owner: 10DLynch) [16:42:57] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1179158|Avoid error when switching to source editing (T402024)]] [16:43:01] T402024: Minerva: TypeError: this.toolbar.getTarget(...).switchToWikitextEditor is not a function - https://phabricator.wikimedia.org/T402024 [16:44:59] !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1179158|Avoid error when switching to source editing (T402024)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:45:43] !log kemayo@deploy1003 kemayo: Continuing with sync [16:46:48] rzl: Ack. And yes, happy to help if anything goes sideways. [16:50:58] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1179158|Avoid error when switching to source editing (T402024)]] (duration: 08m 01s) [16:51:03] T402024: Minerva: TypeError: this.toolbar.getTarget(...).switchToWikitextEditor is not a function - https://phabricator.wikimedia.org/T402024 [16:51:42] Okay, it's all done with no (obvious) errors. [16:52:49] thanks Kemayo [16:53:04] Thanks for helping out! [16:54:32] FIRING: SystemdUnitFailed: netbox_ganeti_magru02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:00:07] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11090346 (10phaultfinder) [17:25:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:26:50] (03CR) 10BPirkle: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177966 (https://phabricator.wikimedia.org/T400132) (owner: 10Hnowlan) [17:29:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11090421 (10phaultfinder) [17:30:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:31:52] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2240.codfw.wmnet with reason: Maintenance [17:32:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2240 (T399249)', diff saved to https://phabricator.wikimedia.org/P81373 and previous config saved to /var/cache/conftool/dbconfig/20250815-173159-fceratto.json [17:32:04] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [17:35:10] 10ops-codfw, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402043 (10phaultfinder) 03NEW [17:43:51] (03PS5) 10Huji: Enable electionclerk user group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155805 (https://phabricator.wikimedia.org/T396347) [17:44:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11090499 (10phaultfinder) [17:45:34] (03CR) 10Huji: Enable electionclerk user group on fawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155805 (https://phabricator.wikimedia.org/T396347) (owner: 10Huji) [17:45:43] (03CR) 10Huji: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155805 (https://phabricator.wikimedia.org/T396347) (owner: 10Huji) [17:52:46] (03PS1) 10Cwhite: k8s-ops: add disk space check overrides [alerts] - 10https://gerrit.wikimedia.org/r/1179177 (https://phabricator.wikimedia.org/T332764) [17:52:48] (03PS1) 10Cwhite: cirrussearch: add disk space check overrides [alerts] - 10https://gerrit.wikimedia.org/r/1179178 (https://phabricator.wikimedia.org/T332764) [17:54:28] (03CR) 10CI reject: [V:04-1] cirrussearch: add disk space check overrides [alerts] - 10https://gerrit.wikimedia.org/r/1179178 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [17:54:41] (03CR) 10CI reject: [V:04-1] k8s-ops: add disk space check overrides [alerts] - 10https://gerrit.wikimedia.org/r/1179177 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [18:05:26] (03PS2) 10Cwhite: k8s-ops: add disk space check overrides [alerts] - 10https://gerrit.wikimedia.org/r/1179177 (https://phabricator.wikimedia.org/T332764) [18:05:26] (03PS2) 10Cwhite: cirrussearch: add disk space check overrides [alerts] - 10https://gerrit.wikimedia.org/r/1179178 (https://phabricator.wikimedia.org/T332764) [18:09:44] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host cirrussearch2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:15:45] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cirrussearch2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:18:30] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host cirrussearch2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:18:37] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:24:44] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cirrussearch2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:34:58] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11090561 (10phaultfinder) [18:41:59] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on an-worker1190:9290 - https://phabricator.wikimedia.org/T401969#11090571 (10VRiley-WMF) This has been corrected. Loose cable. [18:42:04] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on an-worker1190:9290 - https://phabricator.wikimedia.org/T401969#11090572 (10VRiley-WMF) 05Open→03Resolved [18:43:30] 10ops-eqiad, 06SRE, 06DC-Ops: asw2-a4-eqiad:PEM 1 is not powered - https://phabricator.wikimedia.org/T401886#11090576 (10VRiley-WMF) Researching how to obtain a replacment. [18:44:57] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11090578 (10VRiley-WMF) Perfect, thank you! Let us know when the next device is ready [18:46:29] jhancock@cumin1002 provision (PID 3927815) is awaiting input [18:46:42] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host cirrussearch2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:51:17] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cirrussearch2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:52:27] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host cirrussearch2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:54:22] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cirrussearch2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:54:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:04:32] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:04:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:05:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11090657 (10phaultfinder) [19:20:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:29:01] (03PS1) 10Dzahn: zuul::main: create notepool sysuser and config template [puppet] - 10https://gerrit.wikimedia.org/r/1179217 (https://phabricator.wikimedia.org/T400850) [19:29:28] (03CR) 10CI reject: [V:04-1] zuul::main: create notepool sysuser and config template [puppet] - 10https://gerrit.wikimedia.org/r/1179217 (https://phabricator.wikimedia.org/T400850) (owner: 10Dzahn) [19:29:32] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/3 (Core: lsw1-e4-codfw:ethernet-1/55 {#130117100037}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:30:42] (03PS2) 10Dzahn: zuul::main: create notepool sysuser and config template [puppet] - 10https://gerrit.wikimedia.org/r/1179217 (https://phabricator.wikimedia.org/T400850) [19:34:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11090701 (10phaultfinder) [19:35:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:40:30] (03PS1) 10Andrew Bogott: osd.yaml: add entries for cloudcephosd1042, cloudcephosd104[67] [puppet] - 10https://gerrit.wikimedia.org/r/1179218 (https://phabricator.wikimedia.org/T401693) [19:42:01] (03CR) 10Andrew Bogott: [C:03+2] osd.yaml: add entries for cloudcephosd1042, cloudcephosd104[67] [puppet] - 10https://gerrit.wikimedia.org/r/1179218 (https://phabricator.wikimedia.org/T401693) (owner: 10Andrew Bogott) [19:42:13] PROBLEM - Disk space on prometheus1008 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s-dse 12431MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus1008&var-datasource=eqiad+prometheus/ops [19:47:54] (03PS3) 10Dzahn: zuul::main: create notepool sysuser and config [puppet] - 10https://gerrit.wikimedia.org/r/1179217 (https://phabricator.wikimedia.org/T400850) [19:53:17] (03PS4) 10Dzahn: zuul::main: create nodepool sysuser and config [puppet] - 10https://gerrit.wikimedia.org/r/1179217 (https://phabricator.wikimedia.org/T400850) [19:56:05] (03PS1) 10Dzahn: add fake profile::zuul::main::nodepool::user_token [labs/private] - 10https://gerrit.wikimedia.org/r/1179219 (https://phabricator.wikimedia.org/T400850) [19:56:22] (03CR) 10Dzahn: [V:03+2 C:03+2] add fake profile::zuul::main::nodepool::user_token [labs/private] - 10https://gerrit.wikimedia.org/r/1179219 (https://phabricator.wikimedia.org/T400850) (owner: 10Dzahn) [19:58:26] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1179217/6590/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1179217 (https://phabricator.wikimedia.org/T400850) (owner: 10Dzahn) [19:59:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11090773 (10phaultfinder) [20:02:37] (03PS2) 10RLazarus: pyrra: Add Wikifunctions backend API combined latency-availability [puppet] - 10https://gerrit.wikimedia.org/r/1178627 (https://phabricator.wikimedia.org/T394057) [20:04:40] (03CR) 10RLazarus: [C:03+2] pyrra: Add Wikifunctions backend API combined latency-availability (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1178627 (https://phabricator.wikimedia.org/T394057) (owner: 10RLazarus) [20:20:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:24:33] (03CR) 10Novem Linguae: [C:03+1] Enable electionclerk user group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155805 (https://phabricator.wikimedia.org/T396347) (owner: 10Huji) [20:24:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11090845 (10phaultfinder) [20:30:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:32:09] (03PS1) 10Cwhite: logstash: remove udp in error alerts [alerts] - 10https://gerrit.wikimedia.org/r/1179221 [20:39:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11090873 (10phaultfinder) [20:49:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:50:27] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for cirrussearch2089.mgmt:22 - https://phabricator.wikimedia.org/T399943#11090885 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm @bking the replacement board came in today. It has been replaced and the server is up and updated. [20:54:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:03:02] 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (12) PDUs for ML expansion - https://phabricator.wikimedia.org/T400778#11090897 (10VRiley-WMF) @Jhancock.wm helped with E9 pdu 1. Setup the managment cable, will need to work on the network cable. Thank you Jenn! [21:04:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11090901 (10phaultfinder) [21:13:14] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [21:24:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:31:14] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [21:36:32] (03PS4) 10Bking: Introduce opensearch-operator-crds chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173947 (https://phabricator.wikimedia.org/T397246) [21:43:27] (03PS5) 10Bking: opensearch-operator: Add chart for review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) [21:43:32] RECOVERY - Disk space on an-druid1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1003&var-datasource=eqiad+prometheus/ops [21:44:48] (03CR) 10CI reject: [V:04-1] opensearch-operator: Add chart for review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [21:45:02] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11091008 (10phaultfinder) [21:47:40] RECOVERY - Disk space on an-druid1005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1005&var-datasource=eqiad+prometheus/ops [21:49:21] (03PS6) 10Bking: opensearch-operator: Add chart for review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) [21:50:50] (03CR) 10CI reject: [V:04-1] opensearch-operator: Add chart for review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [21:54:06] (03PS1) 10Cwhite: logstash: alert on unassigned shards and cluster status [alerts] - 10https://gerrit.wikimedia.org/r/1179226 [21:55:22] (03PS7) 10Bking: opensearch-operator: Add chart for review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) [21:59:34] (03CR) 10Bking: "Starting with patchset 7, I've started using upstream chart 2.7.0 instead of 2.8.0 . Why? I'm running into [[ https://github.com/opensearc" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [22:04:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:18:18] Hey all - have an updated private security mitigation I’d like to get out soon, unless there are objections. [22:19:32] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:20:52] !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host es2050.codfw.wmnet with OS bookworm [22:20:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:21:05] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11091035 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host es2050.codfw.wmnet with OS bookworm [22:22:44] !log Remove log debug file from host - T383309 [22:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:48] T383309: rsyslog receiver on centrallog hosts misplaces some log host entries - https://phabricator.wikimedia.org/T383309 [22:27:00] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [22:30:13] !log dzahn@dns1004 START - running authdns-update [22:30:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:31:07] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2051-7 to codfw - jhancock@cumin1003" [22:31:13] !log dzahn@dns1004 END - running authdns-update [22:31:26] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2051-7 to codfw - jhancock@cumin1003" [22:31:26] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:31:39] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es2051 [22:31:48] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es2051 [22:31:53] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es2052 [22:32:04] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es2052 [22:32:10] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es2053 [22:32:20] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es2053 [22:32:24] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es2054 [22:32:35] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es2054 [22:32:41] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es2055 [22:32:56] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es2055 [22:33:06] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es2056 [22:33:17] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es2056 [22:33:20] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es2057 [22:33:29] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es2057 [22:33:48] !log Deployed updated security mitigation for T401266 [22:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:08] sbassett: late reply, no objection. Covered by https://wikitech.wikimedia.org/wiki/Deployments/Emergencies#step-by-step (noting that it is late on a Friday, unsure what SREs are around ... other than folks running cookbooks :)) [22:34:33] er...https://wikitech.wikimedia.org/wiki/Deployments/Emergencies#Reasons_for_an_emergency_deploy I mean [22:34:56] thcipriani: I'm around. [22:36:22] denisse: great, nothing specific needed, just the comfort of your presence (per the Emergency deployment policy) [22:36:42] jhancock@cumin1003 provision (PID 2199461) is awaiting input [22:37:00] tx, update seems stable [22:37:45] (03PS1) 10Andrea Denisse: centrallog: Remove unused debug logging config [puppet] - 10https://gerrit.wikimedia.org/r/1179228 (https://phabricator.wikimedia.org/T383309) [22:37:45] (03CR) 10Andrea Denisse: "Hi folks, this code is already absent from the hosts, this patch just removes it from Puppet." [puppet] - 10https://gerrit.wikimedia.org/r/1179228 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [22:39:11] !log jhancock@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2050.codfw.wmnet with reason: host reimage [22:42:55] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2050.codfw.wmnet with reason: host reimage [22:47:21] (03PS1) 10Dzahn: zuul: add nodepool proxy URL variable [puppet] - 10https://gerrit.wikimedia.org/r/1179230 (https://phabricator.wikimedia.org/T400850) [22:47:36] (03PS2) 10Dzahn: zuul: add nodepool proxy URL variable [puppet] - 10https://gerrit.wikimedia.org/r/1179230 (https://phabricator.wikimedia.org/T400850) [22:47:57] 10SRE-SLO, 06SRE Observability, 10Abstract Wikipedia team (26Q1 (Jul–Sep)), 07Essential-Work: Create new SLO dashboard via Pyrra for Wikifunctions - https://phabricator.wikimedia.org/T394057#11091045 (10RLazarus) First SLO is up! The [[ https://slo.wikimedia.org/objectives?expr=%7B__name__=%22wikifunctions... [22:48:05] (03CR) 10CI reject: [V:04-1] zuul: add nodepool proxy URL variable [puppet] - 10https://gerrit.wikimedia.org/r/1179230 (https://phabricator.wikimedia.org/T400850) (owner: 10Dzahn) [22:48:59] (03CR) 10Andrea Denisse: [C:03+1] "lgtm, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/1178883 (https://phabricator.wikimedia.org/T381665) (owner: 10Tiziano Fogli) [22:49:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:50:48] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host es2051.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:51:14] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host es2052.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:51:34] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host es2053.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:52:10] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host es2054.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:52:18] (03PS3) 10Dzahn: zuul: add nodepool proxy URL variable [puppet] - 10https://gerrit.wikimedia.org/r/1179230 (https://phabricator.wikimedia.org/T400850) [22:52:43] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host es2055.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:53:22] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host es2056.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:53:54] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host es2057.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:54:51] (03CR) 10Dzahn: [C:03+2] zuul: add nodepool proxy URL variable [puppet] - 10https://gerrit.wikimedia.org/r/1179230 (https://phabricator.wikimedia.org/T400850) (owner: 10Dzahn) [22:54:56] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2052.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:57:39] PROBLEM - Disk space on an-druid1005 is CRITICAL: DISK CRITICAL - free space: /srv 106384 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1005&var-datasource=eqiad+prometheus/ops [22:57:46] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11091048 (10Jhancock.wm) [22:58:11] !log jhancock@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1002" [22:58:16] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1002" [22:58:17] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2050.codfw.wmnet with OS bookworm [22:58:25] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11091049 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host es2050.codfw.wmnet with OS bookworm completed: - es2050 (**WARN**)... [22:58:54] (03PS1) 10Dzahn: zuul::main: add mode 0550 to file holding a secret [puppet] - 10https://gerrit.wikimedia.org/r/1179231 (https://phabricator.wikimedia.org/T400850) [22:59:09] (03CR) 10Dzahn: [C:03+2] zuul::main: add mode 0550 to file holding a secret [puppet] - 10https://gerrit.wikimedia.org/r/1179231 (https://phabricator.wikimedia.org/T400850) (owner: 10Dzahn) [23:03:01] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2051.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:04:30] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2053.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:04:32] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:04:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:04:56] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2054.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:05:06] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2055.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:06:03] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2056.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:06:38] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2057.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:13:33] PROBLEM - Disk space on an-druid1003 is CRITICAL: DISK CRITICAL - free space: /srv 103025 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1003&var-datasource=eqiad+prometheus/ops [23:29:32] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/3 (Core: lsw1-e4-codfw:ethernet-1/55 {#130117100037}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:29:35] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11091058 (10Jhancock.wm) [23:38:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1179234 [23:38:33] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1179234 (owner: 10TrainBranchBot) [23:43:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:50:50] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1179234 (owner: 10TrainBranchBot) [23:51:41] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host es2051.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:52:12] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host es2052.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:52:34] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host es2053.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:53:07] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host es2054.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:53:35] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host es2055.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:54:17] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host es2056.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:54:42] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host es2057.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:56:11] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2052.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:56:12] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2051.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:56:34] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2053.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:57:09] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2054.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:57:42] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2055.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:58:20] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2056.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:58:43] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2057.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED