[00:05:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:05:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:50:26] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:56] 10netbox, 10netops, 10DNS, 06Infrastructure-Foundations, and 2 others: Missing includes in DNS repo from Netbox-generated snippets - https://phabricator.wikimedia.org/T422115#11792933 (10ayounsi) What would be a good day to alert about those ? Or even better, not even need an alert ? [09:50:41] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:05:26] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:01:33] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450#11793409 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5c0f0433-7743-49c9-b411-8f120a9f337d) set by ayounsi@cumin1003 for 1:00:00 on 3 host(s)... [13:30:44] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450#11793801 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3c9a80b3-71a3-420a-bae4-d8cf79e5188e) set by ayounsi@cumin1003 for 0:30:00 on 3 host(s)... [13:53:20] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450#11793947 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b81e6903-f6db-4d47-bfbb-a5bff02b24fa) set by ayounsi@cumin1003 for 2:00:00 on 3 host(s)... [14:31:38] 10CAS-SSO, 06Infrastructure-Foundations: CAS login page overflows on iOS Safari (iPhone 16e) - https://phabricator.wikimedia.org/T422203#11794193 (10SLyngshede-WMF) p:05Triage→03Medium a:03SLyngshede-WMF [14:31:42] 10CAS-SSO, 06Infrastructure-Foundations: CAS login page overflows on iOS Safari (iPhone 16e) - https://phabricator.wikimedia.org/T422203#11794195 (10LSobanski) p:05Medium→03Low a:05SLyngshede-WMF→03None [14:32:48] 10netops, 06Infrastructure-Foundations, 06SRE: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704#11794201 (10LSobanski) p:05Triage→03Low [14:32:50] 10netops, 06Infrastructure-Foundations: Create public vlans in eqiad and codfw - https://phabricator.wikimedia.org/T422043#11794213 (10ayounsi) p:05Triage→03Medium [14:38:30] topranks: o/ [14:39:05] I am seeing a decrease in memcached errors in the past 3/4 days: https://phabricator.wikimedia.org/T420223#11794128 [14:39:39] And I checked the top 4 pods ending up in errors, their correspondent wikikube worker is in the private c/d old vlan [14:40:33] could it be that Arzhel moving wikikube workers to the new vlans is the source of the improvement? [14:40:34] somehow [14:42:42] not sure - did he move some hosts? maybe as a test? I know clem is moving wikikube-worker1273 right now [14:43:03] I'm moving one [14:43:14] I don't think those arzhel moved are in prod [14:44:04] but could it be that depooling the ones that were resolved the issue? [14:44:09] """resolved""" [14:44:26] My plan is to move one host to the new vlan, reintegrate it to prod and see if it gets the issue agian [14:44:28] again* [14:48:06] we depooled them a while ago, not sure though if Effie depooled more during the past days [14:49:07] Apparently there's a host on the new vlan already [14:49:15] We should check if it has errors or not [14:55:20] It wasn't pooled, but I just pooled it, we'll see when workloads are deployed on it if anything happens [14:55:25] wikikube-worker1347.eqiad.wmnet [14:57:40] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450#11794425 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=918af1bd-6f87-4ec9-acb7-39622f38db7c) set by ayounsi@cumin1003 for 0:30:00 on 3 host(s)... [15:55:51] 10netops, 06Infrastructure-Foundations, 06SRE: cr1-esams failed upgrade - https://phabricator.wikimedia.org/T422525 (10cmooney) 03NEW p:05Triage→03Medium [15:56:01] 10netops, 06Infrastructure-Foundations, 06SRE: cr1-esams failed upgrade - https://phabricator.wikimedia.org/T422525#11794809 (10cmooney) [15:56:07] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450#11794810 (10cmooney) [16:47:16] 10netops, 06Infrastructure-Foundations, 06SRE: cr1-esams failed upgrade - https://phabricator.wikimedia.org/T422525#11795258 (10cmooney) [16:49:27] 10netops, 06Infrastructure-Foundations, 06SRE: cr1-esams failed upgrade - https://phabricator.wikimedia.org/T422525#11795284 (10cmooney)