[00:15:40] FIRING: [7x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [00:20:40] FIRING: [7x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [00:25:40] FIRING: [9x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [00:30:40] FIRING: [10x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [00:45:40] FIRING: [7x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [00:50:40] FIRING: [8x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [01:00:40] FIRING: [9x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [01:10:40] FIRING: [4x] VarnishHighThreadCount: Varnish's thread count on cp5018:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [01:20:40] RESOLVED: [3x] VarnishHighThreadCount: Varnish's thread count on cp5018:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [06:47:31] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Gerrit: Debug connection re-use on Gerrit's httpd causing Gerrit interface to be very slow - https://phabricator.wikimedia.org/T420189#11736789 (10ABran-WMF) Outcome of the change: - `AsyncRequestWorkerFactor... [10:17:44] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Gerrit: Debug connection re-use on Gerrit's httpd causing Gerrit interface to be very slow - https://phabricator.wikimedia.org/T420189#11737488 (10ABran-WMF) @hashar suggested aligning Gerrit's stack to what's... [11:15:46] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: ATS: align ATS and Gerrit Apache timeouts to reenable connection re-use - https://phabricator.wikimedia.org/T417998#11737711 (10hashar) @ABran-WMF and I had a long debugging session last Thursday which I have s... [11:19:19] FIRING: SystemdUnitFailed: bird.service on hcaptcha-proxy4004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:25:57] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team: gerrit: Add envoy in Gerrit's stack - https://phabricator.wikimedia.org/T420909 (10ABran-WMF) 03NEW [11:27:04] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: gerrit: Add envoy in Gerrit's stack - https://phabricator.wikimedia.org/T420909#11737751 (10ABran-WMF) 05Open→03In progress p:05Triage→03Medium [11:27:59] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Gerrit: Debug connection re-use on Gerrit's httpd causing Gerrit interface to be very slow - https://phabricator.wikimedia.org/T420189#11737757 (10ABran-WMF) follow up task: {T420909} [11:44:19] RESOLVED: SystemdUnitFailed: bird.service on hcaptcha-proxy4004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:17:11] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: ATS: align ATS and Gerrit Apache timeouts to reenable connection re-use - https://phabricator.wikimedia.org/T417998#11738124 (10ABran-WMF) 05Stalled→03Resolved [14:41:56] 06Traffic, 10Liberica, 10Prod-Kubernetes, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Kubernetes: Migrate DSE k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420437#11738612 (10BTullis) We are going to wait until the dust has settled slightly on {T414484} before implementing... [14:45:41] 06Traffic, 10Liberica, 10Prod-Kubernetes, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Kubernetes: Migrate DSE k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420437#11738630 (10BTullis) p:05Triage→03Medium [14:54:28] 06Traffic, 06Infrastructure-Foundations, 10Liberica, 10Prod-Kubernetes, 07Kubernetes: Migrate AUX k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420439#11738683 (10elukey) p:05Triage→03Medium [15:09:09] 10netops, 06Infrastructure-Foundations, 06SRE: Wikidough unreachable over IPv6 if it is depooled but still announced from a POP - https://phabricator.wikimedia.org/T420820#11738741 (10cmooney) 05Open→03Resolved a:03cmooney Ok this should no longer be an issue after updating the `wikimedia6` prefix... [15:26:17] 06Traffic, 06cloud-services-team, 10Data-Services, 10Datasets-General-or-Unknown, 13Patch-For-Review: Move dumps.wikimedia.org HTTP service behind CDN edge - https://phabricator.wikimedia.org/T306550#11738842 (10xcollazo) > The option of telling the few consumers of the rsync to change the hostname/IP th... [16:06:30] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: gerrit: Add envoy in Gerrit's stack - https://phabricator.wikimedia.org/T420909#11739157 (10Dzahn) Agreed, envoy is the standard around here to terminate TLS and we do need TLS termination between ATS and the s... [17:38:46] 10netops, 06Infrastructure-Foundations, 06SRE: Atlas no longer reachable from monitoring on routed ganeti - https://phabricator.wikimedia.org/T420975 (10cmooney) 03NEW p:05Triage→03Medium [17:40:26] 10netops, 06Infrastructure-Foundations, 06SRE: Atlas no longer reachable from monitoring on routed ganeti - https://phabricator.wikimedia.org/T420975#11739860 (10cmooney) [18:05:38] 06Traffic, 06cloud-services-team, 10Data-Services, 10Datasets-General-or-Unknown, 13Patch-For-Review: Move dumps.wikimedia.org HTTP service behind CDN edge - https://phabricator.wikimedia.org/T306550#11739990 (10HCoplin-WMF) > the new File Export system, which is now in production running jointly with th... [18:48:54] 10netops, 06Infrastructure-Foundations, 06SRE: Anycast services - depool strategy in terms of BGP routing - https://phabricator.wikimedia.org/T420821#11740224 (10ssingh) Thanks for all the work here @cmooney and for mentioning this, something that I had most certainly overlooked at least. I will think a bit... [19:18:46] 10netops, 06Infrastructure-Foundations, 06SRE: Anycast services - depool strategy in terms of BGP routing - https://phabricator.wikimedia.org/T420821#11740353 (10cmooney) Thanks @ssingh. I think a cookbook that takes down doh and durum simultaneously at a site (I assume by changing bird?) would solve this p... [19:54:19] FIRING: SystemdUnitFailed: anycast-healthchecker.service on hcaptcha-proxy4003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:59:19] FIRING: [2x] SystemdUnitFailed: anycast-healthchecker.service on hcaptcha-proxy4003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:24:19] FIRING: [2x] SystemdUnitFailed: anycast-healthchecker.service on hcaptcha-proxy4003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:29:19] RESOLVED: [2x] SystemdUnitFailed: anycast-healthchecker.service on hcaptcha-proxy4003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:34:33] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11740860 (10CDobbins) [20:43:30] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11740925 (10CDobbins) [21:03:45] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11740999 (10CDobbins)