[01:24:59] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11700466 (10BCornwall) [01:25:36] 06Traffic, 06DC-Ops, 10ops-magru: hw troubleshooting: Comm Error: Backplane 0 for cp7012 - https://phabricator.wikimedia.org/T419611#11700467 (10BCornwall) 05Open→03Resolved Thank you for all your work, rob. I was able to reimage and all seems well now. I'll re-open this is anything changes. [02:17:40] [non-urgent] just noticed a handful of `WidespreadPuppetFailure: Puppet has failed in ulsfo` alerts that have scrolled by in -operations this afternoon, but from puppetboard I *think* it's just durum400[34] tripping the alert threshold (T418993). wanted to flag it here in case that's not expected in their current phase of turnup. [02:17:41] T418993: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993 [07:09:19] 10netops, 06Infrastructure-Foundations, 10Observability-Logging: ~5k/logs/sec from netdev - https://phabricator.wikimedia.org/T412143#11700757 (10ayounsi) > Resolved-In > junos:23.4R1 junos:23.4R2 junos:24.1R1 [08:19:22] swfrench-wmf: indeed. these are downtimed and insetup, but there's some mysterious Bird error we haven't yet narrowed down [09:32:16] this was fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1251010 [09:39:16] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11701269 (10ayounsi) [10:05:41] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11701359 (10tappof) [10:54:04] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11701471 (10ayounsi) I think the factory reset helped. I then temporarily copied the TLS config from asw1-22, and ran the TLS cookbook and we're all good. So now... [11:01:58] 06Traffic: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825 (10Fabfur) 03NEW [11:02:16] 06Traffic: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825#11701503 (10Fabfur) 05Open→03In progress p:05Triage→03Medium [11:41:15] 06Traffic, 06Data-Platform-SRE: Prevent HaproxykafkaNoMessages alerts from being generated due to standard maintenance operations - https://phabricator.wikimedia.org/T419829 (10BTullis) 03NEW [12:28:49] 06Traffic, 10ServiceOps-Services-Oids, 06ServiceOps new (Next quarter), 05WE4.2 Bot detection: hcaptcha-proxy health checks should also depool sites if their upstream is unreachable - https://phabricator.wikimedia.org/T411191#11701776 (10Raine) [12:44:21] 06Traffic, 05MW-1.46-notes (1.46.0-wmf.19; 2026-03-10), 07OKR-Work, 13Patch-For-Review, 06Test Kitchen (Experiment Platform Sprint 20): Test the impact of incremental increase in traffic for cache splitting experiments - https://phabricator.wikimedia.org/T407570#11701818 (10Sfaci) @ssingh Next Monday, Ma... [13:43:54] can someone review https://gerrit.wikimedia.org/r/c/operations/puppet/+/1251081 please? this unbreaks CI for the cache hosts (like found here: https://integration.wikimedia.org/ci/job/operations-puppet-tests-bullseye/24728/console) [13:47:52] +1 [13:55:10] thanks [14:15:29] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11702362 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cf71bad1-aeb4-4596-b577-d88e4e171aab) set by ayounsi@cumin1003 for 0:30:00 on 24 host(s) and their servi... [14:21:59] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11702397 (10ayounsi) BGP bounce done by running those 2 commands "at the same time": ` tools network-instance default protocols bgp neighbor 10.64.128.17 reset-peer tools network-in... [14:27:01] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11702422 (10ayounsi) 05Open→03Resolved All servers have been repooled. [14:39:11] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T419854 (10tappof) 03NEW [14:39:37] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T419855 (10tappof) 03NEW [14:39:53] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: PeeringBGPDown (instance cr1-esams:9804) - https://phabricator.wikimedia.org/T419856 (10tappof) 03NEW [14:40:23] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: PeeringBGPDown (instance cr1-esams:9804) - https://phabricator.wikimedia.org/T419857 (10tappof) 03NEW [14:40:36] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T419858 (10tappof) 03NEW [14:40:54] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T419859 (10tappof) 03NEW [15:10:00] 06Traffic, 05MW-1.46-notes (1.46.0-wmf.19; 2026-03-10), 07OKR-Work, 13Patch-For-Review, 06Test Kitchen (Experiment Platform Sprint 21): Test the impact of incremental increase in traffic for cache splitting experiments - https://phabricator.wikimedia.org/T407570#11702728 (10KReid-WMF) [15:10:35] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, and 3 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11702738 (10malberts) Not sure where exactly to comment, but the commit that was backported to REL 1.43 for this issue, is calling a... [15:17:26] 06Traffic, 06SRE: Startup failure for Bird on new durum hosts - https://phabricator.wikimedia.org/T419868 (10MoritzMuehlenhoff) 03NEW [15:32:35] 06Traffic, 10Liberica: Provide a cookbook that validates IPIP/IP6IP6 capabilities on a given realserver - https://phabricator.wikimedia.org/T419873 (10Vgutierrez) 03NEW [15:32:46] 06Traffic, 10Liberica: Provide a cookbook that validates IPIP/IP6IP6 capabilities on a given realserver - https://phabricator.wikimedia.org/T419873#11702908 (10Vgutierrez) p:05Triage→03Medium [15:36:26] 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11702933 (10RobH) Please note the maint window for this offline host is 2026-03-13 @ 07:00 AM Singapore / which is 5PM Thursday evening for me. I'll be online to remotely supervise the swap and attem... [16:57:57] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11703357 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp4037.ulsfo.wmnet with OS trixie [16:59:17] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11703383 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp4038.ulsfo.wmnet with OS trixie [17:00:43] FIRING: HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=esams&var-instance=cp3069&viewPanel=panel-19 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [17:02:21] 06Traffic, 10decommission-hardware: Decommission codfw cp hosts - https://phabricator.wikimedia.org/T419753#11703391 (10BCornwall) [17:05:43] FIRING: [5x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [17:10:43] FIRING: [5x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [17:15:43] RESOLVED: [5x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [17:36:04] 06Traffic, 06cloud-services-team, 06SRE Observability: Move wikimediastatus.net 301 to ncredir - https://phabricator.wikimedia.org/T419887 (10colewhite) 03NEW [17:36:30] 06Traffic, 06cloud-services-team, 06SRE Observability: Move wikimediastatus.net 301 to ncredir - https://phabricator.wikimedia.org/T419887#11703499 (10colewhite) [17:40:31] 06Traffic, 06ServiceOps new, 10ServiceOps-Services-Oids, 05WE4.2 Bot detection: hcaptcha-proxy health checks should also depool sites if their upstream is unreachable - https://phabricator.wikimedia.org/T411191#11703522 (10MLechvien-WMF) [17:41:02] 06Traffic, 10decommission-hardware: Decommission codfw cp hosts cp2027-cp2042 - https://phabricator.wikimedia.org/T419753#11703531 (10BCornwall) [17:41:35] 06Traffic, 06cloud-services-team, 06SRE Observability, 13Patch-For-Review: Move wikimediastatus.net 301 to ncredir - https://phabricator.wikimedia.org/T419887#11703537 (10Aklapper) [17:41:47] 06Traffic, 10decommission-hardware: Decommission codfw cp hosts cp2027-cp2040 - https://phabricator.wikimedia.org/T419753#11703551 (10BCornwall) [17:41:55] 06Traffic, 06cloud-services-team, 06SRE Observability, 13Patch-For-Review: Move wikimediastatus.net 301 to ncredir - https://phabricator.wikimedia.org/T419887#11703554 (10Aklapper) [17:49:45] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11703568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp4037.ulsfo.wmnet with OS trixie completed: - cp4037 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pup... [17:59:19] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11703592 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp4039.ulsfo.wmnet with OS trixie [18:09:53] FIRING: ErrorBudgetBurn: varnish-combined codfw - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:20:22] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11703676 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp4038.ulsfo.wmnet with OS trixie executed with errors: - cp4038 (**FAIL**) - Downtimed on Icinga/Alertmanager - D... [18:59:53] FIRING: [2x] ErrorBudgetBurn: varnish-combined codfw - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:14:53] RESOLVED: ErrorBudgetBurn: varnish-combined codfw - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:18:23] FIRING: ErrorBudgetBurn: varnish-combined codfw - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:40:41] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11704549 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp4038.ulsfo.wmnet with OS trixie completed: - cp4038 (**PASS**) - Removed from Puppet and PuppetDB if present and d... [21:43:37] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11704561 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp4041.ulsfo.wmnet with OS trixie [21:58:45] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11704596 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp4040.ulsfo.wmnet with OS trixie [22:22:19] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11704619 (10BCornwall) [22:24:00] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11704621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp4042.ulsfo.wmnet with OS trixie [22:38:21] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11704674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp4041.ulsfo.wmnet with OS trixie completed: - cp4041 (**PASS**) - Removed from Puppet and PuppetDB if present and d... [22:42:47] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11704693 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp4043.ulsfo.wmnet with OS trixie [22:45:13] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11704694 (10CDobbins) [23:07:46] 10netops, 06Infrastructure-Foundations, 06SRE: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11704748 (10Papaul) Last update from Nokia today ` The following was added as a limitation under release notes: Management Release:25.10.2 Section:... [23:19:41] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11704774 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp4040.ulsfo.wmnet with OS trixie executed with errors: - cp4040 (**FAIL**) - Downtimed on Icinga/Alertmanager - D... [23:21:36] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11704778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp4040.ulsfo.wmnet with OS trixie [23:44:58] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11704839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp4042.ulsfo.wmnet with OS trixie executed with errors: - cp4042 (**FAIL**) - Downtimed on Icinga/Alertmanager - D... [23:50:56] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11704870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp4042.ulsfo.wmnet with OS trixie