[08:21:24] topranks: uh... [08:21:28] topranks: nice catch :) [08:23:26] fixed :) [08:33:44] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10581732 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1002 for host lvs7002.magru.wmnet with OS bookworm [09:19:07] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10581848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1002 for host lvs7002.magru.wmnet with OS bookworm completed: - lvs7002 (**PASS**) - Downtimed on... [09:41:22] 06Traffic: liberica control plane hangs if it fails to get an etcd endpoint - https://phabricator.wikimedia.org/T387278 (10Vgutierrez) 03NEW [09:41:37] 06Traffic: liberica control plane hangs if it fails to get an etcd endpoint - https://phabricator.wikimedia.org/T387278#10581886 (10Vgutierrez) [09:41:55] 06Traffic: liberica control plane hangs if it fails to get an etcd endpoint - https://phabricator.wikimedia.org/T387278#10581889 (10Vgutierrez) p:05Triage→03Medium [09:58:56] 10netops, 06Infrastructure-Foundations: BGP peers with missing descriptions - https://phabricator.wikimedia.org/T387220#10581935 (10cmooney) 05Open→03Resolved a:03cmooney I had a quick look and added these. [10:17:49] 10netops, 06Infrastructure-Foundations, 10observability, 10Prod-Kubernetes, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10581998 (10cmooney) >>! In T384731#10579181, @ayounsi wrote: >>> And what happens if peer_descr is mis... [10:18:18] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10581999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1002 for host lvs7001.magru.wmnet with OS bookworm [10:56:31] 10netops, 06Infrastructure-Foundations, 10observability: Prometheus: attach host's BGP/interface remote side metrics - https://phabricator.wikimedia.org/T387287 (10ayounsi) 03NEW [11:01:17] 10netops, 06Infrastructure-Foundations, 10observability, 10Prod-Kubernetes, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10582145 (10ayounsi) I forked the discussion to {T387287} and {T387288} as that task was becoming more... [11:01:35] 10netops, 06Infrastructure-Foundations, 10observability: Prometheus: attach host's BGP/interface remote side metrics - https://phabricator.wikimedia.org/T387287#10582150 (10ayounsi) [11:01:38] 10netops, 06Infrastructure-Foundations, 10observability, 10Prod-Kubernetes, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10582148 (10ayounsi) [11:04:24] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10582160 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1002 for host lvs7001.magru.wmnet with OS bookworm completed: - lvs7001 (**PASS**) - Downtimed on... [11:22:25] 06Traffic, 13Patch-For-Review: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477#10582181 (10Vgutierrez) [11:25:56] Feb 26 11:21:08 lvs7001 libericad[4975]: time=2025-02-26T11:21:08.193Z level=INFO msg="Removed BGP communities" [11:25:56] Feb 26 11:21:08 lvs7001 libericad[4975]: time=2025-02-26T11:21:08.193Z level=INFO msg="triggering BGP soft reset" peer=10.140.0.1 [11:26:03] now liberica is more verbose when reconfiguring BGP [11:26:29] that was the repooling of lvs7001 as primary for high-traffic1@magru [13:56:43] 10netops, 06Infrastructure-Foundations, 10observability: Prometheus: attach host's BGP/interface remote side metrics - https://phabricator.wikimedia.org/T387287#10582948 (10ayounsi) Another question is how to name those new metrics ? One suggestion, to stay generic as well, is to do something like `gnmi_bgp_... [14:03:48] 10netops, 06Infrastructure-Foundations, 10observability: Prometheus: attach host's BGP/interface remote side metrics - https://phabricator.wikimedia.org/T387287#10582973 (10fgiunchedi) Thank you for kickstarting this @ayounsi! I think I like `remote_instance` though don't feel strongly. re: `:0` in `instanc... [14:11:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [14:21:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [14:26:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [14:36:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [14:54:37] 06Traffic: Remove katran blockers for low-traffic non-k8s based services - https://phabricator.wikimedia.org/T373020#10583285 (10Vgutierrez) [14:55:14] 10netops, 06Infrastructure-Foundations, 10observability, 13Patch-For-Review: Prometheus: attach host's BGP/interface remote side metrics - https://phabricator.wikimedia.org/T387287#10583319 (10cmooney) >>! In T387287#10582948, @ayounsi wrote: > Another question is how to name those new metrics ? > One sugg... [15:12:39] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): Prometheus: attach host's BGP/interface remote side metrics - https://phabricator.wikimedia.org/T387287#10583404 (10lmata) [15:33:36] hello traffic friends - any objections if I were to deploy an ATS Lua config change in a bit? (same procedure as usual: stop puppet, test on a single host, roll out) [15:34:20] swfrench-wmf: go for it :) [15:34:56] awesome, thanks! [15:36:16] 06Traffic: backport gobgp 3.33 from trixie - https://phabricator.wikimedia.org/T386687#10583495 (10Vgutierrez) 05Open→03Resolved [15:47:37] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10583538 (10cmooney) @fgiunchedi I wonder if you might have any ideas on this. Our routers and our switches are exporting timestamps with different number of digits: ` gn... [15:55:48] sukhe: actually starting now :) [15:57:53] :) [16:01:08] 10netops, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025.02.10 - 2025.02.28): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10583616 (10xcollazo) @cmooney, should we move forward with this patch sometime soon? [16:03:15] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): Prometheus: attach host's BGP/interface remote side metrics - https://phabricator.wikimedia.org/T387287#10583630 (10ayounsi) > I'm wondering what the benefit is to having the additi... [16:14:19] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): Prometheus: attach host's BGP/interface remote side metrics - https://phabricator.wikimedia.org/T387287#10583672 (10cmooney) >>! In T387287#10583630, @ayounsi wrote: > No strong fee... [16:34:42] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10583733 (10cmooney) My robot friend suggested this which works to adjust the result of the promql to the right units: ` gnmi_bgp_neighbor_last_established{instance="$devi... [16:56:28] 06Traffic: Remove katran blockers for low-traffic non-k8s based services - https://phabricator.wikimedia.org/T373020#10583907 (10Vgutierrez) [16:56:51] 06Traffic: Remove katran blockers for low-traffic non-k8s based services - https://phabricator.wikimedia.org/T373020#10583909 (10Vgutierrez) [17:49:40] 06Traffic, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10584289 (10akosiaris) To add some more information on the `rp_filter` setting, apparently starting with calico version `3.23.0` (we r... [19:10:56] 06Traffic, 06SRE, 07Wikimedia-production-error: Reproducible blocking error using the basic upload form, no upload possible - https://phabricator.wikimedia.org/T387007#10584517 (10Aklapper) [19:12:02] 06Traffic, 06SRE: Reproducible blocking error using the basic upload form, no upload possible - https://phabricator.wikimedia.org/T387007#10584521 (10Aklapper) [19:21:04] 06Traffic, 06SRE: Reproducible blocking error using the basic upload form, no upload possible - https://phabricator.wikimedia.org/T387007#10584529 (10ssingh) @Grand-Duc: Hi, does this still persist for you? Or has it resolved? [21:46:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [21:51:24] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [22:17:25] FIRING: SystemdUnitCrashLoop: varnishmtail@internal.service crashloop on cp5023:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [22:27:25] RESOLVED: SystemdUnitCrashLoop: varnishmtail@internal.service crashloop on cp5023:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [22:57:11] 06Traffic: varnish-frontend-fetcherr: Assert error in vslc_vtx_next, 100% CPU usage - https://phabricator.wikimedia.org/T253093#10585224 (10BCornwall) We just had this crash again on cp5023 after a large spike in eqsin traffic. [23:09:20] 06Traffic, 06MediaWiki-Engineering, 06serviceops, 13Patch-For-Review, and 2 others: 503 error when edit large size pages on PHP 8.1 - https://phabricator.wikimedia.org/T385395#10585251 (10Scott_French) As of earlier today (Wednesday) we're back to serving a portion of traffic on PHP 8.1. While we're not fu...