[08:26:25] FIRING: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:31:31] 06Traffic, 10Liberica: Test katran forwarding plane on lvs1013 - https://phabricator.wikimedia.org/T395228#10917013 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=59c5d762-0532-415b-82b0-0ec6b72d478a) set by vgutierrez@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with r... [08:41:29] moritzm: ncredir7003 doesn't seem to be on a healthy state [08:44:21] what specifically? it looks all fine in Icinga [08:45:02] https://alerts.wikimedia.org/?q=alertname%3DProbeDown&q=team%3Dsre&q=%40receiver%3Ddefault [08:45:20] that on one hand, and healthchecks are failing [08:50:53] moritzm: route ganeti has some FW rules applied? [08:51:59] vgutierrez: not for the routed traffic [08:52:10] XioNoX: what about IPIP encapsulation ranges' [08:52:35] https://www.irccloud.com/pastebin/xY7h8CsJ/ [08:52:48] I'm seeing connection attempts but no responses at all [08:52:59] looking [08:53:12] so I'd be inclined to think that ganeti is blocking traffic from 172.16.0.0/12 [08:54:09] and that'd explain the alert that we have with ncredir@magru [08:54:19] let me depool ncredir7003 [08:55:17] ack [08:55:33] there's no obvious error on the service level from looking at logs [08:56:12] instance itself looks healthy FWIW [09:00:37] `ganeti7002:~$ sudo tcpdump -i eno12399np0 host 10.140.0.13` doesn't see any traffic towards ncredir (but sees some to other VMs (like prometheus or bast), but pings between lvs7001 and ncredir works [09:02:10] XioNoX: that's expected [09:02:19] especially in the scenario I just described [09:02:29] inbound traffic from lvs7001 is IPIP encapsulated [09:02:46] and IPIP multiqueue optimized, so source iP address gets randomized [09:02:51] ahh right.. [09:02:59] see the tcpdump paste I posted here [09:05:01] hmmm wait [09:05:11] ok, so IPIP traffic make it to ncredir : `ncredir7003:~$ sudo tcpdump net 172.16.0.0/12` [09:05:15] (nah, I just spotted the iptables rule on ncred7003 as well) [09:05:58] so what's happening with the responses? [09:06:31] ncredir7003 is sending responses [09:06:39] 09:06:25.812391 IP ncredir-lb.magru.wikimedia.org.https > lvs7001.magru.wmnet.60128: Flags [S.], seq 870613634, ack 3432636855, win 43440, options [mss 1440,sackOK,TS val 1752342733 ecr 862978466,nop,wscale 9], length 0 [09:06:47] so the host itself is properly configured [09:08:23] it's visible all the way to the tap0 interface on ganeti7002, so yeah something is eating the reply [09:08:49] as they're not visible on eno12399np0 [09:11:04] I'm wondering if it's not some kind of rpf [09:15:17] I could track the traffic with `pwru` but the impact on ganeti7002 CPU load would be significant [09:16:12] too bad linux doesn't have counters for that [09:17:34] but I'm sure that's the issue, the reply comes from 195.200.68.226 [09:17:48] of course, that's the VIP [09:17:48] while ganeti7002 only have `10.140.2.3 dev tap0` [09:17:59] let me try to manually fix it [09:21:07] vgutierrez: looks good now ? [09:21:37] nope [09:25:12] XioNoX: oh... [09:25:21] XioNoX: IPv6 is healthy though [09:27:34] meaning IPv6 always worked [09:27:39] I disabled rp_filter on both the VM facing and externally facing interfaces of ganeti7002 [09:32:43] 06Traffic, 10Liberica: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561#10917499 (10Vgutierrez) [09:33:34] vgutierrez: ok, it's good now [09:33:36] yes [09:33:39] what did you change? [09:33:55] Jun 16 09:33:14 lvs7001 libericad[3825331]: time=2025-06-16T09:33:14.210Z level=INFO msg="detected healthcheck state change" service=ncredir-httpslb_443 hostname=ncredir7003.magru.wmnet address=10.140.2.3 healthcheck_name=HTTPCheck healthcheck_id=1905204800 healthcheck_result_old=false healthcheck_result=true [09:34:02] disabling rp_filter on each interface wasn't enough, I had to do `net.ipv4.conf.all.rp_filter=0` as well... [09:34:08] of course [09:34:13] it's the most restrictive of both [09:34:48] why IPv6 wasn't affected? [09:35:34] glad to see that switching to katran on lvs7001 detected the issue as healthchecks follow the same path as prod traffic btw [09:37:43] no idea, `net.ipv6.conf.all.rp_filter` doesn't exist, maybe it's not implemented? [09:38:22] quick google search seems to confirm that [10:01:10] vgutierrez, moritzm, ready for you: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1159390 [10:01:26] I didn't know about `profile::base::enable_rp_filter` quite convenient :) [10:01:58] I had to implement that when I needed to disable rp_filter on realservers [10:04:15] lgtm [10:05:13] these are only read on boot, did you already set them on 7001/7002 during debugging? otherwise let's simply backfill with sysctl -w and avoid a reboot [10:07:22] moritzm: good idea, all done [10:07:54] great, thx [10:08:45] moritzm: next time you create a VM, can you check that its matching tap* interface have rp_filter disabled? [10:08:51] (or just ping me and I can check) [10:09:54] will do! [10:10:31] I'll turn ganeti7003 into a routed node next, then I'll flip the remaining VMs running on non-DRBD to DRBD and then I'll add ncredir7004 [10:53:50] 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#10917731 (10Clement_Goubert) [11:23:02] o/ I have another restbase migration URL (second last API to be migrated, hopefully!) if it would suit to roll a change out now https://gerrit.wikimedia.org/r/c/operations/puppet/+/1156813 [11:23:21] these are fairly rarely used [12:09:00] FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum7003:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [12:14:00] RESOLVED: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum7003:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [12:34:28] 06Traffic, 10RESTBase, 10RESTBase Sunsetting, 06serviceops, and 2 others: Block external traffic to RESTBase /page/data-parsoid endpoint and investigate internal usage - https://phabricator.wikimedia.org/T393557#10918072 (10ihurbain) Note: I actually do not know how these are generated, so it's plausible t... [12:36:00] FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh7003:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [12:41:00] RESOLVED: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh7003:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [12:44:16] hnowlan: looks good [12:48:13] vgutierrez: thanks - I'll wait for the backport window to finish before rolling out. [13:31:08] 06Traffic, 10Liberica: Provide connections stats for katran - https://phabricator.wikimedia.org/T397036 (10Vgutierrez) 03NEW [13:48:11] FIRING: [2x] SLOMetricAbsent: haproxy-combined - https://slo.wikimedia.org/?search=haproxy-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:50:12] FIRING: SLOMetricAbsent: varnish-combined magru - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:51:42] FIRING: SLOMetricAbsent: varnish-combined magru - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:52:18] 06Traffic, 10RESTBase, 10RESTBase Sunsetting, 06serviceops, and 2 others: Block external traffic to RESTBase /page/data-parsoid endpoint and investigate internal usage - https://phabricator.wikimedia.org/T393557#10918633 (10akosiaris) >>! In T393557#10918072, @ihurbain wrote: > Note: I actually do not know... [14:00:22] 06Traffic, 13Patch-For-Review: Replace Digicert TLS certs with Google Trust Services ones - https://phabricator.wikimedia.org/T395131#10918656 (10Vgutierrez) 05In progress→03Resolved a:03Vgutierrez [14:38:28] RESOLVED: SLOMetricAbsent: varnish-combined magru - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [14:38:28] FIRING: [2x] SLOMetricAbsent: haproxy-combined - https://slo.wikimedia.org/?search=haproxy-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [14:39:35] RESOLVED: SLOMetricAbsent: varnish-combined magru - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [14:40:55] 06Traffic, 10Liberica: Provide connections stats for katran - https://phabricator.wikimedia.org/T397036#10918888 (10Vgutierrez) 05Open→03Resolved [14:43:35] RESOLVED: [2x] SLOMetricAbsent: haproxy-combined - https://slo.wikimedia.org/?search=haproxy-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [14:51:40] 06Traffic, 10Liberica: seamless upgrade triggers dropped packets with katran - https://phabricator.wikimedia.org/T397053 (10Vgutierrez) 03NEW [14:56:58] 06Traffic, 10RESTBase, 10RESTBase Sunsetting, 06serviceops, and 2 others: Block external traffic to RESTBase /page/data-parsoid endpoint and investigate internal usage - https://phabricator.wikimedia.org/T393557#10918979 (10hnowlan) There is a larger chunk of work involved in deprecating and replacing API... [15:43:56] 06Traffic, 06collaboration-services, 10Gerrit, 13Patch-For-Review, 10Release-Engineering-Team (Radar): Separate Gerrit https and ssh/git hostnames - https://phabricator.wikimedia.org/T394271#10919284 (10Jelto) a:03Jelto [16:04:20] 06Traffic, 10Liberica: seamless upgrade triggers dropped packets with katran - https://phabricator.wikimedia.org/T397053#10919410 (10Vgutierrez) p:05Triage→03High [17:15:03] 06Traffic, 10Liberica: seamless upgrade triggers dropped packets with katran - https://phabricator.wikimedia.org/T397053#10919800 (10Vgutierrez) eBPF pinning doesn't seem to be working as expected, this could be a side effect of hardening liberica units: ` vgutierrez@lvs1013:~$ sudo -u liberica touch /sys/fs/b... [17:16:36] 06Traffic: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912#10919804 (10BCornwall) p:05Triage→03Medium [18:20:10] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920023 (10BCornwall) Okay, so we're ready to reimage lvs1016 but it appears that the mgmt interface isn't reachable. Could dcops look into this, please? [18:22:05] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920042 (10BCornwall) [20:07:00] brett: it looks like brief addition of lvs1016 to high-traffic might have led to some comment-only dither in pybal.conf on at least lvs1020 [20:07:19] oh hmm [20:07:38] just flagging it due to the resulting `PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 [...]` alert [20:07:38] :( [20:08:11] ah yeah, thanks for catching that [20:08:32] oh yeah, that makes sense [20:08:43] I'll go ahead and restart pybal then [20:08:48] brett: simple restart should be enough. it should be safe since it was only fired after this, thus no cascading changes. [20:08:58] thanks and swfrench-wmf too! [20:09:18] no problem! easy to sort based on the puppet-agent journal [20:09:28] 1017 too [20:09:38] still a WARN but will change to CRIT eventually [20:09:43] ack [20:09:56] ah, yeah I'd not checked the other one yet [20:33:23] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920481 (10VRiley-WMF) @BCornwall Hey there, thanks for letting us know. I did replace the cable and it seems to respond to ping. Would you be able to check again? It seems to... [20:41:12] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920518 (10BCornwall) @VRiley-WMF Thanks for the quick response! I've not been able to ping the mgmt interface (10.65.0.75) from lvs1017, cumin1002, and cumin2002. It's timin... [20:57:23] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920555 (10VRiley-WMF) Okay, I found the problem (I pinged the incorrect IP) I set the IP address on the iDRAC to the one listed in netbox. I just tested out the ping and it s... [21:52:04] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920717 (10BCornwall) Thank you!