[08:26:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_liberica_cp_checks.service on lvs7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:31:31] <wikibugs>	 06Traffic, 10Liberica: Test katran forwarding plane on lvs1013 - https://phabricator.wikimedia.org/T395228#10917013 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=59c5d762-0532-415b-82b0-0ec6b72d478a) set by vgutierrez@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with r...
[08:41:29] <vgutierrez>	 moritzm: ncredir7003 doesn't seem to be on a healthy state
[08:44:21] <moritzm>	 what specifically? it looks all fine in Icinga
[08:45:02] <vgutierrez>	 https://alerts.wikimedia.org/?q=alertname%3DProbeDown&q=team%3Dsre&q=%40receiver%3Ddefault
[08:45:20] <vgutierrez>	 that on one hand, and healthchecks are failing
[08:50:53] <vgutierrez>	 moritzm: route ganeti has some FW rules applied?
[08:51:59] <XioNoX>	 vgutierrez: not for the routed traffic
[08:52:10] <vgutierrez>	 XioNoX: what about IPIP encapsulation ranges'
[08:52:35] <vgutierrez>	 https://www.irccloud.com/pastebin/xY7h8CsJ/
[08:52:48] <vgutierrez>	 I'm seeing connection attempts but no responses at all
[08:52:59] <XioNoX>	 looking
[08:53:12] <vgutierrez>	 so I'd be inclined to think that ganeti is blocking traffic from 172.16.0.0/12
[08:54:09] <vgutierrez>	 and that'd explain the alert that we have with ncredir@magru
[08:54:19] <vgutierrez>	 let me depool ncredir7003
[08:55:17] <moritzm>	 ack
[08:55:33] <moritzm>	 there's no obvious error on the service level from looking at logs
[08:56:12] <vgutierrez>	 instance itself looks healthy FWIW
[09:00:37] <XioNoX>	 `ganeti7002:~$ sudo tcpdump -i eno12399np0 host 10.140.0.13` doesn't see any traffic towards ncredir (but sees some to other VMs (like prometheus or bast), but pings between lvs7001 and ncredir works
[09:02:10] <vgutierrez>	 XioNoX: that's expected
[09:02:19] <vgutierrez>	 especially in the scenario I just described
[09:02:29] <vgutierrez>	 inbound traffic from lvs7001 is IPIP encapsulated
[09:02:46] <vgutierrez>	 and IPIP multiqueue optimized, so source iP address gets randomized 
[09:02:51] <XioNoX>	 ahh right..
[09:02:59] <vgutierrez>	 see the tcpdump paste I posted here
[09:05:01] <vgutierrez>	 hmmm wait
[09:05:11] <XioNoX>	 ok, so IPIP traffic make it to ncredir : `ncredir7003:~$ sudo tcpdump net 172.16.0.0/12`
[09:05:15] <vgutierrez>	 (nah, I just spotted the iptables rule on ncred7003 as well)
[09:05:58] <vgutierrez>	 so what's happening with the responses?
[09:06:31] <vgutierrez>	 ncredir7003 is sending responses
[09:06:39] <vgutierrez>	 09:06:25.812391 IP ncredir-lb.magru.wikimedia.org.https > lvs7001.magru.wmnet.60128: Flags [S.], seq 870613634, ack 3432636855, win 43440, options [mss 1440,sackOK,TS val 1752342733 ecr 862978466,nop,wscale 9], length 0
[09:06:47] <vgutierrez>	 so the host itself is properly configured
[09:08:23] <XioNoX>	 it's visible all the way to the tap0 interface on ganeti7002, so yeah something is eating the reply
[09:08:49] <XioNoX>	 as they're not visible on eno12399np0
[09:11:04] <XioNoX>	 I'm wondering if it's not some kind of rpf
[09:15:17] <vgutierrez>	 I could track the traffic with `pwru` but the impact on ganeti7002 CPU load would be significant
[09:16:12] <XioNoX>	 too bad linux doesn't have counters for that
[09:17:34] <XioNoX>	 but I'm sure that's the issue, the reply comes from 195.200.68.226
[09:17:48] <vgutierrez>	 of course, that's the VIP
[09:17:48] <XioNoX>	 while ganeti7002 only have `10.140.2.3 dev tap0`
[09:17:59] <XioNoX>	 let me try to manually fix it
[09:21:07] <XioNoX>	 vgutierrez: looks good now ?
[09:21:37] <vgutierrez>	 nope
[09:25:12] <vgutierrez>	 XioNoX: oh...
[09:25:21] <vgutierrez>	 XioNoX: IPv6 is healthy though
[09:27:34] <vgutierrez>	 meaning IPv6 always worked
[09:27:39] <XioNoX>	 I disabled rp_filter on both the VM facing and externally facing interfaces of ganeti7002
[09:32:43] <wikibugs>	 06Traffic, 10Liberica: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561#10917499 (10Vgutierrez)
[09:33:34] <XioNoX>	 vgutierrez: ok, it's good now
[09:33:36] <vgutierrez>	 yes
[09:33:39] <vgutierrez>	 what did you change?
[09:33:55] <vgutierrez>	 Jun 16 09:33:14 lvs7001 libericad[3825331]: time=2025-06-16T09:33:14.210Z level=INFO msg="detected healthcheck state change" service=ncredir-httpslb_443 hostname=ncredir7003.magru.wmnet address=10.140.2.3 healthcheck_name=HTTPCheck healthcheck_id=1905204800 healthcheck_result_old=false healthcheck_result=true
[09:34:02] <XioNoX>	 disabling rp_filter on each interface wasn't enough, I had to do `net.ipv4.conf.all.rp_filter=0` as well...
[09:34:08] <vgutierrez>	 of course
[09:34:13] <vgutierrez>	 it's the most restrictive of both
[09:34:48] <vgutierrez>	 why IPv6 wasn't affected?
[09:35:34] <vgutierrez>	 glad to see that switching to katran on lvs7001 detected the issue as healthchecks follow the same path as prod traffic btw
[09:37:43] <XioNoX>	 no idea, `net.ipv6.conf.all.rp_filter` doesn't exist, maybe it's not implemented?
[09:38:22] <vgutierrez>	 quick google search seems to confirm that
[10:01:10] <XioNoX>	 vgutierrez, moritzm, ready for you: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1159390
[10:01:26] <XioNoX>	 I didn't know about `profile::base::enable_rp_filter` quite convenient :)
[10:01:58] <vgutierrez>	 I had to implement that when I needed to disable rp_filter on realservers
[10:04:15] <moritzm>	 lgtm
[10:05:13] <moritzm>	 these are only read on boot, did you already set them on 7001/7002 during debugging? otherwise let's simply backfill with sysctl -w and avoid a reboot
[10:07:22] <XioNoX>	 moritzm: good idea, all done
[10:07:54] <moritzm>	 great, thx
[10:08:45] <XioNoX>	 moritzm: next time you create a VM, can you check that its matching tap* interface have rp_filter disabled?
[10:08:51] <XioNoX>	 (or just ping me and I can check)
[10:09:54] <moritzm>	 will do!
[10:10:31] <moritzm>	 I'll turn ganeti7003 into a routed node next, then I'll flip the remaining VMs running on non-DRBD to DRBD and then I'll add ncredir7004
[10:53:50] <wikibugs>	 06Traffic, 10MW-on-K8s, 06serviceops, 06SRE, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#10917731 (10Clement_Goubert)
[11:23:02] <hnowlan>	 o/ I have another restbase migration URL (second last API to be migrated, hopefully!) if it would suit to roll a change out now https://gerrit.wikimedia.org/r/c/operations/puppet/+/1156813
[11:23:21] <hnowlan>	 these are fairly rarely used 
[12:09:00] <jinxer-wm>	 FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum7003:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted
[12:14:00] <jinxer-wm>	 RESOLVED: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum7003:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted
[12:34:28] <wikibugs>	 06Traffic, 10RESTBase, 10RESTBase Sunsetting, 06serviceops, and 2 others: Block external traffic to RESTBase /page/data-parsoid endpoint and investigate internal usage - https://phabricator.wikimedia.org/T393557#10918072 (10ihurbain) Note: I actually do not know how these are generated, so it's plausible t...
[12:36:00] <jinxer-wm>	 FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh7003:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted
[12:41:00] <jinxer-wm>	 RESOLVED: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh7003:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=magru&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted
[12:44:16] <vgutierrez>	 hnowlan: looks good
[12:48:13] <hnowlan>	 vgutierrez: thanks - I'll wait for the backport window to finish before rolling out. 
[13:31:08] <wikibugs>	 06Traffic, 10Liberica: Provide connections stats for katran - https://phabricator.wikimedia.org/T397036 (10Vgutierrez) 03NEW
[13:48:11] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: haproxy-combined <no value> - https://slo.wikimedia.org/?search=haproxy-combined   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[13:50:12] <jinxer-wm>	 FIRING: SLOMetricAbsent: varnish-combined magru - https://slo.wikimedia.org/?search=varnish-combined   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[13:51:42] <jinxer-wm>	 FIRING: SLOMetricAbsent: varnish-combined magru - https://slo.wikimedia.org/?search=varnish-combined   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[13:52:18] <wikibugs>	 06Traffic, 10RESTBase, 10RESTBase Sunsetting, 06serviceops, and 2 others: Block external traffic to RESTBase /page/data-parsoid endpoint and investigate internal usage - https://phabricator.wikimedia.org/T393557#10918633 (10akosiaris) >>! In T393557#10918072, @ihurbain wrote: > Note: I actually do not know...
[14:00:22] <wikibugs>	 06Traffic, 13Patch-For-Review: Replace Digicert TLS certs with Google Trust Services ones - https://phabricator.wikimedia.org/T395131#10918656 (10Vgutierrez) 05In progress→03Resolved a:03Vgutierrez
[14:38:28] <jinxer-wm>	 RESOLVED: SLOMetricAbsent: varnish-combined magru - https://slo.wikimedia.org/?search=varnish-combined   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[14:38:28] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: haproxy-combined <no value> - https://slo.wikimedia.org/?search=haproxy-combined   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[14:39:35] <jinxer-wm>	 RESOLVED: SLOMetricAbsent: varnish-combined magru - https://slo.wikimedia.org/?search=varnish-combined   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[14:40:55] <wikibugs>	 06Traffic, 10Liberica: Provide connections stats for katran - https://phabricator.wikimedia.org/T397036#10918888 (10Vgutierrez) 05Open→03Resolved
[14:43:35] <jinxer-wm>	 RESOLVED: [2x] SLOMetricAbsent: haproxy-combined <no value> - https://slo.wikimedia.org/?search=haproxy-combined   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[14:51:40] <wikibugs>	 06Traffic, 10Liberica: seamless upgrade triggers dropped packets with katran - https://phabricator.wikimedia.org/T397053 (10Vgutierrez) 03NEW
[14:56:58] <wikibugs>	 06Traffic, 10RESTBase, 10RESTBase Sunsetting, 06serviceops, and 2 others: Block external traffic to RESTBase /page/data-parsoid endpoint and investigate internal usage - https://phabricator.wikimedia.org/T393557#10918979 (10hnowlan) There is a larger chunk of work involved in deprecating and replacing API...
[15:43:56] <wikibugs>	 06Traffic, 06collaboration-services, 10Gerrit, 13Patch-For-Review, 10Release-Engineering-Team (Radar): Separate Gerrit https and ssh/git hostnames - https://phabricator.wikimedia.org/T394271#10919284 (10Jelto) a:03Jelto
[16:04:20] <wikibugs>	 06Traffic, 10Liberica: seamless upgrade triggers dropped packets with katran - https://phabricator.wikimedia.org/T397053#10919410 (10Vgutierrez) p:05Triage→03High
[17:15:03] <wikibugs>	 06Traffic, 10Liberica: seamless upgrade triggers dropped packets with katran - https://phabricator.wikimedia.org/T397053#10919800 (10Vgutierrez) eBPF pinning doesn't seem to be working as expected, this could be a side effect of hardening liberica units: ` vgutierrez@lvs1013:~$ sudo -u liberica touch /sys/fs/b...
[17:16:36] <wikibugs>	 06Traffic: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912#10919804 (10BCornwall) p:05Triage→03Medium
[18:20:10] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920023 (10BCornwall) Okay, so we're ready to reimage lvs1016 but it appears that the mgmt interface isn't reachable. Could dcops look into this, please?
[18:22:05] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920042 (10BCornwall)
[20:07:00] <swfrench-wmf>	 brett: it looks like brief addition of lvs1016 to high-traffic might have led to some comment-only dither in pybal.conf on at least lvs1020
[20:07:19] <sukhe>	 oh hmm
[20:07:38] <swfrench-wmf>	 just flagging it due to the resulting `PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 [...]` alert 
[20:07:38] <brett>	 :(
[20:08:11] <brett>	 ah yeah, thanks for catching that
[20:08:32] <sukhe>	 oh yeah, that makes sense
[20:08:43] <brett>	 I'll go ahead and restart pybal then
[20:08:48] <sukhe>	 brett: simple restart should be enough. it should be safe since it was only fired after this, thus no cascading changes.
[20:08:58] <sukhe>	 thanks and swfrench-wmf too!
[20:09:18] <swfrench-wmf>	 no problem! easy to sort based on the puppet-agent journal
[20:09:28] <sukhe>	 1017 too
[20:09:38] <sukhe>	 still a WARN but will change to CRIT eventually
[20:09:43] <brett>	 ack
[20:09:56] <swfrench-wmf>	 ah, yeah I'd not checked the other one yet
[20:33:23] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920481 (10VRiley-WMF) @BCornwall Hey there, thanks for letting us know. I did replace the cable and it seems to respond to ping. Would you be able to check again? It seems to...
[20:41:12] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920518 (10BCornwall) @VRiley-WMF Thanks for the quick response! I've not been able to ping the mgmt interface (10.65.0.75)  from lvs1017, cumin1002, and cumin2002. It's timin...
[20:57:23] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920555 (10VRiley-WMF) Okay, I found the problem (I pinged the incorrect IP) I set the IP address on the iDRAC to the one listed in netbox. I just tested out the ping and it s...
[21:52:04] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920717 (10BCornwall) Thank you!