[02:30:34] <wikibugs>	 10Domains, 06Traffic, 06SRE, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10616926 (10BCornwall) Sadly, I have also been unable to get ahold of Thomas.
[08:41:17] <wikibugs>	 06Traffic: Provide cookbook(s) to operate liberica - https://phabricator.wikimedia.org/T388369 (10Vgutierrez) 03NEW
[08:41:27] <wikibugs>	 06Traffic: Provide cookbook(s) to operate liberica - https://phabricator.wikimedia.org/T388369#10617315 (10Vgutierrez) p:05Triage→03Medium
[09:34:38] <wikibugs>	 10netops, 06Infrastructure-Foundations: Different BFD settings on direct connected links - https://phabricator.wikimedia.org/T387773#10617502 (10ayounsi) 05Open→03Resolved a:03Papaul nice !
[11:04:51] <jinxer-wm>	 FIRING: FermMSS: Unexpected MSS value on 10.2.1.25:80 @ prometheus2007 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=prometheus - https://alerts.wikimedia.org/?q=alertname%3DFermMSS
[11:11:20] <vgutierrez>	 ^^ see -observability, they are having some issues with that instance
[11:19:15] <vgutierrez>	 ChrisDobbins901_: ^^ BTW that's your alert working as expected ;P
[11:19:51] <jinxer-wm>	 RESOLVED: FermMSS: Unexpected MSS value on 10.2.1.25:80 @ prometheus2007 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=prometheus - https://alerts.wikimedia.org/?q=alertname%3DFermMSS
[11:35:11] <vgutierrez>	 topranks: I was checking a BGP status alert on cr2-eqsin
[11:37:03] <topranks>	 vgutierrez: ok, I missed that one
[11:37:05] <topranks>	 from today?
[11:37:33] <vgutierrez>	 topranks: and I see BGP sessions there flapping for durum500[12] at 11:21:27 (and :28 for IPv6)
[11:37:48] <vgutierrez>	 and earlier today for netflow5002
[11:38:58] <topranks>	 netflow I'm not so sure about 
[11:39:05] <topranks>	 for some reason the durum hosts flap quite a bit 
[11:39:12] <topranks>	 I'm not sure what the pattern is, other than the durum hosts 
[11:39:23] <topranks>	 the doh hosts don't flap, which are also VMs with similar config 
[11:39:54] <vgutierrez>	 I don't see anything weird on https://grafana.wikimedia.org/d/WvigL8WGz/wikidough?orgId=1&from=now-3h&to=now
[11:40:47] <vgutierrez>	 hmm those are doh hosts...
[11:40:55] <vgutierrez>	 naming is fricking hard obviously
[11:41:09] <topranks>	 yeah.... I should probably open a task on it, myself and Sukhbit have looked at it before 
[11:41:13] <topranks>	 always flaps and then comes back 
[11:41:35] <vgutierrez>	 ok.. durum is just check.wikimedia-dns.org.
[11:41:36] <topranks>	 and we don't spend enough time on it cos it's durum... but I noticed it's definitely a pattern when working on some of the BGP dashboards recently 
[11:41:39] <topranks>	 yeah
[11:48:04] <topranks>	 vgutierrez: anyway thanks for the heads up, I'll open a task we've better stats to confirm this is a regular problem 
[11:48:24] <topranks>	 but durum is not really a big worry, it's anycast anyway and they're not all flapping at once 
[11:53:04] <vgutierrez>	 well.. eqsin flapped at the same seocnd
[11:53:05] <vgutierrez>	 *second
[12:13:14] <wikibugs>	 06Traffic, 10Data-Engineering (Q3 2025 January 1st - March 31th), 10DPE HAProxy Migration: Fix `webrequest_frontend` kafka timestamp mismatch with in-data `dt` field - https://phabricator.wikimedia.org/T388397#10618243 (10JAllemandou)
[12:22:00] <wikibugs>	 06Traffic, 06MediaWiki-Engineering, 06serviceops, 07Upstream, 07Wikimedia-production-error: 503 error when edit large size pages on PHP 8.1 - https://phabricator.wikimedia.org/T385395#10618275 (10jijiki) 05Open→03Resolved a:03jijiki We are marking this as resolved, if you reckon there is someth...
[12:58:58] <sukhe>	 topranks: if you remember we had some flapping on the eqiad doh hosts as well; that seems to have gone away without us doing anything about it
[12:59:11] <sukhe>	 all these hosts are running the same version of bird so it can't be that as well
[12:59:24] <sukhe>	 the only difference is that in the above, netflow and durum are on the private VLAN
[12:59:36] <sukhe>	 but then doh hosts in eqiad were also flapping and they are on the public one
[13:05:47] <wikibugs>	 06Traffic, 06Commons, 06serviceops, 10Wikimedia-Site-requests: Enforce upload rate limits for bots on commons - https://phabricator.wikimedia.org/T248177#10618520 (10jijiki) 05Open→03Resolved a:03jijiki Closing this as many things have changed since then, ie we have various kinds of ratelimits in...
[13:07:13] <sukhe>	 anyway, I will look at this again too. we really don't care about a single durum host flapping but the ECH experiment is going to run on those and hence it might be wortwhile
[13:41:58] <topranks>	 sukhe: hmm ok yeah 
[13:42:04] <topranks>	 I made a temp dashboard to look at this:
[13:42:05] <topranks>	 https://grafana.wikimedia.org/goto/Z9o99ihNg?orgId=1
[13:42:27] <topranks>	 if you toggle between 'doh' and 'durum' you can see the issue is with the durum VMs 
[13:42:43] <topranks>	 I have no idea why though - they have largely the same bird config right?
[13:43:03] <sukhe>	 topranks: fancy dashboard
[13:43:24] <sukhe>	 and yeah, it's almost the same config, same bird version, same OS (and therefore kernel version)
[13:43:27] <topranks>	 well it doesn't work for eqiad but I know why (labels seem a little differnet in the measurements) 
[13:43:29] <topranks>	 yeah 
[13:43:48] <sukhe>	 the only difference being the public/private VLAN one -- not sure how relevant but it's there
[13:43:53] <topranks>	 I wonder is it something to do with how little they durum hosts are used?  Maybe the hypervisor is putting them to "sleep" (technical term there lol) 
[13:44:05] <topranks>	 shouldn't matter but I guess we need to consider everything 
[13:45:26] <topranks>	 also even though they don't flap at once there is a strange correlation in time, even across DCs 
[13:45:46] <topranks>	 look at it for the last 24h for example 
[13:45:48] <sukhe>	 I have very limited metrics from those hosts but we can check resource usage. negligible if anything but I don't see how that would result in the BGP session flap, unless there is something in the bird config
[13:45:50] <XioNoX>	 topranks: could it fit puppet run times?
[13:45:52] <topranks>	 they all flap shortly after 11am 
[13:45:57] <XioNoX>	 maybe some ressource issues?
[13:46:09] <topranks>	 XioNoX: I was gonna say is there some cronjob or something they do that uses a lot of resources periodically?
[13:46:13] <topranks>	 yeah
[13:46:23] <sukhe>	 yeah there is definitely a pattern
[13:46:32] <sukhe>	 I am about to head to a meeting but I will check later. thanks for the dashboard!
[13:46:45] <topranks>	 cool we can discuss later thanks 
[13:47:19] <sukhe>	 https://grafana.wikimedia.org/goto/Bk3VjmhHg?orgId=1
[13:47:26] <sukhe>	 you can see the spikes here as well for example, on the CPU usage
[13:47:31] <sukhe>	 and in other dashboards
[13:51:28] <topranks>	 bear in mind we have BFD running here.  so the routers/switches need a keepalive every 300ms or they will tear down the session.  which is fairly aggressive for a VM.  but obviously doh are fine 
[13:56:15] <sukhe>	 I think I see it, need to confirm once I come back from the meeting
[13:56:22] <sukhe>	 the timing matches with:
[13:56:34] <sukhe>	 https://puppetboard.wikimedia.org/report/durum5001.eqsin.wmnet/ad29422fc3c18a996a533a80bb553071a0d56530
[13:56:41] <sukhe>	 > prime256v1.ocsp]/content
[13:56:44] <sukhe>	 > Scheduling refresh of Service[nginx]
[13:56:58] <sukhe>	 which then:
[13:56:58] <sukhe>	 sukhe@durum5001:~$ systemctl list-dependencies nginx --reverse
[13:56:58] <sukhe>	 nginx.service
[13:56:58] <sukhe>	 ● ├─anycast-healthchecker.service
[13:57:09] <sukhe>	 sukhe@durum5001:~$ systemctl list-dependencies anycast-healthchecker.service --reverse
[13:57:12] <sukhe>	 anycast-healthchecker.service
[13:57:15] <sukhe>	 ● ├─bird.service
[13:58:19] <sukhe>	 and that causes the session flap but brb now
[14:17:01] <sukhe>	 ok I am quite sure this is it. I should have been using nginx-reload here but I am using the service restart instead, which is wrong. this should not affect DoH since we use dnsdist there and it has an explicit capture for the reload. 
[14:17:22] <sukhe>	 this doesn't explain the flapping in eqiad we saw but let's think about that later and fix this first to see if it helps
[14:23:36] <sukhe>	 for clarity what is happening here is that because I erroneously put restart on nginx when acme chief has a resource updated, it restarts nginx, which restarts anycast-healthchcker and which restarts bird because we have all these systemd dependencies on purpose
[14:23:52] <sukhe>	 and that causes the BGP session to flap, when bird restarts 
[14:24:37] <sukhe>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1126059 this should fix it
[14:44:58] <topranks>	 sukhe: haha nice find!
[14:45:37] <topranks>	 makes total sense, hopefully that'll be the end of this one now :) 
[14:46:15] <sukhe>	 topranks: let this be a lesson that it's not always the network or the DNS :P
[14:46:32] <topranks>	 I don't understand 
[14:46:36] <topranks>	 it's never the network :P 
[14:46:40] <sukhe>	 :D
[15:45:33] <wikibugs>	 06Traffic: upgrade to trafficserver 9.2.9 - https://phabricator.wikimedia.org/T388035#10619536 (10BCornwall)
[16:55:34] <wikibugs>	 06Traffic: Gather site pooled/depooled information for Grafana - https://phabricator.wikimedia.org/T376876#10619936 (10Fabfur) a:05Fabfur→03CDobbins