[02:30:34] 10Domains, 06Traffic, 06SRE, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10616926 (10BCornwall) Sadly, I have also been unable to get ahold of Thomas. [08:41:17] 06Traffic: Provide cookbook(s) to operate liberica - https://phabricator.wikimedia.org/T388369 (10Vgutierrez) 03NEW [08:41:27] 06Traffic: Provide cookbook(s) to operate liberica - https://phabricator.wikimedia.org/T388369#10617315 (10Vgutierrez) p:05Triage→03Medium [09:34:38] 10netops, 06Infrastructure-Foundations: Different BFD settings on direct connected links - https://phabricator.wikimedia.org/T387773#10617502 (10ayounsi) 05Open→03Resolved a:03Papaul nice ! [11:04:51] FIRING: FermMSS: Unexpected MSS value on 10.2.1.25:80 @ prometheus2007 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=prometheus - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [11:11:20] ^^ see -observability, they are having some issues with that instance [11:19:15] ChrisDobbins901_: ^^ BTW that's your alert working as expected ;P [11:19:51] RESOLVED: FermMSS: Unexpected MSS value on 10.2.1.25:80 @ prometheus2007 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=prometheus - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [11:35:11] topranks: I was checking a BGP status alert on cr2-eqsin [11:37:03] vgutierrez: ok, I missed that one [11:37:05] from today? [11:37:33] topranks: and I see BGP sessions there flapping for durum500[12] at 11:21:27 (and :28 for IPv6) [11:37:48] and earlier today for netflow5002 [11:38:58] netflow I'm not so sure about [11:39:05] for some reason the durum hosts flap quite a bit [11:39:12] I'm not sure what the pattern is, other than the durum hosts [11:39:23] the doh hosts don't flap, which are also VMs with similar config [11:39:54] I don't see anything weird on https://grafana.wikimedia.org/d/WvigL8WGz/wikidough?orgId=1&from=now-3h&to=now [11:40:47] hmm those are doh hosts... [11:40:55] naming is fricking hard obviously [11:41:09] yeah.... I should probably open a task on it, myself and Sukhbit have looked at it before [11:41:13] always flaps and then comes back [11:41:35] ok.. durum is just check.wikimedia-dns.org. [11:41:36] and we don't spend enough time on it cos it's durum... but I noticed it's definitely a pattern when working on some of the BGP dashboards recently [11:41:39] yeah [11:48:04] vgutierrez: anyway thanks for the heads up, I'll open a task we've better stats to confirm this is a regular problem [11:48:24] but durum is not really a big worry, it's anycast anyway and they're not all flapping at once [11:53:04] well.. eqsin flapped at the same seocnd [11:53:05] *second [12:13:14] 06Traffic, 10Data-Engineering (Q3 2025 January 1st - March 31th), 10DPE HAProxy Migration: Fix `webrequest_frontend` kafka timestamp mismatch with in-data `dt` field - https://phabricator.wikimedia.org/T388397#10618243 (10JAllemandou) [12:22:00] 06Traffic, 06MediaWiki-Engineering, 06serviceops, 07Upstream, 07Wikimedia-production-error: 503 error when edit large size pages on PHP 8.1 - https://phabricator.wikimedia.org/T385395#10618275 (10jijiki) 05Open→03Resolved a:03jijiki We are marking this as resolved, if you reckon there is someth... [12:58:58] topranks: if you remember we had some flapping on the eqiad doh hosts as well; that seems to have gone away without us doing anything about it [12:59:11] all these hosts are running the same version of bird so it can't be that as well [12:59:24] the only difference is that in the above, netflow and durum are on the private VLAN [12:59:36] but then doh hosts in eqiad were also flapping and they are on the public one [13:05:47] 06Traffic, 06Commons, 06serviceops, 10Wikimedia-Site-requests: Enforce upload rate limits for bots on commons - https://phabricator.wikimedia.org/T248177#10618520 (10jijiki) 05Open→03Resolved a:03jijiki Closing this as many things have changed since then, ie we have various kinds of ratelimits in... [13:07:13] anyway, I will look at this again too. we really don't care about a single durum host flapping but the ECH experiment is going to run on those and hence it might be wortwhile [13:41:58] sukhe: hmm ok yeah [13:42:04] I made a temp dashboard to look at this: [13:42:05] https://grafana.wikimedia.org/goto/Z9o99ihNg?orgId=1 [13:42:27] if you toggle between 'doh' and 'durum' you can see the issue is with the durum VMs [13:42:43] I have no idea why though - they have largely the same bird config right? [13:43:03] topranks: fancy dashboard [13:43:24] and yeah, it's almost the same config, same bird version, same OS (and therefore kernel version) [13:43:27] well it doesn't work for eqiad but I know why (labels seem a little differnet in the measurements) [13:43:29] yeah [13:43:48] the only difference being the public/private VLAN one -- not sure how relevant but it's there [13:43:53] I wonder is it something to do with how little they durum hosts are used? Maybe the hypervisor is putting them to "sleep" (technical term there lol) [13:44:05] shouldn't matter but I guess we need to consider everything [13:45:26] also even though they don't flap at once there is a strange correlation in time, even across DCs [13:45:46] look at it for the last 24h for example [13:45:48] I have very limited metrics from those hosts but we can check resource usage. negligible if anything but I don't see how that would result in the BGP session flap, unless there is something in the bird config [13:45:50] topranks: could it fit puppet run times? [13:45:52] they all flap shortly after 11am [13:45:57] maybe some ressource issues? [13:46:09] XioNoX: I was gonna say is there some cronjob or something they do that uses a lot of resources periodically? [13:46:13] yeah [13:46:23] yeah there is definitely a pattern [13:46:32] I am about to head to a meeting but I will check later. thanks for the dashboard! [13:46:45] cool we can discuss later thanks [13:47:19] https://grafana.wikimedia.org/goto/Bk3VjmhHg?orgId=1 [13:47:26] you can see the spikes here as well for example, on the CPU usage [13:47:31] and in other dashboards [13:51:28] bear in mind we have BFD running here. so the routers/switches need a keepalive every 300ms or they will tear down the session. which is fairly aggressive for a VM. but obviously doh are fine [13:56:15] I think I see it, need to confirm once I come back from the meeting [13:56:22] the timing matches with: [13:56:34] https://puppetboard.wikimedia.org/report/durum5001.eqsin.wmnet/ad29422fc3c18a996a533a80bb553071a0d56530 [13:56:41] > prime256v1.ocsp]/content [13:56:44] > Scheduling refresh of Service[nginx] [13:56:58] which then: [13:56:58] sukhe@durum5001:~$ systemctl list-dependencies nginx --reverse [13:56:58] nginx.service [13:56:58] ● ├─anycast-healthchecker.service [13:57:09] sukhe@durum5001:~$ systemctl list-dependencies anycast-healthchecker.service --reverse [13:57:12] anycast-healthchecker.service [13:57:15] ● ├─bird.service [13:58:19] and that causes the session flap but brb now [14:17:01] ok I am quite sure this is it. I should have been using nginx-reload here but I am using the service restart instead, which is wrong. this should not affect DoH since we use dnsdist there and it has an explicit capture for the reload. [14:17:22] this doesn't explain the flapping in eqiad we saw but let's think about that later and fix this first to see if it helps [14:23:36] for clarity what is happening here is that because I erroneously put restart on nginx when acme chief has a resource updated, it restarts nginx, which restarts anycast-healthchcker and which restarts bird because we have all these systemd dependencies on purpose [14:23:52] and that causes the BGP session to flap, when bird restarts [14:24:37] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1126059 this should fix it [14:44:58] sukhe: haha nice find! [14:45:37] makes total sense, hopefully that'll be the end of this one now :) [14:46:15] topranks: let this be a lesson that it's not always the network or the DNS :P [14:46:32] I don't understand [14:46:36] it's never the network :P [14:46:40] :D [15:45:33] 06Traffic: upgrade to trafficserver 9.2.9 - https://phabricator.wikimedia.org/T388035#10619536 (10BCornwall) [16:55:34] 06Traffic: Gather site pooled/depooled information for Grafana - https://phabricator.wikimedia.org/T376876#10619936 (10Fabfur) a:05Fabfur→03CDobbins