[07:05:13] I'll enable connection re-use on gerrit's backend before there is too much activity [07:06:35] with the revert prepared: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1254750 [07:08:53] running puppet on cp-text hosts with the change [07:14:58] revert submitted, merge in progress [07:18:02] puppet-agent run in progress with the revert [07:23:18] done [08:20:12] arnaudb: o/ thanks a lot for the thorough preparation and communication, really appreciated. One follow up question - did we understand what went wrong in the first place when enabling connection reuse? [08:23:24] elukey: I had suspicions about some timeout values, we changed these yesterday but it did not fixed the issue. I'm now suspecting jetty's httpd.maxThreads = 60 being too low, I'm still looking in the logs to see if I find anything meaningful. fwiw the debug is tracked in https://phabricator.wikimedia.org/T420189 [08:25:30] super thanks [08:42:55] bast2003 is back and can be used again (it needed to be reimaged since make sure to update host keys) [08:45:29] effie, slyngs we will depool ulsfo in 15 minutes for T418971, expect some noise [08:45:29] T418971: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971 [08:47:40] ack [09:07:47] vgutierrez, topranks, slyngs, can I get a quick review on https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1247994 ? [09:08:14] * topranks looking [09:08:51] Valentin was quick! [09:08:53] XioNoX: ncredir/gerrit missing? [09:09:12] or you don't care about those at that level? [09:09:28] vgutierrez: most likely yeah... We should get rid of it all and do firewalling in ebpf instead :) [09:10:00] we have those as safeguards to prevent some obvious attacks (like UDP traffic) to the well known public endpoints as we don't run iptables on LVS [09:10:09] I'll open a task [09:10:38] I'm missing some context here.... what should we get rid of? [09:11:48] topranks: all of that https://github.com/wikimedia/operations-homer-public/blob/master/policies/cr-border-in.yaml#L86-L124 (lines 86 to 124) [09:13:05] honestly it's such a simple rule (drop all udp), and easy for the router silicon to do, I kind of think it's sensible to keep [09:13:45] topranks: it's a snowflake to workaround a server limitation, with lists to maintain, and exceptions, etc [09:13:46] I don't doubt the mighty power of eBPF to drop it on the servers, but even so it seems a simple rule, no point forwarding those packets [09:14:26] but it's not directly related as we use those specific "text-lb.ulsfo" etc definitions in cr-cloud-vrf, so we might not need to add gerrit/ncredir [09:14:31] I think we're back to the old netops dichotomy between you and I Arzhel :) [09:14:39] :) [09:14:41] XioNoX: do we need to wait till you apply that CR? [09:15:04] vgutierrez: yeah, we should stop aruing :) [09:15:14] anyway no strong feelings. If UDP 53 is the only udp port we need I figure it's simple to drop the rest at the internet edge, but no problem if we want to change it [09:15:17] lol [09:15:34] we will need port 443 UDP soon(TM) [09:15:51] (don't quote me on that) [09:15:54] hahah [09:15:55] heh what fun that will be, I assume HTTP3? [09:15:59] yes :D [09:16:17] cool :) [09:16:43] (waiting for puppet to update the repo on cumin1003 then will run homer [09:16:46] ) [09:18:52] vgutierrez: you're good to go [09:19:33] thx <3 [10:00:38] <_joe_> I don't think soon is realistic, re: 443 UDP [10:00:41] <_joe_> quote me on that [10:00:48] <_joe_> sadly [10:01:31] <_joe_> vgutierrez, fabfur I'd like to do a HP deployment, but I don't want to cross over your work with ulsfo; is it done? [10:01:43] almost [10:01:50] we're repooling ulsfo [10:01:55] <_joe_> ack I'll prepare the patch then <# [10:02:00] that's fine [10:02:02] <_joe_> <3 ofc [10:04:43] effie: we just repooled ulsfo [10:05:02] \m/ [10:08:32] vgutierrez: text-lb.ulsfo.wikimedia.org still resolves to 198.35.26.96 ? [10:08:39] LOL [10:08:49] we didn't merge the DNS change [10:09:05] https://en.wikipedia.org/wiki/User_talk:TomWikiAssist *deep sigh* [10:10:12] doing it as we speak... [10:10:23] Emperor: too many words on that page [10:11:47] XioNoX: fixed, sorry about that [10:11:56] it also did the "write blog posts slagging off the humans who blocked it for being an LLM" thing [10:12:28] vgutierrez: no pb :) waiting for my cache (or google dns cache) to refresh [10:12:54] XioNoX: as Sammy put it "clawdbot thing edits Wikipedia for two and a half weeks without permission, gets almost everything reverted, accidentally doxxes its operator, files a civility complaint about being called a clanker, and then writes an essay about what it all means." [10:13:06] hmm wtf, text-lb.ulsfo.wikimedia.org has address 198.35.26.96 [10:14:08] Emperor: Fuck that dude. [10:14:20] (Not Sammy) [10:17:35] I'm depooling ulsfo again [10:22:13] ok... found the issue, VIPs are missing on netbox [10:29:30] funny enough not a big deal in terms on impact [10:29:42] dyna.wikimedia.org. doesn't need text-lb.ulsfo to be refreshed [10:29:47] so users didn't notice [10:30:19] https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1254864 cleanup patch on the network side [10:31:36] vgutierrez: I'm still getting the old IP fyi [10:35:55] yeah [10:36:01] fabfur is working on the netbox changes [10:36:13] I can't feed him more coffee though [10:36:23] he would become unstable [10:37:05] <_joe_> nor give him food after midnight [10:41:18] <_joe_> Emperor: I missed the part about the blog post [10:43:39] f.abfur vibrating in the corner starts phasing through the floor [10:46:03] he is so skinny that he wouldn't need to phase for that TBH [10:46:53] As a fellow stick, that's offensive :P [10:47:06] claime: yeah.. at 59kg I'm super fat [10:47:22] hmm sre.dns.netbox cookbook is failing for us: `FileNotFoundError: [Errno 2] No such file or directory: '/tmp/dns-check.whgip54v/zones/netbox/240-28.26.35.198.in-addr.arpa'` [10:47:27] vgutierrez: I know, and you should be ashamed of your internalised body shaming [10:47:28] Wait [10:47:35] :P [10:48:04] who can help with netbox? :) [10:48:15] * vgutierrez resisting the urge of calling v.olans [10:48:30] he's OOO IIRC [10:48:32] Huh [10:49:43] we did some modifications on netbox to remove old vip (not removed, just removed the VIP label and dns / description) and added the new ones as per https://gerrit.wikimedia.org/r/c/operations/dns/+/1253503 [10:50:37] as a side effect that zone file got emptied and deleted [10:50:55] hmm nope [10:51:00] that wasn't us [10:51:03] https://www.irccloud.com/pastebin/mfVoWNIO/ [10:51:25] I know [10:51:28] That's yesterday's change I was inquiring about [10:51:36] we need to fix that :) [10:51:40] remove the include [10:51:43] from dns [10:52:11] sending a CR [10:53:05] same [10:53:54] https://gerrit.wikimedia.org/r/c/operations/dns/+/1254869/ [10:54:09] vgutierrez wins on time [10:54:17] you might need to add: [10:54:17] ; 198.35.26.224/27 (224-255) - LVS Service IPs [10:54:17] $INCLUDE netbox/224-27.26.35.198.in-addr.arpa [10:54:23] but that can be done in a later time [10:55:02] vgutierrez: and that's the procedure to deploy it: https://wikitech.wikimedia.org/wiki/DNS/Netbox#Atomically_deploy_auto-generated_records_and_a_manual_change [10:56:31] hmm I'll add that as well [10:56:44] cause the netbox change will point the new LB records [10:56:45] to that range [10:57:04] meanwhile I'll abort the current sre.dns.netbox cookbok [11:00:10] _joe_: if you've SAN points to lose - https://clawtom.github.io/tom-blog/2026/03/12/the-interrogation/ and https://clawtom.github.io/tom-blog/2026/03/13/what-the-crabbyrathbun-post-missed/ [11:00:32] there should be no need to edit data in netbox now [11:00:57] before running sre.dns.netbox --skip-authdns-update [11:01:09] XioNoX: correct/ [11:01:10] ? [11:01:19] yes, that's right [11:01:25] the changes are already there in the diff [11:01:58] (I'm already running the cookbook BTW) [11:02:01] ah ok [11:05:17] FileNotFoundError: [Errno 2] No such file or directory: '/tmp/dns-check.wersa4wr/zones/netbox/1.0.2.0.3.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa' [11:05:25] I guess we have some IPv6 outdated as well [11:08:09] ok.. CI is now happy: https://gerrit.wikimedia.org/r/c/operations/dns/+/1254869 [11:08:13] XioNoX: still looking good? [11:08:52] actually the /56 sandbox mention could be removed as well [11:09:58] or not.. that's still on netbox [11:11:03] looks good ! [11:11:52] merging [11:13:48] text-lb.ulsfo.wikimedia.org has address 198.35.26.224 [11:13:53] looking good now [11:16:07] 👍 [11:16:49] nice! [11:17:50] ok.. all clear, I'm repooling ulsfo [11:17:54] no good reason to keep it depooled [11:18:25] ack [11:35:58] Is everything clear for a sre.dns.netbox run? I have a server I'd like to put back Active [11:38:29] yes claime [11:39:13] cool tyvm [12:04:11] XioNoX: I pushed that ACL change to the core routers in codfw now [12:06:56] topranks: cool, I need to merge and push https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1254864 to all the core routers [12:42:23] my turn to "break" DNS [12:42:46] E003|MISSING_OR_WRONG_PTR_FOR_NAME_AND_IP: Missing PTR '1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.3.0.0.0.3.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa.' for name 'gw-virtual ulsfo.wikimedia.org.' and IP '2620:0:863:3::1', PTRs are: 97.26.35.198.in-addr.arpa. [12:47:57] vgutierrez, moritzm: https://gerrit.wikimedia.org/r/c/operations/dns/+/1254910 [12:48:00] or topranks [12:51:53] XioNoX: +1, somewhat pointless nit inline [12:52:21] makes sens [12:52:26] updated [12:55:52] all good, thx! [13:05:26] I'm disablign Puppet on the install* servers for ~ 10 mins to deploy a firewall change (so eventual change to preseed.yaml you make won't take effect), I'll drop a note when Puppet is fully re-enabled [13:33:58] Puppet on install* is back on [13:55:44] in the current round of reboots, I feel something is off with the downtiming [13:56:06] I don't know what but it seems to be across the fleet and so is not tied to a particular cookbook or a host [13:56:09] is it just me? [13:56:20] no, not just you. we had massive issues with downtiming in the first day [13:56:34] someone from o11y restarted Icinga [13:56:52] since apparently when running for a long time, this happens [13:57:07] it's much better since [13:57:11] ok thanks. I will raise it with olly. for some of the more critical stuff like the DNS hosts, it worries me to see downtiming failed [13:57:19] but I also had an odd alert when rebooting a ganeti node earlier [13:57:52] sukhe: yeah, I had problems on the first day, but it's been better since. And I've rebooted quite a lot of hosts! [13:58:41] ok thanks folks. I will see if it persists for me [13:59:33] specifically "ProbeDown: Service ganeti1037:1811 has failed probes (tcp_ganeti_noded_ip4)", even though the cookbook does set downtime [14:00:06] that's not an icinga alert, so probably an unrelated issue? [14:00:25] but overall like Matthew said it's now very little and might simply be some race condition that has always existed, but only really show up if we reboot a lot of hosts [14:28:19] <_joe_> !topic [14:28:21] <_joe_> sigh [14:28:51] <_joe_> elukey, claime: I'm deploying HP shortly. There might be a few minutes of unavailability [14:28:57] ack [14:32:29] ack [15:43:16] I'm rebooting backup director, please do not delete production data accidentaly from production in the next 15 minutes [15:44:17] you spoil all my fun :) [15:50:14] it's ok if your job it to do that [15:50:27] *is [16:08:29] > Excluding 10 nodes: ml-serve[1001-1010] [16:08:37] > Running action: reboot on hosts ml-serve1001.eqiad.wmnet [16:08:40] thanks, cumin [16:29:22] klausman: what do you mean? [16:38:14] It tells me it excluded 1001 , then proceeds to reboot it [16:38:37] (and no, adding * or .eqiad.wmnet did not help) [16:39:16] klausman: sure, but it is not "cumin" but a cookbook afaics :) [16:39:47] so if there is a bug let's try to figure out where it is, maybe with a detailed report (what was executed, etc..) rather than being sarcastic [16:39:50] well, ok :) [16:40:22] and yes, there seems to be some breakage in the --query syntax of the k8s reboot cookbook [16:40:47] cumin.backends.InvalidQueryError: Unable to parse the query 'ml-serve[1011-1013].eqiad.wmnet and (A:ml-serve-master-eqiad or A:ml-serve-worker-eqiad)' neither with the default backend 'puppetdb' nor with the global grammar: [16:40:49] puppetdb: Expected end of text, found 'and' (at char 32), (line:1, col:33) [16:41:16] klausman: No there isn't, use 'P{ml-serve[1011-1013].eqiad.wmnet}' as your query [16:41:18] mayeb it's trying to combine glob and puppetdb querying? [16:41:26] ah, I see. [16:41:49] You can also use --minimal-cordoning to avoid it cordoning the whole cluster then uncordoning the rebooted nodes [16:42:20] Using minimal-cordoning it will try to find a good batch size and only cordon that batch before rebooting these nodes, then repool them, and move ahead [16:42:35] that I am already using. The problem was that 1010 failed because it couldn't evict istiod, and then I had to start over with 1011-1013, which is hard to query/exclude for [16:42:40] (if you give it a batch size, otherwise it's the default which is too low) [16:43:22] klausman: Does --exclude not work? [16:43:34] I couldn't get it to work right [16:43:56] klausman: also please run the cookbook --dry-run the next time to figure out what it would do rather than execute it, if you don't want to reboot random hosts [16:44:22] dry run has a _lot_ of output, to the point where I couldn't spot if I got the host list right [16:45:22] If it doesn't that's a bug [16:45:40] Ok so what you need to do is uncordon/repool the nodes from the batch that failed [16:46:17] And then this works `udo cookbook -d sre.k8s.reboot-nodes --batchsize 15 --k8s-cluster ml-serve-eqiad -a ml-serve-worker-eqiad --exclude ml-server1010.eqiad.wmnet --minimal-cordon -r 'reboots'` [16:46:18] I am already doing that, basically doing the host-by-host cordon/dran;reboot;uncordon by hand [16:46:35] (I'm on the last reboot atm) [16:46:51] But that would only exclude 1010, I need to exclude 1001-1010 [16:48:05] `sudo cookbook --dry-run sre.k8s.reboot-nodes --batchsize 15 --k8s-cluster ml-serve-eqiad -a ml-serve-worker-eqiad --exclude ml-server[1001-1010].eqiad.wmnet --minimal-cordon -r 'reboots'` [16:48:10] DRY-RUN: Effective remote query is: A:ml-serve-worker-eqiad [16:48:12] DRY-RUN: Excluding 10 nodes: ml-server[1001-1010].eqiad.wmnet [16:49:12] I could have sworn I tried that syntax for excludes, but I will have to check [16:57:42] yep it happens, what we are trying to tell you is that feel free to drop a line here if you are not sure, as you can see people often help because they have probably had the same use case beforehand :) [16:58:11] ah, I had used the [] syntax with --query (which needs P{} instead) [16:58:11] but let's remember to be constructive, this is my only recommendation [17:19:45] all pods running [17:19:56] dcausse: does it work now? [17:21:25] elukey: all good, thanks! [17:22:27] klausman: so `sudo /opt/rocm/bin/amd-smi partition` shows that we are running with the default, so all memory in one partition [17:22:36] that is expected after a reboot [17:23:03] I think mi300x hosts need to have a way (in puppet?) to restore their configuration before they can accept traffic [17:23:15] maybe a daemon or something that goes before the kubelet [17:23:40] agreed, a one-off service should do the trick. [17:23:45] and the amd gpu plugin needs to be checked as well, if it starts too early maybe it is not good [17:23:59] can you open a task to track the work? [17:24:09] will do [17:24:25] ah snap I realized only now that we are not in #ml [17:24:33] * elukey needs to go afk [17:24:38] sorry for the spam folks [18:22:19] I got the change 1255000, nice [18:34:45] {◕ ◡ ◕} [19:24:43] mutante: okay to puppet-merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1255012 ? [19:25:40] brett: yes, please. I have a problem to ssh to the server. [19:27:05] done [19:30:35] thanks. ehh.. my yubikey is borked [19:30:37] usb 3-1: device not accepting address 19, error -71 [19:30:48] usb 3-1: Device not responding to setup address.