[06:49:54] good morning [08:58:27] tools-k8s-worker-nfs-74 has D processes and seems to be getting worse [08:59:09] I'll reboot it [09:21:57] thanks! [09:22:15] topranks: hey, I'm seeing a consistent amount of pings lost between ceph nodes [09:22:17] https://usercontent.irccloud-cdn.com/file/BYqsITI3/image.png [09:22:32] is that expected now? [09:22:51] not 100% what I'm looking at, can you give me an example two nodes? [09:23:00] no shouldn't be dropping pings [09:23:24] ahh, yep, nm, it's pings lost to 1029, a node that's taken out temporarily [09:24:23] oh ok np [09:25:26] we have a small process on each osd node doing pings to the rest of nodes (1/min or so) and collecting the lost ones [09:25:34] for both jumbo and non-jumbo [09:46:16] dcaro: good morning, I was chatting with Francesco as I'm making a new spicerack release that has a couple of breaking changes and I'm happy to help with patches for the couple of things needed. [09:46:57] volans okok, what are the breaking changes? Any new goodies we can use? [09:47:12] when started looking I realized you have quite some wrapper for icinga (wrap_with_sudo_icinga), and I think you might be happy as we can probably remove it as icinga_master_host is becoming a method [09:47:31] new goodies for the default argument parser mainly, full changelog at https://doc.wikimedia.org/spicerack/master/release.html#v10-0-0-2025-03-31 [09:47:34] oh nice [09:48:11] the wrapper on icinga is to be able to use sudo yep (as we ssh with our users when running on laptops) [09:48:24] I didn't know you had to do all that just because was a property, we could have probably changed that back in the day. [09:49:04] I can't see the changelog entry you linked [09:49:14] last one is 2025-02-25 [09:49:46] cdn cache? [09:50:01] maybe :), hard refresh did not help, I guess next thing is just waiting xd [09:50:09] weird [09:50:31] jenkins does the push job, I have no idea how it's setup behind that [09:51:18] back to the source... https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/spicerack/+/refs/heads/master/CHANGELOG.rst [09:52:56] I also did not see the entry on first load, but hard refresh worked for m [10:00:16] after the meeting we can decide what to do there, I can also hold the release on pypi if that helps [10:41:18] dcaro: let me know when you're around for ^^^ [10:48:00] arturo: I just noticed that kyverno had some error during the last upgrade, the helm chart is in status "failed" [10:48:25] it looks like it's running correctly, but apparently the post-upgrade hook failed [10:48:59] "sudo helm list -n kyverno" and "sudo helm history kyverno -n kyverno" to see that [10:55:00] in other news, I'm going to kickstart the k8s upgrade in toolsbeta [10:56:50] following the procedure at https://docs.google.com/document/d/1Sh-VC5lAL0FZvvP7eGs6UGjLk-yMZ3oiTDd9nv8J7PE/edit?tab=t.0#heading=h.trre9z3ki4o [11:03:46] dhinus: ack [11:10:29] dhinus: 🎉 yep kyverno failed the last time, I don't remember the details, I think that one of the deployments took too long to start or similar, but it started in the end [11:10:46] dcaro: ack thanks [11:10:48] further deploys did not change anything, so it did not record it [11:10:53] (but now it does :) ) [11:10:58] right! [11:12:10] volans: I'm going for lunch now, I'll be back in ~1h but I have meeting until ~17:30 so I think it might be better to talk async? [11:12:32] feel free to send a patch, as long as it allows running on the laptops I'm ok with it [11:16:56] dcaro: ack but what I'm missing is where you do patch the spicerack instance right now (the normal way, not the special hack for icinga) [11:42:35] first toolsbeta control node upgraded, tests running fine [11:51:34] ack, let me know if you want any help [11:53:50] volans: we do it on the fly, only when needed, like https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/wmcs-cookbooks/+/refs/heads/main/wmcs_libs/alerts.py#39 [12:09:22] all control nodes updated, I'll do the worker nodes after lunch [12:10:46] awesome :) [12:14:31] dhinus: when did the k8s upgrade procedure move from an etherpad to a private google doc? [12:15:24] taavi: that's just temporary, I will move it back to wiki as soon as this upgrade is finished :) [12:15:52] I'm rewriting & expanding the whole procedure as I do it [12:16:54] moved the gdoc from private to public in the meantime! [13:00:54] starting with the worker upgrades [13:03:12] tracking task is T390212 [13:03:15] T390212: Upgrade "toolsbeta" cluster to k8s 1.29.15 - https://phabricator.wikimedia.org/T390212 [13:07:18] ack [13:29:22] toolsbeta-test-k8s-worker-nfs-9 is failing to drain and I'm not sure why [13:30:18] there's a stuck harbor container, I'll try manually stopping it [13:31:19] maintain-harbor actually, pod name test-deploy-5db854865c-9kxrl [13:31:43] "kubectl delete pod" worked [13:33:20] hmm no it was recreated in the same node [13:33:21] interesting [13:33:28] what's it doing [13:33:29] ? [13:42:33] i reuquested a PyPI organization for toolforge when toolforge-weld was initially created in 2023, and apparently it's finally been approved [13:43:22] sent some invites to the org so that i'm not the only one. i'm going to see about moving packages from individual ownership to the org [13:44:03] \o/ I'll try to add the packages that are around [13:54:33] <_joe_> hi, there's a bot from toolsforge which is misbehaving, logging in multiple times per second [13:54:46] <_joe_> what's the standard way to reach out? I open a phab task? [13:55:59] <_joe_> for context, this https://logstash.wikimedia.org/goto/0e61dda0e58f9e4a7c65eeb302175221 and I suspect it was the cause of the sessionstore issue this weekend [13:56:25] _joe_: T389887 or a new one? [13:56:26] T389887: Throttler IP logging uses internal IPs - https://phabricator.wikimedia.org/T389887 [13:56:57] <_joe_> taavi: no a new one, the problem is they're doing 10 logins per second [13:57:07] <_joe_> and each one is a new session AIUI [13:58:41] at least that has a helpful user agent pointing directly to a specific tool [13:58:44] want me to stop it? [13:59:06] <_joe_> taavi: I would first of all want to reach out to them about it [13:59:16] <_joe_> we're not in an emergency right now [14:00:59] i see [14:01:58] _joe_: so the user-agent identifies this to be https://toolsadmin.wikimedia.org/tools/id/hrwp, there's an automatic per-tool email address (tools.@toolforge.org) which redirects to the maintainers that you could use or you could reach out to the maintainers on-wiki [14:02:55] <_joe_> ack, I'll reach out via email [14:49:03] <_joe_> I reached out, if we don't get any response by thursday, we might need to stop the bot temporarily [15:03:59] _joe_: I think we would be OK to stop the bot in order to raise awareness from maintainers. We have done that a couple of times. [15:30:33] dhinus, dcaro: as I've agred with David, I'm releasing spicerack 10.0.0 on apt/pypi but for the next few days refrain from upgrading. I think I have a quick fix that I'll send shortly and a way forward in the next few days. [15:32:03] volans: ack thanks! [16:05:28] volans: 🚀! [16:06:06] I've sent the immediate patch to gerrit and added you both, lmk what do you think [16:58:42] andrewbogott: cat you take care of T390134 whenever we have the drive set up? No need to re-introduce the node, just turn it on and check that the drive works, and leave it out of the cluster for now [16:58:42] T390134: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134 [16:58:46] *can [16:58:59] yep! [16:59:20] thanks! [16:59:30] can you tell me more about 'check that the drive works'? [16:59:43] Do you just mean, check that it shows up in lsblk? [16:59:53] just that the OS finds it, and it's accessible (my idea is to do a bunch of fio kind of tests to the raw drive, so feel free to run one) [17:00:07] essentially yep :) [17:00:44] ok! [17:01:02] I'm clocking out, cya in a couple weeks! [17:01:35] hope everything goes well! See you soon [17:58:09] After running for most of March with no network issues at all gitlab-account-approval has crashed 70 times in the last 4 days. It looks like the problem this time around is more likely gerrit and gitlab being overloaded than the prior DNS/routing problems [17:59:02] crap [17:59:45] the SRE weekly notes say 'Gerrit switchover Wed April 2nd - 15:00–16:00 UTC' so it might be worth waiting until after that before diving it [17:59:48] *diving in [18:00:43] (tbh I don't know exactly what that is but it sounds important) [18:01:43] there are 2 gerrit servers, one in eqiad and one in codfw, and they are flipping the primary/secondary roles this week. Follow up to the DC switch basically. [18:03:08] is it possible that too much cross-dc traffic is causing a slowdown? Seems unlikely... [18:11:59] it is more likely a new wave of crawlers data mining our code forges