[06:49:54] <arturo>	 good morning
[08:58:27] <dhinus>	 tools-k8s-worker-nfs-74 has D processes and seems to be getting worse
[08:59:09] <dhinus>	 I'll reboot it
[09:21:57] <dcaro>	 thanks!
[09:22:15] <dcaro>	 topranks: hey, I'm seeing a consistent amount of pings lost between ceph nodes
[09:22:17] <dcaro>	 https://usercontent.irccloud-cdn.com/file/BYqsITI3/image.png
[09:22:32] <dcaro>	 is that expected now?
[09:22:51] <topranks>	 not 100% what I'm looking at, can you give me an example two nodes?
[09:23:00] <topranks>	 no shouldn't be dropping pings 
[09:23:24] <dcaro>	 ahh, yep, nm, it's pings lost to 1029, a node that's taken out temporarily
[09:24:23] <topranks>	 oh ok np 
[09:25:26] <dcaro>	 we have a small process on each osd node doing pings to the rest of nodes (1/min or so) and collecting the lost ones
[09:25:34] <dcaro>	 for both jumbo and non-jumbo
[09:46:16] <volans>	 dcaro: good morning, I was chatting with Francesco as I'm making a new spicerack release that has a couple of breaking changes and I'm happy to help with patches for the couple of things needed.
[09:46:57] <dcaro>	 volans okok, what are the breaking changes? Any new goodies we can use?
[09:47:12] <volans>	 when started looking I realized you have quite some wrapper for icinga (wrap_with_sudo_icinga), and I think you might be happy as we can probably remove it as icinga_master_host is becoming a method
[09:47:31] <volans>	 new goodies for the default argument parser mainly, full changelog at https://doc.wikimedia.org/spicerack/master/release.html#v10-0-0-2025-03-31
[09:47:34] <dcaro>	 oh nice
[09:48:11] <dcaro>	 the wrapper on icinga is to be able to use sudo yep (as we ssh with our users when running on laptops)
[09:48:24] <volans>	 I didn't know you had to do all that just because was a property, we could have probably changed that back in the day.
[09:49:04] <dcaro>	 I can't see the changelog entry you linked
[09:49:14] <dcaro>	 last one is 2025-02-25
[09:49:46] <volans>	 cdn cache?
[09:50:01] <dcaro>	 maybe :), hard refresh did not help, I guess next thing is just waiting xd
[09:50:09] <volans>	 weird
[09:50:31] <volans>	 jenkins does the push job, I have no idea how it's setup behind that
[09:51:18] <dcaro>	 back to the source... https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/spicerack/+/refs/heads/master/CHANGELOG.rst
[09:52:56] <dhinus>	 I also did not see the entry on first load, but hard refresh worked for m
[10:00:16] <volans>	 after the meeting we can decide what to do there, I can also hold the release on pypi if that helps
[10:41:18] <volans>	 dcaro: let me know when you're around for ^^^
[10:48:00] <dhinus>	 arturo: I just noticed that kyverno had some error during the last upgrade, the helm chart is in status "failed"
[10:48:25] <dhinus>	 it looks like it's running correctly, but apparently the post-upgrade hook failed
[10:48:59] <dhinus>	 "sudo helm list -n kyverno" and "sudo helm history kyverno -n kyverno" to see that
[10:55:00] <dhinus>	 in other news, I'm going to kickstart the k8s upgrade in toolsbeta
[10:56:50] <dhinus>	 following the procedure at https://docs.google.com/document/d/1Sh-VC5lAL0FZvvP7eGs6UGjLk-yMZ3oiTDd9nv8J7PE/edit?tab=t.0#heading=h.trre9z3ki4o
[11:03:46] <arturo>	 dhinus: ack
[11:10:29] <dcaro>	 dhinus: 🎉 yep kyverno failed the last time, I don't remember the details, I think that one of the deployments took too long to start or similar, but it started in the end
[11:10:46] <dhinus>	 dcaro: ack thanks
[11:10:48] <dcaro>	 further deploys did not change anything, so it did not record it
[11:10:53] <dcaro>	 (but now it does :) )
[11:10:58] <dhinus>	 right!
[11:12:10] <dcaro>	 volans: I'm going for lunch now, I'll be back in ~1h but I have meeting until ~17:30 so I think it might be better to talk async?
[11:12:32] <dcaro>	 feel free to send a patch, as long as it allows running on the laptops I'm ok with it
[11:16:56] <volans>	 dcaro: ack but what I'm missing is where you do patch the spicerack instance right now (the normal way, not the special hack for icinga)
[11:42:35] <dhinus>	 first toolsbeta control node upgraded, tests running fine
[11:51:34] <dcaro>	 ack, let me know if you want any help
[11:53:50] <dcaro>	 volans: we do it on the fly, only when needed, like https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/wmcs-cookbooks/+/refs/heads/main/wmcs_libs/alerts.py#39
[12:09:22] <dhinus>	 all control nodes updated, I'll do the worker nodes after lunch
[12:10:46] <dcaro>	 awesome :)
[12:14:31] <taavi>	 dhinus: when did the k8s upgrade procedure move from an etherpad to a private google doc?
[12:15:24] <dhinus>	 taavi: that's just temporary, I will move it back to wiki as soon as this upgrade is finished :)
[12:15:52] <dhinus>	 I'm rewriting & expanding the whole procedure as I do it
[12:16:54] <dhinus>	 moved the gdoc from private to public in the meantime!
[13:00:54] <dhinus>	 starting with the worker upgrades
[13:03:12] <dhinus>	 tracking task is T390212
[13:03:15] <stashbot>	 T390212: Upgrade "toolsbeta" cluster to k8s 1.29.15 - https://phabricator.wikimedia.org/T390212
[13:07:18] <dcaro>	 ack
[13:29:22] <dhinus>	 toolsbeta-test-k8s-worker-nfs-9 is failing to drain and I'm not sure why
[13:30:18] <dhinus>	 there's a stuck harbor container, I'll try manually stopping it
[13:31:19] <dhinus>	 maintain-harbor actually, pod name test-deploy-5db854865c-9kxrl
[13:31:43] <dhinus>	 "kubectl delete pod" worked
[13:33:20] <dhinus>	 hmm no it was recreated in the same node
[13:33:21] <dcaro>	 interesting
[13:33:28] <dcaro>	 what's it doing
[13:33:29] <dcaro>	 ?
[13:42:33] <taavi>	 i reuquested a PyPI organization for toolforge when toolforge-weld was initially created in 2023, and apparently it's finally been approved
[13:43:22] <taavi>	 sent some invites to the org so that i'm not the only one. i'm going to see about moving packages from individual ownership to the org
[13:44:03] <dcaro>	 \o/ I'll try to add the packages that are around
[13:54:33] <_joe_>	 hi, there's a bot from toolsforge which is misbehaving, logging in multiple times per second
[13:54:46] <_joe_>	 what's the standard way to reach out? I open a phab task?
[13:55:59] <_joe_>	 for context, this https://logstash.wikimedia.org/goto/0e61dda0e58f9e4a7c65eeb302175221 and I suspect it was the cause of the sessionstore issue this weekend
[13:56:25] <taavi>	 _joe_: T389887 or a new one?
[13:56:26] <stashbot>	 T389887: Throttler IP logging uses internal IPs - https://phabricator.wikimedia.org/T389887
[13:56:57] <_joe_>	 taavi: no a new one, the problem is they're doing 10 logins per second
[13:57:07] <_joe_>	 and each one is a new session AIUI
[13:58:41] <taavi>	 at least that has a helpful user agent pointing directly to a specific tool
[13:58:44] <taavi>	 want me to stop it?
[13:59:06] <_joe_>	 taavi: I would first of all want to reach out to them about it
[13:59:16] <_joe_>	 we're not in an emergency right now
[14:00:59] <taavi>	 i see
[14:01:58] <taavi>	 _joe_: so the user-agent identifies this to be https://toolsadmin.wikimedia.org/tools/id/hrwp, there's an automatic per-tool email address (tools.<tool name>@toolforge.org) which redirects to the maintainers that you could use or you could reach out to the maintainers on-wiki
[14:02:55] <_joe_>	 ack, I'll reach out via email
[14:49:03] <_joe_>	 I reached out, if we don't get any response by thursday, we might need to stop the bot temporarily
[15:03:59] <arturo>	 _joe_: I think we would be OK to stop the bot in order to raise awareness from maintainers. We have done that a couple of times.
[15:30:33] <volans>	 dhinus, dcaro: as I've agred with David, I'm releasing spicerack 10.0.0 on apt/pypi but for the next few days refrain from upgrading. I think I have a quick fix that I'll send shortly and a way forward in the next few days.
[15:32:03] <dhinus>	 volans: ack thanks!
[16:05:28] <dcaro>	 volans: 🚀!
[16:06:06] <volans>	 I've sent the immediate patch to gerrit and added you both, lmk what do you think
[16:58:42] <dcaro>	 andrewbogott: cat you take care of T390134 whenever we have the drive set up? No need to re-introduce the node, just turn it on and check that the drive works, and leave it out of the cluster for now
[16:58:42] <stashbot>	 T390134: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134
[16:58:46] <dcaro>	 *can
[16:58:59] <andrewbogott>	 yep!
[16:59:20] <dcaro>	 thanks!
[16:59:30] <andrewbogott>	 can you tell me more about 'check that the drive works'?
[16:59:43] <andrewbogott>	 Do you just mean, check that it shows up in lsblk?
[16:59:53] <dcaro>	 just that the OS finds it, and it's accessible (my idea is to do a bunch of fio kind of tests to the raw drive, so feel free to run one)
[17:00:07] <dcaro>	 essentially yep :)
[17:00:44] <andrewbogott>	 ok!
[17:01:02] <dcaro>	 I'm clocking out, cya in a couple weeks!
[17:01:35] <andrewbogott>	 hope everything goes well! See you soon
[17:58:09] <bd808>	 After running for most of March with no network issues at all gitlab-account-approval has crashed 70 times in the last 4 days.  It looks like the problem this time around is more likely gerrit and gitlab being overloaded than the prior DNS/routing problems
[17:59:02] <andrewbogott>	 crap
[17:59:45] <andrewbogott>	 the SRE weekly notes say 'Gerrit switchover Wed April 2nd - 15:00–16:00 UTC'  so it might be worth waiting until after that before diving it
[17:59:48] <andrewbogott>	 *diving in
[18:00:43] <andrewbogott>	 (tbh I don't know exactly what that is but it sounds important)
[18:01:43] <bd808>	 there are 2 gerrit servers, one in eqiad and one in codfw, and they are flipping the primary/secondary roles this week. Follow up to the DC switch basically.
[18:03:08] <andrewbogott>	 is it possible that too much cross-dc traffic is causing a slowdown?  Seems unlikely...
[18:11:59] <bd808>	 it is more likely a new wave of crawlers data mining our code forges