[07:11:24] <godog>	 greetings
[07:58:42] <dcaro>	 morning!
[08:01:15] <dcaro>	 quick review matching total requests quota to the limits in k8s for users https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/74
[08:03:44] <godog>	 LGTM
[08:06:01] <dcaro>	 thanks!
[08:58:59] <dcaro>	 wrote https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1186443 as it was not available as a cookbook
[09:04:07] <godog>	 LGTM
[09:05:00] <godog>	 dcaro: not for that review of course, though how much work would it be to take fqdn as input ?
[09:05:32] <godog>	 in common opts I guess, similarly for vm_console I normally have an fqdn handy, would be useful to be able to plug that in into cookbooks
[09:17:35] <dcaro>	 should not be hard yep, the key point is getting the cluster from the domain, though iirc we have a function for that
[09:30:42] <dcaro>	 at some point I was thinking on 'generalize' that and make it so all cookbooks allow using the full fqdn, or cluster+vm-name
[09:33:47] <godog>	 yeah that'd be a great solution
[09:34:42] <godog>	 dcaro: re: detection of stuck bastions, where can I find the checks ?
[09:35:02] <godog>	 and I'll file a task re: fqdn
[09:37:59] <dcaro>	 this is annoying for the options though, it seems you can't have [--fqdn | (--vm-name + --cluster)]
[09:38:01] <dcaro>	 https://github.com/python/cpython/issues/101337
[09:38:19] <godog>	 T404052 for fqdn
[09:38:19] <stashbot>	 T404052: Add fqdn input to instance-related wmcs cookbooks - https://phabricator.wikimedia.org/T404052
[09:38:34] <godog>	 argh (ah ah) that's a bummer re: issue 101337
[09:38:52] <godog>	 though we could totally do that in code
[09:39:19] <dcaro>	 something like that should be ok `usage: cookbook [GLOBAL_ARGS] wmcs.vps.instance.force_reboot [-h] [--project PROJECT] [--task-id TASK_ID] [--no-dologmsg] [--cluster-name {eqiad1,codfw1dev}] (--vm-name VM_NAME | --fqdn FQDN)`
[09:39:35] <godog>	 +1
[09:39:45] <dcaro>	 and just ignore the cluster name if not passed
[09:44:22] <dcaro>	 just played a bit, +1 for doing it in a different patch and generalized for all that need to target a vm
[09:44:38] <dcaro>	 it's not hard, but not trivial either
[09:44:43] <godog>	 SGTM! thank you for taking a quick look
[09:45:39] <godog>	 dcaro: I'm going to file a followup to T404047 to alert on stuck bastions, where can I find the current checks and alerts ?
[09:45:40] <stashbot>	 T404047: toolforge ssh login hangs right before prompt - https://phabricator.wikimedia.org/T404047
[09:46:23] <dcaro>	 oh yes sorry, so there's a couple places, one is the toolforge alerts repo https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts
[09:47:01] <dcaro>	 and another potential place is metricsinfra DB (those are at the cloudvps level alerts, and iirc there's an ssh-able alert by default, though might be using a non-nfs user for tools as it's generic for all cloudvps vms)
[09:47:13] <dcaro>	 https://wikitech.wikimedia.org/wiki/Metricsinfra
[09:47:31] <godog>	 ack, thank you !
[09:47:34] <dcaro>	 there's also specific alerts for toolforge in the metricsinfra side
[09:47:56] <dcaro>	 note that they use different prometheus, tools has it's own, and then metricsinfra has another, so the stats will be available only in one of them
[09:48:39] <godog>	 that's good to keep in mind, what about "toolschecker" checks? sorry I don't have more information though I remember sth like that from the icinga migration
[09:51:50] <dcaro>	 yep, toolschecker is a service running in tools, that does some checks of it's own, those are defined in the production labs prometheus I think
[09:52:05] <dcaro>	 https://gerrit.wikimedia.org/g/operations/puppet/+/764f56a112b4c42114d28bcd082dc36887be9fc0/modules/icinga/manifests/monitor/toollabs.pp
[09:52:50] <dcaro>	 ideally we would move most if not all of those to prometheus checks (ex. redis, etc.)
[09:53:13] <dcaro>	 we thought of using some of the 'sample' tools also to gather that information instead
[09:54:00] <dcaro>	 T313030
[09:54:01] <stashbot>	 T313030: [toolforge.infra] Replace Toolschecker alerts with Prometheus based ones - https://phabricator.wikimedia.org/T313030
[09:54:08] <godog>	 nice, thank you
[09:54:20] <dcaro>	 it got in the back burner though, and we focused on push to deploy
[09:56:00] <godog>	 T404054 filed
[09:56:01] <stashbot>	 T404054: Improve detection of failing ssh to toolforge bastions - https://phabricator.wikimedia.org/T404054
[09:56:56] <dcaro>	 👍
[11:52:36] <godog>	 mmhh disable_tool is failing on tools-nfs-2, known/expected ?
[11:52:45] <godog>	 Sep  9 05:06:13 tools-nfs-2 disable_tool.py[318863]: pymysql.err.OperationalError: (1045, "Access denied for user 's56226'@'172.16.2.206' (usin
[11:52:48] <godog>	 g password: YES)")
[11:54:34] <godog>	 I'll file a task
[11:56:06] <dcaro>	 that should not be expected I think
[11:57:37] <dcaro>	 paws is down it seems, looking
[11:59:33] <godog>	 ouch, let me know if I can help with anything in debugging/troubleshooting
[12:00:22] <dcaro>	 two of the nodes are in `NotReady`
[12:00:30] <dcaro>	 https://www.irccloud.com/pastebin/C5QqXtVW/
[12:02:18] <dcaro>	 `│   DiskPressure         Unknown   Tue, 09 Sep 2025 11:30:21 +0000   Tue, 09 Sep 2025 11:33:48 +0000   NodeStatusUnknown         Kubelet stopped posting node status.`
[12:04:47] <dcaro>	 I'll reboot them
[12:05:36] <godog>	 ok
[12:06:28] <dcaro>	 did we end up getting ssh access to those nodes? or not yet?
[12:07:34] <dcaro>	 this would be an instance of being easier to use the vm name xd
[12:08:00] <godog>	 indeed
[12:19:06] <dcaro>	 that seemed to do the trick this time :/
[12:19:10] <dcaro>	 not sure what was wrong though
[12:58:07] <dcaro>	 created T404076 to keep track, but I don't think there's much more to add right now
[12:58:08] <stashbot>	 T404076: [paws] 2025-09-09 unexpected downtime - https://phabricator.wikimedia.org/T404076
[13:01:22] <godog>	 ack
[13:09:11] <andrewbogott>	 the quincy test OSD (cloudcephosd1016) has been happy for 20+ hours so I'm going to move the rest of the OSDs to quincy now.
[13:16:14] <dcaro>	 did you just reboot it?
[13:16:42] <dcaro>	 memory usage dropped https://grafana-rw.wikimedia.org/d/000000377/host-overview?from=now-24h&orgId=1&refresh=5m&timezone=utc&to=now&var-cluster=wmcs&var-datasource=000000026&var-server=cloudcephosd1016
[13:16:51] <dcaro>	 it looks ok, it does not show the weird disk patterns we saw before
[13:16:58] <dcaro>	 and it seems to use way less memory
[13:17:18] <andrewbogott>	 I'm running the upgrade-all cookbook so it started with that one, needlessly
[13:17:29] <dcaro>	 xd
[13:17:30] <dcaro>	 ack
[13:17:38] <andrewbogott>	 It does look like less memory although if you look at the history of a new node (e.g. 1049) you can see that memory use ramps up verrrrrry slowly
[13:17:42] <andrewbogott>	 so 1016 might just not be there yet
[13:18:16] <andrewbogott>	 But if they all cap out at that low RAM usage I wonder if we can tune them to use more? Like, could they do more buffering and be more efficient with rebalancing?
[13:19:31] <dcaro>	 yes, it's a setting we have, iirc we are setting it to 8G per osd
[13:19:43] <dcaro>	 we might have to tweak it for different types of hosts though
[13:20:33] <andrewbogott>	 yeah
[14:01:46] <dhinus>	 did someone fix the wikireplica lag alert? :)
[14:01:54] <dhinus>	 I'll look at why it's lagging
[14:03:04] <dhinus>	 ah it's an icinga alert, maybe that one always worked, but I don't think it fired last week
[14:06:39] <dcaro>	 I don't remember seeing it last week no
[14:09:52] <dhinus>	 it doesn't send emails apparently, so I cannot confirm that it fired. maybe in the icinga UI I can see the history
[14:10:26] <dhinus>	 even funnier, the replication was blocked by a thread in "Sleep", and only for about 5 mins, so it doesn't explain the multi-hour lag
[14:10:32] <dhinus>	 I will open a task to track it
[14:11:06] <dhinus>	 again it was a compound issue: a locking thread PLUS an ALTER TABLE that requires table lock
[14:22:15] <dcaro>	 hmmm... tools object storage has doubled in size in a week
[14:22:18] <dcaro>	 Raymond_Ndibe: ^
[14:25:31] <dcaro>	 I just manually ran a gc cleanup from harbor UI
[14:25:41] <dcaro>	 (and configured it to get scheduled daily)
[14:28:45] <dhinus>	 task about the wikireplicas lag: T404090
[14:28:46] <stashbot>	 T404090: [wikireplicas] clouddb1015 replication lag when applying ALTER TABLE - https://phabricator.wikimedia.org/T404090
[14:28:48] <dcaro>	 uh wow `1422 blob(s) and 493 manifest(s) deleted, 25.07GiB space freed up`
[15:53:43] <andrewbogott>	 dhinus, dcaro do you have any reason to suspect I'd have issues using the stock debian ceph builds (possibly in combination with ceph-provided packages)?
[15:54:59] <dcaro>	 not really, though it might mess with the config files or something similar
[15:56:00] <andrewbogott>	 it will produce apt warning for sure, but that doesn't much worry me
[15:57:23] <dcaro>	 that's if it puts them in the same paths (that I'm guessing it will do?)
[15:57:35] <dcaro>	 a dpkg -L might show just to double check
[15:58:17] <dcaro>	 you can also just try and see if it works, though puppet is also doing stuff to the paths, so might be good to cross check just in case
[16:00:04] <andrewbogott>	 I will try with a mon in codfw1dev; it's easy to revert if needed.
[16:01:06] <dcaro>	 👍
[16:02:44] <dhinus>	 andrewbogott: I don't have specific reasons, but I wonder if they are packaged in the same way as the packages we are currently using
[16:03:19] <dhinus>	 or if they would need changes/updates to our puppet config
[16:04:00] <andrewbogott>	 yeah, it's unlikely but possible
[16:04:17] <dhinus>	 I think they're definitely worth trying
[16:20:15] * dcaro off
[16:20:21] <dcaro>	 cya tomorrow
[16:52:41] * dhinus off