[07:11:24] greetings [07:58:42] morning! [08:01:15] quick review matching total requests quota to the limits in k8s for users https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/74 [08:03:44] LGTM [08:06:01] thanks! [08:58:59] wrote https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1186443 as it was not available as a cookbook [09:04:07] LGTM [09:05:00] dcaro: not for that review of course, though how much work would it be to take fqdn as input ? [09:05:32] in common opts I guess, similarly for vm_console I normally have an fqdn handy, would be useful to be able to plug that in into cookbooks [09:17:35] should not be hard yep, the key point is getting the cluster from the domain, though iirc we have a function for that [09:30:42] at some point I was thinking on 'generalize' that and make it so all cookbooks allow using the full fqdn, or cluster+vm-name [09:33:47] yeah that'd be a great solution [09:34:42] dcaro: re: detection of stuck bastions, where can I find the checks ? [09:35:02] and I'll file a task re: fqdn [09:37:59] this is annoying for the options though, it seems you can't have [--fqdn | (--vm-name + --cluster)] [09:38:01] https://github.com/python/cpython/issues/101337 [09:38:19] T404052 for fqdn [09:38:19] T404052: Add fqdn input to instance-related wmcs cookbooks - https://phabricator.wikimedia.org/T404052 [09:38:34] argh (ah ah) that's a bummer re: issue 101337 [09:38:52] though we could totally do that in code [09:39:19] something like that should be ok `usage: cookbook [GLOBAL_ARGS] wmcs.vps.instance.force_reboot [-h] [--project PROJECT] [--task-id TASK_ID] [--no-dologmsg] [--cluster-name {eqiad1,codfw1dev}] (--vm-name VM_NAME | --fqdn FQDN)` [09:39:35] +1 [09:39:45] and just ignore the cluster name if not passed [09:44:22] just played a bit, +1 for doing it in a different patch and generalized for all that need to target a vm [09:44:38] it's not hard, but not trivial either [09:44:43] SGTM! thank you for taking a quick look [09:45:39] dcaro: I'm going to file a followup to T404047 to alert on stuck bastions, where can I find the current checks and alerts ? [09:45:40] T404047: toolforge ssh login hangs right before prompt - https://phabricator.wikimedia.org/T404047 [09:46:23] oh yes sorry, so there's a couple places, one is the toolforge alerts repo https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts [09:47:01] and another potential place is metricsinfra DB (those are at the cloudvps level alerts, and iirc there's an ssh-able alert by default, though might be using a non-nfs user for tools as it's generic for all cloudvps vms) [09:47:13] https://wikitech.wikimedia.org/wiki/Metricsinfra [09:47:31] ack, thank you ! [09:47:34] there's also specific alerts for toolforge in the metricsinfra side [09:47:56] note that they use different prometheus, tools has it's own, and then metricsinfra has another, so the stats will be available only in one of them [09:48:39] that's good to keep in mind, what about "toolschecker" checks? sorry I don't have more information though I remember sth like that from the icinga migration [09:51:50] yep, toolschecker is a service running in tools, that does some checks of it's own, those are defined in the production labs prometheus I think [09:52:05] https://gerrit.wikimedia.org/g/operations/puppet/+/764f56a112b4c42114d28bcd082dc36887be9fc0/modules/icinga/manifests/monitor/toollabs.pp [09:52:50] ideally we would move most if not all of those to prometheus checks (ex. redis, etc.) [09:53:13] we thought of using some of the 'sample' tools also to gather that information instead [09:54:00] T313030 [09:54:01] T313030: [toolforge.infra] Replace Toolschecker alerts with Prometheus based ones - https://phabricator.wikimedia.org/T313030 [09:54:08] nice, thank you [09:54:20] it got in the back burner though, and we focused on push to deploy [09:56:00] T404054 filed [09:56:01] T404054: Improve detection of failing ssh to toolforge bastions - https://phabricator.wikimedia.org/T404054 [09:56:56] 👍 [11:52:36] mmhh disable_tool is failing on tools-nfs-2, known/expected ? [11:52:45] Sep 9 05:06:13 tools-nfs-2 disable_tool.py[318863]: pymysql.err.OperationalError: (1045, "Access denied for user 's56226'@'172.16.2.206' (usin [11:52:48] g password: YES)") [11:54:34] I'll file a task [11:56:06] that should not be expected I think [11:57:37] paws is down it seems, looking [11:59:33] ouch, let me know if I can help with anything in debugging/troubleshooting [12:00:22] two of the nodes are in `NotReady` [12:00:30] https://www.irccloud.com/pastebin/C5QqXtVW/ [12:02:18] `│ DiskPressure Unknown Tue, 09 Sep 2025 11:30:21 +0000 Tue, 09 Sep 2025 11:33:48 +0000 NodeStatusUnknown Kubelet stopped posting node status.` [12:04:47] I'll reboot them [12:05:36] ok [12:06:28] did we end up getting ssh access to those nodes? or not yet? [12:07:34] this would be an instance of being easier to use the vm name xd [12:08:00] indeed [12:19:06] that seemed to do the trick this time :/ [12:19:10] not sure what was wrong though [12:58:07] created T404076 to keep track, but I don't think there's much more to add right now [12:58:08] T404076: [paws] 2025-09-09 unexpected downtime - https://phabricator.wikimedia.org/T404076 [13:01:22] ack [13:09:11] the quincy test OSD (cloudcephosd1016) has been happy for 20+ hours so I'm going to move the rest of the OSDs to quincy now. [13:16:14] did you just reboot it? [13:16:42] memory usage dropped https://grafana-rw.wikimedia.org/d/000000377/host-overview?from=now-24h&orgId=1&refresh=5m&timezone=utc&to=now&var-cluster=wmcs&var-datasource=000000026&var-server=cloudcephosd1016 [13:16:51] it looks ok, it does not show the weird disk patterns we saw before [13:16:58] and it seems to use way less memory [13:17:18] I'm running the upgrade-all cookbook so it started with that one, needlessly [13:17:29] xd [13:17:30] ack [13:17:38] It does look like less memory although if you look at the history of a new node (e.g. 1049) you can see that memory use ramps up verrrrrry slowly [13:17:42] so 1016 might just not be there yet [13:18:16] But if they all cap out at that low RAM usage I wonder if we can tune them to use more? Like, could they do more buffering and be more efficient with rebalancing? [13:19:31] yes, it's a setting we have, iirc we are setting it to 8G per osd [13:19:43] we might have to tweak it for different types of hosts though [13:20:33] yeah [14:01:46] did someone fix the wikireplica lag alert? :) [14:01:54] I'll look at why it's lagging [14:03:04] ah it's an icinga alert, maybe that one always worked, but I don't think it fired last week [14:06:39] I don't remember seeing it last week no [14:09:52] it doesn't send emails apparently, so I cannot confirm that it fired. maybe in the icinga UI I can see the history [14:10:26] even funnier, the replication was blocked by a thread in "Sleep", and only for about 5 mins, so it doesn't explain the multi-hour lag [14:10:32] I will open a task to track it [14:11:06] again it was a compound issue: a locking thread PLUS an ALTER TABLE that requires table lock [14:22:15] hmmm... tools object storage has doubled in size in a week [14:22:18] Raymond_Ndibe: ^ [14:25:31] I just manually ran a gc cleanup from harbor UI [14:25:41] (and configured it to get scheduled daily) [14:28:45] task about the wikireplicas lag: T404090 [14:28:46] T404090: [wikireplicas] clouddb1015 replication lag when applying ALTER TABLE - https://phabricator.wikimedia.org/T404090 [14:28:48] uh wow `1422 blob(s) and 493 manifest(s) deleted, 25.07GiB space freed up` [15:53:43] dhinus, dcaro do you have any reason to suspect I'd have issues using the stock debian ceph builds (possibly in combination with ceph-provided packages)? [15:54:59] not really, though it might mess with the config files or something similar [15:56:00] it will produce apt warning for sure, but that doesn't much worry me [15:57:23] that's if it puts them in the same paths (that I'm guessing it will do?) [15:57:35] a dpkg -L might show just to double check [15:58:17] you can also just try and see if it works, though puppet is also doing stuff to the paths, so might be good to cross check just in case [16:00:04] I will try with a mon in codfw1dev; it's easy to revert if needed. [16:01:06] 👍 [16:02:44] andrewbogott: I don't have specific reasons, but I wonder if they are packaged in the same way as the packages we are currently using [16:03:19] or if they would need changes/updates to our puppet config [16:04:00] yeah, it's unlikely but possible [16:04:17] I think they're definitely worth trying [16:20:15] * dcaro off [16:20:21] cya tomorrow [16:52:41] * dhinus off