[01:39:51] * bd808 off [10:42:42] I need to run an errand, be back in a bit [11:36:14] the connectivity issues to wikis from toolforge is not only on the new workers (containerd), so seems broader [11:37:31] let me know if I can be helpful with that [11:51:09] so, this traffic from the tools-k8s-worker, what route does it do to get to the wikis? There's no nat (the ip returned by the wiki on x-client-ip is the one on the VM), so it goes to the cloudgw (cloudinstances2b-gw.svc.eqiad.wmflabs), and from there 185.15.56.233, what is that one? [11:51:37] vlan1107.cloudgw1001.eqiad1.wikimediacloud.org [11:51:50] the nat is skipped because the dmz_cidr mechanism [11:52:38] cloudinstances2b-gw.svc.eqiad.wmflabs is on the cloudnet nodes, not cloudgw [11:52:56] I guess the PTR record for that address should be generated from netbox, and I missed something back in the day [11:53:18] no, that's just T341338 which I have not gotten into yet [11:53:18] T341338: eqiad1: fix PTR delegations for 185.15.56.0/24 - https://phabricator.wikimedia.org/T341338 [11:53:56] ok [12:26:43] * dcaro lunch [13:52:18] arturo: are nftables sets table-specific? [13:52:27] taavi: yes [13:55:17] what is your opinion on python's ruff vs black/isort/flake, etc? [13:56:44] ruff seems interesting, but so far I have not found any time/reason to test it out [13:56:47] I'm ok with ruff [13:56:58] it would simplify a bit CI/setup [13:57:23] I'm up for trying it (iirc @blancadesal proposed doing so before, but never got time to do it yet) [13:58:59] ok [15:02:47] btullis: are you coming to the WMCS/DataEng meeting? [15:04:58] dhinus: I will not be able to join [15:08:02] I won't be joining either [15:34:49] Great meeting, thanks. +1 - would meet again. [15:36:50] please leave a review on trustpilot; [15:36:56] ;D [16:02:13] did anyone start the k8s reboot cookbook already? [16:02:47] I did not [16:03:04] I did not (we would benefit from a locking mechanism like prod has xd) [16:03:12] I will start the cookbook [16:03:20] please file a task for the locking mechanism if we do not have one already [16:04:17] I think we do not [16:04:34] dhinus: ^ do you remember if we have a task? [16:05:01] the not-very-sophisticated thing I'm doing to find nfs victims is [16:05:03] sudo cumin -t 30 'O{*}' "ls /data/project" [16:05:08] And then looking for timeouts [16:05:12] Everything in toolforge is clear already [16:05:41] we may need to run the reboot cookbook anyway, to cleanup dead procs? [16:05:43] (not just /data/projects but similar nfs mount points) [16:05:52] andrewbogott: that only handles if the host is having issues, the procs that got stuck during the outages will still be stuck [16:06:14] yes, I don't think I know how to detect those [16:06:15] I don't think we have a task for cookbook locking [16:06:16] (as in, the nfs mount recovers so new process work, but the procs that were using it does not) [16:06:26] https://grafana-rw.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1&var-cluster=tools&var-cluster_datasource=prometheus-tools&from=now-30m&to=now [16:06:32] ^ the D state [16:06:46] (sustained D state, intermittent is ok) [16:07:34] what happened to ceph? [16:08:29] it's not clear, slow queries [16:08:58] andrewbogott asked about magnus tools still on the grid. Per https://grid-deprecation.toolforge.org/u/magnus he is down to 4. But of those two (listeria and mix-n-match) are in his top 10 most popular/used tools in the community. [16:09:33] listeria is migrating already [16:09:43] ok! I was sort of hoping that 100 out of the 140 doomed tools were his so we'd only have one person to nag :) [16:10:23] (as in, listeria has some stuff on k8s, I was helping him with some connectivity issues to the wikis on that one today) [16:10:41] dcaro: looking at that grafana board... what does it mean that the nfs server has a huge number? Is that just 1:1 with the stuck processes on the other hosts? [16:11:18] and: is there a way to selectively kick just the stuck processes or is it best to just reboot the host entirely? [16:11:23] andrewbogott: so it means that NFS was having IO issues (probably ceph + extra load from all the client trying to reconnect) [16:12:09] I think the question I'm asking is "Do I also need to reboot the nfs server?" Obviously that will make everything else much worse [16:12:15] andrewbogott: reboot the host, the stuck processes are "unkickable" of sorts (uninterruptible), that could be changed with nfs options but might create corruption [16:12:34] ok. I will start rebooting the high-number hosts on that graph [16:12:47] andrewbogott: no, the nfs not, as it's using ceph, not NFS, there's no 'mount' of sort [16:13:07] andrewbogott: the nfs already went down no? [16:13:56] I don't think it went down all the way, was just unresponsive and then caught back up [16:14:19] andrewbogott: just to clarify: you don't need to anything at the moment, t.aavi is running the rolling reboot cookbook [16:14:31] oh, ok then :) [16:14:41] I missed some scrollback [16:16:14] I think the way to interpret the graph that d.caro posted is: the NFS server had a spike in D procs, then it recovered. It has only 1 at the moment, which is just fine. [16:17:36] changing topics, would it be cool to have a way to install NIX packages on a buildservice image? [16:19:12] NIX? [16:19:31] this nix? https://nixos.org/ [16:19:37] yes [16:20:14] I was... looking at an old cross-section of that graph and now that I'm looking at the recent data everything makes sense again :) [16:21:11] btullis: sorry about standing you up for the meeting, I was having multiple system failures on my end [16:22:08] legoktm has two bots still running on the grid: apersonbot and legobot. It looks like dcaro poked him yesterday on phab tasks for both. [16:22:38] yes :( should get to it this weekend - I have one week left right? [16:23:44] legoktm: yep, you can ask for an extension if you are working on it ;), but 1 extra month is the hard limit [16:24:08] arturo: https://www.qovery.com/blog/my-feedback-about-nixpacks-an-alternative-to-buildpacks/ [16:24:09] ixd [16:24:25] legoktm: do it earlier if you want a good review in the next performance review cycle [16:24:36] ahaha [16:24:45] I just realized that our grid kill date next week is also #ilovefs day -- https://fsfe.org/activities/ilovefs/index.en.html [16:24:49] ty, will get it done soon(TM) :) [16:27:13] about the ceph side of things, it seems only one osd was slow, osd.247, that is hosted in one of the affected hosts cloudcephosd1031 (on F4), looking into the host logs [16:27:15] dcaro: thanks for the link. I think I had in mind a simple `nix install ` instead of replacing the whole stack [16:27:44] since NIX can install as non-root, it may be better that apt in that sense [16:27:45] arturo: I found that trying to find a buildpack for nix (should be fairly easy to get one) [16:28:09] arturo: any specific advantage on toolforge side? (the apt one is a bit messy, might be a good replacement) [16:28:39] feels weird though using both competitor technologies at the same time :S [16:28:40] dcaro: nothing specific, semantic-wise would be the same: get a list of random packages and install them on a Docker container image [16:28:53] btw the cookbook is taking 5+ minutes to reboot each worker which is awfully high [16:29:08] The heritage tool is stuck looking for a solution to being able to start/stop other jobs from a job. This is I think also the case for a couple of other tools (zoomviewer?) [16:30:01] bd808: yep, so far they can use a direct http request, but we have no "nice" way to let them use it T356377 [16:30:02] T356377: [toolforge] allow calling the different toolforge apis from within the containers - https://phabricator.wikimedia.org/T356377 [16:30:29] (we are working on consolidating the client, that might help) [16:30:47] bd808: there is definitely margin for improvement in our side regarding that. I think the first step would be to migrate the jobs-api to OpenApi, and publish proper docs [16:31:12] :nod: [16:31:13] i.e, T356523. I plan to work on that soon [16:31:14] T356523: toolforge: introduce OpenAPI to jobs framework - https://phabricator.wikimedia.org/T356523 [16:37:06] on the ceph osd node, ther's not much info of what might have caused the slow operations, they just start piling up at some point [16:37:13] Feb 07 15:34:15 cloudcephosd1031 ceph-osd[2093]: 2024-02-07T15:34:15.845+0000 7f8e7dbe4700 0 log_channel(cluster) log [WRN] : slow request osd_op(client.753547715.0:323285638 3.778 3:1eeb720f:::rbd_data.29f568dc22b916.00000000000049b4:head [read 503808~4096] snapc 0=[] ondisk+read+known_if_redirected e45617378) initiated 2024-02-07T15:33:45.003417+0000 currently delayed [16:39:41] nothing special on the network side of things (lost pings, network latency, switches saturated, ...) https://grafana-rw.wikimedia.org/d/613dNf3Gz/wmcs-ceph-eqiad-performance?orgId=1 [16:40:03] dcaro: the disk seems to have a number of problems [16:40:05] Device: /dev/sdc [SAT], 1712 Offline uncorrectable sectors [16:40:30] check `sudo journalctl | grep smartd` [16:40:43] yep, that's the task we have been trying to work on to replace all of those: https://phabricator.wikimedia.org/T348716 [16:41:04] you can see the increase on that number for the last 30 days here https://grafana-rw.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?forceLogin=true&orgId=1 [16:41:08] (at the bottom) [16:41:25] wait no, wrong task xd [16:41:54] this https://phabricator.wikimedia.org/T348643 [16:42:13] there's a whole batch of servers affected [16:42:21] wow [16:42:23] https://usercontent.irccloud-cdn.com/file/2zcd7ypu/image.png [16:43:09] We started with "please replace all these drives," Dell countered with "have you tried upgrading the firmware?" and the negotiations have continued from there [16:43:34] hm, guess I shouldn't name vendors in a public IRC channel, oops [16:43:36] ok, I think this started right before I went on the interstellar trip [16:43:55] bd808: which phabricator project should I add to bug reports for wikimedia/slimapp? [16:47:31] andrewbogott: No worries :-) [16:47:47] taavi: well... none at the moment. The prod projects that used it have been archived. I probably should archive the ancient slim wrapper too. It's used by SAL and bash still however. [16:48:02] yes, and tool-admin-web [16:48:12] ugh. right. [16:48:30] I was trying to move that to buildpacks but failed because `ErrorException: rtrim(): Passing null to parameter #1 ($string) of type string is deprecated`. https://phabricator.wikimedia.org/T356892. [16:50:51] and that's a slim issue. ok. The "fix" is probably to rewrite admin-web. slimapp is based on Slim 2.x and Slim is currently 4.x upstream. [16:52:48] If we think keeping the admin tool PHP is important then we need to find a new framework. Or we could just port it all to flask I think [16:53:43] honestly I'm tempted to replace it with something that just redirects people to toolsadmin/toolhub/wikitech [16:55:18] harej wrote a task about figuring out how to just use Striker as the landing page. Toolhub is a better list of tools and the rest there is really just links to other tools at this point I think? [16:57:07] There may be old links floating around the web that use the complicated wiki redirect things that Coren built into the original, but they are not likely to be workflow critical for anything. [17:00:00] andrewbogott: do you have any tips on how to debug a Cinder snapshot that I cannot delete from Horizon? I get "scheduled for deletion" but I'm not sure if it's true [17:01:02] dhinus: it's likely an interaction with the backup service somehow, might require manually purging snaps in rbd [17:01:09] I think there are some docs, let me look... [17:01:20] the same volume also has a backup snap [17:01:41] This is approximately the docs for that issue: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_unit_status_of_backup_cinder_volumes#Snapshot_stuck_in_'deleting'_state [17:01:46] the one I'm trying to delete has "protected: true", which I think was added when I created a volume off the snapshot. but I've since removed that volume. [17:01:47] but I can also have a look [17:02:23] the snap I want to delete is tools-db-2-snap in the tools project [17:02:27] I opened an upstream issue about this but the devs (semi-reasonably) thought this was our weird use case rather than a real bug [17:03:08] oh that cinder command is useful, let me try it [17:03:23] so far I tried rbd snap ls eqiad1-cinder/volume-e25dae8a-803a-4b62-aa0c-bdf6ff481869 [17:03:32] which shows 2 snapshots for that volume [17:05:41] openstack reports the snapshot as "available" [17:06:09] and does not show the backup snapshot at all, only my manual snap [17:07:56] let me try if I can delete it from the CLI [17:08:26] openstack won't know about backup things at all [17:08:57] We probably need some special tooling for hunting/killing the whole tree of snapshots for a given cinder volume [17:09:46] hmm but rbd does not show any children for that snapshot [17:10:00] I created the snapshot yesterday, added one child, then removed the child [17:10:15] * arturo offline [17:10:22] the backup snap is a different snap on the same volume, it should be in a different branch of the tree [17:11:10] I couldn't find an explanation of what actually happens when you delete a snap: does RBD merge two files? I think it's copy-on-write [17:11:24] I added some questions for dell about the ceph disks here: T348643, if anyone has more questions please add them too [17:11:25] T348643: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 [17:11:50] dhinus: I imagine that if you delete a snap that itself has snaps that either that would be impossible or that terrible things would happen [17:12:05] and if you delete a snap that's a 'leaf' then no merging would be needed, right? [17:12:17] I kinda think it's still needed, but I might be wrong [17:12:27] I don't understand [17:12:32] other snapshotting systems work by creating a "diff" file when you create a snap [17:12:37] If image A has snap B... [17:12:45] so the original file remains "frozen", and subsequent writes go to "diff-from-snap" [17:12:45] ohhh I see [17:12:54] but maybe RBD works differently [17:13:16] that's the opposite way from how I was imagining but of course you could do it either way [17:13:22] and your way would use less disk space [17:13:27] I'm just not sure [17:13:41] * andrewbogott reads some docs [17:14:12] I found this page but it doesn't really explain https://docs.ceph.com/en/quincy/rbd/rbd-snapshot/ [17:14:17] we might be about to have another nfs failure [17:14:26] :/ [17:14:28] oops, nope, just a brief spike [17:14:36] health is ok [17:14:37] I'll stop issuing "snapshot delete" commands just in case they're related :) [17:14:43] hahahah [17:15:02] I was just seeing the spike on https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1&var-cluster=tools&var-cluster_datasource=prometheus-tools&from=now-30m&to=now [17:16:19] ah, ok, maybe there was slowness, ceph takes some time to call it 'slow ops' [17:17:38] dhinus: I think you can configure it on rbd, it can either create a full copy, or a copy-on-write snapshot as you mention I think [17:17:42] dhinus: lots of good docs there about images based off of snapshots but nothing about what the snapshot itself is [17:18:05] you can also force a 'full' snapshot from a copy-on-write one iirc (then it does the copy of data + diff merging) [17:18:20] I'm pretty sure it's now doing copy-on-write as it's very quick to create [17:18:26] I suspect it might be slow to delete [17:19:15] https://www.oreilly.com/library/view/mastering-ceph/9781785888786/ad8428cd-0a60-4310-b562-ddccf55cca66.xhtml [17:20:32] sounds reasonable yes [17:20:38] I'm also finding some people saying that "snaptrim" can slow down the cluster [17:21:04] https://forum.proxmox.com/threads/ceph-snaptrim-causing-perforamnce-impact-on-whole-cluster-since-update.132543/ [17:21:09] so maybe it really is deleting snaps that's causing the spike? [17:21:11] "my whole system gehts slowed down when i delete snapshots" [17:22:11] I can't really understand the phrase 'wait for the specified number of settings between the trimming' but that's at least intended to help with the slowdown (by making trimming even slower) [17:24:23] I think it might be a typo s/settings/seconds/ [17:24:44] LOL [17:25:13] in the meantime, the snapshot is still marked as "available" so I'm not even sure if the deletion/trimming process started or not [17:26:03] the k8s worker reboot cookbook is failing on 'Something happened while rebooting host tools-k8s-worker-nfs-9, trying a hard rebooting the instance' and I don't have time to look at it now [17:26:54] taavi: does that mean we have to start the whole reboot cycle over again? [17:27:04] Or do you have a paste of the remaining hosts that need reboots? [17:27:30] https://www.irccloud.com/pastebin/lC5LOKyn/ [17:27:41] It did reboot. So if you can just ignore that message you should :) [17:29:07] the cookbook crashed on some of the first (last) buster workers, I tried to retry it and now it's throwing that on the first node [17:29:15] the -nfs- workers have been restarted, the rest have not [17:29:21] and -nfs-9 is probably cordoned [17:31:32] dhinus: this are the settings we have enabled [17:31:36] https://www.irccloud.com/pastebin/JyRU1hKP/ [17:31:38] (all defaults) [17:31:49] ceph config show-with-defaults osd.0 will show [17:31:54] (it's quite long) [17:33:00] thanks. I'm checking the logs and I see "snaptrim" in the ceph-mgr logs, but the time is earlier than my first deletion attempt [17:33:19] I wonder if backups are also triggering snaptrim [17:34:28] they should I think, but the diffs should be quite small (the duration of the backup) [17:36:40] we can try setting the osd_snap_trim_seconds [17:36:53] *osd_snap_trim_sleep [17:37:31] https://gerrit.wikimedia.org/r/c/operations/puppet/+/998483 [17:38:32] Hm, is btullis also using that template now? [17:38:54] isnt _ssd only for ssd drives? [17:39:08] yes [17:39:14] which is what we have, I think? [17:39:23] anyway I updated the description to clarify [17:39:26] I'd use the non-backend sepecific one [17:39:31] (affects all) [17:39:41] good point about the usage though, let me check [17:39:50] * andrewbogott quotes self [17:39:54] "The default is 0.0 for ssds and 5.0 for hdds. This patch increases [17:39:54] the delay for ssds." [17:40:17] If we change the general setting it will reduce the time limit for hdds, which seems bad [17:40:50] ah, good point [17:41:47] I hate that the line limit in commit messages means I can't add urls... [17:43:17] smartctl shows the disk model, and it's SSD you're right. I was confusing it with NVMe [17:43:35] oh, the ssd/hdd it means is how it's detected by ceph [17:43:40] (it can be manually changed) [17:44:04] as shown by eph osd df [17:44:09] *ceph osd df for example [17:44:31] all are ssd yes (we had to force it at some point, some time ago as it was not being detected ok) [17:45:12] ok, so we probably want to wait for ben to +1 before merging [17:45:26] and then I'll need advice about how to make the change live on our osds [17:46:49] taavi: I'm happy to take over the rolling reboot thing if you need to go. Your plan is just to re-run it as many times as it takes? That's presumed harmless? [17:47:29] dhinus: this isn't necessarily actually fixing your initial problem though! I'm about up to my max stack depth, if you want to make me a task about deleting that snap I can look sometime soon [17:47:30] andrewbogott: no, as I said it now constantly fails on the very first node and you likely need to figure out why [17:47:34] andrewbogott: that'd be `ceph config set osd ` on any mon/mgr node [17:48:34] taavi: understood, but my question is: if I get it to run halfway through and then have to fix another issue, it's fine to just start over at the beginning? [17:48:46] thanks dcaro, I added that to the patch so I'll remember when merging [17:48:52] ack [17:49:09] I need to go, feel free to ping me on telegram if you need anything, I will not be far from a laptop if needed [17:49:14] ok! [17:49:30] I am assuming that all of this snapshot cleanup/performance stuff can wait a day or two [17:49:32] andrewbogott: that's totally fine, but will take forever [17:49:51] yep [17:50:09] andrewbogott: you can add what you find here - https://phabricator.wikimedia.org/T334240 I'll continue working on it tomorrow with logs and such [17:50:15] and I think we have several nodes that are refusing to stop things which is causing user-visible service outages, e.g. wikibugs did not come back after I restarted it [17:50:31] and lucaswerkmeister seems to be having some issues restarting things too [17:51:55] taavi: is there an alternative to restarting the cookbook from the beginning? [17:52:09] you can give it a list of nodes manually [17:52:11] taavi: as in they are failing to reboot? or to drain? (if a container is in D status kubelet might fail to drain it) [17:52:36] but again, right now we have rebooted like less than ~10 nodes out of the ~70 total, so I would focus on getting it running properly first [17:52:53] dcaro: containers that are refusing to stop due to stuck processes most likely [17:53:17] I have not looked, since this is the same thing that always happens when NFS is even a bit unhappy (and here it was stuck for ~15m) [17:55:10] taavi, can you paste the exact command you're running? [17:58:12] sudo cookbook wmcs.toolforge.k8s.reboot --cluster-name tools --all-workers [17:58:25] 'k [17:58:28] does the cookbook have a '--just-reboot'? (we could add that, not draining them should not be that bad I think as arturo said, worsce case some processes are not able to cleanup, though I suspect none is actually doing much) [17:59:04] sorry but I want to do something else this evening than to babysit a cookbook to resolve a user-facing outage on one of our core services that I've been trying to get other people to learn how it works for several years without any success [17:59:39] taavi: you're fine, if you need to go then just go! I'll do reboots [18:03:32] andrewbogott: sorry I had to pick up a phone call. there's no rush to delete that snapshot, I'll make a task [18:03:41] great [18:43:04] andrewbogott: this is the task about the snapshot issues T356904 [18:43:05] T356904: [toolsdb] Deleting snapshot does not work - https://phabricator.wikimedia.org/T356904 [18:43:13] thx [18:50:53] thanks! [18:51:06] now I'm really off xd [18:51:11] cya tomorrow! [23:45:57] * bd808 off