[06:57:35] greetings [08:17:11] I'd like to tackle the trixie nfs server for tools/toolsbeta at T404584, anything I should know and/or to be aware of ? [08:17:11] T404584: Address tools NFS getting stuck with processes in D state - https://phabricator.wikimedia.org/T404584 [08:18:26] godog: the usual VIP failover procedure will not work because the new server will be in the dualstack network but the old server in the legacy vlan [08:18:57] ah! great point [08:19:38] then we're looking at changing the VIP address [08:41:39] morning [08:42:28] godog: bonus points for making the NFS server available over both v4 and v6 [08:42:40] heheh [08:43:19] and directly win the championship if you replace nfs entirely :D [08:43:38] * godog unsubscribe [08:44:18] *replace with something nicer [08:44:42] :D [08:50:40] ok I'm going to attempt https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Runbooks/Create_an_NFS_server#Create_a_replacement_server_for_an_existing_service for toolsbeta nfs [08:50:48] please stand by for hideous breakage [08:51:28] `Be careful! On 2024-03-26 the cookbook formatted the filesystem of the toolsbeta-nfs volume requiring a restore from the backups. gerrit:1014543 should prevent that from happening, but the fix has not yet been properly tested. ` [08:51:31] :/ [08:52:07] so make sure to have a backup xd [08:52:56] heh I was hoping that note refers to migrate_service but maybe not ? [08:53:16] I'm doing add_server first that is [08:53:28] but yes good point re: backup, happy to look at docs on how to do that [08:54:38] that comment is not very encouraging [08:55:05] sorry xd [08:55:32] as I understand, the process would be create the new server to replace the current one, and then run the migration cookbooks to move it over [08:56:04] but I'm not very familiar with these cookbooks, so might be reading it wrong [08:58:41] we can do a trial also in some other project, create the nfs server and try the migration (it would be in the newer vlan from scratch though, so only the volume migration side would be tested) [08:59:01] i'm fairly sure the cookbooks have an assumption that there's at most one nfs server per project [08:59:21] we can use testlabs [08:59:39] there's testlabs-nfs-1 there already, though not sure if it's functional [09:00:22] vm is up and running [09:00:55] mmhh ok I'll check testlabs [09:01:20] taavi: which cookbooks are you thinking of ? [09:01:34] that might help test the cross-vlan migration too (if testlabs-nfs-1 works) [09:01:51] godog: wmcs.nfs.* [09:01:52] https://www.irccloud.com/pastebin/c5uR2E6i/ [09:02:20] got it, ok [09:02:50] can't say I'm not confused if the nfs cookbooks dealing with migration assume there's at most one nfs server per project [09:05:25] I'm guessing one active nfs [09:05:34] (vip/volume attachments/etc.) [09:06:46] yes, that's what I meant [09:07:01] doh of course that's way more reasonable, thank you [11:54:54] the findings and a possible strategy so far https://phabricator.wikimedia.org/T404584#11184317 [12:18:07] toolsbeta build log tailing is not working, looking [12:19:23] godog: that will force all clients to reboot? or will they pick up the DNS change correctly? [12:20:23] mount shows the address is connected to, not sure if that's helpful [12:20:26] `toolsbeta-nfs.svc.toolsbeta.eqiad1.wikimedia.cloud:/srv/toolsbeta/misc/shared/toolsbeta/home on /mnt/nfs/nfs-01-toolsbeta-home type nfs4 (rw,noatime,vers=4.2,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.16.18.135,local_lock=none,addr=172.16.1.238) [12:22:15] I'm asking because it might mean downtime until we reboot workers [12:22:24] dcaro: yeah I don't know yet how clients will react, sth to test for sure [12:23:01] but yes either way there will be some downtime for sure [12:26:40] 👍 we should then announce and plan a window when we have an idea of the process [12:29:06] yes indeed, how much in advance are maint windows for toolforge announced generally ? [12:29:56] depends, usually one week at least [12:30:36] ok! just out of curiosity really, I don't know how far we are from actually doing tools [12:40:30] hopefully not very far, and it will fix the current issues :) [12:41:52] heheh I appreciate your optimism, I don't have any evidence that upgrading to trixie will fix the current issues btw though it is something we'll have to do anyways eventually [12:42:57] and if it is indeed sth wrong with client/server interaction then we'll have more up to date info [12:46:55] hmm.... there's something going on with kubectl logs in toolsbeta, requests time out [12:46:57] https://www.irccloud.com/pastebin/LBUelc7U/ [12:47:14] it's only that worker (other seem to pass) [12:47:24] so maybe that worker is misbehaving somehow [12:48:33] could it be a network/vlan thing? [12:48:55] hmm... but it was working yesterday, so probably not? [12:50:25] hmm... there's supiciously no calico filters in iptables for that worker [12:50:30] https://www.irccloud.com/pastebin/U3l48e9y/ [12:50:39] I'll reboot [12:50:48] unless someone wants to spend some time debugging? [12:51:39] +1 to reboot [12:57:51] I'm getting NFS ads now already ;_; [12:58:16] hahahahahaha [12:58:50] https://www.quobyte.com for the curious, I had to check [12:59:46] it got me with the graph [12:59:48] https://usercontent.irccloud-cdn.com/file/LZ3y0aze/image.png [13:00:13] totally not AI generated [13:08:34] hmm.... toolsbeta is doing weird things :/ [13:11:48] I seem unable to retrieve logs from any alloy pods except the one in worker-nfs-10 [13:14:02] restarting one of the pods made it show logs now :/ [13:17:39] getting errors like [13:17:40] https://www.irccloud.com/pastebin/wmlo7k1P/ [13:17:49] though it's ignoring it, so probably ok? [13:19:27] even recreating the pods, worker-nfs-5 is not streaming any logs :/, that one has something specially wrong [13:19:39] (kubectl logs I mean, as in pod lags) [13:19:40] *logs [13:21:59] I'll drain completely and reboot [13:25:12] quick reviwe for https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1187762? [13:28:03] LGTM [13:29:39] hmm... that was not enough [13:29:40] https://www.irccloud.com/pastebin/gmrgYI57/ [13:29:55] let me fully stop, and then start, so it recreates the domain in the hypervisor [13:36:37] that worker is cursed, even the force restart did not get the logs working [13:36:45] https://www.irccloud.com/pastebin/9Og1Gl9A/ [13:36:53] I'll delete and create a new one [13:51:43] andrewbogott: when you get a chance, I'm seeking feedback on https://phabricator.wikimedia.org/T404584#11184317 [13:51:59] context is upgrading nfs servers to trixie [13:52:20] ok! will look [13:53:09] cheers [13:54:32] hmpf... now it's failing to add a k8s node xd [13:54:34] https://www.irccloud.com/pastebin/ZZtlBe6M/ [13:55:46] which distro? bookworm? [13:56:48] let me check (used the cookbook without passing it) [13:57:22] debian 12 yes [13:58:37] til about `/etc/os-release`, I'll stop with `/etc/debian_version` [13:58:54] somehow kubelet is not getting installed on that host?? [13:58:57] it seems kubelet is not installed [13:58:58] yep [13:59:20] it's there in the repos [13:59:22] https://www.irccloud.com/pastebin/gFXlopTF/ [13:59:24] looking [13:59:46] well it's not on the list of packages in kubeadm::core [14:00:02] was it pulled as a dependency of kubeadm in the past or something like that? [14:00:02] maybe an ordering issue? [14:00:09] (it fails to start the service) [14:00:18] oh, it's **not** on the list [14:00:24] might be :/ [14:00:41] want to send a patch to add it or should I? [14:00:49] please do, I'll +1 xd [14:01:08] I wanted to add also a few more workers to the tools cluster today, as it has triggered the cpu alert a few times in the last couple days [14:02:04] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1188788 [14:02:32] yeah, the 1.29 kubeadm package had a Depends: kubelet, that's no longer there [14:02:40] no longer there on 1.30, I mean [14:03:37] hmm, maybe we should add a "create a new worker node" to the list of things after upgrading [14:07:06] Notice: /Stage[main]/Kubeadm::Core/Package[kubelet]/ensure: created [14:07:07] there we go [14:08:10] okok, not sure if it's worth trying to manually continue, or better to remove and start from scratch [14:08:15] thanks btw :) [14:08:32] I think I'll start from scratch [14:08:33] probably the latter [14:08:36] yep [14:12:33] bd808: I'm think that your fix https://phabricator.wikimedia.org/P83350 got the build working by stopping the installation of static resources entirely -- it fixed the broken dependencies by having there not be any dependencies at all :/ For instance, if I search in /opt/lib/python/site-packages/horizon there are 0 .html files. Is that true in your build as well? [14:12:39] it reused the name :/, I hope there's no issues 🤞 [14:13:12] as long as you removed it with the cookbook (or removed the puppet certs manually) it Should be fine [14:14:43] I did use the cookbooks yep [14:24:54] btw. any opinions on increasing the size of the workers? I think we agreed to do so some time ago, but afaik we have not yet done it [14:25:04] I can ask today in the monthly too xd [14:57:19] hmpf... toolsbeta is not out of the woods yet [14:57:27] I'm getting random timeouts from the api itself [14:57:36] https://www.irccloud.com/pastebin/caz1LKuV/ [14:57:44] https://www.irccloud.com/pastebin/sU9eoXwP/ [15:12:59] andrewbogott: starting my day with a wall of meetings, but I will try to peek at things. I actually would not expect running a manage.py action to place files in a system library directory. [15:13:34] hm, true... [15:13:42] I guess that is why you were using that root to run the script before? [15:14:40] So... hm [15:14:58] I guess we still have the issue of pbr not creating the manifests correctly. [15:15:14] And then manage.py needs to /find/ the files. [15:15:47] anyway -- you should pay attention to your meetings, I'll keep digging [15:24:44] I'm rebuilding one of the ingress nodes in toolsbeta too, it seems stuck :/v [15:24:53] taavi@tools-sgebastion-10:~ $ sudo shutdown tools-sgebastion-10 [15:24:53] W: molly-guard: SSH session detected! [15:24:53] Please type in hostname of the machine to shutdown: tools-sgebastion-10 [15:24:53] Failed to parse time specification: tools-sgebastion-10 [15:25:04] anyone know why exactly that would fail with "Failed to parse time specification"?? [15:25:22] sounds like a fun rabbit hole xd [15:25:45] ah, because the correct command is `sudo shutdown now` [15:25:47] derp [15:25:55] hahahahah [15:26:05] I'm an idiot [15:26:22] * dcaro empathizes [15:29:11] 32 files changed, 9 insertions(+), 1430 deletions(-) [15:30:45] \o/ [15:31:14] taavi: hehe I always use `poweroff` for that reason [15:32:09] * taavi goes hunting puppet.git for wmcs-specific buster references to remove [15:44:36] for review: https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1188831/ [15:46:39] +1d :) [15:47:03] btw, this means we can make all the client packages build on trixie only [15:47:13] extra 🎉! [15:48:46] T404733 [15:48:47] T404733: Update Toolforge client packages to build on Trixie only - https://phabricator.wikimedia.org/T404733 [15:49:04] nice :) [16:44:17] now bash history strikes again, and I accidentally shut down tools-bastion-14 [17:00:48] dcaro, regarding https://phabricator.wikimedia.org/T348643#10170362, do you recall if 1025 ever got put back together again? [17:03:43] andrewbogott: still removed I think https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/eqiad/profile/cloudceph/osd.yaml#151 [17:03:58] great, I'll continue to ignore [17:05:54] might be good to ping someone though, we should get those back [17:09:23] * andrewbogott nods [17:18:01] andrewbogott: asked in the task, we'll see [17:18:11] thx [17:37:26] * dcaro off [17:37:28] cya tomorrow! [23:09:48] andrewbogott: I'm back to trying the full installer script after proving both inside and outside the container that a "simple" `pip install ./horizon` does collect the templates in the wheel and unpacks them later. This is all so weird. [23:12:00] does it at least fail for you in the same way as it does for me? I was starting to think the bug was just "it works except when Andrew does it" [23:13:56] I think I did recreate it, but I did not verify the reproduction today. [23:14:17] I did just make it appear to work... [23:14:37] `find /opt/lib/python/site-packages/horizon/templates -type f -name '*.html' -print | wc -l` 85 [23:15:34] that's more than 0! Was that with a fresh install, or with re-running things in the existing container? [23:15:49] I *think* the only thing I have changed that could be material is removing `--use-pep517` from pip_install [23:16:27] that's interesting since I added it in the first place to get pbr behaving... [23:16:32] that was from a manual run of bin/installpanels.sh inside an empty container [23:16:32] but other things have changed since then! I'll try it [23:17:19] oh. that works for me too [23:17:30] it's only when I do the fresh build that it misbehaves, I think [23:17:38] although I haven't been taking notes so maybe I'm misremembering that test. [23:17:52] but anyway, I'm trying a fresh one without use-pep517 so you don't need to run that experiment [23:19:23] my 'empty container' is the setup from https://phabricator.wikimedia.org/P83350#334471, so a container with the build deps in place, but which has never had bin/installpanels.sh run. [23:21:36] ok, that's different from what I tried [23:40:45] I figured out that something like `--use-pep517` is default behavior for pip in the modern age. PEP 517 was the introduction of [build-system] definition in pyproject.toml. [23:43:36] yeah, it ought to be [23:43:38] https://www.irccloud.com/pastebin/epoRoU2Y/ [23:44:53] Now I'm re-running bin/installpanels.sh in the container with my cwd as /srv/app [23:46:53] I was in /srv/app when I ran bin/installpanels.sh, but I think that's where cwd would be when blubber runs it too? [23:52:15] I assumed so, but we we could add an explicit cd to the script easily enough [23:59:20] hm, nope, removing things in /opt and re-running I still don't get html files