[00:18:01] * bd808 off [08:36:08] morning [08:39:12] morning [09:09:34] o/ [09:10:36] morning [09:25:45] o/ [09:52:38] I finally got annoyed enough about all the alert flaps when adding/removing instances that I finished my spicerack patch for metricsinfra alertmanager support https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1014099 [09:54:28] woot woot! [09:58:48] we can probably do the same for the other cookbooks where we reboot hosts? [10:01:42] something like https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/wmcs-cookbooks/+/95cc3d1cb70bb0933451ccd7a2a83f89f2080e3d%5E%21/#F0 will already work for hardware using the wikiprod alertmanager, we should adopt that everywhere [10:01:55] that patch will also add support for downtiming VMs in the metricsinfra alertmnanager [10:07:54] yup I might have a go at copying that approach to the cookbooks rebooting cloudnets/cloudcontrols/etc. [12:45:33] andrewbogott: do you expect the cookbooks to just work for T360419? if so, should I go for bullseye or for bookworm directly? [12:45:34] T360419: Upgrade toolsbeta-nfs to Debian Bullseye/Bookworm - https://phabricator.wikimedia.org/T360419 [12:58:59] taavi: The worked last time I tried them! And should work with bookworm. [13:00:23] taavi: you're looking at https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Runbooks/Create_an_NFS_server#Create_a_replacement_server_for_an_existing_service ? [13:01:20] yep [13:03:47] andrewbogott: the header in https://gitlab.wikimedia.org/repos/cloud/toolforge/disable-tool/-/blob/master/disable_tool.py does not state the GPL version, did you have a specific one you prefer? GPL-3.0-only or GPL-3.0-or-later? [13:04:49] Huh, I do not have a strong opinion but would be inclined to 3-or-later [13:05:25] I can adjust [13:06:37] sure, and please also add a copy of the text itself to the repo [13:21:09] the cookbook failed due to a puppet certificate error :( [13:23:46] hm, those nfs servers don't need any secrets do they? [13:23:55] So probably we can override to use the central puppetmaster [13:23:56] * andrewbogott looks [13:24:42] taavi: try now? [13:24:57] I added puppetmaster: puppet to the nfs hiera prefix [13:25:24] that will break puppet certs on the existing nfs server but you're about to delete that one anyway [13:25:31] ah, I "fixed" that by running the refresh certs cookbook and https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1014506 [13:25:47] since it was like the last step of the cookbook that failed [13:25:55] that'll also do it [13:26:05] in that case I'll remove my change so it doesn't break again on next run :) [13:31:35] Exception: Server toolsbeta-nfs-3.toolsbeta.eqiad1.wikimedia.cloud must have profile::wmcs::nfs::standalone::volumes: ['toolsbeta'] [13:33:52] ok, apparently fixed [13:34:32] or not [13:35:45] now 'Exception: service ip name mismatch. Expected toolsbeta, found toolsbeta-nfs' [13:36:04] is this for the creation or the failover? [13:36:15] failover [13:37:20] so there's a disagreement about --prefix probably... [13:38:03] probably the initial setup was with prefix 'toolsbeta' rather than 'toolsbeta-nfs' [13:38:17] the initial run was with: [13:38:17] $ sudo cookbook wmcs.nfs.add_server --project toolsbeta --prefix toolsbeta-nfs --image debian-12.0-bookworm --network lan-flat-cloudinstances2b toolsbeta-nfs [13:38:44] it then complained about the volume, so I manually edited `profile::wmcs::nfs::standalone::volumes` from 'toolsbeta-nfs' to 'toolsbeta' [13:39:53] ok. It may be that this server is old and predates some cookbook standardization. But... [13:40:22] if it were me I'd start over and create the server anew with --prefix toolsbeta [13:40:28] in case that's embedded someplace wrong [13:40:37] ok, let's try that [13:40:38] but probably you are ahead of me already, I'm just now looking at the cookbook code [13:40:52] let me get a few cookbook patches ready first [13:44:55] hm, I don't love this, I see everything in the existing server as using toolsbeta-nfs. [13:45:04] So I don't really understand why it didn't work for you the first time. [13:45:12] andrewbogott: please review https://gerrit.wikimedia.org/r/c/operations/puppet/+/1014511/ (I swear that is directly related) [13:45:23] But let me know when you're ready and I'll watch :) [13:50:25] ok, I'm ready [13:50:32] so I'll remove the old -3 first? [13:51:09] yeah [13:51:22] so I think I see the confusion, when creating the host you can specify the volume name and the prefix separately. [13:51:33] So now my question is, in the migration script does it assume they're the same? [13:52:12] This is bad because I'm converging on making the exact hack that you did before [13:52:15] what migration cli did you ue? [13:52:17] *use? [13:52:25] $ sudo cookbook wmcs.nfs.migrate_service --project toolsbeta --from-host-id a4291ef9-d2ef-4301-9a17-9df66c23f125 --to-host-id 856bd127-3355-4f33-9832-ff754322088a [13:52:45] right, so it doesn't take a prefix, it just gets it from hiera [13:52:48] me reads a bit more [13:53:03] so how do you want me to run the add_server this time? last time was --prefix toolsbeta-nfs and volume toolsbeta-nfs. --prefix toolsbeta-nfs and volume toolsbeta? [13:53:31] I'm surprised it lets you specify --volume at all [13:53:34] what's the full line you used before? [13:53:55] oh wait, I'm no longer surprised [13:53:56] 15:38:17 $ sudo cookbook wmcs.nfs.add_server --project toolsbeta --prefix toolsbeta-nfs --image debian-12.0-bookworm --network lan-flat-cloudinstances2b toolsbeta-nfs [13:54:31] yeah, let's try that with --volume toolsbeta [13:54:48] ok, doing [13:54:50] (I'm also pretty sure that I should just remove that --volume arg and force it to ==prefix but it might be too late for that) [13:55:37] and I forgot to run it with test-cookbook for the patches I want. so aborting and retrying [13:56:56] running for real this time [14:10:30] * arturo food [14:10:39] andrewbogott: new VM created, trying the migrate script now [14:10:59] * andrewbogott is not super optimistic [14:11:53] Exception: service ip name mismatch. Expected toolsbeta, found toolsbeta-nfs [14:12:31] At least it's consistent! [14:12:35] * andrewbogott reads the cookbook again [14:14:36] that test just looks wrong to me. Let's see... [14:18:44] taavi: https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1014521 [14:20:16] can you rebase that on top of https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1014520? [14:20:49] done [14:22:21] ok, let me give that a try [14:28:59] * taavi curses the yaml-wrapped-in-YAML wmcs-enc-cli get_prefix_hiera interface [14:29:06] andrewbogott: now it's doing something [14:29:41] great! I wonder how/if it ever worked before [14:31:42] so it finished successfully [14:32:27] toolsbeta-bastion-6 says this: [14:32:28] Could not chdir to home directory /home/taavi: Stale file handle [14:32:28] -bash: /home/taavi/.bash_profile: Stale file handle [14:32:33] I'll reboot it and let's see what happens [14:33:06] that sounds like nfs hiccup [14:34:45] we're migrating the toolsbeta NFS service, so very likely [14:35:03] andrewbogott: taavi@toolsbeta-bastion-6:~$ sudo mount -a [14:35:03] mount.nfs: mounting toolsbeta-nfs.svc.toolsbeta.eqiad1.wikimedia.cloud:/srv/toolsbeta/misc/shared/toolsbeta/project/ failed, reason given by server: No such file or directory [14:35:03] mount.nfs: mounting toolsbeta-nfs.svc.toolsbeta.eqiad1.wikimedia.cloud:/srv/toolsbeta/misc/shared/toolsbeta/home/ failed, reason given by server: No such file or directory [14:35:13] the mounted volume is empty [14:35:18] on nfs-3 that is [14:35:26] uh, there is no mounted volume on -nfs-3 [14:35:41] forget that, there was and it is empty [14:36:41] andrewbogott: https://phabricator.wikimedia.org/P58925 did the migration script format all data on that volume? [14:40:46] hmm I see two toolsbeta-nfs volumes in https://horizon.wikimedia.org/project/volumes/ [14:42:12] andrewbogott: any clue what's going on here? [14:53:28] I saw the two volumes and was also confused [14:53:37] but I don't know much, let me look [15:00:01] taavi: do you have reason to think that this volume contained things, before? [15:01:13] andrewbogott: I don't see where else /home and /data/project on toolsbeta would have been stored before [15:01:41] ok [15:01:47] I confirmed that the other volume is also empty [15:02:35] I have the feeling we're about to test our backup system :/ [15:04:57] taavi: is the output from prepare-cinder-volume in the cookbook logs anywhere? [15:10:15] andrewbogott: 16:36:41 andrewbogott: https://phabricator.wikimedia.org/P58925 did the migration script format all data on that volume? [15:10:45] welp [15:10:51] can you tell what the call was to prepare-cinder-volume? [15:11:28] it should be in the spicerack logs, one second [15:12:27] 2024-03-26 14:29:12,945 taavi 2083597 [INFO] Executing commands [cumin.transports.Command('sudo -i wmcs-prepare-cinder-volume --device sdb --options \'rw,nofail,x-systemd.device-timeout=2s,noatime,data=ordered\' --mountpoint /srv/toolsbeta --force', ok_codes=[0])] on '1' hosts: toolsbeta-nfs-3.toolsbeta.eqiad1.wikimedia.cloud [15:12:27] 2024-03-26 14:29:17,992 taavi 2083597 [INFO] Completed command 'sudo -i wmcs-prepare-cinder-volume --device sdb --options 'rw,nofail,x-systemd.device-timeout=2s,noatime,data=ordered' --mountpoint /srv/toolsbeta --force' [15:14:41] So either this is wrong [15:14:44] https://www.irccloud.com/pastebin/3JrIwTim/ [15:14:57] Or there's some logic error causing that to not get set in the right place [15:15:34] https://wikitech.wikimedia.org/wiki/Help:Adding_disk_space_to_Cloud_VPS_instances says wmcs-prepare-cinder-volume 'can not be used to reattach formatted volumes or move volumes between instances.', are you saying that's not true? [15:17:44] Yes, it should do a perfectly good job of mounting and setting up fstab for existing volumes [15:20:33] When I run in interactive mode it says... [15:20:37] https://www.irccloud.com/pastebin/9dZuc8Aj/ [15:20:45] which is correct... [15:20:51] is that with the volume mounted already? [15:21:24] no [15:21:39] that's if I remove the existing fstab entry [15:21:50] does it then get formatted if you tell it to mount it? [15:22:20] no [15:22:29] but also right now it's failing with '/srv/toolsbeta: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error.' which I've never seen before [15:22:44] I mean, the mount is failing [15:22:48] hmm [15:22:50] which at least tells us it isn't formatting [15:23:18] maybe let's start from trying to restore the data from the backups, let's focus on why it disappeared after that [15:23:41] ok [15:23:45] I'll do the restore [15:23:51] ok [15:54:07] taavi: the restore is done and the volume is mounted on nfs-3 [15:54:26] So maybe we should just go forward with getting that working, and then set up a test case to see if we can reproduce the format [15:55:46] oh, the k8s db operators is by Jerome Petazzoni, he's was of the main docker devels iirc [15:55:49] hm, no service IP though [15:56:22] andrewbogott: toolsbeta-nfs-3 looks fine to me on a first glance [15:57:18] andrewbogott: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1014543/ [15:58:26] and https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1014544/ [15:58:56] so someplace there's an orphaned IP which we need to move to the new host [15:59:20] I see 172.16.1.238/32 in use on toolsbeta-nfs-3, is that not it? [15:59:50] oh, maybe so [16:00:16] I was expecting to see it in horizon but I guess that only shows floating IPs [16:00:29] the few VMs I spot-checked seem to have recovered fine [16:00:30] so other than the scary mystery I guess we're done... [16:00:56] yeaeh [16:01:04] I'm glad you're doing this to discover issues that I haven't seen... but I also wish I knew how to reproduce what you saw [16:01:30] the only theory I have is that it takes a second for the OS to notice the existing filesystem, and the cookbook ran wmcs-prepare-cinder-volume before tha happened [16:01:51] or there's some change which makes lsblk in bookworm not recognize a file system from buster [16:02:16] I guess those safety patches you put in will prevent destruction going forward [16:02:26] So maybe that's the best we can do for now [16:03:07] I added a warning https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Runbooks/Create_an_NFS_server#NFS_service_failover [16:03:15] did you find the backup restore docs? [16:03:45] I didn't but I made a note to write some. [16:04:01] It's 95% the same as instance restore [16:06:07] The buster list is getting shorter... slowly [16:06:56] the toolforge list is getting at a stage where most of that is no longer simple instance replacements [16:07:42] arturo: do you have opinions about https://phabricator.wikimedia.org/T311914#9654353? [16:08:44] taavi: not a strong one. In the past I've considered just dropping aptly and using the main wikiland reprepro directly [16:09:12] on the other hand aptly feels way simpler than reprepro [16:09:26] if you have a strong opinion, you should implement it :-P [16:10:17] taavi: I can probably do the etcd nodes unless you've already started. [16:10:43] unless there's a reason they're buster? [16:10:49] so taavi maybe my strong opinion is: let's try reusing the wikiland reprepro? [16:11:31] dcaro: yep, I went to that talk because I remembered seeing a talk by Jerome a few years ago that was quite good [16:11:53] my random input is that we got aptly because it was the thing Yuvi picked, probably because he found some docs that worked for setting it up. We certainly didn't have an DDs on the team at the time (and probably committed many sins against debs in general) [16:12:06] I don't think he's been involved with Docker recently, but he was one of the original devs [16:12:54] taavi: I'm about to head out for the day (on what is almost certainly a fool's errand). Anything else I can do/say before I go? [16:13:31] andrewbogott: the rest of my nfs cookbooks could use reviews, but that's not urgent [16:14:39] bd808: ok, if our DD says I can migrate to reprepro then I'll do it :-) [16:15:02] i do want to keep a separate toolforge server where upload rights are not tied to global root, though [16:15:02] ok [16:16:07] taavi: if builds are uploaded via CI, that would no longer matter? [16:17:13] In that case I suspect what matters is who triggers that build, but I can ask to be sure [16:17:45] the gatekeeping would be merging to main and triggering the CI build [16:47:04] how do I tell pre-commit to not complain about the whitespace in https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/15 which markdown uses to indicate a forced line break without a paragraph break? [17:18:03] taavi: adding a `--markdown-linebreak-ext md` argument to the check might do it? -- https://github.com/pre-commit/pre-commit-hooks/blob/main/pre_commit_hooks/trailing_whitespace_fixer.py#L52 [17:34:46] * arturo offline [17:35:09] bd808: it did, thanks! [17:38:17] "Read the Source!" in ObiWan's voice ;) [17:38:40] Such a horrible nerd answer though [17:38:40] I also now see that mentioned on https://github.com/pre-commit/pre-commit-hooks?tab=readme-ov-file#trailing-whitespace :-) [17:40:42] I should probably run around and add that bit of config to the pre-commit things I've been setting up. I wonder if there is a reasonable way to make a "bundle" of pre-commit config to include in projects? [17:45:38] btw, you might be interested in this: https://phabricator.wikimedia.org/P58920 -- I'm going throgh things that were primarly authored by me and relicensing things as (A)GPL [17:54:50] I really want to like the AGPL, but the lack of clarity on how it would deal with embargoed security patches has kept me away from it historically. [18:06:15] * bd808 lunch