[00:00:17] the hacks are to the wmf.4 files, so yeah they won't interfear with a wmf.3 rollback at all. [00:00:50] right, sorry, lots of meetings today have made my brain into squashed peas :) [00:00:52] thanks [00:01:28] now I will update the ticket with the options and my squashed peas brain. [00:01:39] what could go wrong [00:01:58] the worst case breakage does mess with SAL logging so that might affect urgency estimates [00:03:23] * bd808 takes off because he missed his afternoon dog walk and the sun is going down [11:18:07] * taavi paged [11:27:14] again toolsdb :( [11:27:54] yeah :( [11:31:53] I'm trying to see if I can find something in the slow query logs [14:41:22] andrewbogott: do we have any docs on draining and undraining a cloudvirt? I'm not finding any [14:43:09] no but we have a cookbook which is better :) [14:43:47] nice, I just found it [14:44:08] will it start scheduling things automatically after a reimage? [14:45:43] no, you'll need to undrain (which I think there's also a cookbook for?) [14:46:08] dhinus: in case you want reading homework, here is how pooling/depooling works: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Host_aggregates [14:46:23] The cookbook removes the host from the pool before draining [14:46:42] I don't think there's an undrain cookbook :/ [14:46:51] OK, I'll look [14:47:12] cloudvirt1025 is the lowest-numbered and also happens to have only 1 running VM, so looks like a good candidate? [14:47:19] There are 6 hosts that use local storage instead, they need to be handled differently. they are cloudvirt-wdqsxxxx and cloudvirtlocalxxxx [14:47:53] hm one vm is suspicious, let's see if it's due for decom https://docs.google.com/spreadsheets/d/1FkBT8BJfN7t0r9ZhE1NLGheNJ5KCRQNfljQicb8CW30/edit#gid=38509438 [14:48:10] it's a canary, indeed [14:48:49] looks like it's due to be decom'd this year but I don't know for sure if the replacements are installed yet [14:48:57] so yeah, good for practice in any case! [14:49:35] I'll try running the drain cookbook on that one [14:49:52] I bet wmcs.openstack.cloudvirt.safe_reboot depools, drains, reboots, repools. Let's look [14:50:27] "This includes putting in maintenance, draining, and unsetting maintenance." [14:50:34] although I guess reboot isn't quite what you want [14:51:17] dhinus: ok, I think you can use 'drain' and then 'wmcs.openstack.cloudvirt.unset_maintenance' after the reimage. [14:51:55] and 'wmcs.openstack.cloudvirt.lib.ensure_canary' to get canaries on the newly reimaged hosts [14:52:11] but it will make your life easier if you manually delete each canary before reimaging, otherwise it'll be orphaned in the openstack DB [14:52:23] (starting to sound like we need docs, isn't it?) [14:52:53] yes :D [14:53:21] I'm thinking maybe we could have a new page "reimaging OpenStack" or "upgrading Debian on Openstack" [14:53:29] with a section for each type [14:53:38] (cloudcontrol, cloudvirt, etc.) [14:55:58] yeah, I guess we need that. Although in theory the process will be 100% different by the next debian release [14:56:38] haha true, but at least we know what we did last time :) [14:56:46] and it could be useful for a one-off reimage [14:56:49] without a version upgrade [14:56:50] yeah [14:57:02] or even adding new hosts to the cluster [14:57:41] so the cookbook worked, the canary is still there, should I delete it from horizon? [15:00:51] yep [15:01:06] I suppose it's pretty quick to drain a host that doesn't have anything to drain [15:01:40] hmmm it's gone from horizon, but still present in "virsh list" [15:01:44] Are you planning to write the first draft of the docs or hoping I'll do that? As of today you have the most recent experience :p [15:01:51] I can do it yeah [15:01:58] might take a minute for virsh to catch up [15:02:07] and also doesn't matter since... you're about to clobber the heck out of virsh [15:02:10] I will start that doc later today or tomorrow [15:02:32] true, I was just curious to understand why they show different things [15:02:42] virsh says "running" [15:02:49] yeah, I'm surprised too [15:03:01] maybe see what /var/log/nova/nova-compute.log says it thinks is happening [15:03:51] 0 bytes :D [15:04:36] hm [15:04:39] 1025? [15:04:47] yes [15:05:16] I'm restarting it to see what it thinks [15:05:20] good idea [15:06:28] # openstack server list --all-projects --host cloudvirt1025 [15:06:28] +--------------------------------------+--------------+--------+----------------------------------------+----------------------------------------------+-----------------------+ [15:06:28] | ID | Name | Status | Networks | Image | Flavor | [15:06:28] +--------------------------------------+--------------+--------+----------------------------------------+----------------------------------------------+-----------------------+ [15:06:28] | 2a1c22ac-6ed1-458b-a59b-26ad8e320021 | canary1025-2 | ACTIVE | lan-flat-cloudinstances2b=172.16.4.253 | debian-11.0-bullseye (deprecated 2023-01-12) | g3.cores1.ram1.disk20 | [15:06:29] +--------------------------------------+--------------+--------+----------------------------------------+----------------------------------------------+-----------------------+ [15:06:45] I think the log was empty because nothing had happened since the last rotation [15:06:58] oh wait it's still there in horizon [15:07:02] so my delete did not work [15:07:07] So maybe something is busted that prevented deletion? [15:07:21] (I was confused by horizon having multiple pages as always) [15:07:48] I guess try deleting it again? [15:08:00] and we'll see if it gives an error message or something [15:08:02] trying [15:08:12] "Scheduled deletion of Instance: canary1025-2" [15:08:18] nova-compute got the deletion message, disconnected the networking [15:08:38] second time lucky? :) [15:08:39] so... it worked that time and we've learned nothing [15:08:52] but there will be a lot more chances to test [15:09:07] yes :P [15:09:25] ok, I'm going to eat breakfast, back in a few [15:09:27] so next step is reimage right? [15:09:50] Once you're convinced the process works you can do 2-3 cloudvirts at once, we should have the space for that much shuffling. [15:09:57] yep makes sense [15:10:57] I will start the reimage and also go offline for an errand... I will work til later to make up the time :) [16:29:54] taavi: I'm going to miss and/or be late to our checkin, will follow up with you on ~20 [16:35:35] andrewbogott: ok, just ping me when you're available [16:49:30] I'm going to go eat something, be back later [17:48:20] so I'm back and the reimage of cloudvirt1025 has completed, with a warning on "ensure kvm processes are running" [17:48:50] dhinus: that should go away once you run the canary vm creation cookbook [17:49:06] I'll try that now! [17:52:15] canary created, and the alert is gone [17:53:49] looking at the alerts, I noticed there's also a wikireplica lag alert, does anybody know something about it? [18:01:49] dhinus: That was discussed this morning in -data-persistence, I think it's now recovering [18:02:04] Also the clouddb puppet alerts are due to work in progress that Ben is doing, I'm about to ack [18:03:05] thanks [18:03:49] back to cloudvirts, I did some cumin-foo to understand the status of cloudvirt, and there 6 hosts with only 1 VM (I assume the canary), and 2 hosts with only 2. all others have more than 10 [18:03:58] *there are 6 hosts [18:04:44] is that expected? I imagined VMs would distribute more or less evenly [18:04:57] How many of the weirdos are cloudvirtlocal or clouddvirt-wdqs? [18:05:46] (Also, you can check the host aggregate settings to see which things are depooled, might be some are depooled by accident. Try the first command on https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Host_aggregates ) [18:07:25] (6) cloudvirt[1025-1027,1051,1059-1060].eqiad.wmnet [18:07:40] these are the ones where virsh list only list one VM [18:07:43] ok [18:07:52] so I would guess that I left 1025-1027 drained because they're due for decom [18:08:25] and I also usually keep a couple of high-number virts empty so we're guaranteed lifeboats if a different cloudvirt crashes (although that's less important now that we have ceph) [18:08:49] I'm listing aggregate in the meantime [18:08:51] So I expect that's all you're seeing. Let's leave 1026-1027 empty and decom them rather than upgrading, unless you're enjoying having test hosts. [18:08:52] *aggregates [18:09:23] I guess I can reimage them in parallel to others, as I don't need to drain [18:09:45] yeah, up to you. I'm also happy to write decom tickets this afternoon. [18:09:51] how do I check if the other ones (1051, 1059, 1060) are actually "depooled" or if something is wrong? [18:10:05] that's the host aggregate thing again [18:10:29] that reminds me, 1051 crashed a while ago and we never properly followed up, https://phabricator.wikimedia.org/T349109 [18:10:51] oh yeah, I was about to link that [18:10:55] I expect that's why it's depoold [18:11:11] right, and I see that the ones with a "spare" aggregate are 1057, 1059, 1060, 1061 [18:11:21] hm [18:11:24] this is imperfect [18:11:39] if something is marked as 'spare' and also 'ceph' then that's silly and it should be removed from 'spare' [18:11:58] I don't see any with both [18:12:03] ok [18:12:34] In any case we def don't need four empty reserve cloudvirts [18:12:49] dhinus: am I answering your question or just making this worse? [18:13:05] no it's much clearer now [18:13:33] remaining question is why 1057 and 1061 are "spare" but have 2 VMs instead of 1 :D [18:14:24] 1057 looks like a cookbook misfire, two canaries [18:14:44] 1061 contains 'mwoffliner4' which iirc is a pain to migrate [18:14:50] but I guess we're going to need to do it :) [18:14:52] ouch [18:15:14] I knew there would be some weird edge case :D [18:15:25] It's just because it's too big and too busy to sync in a small amount of time. We can adjust the migration algorithm to be more tolerant of sync issues I think, or stop the service on it briefly [18:15:34] We should probably figure out how to make it no longer an edge case [18:16:53] is there a way to print the aggregate from the cloudvirt itself? or is it only possible from the cloudcontrol? [18:17:10] * dhinus was thinking of some cumin to group the aggregates more nicely [18:17:10] cloudvirts don't know, it's only the scheduler that knows [18:17:26] makes sense [18:17:31] although we could install novaobserver creds and novaclient on the cloudvirts [18:17:38] nah not worth it [18:17:39] and then do api calls from on the cloudvirts if that helps [18:17:50] we can write a cookbook that lists the aggregates nicely [18:17:54] sure [18:18:13] I was just annoyed that the for loop gives a long and not very readable list [18:20:28] you can also print a given aggregate [18:20:40] but 'what aggregates is this host in' is not as simple to answer [18:36:13] I recapped the situation in T345811 [18:36:20] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [18:36:45] if you agree, I think it's safe to reimage all at once cloudvirt[1025-1027,1051,1059-1060].eqiad.wmnet [18:36:51] after deleting the canaries from horizon [18:37:04] yep, agree [18:37:29] well and 1057 as well, beacuse it's the one with two canaries, but nothing to worry about [18:37:40] 1061 needs special treatment because of mwoffliner4 [18:37:53] At least historically some cloudvirts have required me to hit on the console during the partman phase. You might get lucky but if something seems like it's hanging that's the first thing to check. [18:39:40] thanks, the first one did not require any intervention but good to kno [18:39:42] *know [18:43:39] any tip on running many reimages at once? do you just create multiple tmux sessions or is there a smarter way? [18:48:15] dhinus: I just do it with multiple sessions yeah, not aware of any other way to approach it [18:49:27] I'll send a feature request to volans :P [19:01:28] I managed to create 6 windows in the same tmux session [19:03:24] https://usercontent.irccloud-cdn.com/file/sfhejeYc/reimage_6x.png [19:05:12] hm, better eyesight than me [19:06:51] hahah not ideal but at least I can detach and reattach quickly to all of them :D [19:08:52] I'll be back in a while to check they have completed, and run the ensure_canary cookbook [19:13:45] dhinus: i'm getting paged for lack of canary VMs, please silence them in the future [19:18:17] also how did those escalate to me? shouldn't those have paged US folks first? [19:20:06] they did page me, sorry, I should've ack'd [19:20:17] I think this is still the problem where cookbooks can't actually downtime icinga [19:21:32] hmm. reimage is ran from cumin, not from cloudcumin, so I don't think that's it [19:21:40] oh... [19:22:04] yeah, I suppose it's that after the host is done reimaging the cookbook re-enables alerts. But of course the reimage cookbook doesn't manage canaries. [19:22:29] So it's hard to silence this ahead of time because reimage (I think) resets all the alerts. [19:22:53] We could just make that kvm alert non-paging, although I added it years ago in response to a real problem [19:24:58] * bd808 lunch [19:28:14] taavi: sorry, I didn't think they would page, as I triggered the same alert earlier when I reimaged 1025. andrewbogott did you get a page a few hours ago? [19:28:42] I think so [19:29:02] but I think you created the canary before it escalated so I was sort of assuming that would happen again [19:30:23] I think the new set of alerts was caused by me deleting the canaries in horizon, and not by the coookbook [19:30:49] the reimage cookbook probably resolved the alerts because it adds a downtime? [19:32:07] the correct procedure is probably: 1. set a 30-minute downtime 2. delete the canary in horizon 3. start the reimage cookbook [19:32:29] I think it's after the host comes back up that the alerts fire? But I'm not sure. alerts were at 12:42 [19:33:05] I think the cookbook sees there is a warning in Icinga and does not remove the downtime [19:33:08] And you started the reimage at... oh you're right [19:33:12] but it's a 2-hour downtime so it eventually expires [19:34:56] so in the procedure, you also need step 4. run the ensure_canary cookbook (within 2 hours from the start of the reimage) [19:35:15] I will test this procedure on the next cloudvirt I will reimage [19:35:33] Frustrating because it means you can't start things at bedtime [19:36:29] now I see two new alerts that I don't understand, only in cloudvirt1060 [19:37:20] icinga or alert manager? [19:37:21] oh the downtime failed only for that host :/ [19:37:28] icinga + alert manager [19:37:31] END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1060.eqiad.wmnet with reason: host reimage [19:38:21] I ak'd [19:38:28] ack'd [19:38:29] acked [19:38:48] not clear why it failed to downtime [19:40:12] and it looks like the reimage cookbook downtimes the host _twice_, before and after the first reboot [19:55:35] all 6 reimages have completed. I'm now running the ensure_canary cookbook [19:57:45] I was hoping I could run it on all cloudvirts and it would only create the 6 ones that were missing... but it looks like it wants to recreate some of the existing ones too [19:59:10] INFO: cloudvirtlocal1001 has changes: Would delete canarylocal1001-1 ; Would create a new VM ; [20:00:55] Rook: do you by chance have a trove instance in the 'superset' project that you can't delete? [20:01:10] I just figured out how to solve the same issue in a different project, looks like the last straggler might be there [20:01:38] dhinus: I've never found that canary cookbook to work the way I was expecting but I also haven't tried to fix it [20:02:07] but it's harmless to rotate canaries as long as we don't get multiples [20:02:18] yes I'll let it complete, it's not too slow [20:13:15] the cookbook completed and all alerts have cleared! [20:13:58] I'll call it a day :) [20:17:12] * dhinus off [20:49:27] andrewbogott: probably! Indeed it shouldn't be using trove any longer, though still shows 3 instances. Which I lack the courage to delete on an afternoon before a holiday. [20:49:37] For the greater mystery, why didn't irc beep at me... [21:52:45] Rook: I'll just delete the one that's obviously broken, will leave the working ones for you to delete on a monday [21:54:00] Sounds good