[10:41:31] taavi: I was planning to start upgrading openstack eqiad to bookworm today. the list of hosts is here T345811 [10:41:31] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [10:43:24] dhinus: ok, do you need something specific from me or is that just a general heads up in case something goes wrong and I get paged? [10:43:46] general heads up because you told me to let you know before starting :) [10:44:15] I don't remember if there was anything special you wanted to check or do [10:44:18] ah [10:44:20] me neither [10:44:23] :D [10:44:25] which hosts do you plan to start with? [10:44:51] good question: I think cloudcontrol should be a safe choice? and cloudvirts are the only ones that need draining? [10:46:42] hmm, maybe. didn't galera have some clustering issues when you upgraded codfw1dev? [10:50:48] * dhinus checks notes [10:58:03] one galera issue was fixed by this patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/955841/ [10:58:57] but there were some clustering issues https://phabricator.wikimedia.org/T345810#9153935 [10:59:50] so I should upgrade mariadb before reimaging [11:00:30] I might wait for a.ndrew to be online as he did that in codfw [13:08:24] dhinus, he isn't expected to be around much this week, but you can try pinging him directly [13:10:45] thanks balloons I will ping him :) [14:09:33] I'm here for a bit! [14:10:54] dhinus: I don't remember a ton beyond my notes on that task but my recollection is that you can just do 'apt install mariadb=' and it pretty much takes care of itself. [14:11:09] Of course you always need to do one node at a time and let things settle down afterwards. [14:13:36] thanks! [14:14:46] I'm trying to move some appointments around so might be around for a while if I'm lucky. If I'm less lucky, only here for about an hour. [14:40:32] ok, great, I'm now unscheduled until 3:30pm my time (which is well after you're done for the day) [14:43:59] is it possible to upgrade a trove instance to a newer datastore version? [14:46:35] nvm, just a restart seemed to be enough to work around that particular mariadb bug [14:52:01] taavi: for future reference: in theory it's possible because Trove will just reattach the cinder volume to an upgraded db instance/container. Largely untested though. [14:52:14] I didn't find an option in the UI for that [14:53:02] I might have it disabled or it might be cli only... I was starting to sort that out on T349651 but now I probably need to set that aside for a bit [14:53:02] T349651: Support Trove + Swift integration - https://phabricator.wikimedia.org/T349651 [14:53:29] Well, I guess I think of that task as a generic 'see what else trove can do'. I don't know that swift is actually involved in upgrades. [15:16:13] Is there anything in openstack that would allow for k8s to have a ReadWriteMany volume? If I'm reading https://docs.openstack.org/cinder/zed/configuration/block-storage/block-storage-overview.html correctly cinder won't do this. Perhaps there is something else? [15:18:30] In theory cinder supports 'multiattach' which I think has one writer and multiple readers (or maybe just multiple read-only connections? I can't remember). I spent a while trying to get it working and failed but I still believe it's possible. [15:18:59] If ReadWriteMany is a specific different thing then that would probably involve ceph trickery and bypassing cinder entirely. [15:21:34] hmm I'm trying to figure out the right mariadb-server package to install in bullseye [15:22:08] andrewbogott do you remember which one you used? did you copy the bookworm .deb? [15:23:45] I kind of think it was already available. Let's see... [15:25:21] Hm, nope. I'm looking to see if I kept notes anyplace [15:27:31] sorry, I should clarify: I definiely installed the same version as on bookworm. I'm trying to remember where I got the package from. [15:27:51] probably I just temporarily added the bookworm repo [15:30:26] btw dhinus the call is happening after all [16:39:33] I'm attempting the in-place mariadb-server upgrade. andrewbogott are you around in case anything breaks? [16:39:46] I am [16:41:33] taavi: easy review https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/969817/2 [16:42:05] done [16:42:08] thx [16:46:49] hmmm I added the bookworm line to sources.list, but apt install mariadb-server is saying it's already on the latest version [16:47:06] I tried the same commadns on a vanilla bullseye docker image and it worked as expected [16:47:19] so I'm not sure why it's not working on cloudcontrol1007 [16:47:45] did you remember to apt update? [16:47:47] yes [16:47:55] which host? [16:47:56] maybe it's because there's the extra thirdparty repo [16:48:01] cloudcontrol1007 [16:48:58] yeah, seems like the third-party repo is taking priority, https://phabricator.wikimedia.org/P53066 [16:49:07] I think you can try deleting /etc/apt/preferences.d/apt_pin_openstack_db_galera.pref to remove the pin on that third-party repo [16:49:12] right because of the pinning [16:50:14] removed that file, not enough [16:51:04] that's odd [16:56:16] maybe cat /etc/apt/preferences.d/wikimedia.pref ? [16:56:55] I didn't mean to include the "cat" :P [16:57:07] oh, that'd do it indeed [16:57:07] dhinus: remind me how to prevent 'ssh: connect to host alert1001.wikimedia.org port 22: Connection timed out' on cloudcumins? Some local code hack right? [16:57:32] yes, I'm currently running those cookbooks from my laptop to avoid the issue [16:57:34] dhinus: can you just do mariadb-server= and force it? [16:57:48] dhinus: supposing I don't want to run it on my laptop... [16:57:55] you could comment out the alert code in cloudcumin, but then you won't get the downtime [16:58:07] hmmmm [16:58:09] at the moment there's no way to downtime from cloudcumins, there's an open task [16:58:14] maybe worth it [16:58:21] where's the alert code? [16:58:31] you could downtime manually from icinga/alertmanager to avoid alerts and pages [16:59:06] the alert code is a line "downtime_something" in the cookbook itself, usually at the start of the Runner [16:59:12] ok [16:59:19] and a corresponding uptime_something at the end [17:03:31] using "apt install mariadb-server=1:10.11.4-1~deb12u1" fails with "unmet dependencies" [17:04:33] hm [17:06:07] removing "/etc/apt/preferences.d/wikimedia.pref" works, but then it wants to update too many things [17:07:56] or maybe not that much "20 upgraded, 8 newly installed, 4 to remove and 840 not upgraded." [17:08:04] That sounds right to me [17:08:06] but I'm worried it wants to update binutils as well [17:08:17] and libc-bin [17:08:37] I could easily believe that's a dep for maria though [17:08:41] isn't it written in c? [17:10:22] in my docker tests, binutils is not upgraded [17:11:21] and it only upgrades 8 packages in total [17:11:31] but it's a clean bullseye without any extra package [17:11:41] did you add the repo for debian's bookworm or for the wmf bookworm repo? [17:11:52] wmf [17:12:04] literally the same exact commands [17:12:45] but cloudcontrol1007 has many extra packages installed compared to the docker test, so it makes sense that the output is different [17:13:44] I think it's OK to risk it on one host and then see what you get. [17:14:17] are you doing 'apt get install' or 'apt get upgrade'? [17:14:56] "apt update && apt install mariadb-server" [17:16:24] in that case I definitely think you should risk it :) [17:16:47] I have to log off shortly so I will only do it if you can continue working on this :P [17:17:06] otherwise I can continue tomorrow morning [17:17:16] hm... maybe wait and do it first thing in your tomorrow so you can move ahead with the reimaging and get things into a consistent state. [17:17:23] makes sense [17:17:33] and worse case we can debug it when you wake up :P [17:17:36] sorry I didn't take better notes on this when I did it in dallas :( [17:17:39] yep! [17:18:02] no prob and thanks, it's already a good thing to know the general plan! [17:21:38] I wrote an update here https://phabricator.wikimedia.org/T345811#9292480 [17:22:21] * andrewbogott going afk while ceph rebalances [17:25:53] I reversed the changes in cloudcontrol1007 and enabled puppet [17:27:26] run-puppet-agent passed with no significant changes [17:27:42] * dhinus off for today [17:42:44] tools-db has gone OOM twice now, separated by 24 hours. [17:43:00] I restarted it and set it r/w but this needs investigation to figure out what's ruining it every day [17:44:17] I'll open a bug when I'm back [21:00:45] there's a task already, I started to investigate today but I didn't find much [21:01:05] T349695 [21:01:06] T349695: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 [21:02:09] I tried to spot some patterns in the Grafana dashboard but I couldn't find any