[08:02:21] dcaro: I'm starting the harbor upgrade, will ping you if I need help! [08:02:43] blancadesal: 👍 [08:04:10] dcaro: do we need to do something additional in terms of backups than what we did on toolsbeta? [08:04:45] No, I think we are good [08:09:57] dcaro: and we don't need to do the manual migration of harbor.yml right? puppet will do that when we merge the patch? [08:10:34] correct, the manual steps are the ./prepare, and docker-compose down/up [08:10:44] (well, and merging that patch before xd) [08:15:15] but then do we even need to download the new release this time? [08:20:36] dcaro: I don't think I can merge the patch myself [08:20:47] oops, let me do that then [08:21:36] thanks [08:21:43] merged on prod, let me do a sync on the tools puppetmaster just to be sure [08:21:58] how do you do that? [08:22:15] (that is `root@tools-puppetmaster-02:~# systemctl start puppet-git-sync-upstream`) [08:22:30] to force the tools puppetmaster to pull the merged changes [08:22:58] looks ok [08:23:00] https://www.irccloud.com/pastebin/kJbdM52v/ [08:23:04] 👍 [08:23:41] cool. now I need to force a puppet run on the harbor instance? [08:25:13] yep [08:31:06] blancadesal: I got a message asking me to approve a mail from you to cloud-announce [08:31:22] dcaro: please do :) [08:32:37] can you remove the moderation thingy altogether? [08:32:59] blancadesal: I think next messages will be auto-accepted (if I configured it right xd), btw. you were an owner of the list already [08:33:16] interesting! [08:39:01] up and running? https://usercontent.irccloud-cdn.com/file/QzsiwRhZ/image.png [08:42:21] looks ok, yes [08:42:44] 🎉 [08:43:58] running a build to test now [08:44:37] https://www.irccloud.com/pastebin/yWLFGIKe/ [08:44:55] should we be worried about this? [08:48:53] I don't think so, but we can look at it [08:50:29] building/pulling went ok [08:51:52] which command gave you this, docker-compose? [08:52:14] yes [08:53:38] that's ok yes [08:53:48] you can still see logs in docker logs [08:54:28] oh, wait, there's two errors there [08:55:12] ´harbor-jobservice exited with code 2` and nginx also [08:55:16] but they are running now [08:56:50] where do you see this? [08:57:28] in the paste you passed [08:57:46] ah yes [08:58:06] seems ok now [08:58:24] hmm, there was a moment when the harbor-core had trouble getting to the database [08:58:27] too many connections [08:58:44] 2023-10-25T08:51:04Z [ERROR] [/lib/http/error.go:57]: {"errors":[{"code":"UNKNOWN","message":"unknown: failed to connect to `host=5ujoynvlt5c.svc.trove.eqiad1.wikimedia.cloud database=harbor`: server error (FATAL: remaining connection slots are reserved for non-replication superuser connections (SQLSTATE 53300))"}]} [08:58:50] I think that happened in the past too [08:59:56] should be ok as long as it does not happen often [09:01:49] where are the logs? [09:01:59] docker logs harbor-core [09:02:22] all I get from there is the WARNING: no logs are available with the 'syslog' log driver message [09:03:04] ah no, I was doing docker-compose logs [09:03:09] docker logs works [09:04:19] can we declare the upgrade done? [09:06:23] I think so yes :) [09:07:23] 👍 [09:09:50] congrats on the smooth upgrade! [09:10:48] hello :) I have a bricked instance (can't ssh to it) [09:11:56] I think the issue is `sssd-nss` started exiting with code 70/SOFTWARE then 3/NOTIMPLEMENTED [09:12:07] and after a few retries systemd marked the unit as a failure [09:12:34] dcaro: thanks for your help! [09:12:43] I'll update the docs now [09:12:52] hashar: which VM is it? [09:13:23] integration-agent-docker-1057.integration.eqiad1.wikimedia.cloud , I have restarted `sssd-nss.socket` [09:13:46] that fixed it [09:14:19] the big unknown is why the service bails out with exit code 70 then 3 [09:14:21] blancadesal: just added some sections and some info, I was adding more stuff too, but will wait for you to finish up [09:14:51] hashar: let me take a quick look, but I have not seen that before [09:15:05] I pasted bunch of info on https://phabricator.wikimedia.org/T349681 [09:15:17] nice, I was going to ask you to open one xd [09:17:33] I have posted one last comment with my fringe theory :] [09:21:31] dcaro: I've reenabled puppet on toolsbeta-harbor [09:21:37] dcaro: anyway I managed to recover access to the instance. Then maybe it is worth looking what went wrong, I guess somehow LDAP failed and that caused that sssd thing to bail out [09:22:50] hashar: I think that's correct, not sure why it started failing, I have seen ldap sometimes taking time to reply, and making sssd fail but it usually affects more than one VM [09:23:20] Unable to register to unix:path=/var/lib/sss/pipes/private/sbus-dp_wikimedia.org [org.freedesktop.DBus.Error.NoReply]: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. [09:23:22] there's an open issue https://github.com/SSSD/sssd/issues/6219 that seems similar [09:25:23] dcaro: so we can blame it on some corner case upstream issue / cosmic ray? [09:25:41] I guess a reboot of the instance would have fixed it (by restarting the nsss-ssd service) [09:25:47] so I am willing to mark it resolved [09:25:47] :) [09:26:31] I would not spend a lot of time on it yet, but if it happens again we should definitely investigate more [09:27:19] I agree [09:27:27] I have marked it fixed [09:27:40] thank you! [09:27:51] (and sorry for the interruption in the Harbor upgrade) [09:39:22] dcaro: is there anything more we want to add to the harbor docs before declaring T349313 done? [09:39:24] T349313: [tbs][harbor] Improve Harbor admin docs - https://phabricator.wikimedia.org/T349313 [09:40:28] I want to add a couple more things, like mention that it runs on VPS using a standard cloudvps proxy (managed in horizon), and some links to puppet code/harbor install page [09:40:52] looking for reviews on: https://gerrit.wikimedia.org/r/c/operations/puppet/+/967875/ https://gerrit.wikimedia.org/r/c/operations/puppet/+/968618/ [09:43:17] blancadesal: done :) [09:44:27] much better than what we had :) [10:35:50] toolsbeta harbor is crashing when trying to pull some images [10:35:52] https://www.irccloud.com/pastebin/vSDOCXH5/ [10:38:58] hmm, our cookbooks repo seems to expect spicerack 8.x but cloudcumins are on 7.x [10:40:43] oh, I did nothing and now it works (toolsbeta-harbor pull) [10:41:32] and the normal cumin hosts are on spicerack 8.x. dhinus do you know how it's supposed to be kept up to date? [10:48:27] * dcaro lunch [10:53:09] taavi: yes, updates are manual, volans is taking care of updating cumin hosts, and wmcs is responsible for updating cloudcumin when/if we want to have the latest version [10:54:55] the last time I upgraded the package with "apt update && apt install spicerack" [10:59:54] dhinus: ok, I'll update the hosts [11:01:58] thanks [12:56:51] I just realized this morning I spammed #-cloud with admin-related messages, the reason is that I accidentally swapped the two channels in my IRC client channel list, so I assumed I was typing in #-admin :D [13:49:50] uhhh, they python steering council has approved https://peps.python.org/pep-0703/ [13:50:04] (optional gil-less execution) [13:50:54] will take a bit fo implement though (and they don't seem sure if it will succeed) [13:58:35] ceph is having slow ops again, this time is cloudcephosd1005:/dev/sdi [13:58:52] (osd 38) [14:03:29] restarted the osd to force ceph to move the pgs to another osd [14:03:33] that seemed to have helped [14:03:47] there might be a hiccup on the NFS toolforges [14:31:34] In Horizon, jnuche is trying to navigate to compute -> instances -> instance -> interfaces and the it's showing the loading spinner forever. Same thing when I'm trying it. All other tabs seem to work just fine [14:32:20] (on any project) [14:33:11] that's been buggy for a while iirc. so far our answer to complex network needs has been "ask an admin to do it via the cli" [14:47:10] interesting xd [15:07:02] andrewbogot: in terms of upstream internet access, when discussing the cloudlb/cloud-private project the plan was that hosts could use the web proxy to access those. I don't recall objections being raised at the time this would be problematic. [15:07:18] Indeed as the cloudcontrols are now behind cloudlb (which has the public IP), they are not able to access the internet directly. [15:07:59] The web proxy should work fine for apt, pypi or docker repos though. If you've problems getting it going let us know we can have a look [15:21:47] topranks: and the http proxy thing is basically relying on clients observing $HTTP_PROXY env var right? [15:21:58] that really /should/ work, but needs more investigation [15:25:17] yeah it requires client support, so yes one way is to set the HTTP_PROXY env var and if they respect that it should be fine [15:25:53] Things like pip and curl also have command-line flags to specify to use a proxy, things like apt are more system-wide and need it configured in the conf files in /etc/apt if not taking it from a system-wide parameter [15:56:40] andrewbogott: am I right in thinking that we don't care about rack diversity for cloudvirt-wdqs? [15:57:29] I think that's right. [15:57:44] As far as I know they don't ever serve external traffic and there certainly aren't failover options between them. [16:00:00] that matches my thinking, thanks [16:12:22] possibly best not to put them all in the one rack? [16:12:42] I assume in a planned-maintenance VM instances can be migrated between them? [16:12:55] (somewhat related I opened a task regarding their 1G connections - https://phabricator.wikimedia.org/T349735) [17:32:17] * dcaro off [18:59:31] topranks: (much later) the cloudvirt-wdqs hosts are weird, they use local VM storage on spinning rust. So we don't support live migration for that team, they just get downtime when we have maintenance. [19:01:19] andrewbogott: ok thanks for the explainer [19:02:02] I keep asking them if I can scrap those servers but they keep thinking they're about to use them (while not actually using them) :) [19:05:14] hmm ok