[00:23:19] cool. thanks t.aavi [08:48:58] morning [09:09:21] o/ [09:45:48] i am swapping the active metricsinfra haproxy host [09:48:48] ack [10:31:20] i'm thinking of merging the servers that do redirects for toolserver.org and tools.wmflabs.org into one. only problem is that toolserver.org runs apache and tools.wmflabs.org runs nginx [10:45:16] taavi: +1 to that idea [10:47:14] sounds good yes [11:12:08] hmm, there's a bump on the number of stuck processes on tools nfs-worker 56, 21 and 22 on between 03-04 and 03-05, but there isn't much in sal, anyone remembers if anything happened then? (I'm looking to see if there's any issues while rebalancing ceph, none so far) [11:18:46] I don't remember anything specific [11:31:14] * dcaro paged, harbor is down [11:32:27] was that temporary? [11:33:39] taavi: might that be related to your work on prometheus? [11:33:46] (metricsinfra and such) [11:33:59] probably not [11:34:08] I don't see alerts on alertmanager [11:34:20] I see tools-prometheus-6 also went down a while ago [11:34:36] I thought that might have been you restarting for some reason [11:34:39] looking [11:34:55] nope [11:35:43] ceph is not relabancing (has been stable for a bit, I stopped the rebalance to fix some location hook) [11:35:53] and I can access harbor in ui and ssh [11:36:34] I think the real issue is/was with tools-prometheus-6 [11:37:42] yep, got the emails of resolution now, I seem not too be able to ssh to tools-prometheus-6 though, console works [11:37:43] looking [11:40:08] Mar 22 11:39:37 tools-prometheus-6 sshd[778362]: pam_sss(sshd:account): Access denied for user dcaro: 4 (System error) [11:40:15] interesting, it did find my key, but then failed [11:41:07] I can ssh as root, can any of you ssh as your users? [11:42:03] not as `taavi`, yes as `root`. so probably an sssd issue? (those are often-ish caused by the server overloading) [11:42:26] probably yes, it has been a while though [11:42:30] since the last issue [14:38:27] hopefully a config file like this is not a too cursed approach: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013523 [14:38:57] (I took the list of allowed tools in the nginx .lua file and generated the apache vhost based on that) [14:43:58] * andrewbogott feeling intimidated by https://os-deprecation.toolforge.org/ [14:45:45] andrewbogott: there's a script to mass-file tasks for each project included in the codebase of that tool [14:46:38] cool! At the moment I'm not so much intimidated by all the bug filing as by all the things on that list that our team is responsible for. I'll probably start chipping away at that. [14:47:33] andrewbogott: we will be done before deployment-prep, so you can always feel good about that. [14:47:49] good and bad both! [14:48:15] * bd808 may end up nerd snipped into leading the beta cluster migration this time around [14:49:13] My fear is that the 'remove buster from deployment-prep' project is == 'replace deployment-prep app servers with k8s' [14:49:48] And that second thing will result in another incorrect assertion that deployment-prep is unnecessary and will be shut down 'soon' [14:52:58] It would be sweet to get beta using a k8s cluster for MediaWiki, but I find that unlikely in the near term. I think the more likely progression will end up being: new bullseye vms (may/june) -> MW in a container running under podman (august/september) -> MW on k8s (just before heat death of universe) [14:54:06] the container bits are likely to fall out of work towards the group-1 concept next fiscal [14:54:09] I'm discussing this with tyler in -releng and he's saying roughly that [14:55:10] fwiw, the prod mediawiki appserver code /can/ run on bullseye. we do that for wikitech for example [14:56:20] yeah. the only reason for buster is to pretend that beta running the same version of PHP as prod helps find bugs [14:56:48] which... maybe it does, but that is more practically the job of CI [14:57:25] beta is not likely to surface at-scale PHP crashing bugs in an actionable way [14:59:12] are you all just using the debian-packages php for bullseye? [14:59:18] *debian-packaged [15:00:01] I don't think so [15:00:29] *** 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf11u1 1001 [15:00:30] 1001 http://apt.wikimedia.org/wikimedia bullseye-wikimedia/component/php74 amd64 Packages [15:00:39] 2:76+wmf1~bullseye1 [15:02:02] everything php on cloudweb1003 has a "wmf" something package version. [15:02:58] yes, that's all coming from https://apt-browser.toolforge.org/bullseye-wikimedia/component/php74/ [15:03:05] oh, there's arlready a wmf php package, that was one worry I had about this (since prod appservers aren't on bullseye) [15:03:59] there may be other appserver packages missing---requires investigation---but good to know at least php is there :) [17:05:08] taavi: You may have an opinion about this :) I'm building a fresh acme-chief VM in cloudinfra and thinking about the migration from old to new. Would you sync files between the old and new beforehand, or just let everything regenerate and churn? [17:12:24] andrewbogott: I would migrate the data by using the sync mechanism built into the acme-chief puppet classes [17:12:54] ok, I'll read the code :) [17:13:41] iirc it's just a matter of changing some hiera keys [17:13:52] ah, is that the active/passive host thing? [17:13:56] yeah [17:14:01] cool [17:15:14] * andrewbogott tries it [17:18:09] fyi. ceph is still rebalancing from adding cloudcephosd1034 to the cluster (2 osd daemons at a time, trying to avoid overloading the switches) [17:18:58] on the good side, bringing a node in and out fully works with rack HA (and the cookbooks are more than tested) [17:19:08] cya [17:20:32] It would be very nice to have a proper SAN backplane for the Ceph cluster, but I'm glad to see y'all figuring out how to make things work [18:00:26] * bd808 lunch [22:16:48] bd808 (or anyone else still around) can I get a last-minute +1 for T360823? [22:16:48] T360823: Increase instance and volume quota in devtools project for puppetmaster upgrade - https://phabricator.wikimedia.org/T360823 [22:34:03] andrewbogott: done [22:34:35] thx