[10:12:14] !log tools.sal webservice restart [10:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.sal/SAL [14:35:03] hi folks. is there a known toolforge outage? SAL is down for example and there are some user reports as well [14:36:05] sukhe: there was maintenance on the toolforge NFS server a few moments ago, you might be seeing fallout from that [14:36:46] taavi: thanks [14:37:07] sukhe: we're running a script to reboot everything (as is an usual practice after NFS maintenance) so everything should recover on its own as that completes [15:44:57] !log toolsbeta deploy toolforge-webservice 0.103.7 (T362050) [15:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [15:45:00] T362050: toolforge: review pod templates for PSP replacement - https://phabricator.wikimedia.org/T362050 [15:45:01] !log tools deploy toolforge-webservice 0.103.7 (T362050) [15:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:01:17] hashar: I'd like to migrate VMs in he 'integration' project to the new network setup. That will result in each host being briefly stopped and started, one at a time. Is that something the cluster can survive automatically? [16:07:05] andrewbogott: Jenkins would reattached them but I guess the ongoing job would died [16:07:28] ok -- is there something I can do to make things more graceful? [16:07:33] hmm [16:08:13] well I hesitate between: [16:08:17] A) "too bad for you, just `recheck` to run CI again" [16:08:57] B) mark the Jenkins agents offline which prevent them from running jobs, migrate the VM, bring the agents back online [16:09:38] I can do B via the jenkins web ui, right? [16:09:42] yeah [16:09:44] hopefully [16:09:50] OK, I'll try that first [16:10:05] the issue with A is that you might end up breaking a long running job during a backport window or something like that [16:10:10] it is bit disruptive for devs [16:10:20] so I'd prefer to go with the graceful one which is to put the node offline [16:11:14] as an SRE, you can grant yourself Jenkins adminstrative rights by adding yourself to the `ciadmin` ldap group [16:11:17] (if not already a member) [16:11:28] the list of agents is on https://integration.wikimedia.org/ci/computer/ [16:11:46] they are named based on their WMCS instance hostname [16:12:20] ok, I think I've depooled 44 and 46, does that look true to you? [16:12:26] ah [16:12:41] yes [16:12:48] do note 1046 still has jobs running [16:13:04] they show up in the sidebar but should be complete soon [16:13:17] oops, it was empty when I reached to click the button :) [16:13:22] hehe [16:13:28] * andrewbogott migrates 1044 [16:13:29] well it is ok [16:13:36] marking it offline still let the ongoing jobs to complete [16:13:49] it is just that no more jobs will be scheduled on that instance [16:14:00] * andrewbogott nods [16:14:15] and give CI seems rather quite today, feel free to put offline a batch of them [16:14:35] this way CI / devs are not going to notice [16:15:33] integration-cumin has a keyholder which would need to be rearmed which I Can do easily [16:16:27] ok, I'll do that one next [16:19:29] hashar: integration-cumin is back up, you can rearm [16:47:32] andrewbogott: thanks! [16:47:55] !log theprotonade@tools-bastion-13 tools.matchandsplit ./matchandsplit/scripts/toolforge-deploy-new-version.sh [16:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.matchandsplit/SAL [17:04:37] !log theprotonade@tools-bastion-13 tools.matchandsplit ./matchandsplit/scripts/toolforge-deploy-new-version.sh [17:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.matchandsplit/SAL [17:21:08] hashar: I've now migrated all integration hosts except for the pkgbuilder ones which are on Buster and covered by T367534 [17:21:10] T367534: Cloud VPS "integration" project Buster deprecation - https://phabricator.wikimedia.org/T367534 [18:39:45] !log soda@tools-bastion-13 tools.matchandsplit Test [18:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.matchandsplit/SAL [19:06:36] !log lucaswerkmeister@tools-bastion-13 tools.bridgebot toolforge build delete bridgebot-buildpacks-pipelinerun-knkch bridgebot-buildpacks-pipelinerun-lvm2d bridgebot-buildpacks-pipelinerun-k4ddk bridgebot-buildpacks-pipelinerun-ghk8n # clear some storage space to let a new build fit in the quota [19:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL [19:09:23] !log lucaswerkmeister@tools-bastion-13 tools.bridgebot toolforge build delete bridgebot-buildpacks-pipelinerun-bxnml # clear out more space, hopefully this [built from ref work/bd808/try-some-hacks in toolforge-repos/bridgebot-matterbridge] was no longer needed [19:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL [19:09:50] hmph, quota doesn’t actually seem to go down by much [19:10:54] !log lucaswerkmeister@tools-bastion-13 tools.bridgebot toolforge build delete bridgebot-buildpacks-pipelinerun-jtzrd # clear out failed build [19:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL [19:12:19] !log lucaswerkmeister@tools-bastion-13 tools.bridgebot toolforge jobs restart bridgebot # *without* new image; bot had stopped bridging cloud IRC and telegram for no obvious reason and hopefully this fixes it [19:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL [19:12:47] okay some bridging is happening again [19:14:54] Is stashbot running ? [19:14:56] (note for the Telegram side: the messages that were just bridged are in fact several hours old, the bridge had a hiccup) [19:15:00] (I think they were also bridged out of order) [19:20:58] !log lucaswerkmeister@tools-bastion-13 tools.stashbot ./bin/stashbot.sh restart # quit IRC a minute ago, following earlier reports of unspecified issues [19:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL [19:22:28] I think I need some help with bridgebot’s build service quota… it’s reportedly still at 92% capacity despite me having deleted five out of the seven builds it used to have [19:22:34] so it feels like it’s still counting deleted builds against the quota [19:23:08] but I don’t really want to risk a full `toolforge build clean` because I have no idea how hard the second image would be to rebuild, I’ve never done that one and only have the vaguest idea what it even does [19:24:11] I’ve started another build just in case the act of trying to upload it will “wake up” the quota tracker [19:27:42] didn’t work 😔 [19:36:04] bd808: ^ [19:55:05] @lucaswerkmeister: I have a good idea how to free up a bunch of quota with T366970, so I will put that on my list of things for today. [19:55:06] T366970: Replace tool-bridgebot/znc container with tool-containers/bnc container - https://phabricator.wikimedia.org/T366970 [19:55:29] sounds promising, thanks! [20:05:19] Hi, is Toolforge down? I can't seem to have citation bot work again [20:06:09] https://iabot.toolforge.org/ doesn't load for me, not sure if it's a toolforge problem or it just won't work anymore. [20:06:25] it's more than that [20:07:33] Sorry, can you clarify on that? [20:08:41] we seem to be having some kind of general network overload, lots of random things are failing to respond. Stand by :/ [20:09:31] Ah, I see what mutante means: Wikipedia isn't even loading now for me, it's really laggy. Ok, thanks! [20:10:27] Myrealnamm: major outage [20:10:30] not just cloud [20:10:49] DBs are having issues [20:19:37] Myrealnamm: I think the network storm is over, are things working for you now? [20:21:42] yes [20:21:56] phabricator is back [20:21:58] Thank you! [20:22:05] Myrealnamm: should be ok [20:22:44] Funny thing was I was editing Wikipedia and adding a source, it kept showing 502 or 501 error and finally I published it, but the inline citations didn't show so I had to redo it. a little frustrating [20:33:57] bd808: FWIW I still feel like there’s probably a bug in the build service somewhere [20:34:42] my general assumption would be that a tool should never *have* to run `toolforge build clean`, because that means you can’t restart the tool (/ bot / whatever) until you’ve rebuilt the image again (and you’ve got to hope that the build still works) [20:34:55] but right now it seems like the quota won’t go down without it [20:35:08] (though I suppose so far we also don’t know that the quota will go down *with* it ^^) [20:37:22] lucaswerkmeister: this is starting to sound like something you should make a phab ticket for [20:37:30] sure, can do [20:40:03] (I was hoping someone in here might know what to do directly ^^ but I’ll file the task now) [20:42:53] @lucaswerkmeister: I think the build service right now mostly assumes that tools will only have one container to keep track of and I keep violating that assumption with my tools. ;) [20:43:26] filed at T368317 (cc andrewbogott bd808) [20:43:27] T368317: bridgebot tool build service quota not going down - https://phabricator.wikimedia.org/T368317 [20:43:32] The 1GiB default quota sounds big but turns out to not be so big in practice because of the sizae of the base images [20:43:56] (also FTR “not going down” reminds me of https://bash.toolforge.org/quip/FQ-SNYwBGiVuUzOdg7vQ) [20:44:54] bd808: TBF I got the impression that this is at least to some extent encouraged [20:45:23] like, my assumption for the recent --port feature would be that it expects one tool with at least two images, frontend (webservice) and backend (continuous job with --port) [20:45:36] unless you’re using the containers tool for the backend I guess [20:46:05] (or unless webservice is meant to become based on continuous jobs? I think I’ve heard murmuring in that direction somewhere) [20:46:12] --port if a pre-condition to completely getting rid of `webservice` more than anything else. [20:46:16] *is a [20:47:15] T348755 [20:47:15] T348755: [jobs-api,webservice] Run webservices via the jobs framework - https://phabricator.wikimedia.org/T348755 [20:47:47] nice, I didn’t hallucinate it [20:48:47] see also T362051 where in we are sort of reinventing Helm [20:48:49] T362051: [components-api] First iteration of the component API - https://phabricator.wikimedia.org/T362051 [20:54:29] heh [20:54:37] * lucaswerkmeister has only interacted with helm stuff others set up [20:56:28] !log tools rebooting tools-k8s-worker-nfs-36; it has lots of stuck processes which somehow didn't get unstuck when we did the post-nfs-migration reboots. [20:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:26:31] Hello dear people, I haven't dealt with wmcloud several years now, I am being spammed with puppet alerts from instances I am the admin of. Can a seasoned admin help me pass on ownership, please? [22:27:27] I remember andrewbogott and bd808 used to be experts on this matter, but who knows what changed over the last few years [22:33:07] matanya: what specifically are you asking us to do? Find you new maintainers to hand projects off to? [22:34:36] * bd808 doesn't know how to get rid of his own semi-abandoned side projects [22:34:38] I guess so. First, I'd like to stop puppet spamming, second - I'd like to see if I can help upgrading from buster,and then, step down [22:35:24] Or just step down, and then world peace will come, who knows ... [22:35:59] matanya: which of these projects? -- https://openstack-browser.toolforge.org/user/matanya [22:36:13] I'm sort of assuming video? [22:36:27] I think so [22:36:54] I think bastion is too important for me to touch, not sure what tools and deployment-prep do anymore [22:37:11] you shouldn't get any puppet spam from tools. [22:37:14] I do remember what video does, so it is probably that [22:37:47] Chico is the only other real admin left there for the video2commons backend [22:38:04] chico took over some years ago, not sure if he is still in charge [22:38:16] /me waves [22:38:23] https://openstack-browser.toolforge.org/project/video -- you and chico are it. [22:38:47] Do you know if it even works? [22:39:23] V2c? Kind of, works most of the time is what I aim for [22:40:10] T367599 is relevant if looking for help [22:40:10] T367599: Request to join video project - https://phabricator.wikimedia.org/T367599 [22:40:17] Haven't had a lot of time in the last 6 months with a new baby though, its probably working less then most of the time right now. [22:40:33] An upgrade from buster will break it all? [22:41:11] JJMC89: nice find! [22:41:29] Y'all should take Don-vip up on his offer :) [22:42:10] Puppet is non-functional. [22:42:10] Buster upgrade is doable, nothing is blocking it but putting in the work. Is buster upgrade inplace or new machine? [22:42:12] definitely! [22:42:27] should be in place [22:42:42] my only worry is redis compatibility [22:42:58] My plan last year was to move it all to k8s, haven't had the time and probably wont for a couple of months. [22:43:02] @chicocvenancio: ideally to new instances. Our tracking automation will still think your are on the old OS if you just do an inplace upgrade. [22:43:40] I doubt a new instance will work oob [22:43:52] but willing to give it a shot [22:44:12] If I already remembered how to login into horizon [22:44:29] (If it is still done this way) [22:44:37] Redis is somewhat broken and my goto fix has been to inplace add memory to it... Its what gives me the most pause. [22:45:01] and bd808 feel free to remove me as an admin from all those projects [22:45:32] matanya: that will instantly stop your ability to help with migrations if I do [22:45:46] sounds like a feature, not a bug [22:45:56] I would appreciate a phab task for tracking [22:46:02] doing [22:46:08] Recreating video machines should work and I've done it somewhat recently. Redis is a bit more worrisome. [22:47:23] also, hello matanya! Long time no talk :) [22:48:29] https://phabricator.wikimedia.org/T368330 [22:48:38] Yes, Hello! :) [22:49:06] I have been really busy irl in the last few years, sorry for my absence [22:50:30] eh, folks get busy and drift away. It's normal. It is always nice when they drop by again to visit or help later. :) [22:52:20] We should thank ssapaty for pinging me to handle buster migration, which reminded me stuff even exists [22:54:28] I do admit seeing the names of the people in chat is warming my heart [22:59:22] !log video Removed matanya's "member" right per T368330 [22:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Video/SAL [22:59:25] T368330: Remove matanya as an admin from VPS projects - https://phabricator.wikimedia.org/T368330 [23:00:06] * bd808 still finds the member/reader role names confusing [23:01:15] Hmm. Trying to do `webservice shell` to work around T360488, and now it's giving a strange error: Error from server (Forbidden): pods "shell-1719270016" is forbidden: PodSecurityPolicy: unable to admit pod: [spec.containers[0].securityContext.procMount: Invalid value: "DefaultProcMount": ProcMountType is not allowed] [23:01:16] T360488: Missing Perl packages on dev.toolforge.org for anomiebot workflows - https://phabricator.wikimedia.org/T360488 [23:01:58] !log deployment-prep Removed matanya's "reader" right per T368330 [23:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [23:03:51] anomie: hmmm... that's new to me to. I sounds like it is related to the stuff in T279110 that a.rturo has been working on. [23:03:51] T279110: [infra] Replace PodSecurityPolicy in Toolforge Kubernetes - https://phabricator.wikimedia.org/T279110 [23:03:58] anomie: I can mostly reproduce that in another tool with `webservice perl5.36 shell`, though the error message is a bit longer then [23:04:44] OK, was nice passing by, sending you my love and appreciation, have a great night, and see you around in the wikiverse [23:05:13] matanya: :wave: come back any time :) [23:05:24] Also I seem to have a job stuck in a failed state, `toolforge-jobs delete anomiebot-200` doesn't seem to delete it, so I can't restart it. [23:07:18] anomie: you can try `kubectl delete job anomiebot-200` and see if that does anything different [23:07:39] That seems to have worked, thanks [23:12:12] I think `webservice shell` is messed up for everyone right now. Probably as an unintended side effect of T362050? [23:12:13] T362050: toolforge: review pod templates for PSP replacement - https://phabricator.wikimedia.org/T362050 [23:14:06] Looks like it still works from login-buster.toolforge.org 🤷 [23:18:03] anomie: that points solidly to T362050 and the newest build of toolforge-webservice then. [23:18:04] T362050: toolforge: review pod templates for PSP replacement - https://phabricator.wikimedia.org/T362050 [23:18:24] I left a note for a.rturo on the task [23:24:35] Hi all! I'm trying to restart a toolforge webservice that I've updated a bit in the last couple of months and am running into a strange error. When I run `webservice --backend=kubernetes python3.9 shell` I get the following error: [23:24:35] ``` [23:24:36] Error from server (Forbidden): pods "shell-1719271349" is forbidden: PodSecurityPolicy: unable to admit pod: [spec.containers[0].securityContext.procMount: Invalid value: "DefaultProcMount": ProcMountType is not allowed spec.containers[0].securityContext.runAsUser: Invalid value: 55751: must be in the ranges: [{61312 61312}] [23:24:36] spec.containers[0].securityContext: Invalid value: []int64{55751}: group 55751 must be in the ranges: [{61312 61312}] spec.containers[0].securityContext.procMount: Invalid value: "DefaultProcMount": ProcMountType is not allowed] [23:24:37] ``` [23:24:49] HTriedman: https://phabricator.wikimedia.org/T362050#9919714 [23:25:06] We just found out that you can run it from login-buster.toolforge.org [23:27:00] !status `webservice shell` broken (T362050) [23:27:03] T362050: toolforge: review pod templates for PSP replacement - https://phabricator.wikimedia.org/T362050 [23:30:16] "You are listed as the owner of the following projects that are running Buster[0]: [23:30:16] deployment-prep,integration,tools" [23:30:37] I didn't know I have special rights on tools [23:30:39] the tools one is unexpected I'd guess [23:30:48] Yeah [23:31:32] It didn't list bastion so I guess it's not listening everything I'm member of [23:31:41] Or maybe theees no buster there [23:31:51] Krinkle: is that a new email from Komla? Maybe his script is grabbing something strange? [23:32:07] Yeah from 1h ago [23:32:52] komla: ^ Krinkle reports your latest buster nag email claiming he can fix things in tools that he probably cannot. [23:33:45] Let me check [23:34:12] I pulled the admins using ldapsearch. [23:34:48] ldapsearch won't tell you a Cloud VPS admin vs normal user, just that someone is a project member [23:34:59] https://openstack-browser.toolforge.org/project/tools [23:35:08] I'm not listed as admin there. [23:35:21] * Krinkle fades back into the bushes. [23:35:24] :bd808 oh ok [23:35:29] Thx for checking it out :) [23:36:04] komla: you have to ask Keystone directly about the "member" (admin) vs "reader" (user) roles [23:36:30] so… did that just email all the toolforge “viewers” (“everyone”)? [23:36:42] :bd808 noted! [23:37:15] hm, maybe not, AFAICT my WMDE email didn’t receive anything [23:37:52] :lucaswerkmeister: hmmm, it would seem [23:38:30] (FWIW, I got the message on my private email, but that’s fair enough given that I’m a toolforge root ^^) [23:42:10] :lucaswerkmeister it is one mail per recipient so I guess that's all you'll get. [23:43:21] but my WMDE email is on a completely separate account