[08:40:07] !log quarry rebooting worker-04 due to being unable to ssh to it (things started segfaulting, then too much work for irq) [08:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Quarry/SAL [11:00:41] codesearch seems to be down? [11:01:52] confirmed [11:06:19] Maybe a webservice restart would fix it? [11:06:43] hauskatze: the VM is down too [11:06:54] 11:36:04 (InstanceDown) firing: Project codesearch instance codesearch8 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:07:21] yeah, that explains [11:09:05] hauskatze: has any task been filed [11:09:31] I have not RhinosF1 - Maybe Lucas_WMDE did? [11:10:35] nope [11:15:42] I'll file one [11:16:58] https://phabricator.wikimedia.org/T312207 [11:23:51] thanks [11:25:21] ty [11:25:22] bbl [11:49:40] Lucas_WMDE: Amir1 is fixing [11:49:59] which should be worrying [11:52:00] Amir1: thank you for being amazing as always [11:52:28] ^^ thanks for the kind words [11:53:40] Amir1: I need to pick your brain over how to make actor work faster though [11:56:08] We'll be on about 1.49 before it finishes for Miraheze if not [11:58:03] set the --sleep to zero? [11:59:07] Amir1: will that impact anything else [11:59:14] It's still gonna be fairly slow though [11:59:53] if that won't help, I'm not sure anything would [12:00:55] thanks all! [12:18:54] Amir1: our other issue is we create a lot of wikis a day [12:19:18] And if it runs for a month (which would be a miracle) it's running a stale list [12:19:29] So we've got to keep running again to catch up [12:19:39] Can the script tell if it's already ran [12:19:54] can't you make new wikis use the new schema from the beginning? [12:21:52] I think renames might be an issue [12:21:58] Recreates will [12:22:04] But actually that's fine [12:24:19] Amir1: what's default batch size [12:45:26] Amir1: ^ were you able to find why it did not work? (I found one VM today misbehaving too, no ssh, things started segfaulting and then dummy interrupts) [12:46:52] Nope. Tbh I didn't check. I have so much to do today that I cared for only codesearch being up [12:47:12] okok, np [12:47:38] I'll keep an eye in case there's something going on on a host or similar [15:13:54] as far as I can see, the only way to use custom container image is by getting a dedicated labs project since I can't do it via tools registry? [15:17:27] currently we don't support custom container images, what is your specific need? (we provide some base images for common languages) [15:25:58] originally I ran stuff in my house because last time I checked - registry did not have node 16 [15:26:46] that reminds me that I should deploy T310821 [15:26:47] T310821: toolforge: Provide a node 16 image - https://phabricator.wikimedia.org/T310821 [15:27:23] E_TOOMUCHSTUFF [15:27:46] for now I see it has 16-slim? in it but I'm not sure if I want to do all the dockerfile-ing again to deal with less package I get on toolforge [15:29:14] s/package/image [15:35:06] s[_]: we are working on a service to allow building images using buildpacks, that would allow you to just put your code url and it should pick up all the node/npm stuff, would that be enough? Or you have other dependencies too? (/me being curious) [15:37:00] buildpacks as in? [15:37:07] the one seen in heroku? [15:37:39] similar yes, we would have some made by us, google and heroku ones available (still TBD though) [15:39:46] the goal is to make all that transparent to you, so you just pass your code and it will guess the build process by default, and having the option to tweak how that's done if needed [15:40:05] yeah that probably should be sufficient for my use case, though according to my own dictionary Wikimedia's TBD means a few years away at the very least [15:40:57] hopefully not, but yep, it's been around already for a year so there's some work already done xd [17:10:56] !log tools.global-search Hard stop & start cycle to reregister with frontend proxy (T312246) [17:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.global-search/SAL [17:11:27] kubernetes tools don't register with the front proxy [17:12:48] did you test all tools? user report? [17:14:35] no, reference/correction to the !log above [17:19:07] oh, so it did not reregister? [17:20:29] probably just the pod restart itself which fixed things [17:31:17] taavi: hmmm... you're right. I guess that 503 was from the ingress rather than the front proxy [17:31:17] hihi, related ^ - https://global-search.toolforge.org/ and https://refill.toolforge.org/ now having intermittent 503 [17:31:17] (so guessing this is less tool-related and more toolforge related) [17:31:17] yeah, looking into it [17:31:17] ty <3 [17:31:17] another random data point is that https://os-deprecation.toolforge.org/ is loading very, very slowly and it is a static HTML webservice, so mostly constrained by NFS access to read pre-built files. [17:35:51] TheresNoTime: still seeing any issues? [17:36:06] https://os-deprecation.toolforge.org/ is proper fast again, so hopefully whatever was hiccuping has recovered. [17:36:17] all looks okay here :) [17:36:53] we are having some ceph issues [17:56:54] bd808: it looks like there may be more weirdness with global-search. I tried to update all the PHP packages, and it tested fine on global-search-test.toolforge.org (now redirects), but on the main global-search I can't seem to get it to use the php7.4 image [17:57:21] just now I tried stopping the webservice, which says it was successful. Then a `kubectl get pods` returns a huge stacktrace [17:57:33] now I can't seem to run any `webservice` command [17:57:40] :/ [18:01:24] musikanimal: what is the error you see? [18:02:03] `fatal error: newosproc` followed by a bunch of noise [18:02:27] looks like golang errors [18:02:49] hmmm, maybe os resources? [18:02:55] what bastion are you in? [18:03:06] tools-sgebastion-10 [18:04:07] musikanimal: what command are you running? (a webservice status worked for me) [18:04:25] `webservice --backend=kubernetes php7.4 start` worked and its running again, but `webservice shell` fails with the same `fatal error: newosproc` [18:04:47] yeah, and `webservice status` shows it is still on php7.3 [18:05:20] those huge kubectl stack traces usually mean you’re near the user quota of processes, go likes to spawn threads :/ [18:05:23] try logging in on the other bastion [18:05:35] or close some shells or otherwise reduce the number of running processes if possible [18:05:40] I was able to get the shell running [18:05:45] but yep, might be that [18:05:55] (newosproc points that way) [18:06:19] aha! that worked, thanks lucaswerkmeister [18:06:23] I did have another session open [18:06:33] I will try switching to php7.4 again [18:06:43] I think it might have been this [18:06:45] [Wed Jul 6 17:54:16 2022] cgroup: fork rejected by pids controller in /user.slice/user-11106.slice/session-48531.scope [18:06:53] (these kinds of errors is also how I got into a habit of running to `exec become` instead of `become`, it saves one process ^^) [18:07:42] yep, that was it [18:07:45] it's still refusing to use php7.4, but at least the webservice is up and running [18:10:33] surprisingly without updating the packages, everything is still running fine with no errors. I guess the new translations that I pulled must be there too (static files), so from a user-facing perspective all is well. It's weird it won't let me use php7.4, but not a huge deal (for now) [18:10:55] what do you mean bu 'doesn't let me use php7.4'? any errors? [18:11:19] no errors. `webservice --backend=kubernetes php7.4 start` uses a php7.3 image [18:11:34] this worked fine on global-search-test [18:11:48] even after completely stopping it? [18:11:56] yeah [18:12:00] weird [18:12:24] let me have a look [18:13:02] okay, thanks [18:13:03] tools.global-search@tools-sgebastion-11:~$ kubectl describe pod | grep Image [18:13:03] Image: docker-registry.tools.wmflabs.org/toolforge-php74-sssd-web:latest [18:13:19] so the pod itself uses the correct image, just `webservice` reporting it wrong [18:13:55] it's reading the wrong value from service.template [18:14:44] if I comment it out, then it works fine [18:14:45] ah, it's working now! I don't know if you did anything, but `php -v` is correct when ran inside the pod [18:14:55] I see [18:15:50] am I safe to use `webservice shell` or are you still testing things? I just want to run `composer install` [18:15:57] mind filing a phab task and tag it against #toolforge? seems like a real bug in webservice [18:16:06] sure, not touching anything anymore [18:16:51] I can file a task, sure. I don't have any repro steps, but I can copy our convo here [18:17:27] sounds good [18:29:04] https://phabricator.wikimedia.org/T312278 [18:29:12] thanks again to everyone here for the help! :) [19:24:07] musikanimal: I had made a $HOME/webservice.template in the global-search tool that is setting a default php7.3 backend. In theory cli args should override that, but it sounds like that many not be happening as expected. [19:25:05] * bd808 sees this was figured out and filed as a bug [19:26:11] ah that makes sense! I at least understand what the bug is now, hehe [19:50:43] !log tools.stashbot Update to 3e22675 (sal: Skip "Logged the message" replies in response to logmsgbot) [19:51:28] !log tools.stashbot Update to 3e22675 (sal: Skip "Logged the message" replies in response to logmsgbot) [19:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL [19:55:53] hm, I find those replies useful in -cloud at least… [19:56:04] (maybe not so much in -operations where the wikitech link will always be the same)