[06:09:46] Hey all, was just wondering is there a slow_query_log for toolforge user dbs? A quick google doesn't bring up any results. Also interested in any recommendations for profiling/optimizing Toolforge DB queries. Thanks! [07:08:21] you could try https://sql-optimizer.toolforge.org [07:46:55] nice tool [09:40:40] Sorry to bother, bridgebot is double posting again [09:44:53] !log lucaswerkmeister@tools-bastion-13 tools.bridgebot Double IRC messages to other bridges [09:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL [09:48:07] !log tools update pywikibot script image to v9.1.0 T363132 [09:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:48:10] T363132: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T363132 [10:29:49] Any chance there is a way I can see outputs of healthcheck runs ? [10:35:18] @sohom_datta: I believe successful probes aren't recorded anywhere. With `kubectl describe ` you may see failed ones. [10:45:23] Do healthchecks not use launcher ? [10:58:27] I don't know what launcher is in this context [11:07:43] My healthcheck is set to be the command "ping" with a custom "ping" command defined in the Procfile [11:08:51] But the healtcheck seems to be running `exec /bin/sh -c ping` instead of `/bin/sh -c launcher ping` (which it does for other Procfile commands) [11:19:20] arturo: If you have some time could you take a look at link-dispenser. The crawljob process keeps crashlooping with a exit code of 137 after adding a healthcheck script [11:20:27] I intially thought it was because of the launcher ping issue, but it still crashloops after fixing the issue [11:20:54] !log bsadowski1@tools-bastion-13 tools.stewardbots Restarted StewardBot/SULWatcher because of a connection loss [11:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL [11:21:05] :o [11:41:00] sohom_datta what's the tool? [11:41:23] The tool should be link-dispenser [11:41:35] let me give it a look [11:41:55] I'm just reading about this `launcher` thing https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service#Procfile [11:42:15] @sohom_datta, I don't think you need to have the actual `launcher` keyword in your command [11:42:20] it will be added internally by the system [11:43:19] ie, I believe your job's command should move from `launcher crawljob` to `crawljob`, and the healthcheck from `launcher ping` to `ping` [11:44:29] it should not hurt either though, launcher is capable of running nested [11:45:17] is launcher responsible for going to the procfile and executing the required entry? [11:45:37] the job defines [11:45:38] | Health check: | script: launcher ping [11:45:39] not really, it's responsible from setting the environment variables to change paths and such [11:45:55] but k8s executes [11:45:56] Startup probe failed: command "/bin/sh -c launcher ping" timed out [11:46:09] and in the procfile there is this [11:46:10] https://gitlab.wikimedia.org/toolforge-repos/link-dispenser/-/blob/main/Procfile?ref_type=heads#L3 [11:46:44] there's one binary for each procfile entry generated at build time [11:46:46] https://www.irccloud.com/pastebin/Z2lRs23O/ [11:47:08] that does something similar to 'launcher ' [11:47:15] ok [11:47:45] so `launcher ./scripts/ping_celery.sh` and `ping` should be equivalent in this case [11:47:53] `launcher ping` should also be ok [11:47:56] (and the same) [11:48:04] ok [11:48:17] it seems it was a timeout [11:48:18] Warning Unhealthy 4m49s (x887 over 34m) kubelet Startup probe failed: command "/bin/sh -c launcher ping" timed out [11:48:24] Is there a chance that it picks the unix utility instead [11:49:14] sohom_datta I don't think so, as it is it should be picking up the procfile entry, you can change the name to something that does not collide if you want (ex. if you want to use the original ping, you'll have to pass the full path) [11:50:03] I see that the check connects to celery (does a `celery ping` ) [11:50:48] how is it configured? [11:51:55] The celery config should be pretty simple, it's run using https://gitlab.wikimedia.org/toolforge-repos/link-dispenser/-/blob/main/scripts/start_celery.sh?ref_type=heads [11:52:57] hmm, I'm not sure what's the envvars defined inside the liveness check, maybe `NOTDEV` is not defined there, let me check [11:54:39] Yeah NOTDEV switches the redis url from a local setup to the wikimedia endpoint [11:55:41] I'm thinking that it might be trying to connect to localhost and just timing out [12:01:41] so I did a test, and I think that the healthcheck should be getting all the environment variables set properly [12:02:31] local.tf-test@lima-kilo:~$ cat healthcheck.sh [12:02:32] #!/bin/bash [12:02:32] env > environment.out [12:02:32] local.tf-test@lima-kilo:~$ toolforge jobs run --continuous --command 'while true; do sleep 10; date; done' --health-check-script '$TOOL_DATA_DIR/healthcheck.sh' --mount=all --image python3.11 test [12:02:32] local.tf-test@lima-kilo:~$ cat environment.out [12:02:33] ... [12:02:33] TOOL_DATA_DIR=/data/project/tf-test <- this one is set by us [12:02:54] hmpf, that should have been a paste, sorry [12:04:27] running the healthcheck directly in the pod on k8s works too: [12:04:39] https://www.irccloud.com/pastebin/b34oIB12/%60 [12:10:17] hmm, I think probes might have a different connectivity than the pod itself [12:13:28] nah, I was able to curl without issues [12:20:26] BTW this account has apparently several builds recorded [12:21:11] tools.link-dispenser@tools-bastion-12:~$ toolforge build list | wc -l [12:21:11] 6 [12:21:30] It's for the same container, do you think it's running on a outdated build ? [12:21:34] not sure if this is expected, or there should be some kind of cleanup [12:22:05] @sohom_datta the destination is always tagged :latest, so I think is fine [12:22:22] I was asking more from the system point of view, if the system should clean up the old ones [12:23:59] It might be cause theres a bunch of replicas just chilling around as well https://grafana.wmcloud.org/d/TJuKfnt4z/kubernetes-namespace?orgId=1&var-cluster=prometheus-tools&var-namespace=tool-link-dispenser and I don't know how to purge/remove them [12:25:04] crawljob-5687c95b7-g4hpf should be the only one in use right now [12:26:24] you also have a webservice, no? [12:26:53] the replicasets are created by the deployment objects by kubernetes. I think they are harmless [12:27:30] Yes, the webservices should be the link-dispenser-*-* ones [12:27:30] but if you really really wanted to get rid of them, you may stop the webservice and see if they are gone with the deployment resource [12:29:04] arturo: about the number of builds, I think we keep the last 4 failed ones, and 2 successful ones for reference [12:29:34] all they push to the same image (:latest) so only the last successful build is in the image (it gets overwritten) [12:29:58] ok, I think there are currently 5, all of them succesful [12:31:22] I'm planning to leave the replicasets alone unless kubernetes tells me something about it [12:32:47] hmm, I think that the timeout for health probes might be too agressive (1s) [12:33:09] Sohom_Datta: yep, replicasets are generated automatically by k8s, should be ok to let them be [12:33:43] the liveness probe takes ~2s [12:33:56] hmm, I thought we had increased that at some point [12:35:29] oh, I think it's the default from k8s [12:35:37] we should increase that a bit [12:38:00] It would be nice to be able to manually set it in the yaml [12:38:17] (Also maybe the frequency as well) [12:39:11] yep, manually changed it to 3s in the deployment object and it's running now, I'll send a patch [12:39:56] https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/79 <- this should do the trick [12:39:58] I think we could try to make the system smarter, rather than exposing more options to users. Otherwise, eventually, we will end with the full range of k8s options exposed via the different toolforge interfaces [12:41:00] dcaro: LGTM [12:54:27] !log anticomposite@tools-bastion-13 tools.stewardbots ./stewardbots/StewardBot/manage.sh restart # SULWatcher3 not coming back up [12:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL [13:04:11] Sohom_Datta: any new deployment you make now should have the new timeout values :) [13:04:16] (just deployed the patch) [13:04:48] !log anticomposite@tools-bastion-13 tools.stewardbots SULWatcher/manage.sh restart # SULWatcher3 disconnected [13:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL [13:10:33] It appears to be holding on for now 🎉 [13:10:58] Thank you so much for debugging and taking a look :) [13:12:20] !log anticomposite@tools-bastion-13 tools.stewardbots deploy ed1afde [13:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL [21:03:53] bridgebot is going to be down for a little while as I convert it over to using buildservice images and job service instead of it's current hand-built Deployments. If all goes well y'all shouldn't really notice anything different when I'm done. If it goes badly, well then I guess I'm not done yet. ;) [21:05:37] !log bd808@tools-bastion-12 tools.bridgebot Switched from legacy system to buildservice and jobs configuration (T363028) [21:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL [21:14:47] grrr... there is a bug somewhere in my matterbridge config file, but the error message is pretty much just that. :/ [21:15:06] :( [21:21:29] bd808: bridgebot.toml has an unclosed quote in line 194, I believe (account="irc.lib) [21:21:37] (and apologies if you already debugged past that point by now) [21:22:02] lucaswerkmeister: yeah. no worries. I just found that myself :) [21:22:07] ok :) [21:22:55] * bd808 live hacks that [21:23:38] I think the bridge should be up now... [21:24:10] irc->telegram seems to be working (even catching up on some old messages), telegram->irc not yet [21:24:39] oh hey the bridge’s still alive? [21:24:45] aha there it goes [21:24:52] laaaaag [21:25:01] test test [21:25:15] hmmmm... only one way? [21:25:37] ok. I'll patch the repo and rebuild and restart and then we can start wondering if it will keep working :) [21:31:05] !log bd808@tools-bastion-12 tools.bridgebot built new image from f4022bd9 (T363028) [21:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL [21:32:26] !log bd808@tools-bastion-12 tools.bridgebot Started `bridgebot` job (T363028) [21:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL [21:36:22] Ok, the bridges seem to all be up and stable at this point. I hope the lag before was just because the `webservice shell` I was debugging from needed a bit more ram and cpu (which it should have now) [23:37:20] \o/ \o/ \o/ [23:37:25] sounds awesome, thanks a lot! [23:38:37] I see there’s a service.template file in the tool’s home dir, does that do anything? (I don’t see a running webservice and IIUC www/static/ is served by tools-static instead) [23:39:09] it makes running `webservice shell` easier, but that's all [23:40:35] I'm going to test the potential message doubling fix later in my evening by forcing ZNC to connect to a new upstream server. [23:40:52] First I need a snack though :) [23:43:39] ah, I see ^^