[01:29:30] !log bd808@tools-bastion-14 tools.schedule-deployment Built new image from 801a3f49 (T409536, T411128) [01:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.schedule-deployment/SAL [01:30:23] !log bd808@tools-bastion-14 tools.schedule-deployment Restart to pick up new container image (T409536, T411128) [01:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.schedule-deployment/SAL [11:25:39] !log wikiqlever Adjusted the ports and started the qlever server [11:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikiqlever/SAL [11:26:43] !log wikiqlever service available from https://qlever-backend-demo1.wmcloud.org/ (backend) pointing to qlever1 port 7019 https://qlever-ui-demo1.wmcloud.org (frontend) pointing to qlever1 port 8176 [11:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikiqlever/SAL [15:22:52] Any updates? (re @Yetkin: Is there any mechanism to automatically delete pods after [x] days? I just realized that there were a few pods running for over ...) [15:26:20] @yetkin if it was not a one-off and you find again some shell pods you have to delete manually, can you please open a bug report in phab? [15:27:59] how did you create those pods? with "webservice shell" or with kubectl or else? [16:53:17] I always create my pods using the "webservice shell" command (re @wmtelegram_bot: how did you create those pods? with "webservice shell" or with kubectl or else?) [16:54:32] Also, those pods were eating up my memory quota. Is there any way to get notified when the available memory quota drops to a certain amount? [16:55:00] @yetkin no we don't have this type of notifications, but it could be an interesting feature request :) [17:03:18] @yetkin can you please open a phab task describing the "webservice shell" issue, where pods are never deleted? it's something that we can look into, but I don't think there is a quick fix [17:04:23] in the meantime, if you encounter this problem often, you could maybe write a script that runs "kubectl get pods" and cleans up any leftover shell pod [17:07:30] I wonder if anyone would object if we put a hard limit on the lifetime of a `webservice shell` pod, given that they’re only supposed to be used interactively [17:07:37] (insert https://xkcd.com/1172/ reference here) [17:08:07] though judging by https://github.com/kubernetes-sigs/descheduler#podlifetime / https://github.com/ptagr/pod-reaper / https://github.com/nuetoban/pod-lifetime-limiter, it sounds like there’s no built-in way to do that, and adding a component to our k8s setup is probably not worth the trouble [17:09:31] This looks good (re @wmtelegram_bot: I wonder if anyone would object if we put a hard limit on the lifetime of a `webservice shell` pod, given tha...) [18:27:36] !log soda@tools-bastion-15 tools.yapping-sodium soda built and uploaded a new version [18:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.yapping-sodium/SAL [20:19:13] Hello, I ran this command: [20:19:14] toolforge-jobs restart k8s-20251126.collapsible-option-to-navbox-doc [20:19:14] and got this result: [20:19:15] ERROR: TjfCliError: Failed to create a job, likely an internal bug in the jobs framework. [20:19:15] ERROR: Please report this issue to the Toolforge admins if it persists: https://w.wiki/6Zuu [20:19:16] Can someone help me how to resolve this? [20:21:25] Oh, maybe it's a memory issue. [20:21:26] Could we improve the error message to help developers better understand where the problem is occurring? [20:21:26] toolforge-jobs quota [20:21:27] Running jobs Used Limit [20:21:27] -------------------------------------------- ------ ------- [20:21:28] Total running jobs at once (Kubernetes pods) 7 16 [20:21:28] Running one-off and cron jobs 15 15 [20:21:29] CPU 1.75 16.0 [20:21:29] Memory 31.0Gi 32.0Gi [20:21:30] Per-job limits Used Limit [20:21:30] ---------------- ------ ------- [20:21:31] CPU 3.0 [20:21:31] Memory 8.0Gi [20:21:32] Job definitions Used Limit [20:21:32] ---------------------------------------- ------ ------- [20:21:33] Cron jobs 108 128 [20:21:33] Continuous jobs (including web services) 3 16 [20:22:00] that’s a lot of messages, please use a pastebin for long output (e.g. https://paste.toolforge.org/) [20:22:39] which tool is this in? [20:22:53] cewbot [20:22:55] what lucas says, and please run the restart command with `toolforge-jobs --debug` [20:24:28] toolforge-jobs --debug restart k8s-20251126.collapsible-option-to-navbox-doc [20:24:28] [2026-01-09 20:23:37] [config.py] DEBUG: Unable to find config file /etc/toolforge/jobs-cli.yaml, skipping [20:24:29] [2026-01-09 20:23:37] [config.py] DEBUG: Updating config from /etc/toolforge/common.yaml [20:24:29] [2026-01-09 20:23:37] [config.py] DEBUG: Unable to find config file /data/project/cewbot/.toolforge.yaml, skipping [20:24:30] [2026-01-09 20:23:37] [config.py] DEBUG: Unable to find config file /data/project/cewbot/.config/toolforge.yaml, skipping [20:24:30] [2026-01-09 20:23:37] [config.py] DEBUG: Unable to find config file $XDG_CONFIG_HOME/toolforge.yaml, skipping [20:24:31] [2026-01-09 20:23:37] [cli.py] DEBUG: session configuration generated correctly [20:24:31] [2026-01-09 20:23:37] [connectionpool.py] DEBUG: Starting new HTTPS connection (1): api.svc.tools.eqiad1.wikimedia.cloud:30003 [20:24:32] [2026-01-09 20:23:37] [connectionpool.py] DEBUG: https://api.svc.tools.eqiad1.wikimedia.cloud:30003 "POST /jobs/v1/tool/cewbot/jobs/k8s-20251126.collapsible-option-to-navbox-doc/restart HTTP/1.1" 500 83 [20:24:32] [2026-01-09 20:23:37] [cli.py] ERROR: TjfCliError: Failed to create a job, likely an internal bug in the jobs framework. [20:24:33] [2026-01-09 20:23:37] [cli.py] ERROR: Failed to create a job, likely an internal bug in the jobs framework. [20:24:33] Traceback (most recent call last): [20:24:34]   File "/usr/lib/python3/dist-packages/toolforge_weld/api_client.py", line 136, in _make_request [20:24:34]     response.raise_for_status() [20:24:35]     ~~~~~~~~~~~~~~~~~~~~~~~~~^^ [20:24:35]   File "/usr/lib/python3/dist-packages/requests/models.py", line 1024, in raise_for_status [20:24:36]     raise HTTPError(http_error_msg, response=self) [20:24:36] requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/jobs/v1/tool/cewbot/jobs/k8s-20251126.collapsible-option-to-navbox-doc/restart [20:24:47] [2026-01-09 20:23:38] [errors.py] ERROR: Some additional context for the issue follows: [20:24:48] [2026-01-09 20:23:38] [errors.py] ERROR: messages = {"error": ["Failed to create a job, likely an internal bug in the jobs framework."]} [20:24:48] [2026-01-09 20:23:38] [cli.py] ERROR: Please report this issue to the Toolforge admins if it persists: https://w.wiki/6Zuu [20:24:56] please don’t paste EVEN MORE output directly into IRC /o\ [20:25:52] sorry [20:26:27] but it is indeed bad error handling of the "Running one-off and cron jobs" quota which you are hitting [20:26:31] * taavi files task [20:26:46] anyway, I think it’s a quote issue. you’re already at the limit (15) for count/jobs.batch, and requests.memory + limits.memory are also very close to the limit (probably close enough that another job won’t fit) [20:26:52] s/quote/quota/ [20:26:55] thanks taavi [20:34:31] T414229 [20:34:32] T414229: jobs-api does not properly handle quota errors when restarting a job - https://phabricator.wikimedia.org/T414229 [20:39:44] taavi tank you [21:57:02] !log urbanecm@tools-bastion-15 tools.stewardbots-legacy Deploy 02d1ed7 [21:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots-legacy/SAL