[01:53:27] any elasticsearch experts here? [02:37:48] https://dontasktoask.com/ [07:04:15] 👍 (re @wmtelegram_bot: https://dontasktoask.com/) [09:20:45] !log toolsbeta reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo [09:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [11:11:55] !log toolsbeta created a grid exec node toolsbeta-sgeexec-10-5.toolsbeta.eqiad1.wikimedia.cloud - cookbook ran by arturo@nostromo [11:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [11:23:02] !log dashiki adding Reedy as projectadmin [11:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Dashiki/SAL [13:46:20] My job seem to have got killed before it finished. Is there a max time for pods? [13:46:46] which tool? [13:48:07] https://postimg.cc/NKdvGchW [13:48:17] itemsubjector tool account [13:48:45] it ran for 65000 seconds and then had "Killed" in the .err log [13:49:16] I can split it up into shorter tasks and run sequentially if needed. [13:57:54] I'm not (quickly) seeing anything in our infra limiting the lifetime for kubernetes jobs [14:01:41] ok, I'll break up the task to keep under 50.000 seconds and we will see if that works better. thanks for taking a look 😃 (re @wmtelegram_bot: I'm not (quickly) seeing anything in our infra limiting the lifetime for kubernetes jobs) [14:05:53] I can't seem to start new jobs :/ [14:07:32] do you get some error message? [17:00:39] !admin I have a problem with a VPS project [17:01:08] I have a huge list of users (project members) that I am not really sure where they come out from [17:01:51] ok, I am sorry, I was looking at bastion project members... LOL [17:01:54] my bad+ [17:07:29] CristianCantoro: :) yea, it's normal to see all the users depending how you look, because LDAP is used as a backend for all the VPSes [17:07:46] then Horizon handles who can actually login where based on project [17:35:18] mutante: yeah, sorry, I panicked for a moment [17:40:13] CristianCantoro: no problem, it's good when server admins are paranoid [20:35:12] nope, I just tried again and the only thing I get is "Failed" when running $toolforge-jobs list (re @wmtelegram_bot: do you get some error message?) [20:35:38] I tried $kubectl get events but the output does not show any errors [20:37:38] https://paste.debian.net/1226785/ [20:38:17] I get neither an .out or .err file for the job in the homedir so hard to debug further [20:42:48] which exact toolforge-jobs command did you try? [20:48:07] toolforge-jobs run job$1 --image tf-python39 --command "~/setup.sh && python3 ~/itemsubjector/itemsubjector.py -r" [20:48:17] the setup script is simple: [20:48:46] $ cat ../setup.sh [20:48:46] # adjust this if necessary [20:48:48] cd ~/itemsubjector [20:48:49] pip install -r requirements.txt [20:49:11] it worked flawlessly until yesterday... [20:50:15] the $1 is because it is run in a script that I invoke with the number of the job like this: $ ./create_kubernettes_job_and_watch_the_log.sh 76 (re @DennisPriskorn: toolforge-jobs run job$1 --image tf-python39 --command "~/setup.sh && python3 ~/itemsubjector/itemsubjector.py -r") [20:50:53] I have been running jobs for the last few months with no issues, as you see this is number 76 now [21:08:16] I just tested running the itemsubjector.py script on the bastion for a few seconds and it works fine. So something is probalby wrong with Kubernetes. [21:08:16] It killed my long running job this morning and now I all jobs I start fail with no output. [22:44:51] /create_kubernettes_job_and_watch_the_log