[00:31:40] tgr_: a floating ip would be needed, or we would need to setup something new/custom with the shared proxy. Floating ip would be easier. [00:38:31] Thanks! Filed T315198 about it. [00:38:32] T315198: Request floating IP for matrix Cloud VPS project - https://phabricator.wikimedia.org/T315198 [15:39:49] Hello. [15:40:02] I've got a stuck job and it has no queue name. [15:40:16] hi! which tool? [15:42:16] dbreps is the "become" tool. [15:42:41] "qstat -xml" shows the pending job. [15:42:45] But with no queue name. [15:42:58] So I can't SSH to a host to kill it manually like the docs say. [15:43:20] It's also been stuck for like four days. I'm wondering if there's an easy way to add a timeout. [15:43:34] tools.dbreps [15:43:34] Eqw [15:43:34] 2022-08-11T10:00:12 [15:43:34] [15:47:36] It looks like there are 2 jobs in the Eqw state for that tool and both failed to queue due to a transient LDAP lookup error. [15:48:07] Penelope: have you tried to `qdel` those yourself yet? [15:48:55] !log tools.bridgebot Double IRC messages to other bridges [15:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL [15:48:59] I have not. I didn't see that under https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Stuck_jobs [15:49:05] But I see it in the section above now!! [15:49:49] qdel worked, thank you!! [15:49:52] awesome [15:50:21] Any way to add a timeout to jsub? Or do people wrap in `timeout`? [15:50:30] legoktm is using this: [15:50:33] @hourly jsub -once -quiet -mem 2G -N rusty /data/project/dbreps/src/database-reports/target/release/dbreps2 [15:50:58] Which then iterates through and runs reports that need updating. But if the "rusty" job gets stuck, the reports back up. [15:51:33] I see `timeout` is installed so maybe wrapping in that is easiest? [15:52:00] the issue with both of these was that the submission failed from some kind of LDAP lookup hiccup. I don't think there is any way to tell grid engine "try to submit this, but if submission fails then automatically delete the failed submission tombstone after N minutes" [15:52:56] the tombstone would kept additional jobs from firing because of the `-once` flag on the submission [15:53:54] -l h_rt= ? [15:54:04] I wonder if that would help. [15:54:08] From https://bioinformatics.mdc-berlin.de/intro2UnixandSGE/sun_grid_engine_for_beginners/how_to_submit_a_job_using_qsub.html and such. [15:54:48] I constantly forget about "qstat -xml" btw and I find the truncated job names in regular "qstat" output infuriating. [15:55:07] that would stop a job that was running and kept running, but I don't think it would do anything to change an Eqw failure [15:55:25] Gotta set up another job engine that watches the o.g. ;-) [15:55:27] everything about grid engine is infuriating ;) [15:55:36] Hah, for sure. [15:55:53] At least the XML output is verbose. [15:55:59] Thanks for the qdel tip. [15:57:45] Penelope: I think we should be able to move this over to kubernetes (`toolforge-jobs`), which hopefully lets us ignore all the grid engine issues [15:57:57] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework is our beta quality replacement for grid engine with Kubernetes as the backend runtime. It can take over for many grid jobs today, but not all yet. Jobs that need multiple runtime languages (like PHP & Ruby at the same time) or external binaries/libraries (like image processing things) are not supported yet. [15:58:20] yep, that [15:58:25] I know Kubernetes fairly well from my professional life, so no objection from me. [15:58:32] Tho it also has dumb behavior. [15:58:43] I'm pretty sure that this is all pure Rust so it should work fine in the `standalone` container [15:58:44] Where like you gotta specify "-o json" to get useful info. [15:58:57] Err, "-o yaml", whatever. [16:19:13] SSLV3_ALERT_CERTIFICATE_EXPIRED] sslv3 alert certificate expired (_ssl.c:2622)'))': /hosts/clouddumps1001.wikimedia.org/update [16:22:22] bd808 just had the same issue with one of my tasks (stuck in Eqw state) [16:22:26] Had to be deleted manually [17:24:57] * dcaro off [22:50:02] I need help with my communication [23:43:27] * Platonides suggests as a first step, that he should stay longer in the channel