[00:01:50] AmandaNP: https://whois-referral.toolforge.org/ is alive again. [00:02:26] thank you [00:04:37] !log tools Rebooting tools-k8s-worker-67.tools.eqiad1.wikimedia.cloud (T335543) [00:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:04:37] T335543: Pods getting stuck in "Terminating" status - https://phabricator.wikimedia.org/T335543 [00:07:08] !log tools Hard reboot tools-k8s-worker-67.tools.eqiad1.wikimedia.cloud via horizon (T335543) [00:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:09:41] !log tools `kubectl uncordon tools-k8s-worker-67` (T335543) [00:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:09:44] T335543: Pods getting stuck in "Terminating" status - https://phabricator.wikimedia.org/T335543 [07:20:07] !log tools rebooting tools-sgegrid-shadow due to stale nfs mount [07:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:27:06] !log tools rebooting tools-sgeweblight-10-28 (T335336) [08:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:27:14] T335336: [toolschecker] jobs mtime check is flapping - https://phabricator.wikimedia.org/T335336 [09:44:38] /vi@wmtelegram_bot [09:45:07] /vi@wmtelegram_bot [10:37:17] !log tools.wdmm deployed ec0d05bdc3 (commonswiki vi) [10:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wdmm/SAL [15:01:46] !log tools force reboot tools-k8s-worker-79, unresponsive [15:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:08:37] Some really weird issue in that my Rust bot process is sometimes running forever doing nothing now that I moved to toolforge-jobs [16:08:41] it never had this issue on the grid [16:08:52] But... | Status: | Running for 2d4h7m | [16:09:55] - command: /data/project/dbreps/src/database-reports/target/release/dbreps2 [16:09:55] + command: /usr/bin/timeout 59m /data/project/dbreps/src/database-reports/target/release/dbreps2 [16:10:29] it could be that kubernetes somehow lost track of the status of the process [16:10:51] I guess we could be a bit more cloud-native and introduce some kind of liveness checks into toolforge jobs [16:14:35] legoktm: T335592 [16:14:36] T335592: Toolforge jobs: consider having a way for jobs to report their liveness status to kubernetes - https://phabricator.wikimedia.org/T335592 [16:14:48] hmm [16:15:00] this is the second time it's happened in less than a week, I would have expected more people to complain if it was that frequent of an issue? [16:15:37] it's possible this is something related to this specific bot, my other Rust bot (significantly less complex) has been on toolforge-jobs for months now with no issue [16:19:09] could we attach a debugger or similar to figure out what it's stuck on? [20:22:21] !log tools.speedpatrolling deployed 3cb8ab732d (update dependencies, Flask/Werkzeug 2.3) [20:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.speedpatrolling/SAL [22:56:00] Does Quarry work with the Toolforge user DBs? [22:56:10] Assuming the DB in question is a “_p” [23:01:07] hm, I doubt it [23:01:47] it’s not documented and the source code doesn’t look like it either (especially `quarry/web/replica.py`), but also that `git grep tools` doesn’t find much that looks relevant, and nothing for `tools.db`)