[07:28:23] morning [08:30:50] !log harbor on tools down for upgrade (T346241) [08:30:51] blancadesal: Unknown project "harbor" [08:30:52] T346241: Upgrade harbor from 2.5 to 2.9 - https://phabricator.wikimedia.org/T346241 [08:31:53] !log tools taking harbor down for upgrade (T346241) [08:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:08:53] !log tools harbor up again and upgraded from 2.5 to 2.9 (T346241) [09:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:08:57] T346241: Upgrade harbor from 2.5 to 2.9 - https://phabricator.wikimedia.org/T346241 [09:59:35] * dhinus paged [FIRING:1] ToolsToolsDBWritableState tools (page wmcs) [09:59:56] let's see if it's the same issue as yesterday [10:00:25] hmm I cannot ssh [10:01:36] fnegri@tools-db-1.tools.eqiad1.wikimedia.cloud: Permission denied (publickey,hostbased). [10:03:01] virsh console works and I'm in the VM [10:03:38] mariadb is up and running [10:05:51] oh, there was an issue before on another VM where sssd started failing [10:06:04] and made ldap-base auth not work (ex. ssh with regular users) [10:08:22] looking at the sssd-nss service mentioned in -cloud [10:08:43] the task was T349681 [10:08:44] T349681: Ssh / user issue with integration-agent-docker-1057.integration.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T349681 [10:08:57] status is failed but I cannot "systemctl restart" [10:09:47] 'systemctl start sssd-nss.socket' mentionede in the phab worked [10:10:03] I can ssh again [10:10:39] you can also try first resetting the failed counter (`systemctl reset-failed sssd-nss`) [10:12:27] I can connect to MariaDB from the Toolforge bastion, but the alert is still firing [10:13:25] read_only is set to on [10:14:18] mariadb crashed and restarted automatically, but in read-only mode [10:15:39] I've disabled read-only mode now [10:15:44] I think that's usual behavior, to allow admin to double-check [10:16:06] yes it is [10:16:19] I wonder why it crashed though, it seems independent from the SSH issue [10:16:45] the only log line is "systemd[1]: mariadb.service: Main process exited, code=killed, status=9/KILL" [10:16:54] the alerts have no runbook though, I think that something with a page should have one [10:17:41] yes, I was planning to write one after the crash yesterday [10:17:48] [Wed Oct 25 09:48:25 2023] Out of memory: Killed process 1853830 (mysqld) total-vm:67422936kB, anon-rss:64695060kB, file-rss:0kB, shmem-rss:0kB, UID:497 pgtables:128052kB oom_score_adj:-600 [10:17:51] from dmesg [10:18:05] that matches also the suspicion for ssd failing [10:18:08] *sssd [10:18:14] I will open a single phab for both crashes, the reason could be the same even if the log messages are different [10:18:34] yesterday it logged an oom error, today maybe it crashed before it could log it [10:18:41] https://github.com/SSSD/sssd/issues/6219 <- upstream issue on sssd getting killed (by it's own watchdog though) and then failing to start by itself [10:19:34] thanks for the runbook :) [10:21:32] we could try tweaking the oom killer affinity to avoid mysqld from getting killed first [10:21:47] (though I suspect nothing else really takes much memory xd) [10:22:30] from the logs it seems the next candidate would be rsyslogd, with a mere 21M... hehehehe, not a lot [10:25:28] this looks interesting also: https://mariadb.com/kb/en/mariadb-memory-allocation/ [10:32:04] I created T349695 [10:32:05] T349695: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 [10:50:30] Runbook first draft: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState [11:00:28] !log admin update cloudcumins to spicerack 8.x [11:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [11:03:15] I updated the prometheusconfig for the alert with a link to the runbook [11:03:23] * dhinus lunch [12:35:04] hello. New to the modern tool hosting infrastructure. I'm trying to set up a Web service for a tool in Ruby using kubernetes. Following the instructions, I ran (as my tool) [12:35:04] $ webservice --backend kubernetes ruby3.1 start ./start.sh [12:35:06] (start.sh stars the Ruby server with the specified $PORT) [12:35:07] [12:35:09] and got: [12:35:10] Starting webservice............... [12:35:12] and then nothing. I don't see anything in the logs directory, don't see a running ruby process, and don't know how to proceed. [12:36:16] hi. which tool? [12:37:01] luthor [12:40:16] `webservice logs` reveals an error message, `/data/project/luthor/start.sh: No such file or directory`. the webservice process starts in the tool home directory, so I think you want to replace `./start.sh` with `./luthor/start.sh`. [12:41:12] ah! thank you. [12:41:13] (Why is that error message not outputted anywhere?) [12:42:43] not sure. displaying it is a great idea, I will file a task. [12:43:19] abartov: you might want to try using the newer buildservice https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service [12:43:37] specially if you are used to other paas like heroku or gce [12:46:34] taavi: anything showing now in the error log? I still don't see the tool running, and nothing in the tool's own logs. [12:47:04] abartov: you can run `toolforge webservice logs` yourself to check I think [12:47:13] are you saying you don't see anything when running `webservice logs`? I see a rails-related error [12:50:07] oh, thank you! Yes, now I see it. Okay, thanks. [12:51:31] So, I don't know anything about kubernetes — how do I ensure that the runtime environment has the packages I need? I have installed them (the Ruby gems) from the command line and my server runs from the tool's own command line, but I guess the kubernetes environment is a different container? [12:58:08] yeah, it runs in a separate container. I'm not very familiar with ruby, but I believe there's something like `bundle exec ` which will run the command you want with bundler-installed dependencies. `webservice ruby3.1 shell` gives you a shell inside a container that lets you troubleshoot that [12:59:53] the build service that david linked is an option too. in that case, instead of mounting the file system from NFS like now, you give it a git repository and it will create a custom container with the dependencies installed and available. it's newer so it has more rough edges as of now, but it gives you more flexibility with installing dependencies [12:59:53] etc and it should be more reliable long-term [13:11:11] excellent, thank you. The shell is what I needed. (re @wmtelegram_bot: yeah, it runs in a separate container. I'm not very familiar with ruby, but I believe there's something like `bundle exe...) [15:12:51] is there some cpu-time limitation killing processes in that shell? I am install a gem that has C++ extensions (a CSS parser), and after compiling for a while, it is just KILLed by a signal. [15:43:03] so I managed to get past this by adding memory temporarily, and it compiled fine. [15:43:46] A new issue: Running 'webservice ruby3.1 shell' now does nothing — I am returned to my bastion shell with no message at all. Previous runs of the shell did work and put my in the ruby3.1 shell. How can I debug this? [15:49:46] abartov: I see two shell pods running, do you have maybe multiple tabs open? [15:50:01] (or a tmux/screen session with those) [15:58:49] I don't! Can you kill them please? [15:59:20] ack [16:00:08] it's interesting though that they got stuck, we might probably want to allow users to clean them up, I'll open a task for it [16:00:20] yes, thank you. [16:00:34] Also, is there an equivalent of a ps(1) command I could have used to detect it myself? [16:01:01] hmm, still can't get into the ruby3.1 shell. [16:01:11] kubectl get pods [16:01:27] kubectl delete pod to delete the shell pods [16:03:09] thank you! [16:05:07] killed one, but am now getting: [16:05:07] [16:05:09] tools.luthor@tools-sgebastion-10:~/luthor$ webservice ruby3.1 shell [16:05:10] Error attaching, falling back to logs: unable to upgrade connection: container shell-1698249873 not found in pod shell-1698249873_tool-luthor [16:05:29] and: [16:05:30] tools.luthor@tools-sgebastion-10:~/luthor$ kubectl get pods [16:05:31] NAME READY STATUS RESTARTS AGE [16:05:33] luthor-67f6bdc96f-9mknf 0/1 CrashLoopBackOff 4 (23s ago) 115s [16:11:38] abartov: getting into a meeting sorry, I can try to help later, please open a task if you don't find a solution for us to check and keep track [16:12:46] thanks. I will appreciate help when you're done with your meeting. This CrashLoopBackOff with a rising RESTARTS count sounds problematic. [16:13:31] deleting the pod name that is in CrashLoopBackOff claims to succeed, but instantly starts another one. [16:13:45] (with a different hash in the name) [16:14:19] yes, that pod is the one crated by the webservice start, you can webservice stop to avoid it from respawning [16:21:20] thanks! (Only) after stopping it, I was able to start a ruby3.1 shell. [16:21:55] hmm... interesting [17:14:05] !log tools.wdrc chmod +x /data/project/wdrc/public_html/api.php # debug T349687 [17:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wdrc/SAL [17:15:08] !log tools.wdrc chmod -x /data/project/wdrc/public_html/api.php # my bad (cc T349687) [17:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wdrc/SAL [17:15:22] !log tools.wdrc chmod g+w /data/project/wdrc/public_html/api.php # debug T349687 [17:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wdrc/SAL [17:59:49] !log tools.lexeme-forms deployed cdb1d34e11 (Werkzeug 3.0.1) [17:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lexeme-forms/SAL [18:05:25] apparently my pod lexeme-forms-7956595b8f-dzf29 (lexeme-forms tool) can’t be killed o_O [18:05:27] cannot stop container: de0203a61de9e5ec569d9b44c562a1bc0c689ba1d55fff81878648c5ad4fc663: tried to kill container, but did not receive an exit event [18:05:43] doesn’t really bother me (the new pod is up and running), but probably not ideal in the longer term? [18:11:27] !log tools.wd-image-positions deployed f7a426b0de (Werkzeug 3.0.1) [18:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wd-image-positions/SAL [18:16:33] !log tools.speedpatrolling deployed f018478b2a (Werkzeug 3.0.1) [18:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.speedpatrolling/SAL [18:22:38] !log tools.quickcategories deployed d817ca5f60 (Werkzeug 3.0.1) [18:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.quickcategories/SAL [18:27:34] !log tools.pagepile-visual-filter deployed db35589388 (Werkzeug 3.0.1) [18:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.pagepile-visual-filter/SAL [18:33:09] !log tools.ranker deployed 3bd9718d9a (Werkzeug 3.0.1) [18:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.ranker/SAL [18:35:53] !log tools.translate-link deployed c744b4852f (Werkzeug 3.0.1) [18:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.translate-link/SAL [19:24:27] has anyone encountered this node error? [19:24:27] "Error: error:0308010C:digital envelope routines::unsupported" [19:24:28] [19:24:30] Apparently to do with the openssl fix post Node 16.x. I see we have Node 18.x in the tool shell. The Internet tells me that the lazy way to make this go away is to run with "legacy openssl", but that sounds... *too* lazy. Does anyone have any other pointers? [19:24:31] [19:24:33] (I don't know almost anything about node and am trying not to touch it with a 10-foot pole.) [20:42:45] okay, for now I *did* use "legacy openssl", sigh. It's just for precompilation stuff anyway, so no actual vulnerability. [20:59:14] !log tools.wd-image-positions deployed 7aec0c0b10 (use Codex 1.0.0) [20:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wd-image-positions/SAL