[00:37:15] !log admin updated nameservers for codfw1dev instances via 'openstack subnet set --dns-nameserver etc.' [00:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [07:51:47] !log admin restart neutron-linuxbridge-agent.service on cloudvirt1034 T309732 [07:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [07:51:50] T309732: New SGE nodes can't talk to the grid engine master - https://phabricator.wikimedia.org/T309732 [08:08:18] taavi: oh, I think I saw that agent down the other day, but something else was on fire so I forgot about it, sorry [10:16:57] !log tools publish tools-webservice 0.84 that updates the grid default from stretch to buster T277653 [10:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:17:00] T277653: Toolforge: add Debian Buster to the grid and eliminate Debian Stretch - https://phabricator.wikimedia.org/T277653 [11:17:36] !log tools publish jobutils 1.44 that updates the grid default from stretch to buster T277653 [11:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:17:39] T277653: Toolforge: add Debian Buster to the grid and eliminate Debian Stretch - https://phabricator.wikimedia.org/T277653 [11:20:27] and now we wait a few moments for most of the remaining stretch jobs to move to buster [11:36:35] !log tools refresh volume-admission-controller certs (T308402) [11:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:36:38] T308402: toolforge: Refresh certs that are not controlled by kubeadm (mid 2022 edition) - https://phabricator.wikimedia.org/T308402 [11:42:33] !log tools refresh ingress-admission-controller certs (T308402) [11:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:42:36] T308402: toolforge: Refresh certs that are not controlled by kubeadm (mid 2022 edition) - https://phabricator.wikimedia.org/T308402 [11:47:27] !log tools refresh registry-admission-controller certs (T308402) [11:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [12:03:48] !log tools refresh prometheus certs (T308402) [12:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [12:03:51] T308402: toolforge: Refresh certs that are not controlled by kubeadm (mid 2022 edition) - https://phabricator.wikimedia.org/T308402 [12:52:24] Hmm … someone has changed /usr/bin/sql today [12:52:27] $ sql de [12:52:36] mysql: unknown variable 'defaults-file=/data/project/persondata/replica.my.cnf' [12:53:10] -rwxr-xr-x 1 root root 3898 Jun 2 10:57 /usr/bin/sql [12:59:51] Wurgl: indeed.. fixing, give me a second [13:10:58] should work properly now, sorry about that [13:12:00] It is still called experimental data processing ;^) [13:12:01] Thanks [13:46:58] I login to dev-buster.toolforge.org, become tool-account, and then try to use sql local: [13:46:58] tools.ru-monuments@tools-sgebastion-11:~$ sql local [13:47:00] ERROR 2005 (HY000): Unknown MySQL server host 'tools.db.svc.eqiad.wikimedia.cloud' (-3) [13:47:01] Any suggestions? [13:50:46] if you are trying to create a database for use with your tool, you want to use the command "sql tools" [13:52:59] hmm doesnt work either when i just tried it [13:55:15] @avsolov: looks like a bug introduced in some updates I deployed today.. I'll deploy a fix, give me a moment [14:04:50] and the fix is now live [14:12:18] thank you! it works now! [17:02:27] !log paws scaling db-proxy to zero T309794 [17:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [17:02:30] T309794: Remove db-proxy - https://phabricator.wikimedia.org/T309794 [20:27:23] Hi everyone, it seems that since a few hours ago pdftex/pdflatex isn't working when invoked from web code, is this known? [20:27:56] (in a Toolforge web tool, that is) [20:30:17] jem: My best guess is that you are seeing this in a cronjob and that it was caused by https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/thread/CUWV6ML7NBLST2XE57BWYM6MV2FVQYOR/ [20:30:53] the default job queue was set to Debian Buster earlier today [20:31:02] bd808: It's not a cronjob but that could be it [20:31:24] I remember and it's in my comments that Kubernetes caused that problem [20:32:14] * bd808 reads "when invoked from web code" and nods [20:32:23] The point is that my web tools are migrated since a few weeks ago [20:33:00] This very morning things were ok [20:33:05] did you set the -release on the jsub made from the web at that too? [20:34:11] The last command was "webservice --backend gridengine --release buster restart" [20:34:11] the default OS type for jsub starting things was Stretch until 10:16 UTC today. now it is buster [20:34:41] Ok, so it's not a coincidence [20:34:51] ok. and the webservice invokes pdftex directly? [20:35:16] No, it's invoked with shell_exec in the php code [20:35:46] I've seen that it outputs just one-two lines and stops [20:36:01] The same pdflatex from the shell works ok [20:36:26] *nod* but a shell_exec of pdftex, not a shell_exec of jsub that then calls pdftex? [20:37:10] Just shell_exec of pdftex [20:37:44] with `--release buster` on the webservice you should be getting the same version of pdftex when running on the grid as you would get on the login.toolforge.org bastion. [20:38:03] emphasis on should I guess if you are seeing different behavior [20:38:43] For sure it's different [20:39:22] Anyway I will try with -release=stretch for the moment [20:40:34] But obviously I will need some kind of permanent solution [20:43:58] Didn't work either :/ [20:44:49] bd808: Any "fast" idea? I should generate an image to be posted in Facebook at midnight [20:45:20] jem: what's the tool name? I have a bit of time to poke around [20:46:00] Thanks, bd808 [20:46:01] It's jembot [20:46:23] Do you need the exact URL of the sub-tool? [20:46:33] yeah, that would be helpful [20:46:41] Ok [20:46:56] https://jembot.toolforge.org/ef/ [20:47:34] When you select an article and follow the steps, you should get finally an image [20:49:04] You can create new ones for testing for Saturday 4, for example [20:52:50] bd808: I'm having dinner in a few minutes but I'll be checking from my tablet [20:52:53] And thanks again [20:53:46] lots of confusing state in this tool right now jem. $HOME/service.manifest is in a state that would indicate that the tool is down. But it is running. qstat shows the running version to be in 'dr' deleted state (but it is obviously running). The job that is in dr state was started on 2022-04-09 and is running on a Buster node at least. [20:54:04] Ugh [20:54:16] Can you clean things up? [20:54:45] yeah. I can force delete and then get the webservice running in a way that the state files make sense. [20:54:51] I just stopped and started, but it's true than the answers were confusing [20:54:55] Ok [20:55:04] that* [20:55:55] !log tools.jembot Force deleted stuck webservice job [20:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.jembot/SAL [20:56:20] ummm.... and yet it is still running? [21:00:00] very confused right now about where the webservice is actually running. the grid has lost track, but the front proxy apparently has not. [21:00:29] :( [21:04:50] !log tools.jembot Investigating webservice behavior [21:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.jembot/SAL [21:04:53] I stopped and started from tools-sgebastion-10, if that helps [21:13:29] this is so weird. usually if the front proxy gets confused like this it is possible to fix by starting and then stopping a grid webservice. The start should register a new backend ip/port and then stopping removes it again. I've done that twice now and the front proxy is still pretty obviously pointed at the "lost" job [21:13:38] * bd808 will keep poking [21:14:44] * jem supports mentally [21:33:15] !log tools.jembot Found orphan grid job by fetching host and port from front proxy redis. Killed related processes on tools-sgeweblight-10-4 [21:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.jembot/SAL [21:36:40] hello! there are two jobs on the `eranbot` account that appear to be stuck in the deletion state. Could someone force delete those? The job IDs are 9791260 and 9791582 [21:36:48] this is maddening... I'm now wondering if all grid engine webservices are messed up. I don't think the grid jobs are talking to the front proxy to register/unregister things [21:38:19] !log tools.eranbot sudo qdel -f 9791260 [21:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.eranbot/SAL [21:38:28] ah, I see you all are already talking about this issue [21:38:43] !log tools.eranbot sudo qdel -f 9791582 [21:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.eranbot/SAL [21:38:50] thanks! [21:39:53] taavi: if you happen to be around, I could use some help reasoning about why jembot's webservice is being so confusing. [21:41:56] !log tools Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for active_redis key [21:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:48:40] puppet is taking approximately forever to run on tools-sgebastion-11 :/ [21:55:21] !log tools Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for profile::toolforge::active_proxy_host key [21:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:56:36] !log tools Removed legacy "active_proxy_host" hiera setting [21:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:06:03] * jem is back from dinner [22:09:25] jem: I haven't given up, but things are still mysterious. I found and killed the rogue webservice. I have not yet been able to get a new webservice to start however. The jobs submits but gets stuck in queue wait state. [22:09:49] :( [22:10:03] Sorry for the troubles [22:10:34] If I can help somehow just ping me [22:10:49] I have been able to run other jobs (simple tests with `jsub date` and `jsub lsb_release -a`), so the whole grid is not broken. But maybe webservice nodes are... still trying to figure out what's stopping the scheduler [22:11:43] Ok, so probably a solution to the webservice nodes would also be a solution for me [22:12:36] Because you said webservice should behave the same as shell [22:13:34] I am so out of practice at trouble shooting grid engine. It's been years. [22:13:42] :( [22:14:11] I stuck with grid because of that very problem and because /home wasn't available [22:14:40] But I think I've already linked or copied what I needed from /home [22:15:00] yeah, I know your tool is one that needs us to make custom docker images work before it can move to kubernetes. [22:15:16] Sorry about that [22:15:22] not your fault :) [22:15:46] :) Anyway, I'm open to your guidance in order to make things easier to everyone [22:24:45] * bd808 figures out that tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud is sick somehow [22:26:39] !log tools Rebooting tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud. Node is full of jobs that are not tracked by grid master and failing to spawn new jobs sent by the scheduler [22:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:27:26] Hitting it with the reboot stick seemed faster than trying to clean up manually [22:29:16] Always that reboot stick :) [22:41:06] "restored operation by resetting system to known good state" [22:43:55] Hmmmm, that doesn't look good :( [22:59:45] !log tools.fountain-test Stopped webservice which was stuck in queue wait state [22:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.fountain-test/SAL [23:00:34] jem: the short version is that the problem is not just with your tool. The webservice buster job grid is a mess. [23:00:54] Ok, bd808[3~ [23:00:58] bd808 * [23:01:26] So I guess it's not fixable in the next minutes [23:01:33] and I have a hunch that the problem you were having with pdftex was related to grid nodes being overloaded with "lost" jobs like yours [23:01:42] Hmmm [23:02:05] So I should retry now? [23:02:41] (As you say "was" and not "is related") [23:02:58] jem: you could try forcing your stuff to run on --release stretch I guess. Buster is actively busted and I'm still trying to understand why. [23:03:11] Ok [23:11:01] (And now I'm having problems with my wifi...) [23:18:00] Ok, it was Firefox, not the wifi [23:18:56] T309821 is the mess that is my notes so far [23:18:56] T309821: Buster webservice grid went BOOM! - https://phabricator.wikimedia.org/T309821 [23:23:33] And now 504 in my whole tool :( [23:26:25] jem: your tool's webservice job is still getting stuck in "qw" state (waiting to be given a place to run) [23:26:39] Ugh [23:26:56] So what should I do? [23:27:49] jem: balloons and I are still trying to figure that out [23:27:55] Ok [23:28:19] I wait [23:40:20] Anyway, as midnight UTC approaches, I'll prepare an emergency solution [23:42:13] Edits via carrier pigeon