[02:53:46] Hi, is it possible to perform git operations with toolforge-jobs? It says git command not found even if I've specified the path… [07:16:23] Guest62: I used the php74 image to work around it, that one happens to include git [08:01:09] heads up, about to conduct network operations on Cloud VPS that cloud affect all cloud egress/ingress traffic [08:20:33] !log admin [codfw1dev] trying with python3-dnspython 2.2.0-2 installed by hand on cloudcontrol servers (T305157) [08:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:20:36] T305157: Openstack Wallaby on Debian 11 Bullseye problems because eventlet and dnspython - https://phabricator.wikimedia.org/T305157 [08:24:13] !log admin [codfw1dev] trying with python3-dnspython 2.2.0-2 installed by hand on cloudvirt2003-dev (T305157) [08:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:42:23] !log admin [codfw1dev] trying with python3-eventlet 0.30.2-5 installed by hand on cloudcontrol servers (T305157) [08:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:42:26] T305157: Openstack Wallaby on Debian 11 Bullseye problems because eventlet and dnspython - https://phabricator.wikimedia.org/T305157 [08:45:18] !log admin [codfw1dev] trying with python3-eventlet 0.30.2-5 installed by hand on cloudvirt2003-dev (T305157) [08:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:47:05] !log admin T304598 failover cloudgw1002 (now standby) into cloudgw1001 (now active) [08:58:56] !log admin [codfw1dev] downgraded python3-dnspython to standard 2.0.0-1 cloudcontrol servers, is not part of the problem, apparently (T305157) [09:02:41] is that the version before the tests we did the last time? (there was a reproducer of the error, and it was hapenning there for sure) [09:04:13] wait, dnspython, the issue was with eventlet no? (or the fix was on the eventlet side iirc) [09:05:33] anyhow, nm, /me is out of context [09:07:22] T305487 is happening again, if someone could prod the bridgebot [09:07:47] (the double-message issue. i thought stashbot would mention the task title) [09:08:01] dcaro: I've tried with upgrading/downgrading both. The conclusion I'm at is that python3-eventlet is the problem and it needs upgrading to `0.30.2-5` [09:08:18] ack [09:12:16] !log admin [codf1dev] installing python3-eventlet 0.30.2-5~bpo11+1 on all required servers (cloudvirt, cloudnet, cloudcontrol) (T305157) [09:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [09:12:20] T305157: Openstack Wallaby on Debian 11 Bullseye problems because eventlet and dnspython - https://phabricator.wikimedia.org/T305157 [14:02:26] !log tools.bridgebot Restarting for duplicate IRC messages. (T305487) [14:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL [14:13:52] bd808: if there's more than 1 irc side running to cause the dupe messages, could you poll the procs count and make it restart itself? [14:14:51] RhinosF1: that's not how the code works and deeper surgery than I would do, but I did send what I think would fix it upstream. [14:15:24] bd808: i wondered how long upstream would take though [15:25:31] All my jobs on Toolforge are being dropped. The message is "job dropped because of user limitations". How do I determine what this is about? [15:26:46] GreenC: which tool? [15:28:37] botwikiawk [15:31:59] GreenC: looks like you're being hit by the 'active jobs' limit described in https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Concurrency_limits [15:34:13] agreed. `qstat` shows 16 running jobs for that tool and then 13 more in "qw" (queue wait) state [15:35:05] but... GreenC those 16 running jobs are all continuous, so the 13 pending ones will never get their turn. [15:35:36] I guess 2 of them are normal tasks... [15:46:00] I probably hit the 16 speed limit and new jobs are backing into a traffic jam. [15:46:55] Is it possible to request more slots? [15:47:06] ooo [15:49:22] "16 active jobs simultaneously allowed per tool user" .. I guess that means I can create new tools and each will have 16 slots [15:54:59] GreenC: yes, splitting into multiple tools is an option. You could also think about how many of those actually need to be continuous and if some could be run periodically instead (which often depends on what the bot is doing and how time sensitive that work is) [15:58:04] bd808 that's great I'll split some off. They are periodic, but run in a continuous mode so if Toolforge crashes or whatever it will auto-restart the job and the app can pick up where it left off until the task is complete. [16:44:34] Hello, world! I have an error… what’s wrong? [16:44:35] Failed to launch the browser process! [16:44:37] /data/project/iluvatarbot/www/js/node_modules/puppeteer/.local-chromium/linux-970485/chrome-linux/chrome: error while loading shared libraries: libgobject-2.0.so.0: cannot open shared object file: No such file or directory [16:44:47] libgobject not installed? [18:53:40] !log tools.notwikilambda updated to PluggableAuth 6.0 (T299934) [18:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.notwikilambda/SAL [19:08:59] arturo, are you doing maintainance currently, or is there something else going on (see -feed) [19:10:45] AntiComposite: pretty sure he isn't [19:10:52] he said he'd stopped a while ago [19:13:32] balloons: ^ you are apparently online, cloudvirt1016 has gone down [19:14:11] Rook, are you around? [19:15:13] yes [19:16:09] can you have a look? [19:16:27] I'm not clear on what to look at, is -feed a channel or above in this channel? [19:16:41] Rook: #wikimedia-cloud-feed [19:16:48] It relays icinga and alertmanager [19:16:49] Oh it is a channel [19:16:57] And gerrit / phab [19:17:17] Rook: 20:02:41 PROBLEM - Host cloudvirt1016 is DOWN: PING CRITICAL - Packet loss = 100% [19:17:25] That's the most important part [19:17:56] Then a bunch of alerts on https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:20:56] !log tools.notwikilambda resumed automatic updates of PluggableAuth (T299934) [19:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.notwikilambda/SAL [19:26:41] $ ssh cvn-app9.cvn.eqiad1.wikimedia.cloud : connect failed: No route to host [19:26:47] I guess this is related to the outage? [19:28:10] Krinkle: yep [19:29:52] a k8s worker in tools & toolsbeta, cvn-app9, cloud-puppetmaster-03 are affected for anyone else [19:30:59] Looks like Rook is working on it but the reboot attempt failed [19:37:00] Yeah, had to get into my ghost account to get the ipmi pass, got that trying that route now [19:39:15] okay [19:40:46] I'm getting an error on Horizon when trying to view an unreachable but supposedly running instance. I guess that's the same issue. [19:41:56] ragesoss: which VM [19:42:36] https://horizon.wikimedia.org/project/instances/74abbd53-6efb-4b64-9341-fc7507208fdd/ [19:42:51] aka p-and-e-dashboard-database in the globaleducation project [19:43:17] also https://horizon.wikimedia.org/project/instances/5654c71c-3c14-4c9a-aa07-ba4d39c8a1e1/ [19:43:24] aks p-and-e-dashboard-sidekiq [19:44:02] actually, I get errors for any of the instances on globaleducation, even though at least one I can reach via ssh. [19:44:04] -database is 1016 [19:44:35] -sidekiq doesn't look to be on the same cloudvirt1016 though ragesoss [19:44:54] See https://openstack-browser.toolforge.org/project/globaleducation under hypervisor [19:45:05] yeah, I can reach -sidekiq via ssh. [19:45:17] we've got ssh [19:45:27] but I get the 'Something went wrong!' screen on horizon for -sidekiq and all the others. [19:45:29] yeah it just came back up according to icinga [19:46:15] yeah, the database instance is back up, as outreachdashboard.wmflabs.org just started working again. [19:46:23] great! [19:46:25] Looks like services aren't failing on cloudvirt1016, ragesoss VM come back? [19:46:48] He just said it did [19:46:54] Yep, awesome! [19:46:54] Rook: yeah, things are working now... but I'm still hitting the errors on Horizon [19:47:01] Not sure about the horizon stuff though [19:47:44] There's lot of recoveries showing, looks like puppet is being slow to come back Rook [19:48:33] giving it a few minutes to see if they don't come back with a bit of time. I'm seeing some errors in horizon as well [19:49:07] Yep, I expect puppet will start working whenever puppet next runs [19:50:05] And I just got a completely unrelated page from Miraheze icinga, this is gonna be a fun night [19:51:36] !log tools.stewardbots `./SULWatcher/manage.sh restart` all bots disconnected [19:51:37] Horizon alert I was seeing seems to have vanished [19:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL [19:51:41] (unrelated) [20:00:15] Rook: Horizon seems to be working for viewing my instances now. [20:00:26] 👍 [20:28:16] !log tools.notwikilambda deployed f6e6dc49e1 (allow function-evaluator startup up to 20 minutes) [20:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.notwikilambda/SAL [20:59:35] Hi, I'm trying to migrate a jsub cronjob to toolforge-jobs and I've run into a strange error. Can anybody help? [21:01:23] oops, i think I got it. Problem was an underscore in the job name [21:05:10] Joutbis: did that underscore end up giving you a strange error message? [21:13:20] yep. First it said "HTTP 422: likely wrong schedule time. k8s JSON: ....". It looked innocent enough, but just in case I removed the schedule option [21:13:57] And without a --schedule, it said "HTTP 422: likely an internal bug. k8s JSON: .... " [21:14:16] But changing the underscore to a hyphen nailed it. [21:17:41] that sounds like it should be reported as a bug though :) [21:17:47] to produce a nicer error message [21:18:06] or to accept underscores [21:24:46] I guess, yeah ^^ [21:24:59] or... both! :) [21:25:51] 422 status code is interesting too. That's "Unprocessable Entity" from the WebDAV RFC [21:26:37] So, should I report it, or somebody from WMCS wants to run with it? [21:27:13] Joutbis: if you have time to make a bug report that would be ideal. You've seen the messages and know what inputs it took to create the errors. [21:30:11] OK, will do [21:40:44] Done: https://phabricator.wikimedia.org/T305592 [21:52:21] hmm, but when I try `toolforge-jobs run _ --command true --image tf-bullseye-std` I get a much nicer error: [21:52:22] ERROR: unable to create job: "ERROR: job name doesn't match regex [a-z0-9]([-a-z0-9]*[a-z0-9])?([.][a-z0-9]([-a-z0-9]*[a-z0-9])?)*" [22:10:26] Maybe a string like fer_efem gets parsed into two different strings separated by the underscore and then everything goes haywire [22:10:30] Dunno