[09:24:58] !log admin rebooting cloudcontrol1004 for T291446 [09:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [09:25:02] T291446: cloudcontrol1004 galera crash - https://phabricator.wikimedia.org/T291446 [10:07:43] !log admin cloudcontrol1004 apparently healthy T291446 [10:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:07:48] T291446: cloudcontrol1004 galera crash - https://phabricator.wikimedia.org/T291446 [11:34:48] !log tools disabling pod preset controller T279106 [11:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:34:53] T279106: Establish replacement for PodPresets in Toolforge Kubernetes - https://phabricator.wikimedia.org/T279106 [13:01:42] !log tools publish jobutils and misctools 0.43 T286072 [13:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:01:48] T286072: No tab completion for `become` on dev-buster.toolforge.org - https://phabricator.wikimedia.org/T286072 [13:11:55] \o/ [15:45:32] majavah: around the time that you disabled pod preset controller all of my k8s jobs stopped working [15:45:39] uhm [15:45:49] looking [15:48:14] ah, this looks like the jobs framework is doing something unexpected and my replacement volume admission controller does not like that [15:48:18] sorry about that, will fix [16:13:31] !log toolsbeta testing volume-admission fix for containers with some volumes mounted [16:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [16:19:45] !log tools deploy volume-admission fix for containers for some volumes mounted [16:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:22:00] JJMC89: fixed now, sorry about that, my fault that I only tested it with `webservice` created pods [16:39:08] majavah: Thanks. Things look back to normal. [17:23:21] !log quarry added stopped status T289349 [17:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Quarry/SAL [17:23:25] T289349: add stop status to quarry - https://phabricator.wikimedia.org/T289349 [20:24:54] wm-bb is broken once again and tripped a threshold that knocked a bunch of bridges off [20:25:12] I am talking to serverops [20:25:21] do you know the IP our bots come from? [20:25:28] 185.15.56.1 [20:25:32] thanks, forwarding [20:26:47] 20:25 <@A_Dragon> they should be able to reconnect [20:26:47] 20:25 -!- Moneca [~Moneca@host-87-5-102-118.retail.telecomitalia.it] has joined #libera [20:27:01] glguy: "already fixed" they say [20:27:12] well, wm-bb is broken [20:27:14] but the ban is lifted [20:27:19] yea, the ban part [20:28:06] bd808 ping [20:32:33] glguy: I don't have time to look too much into it, but stopped it to avoid causing more damage [20:33:29] thanks [20:34:11] glguy: is there anything on our side to reduce the effects of one misbehaving bot on others? [20:35:14] I thought that was part of the point of using SASL + voicing them [20:35:18] to not all share the same outgoing IP would have helped, but no idea if that is realistic to change [20:41:53] majavah, identd is the best way to differentiate different users behind the same IP [20:43:16] legoktm, voicing helps with making it clear that a particular connection is allowed to be sending to a particular channel [20:43:29] wm-bb's rate limiting is broken and it's sending commands faster than the server will accept them [20:44:26] and then it immediately reconnects and does that again, over and over [20:45:17] gotcha [20:45:34] seems like matterbridge has some delay we can configure https://github.com/42wim/matterbridge/blob/1f365c716eae44b64dc5bdace5cb70441d7eb4c2/matterbridge.toml.sample#L72 [20:46:53] installing identd should be reasonably simple, just install one of the packages, like gidentd or another [20:46:57] in theory we have ident running but I don't think it survives our NAT thingy AIUI [20:47:00] normally that should just work [20:47:04] aha [20:48:14] it'll probably not like the nat, plus I don't really think we can trust any other project than tools or bastion to report accurate data [20:48:45] nearly all of these bots are in tools though [20:49:37] oidentd claims it can be configured to work correcly, https://oidentd.janikrabe.com/nat/introduction [20:51:46] I could technically just make a patch to try it out, but doing it without being able to hack it on codfw1dev would be painful [20:56:52] is there a test VPS project we can try it? [20:59:09] I'd imagine replicating the nat gateway inside cloud vps itself might be difficulft if not impossible [21:00:25] errr, right nvm. I was thinking the identd would run in tools/VPS project, not as part of the NAT/openstack [21:02:19] you'd still need a component outside the cloud realm to know which vm to forward the requests to [21:04:14] in theory codfw1dev is a good place to develop/test it, but it's still in the production realm and I don't have enough access [21:28:16] hello Cloud Servies! Community Tech's wikiwho.wmflabs.org decided to die today, no idea why... https://phabricator.wikimedia.org/T291886 I created a new instance and it's working great. I went to delete the old DNS proxy and point it to the new instance, and I'm getting the error `RecordSet belongs in a child zone: wikiwho.wmflabs.org.` Any ideas? [21:29:44] I have https://wikiwho2.wmflabs.org/ which shows the new instance is working. So now it's a matter of getting the old hostname to point to it, but OpenStack isn't cooperating [21:37:12] maybe a regression with https://phabricator.wikimedia.org/T131367 or https://phabricator.wikimedia.org/T260388 ? [21:38:34] if we have to use a different hostname, we can, but that also requires a new update to the Who Wrote That browser extension, which means 1-2 days of downtime for our users [21:39:10] (because of the review process for browser extensions) [21:43:09] oh, it's because of the new WikiWho project that was just created! https://phabricator.wikimedia.org/T290768 [21:43:38] we're not quite ready to attempt migrating the WikiWho backend, but for now I'll just create the proxy instance on that project [21:50:13] majavah, legoktm: we have oidentd running, but in practice it just doesn't work. See T151704 for a bunch of things I've tried in the past. And T278584 for the only thing I actually think will help. [21:50:13] T151704: Libera Chat may throttle bot connections from tools - https://phabricator.wikimedia.org/T151704 [21:50:14] T278584: Promote use of SASL for Cloud VPS/Toolforge hosted Libera.chat / Freenode IRC bots - https://phabricator.wikimedia.org/T278584 [21:53:43] irc<->telegram bridge check from telegram side [22:01:57] * bd808 realizes that all the detailed notes on ident madness for Cloud VPS/Toolforge is on security tickets and sighs [22:07:50] bd808: worth trying gidentd instead of oidentd? I see " [22:07:52] Can't get oidentd to work behind NAT" [22:08:13] or stuff like "arkku/aidentd: An IDENT daemon for Linux with NAT ... - GitHub" [22:17:29] mutante: its a deep problem and really it all falls apart once you get to the Toolhub Kubernetes cluster. That's yet another address translation layer to pass and ultimately it would require a special ident processing sidecar in the pod for every bot that is running on Kubernetes. [22:18:04] it's a rabbit hole too deep to be worth crawling down into it anymore in my opinion. [22:20:23] bd808: ooooh, I see. yes, understood!