[09:24:10] hi, almost ten years ago this commit was added which allows to limit the SSH key exchange to older protocols: https://github.com/wikimedia/operations-puppet/commit/47f5c988c12815ae45dd0ea0545fd3e61bacd3b5 [09:24:35] the parameter is very confusingly named [09:25:41] for production this was only enabled for gerrit until recently [09:26:38] when running a cleanup patch through PCC I noticed that Toolforge still sets this on a per project level in Horizon [09:27:52] all SSH clients should be recent enough, after all a decade has passed, any objection to removing the Horizon setting so that the defaults from the SSH profile apply like for the rest of cloud VPS and production? [09:31:17] moritzm: SGTM, cc arturo [09:35:36] LGTM moritzm, thanks [09:37:02] ack, thanks! I'll make the change now with a note to SAL, if there's any complaints we can revisit (or rather help users download a current SSH client maybe) [09:38:23] good spot :) [10:28:44] dhinus: the start-devenv script failed for me for unknown reasons, bootstraping the kind cluster [10:31:49] arturo: it worked for me, I did run it a few times yesterday [10:32:13] but I have another issue: T385082 [10:32:13] T385082: [lima-kilo] some containers are not restarting when restarting the VM - https://phabricator.wikimedia.org/T385082 [10:32:26] yeah, I was trying to replicate that one, when I found this problem [10:32:30] okok [10:32:39] apparently, the haproxy is not finding the backends [10:32:54] can you reproduce your problem? maybe it was a one off [10:33:03] yes, I can reproduce it [10:33:22] interesting [10:33:24] https://www.irccloud.com/pastebin/Kky7VL54/ [10:33:42] let me see if I need to disable some local firewalling on my laptop [10:34:04] wait, suddenly one node became available [10:34:05] [WARNING] 029/103309 (43) : Server kube-apiservers/toolforge-control-plane is UP, reason: Layer7 check passed, code: 200, check duration: 2ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. [10:34:10] hmm the SSL handshare error is similar to mine [10:35:11] this warning might be related: " config : missing timeouts for frontend 'controlPlane'." [10:35:19] "While not properly invalid, you will certainly encounter various problems" [10:35:23] I like that error message :D [10:35:41] heh [10:36:24] my guess is that some dependency was updated and modified the default behavior of something [10:36:30] but I'm not sure what the something could be :P [11:00:05] somewhat similar behavior, but this is a very old issue that has been fixed https://github.com/kubernetes-sigs/kind/issues/588 [11:10:49] ok, apparently I was able to bootstrap the cluster if not using the HA mode or the cache [11:18:55] I was also thinking those 2 things could interfere [11:19:05] I just found some interesting logs: https://phabricator.wikimedia.org/T385082#10507944 [11:23:06] ok, start-devenv finished in my system [11:25:24] can you try "limactl stop" followed by "limactl start"? [11:25:59] I'm trying on my machine with --no-cache now [11:27:33] dhinus: yes, it worked! [11:34:10] that's with --no-cache and --no-ha ? [11:35:46] I'm trying with all combinations of those options [11:37:18] with --no-cache only, I still see the issue [11:39:54] yes, with --no-cache and --no-ha [11:39:58] I wrote a message in the ticket [11:41:03] I think reusing the container cached disk may have unintended consequences. What if the embedded config for haproxy just doens't work across rebuilds? it may make sense, because things like IP addresses and ports may change [11:49:11] I noticed this issue one time, but was a bit tricky to reproduce, probably because I don’t shutdown my machine that much. Let me look at the task [13:28:39] I've seen that (or very similar) issue with ha yep, my guess is IPS change between reboots and the config for haproxy is hardcodded somehow (did not follow through, just used noha as it makes it faster and use less resources, and did not need to do ha tests so far) [13:48:52] power outage in my house shut down my HDMI monitor. When power recovered the content was upside down. What kind of logic is that. [13:55:50] LOL [14:08:08] I found a few more related discussions upstream, the most relevant is probably https://github.com/kubernetes-sigs/kind/issues/1689 [14:11:06] dhinus: for your consideration https://gerrit.wikimedia.org/r/c/operations/puppet/+/1115391 [14:11:35] arturo: ha, I had something very similar in mind :) [14:19:38] left a comment [14:19:39] topranks: hello! is there a reason https://netbox.wikimedia.org/ipam/ip-addresses/18886/ (cloud-private v6 for cloudsw-b1-codfw) isn't assigned to the interface yet? [14:20:20] taavi: just that I've been real busy and haven't been able to get back to it [14:20:21] sorry [14:20:33] I'll get it all tidied up over the next week or two [14:21:25] no worries, I'm the one trying to rush here :P [15:04:48] dhinus: are you doing cloudvirt reboots, or have you done? [15:08:01] I did in the past, but I haven't done any for this ronud [15:08:18] I saw you mentioned an issue with migration, is that fixed? [15:08:30] s/ronud/round/ [15:13:24] dhinus: it's not fixed, so best if you hold off on cloudvirts until I figure out what's happening [15:13:39] ack [15:13:42] I will start with clouddbs [15:14:19] arturo: do you want to do cloudnets? T384946 [15:29:14] dhinus: ok! [15:29:30] thanks [15:32:06] dhinus: T384946 is the ticket, right? [15:33:44] yep [15:34:14] ok [16:04:44] andrewbogott: re https://wikitech.wikimedia.org/w/index.php?diff=2265347 -- why do we have /data/project/.shared/cache? I thought that's what /data/scratch is for [16:05:24] taavi: it might be that that user just created it themselves -- I've been waiting for someone to explain to me where it came from :) [16:05:32] I will see if I can nudge that user over to scratch [16:55:23] arturo: I don't think this is the cause of the issue, but I'm reviewing all the live migration settings for cloudvirts. Should cloudvirts still be using cloudvirt1111.eqiad.wmnet addresses for cloudvirt<->cloudvirt communication or is should that happen on a .private address these days? [16:56:54] andrewbogott: the .private address is technically more correct, I guess [16:57:08] i'm pretty sure i made a task to move it to the private one a while ago [16:57:35] it's an easy change, lmk if you find the task taavi [17:07:45] andrewbogott: T355145 [17:07:45] T355145: Use cloud-private and cfssl certs for instance live migrations - https://phabricator.wikimedia.org/T355145 [17:08:03] iirc the hard part is that currently it uses puppet certs which don't have that name [18:35:27] andrewbogott: the discovery that T380384 is also blocking creating new tools if the requested name matches any SUL account is making that bug a bigger annoyance for the community. It might be worth y'all thinking again about prioritization. [18:35:27] T380384: [toolsadmin] Striker cannot create Developer accounts or tools with names matching existing SUL accounts - https://phabricator.wikimedia.org/T380384 [18:36:10] ok! I guess that was always the case but only for conflicts with wikitech names vs. sul names? [18:36:55] yeah. it was a feature before the SUL migration [18:37:24] it kept people from making "myname" tools [18:38:06] But also nobody thought about this consequence at all before forcing the SUL migration [18:38:32] Wikitech becoming a SUL wiki changed a lot of assumptions in Striker [18:39:37] and sadly nobody made following up with bugs like after the move away form LDAP and to Kubernetes a hypothesis [18:43:27] I'm not sure I understand why the name conflict with wikitech names was bad but the name conflict with sul names is OK? [18:45:46] andrewbogott: The problem is the other way around. and the big deal is that there are multiple orders of magnitude more SUL accounts making it very difficult to find a valid tool name now. [18:46:17] There are 77,420,207 SUL accounts and ~30,000 Developer accounts [18:46:59] Sorry, I need you to walk me through this. Was there ever an actual danger associated with a tool being named after a wikitech account name? [18:47:12] Or was it just a side-effect of general name-vetting? [18:51:53] There were names we wanted to block and we did that using https://wikitech.wikimedia.org/wiki/MediaWiki:Titleblacklist via the "can I create this account" Action API lookup. [18:52:22] it was a side effect that this method also blocked making a "BryanDavis" tool [18:52:58] but now we have lost the MediaWiki:Titleblacklist benefit and added 77M name collisions to the side effect [18:53:52] OK, so it was never important to actually prevent the toolname vs username general case. That's what was confusing me. [19:01:48] yeah, it was a not to horrible way to nudge people away from making old school tools just named after themselves (xtools, etc) but it was not a hard requirement. [19:03:01] but now you can't make a "prototyper.toolforge.org" or "splice.toolforge.org" because of SUL name collisions and there is no work around for that like there is for the Developer account creation. [19:04:03] yep, makes sense [23:52:29] stashbot, still there? [23:52:29] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [23:53:42] For whoever shows up first tomorrow: alert manager is telling me that maintain-kubeusers is down but when I tail the logs everything looks just fine. Either I don't understand how to read the logs or this is an alerting error or ephmeral. In any case... I didn't do anything but look at the log.