[01:13:42] Hello topranks and mutante , after further testing it the least disruptive and simple approach is to create the '.ssh' directory using Puppet. It needs '700' permissions and user:group 'rancid:rancid'. Once the directory is created the 'rancid-differ' service and invoking 'rancid' manually works as expected. [04:50:49] Hello team, is it possible to apply the Puppet changes of a specific gerrit commit if that patch has not been merged into the 'production' branch? [05:05:33] denisse|m: what do you mean? [08:12:16] denisse|m: denisse|m: not really, depending on what exactly you want to test trhere are a multitude of options pcc, cloud project, pontoon, bolt ping me when you get online and i can chat more [08:32:30] elukey: ping, can I merge your patch on puppet? [08:33:51] dcaro: o/ for the cloud private repo right? +1 [08:34:16] elukey: yes :), merging [08:34:24] thanks :) [09:15:41] XioNoX: the network cookbook is really great! Nice work [09:16:22] elukey: thanks! you can see it in action there https://phabricator.wikimedia.org/T314978 [09:30:05] <3 [09:55:38] I have been waiting for ~20 mins for helm-lint sigh, https://integration.wikimedia.org/ci/computer/integration%2Dagent%2Ddocker%2D1036/ seems a 'little' slow [11:15:58] topranks: hey, we are having issues trying to get the new ceph node up, it seems it's failing to send/receive heartbeats, do you have some time to help investigate? nc to the ip/ports works though, so it's more of "confirming that the network is ok" thing [11:19:38] dacro: yeah no probs [11:19:52] nc to the ports would seem to suggest the network is ok [11:20:02] I validated the jumbo frames were working there yesterday [11:20:13] Is there two nodes in particular that you can point me to? [11:21:16] cloudcephosd1024 (old) and cloudcephosd1025 (new) [11:21:21] ok [11:27:18] dcaro, dhinus: can't see any particular signs of a problem [11:27:28] devices are exchanging traffic, TCP sessions look ok [11:27:33] Jumbo ping works [11:28:59] interfaces are configured with same MTU, same settings, routes are ok [11:29:33] dcaro: when you say it's failing to send/receive heartbeats what protocol/ports are they ? [11:32:24] heartbeat_check: no reply from 10.64.20.20:6810 [11:32:46] (this is a log line from cloudcephosd1025) [11:33:34] sorry yeah just seen that [11:33:50] So this traffic is over the production realm networks rather than the cluster (192.168.x) network [11:34:01] yep [11:35:27] Shouldn't be an issue [11:35:40] the weird thing is that we see tcp traffic going on that port from and to that host, so things are going around, but not sure why it complains about not getting heartbeats [11:35:45] iptables both sides have rules that allow it (port range), and counters show packets accepted [11:36:46] https://phabricator.wikimedia.org/P32359 [11:41:50] I'm kind of stumped here tbh, you can see traffic being sent and received on the ports it has in the logs saying "no reply from" [11:42:29] yep [11:42:35] it's a long shot xd [11:43:29] thanks for checking 👍 [11:50:22] np - the network has changed and the log suggests some kind of network issue [11:50:32] so kinda makes sense to look there - but I'm not seeing anything [11:51:11] hmm... I'm noticing now that they contact for the heartbeat on both, the cluster network (192.168.*) and the public networ (10.64.*) [11:52:25] ooohhh... could it be a routing thing? (maybe at the app level) as in receiving the heartbeat on one ip, and sending the reply in the other? [11:55:05] dcaro: it could, but seems unlikely, in that if the packet is received on the public network from 10.x source it surely would reply to the source IP, which would go back the same network [11:56:09] unless there is some ceph weirdness [11:57:07] The routing tables are correct on both hosts so there it no OS-level thing that would cause 192.168.x traffic to go out via public network or vice versa [11:57:32] 👍 [11:59:28] I see the ceph health shows "slow ops". Pings are in nanosecond range though so don't think network delays are a factor. [12:00:49] I'll leave it there for now, if there is anything else I can check let me know [12:02:18] okok, thanks! [20:38:26] if mailman3 upstream insists on config files with ":" in the file names (does it?) then that breaks cloning operations/puppet in general for Windows users who want our puppet repo. :p https://phabricator.wikimedia.org/T314698 [20:42:43] huh, so it does [20:43:11] I found a work around for them by using WSL on Windows 11. .from further down on https://stackoverflow.com/questions/5991805/how-do-i-clone-files-with-colons-in-the-filename [20:43:26] but maybe it can be fixed on mailman side. dunno [20:43:35] we could also drop `recurse => remote` and do the renaming in file resources -- definitely makes the puppet more verbose than it needs to be, but the clone would work [20:43:53] that is true. yes, it is because we use recurse [20:43:59] and just drop the files in there [20:44:06] it's actually the nicest fix I guess [20:44:28] except then whoever wanted that has to define every single file in puppet.and wanted to avoid that [20:45:09] yeah [20:46:04] there's no way to .each over the contents of puppet:///modules/mailman3/templates/, is there? [20:46:19] I almost hope there isn't, that smells pretty bad [20:49:07] it's 29 files [20:57:53] it's actually hardcoded in mailman3. upstream docs just define those like magic words ("supported template names") for templates that are defaults [22:29:59] FWIW, I don't think that windows filesystem issues should be a major concern in our Puppet codebase. Similarly I would not be worried about things that break in a git clone when using a case-insensitive file system. [22:43:30] I wouldn't go huge distances out of our way for Windows compatibility, but certainly at least if it were easy it'd be nice to do it [22:44:23] I wouldn't go to more than, say, 10x the effort of installing WSL :P [22:59:19] rzl, mutante: sorry I'm behind on that task. I actually moved the files into a Debian package, but never finished deploying it [23:01:33] duped and commented at https://phabricator.wikimedia.org/T282308#8147210 [23:02:23] ahh, neat