[06:53:51] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, and 2 others: rsync puppet module doesn't delete removed config - https://phabricator.wikimedia.org/T205618 (10ayounsi) [clinic duty] tagging the teams I think are relevant to this task, please change the tags as needed [09:57:22] 10netops, 10Infrastructure-Foundations, 10SRE: Default allowed SSH parameters on upgraded Juniper mgmt routers prevent some connections - https://phabricator.wikimedia.org/T320272 (10jbond) > , as we can drop to a regular shell and specify the MAC code manually: FYi you can also use the .ssh/config file whic... [10:22:08] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10jbond) I thought i would bring my response here. > Setting skip_acked will also skip recheck_failed_services() Regardless of if we call `... [10:35:54] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10jbond) >>! In T319277#8307039, @jbond wrote: > I thought i would bring my response here. > >> Setting skip_acked will also skip recheck_f... [11:12:47] 10Puppet, 10netops, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10jbond) [11:20:51] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10Volans) I understand your concerns > Regardless of if we call wait_for_optimal(True) or wait_for_optimal(False) we should always call rec... [11:46:37] 10Puppet, 10netops, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10cmooney) Thanks for tracking all this John. As you know most of our hosts just have a single interface with single unica... [11:52:00] moritzm: I remain stumped by what's happening on ganeti4004/4008 [11:52:23] ifup just doesn't seem to be trying to create the bridge devices, and then fails trying to add IPs to them [11:52:41] If I manually create everything as needed with "ip" everything works fine [11:53:11] so doesn't seem to be any underlying kernel issue with adding what we need, but ifup is not doing it right [11:53:20] I can't see why it'd be acting any different to ganeti4003 [11:53:41] perhaps jbond might have some ideas I'm pretty stumped at this stage :( [11:56:40] yeah, it is super strange, I had been WTFing at it for quite a bit this morning... [11:56:46] 10netops, 10Infrastructure-Foundations, 10SRE: Default allowed SSH parameters on upgraded Juniper mgmt routers prevent some connections - https://phabricator.wikimedia.org/T320272 (10cmooney) > AFAIK this configures the ssh daemon to accept connections using this protocol (possibly also configures outbound c... [11:57:24] I'm wondering if it's some kind of race, in the sense that the shell kicks off some step which is still pending/async on the kernel level [11:58:05] maybe the BCM57414 in the Gen15 server is slightly different from the BCM57412 in ganeti4003 [11:58:40] yeah, I was trying to work out if "ifup" was trying to add the bride device, and that failed, or if it was simply not trying to create it at all [11:58:55] but I couldn't really debug it to that level [11:59:36] as you say maybe it tries to create the device, but it's still pending at kernel level, and thus next command (to add IPs to it) fails [11:59:53] but the device never gets created (with or without IPs) so not sure if that makes sense [12:02:53] 10Puppet, 10netops, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10jbond) >>! In T234207#8307389, @cmooney wrote: > I'm not sure if this task is the best place to discuss this but I'm of t... [12:03:24] topranks: just about to grab some food but will take a look when back [12:10:52] 10Puppet, 10netops, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10MoritzMuehlenhoff) >>! In T234207#8307389, @cmooney wrote: > Thanks for tracking all this John. > > So for instance we c... [12:13:57] 10Puppet, 10netops, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10ayounsi) [12:20:15] topranks: I think I found the issue, adding "bridge_hw enp175s0f0np0" to the private interface makes the network setup succeed [12:21:18] found it when digging around and ran into https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=980505 [12:22:46] 10Puppet, 10netops, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10cmooney) >>! In T234207#8307423, @jbond wrote: > Perhaps from the netbox PoV but from any new (networkd) module should su... [12:24:00] moritzm: oh wow nice! [12:30:23] 10Puppet, 10netops, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10jbond) >>! In T234207#8307431, @MoritzMuehlenhoff wrote: >>>! In T234207#8307389, @cmooney wrote: >> Thanks for tracking... [12:36:30] cool guessing no need for me to look now [12:36:54] lucky you [12:37:13] indeed [12:50:09] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10SLyngshede-WMF) 05Open→03In progress [12:53:30] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10SLyngshede-WMF) a:03SLyngshede-WMF [12:54:00] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10SLyngshede-WMF) Note to myself: Check if this is still an issue, and if yes, are we still working on it. [12:55:02] 10Puppet, 10puppet-compiler, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Improve PCC support for cloud VPS environments - https://phabricator.wikimedia.org/T289666 (10SLyngshede-WMF) a:03SLyngshede-WMF [12:55:42] 10Puppet, 10Infrastructure-Foundations: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10SLyngshede-WMF) a:03SLyngshede-WMF [12:56:33] slyngs: did you mean to claim that task fyi it should be fixedas such i was going to resolve it (no you pinged it) [12:57:01] Joanna asked me to check if we can close it :-) [12:57:13] slyngs: ahh then you please close it :_) [12:57:43] Which one was it? I grabbed a couple :-) [12:57:52] ill ping the task [12:58:20] 10netbox, 10Infrastructure-Foundations: Add git-local-changes check for netbox-extras - https://phabricator.wikimedia.org/T250288 (10SLyngshede-WMF) a:03SLyngshede-WMF [12:58:40] 10Puppet, 10Infrastructure-Foundations: sslcert::x509_to_pkcs12 fails to overwrite a valid output file when its contents should change - https://phabricator.wikimedia.org/T287869 (10SLyngshede-WMF) a:03SLyngshede-WMF [12:59:22] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) 05In progress→03Resolved This should be resolved now ill tentativly close it, thanks for the ping and please re-open if there are sti... [13:00:34] 10Puppet, 10netops, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10cmooney) > One thing i forgot to highlight is thet tere is currently a bit of a chicken/egg issue of using interface_auto... [13:02:27] 10Puppet, 10puppet-compiler, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Improve PCC support for cloud VPS environments - https://phabricator.wikimedia.org/T289666 (10jbond) > add an option to test on one random (or possibly hardcoded) host from both the cloud and wmcs environments This is sti... [13:02:37] 10Puppet, 10puppet-compiler, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Improve PCC support for cloud VPS environments - https://phabricator.wikimedia.org/T289666 (10jbond) [13:07:01] jbond: Thank you [13:09:05] no probs [13:14:42] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10jbond) > The original idea was that we don't want to ignore ack'ed alerts blindly Im not sure this was the original idea going from the ta... [13:19:39] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10Jclark-ctr) Verified Netbox Thanks [13:19:48] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10Jclark-ctr) 05Open→03Resolved [13:19:57] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-OnFire, and 2 others: asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Jclark-ctr) [13:29:19] jbond: unfortunately that fix didn't work I think [13:29:45] I think I may have confused moritz as I'd manually enabled the links prior to him adding the config line, so made it looked like it worked [13:30:15] To sum up the issue on ganeti4008 ifup fails, crashes out trying to add IPs to the 'private' bridge device [13:31:10] https://phabricator.wikimedia.org/P35399 [13:31:16] or we fixed one totally unrelated and there's still something else [13:31:32] Reason for that is fairly clear, it can't add an IP to the interface cos the device doesn't exist [13:31:53] Why it is not properly creating the device prior to that is what we can't explain [13:32:00] still, after having poked at this for multiple hours each, maybe we're both missing something really obvious, so a fresh look at it would be great [13:32:18] sorry john, your luck run out apparently [13:32:22] ganeti4003 is very similar and working. [13:32:37] Slightly different kernel and slightly different NIC, but if the stuff is done manually with "ip" command all works [13:32:48] which suggests the kernal and NIC are ok and it's something with the ifup scripting [13:33:43] 10Puppet, 10Infrastructure-Foundations: sslcert::x509_to_pkcs12 fails to overwrite a valid output file when its contents should change - https://phabricator.wikimedia.org/T287869 (10jbond) im not sure if i ever looked at this task, however i do notice that i have an old close PR for stdlib which seems related... [13:34:09] volans: lol [13:34:32] topranks: moritzm: ack ill take a look in a sec [13:55:29] 10netops, 10Infrastructure-Foundations: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ayounsi) p:05Triage→03Low [14:06:26] topranks: moritzm: yuo are both going to kick your self. [14:06:37] lol [14:06:43] * volans grabs popcorns [14:06:53] is it a typo? [14:06:57] fix was apt-get update; sudo apt-get install bridge-utils ; reboot [14:07:08] what? [14:07:10] ahahahahah [14:07:13] wow [14:07:13] haha [14:07:23] was missing bridge utils so no way to create the bridge interface [14:07:24] was bridge-utils missing? or just older version? [14:07:28] #missing [14:07:30] ffs [14:08:11] nice find :) [14:08:13] well lucky we asked, and only tied 2 of the team up for half a day and not the full week! [14:08:32] gnarf [14:08:37] hehe :) [14:08:49] * topranks staring at the strace of ifup didn't even spot it :) [14:09:13] bridge-utils obly gets installed via role(ganeti) and it's actually still in role(insetup).... [14:11:06] and thanks for proving my "after having poked at this for multiple hours each, maybe we're both missing something really obvious, so a fresh look at it would be great" right :-) [14:12:42] always happy to help with that ;) [14:12:58] Thanks John it was really getting to me :P [14:13:19] Even worse I even ran "brctl" at one stage and so I knew it was missing, and never made the link [14:13:43] no problems happens that you cant see the woods for the trees sometimes [14:14:07] moritzm: was this condition because you were manually testing the network stuff with the new kernel? [14:14:20] * moritzm hands https://i.pinimg.com/originals/c3/2d/63/c32d63ad2baab40dc3e7c9c4fc61cc29.jpg to topranks and himself [14:14:23] like normally puppet will install bridge-utils and then modify the interfaces file? [14:14:50] moritzm: so proud... always wanted one of those :D [14:14:57] tbh im not sure, there is also ./modules/ganeti/files/ganeti_init.sh which is at play at some point with ganeti [14:16:00] topranks: yeah, I was testing the new kernel and was under the firm impression I had already merged the role change [14:16:11] and also, bridge-utils adds the scripts in /etc/network/if-pre-up.d/ once installed [14:16:24] prior to that nothing is actually run to try to make the bridge [14:16:33] but I'm also adding servers in parallel in eqiad, so my mind tricked me there [14:16:36] hence me staring at it screaming "it's like it's not even trying to make it".... [14:17:19] but not finding any error / log of what was going wrong [14:23:18] 10Puppet, 10netops, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10ayounsi) For physical servers we indeed need to keep the whole lifecycle/provisioning process in mind (racking/provisioni... [14:25:08] 10netops, 10Infrastructure-Foundations: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10cmooney) Agreed we should add it to the CRs, no reason I can think of not to. Also I'll think about it in terms of the l3_switch template consolidation. They should get the same... [14:30:56] 10Puppet, 10netops, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10cmooney) > Which means being able to map the real world interface to the logical one, from previous conversations it's o... [19:11:04] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney) [19:11:26] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10cmooney) [19:11:34] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney) 05Open→03In progress p:05Triage→03High [19:11:48] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney) [19:15:32] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney) [19:16:32] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10ayounsi) a:05Jclark-ctr→03None [19:29:10] 10netops, 10Infrastructure-Foundations, 10SRE, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney) [23:00:11] 10CAS-SSO, 10GitLab, 10Infrastructure-Foundations, 10SRE: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10bd808)