[07:59:28] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10SLyngshede-WMF) a:03SLyngshede-WMF [09:08:59] 10netops, 10Infrastructure-Foundations, 10SRE: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10cmooney) Reboot completed sucessfully, currently router not showing any alarms: ` root@re0.cr2-esams> show system alarms No alarms currently active `... [09:31:19] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) >>! In T319184#8288137, @cmooney wrote: > [..] > Anyway thought I'd mention just in case you weren't aware. Thanks, double checking this now.... [09:46:02] jbond: I noticed something odd there while testing the puppetdb import script in netbox [09:46:14] nothing important, more of a curiosity tbh [09:47:01] basically puppetdb is listing an interface 'private:0' for ganeti1027, despite that interface not existing on the box itself [09:47:36] you any idea why that might be? [09:48:30] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10SLyngshede-WMF) In spicerack we'll add a "skip_acked=False" to the wait_for_optimal and "acked" properties to HostStatus and HostsStatus datatypes. When skip_a... [09:50:51] topranks: you can run sudo facter -p networking on the host [09:50:57] the script gets the data from there [09:52:47] volans: thanks... I should have thought of that [09:53:10] shows there too as you might expect. anyway no biggie just has me scratching my head (it doesn't happen on say ganeti1026 or others) [09:54:17] topranks: ganeti1026 only has one ipv4 address on the private interface but ganeti1027 has two [09:55:14] ah! [09:55:45] indeed yes see it now, different IP. [09:55:57] i thik this must be a ipv4 facter specific thing. its always been the default to have multiple ipv6 addresses on the same interface and facter shows this correctly by groupingadditional ip addresses in the bindings block e.g. https://phabricator.wikimedia.org/P35371 [09:56:08] although "bindings" is an array, and has multiple v6 entries, so not sure why they don't just add multiple under the same interface. [09:56:18] however historicly additional ipv4 addresses where on a sub interface e.g. private:1 [09:56:22] I think that ganeti1027 is the current master [09:56:26] and hence has the ganeti01.svc.eqiad.wmnet. [09:56:28] assigned [09:56:46] volans: ah yes I see makes sense. [09:56:50] this is probably a logic issue in how facter groups ipv4 vs ipv6 addresses due to some history related to the above [09:57:18] odd behaviour in a way but my curiosity is satisfied. thanks! [09:57:31] yes id say its a facter bug [09:57:37] does it need a fix in the netbox script? [09:57:47] to circumvent the facter but [09:57:52] *bug [09:58:43] fyi we get the same behaviour in the most recent version of facter [09:59:58] volans: we could fix it / work around it there yes [10:00:09] unsure where else we have multiple v4 addresses. [10:00:27] checking puppetdb for lvs1016 it doesn't seem to record the additional IPs on "lo" at all [10:00:38] just to be clear, that IP should not be attached in netbox [10:00:48] as it's a VIP that floats between ganeti hosts in the same cluster [10:01:01] and so wuld be outdated info [10:01:03] yeah exactly it doesn't make sense to record it there [10:01:06] same as the gerrit/gitlab IPs [10:01:14] but it must exist in netbox [10:01:21] script right now is adding the interface but not the IP: [10:01:22] https://netbox-next.wikimedia.org/dcim/devices/3632/interfaces/ [10:01:44] https://netbox.wikimedia.org/ipam/ip-addresses/4472/ [10:02:31] the only problem I see in the script output [10:02:42] is that it ignores it AFAIK [10:05:35] is that a problem? it's by design no? [10:09:39] volans: I guess perhaps it could add the IP if it finds it, just not attach to interface in netbox. [10:09:55] yes that's what I meant. I don't recall all the corner cases, but I think it should mention it even oif already exists [10:09:59] but slippery road, maybe a report to check that IPs discovered by puppet are recorded in netbox? [10:10:00] or maybe not, not sure [10:11:27] yeah it's tricky [10:11:57] remember that some of the import puppetdb code was also designed during the transition phase to netbox [10:12:16] some of that logic can go now probably and just assume netbox is the source of truth [10:15:48] yeah... I think it's best rolled into the work John was talking about re: systemd-networkd, and the wider issue of what is the source of truth for IPs on systems that have multiple. [10:16:07] the edge-cases right now are fairly minor I think so no major worry [10:16:48] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) Plan of action: General overview before/after. Red: deactivated/removed. Green: activated/added. {F35550079} We're... [10:20:35] +1 [10:26:26] when we have moved to systemd-networkd we can also more easily filter out addresses which should not be imported (e.g. by means of adding Description= to same stanza which makes the import script filter them out) [11:31:30] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) [11:31:36] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb postgress: Improve postgress standby server - https://phabricator.wikimedia.org/T313217 (10jbond) 05Open→03Resolved a:03jbond puppetdb has now been migrated to use replication slots [12:39:12] 10netops, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10aborrero) [12:53:24] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10aborrero) [13:14:45] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10cmooney) I don't think it's true to say the VRRP is over VXLAN here, the VRRP... [13:36:36] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10Jclark-ctr) cableid c220756659 fpc2 - fpc8. [14:11:31] 10Puppet, 10netops, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10aborrero) [14:30:54] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10cmooney) Ok yeah I see what is going on. Cloudnet1005 is running VXLAN over U... [15:17:18] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) Row C got moved to the new linecards with no issues, but moving cr1<->row D caused an outage. As row C cleanup, @Jclark-ctr can you rem... [15:24:59] 10Puppet, 10Infrastructure-Foundations: Puppet failure on deploy-1004.devtools.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T319681 (10dancy) [15:28:03] 10Puppet, 10Infrastructure-Foundations: Puppet failure on deploy-1004.devtools.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T319681 (10dancy) [15:29:10] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10cmooney) @aborrero thanks. Reading briefly through the docs I have a better u... [15:36:18] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10aborrero) >>! In T319539#8291916, @cmooney wrote: > I gather the hypervisor ho... [15:45:16] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10cmooney) > But we do have keepalived running on cloudgw servers. So we may wan... [15:47:14] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [16:05:21] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) This has been completed smoothly! I deleted the following VC cables from Netbox: 0315 0316 0317 0318 0320 Please... [16:06:58] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-OnFire, and 2 others: asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10ayounsi) 05Open→03Resolved a:03ayounsi Sub-task completed successfully nothing more to do here. [16:27:42] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) [16:40:09] er, Netbox regression/bug, I can't set this interface to be part of the ae2 LAG... https://netbox.wikimedia.org/dcim/interfaces/27155/ because the LAG is on a different VC member (it used to be possible), I guess next option is to try nbshell [16:42:31] looks like it worked, easier than expected [16:45:55] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) Also looks like the optic or fiber needs to be replaced, error rate is high: https://librenms.wikimedia.org/device/device=162/tab=port/p... [16:48:54] lol [16:50:11] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: IPv6 BFD Sessions Failing from Bird (Anycast VMs) to Juniper QFX in drmrs - https://phabricator.wikimedia.org/T304501 (10cmooney) Diff if the above patch is merged (running from my laptop with updated template): ` Changes for 8 devices: ['c... [18:49:35] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10Jclark-ctr) Can this be changed at any time? I will work on netbox updates when not in data center