[00:31:42] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Znuny: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10HMarcus) Hi all - fundraising@wikimedia.org is a delegated inbox in our domain. Meaning it acts like a normal user account, but we have granted delegate access to... [00:35:17] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Znuny: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10Dzahn) Thank you for the detailed and quick response @HMarcus I'll leave the personal access part to fundraising but I can confirm that where it says James Alexand... [00:35:59] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Znuny, 10fundraising-tech-ops: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10Dzahn) [00:36:23] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Znuny, 10fundraising-tech-ops: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10Dzahn) added fundraising-tech-ops [08:20:31] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Znuny, 10fundraising-tech-ops: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10akosiaris) The problem described by the task (that is forwards from VRTS to donate@ failing) has been resolved. There was a configuration... [09:03:55] How is that an RFO?! "We could see that there was no light coming from your equipment in Carrollton. Once that was resolved, we had to change the configuration from our end in order to restore the traffic. Thank you for the cooperation in this case." [09:16:39] lol [09:44:08] hmm yeah. [09:44:17] Taken at fact value there is either active equipment between us and whatever device they were looking at, or our optic was somehow dodgy but recovered itself (seems unlikely?) [09:45:00] maybe they just rfo.sh | mail [09:45:08] which code is more or less [09:45:28] echo "There was a problem, the problem was resolved. Thank you for your cooperation." [09:46:18] lol. That script is 80% of their big automation push, but it won't be ready till 2030 I think. [09:47:43] But in seriousness I'm inclined to believe it's more real, as it matches the things they said while it was down. [09:48:06] We changed the optic our side now, if that had anything to do with it. I'd love to know what "configuration change" they did to "restore traffic" [09:48:14] that's probably the crux of it. [09:54:18] yeah, that's the part I don't like [09:55:09] they said the lost link from us, fair, that happens. But that they have to do config change to put the link back in service is not normal [10:05:08] Yeah 100% that's not normal, and the level of info is a joke. [10:46:15] topranks, XioNoX: any of you around for some debugging why a VM lost it's connectivity? [10:46:29] moritzm: sure yeah [10:46:32] sounds like great fun :) [10:46:40] damn too slow [10:46:45] hahaha [10:46:48] this went up 30 mins ago: PROBLEM - Host urldownloader2001 is DOWN: PING CRITICAL - Packet loss = 100% [10:47:06] the active urldownloader instance for codfw is 2002, so this has no immediate impact [10:48:02] urldownloader2001.wikimedia.org kvm debootstrap+default ganeti2027.codfw.wmnet running 1.0G row_A [10:48:02] I've connected to the console of urldownloader2001 (via sudo gnt-instance console urldownloader2001.wikimedia.org from ganeti2019.codfw.wmnet) [10:48:26] but the on host IP configuration looks all fine and matches what's in Netbox [10:49:06] moritzm: do you see anything inbound if you run tcpdump ? [10:49:18] What's the MAC address on the VM int it if you do "ip -br link show" Moritz? [10:50:28] https://paste.debian.net/1223686/ [10:50:37] seeing only some ARP requests [10:50:47] MAC is in the paste above [10:50:55] and the output of a few seconds of tcpdump [10:54:16] yeah the ARPs are going out to the switch alright [10:54:19] 10:54:01.479417 aa:00:00:6c:78:cd > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 2001, p 0, ethertype ARP, Request who-has 208.80.153.1 tell 208.80.153.24, length 28 [10:55:03] The switch isn't configured for the public vlan though [10:55:08] https://netbox.wikimedia.org/dcim/interfaces/23111/ [10:55:16] ^^ we need to add the vlan here and run homer [10:55:31] good catch! [10:55:54] * volans out of context [10:55:58] but why was it working before? [10:56:16] volans: maybe moved to a different ganeti node? [10:56:35] hopefully that's it cos otherwise it's definitely a head scratcher. [10:57:13] yeah: because as part of the migrations needed for the reimages this is the first node with a public IP which ended up on ganeti2027 which only added to the cluster recently [10:57:30] ack good we have an explanation [10:57:54] I added it to the cluster last Thu: https://phabricator.wikimedia.org/T294139#7559247 [10:59:09] topranks: could please you also check if the same might apply to the switches which are connected to ganeti2025, ganeti2026 and ganeti2028? these are also new or in the process of getting added to the codfw Ganeti cluster [10:59:41] yep no probs, just fixing the port for ganeti2027 I'll check the others then. [10:59:43] same is needed for https://netbox.wikimedia.org/dcim/interfaces/23114/ haven't checked the others [10:59:58] and great catch, I would have continued to blame the OS for another hour if I hadn't asked here :-) [11:00:31] having helped with the lvs1020 install the other day all that was fresh in my mind :) [11:00:37] cool, thanks! out for an errand, back in 30m [11:02:04] cool, urldownloader2001 is responding to pings now btw. [11:06:08] moritzm: I've changed it for the port XioNoX mentioned above (ganeti2028) now also, the other two were already set up correctly. [11:28:05] thanks! [13:06:46] cdanis, topranks, I pused a temporary fix for the ural.ru issue [13:07:21] https://www.irccloud.com/pastebin/udczmDOc/ [13:12:48] XioNoX: Good stuff, I'm not sure I'm up to speed on the overall issue? [13:13:04] and I'd bet they were going through CF before, hiding the issue [13:13:20] Despite that I notice this second route looking at a random prefix: [13:13:22] topranks: email to NOC [13:13:25] https://www.irccloud.com/pastebin/a9Pci7jU/ [13:14:05] So I wonder if perhaps a ".*" at the start of your as-path regex might avoid taking that path even if the first AS is different? [13:14:15] and I disabled CF not long ago in esams (see -sre-private) [13:14:22] Assuming there is a problem on path between AS31500 and 35815 we are trying to avoid. [13:14:44] topranks: it's just meant to be a test to workaround that one issue [13:15:05] I need to send an email to 31500 [13:21:05] Ok good stuff. What symptoms did you observe with that path? Traceroute dying out or similar? [13:23:28] yeah it doesn't show any hop [13:24:00] Hmm yeah that's what I got trying to reproduce, although not 100% on testing this way [13:24:06] https://www.irccloud.com/pastebin/P6QFeCwM/ [13:26:59] topranks: I'm testing it form bast3005 [13:27:01] from* [13:33:20] issue is quite visible there https://w.wiki/4ZNR [13:34:50] https://w.wiki/4ZNT splitting by AS path [13:36:20] Seems to me that AS31500 peer 80.249.209.157 is unreachable on that LAN... don't see any ARP for it and can't ping. [13:36:29] Our own direct session with them is down. [13:36:42] But we're learning routes from AMS-IX route servers with that as next hop. [13:37:08] Any we have selected as best and are using just get nothing back in traceroute, which makes sense as we've no ARP for the next-hop [13:37:54] Perhaps we should changed that as-path depref to "^31500 .*" [13:45:19] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Znuny, 10fundraising-tech-ops: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10krobinson) Thanks @akosiaris - this part is indeed solved! I also agree that it makes sense to hand these all over to ITS, if that is... [13:58:22] Ok AMS-IX route-servers have stopped announcing prefixes from AS31500 it seems. [13:59:08] So should be back to normal. From what I could tell approx 636 prefixes across 282 ASNs were affected: https://phabricator.wikimedia.org/P18238 [14:03:46] Looks like issue started about 23:45 UTC yesterday: https://w.wiki/4ZP6 [14:31:45] Was another email about it from a different carrier, I responded. [14:32:04] All seems good now anyway, can see traffic picking up to some affected destination ASNs: https://w.wiki/4ZPY [14:45:40] nice! [15:33:48] 10CAS-SSO, 10Infrastructure-Foundations: Deploy IDP test application to production - https://phabricator.wikimedia.org/T297889 (10MoritzMuehlenhoff) [15:33:58] 10CAS-SSO, 10Infrastructure-Foundations: Deploy IDP test application to production - https://phabricator.wikimedia.org/T297889 (10MoritzMuehlenhoff) p:05Triage→03Low [15:41:03] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [16:33:15] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) I wrote a small script to grep operations-puppet and cloud-instance-puppet with the class names pending above, and got this: {F34886637} The ones tha... [16:47:23] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10jbond) > I wrote a small script to grep operations-puppet and cloud-instance-puppet Did you also see `utils/audit.py` in the puppet repo would be good to m... [16:51:39] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) >>! In T272559#7575754, @jbond wrote: >> I wrote a small script to grep operations-puppet and cloud-instance-puppet > > Did you also see `utils/audi... [16:57:25] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10jbond) > Will take a look, I thought that one only checked puppetdb it also parses https://openstack-browser.toolforge.org/puppetclass/ although dose so with... [17:50:18] 10Mail, 10Infrastructure-Foundations, 10SRE, 10observability, and 2 others: large MX queues should page - https://phabricator.wikimedia.org/T297144 (10herron) 05Open→03Resolved a:03herron I know the task description says "threshold to be determined" but calling more attention to the current check wou... [21:26:36] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Znuny, 10fundraising-tech-ops: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10Dzahn) Thank you very much @akosiaris and @krobinson would love to move those over to ITS as its part of an epic task (to move all the al... [21:44:16] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Znuny, 10fundraising-tech-ops: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10Dzahn) 05Open→03Resolved a:03Dzahn optimistcally calling resolved based on previous comments [21:51:49] 10Mail, 10Infrastructure-Foundations, 10SRE, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) [21:52:18] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) [21:53:55] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Znuny, 10fundraising-tech-ops: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10Dzahn) >>! In T297307#7574634, @akosiaris wrote: > I am inclined to resolve this task, but I think there might be a followup action item... [21:55:47] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Znuny, 10fundraising-tech-ops: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10Dzahn) P.S. Since you are all here. There is also open ticket T252932 which is called "Forwarding or alias for fundraising@" and you can... [22:04:23] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Znuny, 10fundraising-tech-ops: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10Dzahn) a:05Dzahn→03None