[09:47:48] so basically OSPF prefixes are not impacted across sites [09:48:08] and eqiad-codfw bgp seems fine [09:48:23] it's eqiad-ulsfo/eqsin that's acting up [09:48:29] bgp only [09:48:40] indeed I can traceroute eqsin hosts from eqiad [09:49:00] error rate on ats-be in eqsin is looking good [09:49:18] I don't know if ulsfo or eqsin hosts need to reach eqiad LVS VIPs [09:49:21] yeah I can see it [09:49:29] https://www.irccloud.com/pastebin/owJzoGQM/ [09:49:30] but it should be minimal [09:51:11] XioNoX: for a/p services I guess so [09:51:24] Think it may be better [09:51:25] but I can reach both [09:51:30] cmooney@re0.cr1-eqiad> traceroute 103.102.166.240 source 185.212.145.2 no-resolve wait 1 [09:51:30] traceroute to 103.102.166.240 (103.102.166.240) from 185.212.145.2, 30 hops max, 52 byte packets [09:51:30] 1 208.80.153.221 30.684 ms 32.510 ms 31.288 ms [09:51:30] 2 103.102.166.138 247.374 ms 247.788 ms 247.250 ms [09:51:30] 3 103.102.166.240 247.098 ms 247.058 ms 247.086 ms [09:51:43] topranks: confirmed from alert1001, seems to work [09:51:52] recoveries coming in [09:52:03] what did you change? [09:52:35] I'll check with Jgreen if there was any impact within Fundraising when he gets online [09:52:59] Most recently for recovery I cleared the iBGP session between cr1-eqiad and cr2-eqiad. [09:53:35] What caused the issue was removing "protocols bgp group Confed_eqiad metric-out minimum-igp" on both those routers. [09:54:03] Re-adding that as first response to the reports did not seem to resolve the problem. [09:54:18] just to understand - the impact was limited to eqiad hosts trying to contact LVS IPs in eqsin/ulsfo (hence icinga complaining), but no real user impact (ats-be uslfo/eqsin -> eqiad for example unafected, no routing loop) [09:54:36] indeed [09:54:49] ack thanks [09:55:42] just to be clear on this. eqsin/ulsfo hosts could contact eqiad LVS IPs just fine? [09:56:00] akosiaris: this I'm less sure [09:56:15] when I did try it was also self-resolving, it did work but it might have been because it had already recovered [09:56:17] but it's very possible that yes [09:56:20] cr2-eqsin seemed to get to LVS IP in eqiad without any problem when I first checked. [09:56:32] not 100% sure so, when I tried a discovery address it went to codfw [09:56:42] that followed my initial revert, but prior to hard clear of session. [09:57:00] topranks: the new metric might just get added with new BGP updates, so good call on the clear [09:57:10] I am still not fully clear why this happened. [09:57:26] <_joe_> akosiaris: yes they could [09:57:45] <_joe_> I did navigate enwiki from eqsin as a logged in user during the outage [09:57:58] topranks: you will have to try harder to earn your t-shit :) [09:58:03] er, [09:58:06] t-shirt :) [09:58:13] what! but this was my big chance! [09:58:17] sorry folks. [09:58:55] side bonus, we checked that monitoring works [09:58:59] _joe_: thanks for doing that and for confirmation. [10:00:29] heh. you fixed it before i could wake up and log in. just waiting for icinga/victorops/splunk to clear for us. [10:01:22] dwisehaupt: was there any impact on fundraising? [10:03:02] looks like it is mostly ttl exceeded alerts for codfw hosts. checking one other thing. [10:08:11] yeah. we're seeing a little knock on from syslog backup, but nothing front facing. [10:09:48] OK thanks for confirming. [10:10:03] cool! [10:10:19] I believe I understand what happened and it makes sense. [10:10:39] This is far from the best way to learn how BGP confederations differ from eBGP. [10:15:00] <_joe_> I respectfully disagree. Now we're sure you'll never forget [10:17:33] on the upside, from the fundraising side there were no active banners or email sends going on so it didn't affect that. [10:21:57] ok. i'm headed back to sleep. have a good morning/day/night all. [10:24:27] d.wisehaupt: many thanks, apologies for the disturbance. [11:01:02] elukey: can I do something to get https://gerrit.wikimedia.org/r/c/operations/puppet/+/736753 moving? [11:02:46] majavah: I think that we need to scope the work a little bit more in the task, I have it in my TODO list and I hope to dedicate some time before the end of week (I'll ping you in case) [11:09:44] the incidents in VO didn't auto-resolve (T264016), I'll do that now [11:09:44] T264016: Host page did not auto-resolve in VO - https://phabricator.wikimedia.org/T264016 [11:10:42] volans: ^ since you ack'd [11:11:07] godog: ack, do we know why that happens? IIRC is not the first time [11:13:17] volans: yeah not the first time, but I'm not sure about on why that happens [11:13:34] I'll bring the task up to the team meeting though [11:13:52] do by any chance the alert and recovery emails ahve different texts? [11:14:40] so that VO can't match one with the other [11:14:51] could be that for sure yeah [12:22:21] btw, looking at NEL data I agree re: no real user impact on the earlier iBGP issue [13:42:12] In 20 minutes we will switch m5 master [13:56:47] * andrewbogott is here [13:58:03] andrewbogott: <3 [18:49:57] minRTT https://ripe83.ripe.net/archives/video/633/ cdan.is shared a blog post about this previoulsy, but for those to lazy to read here is a presentation ;) [19:02:55] is it possible to get a puppet spec test (for example the one for profile::puppet_compiler in https://gerrit.wikimedia.org/r/c/operations/puppet/+/740915/3) to use the wmcs hiera lookup order instead of the production one? I don't want to add defaults to hieradata/profile/ for a profile that's designed to be used in cloud only [19:07:42] majavah: currently its not that simple to do that, could you raise a task and link it to either T285539 or possibly T289668. me and dcaro are looking at some of theses issue this Q so will take a look [19:07:42] T285539: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 [19:07:43] T289668: Add more rspec test to the puppet code - https://phabricator.wikimedia.org/T289668 [19:08:17] sure! what would you suggest for that patch in the short term? [19:08:39] also ill take a look at the specific issue in the CR but i would say that the quick fix would be to place a default under hieradata/common/profile/ [19:08:45] :D see above [19:08:53] ok, ty [19:11:27] T296327 [19:11:27] T296327: Allow using WMCS hiera lookup order in Puppet rspec tests - https://phabricator.wikimedia.org/T296327 [19:11:44] great thanks