[07:19:56] good morning early birds, can I get a review on https://gerrit.wikimedia.org/r/c/operations/dns/+/1091130 ? [07:27:46] XioNoX: +1d [07:27:51] thx! [10:59:06] brouberol: the GitLab project repos/data-engineering/blunderbuss is not properly configured for the Trusted Runners (I guess it has been added manually?). See diff here https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/-/jobs/396744. [10:59:06] To persist the configuration the project has to be added here: https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/-/blob/main/projects.json?ref_type=heads [11:01:13] This repo is managed by Aleksandar Mastilovic (I'm not 100% sure of his IRC nick, if any). Would you mind reaching out to him on slack? Thanks! [11:28:42] its amastilovic, but last seen on irc 2024-06-18 (in a wmopbot channel at least) [15:19:57] stevemunene: FYI there are 2 reimage cookbooks on cumin1002 started by your use on Oct. 31st waiting for user input [15:39:46] Ack, thanks volans the reimages were taken care of by someone else in the team. [15:58:37] Hello folks, not sure if this is the right channel, but https://noc.wikimedia.org/conf seems to be down. [15:58:59] wfm [15:59:03] xcollazo: works for me [16:03:34] Hmm.. my process from inside eqiad did timeout a couple times but it is working again. Thanks for checking. [16:10:51] Actually the issue continues. Example endpoint: https://noc.wikimedia.org/conf/dblists/testwikis.dblist [16:10:51] Response: upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure [16:12:54] xcollazo: there is an ongoing maintenance in eqiad, but it only started 15min ago [16:14:00] that tracks, thanks XioNoX. I'll retry my process later. [16:19:01] hmm [16:19:10] I am a bit worried about this [16:19:30] the reason noc is working for some and not others because it is dyna [16:19:37] like it is failing for me here [16:20:25] sukhe@cumin1002:~$ dig mw-misc.discovery.wmnet +short [16:20:25] k8s-ingress-wikikube-ro.discovery.wmnet. [16:20:25] 10.2.2.70 [16:20:36] shouldn't this have been depooled for eqiad? [16:20:42] swfrench-wmf: ^ context [16:20:58] not if eqiad was only depooled for edge traffic, no [16:21:33] if services are still pooled - e.g., mw-web, mw-api-ext, etc. serving RO traffic - then eqiad is very much alive / serving [16:21:50] e.g., RO traffic from Europe will go to eqiad [16:22:03] uh [16:24:04] how broken is eqiad right now and how much longer? [16:24:12] ^ this, yeah [16:24:43] looking for signs of impact, and I'm not seeing anything, but I'm also not entirely sure where to be looking [16:24:53] what devices are we operating on? [16:24:56] if we are talking about the severity of "broken", then that means the cr's are down and so is edge traffic [16:25:02] papaul: ^ where are we with this? [16:25:44] just cr1 though right now, cr2 would follow this [16:25:48] > START - Cookbook sre.hosts.downtime for 2:00:00 on cr1-eqiad,cr1-eqiad IPV6,cr1-eqiad.mgmt with reason: router upgrade [16:25:48] if both CRs are down then-- okay [16:25:56] sukhe: yes just cr1 for now [16:26:28] got it, so we're SPoF on a single CR [16:26:34] yes [16:27:14] I don't understand why noc.wm.o is broken some of the time even in codfw [16:27:32] I'm seeing production errors in logtash for mw requests to https://zu.wiktionary.org/wiki/api/v2/displays that are routed to the `mw-web` servergroup. Is that expected? [16:27:52] oh nvm, this isn't even valid I think? [16:28:06] okay it's not actually broken in codfw, but it is sometimes very slow [16:28:14] curl -v https://noc.wikimedia.org/conf/dblists/testwikis.dblist --connect-to noc.wikimedia.org:443:text-lb.codfw.wikimedia.org:443 [16:28:15] wikitech timing out here for a bit [16:32:25] alright, zooming out, agreed that if the CRs are being upgraded, even one at a time, we should depool for edge traffic (since we're naively down 1/2 of edge capacity). for services, I *think* this is tenable to be SPoF on a single CR - i.e., (1) our DC network architecture is such that we'll just rely on the other one and (2) our WAN architecture will still be okay for transport (i.e., eqiad <> esams) [16:32:36] XioNoX: is that correct^ [16:32:57] swfrench-wmf: I think we're about to learn something interesting [16:33:11] the down hosts are a concern [16:33:13] that shouldn't be happening [16:33:23] agreed, yeah [16:33:35] sukhe: a single cr router maintenance at a time should NOT be causing dozens of hosts to fail pings [16:33:37] something is wrong [16:33:41] I am not sure this was part of the eal [16:33:42] deal [16:33:43] yes [16:34:02] and whatever is wrong probably also explains the earlier request failures and high latencies [16:34:25] sukhe: swfrench-wmf: can we depool eqiad for services? [16:34:25] recoveries [16:34:29] papaul: do we know what happened here? [16:34:37] it's always a risk though, with physical work, that some other power/network gets disturbed [16:34:38] cdanis: I would very much like to suggest given cr2 will follow [16:34:44] you have to account for that risk [16:35:00] I'm prepping for depooling [16:35:05] +1 [16:35:28] sukhe: that happen when switching re0 to master [16:35:30] looking [16:36:23] can we hold on the cr2 work until the depool is done? [16:36:26] and is the cr1 work done now [16:36:36] cdanis: still going [16:36:36] papaul: ^ [16:36:38] ack [16:36:49] to what degree is eqiad transitioned to L3 ToR? just E+F? [16:37:00] I think so [16:37:17] can someone grab the list of hosts that were down and check them against Netbox for their row [16:37:43] I do seem to remember that switching the routing engine on junos is disruptive to the BGP sessions [16:38:50] first couple I checked are both row E, checking others [16:39:03] If nothing else, we currently assume we can operate with only one cr1, so if that's not the case, it's worth noting [16:39:19] :) [16:39:52] bblack: yes just E and F [16:39:58] well, we technically thought we don't even need to depool eqiad for edge traffic even, so yeah (the actual depool was just a precaution), we need to do a retrospect [16:40:13] [as part of this upgrade, that is] [16:40:15] so far everything is row E that died [16:40:34] current schedule likely means we don't get to C & D until next October (before next dc switchover is not really achievable which puts us out till after that) [16:41:00] wait what.. sorry catching up thought this was just curosity... row E is offline? [16:41:14] I'm working on the depool, but there's something odd going on with the cookbook - I think we might have broken something with a recent change [16:41:15] topranks: [16:41:21] E and F went down [16:41:26] all E1 [16:41:35] topranks: at the very least a bunch of hosts on row E lost connectivity when the RE was failed over on cr1 [16:41:50] it seems like that whole list of down/recover hosts were specifically in rack E1 [16:42:01] we are currently working on a services depool of eqiad before we continue with doing anything cr2 [16:42:07] bblack: very very interesting [16:42:11] let me know when it's okay to repool db1190 that paged [16:42:32] the switch in E1 is responding [16:42:36] cmooney@re0.cr2-eqiad> ping 10.64.130.1 [16:42:37] PING 10.64.130.1 (10.64.130.1): 56 data bytes [16:42:37] 64 bytes from 10.64.130.1: icmp_seq=0 ttl=63 time=2.926 ms [16:42:37] 64 bytes from 10.64.130.1: icmp_seq=1 ttl=63 time=13.311 ms [16:43:32] topranks: the E1 ToR is has bgp sessions with both CRs right? [16:43:36] https://www.irccloud.com/pastebin/8ANZ4mid/ [16:44:01] cdanis: no (for budget reasons mostly) each spine switch is connected to only one CR [16:44:26] oh [16:44:30] so all traffic to rows E/F is currently routing via cr2-eqiad over it's link to ssw1-f1-eqiad [16:44:38] alright, I was able to use test-cookbook to pull the cookbook at an older known-good CR [16:44:42] does anyone have an example of something that is actually down [16:44:51] not at present -- these were flaps [16:44:52] not anymore I don't think [16:44:53] but long flaps [16:44:55] topranks: not anymore, no [16:45:00] ok [16:45:03] like, "Icinga alerted" long [16:45:06] so ~minutes [16:45:10] well if you just kill the router that will happen [16:45:17] the CRs have full BGP tables [16:45:41] by icinga in IRC, it was ~16:32 -> 16:34 [16:45:41] when the IBGP session from cr1 to cr2 suddenly goes down it takes a *long* time to re-process the _millions_ of routes that suddenly were withdrawn [16:45:45] it's minutes to reconverge? and the reconvergence blocks external traffic? [16:45:45] yeah [16:45:48] s/external/internal/ [16:45:51] so that's expected [16:46:08] is there a smooth way to do it? [16:46:12] they're just bgp routes, the IBGP between CRs doesn't discriminate [16:46:20] bblack: yeah it can be done gracefully [16:46:23] can we do something like graceful shutdown or multi-next-hop? [16:46:33] various ways. [16:47:11] for ultra-graceful you can adjust CR1 policy in advance to de-pref the routes it's getting from the spine layer, so they are still working and CR2 will slowly start prefering what it is learning directly for everything [16:47:18] (sorry, I guess it's actually called "BGP Additional Paths") [16:47:33] but even if you just shut the port connecting the spine switch, the ~1000 route updates generated would cause a much shorter blip (seconds) [16:47:42] cdanis: it's not really addpath [16:47:47] cr2 has all the routes from the get go [16:47:50] it's just not using them all [16:48:12] it's the time to re-process every route in the table and decide what the new best path should be and push that to the asic that you're waiting on [16:48:15] right [16:48:30] so we need to do that for our internal IP space much before we actually do it for the external connections with the millions of prefixes [16:48:37] Tbh I think there is too much reliance on "depool the site, after which disruption is fine" here [16:48:52] so it sounds like, for the immediate very short term of finishing this maintenance, we can have papaul offline one by one the internal interfaces on cr2? [16:49:02] cdanis: we can also do things more gracefully for external stuff, though it's not quite as easy [16:49:04] topranks: +1 and we're paying for it now :) [16:49:37] but yes, if we shut down the spine peerings in advance, or de-pref them, we'd have no outage [16:50:01] topranks: is this part of the runbook that papaul is following? [16:50:04] Running `sudo test-cookbook -c 1077111 sre.discovery.datacenter depool eqiad --reason "Network maintenance"` [16:50:50] swfrench-wmf: thanks [16:51:43] oh goodness right ... test-cookbook prohibits running as root [16:51:45] sukhe: it is not, I can add something there [16:52:11] topranks: can you come up with something quickly that will work for specifically cr2-eqiad's configuration perhaps 😅 [16:52:13] topranks: ok, to be clear, the context was that perhaps we can try this when edge/services are depooled (which should be now) [16:52:16] so that we know it works [16:52:24] sukhe: it's almost now [16:52:28] definitely some blame has to fall on us (netops) here, while it makes perfect sense for this to happen this new scenario needed to be documented [16:52:43] it's okay it's not about blame right now [16:52:46] <_joe_> so to be clear *WE ARE NOT* going to perform this maintenance in codfw now [16:52:59] no of course not [16:53:02] <_joe_> not until we've switched mediawiki over [16:53:05] _joe_: indeed not. [16:53:12] <_joe_> ok :) [16:53:13] let's get eqiad depooled for services and finish up [16:53:13] I think that's understood but yeah good to be clear (that's for March) [16:53:19] although that said - in general - I think we may be better performing these "as if the site was live" [16:53:35] I am going to manually depool RO services via conftool, starting with mw services [16:53:35] topranks: totally agreed but I'd like to focus on the tactical rn and not the strategic [16:53:36] rather than saying "once we depool we can do it with a bang" [16:53:39] swfrench-wmf: thank you [16:54:50] FTR I think it would also be fine to do gross-er things to make it work (like manually check out an old rev of the cookbooks repo on a cumin host) [16:55:32] papaul: what's the status of cr1? [16:55:48] cdanis: rebooting re1 now [16:56:13] swfrench-wmf: test-cookbook does run sudo itself what's the issue? [16:56:13] fwiw I've applied the policy changes on CR1 and updated the runbook [16:56:28] volans: oh, seriously??? [16:56:30] yes [16:56:37] okay, that makes my life way easier, thank you! [16:56:39] it setup all in your home [16:56:45] then does run the cookbook as usual [16:57:01] (action item for later is to make the error message clearer maybe) [16:58:21] it does say to run it as your user and that's how you're supposed to do it, beside that feel free to edit tthe files in place on the checkout in an emergency [16:58:35] it's python, we get that advantage so we can use it when needed [16:58:39] <_joe_> later. [16:59:01] if you tell me what are you trying to do I can help [16:59:54] we're depooling services now] [17:00:12] volans: thank you very much, but I think we're in a "improving" state [17:00:39] cdanis: upgrade done on cr1 now just checking all is good there [17:00:55] thanks [17:01:01] https://docs.google.com/document/d/1NABhhfYqBBFbRXDr--UO_VK7RtZDKwQ4AQKjBE0-M50/edit [17:01:11] capacity wise, are we still able to run on a single core router, also in eqiad/codfw? [17:01:14] topranks: I moved your edit to the if eqiad/codfw section, next to the pfw drain instructions [17:01:32] cdanis: <3 [17:01:36] question_mark: yeah [17:01:42] XioNoX: can you please link the runbook here? [17:01:55] sukhe: https://wikitech.wikimedia.org/wiki/Juniper_router_upgrade# [17:01:58] thank you [17:01:59] XioNoX: it's probably a good idea in general, at the L3 switch pops anyway [17:02:51] good; then testing that occasionally with (graceful) maintenance occasionally even with edge traffic is not a bad thing I think [17:03:12] topranks: good point, almost forgot that we have L3 switches in POPs too [17:03:29] <_joe_> going back to the current issue, I think I had suggested to depool the application layer [17:03:38] <_joe_> before maintenance started [17:03:53] <_joe_> have we assessed the impact? [17:03:58] swfrench-wmf: how's the depooling going? [17:04:06] _joe_: no but I would love for someone to look at that in a detailed way [17:04:16] cdanis: 50% done [17:04:17] _joe_: we had a random set of hosts unreachable for ~minutes in eqiad [17:04:48] anecdotally we saw errors, or long-running requests -- even from the codfw applayer, sometimes [17:04:53] <_joe_> yes and ofc that caused problems at random for users in eqsin,eqiad,drmrs and esams [17:05:04] like I was having latency issues fetching noc.wm.o via codfw edge [17:05:05] <_joe_> and sometimes from codfw ofc because we use dns [17:05:15] cdanis: that 50% includes all (RO) external traffic serving for mw [17:05:16] <_joe_> uh that is strange [17:05:23] swfrench-wmf: :D [17:05:28] _joe_: yes [17:05:44] uhh is mw-misc pooled in codfw? [17:06:37] topranks: what do you think of adding the switch group to the "Disable the peers" section as well? [17:06:54] cdanis: it uses the discovery record for k8s ingress, which should be pooled in codfw [17:07:21] it is [17:07:23] I just checked [17:07:26] but mw-misc is very very slow in codfw [17:07:39] that is quite odd ... [17:08:05] XioNoX: I think the de-pref is better overall, shutting the peer means routes are dropped immediately, then we have to wait on convergence to install the other ones in their place [17:08:17] with de-pref there is still a wait, but at all times there is a working path [17:08:23] topranks: I mean in addition to the de-pref [17:08:45] sure, it's safe to do once the de-pref has taken effect [17:08:59] but I'm not sure it's needed [17:09:18] we should probably de-pref the switch-side too however for the return path, even though convergence there is quite quick [17:11:10] topranks: that should be done with the graceful-shutdown command above, but it needs more testing [17:11:42] actually it should (in theory) be the only needed thing for BGP locally [17:11:43] Amir1: given that applayer is basically depooled in eqiad, I think you can repool db1190 [17:11:55] yeah agreed [17:11:57] ack [17:12:04] as graceful-shutdown should be send to the switches and the other CR [17:12:08] sent* [17:15:20] do we need anything on the far-side to work with it? or is it just default in newer junos that it'll pick up on the gracefull shutdown community? [17:15:55] looking at the outbound policy on the CRs it's more messy to adjust the policy to increase med for all the terms doing it that way [17:16:28] it's enabled by default on the receiver side since 19.1 https://www.juniper.net/documentation/us/en/software/junos/cli-reference/topics/ref/statement/graceful-shutdown-edit-protocols-bgp.html [17:16:38] that's cool [17:16:51] not sure what local-pref it sets by default though [17:19:35] topranks: taking a random host cr3-ulsfo> show bgp neighbor 198.35.26.193 (that's cr4): Options: [17:20:06] mw-web and mw-api-ext are running "hot" in codfw but alright for the moment (we decided to hold on scaling down after the switchover) - I'll keep an eye out and potentially scale up a bit if needed [17:20:06] Graceful Shutdown Receiver local-preference: 0 [17:20:58] alright, we're done depooling all the *eligible* services from eqiad - there are some notable exceptions (e.g., swift) [0] but all in all the app layer is largely out [17:20:58] [0] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/discovery/datacenter.py#27 [17:21:26] that'll be lower than everything [17:21:39] there is a question, however, about unicast BGP vs. EVPN-learnt routes [17:21:45] so, stated differently, we're in a state where the blast radius of something going wrong in eqiad is vastly reduced, but not eliminated [17:22:41] thank you swfrench-wmf [17:22:54] XioNoX: topranks: do you have an early version of a graceful shutdown command for cr2 you might like to test now-ish? [17:23:33] cdanis: yeah I guess we could give it a go [17:23:38] cdanis: `set protocols bgp graceful-shutdown sender` [17:23:44] XioNoX: don't tell me, tell papaul :D [17:24:51] cdanis: all good on your end to start cr2 [17:25:04] all done on cr1 [17:25:55] papaul: I have no objection, but maybe let's try what XioNoX just posted (if that's all that's necessary?) [17:27:26] ok so cr1 is upgraded? and the plan is to now upgrade cr2? [17:27:34] topranks: yes, and the applayer is as depooled as it can be [17:27:35] topranks: yes [17:27:47] if we can quickly do a graceful-shutdown thing, then let's [17:27:50] but otherwise let's just go [17:27:54] IMO [17:27:56] let's just hang on a sec [17:28:12] topranks: let me know when to start [17:28:16] so on CR1 we still have this line in play: [17:28:21] set policy-options policy-statement BGP_Infra_In then local-preference 20 [17:28:34] can you remove it [17:28:36] understandable, the addition and removal of that was only added to runbook in past hour [17:28:37] yes [17:28:54] but let's move slowly through this in general [17:29:24] papaul: all the "Cleanup" steps are done on cr1 ? [17:29:29] https://wikitech.wikimedia.org/wiki/Juniper_router_upgrade#Cleanup [17:29:46] XioNoX: 2 no steps [17:29:51] 2 more [17:30:00] topranks: I see that you added the rollback for the switches, nice [17:30:26] I did what now ? [17:30:47] sorry, removing the local-pref 20 for BGP_Infra_in / [17:30:50] I need to step away from keys for an appointment, can someone take IC or should we just declare this to no longer be an incident ? [17:30:53] topranks: missing words :) you added the rollback steps to the Cleanup section of the doc :) [17:31:07] yeah [17:31:17] should we try the graceful shutdown ? [17:31:24] cdanis: me, taking over [17:31:29] I suspect because of EVPN vs unicast BGP it won't work perfect on the switch side [17:31:47] cdanis: we can do that but I think we want to see how this works out for cr2? [17:31:51] regardless, taking over as IC [17:32:13] topranks: +1 for testing it, worse case it doesn't do anything [17:32:17] but I think shutting the session is the right thing to do, and cos the leaf are doing ECMP to spines it should be instant switch once WITHDRAW is received for spine on session shut [17:32:47] papaul: I see you have the "graceful-shutdown sender" config prepped on cr2, you can go ahead and commit that [17:33:19] topranks: am waiting for you to give me the green ligh so i can start [17:33:51] I can give you the green light to commit the staged config on cr2-eqiad [17:33:55] but hold it there [17:33:59] to [17:34:01] ok [17:35:23] topranks: it's working on the cr1 side, all the prefixes learned from cr2 have a local pref switched to 0 [17:36:19] ok yeah that's good [17:37:06] an example for v6, all the prefixes with cr2 as next hop have a local pref of 0 https://www.irccloud.com/pastebin/a6HDqSz4/ [17:38:41] cool [17:38:50] it's also working on the evpn switches [17:40:40] well.... it's partially working [17:40:48] which is a little odd [17:40:49] ah? [17:41:32] scrap that... [17:41:36] it is fully working it seems [17:41:56] however it looks like - this makes sense - the CR takes a long time to add the community and send UPDATE for every route [17:42:09] so when initially checking I seen some with local-pref 0, some with default 100 [17:42:25] but they are all showing local-pref 0 now [17:42:40] our lesson should be "wait 15 mins" after adding the graceful-shutdown command I think [17:42:59] same in my paste above, looking closer for example 2001:320:237::/48 was still through cr2 while now it's through eqord [17:43:08] the *good* news is that ssw1-f1-eqiad is prefering the EVPN route it learns from ssw1-e1-eqiad over the local-pref 0 unicast routes from cr2-eqiad [17:43:15] and has withdrawn it's announcement to the leafs. [17:43:17] leaf previous: [17:43:29] https://www.irccloud.com/pastebin/vtwKPsit/ [17:43:35] vs now: [17:43:39] https://www.irccloud.com/pastebin/SstbSoL2/ [17:44:41] nice! [17:45:18] that command is going to make all future upgrades much more smooth and easy [17:48:43] yeah it should be part of the runbook [17:48:52] topranks: it is :) [17:49:06] but wasn't fully tested since its addition [17:49:12] well yeah, but let's remove the "needs testing" seeing as we did [17:49:31] for the external peers we still should do other things, no guarantee they'll respect it [17:49:38] topranks: and add the ~15min "just in case" buffer [17:49:45] yeah eaxactly [17:49:54] speaking of it's been 15 minutes -- should we proceed now? [17:51:12] papaul: one question - did you add that "graceful-shutdown" command on cr1 before the start also? [17:51:21] we can see that the link between cr2 and ssw1-f1 is properly drained now [17:54:12] doc updated [17:55:20] yeah.... so that command also causes the router to set local-pref to "0" for all routes it LEARNS ? [17:55:26] as well as add the community to what it sends? [17:55:27] nice [17:55:42] kind of mad we took this long as an industry to come up with it [17:56:10] so where are we with cr2 now? asking with my IC hat on [17:56:26] I think we can proceed [17:56:46] papaul: ^ [17:56:48] XioNoX: I think we can remove that line I added to set local-pref to 20 in the infra-in policy? [17:57:00] the graceful-shutdown command addition does it anyway [17:57:01] doesn't seem to impact inbound routes https://www.irccloud.com/pastebin/wP528cko/ [17:57:11] topranks: sgtm [17:57:24] I think "receive-protocol" shows before policy is applied [17:58:06] topranks: not on cr1 looking at learned routes from cr2, or am I miss-understanding? [17:59:41] 1.0.4.0/22 on cr2 still goes through cr1, but it makes sens to me [18:00:07] anyway, we can discuss the details later :) [18:00:36] papaul, sukhe, topranks, +1 to continue the runbook steps [18:00:53] yeah - and it's direct path via Telia has local-pref 0 on it: https://phabricator.wikimedia.org/P71048 [18:01:00] +1 from me also [18:01:09] XioNoX: running upgrade on re1 [18:04:15] one thing I would caution is wait till the graphs catch up [18:04:36] ok [18:04:41] topranks: "Communities: 1299:35000 14907:4 graceful-shutdown unknown iana opaque 0x4300:0x0:0x1" [18:05:03] the vrrp change is there but I still see a lot of traffic on et-1/1/3 connected to asw2-d-eqiad [18:05:57] topranks: I'm wondering if Telia doesn't do something like add graceful shutdown to the prefixes it sends us if it learns prefixes with graceful shutdown from us ? [18:06:47] papaul: ^ see this if missed, topranks wants to wait till you see that traffic has dropped down [18:07:05] perhaps... but I see this with the routes from the spine switches too (which is where I fist noticed) [18:07:49] https://phabricator.wikimedia.org/P71051 [18:08:31] topranks: so is that a yes for you to proceed then? [IC hat :)] [18:08:34] sukhe: just adding the image now i have not done ant reboot yet [18:08:55] proceed with the steps - one of which is make sure the router interfaces look right [18:11:27] XioNoX: I'm wondering what the 2.7Gb inbound from asw2-d-eqiad is here: [18:11:36] https://www.irccloud.com/pastebin/lyWvClnG/ [18:13:11] looking [18:13:26] XioNoX: topranks: let me know about to reboot re1 [18:13:35] ok now it's dropped down..... [18:13:51] possibly counters not updating properly due to the router image install going on ? [18:14:05] gnmic stats were dead for ~10 mins too [18:14:37] yeah, from good old snmp traffic moved over the other link around 18:50UTC https://librenms.wikimedia.org/graphs/to=1731607800/id=30357/type=port_bits/from=1731586200/ [18:14:51] https://librenms.wikimedia.org/graphs/to=1731607800/id=30358/type=port_bits/from=1731586200/ [18:15:21] I see what you mean about gnmi [18:15:38] my cli show commands were slow to answer, so that's possible [18:15:42] eh... good old snmp is dead there [18:16:07] last measure it has is 5.84Gb out [18:16:28] I expect we lost a few polls similar to gnmi with the router image install [18:18:44] we still have quite a bit of traffic coming in from cr1-eqiad [18:19:03] I'm guessing that may be going out to other transits? Following OSPF routes rather than BGP? [18:21:46] ok I see changing the ospf costs is in the next steps [18:22:12] so we can proceed with those I think? [18:23:20] topranks: so that's 3.3 right? papaul, which step are you on? [18:23:24] let's sync once before continuing [18:23:45] sukhe:waiting for re1 to come back online [18:24:21] sukhe: we need to be on the same page before i swith re0 to backup [18:24:38] topranks: yeah I was looking at the transport links, it's more than I was expecting between cr1 and cr2 [18:24:56] I think it's just that they haven't been drained [18:25:17] sukhe: yes the steps in 3 need to be done [18:25:42] topranks: what step in 3? [18:25:57] > Adjust OSPF metrics [18:27:57] topranks: which metrics are we adjusting on which interface becasue the runbook just said adjust ospf metrics [18:28:31] all of them - it can be done by changing to 'drained' on the circuits in netbox and running homer [18:28:36] but given other changes don't do that now [18:29:13] ok let me know when all is good becasue i am ready to switch re0 to backup [18:29:31] it can be running it as diff and cherry-picking the relevant changes [18:29:44] I can take care of it [18:30:00] XioNoX: can we add that too to the runbook please thanks [18:32:06] set protocols ospf area 0.0.0.0 interface ae1.402 metric 30000 [18:32:06] set protocols ospf area 0.0.0.0 interface xe-1/0/1:0.0 metric 30000 [18:32:06] set protocols ospf area 0.0.0.0 interface xe-1/0/1:3.0 metric 30000 [18:32:06] set protocols ospf area 0.0.0.0 interface xe-3/2/1.0 metric 30000 [18:32:06] set protocols ospf area 0.0.0.0 interface xe-3/2/2.0 metric 30000 [18:32:07] set protocols ospf area 0.0.0.0 interface xe-3/2/4.0 metric 30000 [18:32:07] set protocols ospf3 area 0.0.0.0 interface ae1.402 metric 30000 [18:32:08] set protocols ospf3 area 0.0.0.0 interface xe-1/0/1:0.0 metric 30000 [18:32:08] set protocols ospf3 area 0.0.0.0 interface xe-1/0/1:3.0 metric 30000 [18:32:09] set protocols ospf3 area 0.0.0.0 interface xe-3/2/1.0 metric 30000 [18:32:09] set protocols ospf3 area 0.0.0.0 interface xe-3/2/2.0 metric 30000 [18:32:10] set protocols ospf3 area 0.0.0.0 interface xe-3/2/4.0 metric 30000 [18:32:11] set protocols ospf3 area 0.0.0.0 interface ae1.402 metric 30000 [18:32:11] set protocols ospf3 area 0.0.0.0 interface xe-1/0/1:0.0 metric 30000 [18:32:12] set protocols ospf3 area 0.0.0.0 interface xe-1/0/1:3.0 metric 30000 [18:32:12] set protocols ospf3 area 0.0.0.0 interface xe-3/2/1.0 metric 30000 [18:32:25] sry [18:32:29] https://www.irccloud.com/pastebin/4Im2Cqqc/ [18:32:48] topranks: go for it then :) [18:32:54] ok will do [18:33:17] ok done [18:33:27] now the critical part - perhaps we need to highlight it more in the runbook [18:33:38] "Check LibreNMS graphs for router in question: https://librenms.wikimedia.org/devices/type=network" [18:33:59] topranks: we need to update the ospf metric on the remote side too [18:35:05] true true [18:35:29] done for esams [18:35:56] I'll do eqord and magru [18:37:33] done for drmrs [18:40:27] cr2-codfw to go, I did eqord and magru [18:41:37] topranks: it's not doing any traffic as it's the backup link https://librenms.wikimedia.org/device/device=2/tab=port/port=11592/ [18:43:31] cool [18:43:38] graphs showing no traffic through cr2-eqiad now [18:43:43] so I think it's ok for the reboot [18:43:55] topranks: not reboot yet [18:43:56] I need to step away to watch Ireland probably lose at football as usual [18:43:58] yeah total 0: https://librenms.wikimedia.org/graphs/to=1731613200/device=2/type=device_bits/from=1731598800/ [18:44:09] topranks: :) [18:44:57] XioNoX: tomorrow - or before you go - let's go through that runbook and work out what safeguards to add [18:45:20] topranks: yep, and steps to detail more [18:46:00] XioNoX: topranks: thanks for the new steps [18:46:27] sukhe: not to put time pressure on things at all, but there's a train window starting in 15m wherein mediawiki is supposed to roll to group2 [18:46:27] I'm guessing we should probably hold on that both from the perspective of "too many things happening at once" and from the standpoint of possible disruption in eqiad causing issues for the rollout. thoughts? [18:46:45] +1 [18:46:59] at least my two cents for waiting :) [18:47:07] papaul: you can resume [18:47:35] I see it's ongoing, cool [18:47:37] XioNoX: switching re0 to backup now re1 is done [18:47:58] brennen: FYI ^ we'll keep you posted [18:50:53] swfrench-wmf: ack, thanks. will hold until further notice. [18:56:42] papaul: let me know once it's back to re0 [18:57:03] XioNoX: ok\ [19:10:27] XioNoX: re0 is back up switching now to master [19:11:34] alright [19:16:26] XioNoX: re0 is back to master [19:16:46] papaul: cool, no issues during the upgrade? [19:17:10] XioNoX: no [19:17:35] papaul: nice! [19:18:05] XioNoX: rolling back the changes now [19:18:14] papaul: great [19:18:30] I will wait for your approval to repool edge traffic and then swfrench-wmf will repool services after that [19:18:37] sukhe: sure [19:18:37] I'm taking care of the remote sides for ospf [19:18:43] ok [19:23:35] XioNoX: all good on my end [19:24:06] great, waiting for homer [19:24:33] XioNoX: cr1 vrrp are still all as master [19:25:40] sukhe: did we add any broad silences that need cancelled before repooling? [19:25:59] swfrench-wmf: good question, only one, for pybal failign BGP sessions [19:26:02] I will remove those when we are ready [19:26:10] papaul: that's expected, we don't force it one way or the other iirc [19:26:12] awesome, thanks [19:26:44] XioNoX: ok [19:28:47] ospf back to normal [19:30:49] sukhe: all good for me [19:30:59] ok [19:31:20] let's monitor it for a few more minutes but it lgtm too [19:31:30] removing alerts for now [19:31:40] silenced alerts I mean [19:32:43] papaul: you have removed other downtimes? [19:32:51] sukhe: yes icianga [19:32:56] ok looks good [19:35:54] thank you all! [19:36:19] XioNoX: sukhe: all good on my end going to get some food [19:36:29] papaul: thanks! [19:36:30] enjoy it [19:36:48] brennen: I think you should be good to go on the train then [19:36:52] ok looks good, repooling eqiad, no outstanding alerts [19:37:31] done [19:39:44] let's perhaps give the edge-traffic repool 10m or so to let any potential issues shake out [19:39:54] yep [19:40:03] after that, I'll start the repool cookbook, which will take ~ 20m to execute [19:40:23] cdanis, swfrench-wmf: ok on train or should i wait ~10? [19:41:00] brennen: I would say wait at least 5 [19:41:27] not because of the TTL but if something is really broken, it will give us an idea [19:41:37] looks all good so far [19:41:37] sounds good, i'll start around 19:50 UTC [19:41:40] brennen: how long does the train usually take? just trying to get a sense of when scap will actually update eqiad [19:42:31] this ~should~ be a relatively quick deploy. [19:42:54] cool, I'll let you go first before I start repooling services [19:43:33] (a deployment will add a bit of noise to monitoring how mediawiki is behaving on repool if the happen around the same time) [19:44:40] brennen: please feel free to proceed [19:44:46] since swfrench-wmf will wait for you [19:44:52] swfrench-wmf: sounds good? [19:44:59] SGTM [19:45:07] sorry about the delay brennen [19:45:11] brennen: please let me know when you're done [19:45:23] will do. [19:46:05] thanks! [19:47:33] Folks, I had reported before that NOC endpoints like https://noc.wikimedia.org/conf/dblists/testwikis.dblist were failing. Catching up in the conversations, it seems like the maintenance is done. [19:47:33] But I can still see https://noc.wikimedia.org/conf/dblists/testwikis.dblist timeout sporadically, or take a long time to respond. Is this still expected? [19:49:45] xcollazo: that does indeed seem like a different issue [19:50:13] it was sporadically slow to respond even when I sent my own traffic to codfw, both during and after the incident [19:51:30] cdanis: thanks. I will open a phab. do you know what tag should be appropriate? [19:51:50] I guess Serviceops [19:59:58] swfrench-wmf: done and logs seem fairly stable, please go ahead. [20:00:15] brennen: great, thank you [20:00:20] sukhe: repooling services [20:01:06] thanks and gl! [20:02:53] Opened https://phabricator.wikimedia.org/T379968 for the https://noc.wikimedia.org issues. [20:25:47] sukhe: all done - large mediawiki services repooled since a bit before 20:10 and things appear stable [20:28:19] thanks!