[07:31:44] gerrit needs to restart to apply a config change, it should be quick [08:15:22] btullis: hello, I see dse-k8s-worker1012 in netbox status planned, is that expected/wanted? I got an unexpected homer diff about removing that host as a bgp peer though it does seem active [08:18:36] also can I get access to https://phabricator.wikimedia.org/T414787 ? [08:25:52] if you use bast2003.wikimedia.org, please temporarily change to bast1004.wikimedia.org, there's a CPU error, I've opened https://phabricator.wikimedia.org/T420320 for DC ops to look into it [08:31:52] Hello. What's the process for getting a Puppet change reviewed and merged? I submitted https://gerrit.wikimedia.org/r/c/operations/puppet/+/1249932 to removes an olde periodic job and I'm unsure how to proceed [08:33:42] phuedx: couple of ways, https://wikitech.wikimedia.org/wiki/Puppet_request_window and/or poke the relevant SRE team, service ops in this case [08:33:56] <3 Thanks [08:34:13] sure np [08:41:23] https://translate.kagi.com/?from=en&to=LinkedIn+speak x) [08:44:53] hahaha [08:46:57] I’m writing this with a heavy heart to share that bast2003 is currently facing some unexpected downtime. While this presents a significant challenge, it’s also an opportunity for us to lean into resilience, embrace the pivot, and focus on building back even stronger. Stay tuned as we navigate this journey of growth and optimization. #Resilience #TechUpdate #GrowthMindset #Innovation [08:47:27] Deeply saddened to hear the news regarding bast2003. Sending my thoughts and support to the entire team during this difficult time. [08:48:58] Reflecting on the current state of digital communication. 📧 Is it just me, or has the volume of daily correspondence become a significant challenge for productivity? In today’s fast-paced corporate landscape, navigating a saturated inbox can feel overwhelming. However, it’s also an opportunity to refine our focus and prioritize high-impact interactions. 🚀 How are you optimizing your workflow to stay agile amidst the noise? Let’s [08:48:58] discuss below! 👇 #Productivity #TimeManagement #DigitalTransformation #Leadership #Focus [08:49:23] "I walked my dog" takes epic proportions [08:49:26] arnaudb: what have you done to us :D [08:49:30] :D [09:35:29] btullis: I'm putting dse-k8s-worker1012 back to status active in netbox since the host seems up and with the correct role applied [09:36:15] godog: Super, thanks. [09:36:43] btullis: np, also can I get access to https://phabricator.wikimedia.org/T414787 if you can? [09:37:34] topranks: I'm running homer on cr* eqiad and I see the ospf metric change, is that you? safe to proceed ? [09:41:28] godog: let me look, what are you trying to change on the CRs? [09:41:49] topranks: I'm attempting to deploy https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1253574 [09:42:02] I've made a few manual changes to affect traffic priority, so might be easier for me to manually add the bits for you too than roll it all back [09:42:26] topranks: ok no worries, change can wait for you to be done [09:42:31] godog: can you just say 'no' / cancel that? [09:42:41] topranks: yes done, I'll stand by [09:42:42] I'll take a look at the diff and push it for you [09:42:45] cool thanks gimme 5 min [09:42:54] ok no worries [09:46:56] godog: I'm looking at this change and don't think it's correct actually [09:47:15] these are the last lines in the resulting acl: [09:47:20] https://www.irccloud.com/pastebin/yOLRaJHu/ [09:47:36] so your rule is going after the "term default then accept" allow all [09:47:56] if the traffic is currently blocked it's probably due to the deny-from-instances rule before that, and would still be blocked with this [09:48:28] godog: actually ignore me - the way I'd done the comparison affected the order [09:48:45] lol ok [09:49:15] to be clear I'm partially reverting https://gerrit.wikimedia.org/r/c/operations/homer/public/+/970275/3/policies/cr-cloud-vrf.yaml to allow cumin/cloudcumin return traffic [09:50:12] godog: is this urgent? can I look at it this afternoon? [09:50:22] topranks: yeah totally, not urgent [09:50:30] looking at the task and the patch I'm trying to work out exactly what is broken and why [09:50:52] I've no particular objection to this going in, but the flow seems to be different than I'd expect [09:51:17] godog: ok thanks, I'll take a look this afternoon when I've finished the current maintenance [09:51:28] sure np, thank you topranks [09:57:54] godog: I have added you to T414787 now. [09:58:30] btullis: while you are there I see some bgp changes for dse-k8s hosts on those CRs, I will also push when I am done [09:58:33] change of IP for dse-k8s-worker1012 [09:58:48] topranks: Yes please. [09:59:49] btullis: cheers! [10:00:08] Interestingly, I ran the provisioning cookbook yesterday with the `--homer` switch. As in `cookbook sre.hosts.provision --no-dhcp --no-users --legacy --homer dse-k8s-worker1012` [10:01:00] So I wasn't expecting any outstanding BGP changes. It's good that you spotted them. [10:01:47] btullis: I'll take a look at that, we might have some error in the cookbook [10:02:05] it should try to determine if a given host is peering with the CRs (on row-wide vlan) or with its top-of-rack switch (per-rack vlan) [10:02:29] perhaps there is some race condition related to when Netbox is updated that might be causing it to make a mistake here.. not sure [10:05:20] !log disabling VRRP for et-1/0/5 sub-interfaces on cr1-eqiad T420180 [10:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:24] T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment - https://phabricator.wikimedia.org/T420180 [10:20:35] btullis: dse-k8s-ctrl2001 just paged, are you around? [10:21:13] could be related to what cathal is working on ? [10:21:30] Yes, I am around. [10:21:31] effie: it ought not to be [10:21:37] my work is in eqiad [10:21:49] oh just noticed it was codfw, sorry [10:22:11] keep doing your St Patrick's day work, [10:22:27] dse-k8s-crtl2001 is pinging from esams so it may be fixed / false alarm [10:22:43] and a very Happy St. Patrick's Day to you too :) [10:23:03] Hmph, I can't reach bast2003 [10:24:15] Yeah, I see some latency here https://grafana.wikimedia.org/goto/dfga474adwmpsd?orgId=1 but no sustained errors. Thanks for the heads-up. [10:24:36] Emperor: see backscroll and https://phabricator.wikimedia.org/T420320 [10:24:58] ah, thanks, I'll tweak my ssh config [10:51:33] fabfur: just migrate all comms to toots in the form of haikus [10:53:54] arnaudb: Do you want me to run all my on-call handover messages through this? Because that's what's gonna happen, I'm warning you/ [10:54:28] \o/ I'm considering using it for commit messages [11:01:26] godno.gif [11:14:09] haiku: A single crash / reduces your computer / to a simple stone [11:16:29] There once was a document here \ Now it is lost, do you hear? \ Our rm's too brisk \ we erased all the disk \ and the SRE's have gone for a beer [11:17:24] not only a haiku, but also kinda gives sea shanty vibes as well if you add some backing beats [11:18:44] limericks are surprisingly hard to keep WS [12:32:15] godog, btullis: I've pushed those changes on cr1-eqiad and cr2-eqiad now [12:35:14] topranks: Great! Thanks. For the record, I also had to run `homer 'asw2-a*'` to get it to disable the old 1 Gbps port. I think that this may have been missed by the `sre.network.configure-switch-interfaces` and the `sre.hosts.provision --homer` cookbooks. Bit of an edge case, switching the NICs, but worth knowing about. I have a few more to do in rows A&B, so I'll let you know if it's reproducable. [12:37:57] btullis: yes the cookbooks all act on what they see in Netbox. So it won't know there is config for an old port that needs to be removed [12:38:35] regarding the provision cookbook it only configures the switch, it will never do the CRs. so it makes sense to me the CR IP change didn't get pushed by it, but again it's another gap in the workflow we can hardly expect SREs to anticipate [12:40:24] yeah I do the cleanup in the background almost daily (based on the homer daily diff email) it's no big deal, and won't happen once we've fully migrated to the new design [12:45:14] Ack, thanks both. I'll try not to leave too much cruft behind as well. [13:43:10] topranks: sweet thank you! [14:40:17] Phabricator needs a short restart at 15:00 UTC (in 20 minutes) [14:51:18] IF (or anyone with PKI experience), I have a CR for adding a new CFSSL profile, if anyone has time to look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1251117 [15:04:42] Phabricator maintenance finished [15:08:21] thanks! [16:42:36] topranks: I am running sre.dns.netbox after putting a server in Failed and I have removal of vrrp-gw-1221.ulsfo.wikimedia.org. et-1-0-1-1221.cr3-ulsfo.wikimedia.org. et-1-0-1-1221.cr4-ulsfo.wikimedia.org. Expected? [16:59:59] XioNoX: ^ any idea? [17:03:55] Ok well I'm aborting the run [17:15:33] 17:11:25 +icinga-wm │ RECOVERY - Host bast2003 is U [17:15:37] IT HAS RISEN [17:48:24] claime: not 100%, it’s no harm if they are removed I think [17:48:41] papaul: do you know what the situation is with these? [17:49:45] claime the server is dead [17:49:52] we are working on it [17:50:00] topranks: ^ [17:50:44] papaul: I was asking about the ulsfo dns record change, you know about that? [17:52:11] topranks: yes that is the ulsfo sanbox vlan we removed [17:52:22] forgo to run DNS cookbook [17:52:41] claime: it is ok thank you [17:55:28] FYI, I've made some changes to mw-web and mw-api-int, affecting how container shutdown is handled on pod termination (tl;dr - envoy now drains inbound traffic). [17:55:28] * this change has already been piloted in their respective canaries, so we have some confidence that it does the right thing, and I'll be monitoring throughout the day. [17:55:28] * however, in the unlikely event that an issue is escalated to us that rhymes with "unexpected 5xx errors from these services, correlated with deployments" please let me know. [17:55:28] I'll be following up soon with more details and rollback instructions in T364245. [17:55:28] T364245: Recentchanges and cu_changes tables are occasionally missing revisions on multiple wikis - https://phabricator.wikimedia.org/T364245 [18:08:41] swfrench-wmf: very cool :)