[10:21:27] dhinus: you around? [10:24:32] arturo: yep [10:25:01] I think I'm ready for a review of the tofu-infra flavor refactor [10:26:21] nice, looking! [10:26:33] mmm, wait, double checking a few last minute things [10:26:39] was it more tricky than expected to get it right? [10:26:41] was it more tricky than expected to get it right? [10:26:46] some move {} blocks may not be working as expected [10:27:24] I had to change the abstraction a couple times because how flavors are stored in the openstack API, and how access is managed [10:27:38] we cannot have module.project["whatevet"].flavor [10:28:19] so we have module.project["admin"].flavor, and access permissions from there to other projects [10:29:09] ouch, are only admins supposed to modify flavors? [10:29:29] so if you are admin in project foobar, you cannot modify the project flavors? [10:29:33] in Cloud VPS yes, flavors are not self-service [10:29:42] I see [10:30:37] so I'll ping you in a few minutes, I'm looking into a mystery now [10:30:53] ok! [10:30:57] thanks [10:35:52] dhinus: I think the first one is ready https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/163 [10:38:01] this is not a strict noop, because as part of the refactor, we are no longer defining some of the flavors that were only useful in eqiad1, but were defined nonetheless for codfw1dev [10:39:16] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/166 however is a noop (the eqiad1 one) [10:41:14] ack [10:42:24] we have a meeting in a few mins, I will complete the review after that :) [10:42:52] cool, thanks [11:33:50] topranks: hey [11:33:53] arturo: how are we looking [11:33:55] heh [11:33:59] all good here [11:35:28] let me know when we are good to begin [11:39:13] getting out of a meeting [11:39:25] ok [11:39:49] topranks: I'm here now [11:40:19] no rush, let me know when you're comfortable with me to start if you need to check anything etc [11:40:31] let me run the network tests [11:41:31] dhinus: do you know anything about this alert? [11:41:33] https://usercontent.irccloud-cdn.com/file/oPyshEzq/image.png [11:43:22] no, I was looking at it during the meeting [11:44:01] I can access dumps files from toolforge via NFS however [11:44:37] topranks: other than that weird alert that we are now investigating, all other network bits are just fine now. You may proceed [11:45:09] arturo: ok, proceeding cautiously [11:45:20] I'm unable to access any path at https://checker.tools.wmflabs.org [11:45:27] statics added on v4 [11:45:31] so I'm actually surprised more alerts are not firing :) [11:45:35] ok but that's separate right? [11:45:48] topranks: yep this started a couple hours ago [11:45:49] dhinus: maybe the checker itself is down? [11:45:59] arturo: I'll try restarting it [11:46:03] thanks [11:46:18] all my checks are ok after first statics added [11:46:34] ok [11:46:53] system is using the static routes in preference to the BGP ones [11:47:07] *switch - cloudsw1-f4-eqiad to be precise [11:47:11] https://www.irccloud.com/pastebin/0WM4xmsw/ [11:47:32] I'll add the equivalent for IPv6 and the cloud vrf IPv4 [11:47:41] ok [11:47:48] arturo: fnegri@tools-checker-5.tools.eqiad1.wikimedia.cloud: Permission denied (publickey,hostbased). [11:47:51] can you ssh to that VM? [11:48:01] let me see [11:48:04] if not, I'll reboot it from Horizon [11:48:26] v6 in prod realm - all ok [11:48:30] dhinus: yes, using root@ [11:48:51] right, didn't try that [11:49:53] just now I lost SSH access [11:49:53] topranks: I think that is your change [11:49:53] ok [11:49:53] rolled back [11:49:56] arturo: I'll reboot that VM and let's see what happens [11:50:05] how in the fuck ] [11:50:16] dhinus: ack [11:50:44] topranks: did you identify the problematic change? [11:50:49] no [11:50:57] want to try again the last step? [11:51:13] what is the last change you did anyway? [11:51:16] :-P [11:52:22] last thing I changed was adding static default routes in the cloud vrf on cloudsw1-f4-eqiad [11:52:26] but tbh I'm super nervous now [11:52:37] if that can break things.... we are in a really odd position [11:52:42] what did you lose SSH to? [11:53:07] I lost ssh to all Cloud VPS VMs I was connected to [11:53:11] FWIW none of the checks I was running failed, switches had full comms in all address fams internal, external etc cloud and prod [11:53:21] maybe we lost the routing to the public IPv4, which is what the bastions use [11:53:36] from the cloudgw? [11:53:48] yeah via cloudgw [11:53:59] cloudsw <-> cloudgw anyway, somehow [11:54:43] topranks: I say we try again, and let it break for a few minutes so you can run a few checks? [11:54:55] ok [11:55:05] ok [11:55:24] give me a moment to double check some things [11:55:30] sure [11:55:30] and I can re-try [11:55:46] arturo: found why the alert is failing, No such file or directory: '/public/dumps/public/enwiki/20250103/status.html' [11:55:59] it can access dumps but is looking for that specific file and is not finding it [11:56:01] dhinus: so the file is gone from the dumps [11:56:12] I'll try to point to a diff file [11:56:21] dhinus: thanks, good finding [11:57:48] arturo: I know what it was, the statics propagated to BGP when they shouldn't, conlficted with the default route the CRs were sending [11:58:02] topranks: great [11:58:17] I've tightened the policy now and will re-try [11:58:26] ack [12:03:36] arturo: I raised T390955 because the fix is not obvious [12:03:36] T390955: toolschecker looks for a nonexistent file in dumps - https://phabricator.wikimedia.org/T390955 [12:03:45] ok [12:04:08] * dhinus lunch [12:04:54] topranks: so far so good, no? [12:05:04] I've not done it yet [12:05:12] oh, ok [12:05:13] setting up a few more monitors first [12:05:19] ok [12:05:39] ok I am pulling the trigger now [12:05:45] ack [12:07:42] prod routes re-added [12:07:46] they were fine before though [12:08:50] adding v6 now [12:08:58] so far so good, as far as I can tell [12:09:23] seems ok [12:09:30] gonna do the one that broke it before now [12:09:34] (cloud ipv4) [12:09:41] 🚢 🇮🇹 [12:10:49] seems ok [12:10:58] same here [12:11:02] boat italy ? [12:11:07] ship it [12:11:08] ship it haha [12:11:21] :-) [12:12:30] I'll wait a few and proceed [12:12:35] sounds good [12:19:08] ok I'm gonna proceed [12:19:14] ack [12:20:01] applying step 5 now (ibgp on cloudsw1-c8) [12:20:47] done [12:20:48] ok [12:23:57] seems to be ok? [12:24:04] yes, all good [12:24:07] all checks seem fine for me [12:24:13] ok I am proceeding to the next step [12:24:26] ok [12:24:37] this will hopefully trigger the bgp problem we had last week - but not affect traffic as the static routes are now in place [12:24:47] but it is the more delicate step for sure [12:26:06] arturo: ok I am pulling the trigger NOW [12:26:22] ok [12:27:58] traffic seems unaffected but we do have the bgp route withdrawl similar to last week [12:28:04] arturo: how are things looking? [12:28:10] topranks: seems fine here [12:28:14] ok [12:29:50] ok yeah looks ok [12:30:27] and I can see the reason cloudsw1-f4 is rejecting the routes, leave it with me to see if I can work out why (cloudsw1-c8 is sending them as if they were EBGP when it should be iBGP) [12:32:45] ok [12:33:54] still ok? [12:34:46] yup [12:35:10] yeah seems fine [12:35:51] the TL;DR is the use of the IBGP ASN in the cloud vrf routing-instance is causing the switch as a whole to consider that ASN as "its own" [12:36:10] it is then rejecting the routes in the default table with that AS as it thinks they are "looped" [12:36:35] it's either a bug or nonsensical behaviour tbh, but anyway at least it's clear what it's doing [12:37:22] "bug" is actually the wrong word, it's not a coding mistake [12:37:23] more like the logic is poorly implemented as it should consider each vrf separately [12:38:06] ok [12:43:20] we have a ceph warning, looking [12:44:05] slow osd heartbeats [12:50:26] topranks: it is possible that some cross-switch routing is not working as expected? [12:50:48] it's possible, I'm not sure why that would be exactly [12:51:04] also more likely some routing would be broken than it to affect latency [12:51:26] topranks: yeah, then broken. Ceph is reporting some cross-links failures [12:52:40] can you give me an example? [12:53:01] https://www.irccloud.com/pastebin/D7pBhYeb/ [12:53:39] I'm trying to find any relevant grafana dashboard which could possible reflect/support this claim [12:53:49] IP addresses would be ideal [12:53:53] usage on the ceph hosts is normal [12:53:58] there have been no changes in ~20 mins [12:54:02] network-wise [12:54:10] 2 second pings are high [12:54:16] but again - routing won't affect latency [12:54:33] topranks: we are back to HEALTH OK [12:54:47] so, maybe this was just during the change, or the earlier mini-outage [12:55:24] running a loop, suddenly we went from warning to OK https://www.irccloud.com/pastebin/cKadfwwY/ [12:55:53] topranks: I think we are good. Any remaining changes to apply? [12:57:46] arturo: so yeah ignoring this ceph thing I've been able to draw some conclusions and was considering what to do next [12:58:08] What is clear is JunOS won't allow the IBGP to co-exist with the EBGP config between the same switches [12:58:27] I'm not sure if it's a bug or just a limitation, but it considers the IBGP AS it's own everywhere [12:58:50] it won't allow this to be separate in the VRF vs main table, which tbh is a bit mad cos I'm sure some people need that [12:59:24] For us - it's not a blocker in general - this "dual state" is only needed for the current migration [12:59:38] we don't need this config in place long term it's just a step in the reconfiguration of the switches [12:59:42] ok [13:00:22] so.... what I would propose to do is to - with the current static-routes in place which are working as fallback - reconfigure all the BGP sessions on F4 to IBGP [13:21:34] should I just do ==8.5.0? [13:31:40] andrewbogott: I think 9.1.3 [13:31:58] the version that breaks it I think is 10.x [13:38:28] arturo: I am fed and back so I will proceed? [13:44:18] dhinus: or we merge the fix patch [13:44:25] that is in gerrit, as you want [13:44:47] I didn't have time to look at it, taavi volans if you think it's ok I'm fine with merging [13:45:16] if anyone of you can test it it would be great [13:46:49] andrewbogott: tofuinfratest is failing to create both trove and magnum things, can you have a look? [13:47:32] andrewbogott: this might be useful https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed [13:47:39] there's no error, just timing out apparently [13:47:42] Yep, in a bit [13:47:47] andrewbogott: sure, no rush [13:48:09] might be worth trying manually creating a new trove db and a new magnum cluster before even looking at tofu [13:49:35] volans: I don't have the bandwidth today, maybe tomorrow [13:51:08] then a pin in your setup.py is the quickest fix [13:59:14] topranks: yes, you may proceed. But also, don't we have a network meeting now? [14:03:54] arturo: oh shit I'd forgot that.. again.... [14:04:01] I've nothing prepared maybe we'll skip it? [14:04:05] sure! [14:04:11] no problem [14:04:21] I also prefer getting this network change completed today [14:04:36] ok [14:05:24] so one-by-one I'll take down the BGP sessions from cloudsw1-c8 to cloudsw1-f4, all routing is on the statics so it is ok [14:05:49] I'll start with the prod-realm IPv6 as we have the least relying on it [14:07:16] seems ok so far [14:14:32] topranks: [14:14:33] https://usercontent.irccloud-cdn.com/file/z7KSNSEh/image.png [14:14:42] I guess that's expected [14:14:57] oh ok bfd status [14:14:59] yes thats fine [14:15:34] we couldn't downtime the switches but perhaps better not to in case there are relevant alerts [14:15:46] ok [14:21:25] ok I'm gonna proceed with IPv4 [14:21:31] topranks: ack [14:23:51] seems ok so far [14:26:32] Ar zh el would be glad to see his new bgp alerts are working well :) [14:26:38] :-) [14:27:05] I see everything green so far [14:30:07] there seems to be some sort of blip with stats collection from cloudsw1-f4 [14:30:18] it seems only to be affecting the gnmi metrics [14:30:27] actual traffic is ok [14:30:44] the metrics are collected over the separate oob management network - so this is something else [14:30:54] ok [14:31:18] fyi - seems like it missed some metrics, they are back again for most recent few mins [14:31:27] unrelated anyway [14:31:34] ok [14:31:52] i'll give it another few just to make sure they settle down as we need to see those [14:32:56] sure [14:33:30] dhinus: want to review https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/163 now? [14:42:39] arturo: ok those stats are working fine not sure why we had those gaps [14:42:46] ok [14:42:49] I'm gonna shut the last session to F4, for cloud vrf [14:43:00] ok [14:43:33] done [14:51:48] topranks: sorry, got distracted. Everything seems fine here. [14:52:11] yeah I am slowly going through them [15:29:57] topranks: are we done? I'm about to disconnect [15:30:39] not yet, go ahead though, the riskiest stuff is done [15:30:42] is andrew around? [15:31:25] I am! watching the staff meeting at the moment [15:31:35] sure, as am I [15:31:55] like I say the risky bit is done, but be aware there are some changes going on and ping me if you get any sniff of an issue [15:40:55] hey cloud team, is https://phabricator.wikimedia.org/T390987 best way to handle unresponsive tool maintainer ? [15:45:23] well a duplicate task is generally never helpful [15:46:44] taavi: that's fair [15:47:40] but i'm generally not sure if i understand the problem there. if i understand correctly, some blocked users are trying to edit but the edits are not going through because they are blocked? is there an actual, concrete problem they are causing apart from some hypothetical about resource usage? [15:48:21] since if the only real problem is that it's DoSing other users of the tool then ihmo that's up to the tool maintainers to decide, and shutting down the tool wouldn't fix the problem [15:48:52] ye it's a bit confusing [15:48:59] doesn't help the tool is down anyway [20:41:59] topranks: are you here and/or doing things? I'm restarting services but that shouldn't have caused VMs to go down [20:41:59] andrewbogott: yeah so I just finished the works, or so I thought [20:41:59] I've reverted the last change, which was about 10 mins ago, to remove the static routes and revert to bgp [20:41:59] ok. I'm getting 1,000,000 alerts about VMs being down [20:41:59] seems like maybe that didn't work as expected, my own checks all were ok but I can see there is a hit on throughput [20:41:59] the one VM I have access to is up and working ok [20:41:59] ceph was just alerting about slow ops [20:41:59] 21:28:23 sudo: PAM account management error: Authentication service cannot retrieve authentication info [20:41:59] 21:28:23 sudo: a password is required [20:41:59] I /think/ that's unrelated, although that alert had been flapping since network updates this morning [20:41:59] I think the VM outage is something else [20:41:59] [WRN] OSD_DOWN: 141 osds down [20:41:59] that seems not good [20:41:59] is there any example VM IP you can point me to? [20:41:59] working on it [20:41:59] at least the number is going down [20:41:59] but those OSDs seem like a good place to start [20:41:59] let's just give it a bit of time [20:41:59] 99 now [20:41:59] still at 99 [20:41:59] tools-bastion-13 on cloudvirt1060 is unreachable [20:41:59] ceph being upset will cause VMs to be unreachable [20:41:59] andrewbogott: you know the fqdn or IP of tools-bastion-13 [20:41:59] taavi: with my root key it should work though [20:41:59] cloudvirt1060 seems to have normal network access [20:41:59] andrewbogott: no it won't if the VMs disk is stuck [20:41:59] tools-bastion-13.tools.eqiad1.wikimedia.cloud [20:41:59] ok thanks [20:41:59] taavi: yeah, true [20:41:59] i took an example osd that's reported down and the systemd service unit for it is inactive [20:41:59] again, whatever side effects with VMs you are seeing is all explained by these ceph issues where ceph thinks half the cluster is offline [20:41:59] yeah, there are definitely ceph issues. Last time that happened cloudvirts went offline too but maybe this time it's just ceph [20:41:59] !log admin taavi@cloudcephosd1008 ~ $ sudo systemctl start ceph-osd@11.service [20:41:59] dcaro we could probably use a hand if you're conscious [20:41:59] it's pingable from cloudnet [20:41:59] root@cloudnet1006:~# ping 172.16.1.16 [20:41:59] PING 172.16.1.16 (172.16.1.16) 56(84) bytes of data. [20:41:59] 64 bytes from 172.16.1.16: icmp_seq=1 ttl=64 time=2.21 ms [20:41:59] 64 bytes from 172.16.1.16: icmp_seq=2 ttl=64 time=0.482 ms [20:41:59] seeing an issue reaching the dns from cloudnet though [20:41:59] https://www.irccloud.com/pastebin/PlR78zHF/ [20:41:59] ceph reporting 177 osds offline [20:41:59] Wow [20:41:59] Coming to a laptop [20:41:59] dcaro: sorry to ping you during your break. This is very likely due to network changes but we could use help diagnosing the specifics [20:41:59] (and, probably, recovering) [20:41:59] * dcaro on a laptop [20:41:59] manually starting the stopped osd service units seems to make ceph think they are up again [20:41:59] that's interesting. topranks you reverted something? Such that maybe restarting things will make a recovery? [20:41:59] i'm tempted to just cumin 'systemctl status ceph-osd.target' on all osd hosts [20:41:59] andrewbogott: yeah I reverted the very last change I made which was done just prior to this [20:41:59] there was a traffic issues at some point yes [20:41:59] most racks have normal traffic again, e4 still looks funny [20:41:59] * taavi does [20:41:59] taavi: as long as we don't restart any that are already up... that seems fine [20:41:59] not sure is applications need a restart, if they don't recover from an interruption then yes [20:41:59] jumbos were lost too [20:41:59] they should be able to realize they are up eventually by themselves though [20:41:59] the restart might speed it up [20:41:59] it's recovering fast [20:41:59] huh, from cumin: ssh: connect to host cloudcephosd1029.eqiad.wmnet port 22: Connection timed out [20:42:00] yeah, starting the services really helped [20:42:00] it was sticking at 177 until taavi did whatever he did [20:42:03] 3% degraded [20:42:06] now rapidly improving [20:42:09] warning now [20:42:10] no, it was stuck at 99 unless I did that [20:42:14] 1029 is out of the pool [20:42:26] (it's waiting for a disk replacement) [20:42:31] and ceph isn't reporting missing data anymore, which is nice [20:42:40] it should rebalance back [20:42:53] now it's just complaining about slow heartbeats [20:43:03] andrewbogott: is there some known issue with 1029> [20:43:14] those might take a bit to go out, the numbers are the highest for the last period [20:43:21] topranks is discovering that I have set deadly tripwires throughout our infrastructure just to demoralize him [20:43:26] taavi: 1029 is offline on purpose [20:44:02] ok [20:44:22] so... think we're good? We'll need to restart nfs clients but let's hold off on that for a bit. [20:44:29] i think we're good [20:44:54] i'm going to go back at looking at trains [20:44:56] there's some alerts also going on [20:45:13] thanks taavi [20:45:33] thanks! /me back to painkillers [20:46:21] thank you for appearing dcaro, I'll clean up what I can [20:46:50] np, ping me again if you need extra hands, our shifts are a bit unbalanced lately [20:47:00] ok! [20:47:06] dcaro: sorry to have caused you to be disturbed :( [20:47:13] topranks: do you still see routing issues anywhere? [20:47:13] hope you're ok otherwise [20:47:33] np, everything good, recovering quickly (painkillers help xd) [20:47:33] no... I've changed nothing else, one check I was doing I was using wrong source IP so human error, it was ok [20:47:41] oh great, ok [20:47:49] traffic levels have returned to normal on all switches, I'm not sure why e4 seemed to take longer [20:48:20] I'll start cleaning up then. And I guess you can debate with arturo tomorrow about why our switches chose to betray you yet again [20:49:27] * bd808 cleans 100+ alerts from his inbox :) [20:49:56] it's convenient to blame the switches but I think the culprit is closer to home [20:50:13] yeah, thunderbird just shows '99+' which feels like a rebuke [20:53:22] I reset rabbitmq so when the alerts started firing I assumed it was that... until things started firing that have nothing to do with rabbit [20:53:35] "Huh, I've never seen rabbitmq take down a VM before" etc. [21:15:20] * arturo briefly online, clearly late to the party, so back to bed