[12:07:49] dcaro: FYI I created these two session proposals for the hackathon T390165 T390164 [12:07:49] T390165: [Session] What happens when you type .toolforge.org - https://phabricator.wikimedia.org/T390165 [12:07:49] T390164: [Session] Cloud VPS and IPv6 - https://phabricator.wikimedia.org/T390164 [12:08:27] thanks! [12:25:50] topranks: hey! you ready for the network operation? :-P [12:26:04] arturo: yep sure am!! [12:26:14] you guys don't need that cloud services thing you built right??? [12:26:18] :P [12:26:25] don't worry I'm confident it'll be ok [12:26:30] heh [12:26:49] it seems we need it today... tomorrow who knows [12:28:26] ok well we'll be careful [12:29:55] also, be mindful of our beloved hidden cryptominers. We don't want them loosing money [12:29:57] hehe [12:30:24] they may leave a poor review on reddit xD [12:32:31] haha yeah we don't want to get on the bad side of them [12:32:58] I can already hear someone telling me to enjoy staying poor :P [12:35:36] ok, all monitoring and network tests are green at the moment, you may proceed with changes when you want [12:36:05] arturo: ok config is all staged, I am about to commit thanks [12:37:06] ack [12:37:52] arturo: ok done, OSPF is up everything seems ok [12:37:59] ok [12:38:02] my test pings are still working from my instance [12:38:09] dns working [12:38:15] I'm doing some tests, but everything seems fine [12:38:43] including toolforge [12:39:37] ceph seems happy as well [12:39:51] ok, well this is where I'm gonna pause for now and take a few mins just to make sure [12:40:32] ok, sounds good to me [12:53:12] arturo: ok still looks ok to me what do you think? [12:53:17] FYI https://phabricator.wikimedia.org/T389958#10682746 [12:53:39] all seems green to me [12:54:32] have you reach the same level of changes compared to last friday? [12:55:23] For IPv6 within the cloud-vrf, yes [12:55:53] I guess the difference is on Friday I added the same to the prod network, and enabled it on v4 ints too [12:56:22] if we have problems I with the routing config I want to find it here in the cloud vrf for v6 - which currently is not being used [12:56:47] ok [12:57:17] do you propose then that we enable IPv6 on the openstack network? [12:57:39] not yet we need to complete the rest of the work [12:57:45] ok [12:57:48] which we can do now during the window [12:58:15] next step is to enable BGP but not exchange any info: https://phabricator.wikimedia.org/P74465 [12:58:43] (and btw this is being added manually, next quarter we are productionizing this setup for the prod network, so I will be templating it up then) [12:59:38] the final step here will be to adjust the IBGP import/export policy to announce the required networks, if the sessions are stable for a few mins and cause no issues [12:59:54] ok [13:03:03] alright I'll proceed with the IBGP config [13:06:46] how is that going? [13:07:08] just added the config [13:16:44] got an alert [13:16:44] ok alert went away [13:16:44] what was the alert? [13:16:44] unable to get NRPE status [13:16:50] another alert about cloudcontrol galera size mismatch [13:16:50] I cannot ssh to cloudvirt1060 [13:17:04] that's an odd one, like an icinga check [13:17:04] wtf [13:17:04] plenty of alerts now about haproxy being offline [13:17:04] https://usercontent.irccloud-cdn.com/file/OHYoV9MK/image.png [13:17:04] I cannot ssh to cloudvirts [13:17:04] lost my ssh session to cloudcephmon1004 [13:17:04] network tests failing [13:17:04] topranks: I would say try to quickly check what's happening, and if not obvious, then immediate revert. We seems to be offline [13:17:04] ok [13:17:04] * dhinus paged 14x] CloudVirtDown wmcs wmcs (node eqiad) [13:17:04] dhinus: known [13:17:04] topranks: please revert now [13:17:04] reverted [13:17:04] what the actual fuck [13:17:04] and I can ping where I couldn't [13:17:04] I can ssh to stuff now [13:17:04] there is quite literally no logical explanation [13:17:04] ceph in warning status [13:17:04] we may have a follow up toolforge outage because NFS [13:18:42] got 10 osds down, seems to be recovering [13:19:13] plenty of osd slow ops being reported, but they are recovering [13:19:13] 2 osds down [13:19:57] 1 osd down [13:22:07] all osd UP now, but [13:22:19] Reduced data availability: 2529 pgs inactive, 14 pgs down, 2333 pgs peering, 934 pgs stale [13:22:19] Degraded data redundancy: 5295020/78794922 objects degraded (6.720%), 1682 pgs degraded, 1696 pgs undersized [13:22:24] this seems concerning [13:24:34] ok, ceph catching up now [13:25:42] I'm not all the way awake but I can help if there's anything to be done. It seems like right now we can just wait for recovery? [13:26:14] andrewbogott: yes. Do we want to declare there was an incident? [13:26:36] ceph seems to be recovering on its own [13:26:48] Probably not necessary if it's over and understood. [13:26:50] tools k8s seems to be recovering as well [13:27:34] yeah I think we can call this an incident [13:27:39] throughput graphs went to 0 [13:28:06] I'm fine either way [13:28:18] andrewbogott: new alert about rabbitmq partition. Could you please check the health? [13:28:25] yep [13:30:22] each rabbit node says it can't talk to the others. We think network connectivity is restored, right? [13:30:58] andrewbogott: I assume so, yes [13:31:20] but don't rule out some inter-switch connectivity is missing [13:31:20] ok, I'll dig a bit deeper. it might be a left over complain from before [13:31:40] I'm also noticing tools k8s may have a hard time with the api-server [13:31:52] topranks: could you please check cloudsw to cloudsw connectivity? [13:32:25] ceph slow ops still haven't gone down to zero, although it's only listing mons and not osds in the list of daemons with slow ops? [13:32:48] kyverno is struggling in toolforge: [13:32:51] aborrero@tools-k8s-control-7:~$ sudo -i kubectl get policy -A | grep "Not ready yet" | wc -l [13:32:51] 3512 [13:32:52] aborrero@tools-k8s-control-7:~$ sudo -i kubectl get policy -A | grep -v "Not ready yet" | wc -l [13:32:52] 7 [13:33:27] taavi: true [13:33:34] here's an example of something I would expect to work but which isn't: [13:33:37] root@cloudrabbit1001:~# telnet cloudrabbit1002.eqiad.wmnet 4369 [13:33:48] andrewbogott: does ping work? [13:34:12] yes, ping works [13:34:32] what about the wikimedia.cloud address? [13:34:45] oh, good point [13:35:03] seems ok [13:35:03] https://phabricator.wikimedia.org/P74469 [13:35:28] yeah, telnet cloudrabbit1002.private.eqiad.wikimedia.cloud 5672 works [13:35:32] so, false alarm [13:35:37] I'll just restart the nodes and see if they cheer up [13:37:08] tools kyverno still not catching up, I'll restart it [13:38:01] rabbit reports that a node is up and also that it can't talk to it. How do you know it's up then, rabbit? [13:38:26] ok, restarting a kyverno pod seems to have helped with bringing the policy resources into READY state [13:38:40] slowly, as we have 4k of them [13:39:01] andrewbogott: you may want to fully restart rabbitmq? [13:39:16] yeah, I am, just complaining about it being silly [13:39:58] it is silly! and very sensitive to all network stuff [13:41:42] dhinus: do you remember how to run toolforge functional tests on the platform itself? [13:42:17] I'm going to restart openstack services now in case they're upset from rabbit being unreachable [13:43:44] taavi: if I'm understanding cephmon logs correctly, the slow ops are stalled clients that could not write to disks, example: [13:43:52] Mar 27 13:42:36 cloudcephmon1004 ceph-mon[17970]: osd.52 osd.52 53911 : slow request osd_op(client.1173606840.0:44565 3.62c 3:346f07ef:::rbd_header.2ae5c1b7a4b88d:head [watch ping cookie 140140018530512] snapc 0=[] ondisk+write+known_if_redirected e67095622) initiated 2025-03-27T13:10:40.124044+0000 currently delayed [13:44:24] rbd clients, so virtual machines [13:45:29] or maybe even hypervisors [13:45:42] arturo: from a tools-bastion IIRC [13:45:45] not sure if ceph mon sees VMs or hypervisor as clients [13:46:06] you have to manually clone toolforge-deploy [13:46:07] hypervisors i guess [13:46:08] I guess it should hypervisors clients [13:46:41] well it seems like the slow op number is slowly going down, so unless something's actively broken we can just let ceph process things in the background [13:47:16] 5400 slow ops, oldest one blocked for 455 sec, [13:47:23] let's see if the numbers go down [13:48:27] the 'blocked for' # is going up, the number of ops is going up and down intermittently but I think trending down [13:48:29] nova-compute alerts are rabbit/nova-conductor timeouts it seems [13:48:31] very gradually [13:48:42] taavi: the restarts i'm running now should help with that [13:48:51] nova I mean [13:49:30] ok, ceph just cleared up the worst of the PGs so I think it's recovering on its own [13:50:27] yeah [13:51:57] kyverno seems fully recovered now [13:52:10] oh! ceph just switched to health_ok [13:52:20] good job ceph :) [13:52:47] yeah, slow but steady [13:53:29] I'm declaring the emergecy to be over [13:54:53] sgtm [13:55:54] I still see some "nova-compute proc maximum" that are paging [13:56:08] maybe they're over now [13:56:15] they should be recovering soon [13:57:08] cumin is still steadily restarting things [13:57:27] all pages have now auto-resolved in victorops [13:57:51] kernel errors reported on cloudcephmon1006 [13:57:53] errors are: [13:57:56] Mar 27 13:07:21 cloudcephmon1005 kernel: TCP: request_sock_TCP: Possible SYN flooding on port 3300. Sending cookies. Check SNMP counters. [13:57:56] Mar 27 13:08:13 cloudcephmon1005 kernel: TCP: request_sock_TCP: Possible SYN flooding on port 9283. Sending cookies. Check SNMP counters. [13:58:19] which is a legit error, and the result of ceph DDoS'ing itself [13:58:25] so I'll just ACK the alert [13:59:09] ops, sorry, wrong cephmon [13:59:24] in cloudcephmon1006 the error is the ommkiller killing ceph-mgr proccess [13:59:37] that's not great [14:00:00] seems concerning, but also it has been restarted by systemd, so all is back into normal state now [14:00:04] I usually reboot a server after the oomkiller fires since who knows what else it killed? [14:00:09] unless you can tell that it only killed the one thing [14:00:55] well, the ommkiller should report its actions [14:01:04] true, ok [14:01:11] but yes, a reboot to start with a clean slate is also interesting [14:01:16] are these k8s-haproxy alerts well understood? [14:01:49] k8s-haproxy should be the instability trail, but everything seems to be up [14:02:32] * arturo offline now to pick up child, be back later [14:04:10] andrewbogott: looking into the k8s-haproxy alerts [14:04:16] great, thanks. [14:04:16] komla: you're on clinic duty this week, right? I see several openstack resource requests, are you on top of those? [14:04:23] andrewbogott: they're still flapping [14:05:07] something is still messed up with openstack so I'm still looking at that... lmk if I can help with the haproxy things [14:05:29] haproxy alerts are checking this url https://admin.toolforge.org/healthz [14:05:44] and that URL is sometimes hanging from my machine as well [14:06:03] * dhinus reads https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy [14:06:13] "When this alert fires it is probably because the tool has crashed. Restart on a toolforge bastion with:" [14:06:17] * dhinus tries that [14:08:16] bah, there's still split brain of some sort. Guess I'll go back to the start and rebuild the rabbit cluster [14:08:58] I think the restart fixed it. I love good runbooks :) [14:09:39] now there's a more cryptic "No Puppet resources found on instance tools-k8s-worker-nfs-21" [14:11:41] andrewbogott: new "nova-compute proc minimum" pages a few secs ago [14:11:57] yeah, I'm resetting everything again. Will be noisy [14:12:21] ack [14:13:39] anything i can help with? [14:15:22] not for me, thanks [14:16:20] I mean, unless you want to reimplement rabbitmq in something other than erlang [14:24:41] I don't understand how nova decides whether to crash or to wait when it can't talk to rabbit... [14:34:19] i'm rebooting some tools k8s nodes that are showing up as unknown on `kubectl sudo top node` [14:42:46] I need some breakfast before the meeting. I /think/ things are working now? [14:43:38] alerts are definitely looking better now, I think it's just some toolforge nodes that are stuck but taavi's reboots should help [14:51:04] andrewbogott: Yes, I am on clinic duty. I've been looking at account and vps requests. Let me check the openstack one [14:53:07] also provisioning a new non-nfs worker since it seems like more tools are slowly migrating to buildpacks without nfs :-) [14:58:47] progress! [14:59:31] guys really sorry about this mess :( [14:59:43] my head is sore from banging against the wall here if that makes you feel better [15:00:09] I did manage to get a little info on what broke, but the why is still confounding me [15:00:09] https://phabricator.wikimedia.org/T389958#10683425 [15:02:14] the TL;DR is the switches in E4/F4 decided to not accept any BGP routes (regardless of version, or in the prod or cloud realm) when the new IPv6 sessions were added in the cloud vrf [15:02:58] it's likely either a bug or some esoteric constraint in the platform or software version I didn't account for [15:03:01] I'll keep digging [15:03:12] I think we'll have to replicate outside production before we can try this again [15:21:14] we've already made the change in codfw1dev right? [15:21:43] (that would support the sw version theory) [15:21:47] yes, but it's a different setup [15:21:55] only 1 sw for example [15:22:01] * andrewbogott nods [15:23:55] andrewbogott: yeah the main difference is only one switch [15:24:02] so there is no routing at all, no OSPF, no BGP etc [15:25:40] the switches in E4/F4 are on a different software version than those we have elsewhere though I note [15:26:26] may or may not be relevant but glad you raised it [15:30:54] if there are no objections, I will start the toolforge k8s upgrade //in toolsbeta// [15:31:59] tracking task: T362868 [15:32:00] T362868: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.29 - https://phabricator.wikimedia.org/T362868 [15:32:48] topranks: #hugops Sorry the network rebelled against your authority. [15:33:36] bd808: hahaha they can have a habit of doing so at times [15:33:50] I won't let it have the last laugh though it'll live to regret :P [15:43:46] hmm, the freshly rebooted k8s workers don't seem to be able to mount the project nfs shares [15:44:00] and tools-nfs has a bunch of D procs which is very worrying [15:44:17] Message from syslogd@tools-nfs-2 at Mar 27 15:43:51 ... [15:44:17] kernel:[6292265.902160] watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [kworker/u32:11:4160997] [15:44:18] huh [15:45:24] andrewbogott: dhinus ^ have you seen that before? i have not :/ [15:45:51] I don't think I saw it before, but I wonder if sometimes it might have been logged when there were D processes? [15:46:06] I rebooted a few k8s workers without checking all the logs but trusting the alert [15:46:45] the worrying thing is that this is on the nfs server, not on the workers [15:46:59] ah I see! [15:47:15] is it the server that got messed up during the outage maybe? [15:47:23] i've no idea [15:47:43] can we try restarting the server? I don't know what's the expected impact [15:47:52] taavi: I'd say we reboot the NFS server then all the k8s workers [15:48:56] can any of you try that? i need to run in a moment [15:49:06] yes I can do that [15:49:25] thanks! [15:49:45] do you think we should announce this invasive operation? [15:49:55] an email won't hurt [15:49:58] I'll send an email [15:49:58] I can send it [15:50:03] ok I'll let you do it :) [15:50:08] maybe just reuse the same thread from today [15:50:15] ok, thanks, please you do, reuse the thread yes [15:50:19] I will ack [15:51:31] I'm not 100% sure that the nfs server daemon will come up after reboot so keep an eye on that, you may need to manually start it or force a puppet run. [15:51:40] before rebooting the NFS server, I'll check the hypervisor it is running on [15:51:45] * andrewbogott can't remember specifics but thinks there's something vaguely hands-on about rebooting an nfs server [15:52:08] T390210 [15:52:08] T390210: Jobs in "rv" Toolforge tool can't be started - https://phabricator.wikimedia.org/T390210 [15:52:42] * bd808 is busy trying to figure out what's wrong in beta cluster [15:53:09] hypervisor cloudvirt1041 seems fine running the tools nfs server, so I will reboot the V [15:53:11] VM [15:53:11] email sent [15:53:47] you will need a forced puppet run immediately after restart so that the VIP is added [15:53:53] ok [15:53:53] bd808: do you think beta issues are also related to the cloudvps outage earlier today? [15:54:12] * dcaro back [15:54:15] atyhditf [15:54:16] oops, I started the reboot k8s worker before the NFS server is back online, so canceling it [15:54:17] ... [15:54:20] anything I can help with? [15:55:03] dcaro: maybe prepare to run `aborrero@cloudcumin1001:~ 14s 97 $ sudo cookbook wmcs.toolforge.k8s.reboot --cluster-name tools --all-nfs-workers` on my signal [15:55:22] okok, I'll open a tmux on cloudcumin [15:56:37] NFS server VM not coming back online, I'll force reboot [15:57:03] okok [15:57:17] aborrero@cloudcontrol1006:~ 2s 2 $ sudo wmcs-openstack server reboot --hard 1cb7639b-18be-455d-92df-cd9402fe0f1f [15:57:18] that nfs is tools nfs? or dump? [15:57:24] tools's [15:57:36] puppet is failing on the workers complaning of dumps [15:57:47] wait no, [15:57:56] labstore, but the projcets [15:57:59] nm, +1 [15:58:01] it should be complaining about the tools one [15:58:11] nfs server back online, running puppet [15:58:51] dcaro: hit the reboot cookbook [15:58:56] ack [15:59:01] :-) go go go [15:59:06] running [16:01:13] dhinus: they could be, but I have no evidence of value yet either way [16:02:33] it's going to take a while [16:03:05] dcaro: I assume plenty of pods would be stuck [16:03:18] is there an option to run the cookbook in some kind of "force" mode? [16:03:44] quite a few yep https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1&var-cluster=tools&from=now-30m&to=now&viewPanel=2 [16:03:53] most should recover after restarting though [16:04:05] (by themselves I mean) [16:05:07] on the plus side, now we are certain there are no active connections remaining to clouddumps1001 :D [16:05:19] (at least from tools workers) [16:05:37] \e/ [16:16:09] thanks everyone for showing up during this operational-intense day. Really appreciated! I'm glad you are around. [16:19:09] beta cluster's problem was a dead varnish instance. I kind of doubt that was from the network stuff, but I guess it is possible [16:25:50] some days it's a victory just getting things back to the way they were when you woke up [16:25:54] (most days?) [16:28:07] bd808: I hope you saw the email about updating the opentofu network name, you are most likely affected [16:35:00] arturo: I did see the email. I just looked and I guess I need to change https://gitlab.wikimedia.org/cloudvps-repos/deployment-prep/tofu-provisioning/-/blob/main/magnum.tf?ref_type=heads#L29 ? [16:35:31] bd808: yes [16:37:11] arturo: did you see my comment on the paws pull request? [16:37:26] taavi: no, sorry, very busy few days [16:37:39] do you have a link at hand? [16:39:39] https://github.com/toolforge/paws/pull/485#pullrequestreview-2716715210 [16:40:44] thx [16:49:29] should I care about 'Service tools-static-15:80 has failed probes' or is that something that's in the process of recovering? [16:51:52] I don't think we have a cookbook for that one, andrewbogott please hit that one by hand (reboot) [16:52:32] ok! [16:56:45] arturo, taavi: FYI tofu plans updating the network name as an in-place update, but then it goes BOOM! when you try to apply it -- https://gitlab.wikimedia.org/cloudvps-repos/deployment-prep/tofu-provisioning/-/jobs/472791 [16:58:05] that sounds like a opentofu provider bug [16:58:19] well no, if you are not tracking the cluster itself in the repo? are you? [16:58:54] you are https://gitlab.wikimedia.org/cloudvps-repos/deployment-prep/tofu-provisioning/-/blob/main/magnum.tf?ref_type=heads#L8 [16:59:23] yeah, that's the cluster [17:00:06] so the openstack API prevents modifications to the template if there are instances [17:00:09] the provider I guess doens't understand that the backend won't let you change a cluster template that is in use? [17:00:20] yeah [17:00:48] how messy is to redeploy that cluster? :-( [17:00:52] I'm sorry [17:01:13] that one is "easy" in that nothing is using it. PAWS will be the messy one [17:01:57] Vivian used to say the re-deploy was very clean [17:02:39] they also didn't ever care if folks workload were interrupted [17:02:45] correct [17:02:59] so "clean" asl long as you aren't a user I guess [17:03:07] fair [17:03:15] xd [17:03:46] maybe you can create a new cluster and migrate the load? (not sure what's the current setup/load type) [17:04:07] the network has been renamed already, so there is no way back now I'm afraid. I guess we can delay the PAWS redeploy until there are other changes to justify a full redeploy? [17:05:13] I need to go offline now, happy to chat more about this tomorrow [17:11:15] andrewbogott: https://bash.toolforge.org/quip/wiWV2JUBvg159pQrsA-M [17:12:16] :( [17:12:43] https://bash.toolforge.org/search?q=andrewbogott has some gems :) [17:14:08] I'm tempted to propose a session for chicago when we go through the top ones [17:14:16] (not just from andrew of course :P) [17:14:42] maybe more suited for the next sre summit [17:15:08] dhinus: that's basically https://office.wikimedia.org/wiki/User:MSchottlender_(WMF)/Dramatic_reading and its is awesome every time [17:15:14] hahaha [17:15:52] obvs I was not the first one to have that idea! [17:16:17] bd808: about the PAWS tofu change, I wonder if we could hack the tfstate file as taavi did for tofu-provisioning [17:17:42] ¯\_(ツ)_/¯ I'm not sure what the consequences are. If the name only needs to change for the next k8s cluster rebuild then yeah maybe that would work. [17:18:38] for my deployment-prep one I will just rebuild it all later today I think [17:20:15] sgtm [17:22:15] more D-state processes in tools-k8s-worker-nfs-54 [17:23:05] but the alert disappeared before I could look at them :) [17:24:12] I'll log off for the day. thanks everyone for working on recovering things after the outage! [17:24:35] the reboots are still ongoing xd [17:24:42] now just reached 57 [17:24:50] ah ok! [17:24:55] (it takes a bit after the reboot for the stat to go down enough for the alert to clear) [17:24:55] that explains it :) [17:25:10] * dhinus offline [17:25:45] I don't see many workers stuck though, I think I might be better off stopping the all workers reboot and just rebooting the 2 that show issues [17:26:51] cya! [18:48:30] * dcaro off [18:48:33] cya on monday! [18:58:25] interesting one guys. [18:59:09] before all the fun I caused later, we had the link from c8 to d5 hit max capacity breifly at ~11:43 UTC [18:59:14] https://grafana.wikimedia.org/goto/o3uQSToNR [18:59:57] Looks like QoS worked well. We dropped nothing in the high-prio class (osd heartbeats), and dropped proportionally more of the OSD replication traffic than the rest. [21:20:54] That was probably me taking a ceph osd out of the pool, happy to see QoS doing it's job :)