[06:44:15] greetings [06:45:49] getting ready to start on T417393 [06:45:50] T417393: Carry out controlled network switch down tests in cloud - https://phabricator.wikimedia.org/T417393 [06:48:21] XioNoX topranks I'll be shutting down two ports to test the last two hosts in isolation and then cloudsw1-c8-eqiad is ready for you [06:54:05] godog: I think only e4 and f4 need an upgrade [06:54:13] about to go to daycare, back in 25min or so [06:54:19] XioNoX: oh ok! even better [07:18:43] wait when were the virts drained for that upgrade? [07:20:04] my understanding is that we're not touching e4/f4 today [07:20:27] I thought also c8 needed an upgrade, hence the ping to netops [07:21:40] ok so with cloudrabbit1001 off the network from cloudcontrol1011 trying to connect results in 'no route to host' [07:21:59] whereas from cloudcontrol1007 and 1006 the connection actually times out, no icmp back [07:24:24] on 1011 oslo seems to actually have given up on trying to talk to rabbitmq01, 1006 and 1007 not so much [07:25:52] I'm ready to bring back cloudrabbit1001 unless someone wants to take a look ? [07:29:11] how much do you love breaking things? :-P [07:29:45] lol [07:30:16] let's say I'd rather have me be the one that breaks them rather than chance [07:30:27] <3 [07:31:34] I'll give it five minutes to recover then moving on to breaking cloudcontrol1011 [07:36:33] ok doing cloudcontrol1011 [07:37:06] godog: apologies I got my times mixed up, thought you were starting an hour later [07:37:21] cloudsw1-c8 needs an upgrade so we'll take the opportunity if we can! [07:37:57] topranks: heh you actually got the times right and I mixed them up :| [07:38:27] haha well no worries, we are still good to proceed are we? [07:38:57] topranks: yes, checking how things are now with cloudcontrol1011 down then will ping you for the upgrade [07:39:35] godog: ah actually it seems I am mistaken [07:39:56] cloudsw1-c8-eqiad is on JunOS 21.4. We are mostly using 23.4 everywhere [07:40:30] however as it is older hardware (qfx5100) the latest recommended stable for it is 21.4 [07:41:50] win (?) [07:42:41] they no longer produce updated software releases for it [07:44:56] End of support is 7/1/2027 (I guess they mean July 2027?) so we need to buy replacement for next FY [07:46:12] somehow my brain thinks that we just only got those switches, but it's indeed been some time now [07:51:53] I do sometimes wonder how we got that particular model when we did, I think we were quite unfortunate with the timing in terms of when the successor (QFX5120 - Trident 3 based) was released [07:54:18] ok I can't see anything obviously and badly broken with cloudcontrol1011 down [07:54:30] bringing it back shortly [07:58:28] morning [08:05:41] greetings [08:06:03] ok with cloudcontrol1011 back and no actual c8 upgrade to do I think we're done [08:06:18] unless someone wants to poke at c8 while we're at it? [08:08:07] though I don't specially cherish the company of a switch in my day to day, I don't wish them any harm either [08:11:07] (that's to say, I'll pass on the poking opportunity) [08:11:23] topranks: what I've observed with cloudrabbit1001 down is that cloudcontrol1011 in the same rack got 'no route to host' back when trying to connect, whereas cloudcontrol1006 and 1007 didn't get any icmp back, does that check out ? [08:11:26] dcaro: lol [08:11:40] no [08:11:45] can I investigate now? [08:12:17] topranks: yes, I put all interfaces back though feel free to shut cloudrabbit1001 [08:12:23] what are the cloudcontrols trying to connect to and getting that response for? [08:12:41] so to recreate I can just shut the port to cloudrabbit1001? [08:13:00] godog: you have the config locked on cloudsw1-c8 [08:13:01] yes that's xe-0/0/21 [08:13:09] topranks: ok I got out [08:13:24] and what are the cloudcontrols trying to connect to exactly? a specific IP/hostname? [08:13:42] I'm gonna shut cloudrabbit1001 port now unless any objection? [08:13:43] cloudcontrol1007 trying to connect to cloudrabbit1001 on 5671 tcp [08:13:54] sure sgtm, I'll update the task too [08:15:22] godog: it's got a few IPs, and is listening on all of them on TCP 5671 [08:15:52] I'm gonna assume your talking about connectivity on the cloud-private vrf to 2a02:ec80:a000:201::17 ? [08:16:00] exisitng 5671 connections are on that it seems [08:16:17] topranks: yes that's correct, you can see the icmp back to cloudcontrol1011 now but not e.g. cloudcontrol1006 [08:18:03] ah ok [08:18:10] and this isn't a VIP, this is just the cloudrabbit IP itself/ [08:18:14] correct [08:18:18] i.e. it's supposed to not work when it's down? [08:18:28] yes exactly [08:18:34] yeah it's not an ICMP back [08:18:41] cloudcontrol1011 is in the same rack/vlan [08:19:15] so to connect from 2a02:ec80:a000:201::25 (cloudcontrol1011) to 2a02:ec80:a000:201::17 (cloudrabbit1001) it will do normal neighbor discovery (v6 of arp) [08:19:49] when those neighbor discovery probes time out and it can't find the MAC of the other local host subsequent connections to the IP cause the kernel to generate a "destination unreachable" response back [08:19:59] but it's all happening locally on cloudcontrol1011 afaict [08:20:29] topranks: got it, thank you, that explains [08:20:44] morning [08:20:55] yeah, the other two blindly keep sending the packets out to their default gw (the switch), which then don't arrive, but that causes no such feedback [08:21:02] all ok anyway I think thanks! [08:21:09] I've re-enabled the cloudrabbit1001 port now [08:21:14] topranks: ack, thank you [08:22:44] topranks: did I get it right that the case of port/host down and cross-rack/vlan traffic is blackholed ? [08:22:51] i.e. working as intended [08:24:07] everything looks ok yeah. in terms of the cross rack traffic with say cloudrabbit1001 down the remote switch will still see the routes for the subnets for the remote rack - so traffic will be sent to the right destination switch. But once it gets there the switch (cloudsw1-c8) won't have a neighbor entry for the specific destination IP, and the packet gets silently dropped [08:24:32] all that is as expected [08:26:06] cheers, ok definitely needs some adjusting on the openstack side to cater for this case [08:27:06] what is the case exactly? a single host with a single IP going down we expect the IP to be unreachable right? [08:27:16] is there a service running on cloudrabbit that needs to be highly available? [08:29:10] yes all expected, what I saw is openstack getting "stuck" on retrying that same host (cloudrabbit1001) instead of failing over to other hosts [08:29:32] ok gotcha [08:29:32] which happened in the case of cloudcontrol1011 where openstack got 'no route to host' [08:29:50] but not in the blackhole case, I suspect some settings need adjustment [08:29:59] so it needs to time out quicker / try the others... or we need some other sort of HA mechanism with a VIP or load-balancer or something [08:31:26] yes essentially give up / time out quicker, re: ha/vip my understanding is that the rabbit client already has explicit support for multiple hosts to do automatic failover client side [08:31:57] I'll be updating the task with these findings, though I think we're done for today [08:40:54] ok cool. [11:45:24] mmhh designate on cloudcontrol is still unhappy, I'm taking a look [12:14:04] ok four zones in status 'PENDING' from wmcs-openstack zone list --all-projects | grep PENDING [12:23:45] and now two zones, I'm guessing updates are going through somewhat [12:26:53] let me know if you need help digging into the dns stuff, though andrewbogot.t is the expert [12:28:48] ack, thank you dcaro yeah I'm basically wondering if it is good to set the two zones back to active, 16.172.in-addr.arpa. and 1.0.0.0.0.0.0.a.0.8.c.e.2.0.a.2.ip6.arpa. [12:34:11] mmhh possibly the reason why nova-fullstack is also failing, investigating [12:46:57] ok new fullstack tests won't be performed atm since a few pending VMs 'fullstackd*' are in 'admin-monitoring' [12:49:02] or at least that's what it looks like to my untrained eye [13:03:08] I'm out of ideas tbh, not sure how to verify that putting those zones back to active is safe [13:20:26] current status seems to be that designate api is fine for a while latency wise the climbs, a restart of designate services then temporarily fixes it [13:20:34] taavi@runko:~ $ curl -I --connect-to ::dumps-lb.eqiad.wikimedia.org https://dumps.wikimedia.org [13:20:34] HTTP/1.1 200 OK [13:21:40] neato [13:22:28] godog: there's a cookbook to cleanup fullstack VMs iirc, looking [13:23:05] had one minor hiccup, which is that apparently LOAD_BALANCER_HEALTH_CHECKS does not include all the health check endpoints. fixed that temporarily at least with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1268942 [13:23:07] Raymond_Ndibe: dhinus can you review https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/87 ? it's currently affecting all builds with the new buildpacks too [13:23:49] dcaro: I can look in a moment [13:25:06] hm... maybe I just dreamt about that cookbook? [13:25:51] heh I'm not sure either [13:26:26] there's no mention in the runbook either, probably a dream xd https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedLeakedVMs [13:27:22] godog: can you open a task for the designate issues? to keep track and such (if there's none yet) [13:27:52] dcaro: yes indeed, will do now [13:28:49] designate-api on cloudcontrol1007 seems to be timing out trying to get messages from rabbit [13:28:55] 2026-04-08 13:28:16.121 1218386 ERROR designate.api.middleware oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 989f3289ba3b4dea9494cecde5b35446 [13:30:09] yes all other hosts too [13:31:01] we can shrink down to one designate-api in one node I think, that will reduce the debugging and make sure requests go to just one [13:32:37] {{done}} T422646 [13:32:37] T422646: Designate API timing out - https://phabricator.wikimedia.org/T422646 [13:32:53] good idea just one node [13:33:52] I'll try leaving only designate up on cloudcontrol1006 [13:34:20] same for designate-worker, I think one is enough [13:34:35] maybe pdns is misbehaving though [13:35:22] the orchestrator though seems to be dns-controller https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/DNS/Designate#/media/File:Wmcs-dns.png [13:35:39] s/controller/central [13:36:34] indeed [13:37:22] hmm. it kinda restarts every hour? [13:37:49] kinda yeah [13:39:06] can I help with designate debugging? [13:39:30] probably :) [13:39:39] yes all ideas are welcome [13:39:59] * andrewbogott finished reading the backscroll [13:40:09] current status is: I left designate services on only on cloudcontrol1006 [13:40:20] Is the goal right now to get things working, or to understand what's happening? [13:40:29] what exactly do you think is broken? [13:40:39] it seems that the designate-api is timing out waiting for responses (probably from central?), designate-worker seems to have some PENDING zones (probably central again) [13:40:49] central does not complain [13:41:04] there's some leaks piling up I think (godog please correct/add anything xd) [13:41:06] andrewbogott: ideally the latter, if we can't then the former [13:41:45] taavi: the designate api periodically gets into a loop with timeouts waiting for replies, there's two zones pending iirc [13:41:58] are /all/ designate services stopped on all other cloudcontrols? [13:42:01] a restart "fixes" it for a while, then the timeouts start appearing again [13:42:24] godog: a restart of all the services or of just the api service? [13:42:54] andrewbogott: all services, and yes to your question re: all services [13:43:05] being stopped on all hosts but 1006 [13:43:23] regarding fixing things: it's likely that stopping all the designate services everywhere, counting to 5, and restarting them, will sort things as then no service will have expectations about any particular queue messages that are failing to arrive. But that's not especially diagnostic. [13:43:43] Because of course a lost queue message shouldn't cause any long-term problems anyway [13:44:37] I'm ok to test that theory of stopping designate everywhere, wait and restart [13:44:58] and is it generally safe to put pending zones back to active ? [13:45:09] ok. um, I forgot, this will also include stopping memcached. [13:45:35] you shouldn't have to do anything with a pending zone, it should recover as soon as designate and pdns agree on the state [13:46:15] ok thank you, re: memcached we'll also see keystone forgetting sessions, am I remembering correctly ? [13:46:36] yeah :/ [13:46:48] ok [13:47:04] will stop designate everywhere and memcached [13:47:38] from my history in codfw1dev: "systemctl stop designate-api.service designate-central.service designate-mdns.service designate-sink.service designate-worker.service designate-producer.service memcached.service" <- not the most efficient, keystrokewise [13:47:57] systemctl takes wildcards! [13:48:11] lol yes that's what I did: systemctl stop designate-* [13:48:21] well that's a lot easier [13:48:24] xd [13:49:29] ok starting memcache back up then designate [13:51:26] just to make sure I'm caught up -- if things are still broken we should see rabbit errors in designate-api.log after a few minutes? [13:51:28] ah globbing with 'systemctl start' doesn't really work fwiw [13:51:47] you could cheat and run puppet to start all the services :F [13:51:49] puppet knows what to start, if you're patient [13:51:54] probably slower, but less typing at least [13:51:55] :D [13:52:06] andrewbogott: could be in the order of half an hour or so IME after timeouts showed up [13:52:14] also yes I just ran puppet :D [13:52:16] oh, dang [13:52:32] ok designate up only on cloudcontrol1006 [13:52:47] love this log file [13:52:50] -rw-r--r-- 1 designate designate 5508717 Apr 8 13:52 '.log' [13:52:55] /var/log/designate [13:53:15] lol [13:53:41] f"{mod}.log" [13:53:54] https://www.irccloud.com/pastebin/X2FiXX1t/ [13:53:59] ^ is that an improvement? [13:54:18] I'd say it is [13:54:19] i hope that file name doesn't take untrusted input fron anywhere [13:54:51] the famous '../../etc/passwd' syslog tag [13:56:40] I think this is an example of the thing I said a couple weeks ago about "I'm worried that designate services on different cloudcontrols aren't actually redundant but rather interdependent" [13:57:03] The fact that I haven't dug deep into that doesn't speak well of my curiosity as an engineer, but there's no time like the present. I think we can likely in codfw1dev with some experimentation [13:58:20] * andrewbogott watching fullstackd-20260408135542 nervously [13:58:38] *likely reproduce it [13:59:31] that would mean things are way more complicated :/ (and instead of increasing availability, actually reducing it) [14:00:01] heh, does bouncing memcached require restarting openstack services too? I see a bunch of latencies climbing on https://grafana.wikimedia.org/d/UUmLqqX4k/wmcs-openstack-api-latency [14:00:01] yeah [14:00:16] it shouldn't... [14:01:01] dcaro: yep, would be bad. But the fact that designate services use memcached 'for coordination' worries me that they don't have good failure cases when they can't coordinate with their buddies anymore. [14:01:46] yeah it shouldn't though the latency climb does line up [14:01:48] godog: if memcached was used like a cache I would expect latencies to go up and then back down once those caches have been filled. but given this is openstack ... [14:02:00] taavi: lol heh [14:02:21] ok I'll bounce things on cloudcontrol1006 [14:02:34] wait, I want to see if latency recovers [14:02:45] and also want to see if this fullstackd test completes without interruptions... [14:02:53] unless you see actual failures elsewhere? [14:03:24] sorry too late I bounced nova on cloudcontrol1006, and I'll stop now [14:03:24] yeah, ok fullstackd-20260408135542 completed and cleaned up. So that means lots of things (including creating a new DNS record) are working [14:03:41] no worries, we can test the memcached thing on another occasion [14:04:04] It seems like we have a fix now, but not a theory for the fix. [14:04:22] Unless designate somewhere has a 1-hour timeout set for rabbit messages [14:05:45] heh my money is on something to do with memcache, it was the only component I hadn't touched yet [14:08:23] yeah [14:08:45] I have a vague recollection that we have a recent-ish SRE hire who used to work on designate. Recognize any of these names? [14:08:49] https://www.irccloud.com/pastebin/KfSZOS1c/ [14:09:02] oops that is a lot of names [14:10:02] yes I see your name [14:10:24] Oh, Federico! [14:13:24] I have fixed a couple of designate bugs but that was a couple of levels of complexity ago. [14:13:43] ok without pending zones and nova-fullstack passing I'm relatively confident to re-enable puppet and thus designate on the other cloudcontrols [14:13:51] sounds good to me. [14:14:00] Was designate the only service that suffered from the outage test? [14:16:08] andrewbogott: that didn't self-recover once things were back yes [14:16:31] that's not terrible I guess [14:16:42] mind if I clear out the dns-less fullstack VMs? [14:16:47] go for it [14:17:46] oslo didn't react well to traffic to cloudrabbit being just blackholed and didn't select another server afaict, though it did recover once the server came back [14:18:11] admittedly that's the host just dropping from the network, in a graceful reboot it would have worked [14:18:58] some of that might be tunable... [14:19:03] https://www.irccloud.com/pastebin/9ZPMCQv1/ [14:19:31] that '180' is an awful lot of heartbeats although apparently I had a reason for increasing it when I increased it [14:20:16] wait, it's not number of heartbeats, it's... [14:20:26] surely that's not in seconds? [14:20:31] looking for reviews of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1268978 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1268952 [14:20:36] ok thank you, I left more details on https://phabricator.wikimedia.org/T417393#11798114 too [14:22:55] it is seconds. godog were you seeing services take more than 3 minutes without flipping to a different server? [14:24:40] andrewbogott: yes around 20m I think [14:25:07] you can see the different reactions from oslo in https://phabricator.wikimedia.org/T417393#11798114 [14:25:11] ok, so that's probably not something we can tune our way out of :/ [14:26:04] indeed, if it is indeed the same problem as https://bugs.launchpad.net/oslo.messaging/+bug/2096926 [14:27:37] howdy federico3! [14:27:54] hello [14:28:39] godog just did a failover test with our openstack services, and designate behaved very poorly. Basically once one of our three nodes went down, the remaining nodes were never able to get over it, complaining about lost rabbitmq messages forever and refusing to do their job. [14:28:46] The solution seems to have been to clear memcached [14:29:06] I know that memcache is involved on 'coordination' between the services but I don't know much more than that. Does that behavior sound familiar or expected to you? [14:29:23] (also godog may correct my understanding of the misbehavior) [14:29:48] yes all checks out [14:30:45] * andrewbogott waits to hear the sound of federico3 leaving the room, closing the door, never coming back [14:32:47] oh boy [14:35:42] lol [14:35:53] (disclaimer: i'm ooto today and on a train, also haven't touched Designate in maybe 10 years) but yes it has a bunch of services with different DBs and caching to communicate between components [14:36:04] looks like we're back, I gotta go though might read scrollback later [14:36:36] oh sorry I thought you had worked with it more recently. [14:37:50] In that case I'll try to ask a more general question: would you expect us to be able to run a cluster of 3 active/active/active designate services without incident? The other openstack APIs are made to be roughly stateless so you can run them in a cluster but maybe designate simply isn't [14:38:08] it might have changed in the meantime but investigating "stuck" services was tricky, restarts would be needed [14:39:55] at the time it wasn't primarily built for a highly dynamic environment like k8s with nodes appearing and disappearing often [14:40:35] ok [14:41:34] noticing stuck services and restarting has been my standard practice several years but godog has higher standards :) [14:41:51] We will investigate more another day but I will keep my expectations low. [14:42:19] but, for context, most of openstack at the time was not designed for a "chaos monkey" type of environment [14:42:39] yeah [14:43:07] let me look at the docs for 10 mins [14:43:42] For the other services we actually benefit from having multiple services. Whereas with designate it feels like more of a twin-engine problem. [14:44:08] but I'm hoping it's just a misconfig on my part [14:45:45] when you say 3 nodes you mean 3 hosts with the full designate stack running on each? [14:45:50] correct [14:46:16] and memcached+mcrouter shared between them [14:52:04] Designate could natively use multiple hosts for workers, etc, but it seems you have independent stacks sharing just a memcached? and the mysql instance? [14:52:42] yes, also sharing mysql [14:53:07] so it shouldn't really be independent stacks, just 3 of each service sharing state [14:53:53] * andrewbogott re-reading https://docs.openstack.org/designate/stein/admin/ha.html [14:54:03] oh wait that's from many years ago [14:54:09] which component is complaining? I guess Central, because worker, producer and api are meant to be active/active [14:54:29] but central is not active/active [14:55:43] now I'm looking at https://docs.openstack.org/designate/latest/admin/ha.html [14:55:49] seems to show central being active/active [14:56:01] I believe the actual error messages were showing up in the api service [14:56:30] but designate as a whole was generally broken, leaving zones in a 'pending' state [14:57:00] it's just the api failing on the node that was restarted? has it been restarted again? [14:57:00] From that chart (and my memory) it seems like the likely issue is the -producer service since it's the only one that has additional lock handling between instances [14:57:30] restarting the api service didn't fix things, what did fix things was stopping all services and clearing memcached. [14:58:36] (of course if the -producer service is locked, whey would the api throw errors? that doesn't really make sense so maybe there were two problems) [14:59:12] yes, stale locks in tooz are a thing [14:59:25] yeah [14:59:46] you see zones that are stuck between pending and active I guess? [14:59:54] corect [15:00:40] were the nodes restarted cleanly or just pulled the plug perhaps? Look at the timeout configuration in tooz if they do not recover [15:01:07] ok, that's a good place to start [15:01:14] restarted with systemctl [15:01:34] Given that you're off today I don't think you should spend much time thinking about this, I mostly wanted to know if you had an immediate gut "don't do it that way!" reaction to our setup [15:04:25] no worries, I can take a look at the logs if you have everything collected in one place, also if the priority is to get it fixed you can shut down everything gracefully and restart everything as an initial step, but that does not clear persistent logs [15:04:32] i mean persistent locks [15:04:56] oh, the issue is resolved for now. Would just like our failsafe setup to actually be failsafe :) [15:04:58] yet the timeout should have kicked in after minutes at most [15:05:31] some logs are at https://phabricator.wikimedia.org/T417393#11798114 [15:05:37] but probably not the interesting ones [15:06:14] (not on logstash?) [15:06:16] (I am also in a meeting now so may not be fully responsive here) [15:06:24] probably also there! Let me look... [15:06:55] ok if there's no urgency we can get back to it e.g. tomorrow, I'll ping you if that works for you [15:07:21] the logstash dashboard is here but you'll have to tune it to show designate https://logstash.wikimedia.org/app/dashboards#/view/3ef008b0-c871-11eb-ad54-8bb5fcb640c0?_g=h@e78830b&_a=h@95e12ff [15:07:49] definitely not urgent [15:07:54] thank you for consulting! [15:15:26] you are welcome! [15:15:59] federico3: thanks! 👍 [15:18:40] (I didn't do anything) [15:18:57] gave your time and attention so far ;) [15:37:39] We have an access request for toolforge elasticsearch access which is not something I've seen before. anyone have a doc link for how to grant? [15:44:14] saw it not long ago, looking [15:44:37] https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#Granting_a_tool_write_access_to_Elasticsearch [15:44:40] andrewbogott: ^ [15:44:51] ty! [15:46:27] oh, dcaro, in the meantime will you +1 https://phabricator.wikimedia.org/T422462 ? [15:49:24] 👀 [15:51:42] it's ~1/3 of the free space for elastic (~1/5 of the total), should be ok though maybe discuss in the daily tomorrow [15:52:21] maybe there's some tests that the user can do before using all that space to see if it fits their use case? [15:53:10] (as elasticsearch is not really a DB, but a search index, so if they are trying to get a DB might not be the right think to use) [15:54:23] the elastic /srv partitions are big and empty enough that I would not worry about that at the moment [16:09:06] if there's only one replica that's ok yep, but for 3 it would mean using ~100T of data in a ~400T cluster [16:09:23] andrewbogott: I see you acked the Tofuinfratest alert, have you checked why it's failing? [16:09:25] I mean, we have space, just want to make sure we are ready to stress test it xd [16:09:51] btw I added a quick command to the alert runbook that I use to see the output of the last cronjob run [16:10:12] dhinus: it was failing due to the trove quota leak. I fixed that and now it's failing due to a magnum failure which I take to be... intermittent? But I haven't looked at that part. [16:11:20] it was intermittent last week, but is failing constantly this week... [16:14:22] ok, I may or may not have another look [16:32:44] I can also take a look tomorrow but wanted to check with you first [16:33:11] in case you knew something more :) [16:37:08] so far I do not [18:13:02] * dcaro off [18:13:04] cya tomorrow!