[07:20:26] <godog>	 greetings
[07:48:17] <godog>	 I'll be going ahead with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1261374 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1261366 shortly, first in codfw then eqiad
[08:01:08] <dcaro>	 morning! ack
[08:38:08] <dhinus>	 morning
[09:42:51] <dcaro>	 godog: how is it going? I see some alerts up for a bit, I'm guessing you are still deploying stuff?
[09:43:38] <godog>	 dcaro: yes first part is done (rabbitmqctl cli) and I'm about to do the second part
[09:43:45] <godog>	 i.e. transient rabbitmq queues for openstack
[09:44:04] <godog>	 alerts should be going back to quiet
[09:44:28] <dcaro>	 ack, designate seems to be having more trouble than the others for some reason
[09:45:20] <godog>	 indeed, I'll deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1261374 and then restart openstack services
[09:47:19] <dcaro>	 oh, I think we might be having some issues hitting different prometheus backends or something (the graphs change if I refresh, between two different versions)
[09:49:05] <godog>	 ah yes I saw the prometheus rolling reboots messages, likely related to that
[09:54:47] <godog>	 ok change deployed I'll be resetting rabbit state and restart openstack
[09:54:58] <dcaro>	 designate seems to be up again :)
[09:55:41] <godog>	 \o/
[09:56:50] <dcaro>	 oh, there's a bunch of neutron alerts
[09:56:52] <dcaro>	 from cloudvirts
[09:57:34] <godog>	 mmhh ok thank you, will rebuild rabbit and restart those 
[09:59:51] <dcaro>	 ack, let me know if you you need any help, they seems to fail with auth issues `Failed to consume message from queue: (0, 0): (403) ACCESS_REFUSED`
[10:01:03] <godog>	 dcaro: ack! thank you, will do! yes rebuild in progress atm
[10:05:40] <godog>	 ok rebuild done, restarting puppet and openstack services
[10:08:31] <dcaro>	 things are getting online \o/
[10:09:55] <godog>	 \o/
[10:11:26] <dcaro>	 it seems toolforge prometheus is down (service unavailable when loading dashboards), will look in a bit if nobody gets to it before
[10:11:59] <godog>	 I just did neutron on cloudnet btw, that might be it (i.e. network)
[10:21:16] <godog>	 ok puppet re-enabled, restarting openstack services one more time
[10:31:18] <godog>	 still not out of the woods btw, investigating stack traces for e.g. heat
[10:31:41] <taavi>	 prometheus might be a side effect of the istio overload last week, I filed T421416 then and am trying to find a moment to poke that a bit more
[10:31:42] <stashbot>	 T421416: Alert on Prometheus instability / unexpected restarts - https://phabricator.wikimedia.org/T421416
[10:43:56] <godog>	 ack
[10:47:06] <dcaro>	 neutron agents seem to be failing to connect to rabbit
[10:47:11] <dcaro>	                                                                   2026-03-30 10:46:53.443 3681266 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_oskenapp oslo_messaging.exceptions.MessageDeliveryFailure: Unable to connect to AMQP server on rabbitmq03.eqiad1.wikimediacloud.org:5671 after inf tries: Exchange.declare: (406) PRECONDITION_FAILED - inequivalent arg 'auto_delete' for exchange 
[10:47:11] <dcaro>	 'neutron-vo-Port-1.10_fanout' in vhost '/': received 'false' but current is 'true'
[10:47:33] <dcaro>	 seems an issue with queues config
[10:47:57] <dcaro>	 xd, `after inf tries` yep, sure
[10:48:29] <dcaro>	 probably retrying every in 0 seconds
[10:48:35] * dcaro stops joking
[10:50:34] <godog>	 lol yeah I got another puppet run going and then will reset rabbit again, this time should be the last
[10:55:13] <godog>	 dcaro: what host was that ?
[11:00:57] <dcaro>	 Cloudvirt1069
[11:01:53] <godog>	 ack thank you
[11:12:06] <topranks>	 folks I've just realised I've a meeting clash this wednesday for our network sync meeting.  
[11:12:24] <topranks>	 any change we can do it an hour earlier?  Or perhaps push it back a week?
[11:13:08] <godog>	 either option would work for me FWIW
[11:13:59] <dcaro>	 Works for me too
[11:14:50] <taavi>	 same
[11:15:13] <topranks>	 thanks guys, I sent a proposal to do it an hour earlier, I think andrew is the owner so let's see what he thinks 
[11:15:14] <topranks>	 cheers
[11:24:54] <godog>	 ok restarts and resets done, I'm looking at dashboards/alerts
[11:43:54] <dcaro>	 things seem to be working ok now :)
[11:44:46] <godog>	 yeah! just updated https://phabricator.wikimedia.org/T421054#11763845 with a few classic queues left I'm investigating
[11:46:33] <dcaro>	 prometheus in toolforge seems to be in a OOM loop, getting killed when starting up
[11:46:58] <godog>	 poor prometheus :(
[11:48:15] <godog>	 ok neutron-l3-agent.service doesn't get restarted by wmcs.openstack.restart_openstack do you know if it safe to do so at any time ?
[11:48:33] <taavi>	 it will cause a failover from one cloudnet to the other
[11:49:08] <taavi>	 so one at a time only, and preferrabily starting from the passive one so we only trigger one of those
[11:49:47] <godog>	 ack ok thank you taavi 
[11:50:09] <godog>	 I'll !log before I do it
[11:52:49] <taavi>	 thanks
[11:53:09] <taavi>	 and fwiw we tend to log those in -cloud with `!log admin` instead of in -operations
[11:53:25] <godog>	 ah got it, will do
[11:58:16] <dcaro>	 hmm... any ideas besides removing the WAL to get prometheus toolforge running?
[12:03:10] <godog>	 IIRC yes removing the wal is what I did in the past when replaying it was causing trouble, not optimal but heh
[12:10:59] <dcaro>	 okok, done, back up and running, we lost some data though (not critical)
[12:11:10] <godog>	 taavi: I need to briefly pause both l3-agent, delete its exchanges and queues and then restart them both
[12:11:29] <godog>	 i.e. brief network interruption
[12:13:53] <taavi>	 :(
[12:14:17] <taavi>	 i'm leaning towards doing that on a scheduled window
[12:16:43] <godog>	 that's fair taavi yeah, I'll be sending a cloud-announce window announce for tomorrow EU morning
[12:30:10] <godog>	 we won't have cloudnet redudancy until then though I think that's acceptable
[12:32:35] <dcaro>	 +1 from me, it's not a long time
[12:54:44] <volans>	 I have a new "beta" build of cumin that has all the major changes and the proxy support for openstack. Would it be ok for you if I install it on cloudcumin2001 and try it there? it's usually unused AFAIK
[12:56:24] <taavi>	 yes
[12:57:32] <dcaro>	 volans: 👍
[13:04:17] <volans>	 great, thx, I have the current deb handy for revert :)
[13:37:15] <taavi>	 hmh, do our prometheus instances not monitor each other?
[13:37:54] <taavi>	 (that job is seemingly hardcoded all the way in prometheus::server)
[15:04:41] <taavi>	 andrewbogott: so T421025 reminded me about one more thing - didn't we choose to move those public records under $PROJECT_NAME.wmcloud.org instead of $PROJECT_ID.wmcloud.org?
[15:04:41] <stashbot>	 T421025: Add PTR record for azwikimedia - https://phabricator.wikimedia.org/T421025
[15:05:37] <andrewbogott>	 hm, good question. I don't know if we decided to but we probably should...
[15:06:04] <andrewbogott>	 well, it depends on who/what we think those automatic entries are for.
[15:06:33] <taavi>	 also looking at https://openstack-browser.toolforge.org/project/azwikimedia we only provision the project name variant for new projects, so that's one more argument for doing it with project name
[15:08:32] <andrewbogott>	 let me make a task to cover all this...
[15:08:54] <taavi>	 ty
[15:09:55] <taavi>	 and for the original task, I think it's best to have some kind of config to override the auto-generated records for that script. the same thing would be useful for some mail servers too
[15:10:05] <taavi>	 some other mail servers, like toolforge ones, that is
[15:10:22] <andrewbogott>	 should maybe be two tasks rather than one, but T421739
[15:10:23] <stashbot>	 T421739: Improvements to auto-generated floating ip ptr records - https://phabricator.wikimedia.org/T421739
[15:10:48] <andrewbogott>	 could the auto-generation script just do something like "if existing record doesn't match regex, ignore"?
[15:12:33] <taavi>	 how would the script separate manually managed records and records formerly managed by that script that need to be cleaned up? (remember that it also handles Designate-managed A records, so 185.15.56.4 has pointers for both instance-tools-bastion-14.tools.wmcloud.org. and dev.toolforge.org.)
[15:25:50] <andrewbogott>	 taavi: actually now that I think of it I think there's literally a flag in designate to distinguish script-created vs human-created recordsets.  So it may be as simple as checking that, I'll see.
[15:26:01] <andrewbogott>	 well, would have to backfill that flag
[15:26:36] <taavi>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/openstack/files/designate/wmcs-dns-floating-ip-updater.py#32
[15:26:42] <taavi>	 or we could just check if that description matches :P
[15:26:50] <andrewbogott>	 yeah
[15:27:11] <taavi>	 and if we still want these to be tracked in git, could do it with the tofu repo and tooling
[15:28:42] <taavi>	 so yeah, let's go with that approach
[15:41:35] <andrewbogott>	 do you think that tofu + wmcs-dns-floating-ip-updater will be able to cooperate on a shared zone? Or will tofu just wipe out anything that's not tofu-manager there?  (I would expect 'wipe out' but maybe you know better)
[15:42:45] <taavi>	 we already manage those zones and have some records in them in tofu https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/blob/main/resources/eqiad1-r/cloudinfra/dns.tf?ref_type=heads#L16
[15:43:07] <taavi>	 basically tofu if you create something with tofu and then remove it from there, it'll remove it, but if something is created outside tofu it won't touch it
[15:43:59] <taavi>	 oh, I forgot to ask earlier: any outgoing updates for the SRE meeting?
[15:44:18] <andrewbogott>	 not from me
[15:44:24] <taavi>	 ^ godog: volans
[15:44:37] <godog>	 not for me, thank you taavi 
[15:45:18] <andrewbogott>	 hmmm my instinct is to fear tofu but you're right it clearly isn't wiping things out there
[15:50:18] * volans interview...
[15:50:48] <volans>	 ah no, no specific update
[15:50:49] <volans>	 thx
[17:56:32] * dcaro off