[07:20:26] greetings [07:48:17] I'll be going ahead with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1261374 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1261366 shortly, first in codfw then eqiad [08:01:08] morning! ack [08:38:08] morning [09:42:51] godog: how is it going? I see some alerts up for a bit, I'm guessing you are still deploying stuff? [09:43:38] dcaro: yes first part is done (rabbitmqctl cli) and I'm about to do the second part [09:43:45] i.e. transient rabbitmq queues for openstack [09:44:04] alerts should be going back to quiet [09:44:28] ack, designate seems to be having more trouble than the others for some reason [09:45:20] indeed, I'll deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1261374 and then restart openstack services [09:47:19] oh, I think we might be having some issues hitting different prometheus backends or something (the graphs change if I refresh, between two different versions) [09:49:05] ah yes I saw the prometheus rolling reboots messages, likely related to that [09:54:47] ok change deployed I'll be resetting rabbit state and restart openstack [09:54:58] designate seems to be up again :) [09:55:41] \o/ [09:56:50] oh, there's a bunch of neutron alerts [09:56:52] from cloudvirts [09:57:34] mmhh ok thank you, will rebuild rabbit and restart those [09:59:51] ack, let me know if you you need any help, they seems to fail with auth issues `Failed to consume message from queue: (0, 0): (403) ACCESS_REFUSED` [10:01:03] dcaro: ack! thank you, will do! yes rebuild in progress atm [10:05:40] ok rebuild done, restarting puppet and openstack services [10:08:31] things are getting online \o/ [10:09:55] \o/ [10:11:26] it seems toolforge prometheus is down (service unavailable when loading dashboards), will look in a bit if nobody gets to it before [10:11:59] I just did neutron on cloudnet btw, that might be it (i.e. network) [10:21:16] ok puppet re-enabled, restarting openstack services one more time [10:31:18] still not out of the woods btw, investigating stack traces for e.g. heat [10:31:41] prometheus might be a side effect of the istio overload last week, I filed T421416 then and am trying to find a moment to poke that a bit more [10:31:42] T421416: Alert on Prometheus instability / unexpected restarts - https://phabricator.wikimedia.org/T421416 [10:43:56] ack [10:47:06] neutron agents seem to be failing to connect to rabbit [10:47:11] 2026-03-30 10:46:53.443 3681266 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_oskenapp oslo_messaging.exceptions.MessageDeliveryFailure: Unable to connect to AMQP server on rabbitmq03.eqiad1.wikimediacloud.org:5671 after inf tries: Exchange.declare: (406) PRECONDITION_FAILED - inequivalent arg 'auto_delete' for exchange [10:47:11] 'neutron-vo-Port-1.10_fanout' in vhost '/': received 'false' but current is 'true' [10:47:33] seems an issue with queues config [10:47:57] xd, `after inf tries` yep, sure [10:48:29] probably retrying every in 0 seconds [10:48:35] * dcaro stops joking [10:50:34] lol yeah I got another puppet run going and then will reset rabbit again, this time should be the last [10:55:13] dcaro: what host was that ? [11:00:57] Cloudvirt1069 [11:01:53] ack thank you [11:12:06] folks I've just realised I've a meeting clash this wednesday for our network sync meeting. [11:12:24] any change we can do it an hour earlier? Or perhaps push it back a week? [11:13:08] either option would work for me FWIW [11:13:59] Works for me too [11:14:50] same [11:15:13] thanks guys, I sent a proposal to do it an hour earlier, I think andrew is the owner so let's see what he thinks [11:15:14] cheers [11:24:54] ok restarts and resets done, I'm looking at dashboards/alerts [11:43:54] things seem to be working ok now :) [11:44:46] yeah! just updated https://phabricator.wikimedia.org/T421054#11763845 with a few classic queues left I'm investigating [11:46:33] prometheus in toolforge seems to be in a OOM loop, getting killed when starting up [11:46:58] poor prometheus :( [11:48:15] ok neutron-l3-agent.service doesn't get restarted by wmcs.openstack.restart_openstack do you know if it safe to do so at any time ? [11:48:33] it will cause a failover from one cloudnet to the other [11:49:08] so one at a time only, and preferrabily starting from the passive one so we only trigger one of those [11:49:47] ack ok thank you taavi [11:50:09] I'll !log before I do it [11:52:49] thanks [11:53:09] and fwiw we tend to log those in -cloud with `!log admin` instead of in -operations [11:53:25] ah got it, will do [11:58:16] hmm... any ideas besides removing the WAL to get prometheus toolforge running? [12:03:10] IIRC yes removing the wal is what I did in the past when replaying it was causing trouble, not optimal but heh [12:10:59] okok, done, back up and running, we lost some data though (not critical) [12:11:10] taavi: I need to briefly pause both l3-agent, delete its exchanges and queues and then restart them both [12:11:29] i.e. brief network interruption [12:13:53] :( [12:14:17] i'm leaning towards doing that on a scheduled window [12:16:43] that's fair taavi yeah, I'll be sending a cloud-announce window announce for tomorrow EU morning [12:30:10] we won't have cloudnet redudancy until then though I think that's acceptable [12:32:35] +1 from me, it's not a long time [12:54:44] I have a new "beta" build of cumin that has all the major changes and the proxy support for openstack. Would it be ok for you if I install it on cloudcumin2001 and try it there? it's usually unused AFAIK [12:56:24] yes [12:57:32] volans: 👍 [13:04:17] great, thx, I have the current deb handy for revert :) [13:37:15] hmh, do our prometheus instances not monitor each other? [13:37:54] (that job is seemingly hardcoded all the way in prometheus::server) [15:04:41] andrewbogott: so T421025 reminded me about one more thing - didn't we choose to move those public records under $PROJECT_NAME.wmcloud.org instead of $PROJECT_ID.wmcloud.org? [15:04:41] T421025: Add PTR record for azwikimedia - https://phabricator.wikimedia.org/T421025 [15:05:37] hm, good question. I don't know if we decided to but we probably should... [15:06:04] well, it depends on who/what we think those automatic entries are for. [15:06:33] also looking at https://openstack-browser.toolforge.org/project/azwikimedia we only provision the project name variant for new projects, so that's one more argument for doing it with project name [15:08:32] let me make a task to cover all this... [15:08:54] ty [15:09:55] and for the original task, I think it's best to have some kind of config to override the auto-generated records for that script. the same thing would be useful for some mail servers too [15:10:05] some other mail servers, like toolforge ones, that is [15:10:22] should maybe be two tasks rather than one, but T421739 [15:10:23] T421739: Improvements to auto-generated floating ip ptr records - https://phabricator.wikimedia.org/T421739 [15:10:48] could the auto-generation script just do something like "if existing record doesn't match regex, ignore"? [15:12:33] how would the script separate manually managed records and records formerly managed by that script that need to be cleaned up? (remember that it also handles Designate-managed A records, so 185.15.56.4 has pointers for both instance-tools-bastion-14.tools.wmcloud.org. and dev.toolforge.org.) [15:25:50] taavi: actually now that I think of it I think there's literally a flag in designate to distinguish script-created vs human-created recordsets. So it may be as simple as checking that, I'll see. [15:26:01] well, would have to backfill that flag [15:26:36] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/openstack/files/designate/wmcs-dns-floating-ip-updater.py#32 [15:26:42] or we could just check if that description matches :P [15:26:50] yeah [15:27:11] and if we still want these to be tracked in git, could do it with the tofu repo and tooling [15:28:42] so yeah, let's go with that approach [15:41:35] do you think that tofu + wmcs-dns-floating-ip-updater will be able to cooperate on a shared zone? Or will tofu just wipe out anything that's not tofu-manager there? (I would expect 'wipe out' but maybe you know better) [15:42:45] we already manage those zones and have some records in them in tofu https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/blob/main/resources/eqiad1-r/cloudinfra/dns.tf?ref_type=heads#L16 [15:43:07] basically tofu if you create something with tofu and then remove it from there, it'll remove it, but if something is created outside tofu it won't touch it [15:43:59] oh, I forgot to ask earlier: any outgoing updates for the SRE meeting? [15:44:18] not from me [15:44:24] ^ godog: volans [15:44:37] not for me, thank you taavi [15:45:18] hmmm my instinct is to fear tofu but you're right it clearly isn't wiping things out there [15:50:18] * volans interview... [15:50:48] ah no, no specific update [15:50:49] thx [17:56:32] * dcaro off