[00:27:40] <andrewbogott>	 taavi: I'm lost in the haproxy docs. I have request logging enabled for neutron-api (in theory) but see nothing
[00:33:58] <andrewbogott>	 best I can tell all the traffic is internal, so this might not be a dos, or it might be a self-inflicted dos
[01:45:13] <andrewbogott>	 Ok, quick wrap-up:  It was not a denial of service. Neutron was in a split-brained state which meant it timed out on many operations, causing its wsgi queue to fill up so it /looked/ like a DOS in the logs.  After trying many other things I did a full reset of rabbitmq and a restart of all services and now things seem normal.
[07:58:36] <taavi>	 morning. I'm going to reimage cloudweb1003 shortly, I'll take a backup of /home beforehand
[08:54:48] <arturo>	 ack
[09:00:50] <dcaro>	 morning
[09:13:12] <arturo>	 o/
[09:14:50] <taavi>	 is anyone planning to do anything about that toolsdb replag alert? I don't like how we have collectively deemed to ignore most of them because they're so frequent and unactionable
[09:18:21] <arturo>	 my DB maintenance kung-fu is very low in general
[09:19:44] <dcaro>	 taavi: I'll be looking on it
[09:20:02] <dcaro>	 as I did the last time (when I wrote the cookbook to check the stuck query, and see if it's a valid one or something else is going on)
[09:20:38] <dcaro>	 previous https://phabricator.wikimedia.org/T355411
[09:21:02] <dcaro>	 just need a few minutes to have some coffee
[09:21:13] <dcaro>	 ☕
[09:24:30] <dcaro>	 just opened the new task T357264, feel free to add anything you find out
[09:24:31] <stashbot>	 T357264: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-02-12 - https://phabricator.wikimedia.org/T357264
[10:15:01] <blancadesal>	 dcaro: to confirm, this is all automated now right? https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy#updating-component-versions
[10:15:57] <blancadesal>	 or is it only for some of the components?
[10:17:05] <dcaro>	 blancadesal: I think all of them (the ones that generate a chart, that are under toolforge-deploy) are automated yes
[10:17:27] <blancadesal>	 ok, I'll update the readme then
[10:17:28] <dcaro>	 like https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/193 is created foryou
[10:18:07] <dcaro>	 ack, might still be good to keep the notes in case CI stops working for whatever reason (or we want to force a manual update)
[10:19:15] <blancadesal>	 ok
[11:09:40] * dcaro lunch
[12:09:44] <taavi>	 topranks: hi, for T341338 I'm planning to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/998401 soon, after that we will need to update the reverse zone delegations to point to ns0/1/2.wikimedia.org
[12:09:44] <stashbot>	 T341338: eqiad1: fix PTR delegations for 185.15.56.0/24 - https://phabricator.wikimedia.org/T341338
[13:39:58] <taavi>	 andrewbogott: cloudcontrol2004-dev is alerting for "Some flavors are not assigned to aggregates: g3.cores2.ram4.disk20", do we need to do something about that?
[13:55:42] * arturo food time
[16:06:56] <arturo>	 in case you are curious, this is what the openapi-generator at https://openapi-generator.tech/ generates for the jobs-api https://gitlab.wikimedia.org/aborrero/jobs-api-gen
[17:00:57] <dcaro>	 no sync meeting I'm guessing? andrewbogott anything specific about yesterday's outage?
[17:07:53] <dcaro>	 arturo: did you try using the official one? https://github.com/OpenAPITools/openapi-generator
[17:10:04] <dcaro>	 oh, wait, it's the same xd
[17:10:51] <dcaro>	 and it's not official
[17:11:56] <andrewbogott>	 dcaro: I don't think there's anything to tell other than what I put in IRC -- "it turned out to be rabbitmq"
[17:12:06] <dcaro>	 ack
[17:12:23] <dcaro>	 then /me out :)
[17:12:25] <dcaro>	 cya
[17:20:31] <andrewbogott>	 taavi: I will look at that flavors vs aggregates thing. It's an easy fix one way or another
[17:22:22] <bd808>	 andrewbogott: I was thinking the other day that rabbitmq is the last mostly mysterious piece of our openstack deployment. By that I mean that it is the one thing that as far as I know we have never had a local expert really dig into tuning and improving (other than Chase and Brooke doing some short term heroics). Does that sound right to you?
[17:23:14] <andrewbogott>	 Yeah, mostly right. Although it has been much more reliable since I switched it to persistent queues.
[17:23:27] <andrewbogott>	 I think there are visualization tools that we could/should add
[17:25:29] <andrewbogott>	 the main problem I've had with it lately is that it doesn't seem to notice when it has split-brain; each individual note is totally happy.
[17:26:00] <andrewbogott>	 I wonder if there's a good way to have each do a health check and then /compare/ those health checks to see if they agree about what's happening...
[17:26:15] <bd808>	 total throughput has historically been a problem too correct? Like when nodepool was around and flooding it?
[17:26:28] <bd808>	 "tools.gitlab-account-approval@tools-sgebastion-11:~$ toolforge build start https://gitlab.wikimedia.org/toolforge-repos/gitlab-account-approval; KeyError: 'name' Please report this issue to the Toolforge admins if it persists: https://w.wiki/6Zuu"
[17:26:41] <bd808>	 buildservice is mad at my tool? or all tools?
[17:27:46] <andrewbogott>	 bd808: I don't think we've had throughput issues in quite a while
[17:28:40] <bd808>	 I assumed that the vague "openstack is too slow" stuff from the people working on the Catalyst project was throughput related, but maybe it's just about puppet?
[17:29:05] <taavi>	 ^ Raymond_Ndibe: seems like you released a new builds-cli version earlier today but forgot to update it on tools-sgebastion-11 (which I just did)
[17:29:07] <taavi>	 bd808: try now?
[17:29:41] <bd808>	 taavi: looks like its building now. thanks
[17:30:30] <bd808>	 that error message was totally useless to figure out the problem. Is there a logging place that you looked at?
[17:30:32] <andrewbogott>	 bd808: no idea, until someone with catalyst actually talks to me about their issues I'm trying to ignore it
[17:30:51] <bd808>	 andrewbogott: fair enough
[17:30:53] <taavi>	 no, but I just looked at the commit log of the most likely culprit
[17:31:06] <taavi>	 you should be able to pass --debug to get a better stack trace
[17:31:29] <taavi>	 there's a cookbook that does the copy from toolsbeta repos to tools, maybe I should extend that to also roll the package out everywhere to avoid that kind of split-brain where some nodes have been upgraded and some have not
[17:32:37] <bd808>	 that seems like a nice toil reduction if there is a reasonable way to target the bastions with the install command
[17:33:17] <taavi>	 i forget if the wmcs cookbooks can access toolforge puppetdb, but if they do that's a relatively easy way to figure out where it's installed
[17:36:15] <dhinus>	 I'm also not sure if they can
[18:53:55] <andrewbogott>	 taavi: next time you're going to delete old worker nodes can you save them for me to delete? I'm still chasing down a designate-sink corner case.
[19:22:23] <bd808>	 increment the "mysterious filesystem behavior on the NFS server" counter for T357340
[19:22:23] <stashbot>	 T357340: rm'ing a specific file on NFS hangs on (dev|login).toolforge.org - https://phabricator.wikimedia.org/T357340
[19:22:49] * bd808 lunch