[00:27:40] taavi: I'm lost in the haproxy docs. I have request logging enabled for neutron-api (in theory) but see nothing [00:33:58] best I can tell all the traffic is internal, so this might not be a dos, or it might be a self-inflicted dos [01:45:13] Ok, quick wrap-up: It was not a denial of service. Neutron was in a split-brained state which meant it timed out on many operations, causing its wsgi queue to fill up so it /looked/ like a DOS in the logs. After trying many other things I did a full reset of rabbitmq and a restart of all services and now things seem normal. [07:58:36] morning. I'm going to reimage cloudweb1003 shortly, I'll take a backup of /home beforehand [08:54:48] ack [09:00:50] morning [09:13:12] o/ [09:14:50] is anyone planning to do anything about that toolsdb replag alert? I don't like how we have collectively deemed to ignore most of them because they're so frequent and unactionable [09:18:21] my DB maintenance kung-fu is very low in general [09:19:44] taavi: I'll be looking on it [09:20:02] as I did the last time (when I wrote the cookbook to check the stuck query, and see if it's a valid one or something else is going on) [09:20:38] previous https://phabricator.wikimedia.org/T355411 [09:21:02] just need a few minutes to have some coffee [09:21:13] ☕ [09:24:30] just opened the new task T357264, feel free to add anything you find out [09:24:31] T357264: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-02-12 - https://phabricator.wikimedia.org/T357264 [10:15:01] dcaro: to confirm, this is all automated now right? https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy#updating-component-versions [10:15:57] or is it only for some of the components? [10:17:05] blancadesal: I think all of them (the ones that generate a chart, that are under toolforge-deploy) are automated yes [10:17:27] ok, I'll update the readme then [10:17:28] like https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/193 is created foryou [10:18:07] ack, might still be good to keep the notes in case CI stops working for whatever reason (or we want to force a manual update) [10:19:15] ok [11:09:40] * dcaro lunch [12:09:44] topranks: hi, for T341338 I'm planning to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/998401 soon, after that we will need to update the reverse zone delegations to point to ns0/1/2.wikimedia.org [12:09:44] T341338: eqiad1: fix PTR delegations for 185.15.56.0/24 - https://phabricator.wikimedia.org/T341338 [13:39:58] andrewbogott: cloudcontrol2004-dev is alerting for "Some flavors are not assigned to aggregates: g3.cores2.ram4.disk20", do we need to do something about that? [13:55:42] * arturo food time [16:06:56] in case you are curious, this is what the openapi-generator at https://openapi-generator.tech/ generates for the jobs-api https://gitlab.wikimedia.org/aborrero/jobs-api-gen [17:00:57] no sync meeting I'm guessing? andrewbogott anything specific about yesterday's outage? [17:07:53] arturo: did you try using the official one? https://github.com/OpenAPITools/openapi-generator [17:10:04] oh, wait, it's the same xd [17:10:51] and it's not official [17:11:56] dcaro: I don't think there's anything to tell other than what I put in IRC -- "it turned out to be rabbitmq" [17:12:06] ack [17:12:23] then /me out :) [17:12:25] cya [17:20:31] taavi: I will look at that flavors vs aggregates thing. It's an easy fix one way or another [17:22:22] andrewbogott: I was thinking the other day that rabbitmq is the last mostly mysterious piece of our openstack deployment. By that I mean that it is the one thing that as far as I know we have never had a local expert really dig into tuning and improving (other than Chase and Brooke doing some short term heroics). Does that sound right to you? [17:23:14] Yeah, mostly right. Although it has been much more reliable since I switched it to persistent queues. [17:23:27] I think there are visualization tools that we could/should add [17:25:29] the main problem I've had with it lately is that it doesn't seem to notice when it has split-brain; each individual note is totally happy. [17:26:00] I wonder if there's a good way to have each do a health check and then /compare/ those health checks to see if they agree about what's happening... [17:26:15] total throughput has historically been a problem too correct? Like when nodepool was around and flooding it? [17:26:28] "tools.gitlab-account-approval@tools-sgebastion-11:~$ toolforge build start https://gitlab.wikimedia.org/toolforge-repos/gitlab-account-approval; KeyError: 'name' Please report this issue to the Toolforge admins if it persists: https://w.wiki/6Zuu" [17:26:41] buildservice is mad at my tool? or all tools? [17:27:46] bd808: I don't think we've had throughput issues in quite a while [17:28:40] I assumed that the vague "openstack is too slow" stuff from the people working on the Catalyst project was throughput related, but maybe it's just about puppet? [17:29:05] ^ Raymond_Ndibe: seems like you released a new builds-cli version earlier today but forgot to update it on tools-sgebastion-11 (which I just did) [17:29:07] bd808: try now? [17:29:41] taavi: looks like its building now. thanks [17:30:30] that error message was totally useless to figure out the problem. Is there a logging place that you looked at? [17:30:32] bd808: no idea, until someone with catalyst actually talks to me about their issues I'm trying to ignore it [17:30:51] andrewbogott: fair enough [17:30:53] no, but I just looked at the commit log of the most likely culprit [17:31:06] you should be able to pass --debug to get a better stack trace [17:31:29] there's a cookbook that does the copy from toolsbeta repos to tools, maybe I should extend that to also roll the package out everywhere to avoid that kind of split-brain where some nodes have been upgraded and some have not [17:32:37] that seems like a nice toil reduction if there is a reasonable way to target the bastions with the install command [17:33:17] i forget if the wmcs cookbooks can access toolforge puppetdb, but if they do that's a relatively easy way to figure out where it's installed [17:36:15] I'm also not sure if they can [18:53:55] taavi: next time you're going to delete old worker nodes can you save them for me to delete? I'm still chasing down a designate-sink corner case. [19:22:23] increment the "mysterious filesystem behavior on the NFS server" counter for T357340 [19:22:23] T357340: rm'ing a specific file on NFS hangs on (dev|login).toolforge.org - https://phabricator.wikimedia.org/T357340 [19:22:49] * bd808 lunch