[09:16:52] mjqrnxng¨ [09:17:02] xd, that's me shifted one key on the keyboard... [09:17:05] morning! [09:17:22] I'm looking into the openstack api alert [09:18:32] it seems that it failed to connect to the Db, though the logs are old (journal rotated) [09:18:38] just restarted the nova-api-metadata service [09:27:46] morning :) [09:28:46] opened T383203 to keep track, though the restart seemed to be enough, and I'm having issues finding logs... will jump to the puppet errors that are popping up [09:28:47] T383203: [openstack] 2025-01-08 nova-api-metadata.service down on cloudcontrol1005 - https://phabricator.wikimedia.org/T383203 [09:29:33] seems to be another missing cloud.yaml entry (prometheus::default_web_instance) [14:15:51] hmm, the tools nfs server went away for tools static [14:15:55] https://www.irccloud.com/pastebin/hyLuwLvi/ [14:17:11] tested a couple workers and look ok [14:17:44] this does not though https://usercontent.irccloud-cdn.com/file/U0acHsAY/image.png [14:17:58] I was taking one nfs node out for reimaging, might have been that [14:18:20] ceph says everything is ok :/ [14:18:39] (rebalance ongoing, but everything up and running) [14:21:13] yep, nfs is misbehaving [14:25:42] I'll start rebooting worker 70, to see if it comes back up ok [14:25:53] if it does, I'll start rebooting all the nodes that are currently having issues [14:26:11] I've stopped the draining of cloudcephosd1012 (it's half-drained), will see how I retake it later [14:28:52] tools-static-15 came up ok after the reboot [14:29:33] so whatever was making nfs misbehave, is not still happening (or not for the same files) [14:32:25] I'm starting to suspect that the issue is not so much on the disk side, as accessing stuff from the nfs VM itself seems ok, but from the network side, maybe when ceph rebalances there's a hiccup in the network (for sure there's a spike) and that throws nfs off or something [14:41:19] that's a pain. Maybe we can configure nfs to be less sensitive? [14:41:43] It turns out to be hard to google about hosting nfs on ceph because of cephfs [14:44:12] yep, it's a weird setup that adds a bunch of layers (and thus complexity) in the middle [14:54:14] I'm trying to debug a bit before restarting all the nfs workers that are stuck, I've noticed that they report a few errors in `nfsiostat` (<~10), but I'm not finding any stats in prometheus about it, might be interesting to have them [14:57:52] I think I can safely say though that the nfs issue happens momentarily, as in, there's a hiccup for a couple minutes (or less), gets the nfs clients stuck, and the hiccup goes away while the clients remain stuck [15:00:16] We always have the option of trying native nfs support on ceph and skipping the server-on-a-VM layer. It doesn't sound like that would necessarily help in this case, it might be more of a client config thing [15:01:14] maybe, given that the nfs ceph uses is a different implementation, makes it not so clear to me that it would not help [15:01:27] (really weird phrasing xd) [15:02:00] that would mean though that toolforge VMs have direct access to bare metal ceph osd and mon nodes [15:03:31] oh right, we can't actually do that without some kind of intermediary [15:14:55] started a task to add some debugging and such T383238 [15:14:55] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [16:39:32] I was just about to tell andrewbogott that my gitlab-account-approval had been uncharacteristically stable for the last couple of weeks and then I got 2 failure emails. :/ [16:41:19] yeah, I remain convinced that it's not load-dependent even though I wish it were [16:41:27] the failures are specifically in dns? [16:42:14] Looking now. They may have just been k8s node draining? [16:42:56] oh yeah, could be since we are definitely doing that right now [16:42:57] `Exit code was '255'.` -- I think that is k8s killing the running task [16:47:00] Yeah, I just got one of those too. [16:47:43] So... does that suggest that the failures /are/ load-related? (Since I moved 'integration' to its own recursor I've been looking for less-noisy-neighbor effects elsewhere) [16:48:49] (integration has had 0 failures since the move, six days ago) [17:02:45] andrewbogott: I can say that my email logs have errors on 12/19, 12/21, 12/22, 12/24, 12/25, 12/26, 12/27, 12/28, 12/30, 12/31 and then today. What this proves or disproves, ¯\_(ツ)_/¯ [17:05:23] that does kind of suggest that things maybe got better on the 2nd [17:05:30] or that coincidences happen. One or the other! [17:53:26] topranks: I am impatient to try out something like https://gerrit.wikimedia.org/r/c/operations/puppet/+/1058612 -- is the next step just to merge it and see what happens, or are there invervening things to try? (e.g has this been tried in codfw1dev already?) [17:54:32] andrewbogott: do you want to give it a try tomorrow?? [17:55:01] I'm up for it in european timezones if needed [17:55:05] should just be a matter of merging but I need to sign off soon and we need to monitor it afterwards for a while [17:55:47] if 'give it a try' is the next step, then sure! Maybe tomorrow around 14:30 UCT? [17:56:02] yep sounds good [17:56:09] or dcaro if you want to schedule something earlier than that that's fine with me but I won't be up [17:56:43] I'm ok with both options, up to topranks what works better for you :) [17:57:08] I sent an invite [17:57:47] (hopefully I did my tz math correctly) [18:02:05] 👍 thanks [18:06:46] blancadesal (or whoever) can I get a +1 on T383251 and T383252? I'm happy to make the adjustments. [18:06:47] T383251: Request increased quota for wikitextexp Cloud VPS project - https://phabricator.wikimedia.org/T383251 [18:06:48] T383252: Bump up quota for wikitexexp to let us spin up a more powerful test server - https://phabricator.wikimedia.org/T383252 [18:07:47] two quota requests for the same project? [18:08:13] taavi: one is temporary for a rebuild [18:08:23] ah i see [18:08:25] the other is permanent. I asked for separate tickets so we'd know how much to revert [18:09:25] blancadesal: I got what I needed, ignore ping :) [18:14:37] * dcaro off [18:37:25] hashar: how have dns failures on deployment-prep been? Still a steady trickle or did they by chance stop happening on the 2nd? [20:07:44] andrewbogott: hi, I haven't check. for deployment-prep you can check on https://beta-logs.wmcloud.org/ I have listed on the task where the credentials can be found https://phabricator.wikimedia.org/T374830#10227787 [20:07:54] I have saved them in my browser so I don't have to look back :) [20:08:02] oh thanks [20:08:20] given: [20:08:38] 1) logstash on beta can be down for whatever reason, but on the front page you should see events [20:08:46] 2) I can't remember the search query I have used. Probably "Could not resolve host" [20:09:08] then stretch it over 7 days or so to catch resolution issues [20:09:29] they came from attempt to resolve commons.wikimedia.org [20:09:32] due to InstantCommon [20:09:47] (a feature in MediaWiki that attempts to fetch images from commons.wikimedia.org when the requested image can't be found locally) [20:10:43] beside that I havent looked at log this year [20:11:06] but I think it is the network failing us [20:11:08] so many broken refs to deployment-imagescaler03.deployment-prep.eqiad.wmflabs [20:11:29] yeah that would be an instance that got deleted while the puppet or mediawiki-config is not updated [20:11:37] or maybe left over jobs that will repeat infinitely until purged [20:12:07] yup that instance is gone [20:12:13] deployment-imagescaler04.deployment-prep.eqiad1.wikimedia.cloud.yaml [20:12:15] is thenew one [20:14:20] nothing show up at quick glance so I don't know [20:14:27] ok, yeah, still a steady drum beat of "Could not resolve host: commons.wikimedia.org" [20:14:58] If you can produce a test that demonstrates this is the network and not dns please share! I am basically stumped. [20:16:23] oh [20:16:26] I have no idea really [20:16:26] but [20:16:40] we had the issue back in octoboer/november and that got resolved when the cloudgw got switched [20:16:53] there was some network kernel issue there [20:17:06] or maybe the NIC driver was causing the kernel to emit issue due to some queues being full/dropping whatever [20:17:11] yeah, a network issue definitely /could/ cause symptoms like this, but we can't find any other symptoms at the moment [20:17:48] I think that was https://phabricator.wikimedia.org/T376589 [20:18:45] with the temporary dns server on integration, I'd expect the failures to have vanished [20:19:06] then that does not help much if the root cause is powerdns having trouble to keep up [20:19:31] yeah. We're going to replace both of those cloudgw hosts out of superstition but I don't expect that to fix this particular issue. [20:19:46] hardware routers and ASICS! [20:19:49] Right, that's why I was asking -- if it was a load issue then removing integration from the load should have helped with deployment-prep [20:19:52] but it sees not [20:19:54] *seems not [20:20:13] also the recursor has a ton of metrics ( https://docs.powerdns.com/recursor/metrics.html#gathered-information ) [20:20:21] link I have added at the top of https://grafana.wikimedia.org/d/000000044/pdns-recursor-stats?orgId=1 [20:20:31] but I could not find any metric that looked like something useful to us :/ [20:22:24] yeah, the recursor metrics always look happy. I spiked it up to 2x traffic and saw no increase in failures. [20:47:43] which leads to suspect the network layer [20:48:08] and I have no idea how to interpret the kernel trace at https://phabricator.wikimedia.org/T376589#10205754 [20:48:56] or the very suspcicious [Mon Oct 7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: TX timeout detected, starting reset task! [20:49:42] but that reboot did solve the issues we were encountering at the time