[07:59:22] what's up with the clouddumps1002 io alerts? [08:00:17] morning [08:01:19] o/ [08:05:00] there's been a spike on write operations on 1002 for the last ~5.5h [08:16:52] oh, not write no, read xd [08:17:01] (too many colors in the graph, they start repeating) [08:17:04] anyhow, there's a [08:17:11] few universities doing an rsync from it [08:17:56] I think it's ok for now, if it keeps too high (say until tomorrow), or it starts failing we can try to slow down things [08:18:23] that alert has always not been great (most of the time we don't really act on it) [08:53:11] quick review -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/1142520 udptaing toolsbeta prometheus certs [08:54:27] lgtm [08:54:37] (assuming you have the private key handy somewhere) [08:55:02] yep, uploaded to the puppetserver already [11:25:36] review for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1142572? [11:26:27] +1d [11:27:09] thanks! [11:35:51] no, that wasn't it :( [12:02:56] super weirdly the same issue happens when using :close() instead of :set_keepalive() [12:11:34] looks like an interesting rabbit hole :) [12:12:33] yeah [12:12:43] i'm starting to be convinced that this is an nginx bug [12:54:57] andrewbogott: at which point in the debian release process can we spin up a trixie prerelease image in codfw1dev? would be great if the nginx version in trixie is doing the same thing [12:58:04] given that there's a few of us on pto, I'm thinking on moving the toolforge service check-in to next week (that way it also avoids overlapping with the monthly meeting), anyone prefers to do it today? [12:58:47] moving it sgtm [12:59:14] taavi: looks like there are already dailies at https://cloud.debian.org/images/cloud/trixie/daily/ so I'll see if I can build one now. If that fails you can always just use a raw upstream image. [13:25:13] taavi: cloud-init seems to not work properly in those dailies but it works well enough to inject a key. debian-13-raw-1 in 'testlabs' and the key is added for user 'debian' [13:31:28] next fix attempt that seems more promising than the first one: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1142598/ [13:32:27] so it's a lua version diference [13:33:15] * andrewbogott lacks context but that change looks harmless at worst [13:34:32] i have that applied as a local hack on proxy-5 and so far i haven't seen that error happening after applying it [13:34:48] while previously it was pretty well reproducible by doing a hard refresh on quarry.wmcloud.org when on a v6 connection [13:35:31] xd, it was already merged, +1d anyhow [13:35:32] so i'll merge and try re-enabling v6 for tools-static to generate more traffic [13:35:53] does it only happen on v6? [13:36:23] only v6 traffic is being pushed to the new bookworm proxies for now [13:36:34] ah okok [13:36:41] since i need to backfill security group things before flipping the switch for v4 [13:43:05] so far the fix is looking very good [14:22:16] andrewbogott: to answer your question I'm not 100% sure what the normal allocation workflow for those cloud-private IPs is [14:22:23] certainly they get added to Netbox: [14:22:24] https://netbox.wikimedia.org/ipam/prefixes/657/ip-addresses/ [14:22:42] but I'm not sure exactly what configures them on the host side, it may be in puppet/hiera [14:22:49] I suspect taavi might be able to advise us here [14:23:07] is the networking (in terms of vlans on ports) correct on the switches for those hosts you mentioned? [14:25:31] I don't know. I was assuming so since they used to be cloudcontrols but let's look... [14:26:44] which hosts is this about? [14:27:11] cloudrabbit200[123]-dev, recently renamed from cloudcontrol200[789]-dev [14:27:25] they do not have the rabbit puppet role applied to them yet [14:28:19] ok, so the old names already had cloud-private addresses allocated https://netbox.wikimedia.org/ipam/prefixes/657/ip-addresses/ [14:29:00] that's good, so let's see if I can edit those records... [14:29:02] i think updating the dns names (and running the netbox dns cookbook) should be enough for that, the netbox puppetdb import script will take care of the rest [14:29:02] i think updating the dns names (and running the netbox dns cookbook) should be enough for that, the netbox puppetdb import script will take care of the rest [14:29:04] oops [14:30:21] doubly correct :) [14:31:03] ...is that 'sre.dns.netbox'? [14:31:07] the netbox puppetdb import script will attach them to the hosts properly, but that is just for information and as you can see hasn't happened will all of them [14:31:17] andrewbogott: yes that's the one [14:31:37] with no args, right? [14:31:46] * andrewbogott is not looking forward to breaking DNS foundation-wide [14:31:51] yep no args [14:31:59] ok, running [14:32:04] it'll prompt you with the diff anyway, which should make sense (i.e. changing the names you modified) [14:33:10] also I checked the switch ports for those hosts they are correct [14:36:00] it seems to be doing the thing [14:37:48] ftr there was another pending change elasticsearch->cirrussearch [14:42:06] probably you can ask Brian King about that, but I know they are moving and renaming those hosts so probably fine [14:42:21] I generally say "yes" if it looks like regular work happening in paralell, with say a given host [14:42:38] the thing to watch out for is it's not deleting "en.wikipedia.org" or something :) [14:53:51] yeah, I know they're renaming things so it seemed unsurprising [14:56:03] dcaro: one thing I forgot to mention in the checkin: I cleared out cloudcephosd2004-dev from the pool and re-added it and now it seems to work fine. It is much bigger than the other osds in codfw1dev so balancing may be a bit silly until we get some more large ones there. T392366 [14:56:04] T392366: Service implementation for cloudcephosd2004-dev - https://phabricator.wikimedia.org/T392366 [14:56:25] andrewbogott: thanks! [14:56:38] chuckonwu: we don't have yet security groups on tofu-provisioning right? [14:56:50] I also drained and decom'd cloudcephosd100[123] over the weekend [14:57:44] dcaro, we don't [14:57:45] I saw that :), I was 'snooping' while in the hackathon, thanks too, it took a while [14:58:12] chuckonwu: okok, I think we are missing the bastion security group for toolsbeta, I'll add it (that's why taavi was unable to ssh directly, but through a bastion only) [14:59:07] 👍 [15:01:51] works for me now sshing without proxy :) [15:02:00] *jumphost [15:02:17] (and of course, the last 4 runs of the functional tests now pass...) [15:37:53] that keystone alert is just me restarting things, it will clear on next check