[07:59:22] <taavi>	 what's up with the clouddumps1002 io alerts?
[08:00:17] <dcaro>	 morning
[08:01:19] <taavi>	 o/
[08:05:00] <dcaro>	 there's been a spike on write operations on 1002 for the last ~5.5h
[08:16:52] <dcaro>	 oh, not write no, read xd
[08:17:01] <dcaro>	 (too many colors in the graph, they start repeating)
[08:17:04] <dcaro>	 anyhow, there's a 
[08:17:11] <dcaro>	 few universities doing an rsync from it
[08:17:56] <dcaro>	 I think it's ok for now, if it keeps too high (say until tomorrow), or it starts failing we can try to slow down things
[08:18:23] <dcaro>	 that alert has always not been great (most of the time we don't really act on it)
[08:53:11] <dcaro>	 quick review -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/1142520  udptaing toolsbeta prometheus certs
[08:54:27] <taavi>	 lgtm
[08:54:37] <taavi>	 (assuming you have the private key handy somewhere)
[08:55:02] <dcaro>	 yep, uploaded to the puppetserver already
[11:25:36] <taavi>	 review for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1142572?
[11:26:27] <dcaro>	 +1d
[11:27:09] <taavi>	 thanks!
[11:35:51] <taavi>	 no, that wasn't it :(
[12:02:56] <taavi>	 super weirdly the same issue happens when using :close() instead of :set_keepalive()
[12:11:34] <dcaro>	 looks like an interesting rabbit hole :)
[12:12:33] <taavi>	 yeah
[12:12:43] <taavi>	 i'm starting to be convinced that this is an nginx bug
[12:54:57] <taavi>	 andrewbogott: at which point in the debian release process can we spin up a trixie prerelease image in codfw1dev? would be great if the nginx version in trixie is doing the same thing
[12:58:04] <dcaro>	 given that there's a few of us on pto, I'm thinking on moving the toolforge service check-in to next week (that way it also avoids overlapping with the monthly meeting), anyone prefers to do it today?
[12:58:47] <andrewbogott>	 moving it sgtm
[12:59:14] <andrewbogott>	 taavi: looks like there are already dailies at https://cloud.debian.org/images/cloud/trixie/daily/ so I'll see if I can build one now.  If that fails you can always just use a raw upstream image.
[13:25:13] <andrewbogott>	 taavi: cloud-init seems to not work properly in those dailies but it works well enough to inject a key.  debian-13-raw-1 in 'testlabs' and the key is added for user 'debian'
[13:31:28] <taavi>	 next fix attempt that seems more promising than the first one: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1142598/
[13:32:27] <andrewbogott>	 so it's a lua version diference
[13:33:15] * andrewbogott lacks context but that change looks harmless at worst
[13:34:32] <taavi>	 i have that applied as a local hack on proxy-5 and so far i haven't seen that error happening after applying it
[13:34:48] <taavi>	 while previously it was pretty well reproducible by doing a hard refresh on quarry.wmcloud.org when on a v6 connection
[13:35:31] <dcaro>	 xd, it was already merged, +1d anyhow
[13:35:32] <taavi>	 so i'll merge and try re-enabling v6 for tools-static to generate more traffic
[13:35:53] <dcaro>	 does it only happen on v6?
[13:36:23] <taavi>	 only v6 traffic is being pushed to the new bookworm proxies for now
[13:36:34] <dcaro>	 ah okok
[13:36:41] <taavi>	 since i need to backfill security group things before flipping the switch for v4
[13:43:05] <taavi>	 so far the fix is looking very good
[14:22:16] <topranks>	 andrewbogott: to answer your question I'm not 100% sure what the normal allocation workflow for those cloud-private IPs is 
[14:22:23] <topranks>	 certainly they get added to Netbox:
[14:22:24] <topranks>	 https://netbox.wikimedia.org/ipam/prefixes/657/ip-addresses/
[14:22:42] <topranks>	 but I'm not sure exactly what configures them on the host side, it may be in puppet/hiera 
[14:22:49] <topranks>	 I suspect taavi might be able to advise us here 
[14:23:07] <topranks>	 is the networking (in terms of vlans on ports) correct on the switches for those hosts you mentioned?
[14:25:31] <andrewbogott>	 I don't know. I was assuming so since they used to be cloudcontrols but let's look...
[14:26:44] <taavi>	 which hosts is this about?
[14:27:11] <andrewbogott>	 cloudrabbit200[123]-dev, recently renamed from cloudcontrol200[789]-dev
[14:27:25] <andrewbogott>	 they do not have the rabbit puppet role applied to them yet
[14:28:19] <taavi>	 ok, so the old names already had cloud-private addresses allocated https://netbox.wikimedia.org/ipam/prefixes/657/ip-addresses/
[14:29:00] <andrewbogott>	 that's good, so let's see if I can edit those records...
[14:29:02] <taavi>	 i think updating the dns names (and running the netbox dns cookbook) should be enough for that, the netbox puppetdb import script will take care of the rest
[14:29:02] <taavi>	 i think updating the dns names (and running the netbox dns cookbook) should be enough for that, the netbox puppetdb import script will take care of the rest
[14:29:04] <taavi>	 oops
[14:30:21] <topranks>	 doubly correct :) 
[14:31:03] <andrewbogott>	 ...is that 'sre.dns.netbox'?
[14:31:07] <topranks>	 the netbox puppetdb import script will attach them to the hosts properly, but that is just for information and as you can see hasn't happened will all of them 
[14:31:17] <topranks>	 andrewbogott: yes that's the one 
[14:31:37] <andrewbogott>	 with no args, right?
[14:31:46] * andrewbogott is not looking forward to breaking DNS foundation-wide
[14:31:51] <topranks>	 yep no args 
[14:31:59] <andrewbogott>	 ok, running
[14:32:04] <topranks>	 it'll prompt you with the diff anyway, which should make sense (i.e. changing the names you modified)
[14:33:10] <topranks>	 also I checked the switch ports for those hosts they are correct
[14:36:00] <andrewbogott>	 it seems to be doing the thing
[14:37:48] <andrewbogott>	 ftr there was another pending change elasticsearch->cirrussearch
[14:42:06] <topranks>	 probably you can ask Brian King about that, but I know they are moving and renaming those hosts so probably fine 
[14:42:21] <topranks>	 I generally say "yes" if it looks like regular work happening in paralell, with say a given host 
[14:42:38] <topranks>	 the thing to watch out for is it's not deleting "en.wikipedia.org" or something :) 
[14:53:51] <andrewbogott>	 yeah, I know they're renaming things so it seemed unsurprising
[14:56:03] <andrewbogott>	 dcaro: one thing I forgot to mention in the checkin: I cleared out cloudcephosd2004-dev from the pool and re-added it and now it seems to work fine. It is much bigger than the other osds in codfw1dev so balancing may be a bit silly until we get some more large ones there.  T392366
[14:56:04] <stashbot>	 T392366: Service implementation for cloudcephosd2004-dev - https://phabricator.wikimedia.org/T392366
[14:56:25] <dcaro>	 andrewbogott: thanks!
[14:56:38] <dcaro>	 chuckonwu: we don't have yet security groups on tofu-provisioning right?
[14:56:50] <andrewbogott>	 I also drained and decom'd cloudcephosd100[123] over the weekend
[14:57:44] <chuckonwu>	 dcaro, we don't
[14:57:45] <dcaro>	 I saw that :), I was 'snooping' while in the hackathon, thanks too, it took a while
[14:58:12] <dcaro>	 chuckonwu: okok, I think we are missing the bastion security group for toolsbeta, I'll add it (that's why taavi was unable to ssh directly, but through a bastion only)
[14:59:07] <chuckonwu>	 👍
[15:01:51] <dcaro>	 works for me now sshing without proxy :)
[15:02:00] <dcaro>	 *jumphost
[15:02:17] <dcaro>	 (and of course, the last 4 runs of the functional tests now pass...)
[15:37:53] <andrewbogott>	 that keystone alert is just me restarting things, it will clear on next check