[07:57:53] <dcaro>	 morning
[07:59:14] <moritzm>	 good morning, I've removed Arturo's access from production systems (SSH access, pwstore, privileged LDAP/Phab groups, Icinga config, VictorOps)
[07:59:39] <moritzm>	 you only need to check on WMCS-specific things whether there
[07:59:47] <moritzm>	 is something that should get removed
[08:00:14] <moritzm>	 I'm sure you've all sorted it out already, just wanted to drop a note about prod access :-)
[08:01:01] <taavi>	 moritzm: morning, we did that yesterday evening
[08:01:03] <dcaro>	 thanks moritzm
[08:02:44] <moritzm>	 great, thanks
[10:23:53] <dcaro>	 I'm going to make some paging tests, I'm oncall so it should page me, but just in case something goes wrong you have been warned :)
[10:29:24] <dcaro>	 hmpf... I have to reinstall victorops on my phone
[10:29:55] <dcaro>	 good news is that it worked :)
[10:30:20] <dcaro>	 https://usercontent.irccloud-cdn.com/file/fe7t9PqE/image.png
[10:36:59] <dcaro>	 I've acknowledged it in victorops, and acked it on karma/alertmanager, not sure what the flow should be though :/
[13:33:05] <Rook>	 Could I get a review on https://gerrit.wikimedia.org/r/c/operations/puppet/+/965514 ?
[13:50:47] <dcaro>	 got to take down the toolsbeta-harbor service for a few minutes, some steps for MRs CI on gitlab might fail in between
[13:51:04] <dcaro>	 Rook: can you add more context to the change? or you are just looking for puppet syntax review?
[13:51:31] <Rook>	 Pretty much just a syntax review. I'm trying to keep puppet from overwriting the git repo
[13:52:26] <taavi>	 should puppet instead pull stuff from the new repo? it seems strange to not have it manage the repo clone, but still have it manage the pip install inside &c
[13:53:20] <Rook>	 It could pull from the github repo. The goal would be to remove puppet entirely, though for now the git bits are the ones causing a problem
[13:54:59] <dcaro>	 what handles the repo clone?
[13:55:44] <Rook>	 If quarry continues to be worked on the repo would be pulled into the bastion where it would be used to deploy to k8s. Puppet isn't needed at that point.
[13:56:14] <Rook>	 Though for now quarry is still running on vms, so it is pulling the wrong repo, overwriting the right one
[14:04:18] <dcaro>	 does that mean that the venv and all that is not needed either?
[14:04:34] <dcaro>	 wait, so it's currently running on VMs, or on k8s?
[14:04:48] <dcaro>	 fyi. toolsbeta-harbor is back online
[14:06:24] <Rook>	 It's still on VMs, some people have been patching it so I've been moving it towards k8s
[14:42:26] <dcaro>	 oh, we are getting slow ops again
[14:47:21] <taavi>	 anything I can do?
[14:50:42] <taavi>	 looks like the impact was smaller this  time?
[14:51:03] <taavi>	 also noticed that the alert changes from yesterday did not merge yet, that's now done
[14:52:57] <dcaro>	 I'm trying to find out if there's anything weird going on with those hard drives, one is from the pool of the ones that we found with errors (1024), but the other is not (1014)
[14:53:19] <dcaro>	 the alert already cleared though
[15:05:15] <andrewbogott>	 taavi: when working with object storage did you run into any file size limits? In codfw1dev I'm getting an error with files  bigger than ~250Mb, haven't tried yet in eqiad1
[15:06:01] <taavi>	 andrewbogott: no, but I also didn't test it with particularly big files
[15:06:10] <andrewbogott>	 ok, I'll keep testing
[15:06:26] <taavi>	 are you getting an error from rados or some proxy in between?
[15:10:09] <andrewbogott>	 I don't know yet, Horizon is just saying 'there was an error'.  for a minute I thought it was going to cut off at a nice round 256 but no such luck
[15:12:44] <dcaro>	 did you substract the tcp header? xd
[15:13:18] <dcaro>	 (always got me in tests)
[15:13:31] <taavi>	 I wouldn't be surprised if that was horizon doing some soft of buffering, instead of an issue with the api itself
[15:13:57] <dcaro>	 agree
[15:25:28] <andrewbogott>	 dcaro: 260Mb works, which would be a lot of header.
[15:25:38] <andrewbogott>	 I'll try this w/out Horizon and see what I can see
[15:58:07] <andrewbogott>	 dhinus: no need to upgrade Horizon as part of the update unless you /really/ want to. The apis are pretty stable from version to version so Horizon tends to work for n +/- 2
[15:58:38] <andrewbogott>	 so if you've upgraded cloudvirts/cloudnets/cloudservices/cloudcontrols I think you're done.
[15:59:44] <dhinus>	 yep they're all done. I think I need to upgrade cloudbackup in eqiad though
[16:00:27] <andrewbogott>	 it's not running any openstack services anymore as far as I know
[16:00:35] <andrewbogott>	 used to have cinder-backup but now that's all done with backy2
[16:00:40] <dhinus>	 ah ok
[16:01:00] <dhinus>	 I was checking the list in https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Openstack_upgrade#Upgrading_Openstack_API_services
[16:01:11] * dcaro off
[16:01:13] <andrewbogott>	 dhinus: so you got a normal-ish experience with the services and cloudvirts nodes, where the cookbook actually worked and there wasn't a ton of followup
[16:01:53] <andrewbogott>	 I removed the section about cloudbackup nodes :)
[16:03:40] <dhinus>	 yep, I just had to fix a small mysql issue (https://gerrit.wikimedia.org/r/c/operations/puppet/+/964858/)
[16:04:42] <dhinus>	 and one cloudvirt had hardware problems (T348531)
[16:04:42] <stashbot>	 T348531: hw troubleshooting: disk failure for cloudvirt2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T348531
[16:04:58] <dhinus>	 apart from that, the cookbooks worked fine
[16:08:55] <andrewbogott>	 dhinus: ok, so the next step is just to wait a week or two and see what surprising bugs show up in codfw1dev.
[16:09:01] <andrewbogott>	 Then we can start upgrading eqiad1
[16:09:11] <andrewbogott>	 I'm glad things got smoother
[16:09:25] <dhinus>	 sounds good!
[16:17:15] <dhinus>	 hmm I think I spotted something wrong, cloudservices still contain "openstack-zed-bookworm.sources" instead of "openstack-antelope-bookworm.sources"
[16:18:02] <dhinus>	 in /etc/apt/sources.list.d
[16:49:26] <dhinus>	 looking at /var/lib/puppet/state/classes.txt I see cloudservices2004-dev has "openstack::serverpackages::zed::bookworm" instead of ::antelope
[16:57:45] <dhinus>	 I finally found what's the issue, there's an override in hieradata/hosts/cloudservices2004-dev.yaml
[17:00:21] <andrewbogott>	 that's probably left over from the upgrade from x to z
[17:05:10] <dhinus>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/965546
[17:05:39] <dhinus>	 I think we can simply remove the override and inherit the value?
[17:05:48] <dhinus>	 running a PCC
[17:06:37] <dhinus>	 seems to work
[17:13:09] <andrewbogott>	 I guess you'll need to re-run the upgrade script after that
[17:13:50] <dhinus>	 yep
[17:17:48] <dhinus>	 I'm logging off now, tomorrow I'll re-run the upgrade_openstack_node for those 2 hosts
[17:23:17] * andrewbogott waves
[18:33:24] <blancadesal>	 andrew: both "members" and "viewers" in an openstack project can ssh to any instance of the project, right? Is there any way to limit this at the project-level?
[18:34:38] <taavi>	 blancadesal: yes, and yes
[18:35:54] <taavi>	 the simplest way is to set the `profile::ldap::client::labs::restricted_to` hiera key to an another group, and for more complex stuff you can write your own access rules via puppet
[19:00:22] <blancadesal>	 is this flexible enough to give one subset of users access to an arbitrary subset of instances, but not to others?