[08:12:01] morning [08:15:45] morning [09:06:11] is there a way in Puppet to test the version of a package? something like "debian::codename::eq", but for a single package not the distro [09:06:47] I'm not sure I understand [09:08:32] what do you need that for? [09:10:27] mariadb changed a privilege name since version 10.5 [09:10:46] so I need to use the new name only on bookworm, but I wondered if I could query the package version instead [09:11:12] https://gerrit.wikimedia.org/r/c/operations/puppet/+/964858 [09:13:57] dhinus: that's exactly the pattern in use here. Puppet has no notion of installed package versions unless you feed that info via hiera or similar. The patch looks like the right approach to me [09:23:20] * arturo having laptop problems [09:23:22] thanks. can I get a +1? PCC is looking good [09:23:55] ("Hosts:auto" in PCC is so satisfying when it works :P) [09:24:11] dhinus: +1d [09:24:28] cheers [09:25:13] dcaro: there are even more tools with broken database credentials :( https://phabricator.wikimedia.org/T348502 [09:32:45] taavi: want me to look? [09:34:16] dcaro: if you have any ideas, I would appreciate it [09:34:29] taavi: looking :), do you have any suspicions? [09:55:07] taavi: it seems it's a wrong command from the user side [09:55:34] ah sigh, thanks for looking [10:29:40] do you all have something you want me to work/do in my last hours here @ WMCS? any task, patch or similar? [10:31:33] can you finish the grid decom please? [10:31:36] * taavi hides [10:31:45] :-) [10:54:49] +1 for the grid decom xd, just pull the plug on your way out [10:55:25] * dcaro lunch [11:05:52] something weird is happening in cloudvirt2004-dev, I see some VMs running, but Puppet is failing with "Read-only file system @ rb_sysopen - /var/lib/puppet/state/agent_catalog_run.lock" [11:06:46] and "less /var/log/puppet.log" returns "-bash: /usr/bin/less: Input/output error" [11:33:37] draining the host also fails, I'll try rebooting it [11:34:05] that's on the cloudvirt? or the vms inside of it? [11:34:12] (I'm guessing the cloudvirt) [11:34:21] cloudvirt [11:34:27] I can't even reboot it [11:36:00] maybe hard reboot from the mgmt interface? [11:36:13] feels like the hard drive gave up :/ [11:36:25] is it a single drive? [11:36:52] * dhinus having lunch now, back in a bit :) [11:46:34] it should be raid [11:48:49] lots of nrpe unreachable alerts popping up! [11:49:21] check_dpkg or something else? [11:49:23] only for dpkg check [11:49:46] yes, probably the check script is failing, and nagios does not understand the output [11:49:53] that's being removed / moved to prometheus in https://phabricator.wikimedia.org/T332764 [11:50:00] aaah,,okok [11:58:36] arturo, dcaro: in the catalyst-k3s scenario we discussed yesterday where each environment runs on its own vm and the user is able to ssh into it, remind me why we considered also having a general ssh bastion? [11:59:03] blancadesal: in that scenario there's no bastion needed [11:59:11] (the vm itself acts as bastion) [11:59:26] well, there is a need for a floating IP somewhere. The bastion is the only VM that has a floating IP and thus the only reachable from the internet [11:59:49] but that's the same bastion anyone uses for VM instances no? [11:59:57] (not an extra bastion, like toolforge has) [12:00:28] * arturo nods [12:01:16] so there needs to be one for the sake of assigning a floating ip to it, as the env-vms are all ephemeral? [12:02:46] not really, you need one that is already there (the one offered by cloudVPS, that is used to ssh to any VM), but that's already there. It's needed because the VMs themselves don't have public IPs, so you can't reach them from the internet [12:04:03] to be clear, that bastion is there by default? [12:05:13] yes, the docs are a bit hidden though, at the end of this section, on how to ssh to an instance: https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_Instances#Working_with_Cloud_VPS_Instances [12:05:44] thanks [12:06:31] oh, this is better https://wikitech.wikimedia.org/wiki/Help:Accessing_Cloud_VPS_instances#Accessing_Cloud_VPS_instances [12:08:00] yeah, the general bastion could be enough [12:08:10] if the ssh key is offered via the catalyst UI [12:08:19] you copy-paste to your terminal [12:08:32] and that's all from the access control POV [12:09:21] we also talked about the option of splitting out the "management" bits to their own Cloud VPS project, meaning that this second project would have to create/halt/destroy VMs on the "VM runner" project via the Openstack API. Does this make things more complicated from a networking/access control POV? [12:11:31] Same I'd say, that one though might enable them to control access to the VMs using the project access list in horizon/openstack/ldap, instead of having to create an ssh key or password per-vm [12:12:40] how would that work? [12:14:54] they would have to add the users to the openstack project, and then with some puppet magic (like toolforge does) setup the proper access on the given hosts/instances [12:15:31] it would be more complicate imo [12:16:37] i don't have all of the context, but I'm a bit confused [12:17:53] if each catalyst "env" has it's own cloud vps project, why do you need any special ssh keys or such to log in to the hosts? you can add the people who need access to the cloud vps project, and utilize the default cloud vps puppetization that lets project members (and viewers in new openstack terms) log in to the instances in that project [12:19:06] we were discussing the other day the idea of instantiating golden image k3s/minikube VMs for catalyst to deploy their stuff [12:19:22] then share the ssh key for that VM using the catalyst UI for users to ssh to them and be able to use kubectl [12:19:44] all this can be done in a single Cloud VPS project, I think [12:20:54] another option that's the last one blancadesal was asking about, was to have catalyst itself in one project, and have the VMs created in another project, were we setup the access by giving project rights (or one project per-team) [12:22:50] arturo: is there anything missing on https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/47#32b9b28f71cde4f8fc775568cea4f6ceba670187 ? [12:23:32] dcaro: I don't think so. I'm pretty happy with the code, but I didn't check why the CI is failing [12:23:44] oh, publish step [12:24:18] it needs manual rebase it seems too [12:24:28] I can do that [12:36:53] I raised T348531 for the HDD errors in cloudvirt2004-dev [12:36:54] T348531: HDD failure in cloudvirt2004-dev - https://phabricator.wikimedia.org/T348531 [12:37:41] dcaro: rebased [12:38:13] dhinus: thanks [12:38:26] dhinus: i added #ops-codfw too [12:38:39] i believe you need to add #ops-site to ensure prompt reply [12:39:06] RhinosF1: thanks, I always forget! [12:39:28] where do I find the HDD/RAID config for that specific host? I don't see it in Netbox [12:41:43] dhinus: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Hardware_Troubleshooting_Runbook#HDD_&_SSD_Failures [12:42:01] which says cat /proc/mdstat [12:42:48] "md0 : active raid1 sdb2[0](F) sda2[1]" [12:43:57] dhinus: next run sudo mdadm --detail /dev/md0 [12:44:15] paste both into the task [12:44:57] done [12:45:49] I'm puzzled by the fact mdadm shows only one drive as "faulty", only one drive failing should not give I/O errors I think? [12:46:03] no idea [12:46:14] check if the template from https://phabricator.wikimedia.org/maniphest/task/edit/form/55/ has anything not in the task [12:49:14] yep, raid should be able to work with one hard drive down [12:50:00] renamed the task to follow the template there, and marked as failed in netbox [13:03:27] Raymond_Ndibe: ping, toolforge meeting? [13:04:24] about to ping you dcaro. Can you send me an invite with rndibe@wikimedia.org? [13:04:50] my -ctr email is stuck in an invite signin loop in okta [13:05:00] dcaro: [13:05:10] Raymond_Ndibe: done [13:05:12] \o/ [13:17:25] thanks [15:14:26] dcaro: I wanted to add a wiki link to the google re:work book you mentioned, but for some reason that URL now redirects to a Japanese-only page :D https://rework.withgoogle.com/ [15:15:05] hahahah [15:15:11] true, it does not work anymore! [15:19:47] it was there in february [15:19:48] https://web.archive.org/web/20230225140257/https://rework.withgoogle.com/guides/understanding-team-effectiveness/steps/introduction/ [15:21:25] it seems the change happened between sep 22nd and oct 9th [15:24:20] oh, the info is still all there though [15:25:31] https://rework.withgoogle.com/jp/guides/understanding-team-effectiveness [15:25:51] there's a lang selection at the bottom, but changing to english redirects to the splash page in jp xd [15:27:37] very weird LOL, maybe just a bug? [15:28:30] your team will be more effective if they speak japanese is what I'm getting out of that [15:29:07] hahahahaha, yep, that might be it :) [15:34:58] dhinus: you have a reply from DC-Ops [15:43:13] thanks, replied! [16:00:50] * arturo offline [16:17:25] dhinus: for the cookbooks patch, it seems like some dependency got upgraded or similar and the code now is incompatible with it, I can take a look tomorrow if you want [16:17:28] * dcaro off [16:41:38] hi all im curious how dose one change the wmcs base images. [16:42:20] id like to move the puppet5.5.22-2+deb12u3 package out of main but this would mean adding the componet/puppet5 to wmcs images [16:44:06] jbond: we have a script (wmcs-image-create) that takes an upstream debian cloud image and builds a puppetized image out of it [16:45:01] but I suspect puppet is installed via cloud-init, so maybe the thing to modify is the cloud-init vendordata file [16:45:25] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/openstack/templates/nova/vendordata.txt.erb#105 [16:46:42] taavi: thanks ill take a look at that, although i see the production puppet also installs this from main so it may need a bit more thought, refactoring