[08:03:05] 10CAS-SSO, 10Infrastructure-Foundations: Migrate CAS to Bookworm - https://phabricator.wikimedia.org/T357748 (10MoritzMuehlenhoff) [08:04:26] 10CAS-SSO, 10Infrastructure-Foundations: Move CAS to Java 17 - https://phabricator.wikimedia.org/T357749 (10MoritzMuehlenhoff) [12:31:43] in the public homer repo, policies/cr-labs.yaml enables a puppetserver_group, where is the membership of puppetserver_group defined? can't find it in the private or public Homer repos and searching for it in Netbox also yields no results [12:35:33] moritzm: it's automatically generated from all netbox devices with that name prefix [12:36:36] there's a script https://netbox.wikimedia.org/extras/scripts/capirca.GetHosts/ to refresh those definitions [12:43:21] ah, thanks! [12:43:58] I'll run that (I noticed that some of the codfw1dev cloud test servers failed to connect to our new puppetserver2003) [12:44:49] yeah it's come up before [12:45:32] I wonder should we create a cronjob/systemd-timer or something to execute it nightly? [12:47:17] if we complement if with some status output (e.g. mailing a diff if there is one like we do for public hosts diff-scan), that sounds useful to me [12:47:40] for now I'll add a note to the Puppet docs on wikitech [12:49:10] moritzm: I see your run failed actually [12:49:24] JobTimeoutException: Task exceeded maximum timeout value (300 seconds) [12:49:55] I initially ran it as dry-run only because I was cruious what happens under the hood [12:49:56] do you want to run it again? or I will try? if it keeps doing the same I'll try to work out what the options are [12:50:00] currently re-trying to real [12:50:04] ok [12:50:33] this is what it does: [12:50:34] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/netbox-extras/+/refs/heads/master/customscripts/capirca.py [12:52:16] thx [12:55:19] the non-dry-run also failed, hitting a timeout: https://paste.debian.net/hidden/aae7408c/ [13:37:02] yeah that's getting problematic [13:37:35] we probably went over a certain threshold in term of host or data in the DB [13:37:42] howpfully the netbox upgrade will help here [13:37:53] usually running it again will fix it [13:37:59] the panacea for all sins [13:38:00] :D [13:38:47] it will fix lots of issues I'm sure, not sure how many new ones it will bring :) [13:38:51] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack: spicerack.redfish needs to know about Jobs as well as Tasks - https://phabricator.wikimedia.org/T357764 (10Volans) p:05Triage→03Medium [14:20:48] Hi. unless I'm missing something, the sre.hardware.upgrade-firmware cookbook is documented as rebooting the host, but if I do an idrac-only update (-c idrac) it seems to not in fact do so. Is that expected? [14:21:51] idrac updates don't need to reboot the host [14:21:54] only the bmc [14:22:30] Ah, that wasn't entirely clear; I don't suppose there's a "reboot it anyway" option? [14:29:27] why would you :D [14:29:57] because the raid-deletion job doesn't work after a firmware update unless the host is rebooted first. [14:30:09] (no, I don't know why I just observe that this is the case) [14:30:50] I can stick it into the convert_disks cookbook myself, it just feels less DRY [14:31:24] in case of existing hosts it does call sre.hosts.reboot-single [14:31:42] not in all cases (this is an existing host) [14:32:14] no, what I mean is that when it does hte reboot (bios or driver upgrade), it just calls the sre.hosts.reboot-single [14:32:25] oh I see what you mean, sorry [14:32:38] so if you need a reboot you can just add that [14:34:14] that said we could add an option to force a call self._reboot() with idrac updates too [15:54:26] volans: hi, how do you deploy debmonitor nowadays? I am wondering whether operations/software/debmonitor/deploy Gerrit repo is still any relevant ?; ) [15:55:03] was recently migrated to deb package [15:55:15] so that repo will be archivable I think [15:55:42] oh a debian package of course [15:55:52] since you have the perms to do so :) [15:55:58] this is like in the last 2 weeks [15:56:02] I guess operations/software/netbox-deploy would be similar if not already? [15:56:13] no, that's still valid and will stay that way [15:56:40] btw jhathaway moritzm godog -- I think https://gerrit.wikimedia.org/r/1004164 should fix the pcc failures happening for the idp hosts [15:57:33] volans: netblox-deploy I think eventually we will have to redo it cause it is rather large, though it is not causing immediate troubles ;) [15:57:36] thanks for the confirmations! [15:58:05] cdanis: gah, my bad! thank you [15:58:15] hashar: we can totally "reset" it if needed [15:58:26] np godog! easy thing to miss and it's not like there's any automated checking of hosts breaking in pcc [15:58:52] volans: yes eventually, but there is zero pressure to do it any time soon it is fine keeping it as is [16:00:07] basically we don't care of the history of artifacts/ that is surely what's making the size large [16:05:17] and we can offload them to LFS :) [16:10:43] cdanis: thanks, +1ed