[03:23:29] (SystemdUnitFailed) firing: (3) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:08:29] (SystemdUnitFailed) firing: (4) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:03:29] (SystemdUnitFailed) firing: (5) httpbb_hourly_appserver.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:03:30] (SystemdUnitFailed) firing: (5) httpbb_hourly_appserver.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:22:02] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) Maybe it makes sense to create a dedicated task to discuss the general usage and policies for developer account naming? The... [12:03:29] (SystemdUnitFailed) firing: (4) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:08:30] (SystemdUnitFailed) firing: (4) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:22] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack, 10Patch-For-Review: Add support for knams as PoP in tooling and automation - https://phabricator.wikimedia.org/T340465 (10Volans) 05Open→03Resolved All changes required have been merged, if anything else come up later we can re-open this... [13:51:43] til: https://aerleon.readthedocs.io/en/latest/faq/ capirca was a typo, so they named the new project with a typo as well :) [13:52:57] :facepalm: [13:55:30] amazing [14:03:14] haha [14:03:30] what's the actual Battlestar Gallactica planet called? [14:04:11] Aerelon [14:04:16] https://galactica.fandom.com/wiki/Aerilon [14:04:28] yeah I had to look it up [14:04:50] not like it would have been any easier to use [14:04:54] ineresting, in the list of planets they have written in one way [14:05:02] the detailed page is another spelling [14:05:58] volans: probably not written by a native from the planet [14:06:47] lol [14:14:15] 10netbox, 10DC-Ops, 10Infrastructure-Foundations: Netbox device's platform field inconsistency - https://phabricator.wikimedia.org/T336623 (10Volans) 05Open→03Resolved a:03Volans Given the consensus I've run this code in netbox nbshell: `lang=python >>> import uuid >>> request_id = uuid.uuid4() >>> us... [14:44:26] XioNoX: heads-up, though nothing that should affect the current deployment but I reverted https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/e802654ef9a837d6b5b6ac869ae8900a72b8def8 [14:44:58] ran into some issues with a new bookworm host, where validate_cmd fails so it doesn't create the file because the check fails and thus a circular problem [14:45:11] not sure why we never observed this before but I will dig later, for now I just removed validate_cmd [14:45:21] which is not a big deal but yeah, thought you should know [14:45:39] sukhe: noted, thanks [14:45:40] I will most likely fix this in the deb so that we install the dummy/default conf file to the location of the validate_cmd [14:46:50] sukhe: not sure I fully understand [14:47:38] XioNoX: validate_cmd is trying to call --check but the file does not exist, so it fails [14:50:05] sukhe: can we do "validate_cmd => '/usr/bin/anycast-healthchecker -f % --check'" ? [14:50:27] basically tell puppet to check the new version of the file [14:50:36] (if I understand the doc correctly) [14:53:42] yeah [14:53:42] Invalid configuration: /etc/anycast-healthchecker.conf configuration file either isn't readable or doesn't exist [14:53:52] -f would be this but it was still failing [14:54:48] the only thing that I am truly confused about is that nothing should have changed between bullseye and bookworm to cause this [14:55:07] and I did recdns reimages, where the same thing is called, so not sure why this would matter [14:55:37] sukhe: maybe we added the validate stuff after anycast-hc was deployed everywhere? [14:55:43] ah right [14:56:16] "-f would be this but it was still failing" I think longer term it's a cleaner fix than a dummy file [14:56:24] to add -f % etc [14:56:59] I worry that the current validate_cmd doesn't prevent from a mistake to be rolled out [14:57:17] if it validates the current live file and not the new one [14:57:55] XioNoX: maybe I am understanding something but -f % simply would mean /etc/anycast-healtchecker.conf, which is what it fails on above in the puppet run [14:58:12] the default is also that, do we gain something extra by specificying -f % here? [14:58:25] -f % would replace the % with a temp path [14:59:03] looking at https://www.puppetcookbook.com/posts/validate-configfiles-before-deployment.html [14:59:26] ah, I don't see it in the official documentation [14:59:31] interesting, ok we can certainly try that [14:59:41] I reimaged durum6001, can do that for durum6002 [14:59:49] but still doesn't explain why the recent recdns reimages worked just fine :D [15:00:19] https://github.com/puppetlabs/puppetlabs-stdlib/blob/main/lib/puppet/parser/functions/validate_cmd.rb [15:00:25] taking a % as a placeholder for the file path (will default to the end). [15:01:01] so without the -f % it probably runs something like `/usr/bin/anycast-healthchecker --check /etc/myfutureconfig.conf` [15:01:45] XioNoX: https://puppetboard.wikimedia.org/report/durum6001.drmrs.wmnet/12975b76afcbdf67802785034f65641ea0877c8f [15:01:52] Execution of '/usr/bin/anycast-healthchecker --check' returned 1: [15:01:55] Invalid configuration: /etc/anycast-healthchecker.conf configuration file either isn't readable or doesn't exist [15:02:00] so it seems to be reading the correct file at least [15:02:30] I don't think so [15:02:54] I think it should read something like `/etc/anycast-healthchecker-tmpXXXX.conf` [15:03:18] randomly generated string by puppet, before replacing the final one [15:03:50] here it's like it's not even trying to append the file path to the end of the validate_cmd [15:04:00] maybe it's a puppet change between version? [15:05:12] https://www.puppet.com/docs/puppet/8/types/file.html#file-attribute-validate_cmd [15:05:19] Note that if a validation command requires a % as part of its text, you can specify a different placeholder token with the validate_replacement attribute. [15:06:02] or maybe a simple puppet ordering issue [15:06:11] because well none of this explains the recent recdns reimages :] [15:06:14] which went just fine [15:06:17] and where we do call this [15:06:33] in a meeting but will come back to this [15:06:38] will put -f % first to see [15:06:56] yeah no idea neither about the recdns stuff :) [15:07:20] anyway the good news is that durum worked, so that sorted out the bookworm deps for anycast :) [15:07:27] doing durum6002 now to see [15:07:33] sweet! [15:12:28] XioNoX: https://gerrit.wikimedia.org/r/c/operations/puppet/+/946958/ [15:13:01] if durum6002 also fails on this now, means something else changed in between [15:13:10] sukhe: +1 [15:34:32] 10netbox, 10Infrastructure-Foundations: Netbox report test_mgmt_dns_hostname - rq.timeouts.JobTimeoutException - https://phabricator.wikimedia.org/T341843 (10jijiki) >>! In T341843#9073121, @ayounsi wrote: > Not sure i f it's a Netbox or Redis issue :( > There we can see that there are constantly 2 blocked cl... [15:49:51] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10bd808) >>! In T320390#9076268, @Jelto wrote: > As far as I understand login and registration of new accounts works fine and the co... [16:08:30] (SystemdUnitFailed) firing: (3) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:09:27] (SystemdUnitFailed) firing: (3) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:01:54] 10Packaging, 10Infrastructure-Foundations, 10Thumbor, 10Wikimedia-SVG-rendering: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10Izno)