[07:01:39] XioNoX: pm? [07:04:18] RhinosF1: what's up? [07:05:52] moritzm, topranks: wrt the failed decom last week, my suggestion would have been to just retry the decom cookbook. It's meant to be idempotent so re-running it right away shouldn't be a problem and should take care of any missed step in case of a transient failure [07:06:16] it's correct that 255 exit on clustershell is usually an ssh connection failed/interrupted [07:09:07] yeah, we'll see when flerovium is ready to be taken down [07:10:29] ack, lmk if you want me to debug it [07:12:17] ack [08:06:48] 10Puppet, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review: "Ensure hosts are not performing a change on every puppet run" alert is failing - https://phabricator.wikimedia.org/T345909 (10fgiunchedi) Looks like the alert is working as expected: https://alerts.wikimedia.org/?q=%40sta... [08:21:28] XioNoX, topranks: when you have a chance could you re-arm the homer key on cumin2002 please? [08:22:22] 10Puppet: pg replication lag UNKNOWN for puppetdb2003 - https://phabricator.wikimedia.org/T346016 (10fgiunchedi) [08:22:49] 10Puppet: pg replication lag UNKNOWN for puppetdb2003 - https://phabricator.wikimedia.org/T346016 (10fgiunchedi) [08:23:45] (SystemdUnitCrashLoop) firing: routinator.service crashloop on rpki2002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [08:28:21] Sep 11 08:27:55 rpki2002 routinator[1667211]: Fatal: failed to open file /var/lib/routinator/repository/rrdp/rrdp.ripe.net/21d6592469dbe79feb2922562764fd193170f173229298b9a4443ffb5c282000/tmp/rpki.ripe.net/repository/DEFAULT/ws5Z0a-DDJS6jqEc-P7G3ZwDuRc.cer: No space left on device (os error 28) [08:28:21] Sep 11 08:27:56 rpki2002 routinator[1667211]: Fatal error. Exiting. [08:28:42] sad_trombone.wav [08:28:57] it's tmpfs https://phabricator.wikimedia.org/T300955 [08:30:10] they have 6GB of ram now correct? [08:30:27] tmpfs 4.0G 2.3G 1.8G 56% /var/lib/routinator/repository [08:30:28] how much would they need? we're going towards the high end of ram for a VM [08:30:37] they have enough it seems [08:30:52] probably not during the update, it might double the data [08:30:58] "but giving it 4GB will allow ample margin for future growth" [08:31:02] and then remove the old one [08:31:24] unless we/it are/is not cleaning some old cruf [08:31:28] *cruft [08:31:34] yeah [08:31:39] how do I clear that cleanly? [08:31:47] just delete everything under that path? [08:31:58] or there is something specific to tmpfs [08:32:21] it's a filesystem [08:32:57] so rm -rf works just fine? [08:33:02] yep [08:33:45] (SystemdUnitCrashLoop) resolved: routinator.service crashloop on rpki2002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [08:34:13] alright, restarting and monitoring it [08:35:16] slyngs: FYI idp-test1002 runs a git pull of cas-overlay-template at every puppet run making a change at every run [08:37:18] that's expected, that's the place where we build the cas debs (and the cas-overlay-template repo is the source for it) [08:37:35] but I doubt it changes every time [08:37:42] the repo [08:37:54] fully initialized tmpfs 4.0G 2.1G 2.0G 51% /var/lib/routinator/repository [08:38:16] XioNoX: I bet it doubles when updating the data, so needing 4.2... [08:39:22] moritzm: cumin has git::clone with ensure => 'latest' but doesn't make a change at every puppet run, only if there are new commits to pull [08:39:38] I didn't check what the idp does in puppet [08:40:45] Is it git or the rsync that causes the change? [08:41:01] We do a git clone and then an rsync [08:41:17] moritzm: fyi - https://github.com/NLnetLabs/routinator/issues/880 [08:41:40] slyngs: [08:41:44] https://puppetboard.wikimedia.org/report/idp-test1002.wikimedia.org/eef7cb484c2263303c8c5f877b55159b7e94c72c [08:41:48] volans: keyholder armed [08:41:57] thx! [08:43:31] volans: I'll take a look, because we do the same with the IDM and deployments and that doesn't cause a change on every run [08:43:33] (SystemdUnitFailed) firing: (2) kube-controller-manager.service Failed on aux-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:51:49] volans: https://github.com/NLnetLabs/routinator/issues/889 [08:51:55] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) In terms of the LVS connections from rows C and D, when we move from old switches to new ones we need to land those on the Spines rather t... [08:53:03] XioNoX: ack, let's see what they say [08:55:53] moritzm: on a semi-related note sretest1001 is doing a change at every run, something related to ferm it seems. By any chance are you doing some tests there? [08:55:56] https://puppetboard.wikimedia.org/report/sretest1001.eqiad.wmnet/6bb8004d8c9799f4a25b0104effecf65f758210a [08:56:36] for the ferm/nftables migration [08:56:55] yeah, this is currently being moved to nftables, WIP [08:57:07] XioNoX: oh, nice [08:58:24] ack thx [09:00:12] slyngs: ah I just noticed that if there are local modifications then the git pull is attempted at every run [09:00:18] if maybe that's the case [09:01:12] Yeah, I think it's the build process. So maybe do a checkout in /srv/cas-checkout, notify and copy the files to the build directory would avoid the issue [09:03:10] possibly, the check of git::clone is: [09:03:10] unless => "${git} fetch --tags --prune --prune-tags && ${git} diff --quiet ${ref_to_check}" [09:05:40] Okay, so it adds to the debian/changelog and that's causing the problem [09:06:10] we can just reset the local checkout, let me see what the diff is [09:06:35] That will work to [09:07:56] fixed, doublechecking with a Puppet run [09:12:41] thx! [09:13:33] (SystemdUnitFailed) resolved: netbox_ganeti_esams01_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:13:36] uh? netbox_ganeti_esams01_sync.service is failing on netbox1002 with requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='ganeti01.svc.esams.wmnet', port=5080): Max retries exceeded with url: /2/groups?bulk=1 [09:13:50] it resolved [09:14:07] we applied a new sandbox VLAN on esams01 earlier [09:14:25] for the Atlas probes running in a Ganeti VM instead of the appliances [09:14:55] mmmh the unit is still failed on the host [09:15:08] and $ telnet 10.80.1.18 443 [09:15:10] Trying 10.80.1.18... [09:18:33] (SystemdUnitFailed) firing: netbox_ganeti_esams01_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:18:33] at first sight it doesn't seem to me that netbox1002 is able to reach ganeti01.svc.esams.wmnet at the moment (sorry the above paste was with teh wrong port, I tried 5080 too) [09:20:15] and the check for ganeti RAPI is firing too, so not related to netbox specifically [09:20:31] https://alerts.wikimedia.org/?q=alertname%3DHTTPS%20Ganeti%20RAPI%20esams [09:59:49] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Add a dependency on the opensearch-py client - https://phabricator.wikimedia.org/T345900 (10brouberol) Actually, after having experimented with supporting both OpenSearch and Elasticsearch in spicerack with local experiments, we've decided to put a pin... [10:22:15] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Add a dependency on the opensearch-py client - https://phabricator.wikimedia.org/T345900 (10brouberol) 05Open→03Declined [10:22:43] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Add a dependency on the opensearch-py client - https://phabricator.wikimedia.org/T345900 (10Volans) @brouberol thanks for the summary and update! Curator is the dependency that mostly creates issues and I think it would be great if we will plan for a pa... [11:03:27] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: [spicerack] split SRE cookbooks into "shared" and "SRE-only" - https://phabricator.wikimedia.org/T343894 (10fnegri) 05Open→03Declined I merged the patch above and cleaned up the SRE cookbooks from cloudcumin[1-2]... [11:10:53] volans: fixed, this needed a restart of the Ganeti daemons on the master node [11:11:39] oh, too bad :( thanks for fixing and sorry for the trouble [11:11:53] was related to the earlier change? [11:13:33] (SystemdUnitFailed) resolved: netbox_ganeti_esams01_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:15:42] yeah, this was needed as a followup to adding the new VLAN, I'll add a note to the wikitech ganeti docs later [11:17:31] ack thx [11:31:49] 10Puppet, 10Infrastructure-Foundations, 10Puppet-Infrastructure: "Ensure hosts are not performing a change on every puppet run" alert is failing - https://phabricator.wikimedia.org/T345909 (10jbond) 05In progress→03Resolved a:03jbond >>! In T345909#9155302, @fgiunchedi wrote: > Looks like the alert is... [11:32:44] volans: https://github.com/NLnetLabs/routinator/issues/889#issuecomment-1713692906 [11:33:09] moritzm: https://github.com/NLnetLabs/routinator/issues/880#issuecomment-1713695995 :( [11:36:03] XioNoX: we should be able to add nr_inodes=2M to fstab [11:46:30] jbond: of course not tested but I sent https://gerrit.wikimedia.org/r/c/operations/puppet/+/956411 [13:01:53] 10Puppet: pg replication lag UNKNOWN for puppetdb2003 - https://phabricator.wikimedia.org/T346016 (10jbond) 05Open→03In progress p:05Triage→03Medium [14:11:26] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Add a dependency on the opensearch-py client - https://phabricator.wikimedia.org/T345900 (10colewhite) Linking my comment here for visibility: T345337#9150551 [14:51:27] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10fnegri) I'm running into a similar issue while reimaging `cloudnet2005-dev.codfw.wmnet` to Bookworm. ` fnegri@cumin1001:~$ sudo cookbook sre.hosts.re... [15:25:42] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [16:22:44] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10fnegri) Ignore my previous comment, this turned out to be a one-off issue with the reimage cookbook. Restarting the cookbook a second time, it worked...