[07:01:39] <RhinosF1>	 XioNoX: pm?
[07:04:18] <XioNoX>	 RhinosF1: what's up?
[07:05:52] <volans>	 moritzm, topranks: wrt the failed decom last week, my suggestion would have been to just retry the decom cookbook. It's meant to be idempotent so re-running it right away shouldn't be a problem and should take care of any missed step in case of a transient failure
[07:06:16] <volans>	 it's correct that 255 exit on clustershell is usually an ssh connection failed/interrupted
[07:09:07] <moritzm>	 yeah, we'll see when flerovium is ready to be taken down
[07:10:29] <volans>	 ack, lmk if you want me to debug it
[07:12:17] <moritzm>	 ack
[08:06:48] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review: "Ensure hosts are not performing a change on every puppet run" alert is failing - https://phabricator.wikimedia.org/T345909 (10fgiunchedi) Looks like the alert is working as expected: https://alerts.wikimedia.org/?q=%40sta...
[08:21:28] <volans>	 XioNoX, topranks: when you have a chance could you re-arm the homer key on cumin2002 please?
[08:22:22] <wikibugs>	 10Puppet: pg replication lag UNKNOWN for puppetdb2003 - https://phabricator.wikimedia.org/T346016 (10fgiunchedi)
[08:22:49] <wikibugs>	 10Puppet: pg replication lag UNKNOWN for puppetdb2003 - https://phabricator.wikimedia.org/T346016 (10fgiunchedi)
[08:23:45] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: routinator.service crashloop on rpki2002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[08:28:21] <XioNoX>	 Sep 11 08:27:55 rpki2002 routinator[1667211]: Fatal: failed to open file /var/lib/routinator/repository/rrdp/rrdp.ripe.net/21d6592469dbe79feb2922562764fd193170f173229298b9a4443ffb5c282000/tmp/rpki.ripe.net/repository/DEFAULT/ws5Z0a-DDJS6jqEc-P7G3ZwDuRc.cer: No space left on device (os error 28)
[08:28:21] <XioNoX>	 Sep 11 08:27:56 rpki2002 routinator[1667211]: Fatal error. Exiting.
[08:28:42] <volans>	 sad_trombone.wav
[08:28:57] <XioNoX>	 it's tmpfs https://phabricator.wikimedia.org/T300955
[08:30:10] <volans>	 they have 6GB of ram now correct?
[08:30:27] <XioNoX>	 tmpfs           4.0G  2.3G  1.8G  56% /var/lib/routinator/repository
[08:30:28] <volans>	 how much would they need? we're going towards the high end of ram for a VM
[08:30:37] <XioNoX>	 they have enough it seems
[08:30:52] <volans>	 probably not during the update, it might double the data
[08:30:58] <XioNoX>	 "but giving it 4GB will allow ample margin for future growth"
[08:31:02] <volans>	 and then remove the old one
[08:31:24] <volans>	 unless we/it are/is not cleaning some old cruf
[08:31:28] <volans>	 *cruft
[08:31:34] <XioNoX>	 yeah
[08:31:39] <XioNoX>	 how do I clear that cleanly?
[08:31:47] <XioNoX>	 just delete everything under that path?
[08:31:58] <XioNoX>	 or there is something specific to tmpfs
[08:32:21] <volans>	 it's a filesystem
[08:32:57] <XioNoX>	 so rm -rf works just fine?
[08:33:02] <volans>	 yep
[08:33:45] <jinxer-wm>	 (SystemdUnitCrashLoop) resolved: routinator.service crashloop on rpki2002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[08:34:13] <XioNoX>	 alright, restarting and monitoring it
[08:35:16] <volans>	 slyngs: FYI idp-test1002 runs a git pull of cas-overlay-template at every puppet run making a change at every run
[08:37:18] <moritzm>	 that's expected, that's the place where we build the cas debs (and the cas-overlay-template repo is the source for it)
[08:37:35] <volans>	 but I doubt it changes every time
[08:37:42] <volans>	 the repo
[08:37:54] <XioNoX>	 fully initialized tmpfs           4.0G  2.1G  2.0G  51% /var/lib/routinator/repository
[08:38:16] <volans>	 XioNoX: I bet it doubles when updating the data, so needing 4.2...
[08:39:22] <volans>	 moritzm: cumin has git::clone with ensure    => 'latest' but doesn't make a change at every puppet run, only if there are new commits to pull
[08:39:38] <volans>	 I didn't check what the idp does in puppet
[08:40:45] <slyngs>	 Is it git or the rsync that causes the change?
[08:41:01] <slyngs>	 We do a git clone and then an rsync
[08:41:17] <XioNoX>	 moritzm: fyi - https://github.com/NLnetLabs/routinator/issues/880
[08:41:40] <volans>	 slyngs: 
[08:41:44] <volans>	 https://puppetboard.wikimedia.org/report/idp-test1002.wikimedia.org/eef7cb484c2263303c8c5f877b55159b7e94c72c
[08:41:48] <XioNoX>	 volans: keyholder armed
[08:41:57] <volans>	 thx!
[08:43:31] <slyngs>	 volans: I'll take a look, because we do the same with the IDM and deployments and that doesn't cause a change on every run
[08:43:33] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) kube-controller-manager.service Failed on aux-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:51:49] <XioNoX>	 volans: https://github.com/NLnetLabs/routinator/issues/889
[08:51:55] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) In terms of the LVS connections from rows C and D, when we move from old switches to new ones we need to land those on the Spines rather t...
[08:53:03] <volans>	 XioNoX: ack, let's see what they say
[08:55:53] <volans>	 moritzm: on a semi-related note sretest1001 is doing a change at every run, something related to ferm it seems. By any chance are you doing some tests there?
[08:55:56] <volans>	 https://puppetboard.wikimedia.org/report/sretest1001.eqiad.wmnet/6bb8004d8c9799f4a25b0104effecf65f758210a
[08:56:36] <volans>	 for the ferm/nftables migration
[08:56:55] <moritzm>	 yeah, this is currently being moved to nftables, WIP
[08:57:07] <moritzm>	 XioNoX: oh, nice
[08:58:24] <volans>	 ack thx
[09:00:12] <volans>	 slyngs: ah I just noticed that if there are local modifications then the git pull is attempted at every run
[09:00:18] <volans>	 if maybe that's the case
[09:01:12] <slyngs>	 Yeah, I think it's the build process. So maybe do a checkout in /srv/cas-checkout, notify and copy the files to the build directory would avoid the issue
[09:03:10] <volans>	 possibly, the check of git::clone is:
[09:03:10] <volans>	 unless    => "${git} fetch --tags --prune --prune-tags && ${git} diff --quiet ${ref_to_check}"
[09:05:40] <slyngs>	 Okay, so it adds to the debian/changelog and that's causing the problem 
[09:06:10] <moritzm>	 we can just reset the local checkout, let me see what the diff is
[09:06:35] <slyngs>	 That will work to
[09:07:56] <moritzm>	 fixed, doublechecking with a Puppet run
[09:12:41] <volans>	 thx!
[09:13:33] <jinxer-wm>	 (SystemdUnitFailed) resolved: netbox_ganeti_esams01_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:13:36] <volans>	 uh?  netbox_ganeti_esams01_sync.service is failing on netbox1002 with requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='ganeti01.svc.esams.wmnet', port=5080): Max retries exceeded with url: /2/groups?bulk=1
[09:13:50] <moritzm>	 it resolved
[09:14:07] <moritzm>	 we applied a new sandbox VLAN on esams01 earlier
[09:14:25] <moritzm>	 for the Atlas probes running in a Ganeti VM instead of the appliances
[09:14:55] <volans>	 mmmh the unit is still failed on the host
[09:15:08] <volans>	 and $ telnet 10.80.1.18 443
[09:15:10] <volans>	 Trying 10.80.1.18...
[09:18:33] <jinxer-wm>	 (SystemdUnitFailed) firing: netbox_ganeti_esams01_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:18:33] <volans>	 at first sight it doesn't seem to me that netbox1002 is able to reach ganeti01.svc.esams.wmnet at the moment (sorry the above paste was with teh wrong port, I tried 5080 too)
[09:20:15] <volans>	 and the check for ganeti RAPI is firing too, so not related to netbox specifically
[09:20:31] <volans>	 https://alerts.wikimedia.org/?q=alertname%3DHTTPS%20Ganeti%20RAPI%20esams
[09:59:49] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Add a dependency on the opensearch-py client - https://phabricator.wikimedia.org/T345900 (10brouberol) Actually, after having experimented with supporting both OpenSearch and Elasticsearch in spicerack with local experiments, we've decided to put a pin...
[10:22:15] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Add a dependency on the opensearch-py client - https://phabricator.wikimedia.org/T345900 (10brouberol) 05Open→03Declined
[10:22:43] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Add a dependency on the opensearch-py client - https://phabricator.wikimedia.org/T345900 (10Volans) @brouberol thanks for the summary and update! Curator is the dependency that mostly creates issues and I think it would be great if we will plan for a pa...
[11:03:27] <wikibugs>	 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: [spicerack] split SRE cookbooks into "shared" and "SRE-only" - https://phabricator.wikimedia.org/T343894 (10fnegri) 05Open→03Declined I merged the patch above and cleaned up the SRE cookbooks from cloudcumin[1-2]...
[11:10:53] <moritzm>	 volans: fixed, this needed a restart of the Ganeti daemons on the master node
[11:11:39] <volans>	 oh, too bad :( thanks for fixing and sorry for the trouble
[11:11:53] <volans>	 was related to the earlier change?
[11:13:33] <jinxer-wm>	 (SystemdUnitFailed) resolved: netbox_ganeti_esams01_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:15:42] <moritzm>	 yeah, this was needed as a followup to adding the new VLAN, I'll add a note to the wikitech  ganeti docs later
[11:17:31] <volans>	 ack thx
[11:31:49] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Puppet-Infrastructure: "Ensure hosts are not performing a change on every puppet run" alert is failing - https://phabricator.wikimedia.org/T345909 (10jbond) 05In progress→03Resolved a:03jbond >>! In T345909#9155302, @fgiunchedi wrote: > Looks like the alert is...
[11:32:44] <XioNoX>	 volans: https://github.com/NLnetLabs/routinator/issues/889#issuecomment-1713692906
[11:33:09] <XioNoX>	 moritzm: https://github.com/NLnetLabs/routinator/issues/880#issuecomment-1713695995 :(
[11:36:03] <jbond>	 XioNoX: we should be able to add nr_inodes=2M to fstab
[11:46:30] <XioNoX>	 jbond: of course not tested but I sent https://gerrit.wikimedia.org/r/c/operations/puppet/+/956411
[13:01:53] <wikibugs>	 10Puppet: pg replication lag UNKNOWN for puppetdb2003 - https://phabricator.wikimedia.org/T346016 (10jbond) 05Open→03In progress p:05Triage→03Medium
[14:11:26] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Add a dependency on the opensearch-py client - https://phabricator.wikimedia.org/T345900 (10colewhite) Linking my comment here for visibility: T345337#9150551
[14:51:27] <wikibugs>	 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10fnegri) I'm running into a similar issue while reimaging `cloudnet2005-dev.codfw.wmnet` to Bookworm.  ` fnegri@cumin1001:~$ sudo cookbook sre.hosts.re...
[15:25:42] <wikibugs>	 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero)
[16:22:44] <wikibugs>	 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10fnegri) Ignore my previous comment, this turned out to be a one-off issue with the reimage cookbook. Restarting the cookbook a second time, it worked...