[02:04:21] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:35:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:49:22] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:43:19] <elukey>	 slyngs: thanksss
[06:48:32] <XioNoX>	 elukey: good morning ! thinking about it, with Netbox 4 it's not possible to git cherry pick a pending script/report CR
[06:51:39] <XioNoX>	 Possible other options : 1/ scp the file to /srv/netbox/customscripts, 2/ merge it in the dev branch and add it as a new "data source" in Netbox-next
[06:52:37] <XioNoX>	 3/ use the "file upload" in the "add script" page (probably the easiest option)
[06:57:51] <elukey>	 XioNoX: o/ ack thanks! I think I'll brutally copy the file manually
[06:57:55] <elukey>	 in -next of course
[06:58:18] <XioNoX>	 elukey: try the file upload, it's probably better/cleaner?
[06:58:26] <XioNoX>	 but also less tested :)
[06:58:35] <elukey>	 ahhhhh you want me to test it!
[06:58:39] <elukey>	 right ok I'll do it :)
[06:59:09] <XioNoX>	 elukey: we're good to try a new round of Netbox upgrade btw
[06:59:31] <XioNoX>	 let me know if you have some time to assist this morning
[07:00:10] <XioNoX>	 elukey: also don't forget to update your local repo to pickup https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/1058208
[07:00:14] <elukey>	 XioNoX: this morning I am a bit busy, would it be ok in the afternoon?
[07:01:00] <XioNoX>	 sure, ideally before dcops starts working
[07:33:50] <wikibugs>	 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Q1:codfw:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371434#10030695 (10ayounsi) a:03Papaul
[07:54:22] <Emperor>	 Hi folks. I'd like to have s3cmd installed on the cumin nodes, please [useful for testing S3 services e.g. thanos-swift and the new apus Ceph cluster]. Is that OK? Do you have a preferred way of getting this done?
[07:56:05] <elukey>	 Hi! I don't have anything against it, but I don't have a solid grasp of what is the policy for cumin nodes. If it is a matter of adding credentials + s3cmd via a custom profile, it should be fine
[07:56:10] <elukey>	 Cc: volans: --^
[07:58:17] <Emperor>	 I wasn't planning on installing credentials (just making them in my ~ as needed), I'd just like the CLI tool available
[07:59:38] <Emperor>	 (yes, I could just stick the binary in ~, but that seems hacky)
[08:02:08] <XioNoX>	 Is it worth a task or documentation on wikitech?
[08:09:56] <elukey>	 Emperor: my preference would be to have a tool available for everybody, to avoid a proliferation of ~/.something in various home dirs (easier to miss right perms etc..)
[08:16:13] <Emperor>	 If you'd rather I put in a phab ticket for "please install s3cmd on cumin nodes" I can do so
[08:17:38] <Emperor>	 Essentially, I want to do some testing of the new apus cluster's S3 api, and would like the s3cmd tool available to do so, ideally with minimal faff :)
[08:23:41] <volans>	 elukey: I've commented in -sre alrady :)
[08:23:58] <volans>	 for the tool, as for the credentials, having puppet install them would be ideal indeed
[08:37:19] <Emperor>	 is there an existing profile/class that could have one more package added to its install_packages list? A whole new one for one package feels like overkill...
[08:39:14] <volans>	 for the multi-role nature of the cumin hosts their role is a list of included profiles, each one for each different use case, see role::cluster::management
[08:57:36] <Emperor>	 this is sounding like a lot of hassle, and I should just stick the binary in ~ for now unless I end up needing a bunch of credentials &c permanently available
[09:03:21] <volans>	 Emperor: surry but no, cumin host should not have personal venvs or software installed randomly at all! And all deb packages should be installed via puppet. A puppet profile for what you need takes 5 minutes to write. If this is purely testing it should be done on the test hosts of the service and when ready be properly setup for production.
[09:45:20] <elukey>	 Emperor: +1 to what volas said
[10:15:13] <Emperor>	 comme ça? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1058575
[10:30:02] <elukey>	 Emperor: I personally don't agree with having s3cmd without credetials for everybody to use, since you can easily run it from stat10XX nodes for example
[10:30:32] <elukey>	 I think that cumin nodes should allow people with use cases like "I want to fetch data from thanos swift" to not worry about where to get credentials
[10:30:47] <elukey>	 otherwise cumin nodes become "test" nodes, and they shouldn't be
[10:30:53] <elukey>	 at least this is my view :)
[10:31:13] <elukey>	 if your requirement is just to have s3cmd available, maybe stat10xx could be good enough?
[10:32:17] <XioNoX>	 Be ready for a Netbox 4.1 release with new UI as soon as we upgrade to 4.0.x - https://github.com/netbox-community/netbox/issues/16907
[10:34:05] <volans>	 what's the timeline XioNoX ? if it's close it might be wise to pile the upgrades to reduce the adapt-to-new-ui trauma to everyone?
[10:34:27] <volans>	 I fully agree with what Luca said above wrt the s3 client
[10:34:32] <Emperor>	 I don't know anything about the statxxx nodes, but there aren't any in codfw, right?
[10:34:39] <XioNoX>	 volans: timeline is the day after we upgrade, can't escape it
[10:35:02] <XioNoX>	 volans: but also I prefer to minimize the upgrade trauma than new UI trauma, sorry
[10:35:17] <volans>	 that ship has sailed already
[10:35:20] <volans>	 sorry :D
[10:35:46] <Emperor>	 without wishing to grumble too much, if "no, you can't use cumin nodes for test utilities" is policy, where can I use testing utilities from in both DCs? Given I have S3 endpoints in both and need to check both work
[10:36:14] <Emperor>	 [I thought I'd started out with the "because I need to test things" requirement, too?]
[10:37:25] <volans>	 do they need to have special permission or acls in the network?
[10:38:26] <elukey>	 Emperor: nope only eqiad, this is the only downside (I thought that cross-dc with TLS was an option)
[10:39:18] <elukey>	 anyway, the cumin nodes can be used for tests, but in a more structured way - I was just suggesting to render the s3-cmd credentials somewhere under /etc
[10:39:19] <Emperor>	 volans: no, just to be able to reach the apus endpoints (port 443)
[10:39:31] <elukey>	 we do the same on stat100x nodes btw, for example to access thanos swift for ML
[10:39:44] <elukey>	 (lunch, will read in a bit)
[10:39:53] <XioNoX>	 is there a test cumin host?
[10:40:04] <Emperor>	 I don't currently have any general-purpose apus account credentials (and may never do so)
[10:40:29] <volans>	 XioNoX: ofc not :D
[10:43:01] <Emperor>	 and I want clients in both DCs (since the inter-DC replication is one of the things I want to test)
[10:52:14] <XioNoX>	 topranks: netbox 4.1 sneak peak: https://github.com/netbox-community/netbox/issues/7025
[10:53:21] <Emperor>	 (FTR, if we end up with some permanent set of apus credentials it would be useful to make generally available, templating them out would be sensible)
[11:18:21] <topranks>	 XioNoX: interesting feature!
[11:18:30] <topranks>	 could definitely see how it'd be useful, although require some thought 
[11:20:22] <XioNoX>	 yeah, what's redundant and at which scale and to do what with it. It could be useful for example with the maintenance email parser :) To alert if 2 redundant links are going to go dow
[11:22:22] <topranks>	 yeah that's the perfect use-case for it 
[12:10:39] <elukey>	 Emperor: okok thanks for giving us more details.. Just one question - using s3cmd from stat10xx nodes (so eqiad only) is not a viable option because of the cross dc calls? 
[12:17:45] <Emperor>	 elukey: indeed
[12:18:29] <Emperor>	 particularly one of the things I want to test (as well as using the DC-explicit hostnames) is r/w to/from the discovery record from both DCs
[12:18:39] <elukey>	 okok got it
[12:25:18] <elukey>	 Emperor: from puppet it seems that role::mediabackup::worker could fit your use case, it has s3cmd in both dcs and it should be owned by DP
[12:25:19] <XioNoX>	 elukey: Netbox DB converted and imported, running the deploy cookbook on 1003/2003
[12:25:26] <elukey>	 ack!
[12:25:54] <XioNoX>	 elukey: can you change the discovery record?
[12:26:14] <elukey>	 should we do it after the deploy cookbook finishes and we verify that all is good?
[12:27:02] <Emperor>	 elukey: ah, OK, yes, I think I can make that work. I'll abandon my CR
[12:27:48] <elukey>	 Emperor: if they don't work, let's revisit the cumin option, we can add s3cmd for a while and unblock you (but we'll need to find a better and more permanent solution in the long run)
[12:28:41] <XioNoX>	 elukey: sure, I can do some small testing with my tunnel, but we will soon need in production testing
[12:28:59] <XioNoX>	 deployed on 2003, one small error (which suprisingly doesn't happen on -dev
[12:29:27] <elukey>	 XioNoX: ack let's do it, and flip only after the testing
[12:29:39] <elukey>	 even if it is little etc.. it will give us some confidence
[12:31:18] <XioNoX>	 are you saying something could go wrong? :)
[12:31:36] <elukey>	 netbox 4 doesn't like us :D
[12:34:49] <wikibugs>	 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#10031670 (10ops-monitoring-bot) Deployed netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.8 to future netbox prod - ayounsi@cumin1002 - T336275
[12:37:46] <XioNoX>	 deplow done, now the testing
[12:42:43] <XioNoX>	 elukey: script works :)
[12:42:50] <elukey>	 \o/
[12:43:27] <elukey>	 ready for prime time?
[12:44:16] <XioNoX>	 elukey: yeah, we're better than where we were last time
[12:44:38] <XioNoX>	 we can flip dicovery while I test all the other ones
[12:45:08] <elukey>	 ok proceeding
[12:47:03] <elukey>	 puppet running on dns nodes
[12:50:54] <XioNoX>	 all the important scripts and reports have been tested successfully
[12:54:22] <XioNoX>	 once discovery is deployed, we can release the new homer
[12:54:37] <elukey>	 should be done
[12:55:15] <XioNoX>	 each refresh I alternate between the two :)
[12:55:23] <XioNoX>	 so I guess a few more minutes
[12:58:27] <elukey>	 works for me now
[12:58:33] <XioNoX>	 same!
[13:00:16] * elukey meeting, will read in a bit
[13:01:44] <topranks>	 also seems ok for me :) 
[13:06:58] <XioNoX>	 great, now it's homer's turn to have issues releasing its new version...
[13:07:00] <XioNoX>	 "Your build configuration is incomplete and previously worked by accident!"
[13:08:44] <topranks>	 I know I shouldn't laugh..... 
[13:09:02] <XioNoX>	 don't worry I'm also lauthing :)
[13:09:06] <topranks>	 lol 
[13:09:33] <topranks>	 if I can help let me know, I expect it's beyond me if you can't figure it out though 
[13:09:51] <XioNoX>	 https://www.irccloud.com/pastebin/rF5xlt66/
[13:10:04] <XioNoX>	 so it's something to do with the docker env used to build the wheels
[13:10:12] <XioNoX>	 I also should have done that sooner but totally forgot
[13:10:45] <topranks>	 last time I had to do this I hit some issues with the docker env too, permission things I think 
[13:11:11] <cdanis>	 I hate to say it but this is possibly the most bettercallvolans thing I've ever seen
[13:11:16] <topranks>	 no prizes for guessing who sorted it out for me 
[13:11:19] <topranks>	 haha yeah :) 
[13:12:00] <slyngs>	 Can you upgrade setuptools?
[13:12:05] <XioNoX>	 eh, I think I solved it, but dunno if the fix is legit or not
[13:13:15] <slyngs>	 You can just apply this https://usercontent.irccloud-cdn.com/file/FYqi6gOJ/works.jpg
[13:13:29] <XioNoX>	 if someone can review https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/1058600
[13:13:41] <XioNoX>	 the fix is the tiny change in Dockerfile.build
[13:25:28] <elukey>	 back!
[13:27:18] <XioNoX>	 homer 0.7.0 deployed but there are issues
[13:30:02] <XioNoX>	 two more breaking changes that either were not documented or I didn't see
[13:30:52] <XioNoX>	 one is that when querying the interfaces from the master node of a virtual-chassis, it doesn't return the interfaces of the other members anymore (eg. https://netbox.wikimedia.org/dcim/interfaces/?device_id=614 )
[13:32:07] <XioNoX>	 the other one is how data back from scripts is structured I think
[13:32:11] <XioNoX>	 https://www.irccloud.com/pastebin/kJmDY8yQ/
[13:35:30] <XioNoX>	 the 2nd one seems like a bug in the library, dunno if someone can confirm or not
[13:38:18] <elukey>	 it seems likely, url_path being a binary object and not a str
[13:40:12] <elukey>	 we could add some logging to https://github.com/netbox-community/pynetbox/blob/master/pynetbox/core/response.py#L429
[13:40:21] <XioNoX>	 I have a fix for the first issue at least, working on it
[13:51:20] <elukey>	 XioNoX: what homer command did you run for https://www.irccloud.com/pastebin/kJmDY8yQ/ ?
[13:51:35] <elukey>	 I added a hacky print statement, I want to see if we can check something
[13:52:59] <XioNoX>	 elukey: `homer cr3-ulsfo* diff` on cumin2002 for example
[13:55:46] <elukey>	 INFO:homer:Generating configuration for cr3-ulsfo.wikimedia.org
[13:55:49] <elukey>	 PATH: /api/extras/scripts/1/ of type <class 'str'>
[13:55:52] <elukey>	 PATH: /api/core/jobs/53166/ of type <class 'str'>
[13:55:54] <elukey>	 PATH: /api/users/users/6/ of type <class 'str'>
[13:55:57] <elukey>	 PATH: b'' of type <class 'bytes'>
[13:56:04] <XioNoX>	 https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/1058609 this fixes the first issue (tested manually)
[13:57:01] <elukey>	 URL: None URL_PATH: b'' of type <class 'bytes'>
[13:57:06] <elukey>	 ok yeah something strange
[13:57:25] <elukey>	 _endpoint_from_url is called with an empty pat
[13:57:28] <elukey>	 *path
[13:58:22] <elukey>	 checking a quick fix
[13:58:54] <elukey>	 Changes for 1 devices: ['cr3-ulsfo.wikimedia.org']
[13:58:54] <elukey>	 # No diff
[13:58:54] <elukey>	 ---------------
[13:58:54] <elukey>	 INFO:homer:Homer run completed successfully on 1 devices: ['cr3-ulsfo.wikimedia.org']
[13:59:00] <XioNoX>	 nice!
[13:59:26] <XioNoX>	 I was looking at pynetbox and netbox changes/issues but nothing has been reported at least
[13:59:32] <elukey>	 sort of, the issue is in pynetbox, we cannot really patch it right?
[13:59:50] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501 (10cmooney) 03NEW p:05Triage→03Low
[14:00:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:00:43] <XioNoX>	 not cleanly, but they're reactive upstream, so might be worth keeping the local workaround
[14:01:24] <XioNoX>	 until we can rebuild the wheels with the updated version
[14:02:29] <elukey>	 ok so lemme file a patch so we can reason about it
[14:04:22] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:06:27] <XioNoX>	 so on cumin2002 homer is working well on all the devices I tested
[14:09:23] <XioNoX>	 next would be to deploy the cookbook changes once we're good with Homer
[14:09:38] <elukey>	 checking another thing in the code
[14:09:43] <XioNoX>	 no rush
[14:11:06] <elukey>	 XioNoX: I am going to re-break the venv removing my patch, I need to add another logging in another piece of the code
[14:11:18] <XioNoX>	 no pb
[14:12:59] <elukey>	 interesting.. so in the stacktrace, __init__ calls self._endpoint_from_url(values["url"])
[14:13:06] <elukey>	 and in the "breaking" case, this is values
[14:13:07] <elukey>	 Values: {'obj': None, 'url': None, 'time': '2024-07-31T12:41:33.339876+00:00', 'status': 'success', 'message': 'Generated successfully, see the output tab for result.'}
[14:15:25] <XioNoX>	 no idea how to interpret that
[14:19:09] <elukey>	 https://github.com/elukey/pynetbox/commit/155e2050ae09b538f85c0217307e81af8446f7ee
[14:19:16] <elukey>	 this is the current fix
[14:19:17] <XioNoX>	 from there https://netbox.wikimedia.org/api/extras/scripts/1/ it's like the log line, and not the actual result
[14:19:21] <elukey>	 better than the one before
[14:19:22] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:19:41] <elukey>	 XioNoX: do you mind to re-test homer on cumin2002?
[14:19:43] <XioNoX>	 elukey: that makes sens
[14:20:39] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:21:17] <XioNoX>	 elukey: all good on all 4 ulsfo devices (no diff)
[14:22:57] <elukey>	 https://github.com/netbox-community/pynetbox/pull/632
[14:23:32] <elukey>	 I am a little hesitant in leaving homer as is with live patching
[14:23:46] <elukey>	 but I guess we can live with it for a bit, if you are confident that upstream will answer
[14:25:17] <XioNoX>	 yeah, their last release was a month ago, and nobody else than us will do the next homer release
[14:26:35] <XioNoX>	 elukey: next is https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1050445/ commit message still says not tested, but it was tested as much as possible (not all were fully testable)
[14:26:54] <elukey>	 my worry is that we re-deploy the venv and we forget about the fix, but we can live with it, not ideal but a complete rollback is worse at this point
[14:27:27] <XioNoX>	 even if that happens, we will quickly remember where the fix is :)
[14:27:58] <elukey>	 XioNoX: at this point we need to merge and possibly dry-run those, to have a vague sense of safety
[14:28:04] <elukey>	 (the new cookbooks I mean)
[14:28:08] <XioNoX>	 elukey: yeah, that's the plan
[14:28:19] <elukey>	 let's do it
[14:30:01] <XioNoX>	 elukey: running puppet on cumin2002 to pickup the new cookbooks, and I disabled puppet on 1002 to roll back more easily if needed
[14:31:23] <elukey>	 okok
[14:32:16] <elukey>	 I also need to run the provision cookbook, will test one
[14:33:27] <elukey>	 running now
[14:34:32] <XioNoX>	 so far decom' works fine
[14:37:02] <XioNoX>	 all good for decom in dry-run
[14:38:49] <elukey>	 provision looks good too (still in progress)
[14:39:02] <XioNoX>	 running reimage as we speak
[14:40:11] <elukey>	 back in 10 mins
[14:41:18] <XioNoX>	 re-image failed with "RuntimeError: New OS is bullseye but bookworm was requested"
[14:41:25] <XioNoX>	 so probably not related
[14:43:42] <XioNoX>	 and now (when setting --os bullseye) "spicerack.puppet.PuppetHostsError: Unable to find CSR fingerprints for all hosts, detected errors are:"
[14:49:18] <XioNoX>	 elukey: for when you're back, one small bug: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1058618
[14:50:24] <elukey>	 provision worked!
[14:50:28] <XioNoX>	 nice!
[14:51:14] <elukey>	 +1ed
[14:51:35] <XioNoX>	 elukey: not urgent but also small addition to that one https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1056989/2..3
[14:53:29] <elukey>	 why do we need to sudo in there?
[14:53:53] <XioNoX>	 elukey: probably don't, just to run it as that other user
[14:54:22] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:54:38] <elukey>	 XioNoX: if it generates a permission issue, let's change it, otherwise I think we can skip no?
[14:54:55] <XioNoX>	 elukey: `runuser` is the command needed?
[14:55:04] <XioNoX>	 elukey: yeah it will cause permissions issues
[14:55:33] <XioNoX>	 when updating the scripts using the UI, it will complain if the file is owned by root
[14:55:42] <elukey>	 makes sense then okok
[14:56:34] <elukey>	 re: runuser, never seen it but it seems so yes
[14:56:55] <elukey>	 ah TIL /usr/sbin/runuser on cumin nodes
[14:57:11] <elukey>	 XioNoX: --^
[14:57:49] <XioNoX>	 yeah I saw it in some cookbooks, that's the only reason I'm mentioning it :)
[14:58:04] <elukey>	 let's use it
[14:58:15] <XioNoX>	 CR updated
[14:59:22] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:59:34] <XioNoX>	 sync netbox hiera works fine
[15:00:42] <XioNoX>	 it doesn't crash, but it tries to remove all the VMs now
[15:02:17] <elukey>	 not great :D
[15:03:09] <topranks>	 who needs VMs anyway?
[15:03:11] * topranks hides 
[15:03:35] <elukey>	 how does it work? Is the cookbook committing to the puppet repo?
[15:04:02] <elukey>	 okok it uses reposync
[15:04:38] <XioNoX>	 elukey: yeah it runs the "VM_LIST_GQL" query with the parameters ['active', 'failed']
[15:09:34] <XioNoX>	 If I manually run the query the hosts that the cookbook wants to remove still show up https://usercontent.irccloud-cdn.com/file/YJ99hCiJ/Screenshot%202024-07-31%20at%2017-08-47%20GraphiQL%20NetBox.png
[15:09:54] <XioNoX>	 ohhh
[15:09:58] <XioNoX>	 https://www.irccloud.com/pastebin/i6QVUvcH/
[15:10:15] <XioNoX>	 but it's not capitalized in the output
[15:10:48] <elukey>	 we really need to invest time into a complete testing pipeline for netbox
[15:11:19] <XioNoX>	 elukey: 100% agreed :)
[15:11:34] <XioNoX>	 I updated the cookbook manually and it fixes the issue
[15:11:52] <XioNoX>	 I'm also wondering why there is this condition, while we filter them ahead of time
[15:13:04] <topranks>	 does seem to be useless given it's in the query already 
[15:14:27] <XioNoX>	 https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1058625
[15:14:48] <XioNoX>	 elukey ^ topranks ^
[15:15:36] <XioNoX>	 there are lots of thing we need to do after the upgrade to clean it all up properly, and document it
[15:15:38] <elukey>	 XioNoX: the comment is a bit dense to read, what do you mean with it?
[15:16:10] <XioNoX>	 elukey: a reminder that the if right under might be useless
[15:16:14] <topranks>	 XioNoX: might it be better to have if host['status'].lower() == 
[15:16:19] <topranks>	 lgtm anyway 
[15:16:56] <XioNoX>	 topranks: yeah, I went for the easiest fix for now :)
[15:17:07] <elukey>	 XioNoX: let's write in like "check if the 'if' block below is useless as we etc..
[15:17:13] <elukey>	 *it
[15:17:21] <topranks>	 SURE
[15:17:29] <topranks>	 damn I need .lower() myself :) 
[15:17:30] <topranks>	 +1 
[15:17:47] * elukey scared by topranks shouting :D
[15:18:08] <topranks>	 hahaha 
[15:18:26] <XioNoX>	 topranks: line 343 there is already a 'status': host['status'].lower() and that's what we do for baremetal too
[15:18:40] <XioNoX>	 so at least the issue won't show up later down the pipeline
[15:19:22] <topranks>	 ah
[15:19:23] <topranks>	 cool
[15:20:00] <XioNoX>	 elukey: CR updated
[15:20:21] <elukey>	 lgtm
[15:20:38] <XioNoX>	 waiting for CR then will deploy
[15:20:51] <XioNoX>	 and then re-enable puppet on cumin1002
[15:21:13] <XioNoX>	 I added it to the pile of things to do after the upgrade :)
[15:23:14] <XioNoX>	 anything left after that? :)
[15:23:58] <XioNoX>	 elukey: did you do your venv modification on both cumin hosts for homer ?
[15:23:58] <akosiaris>	 announcement ?
[15:24:18] <akosiaris>	 answer to the "anything left?" btw ^
[15:24:28] <elukey>	 XioNoX: nope, lemme do it 
[15:24:59] <XioNoX>	 elukey: can you do that one too ?
[15:25:01] <XioNoX>	 https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/1058609
[15:25:32] <XioNoX>	 it's in the homer_plugins path in the venv
[15:25:49] <elukey>	 but this one is in our control, we should create a new release
[15:25:52] <elukey>	 no?
[15:26:09] <XioNoX>	 elukey: we can, I didn't want risking breaking your venv change
[15:26:16] <XioNoX>	 elukey: but yeah I can deploy it
[15:26:59] <elukey>	 XioNoX: no problem I'll re apply it
[15:27:05] <elukey>	 one hack is already enough :D
[15:27:10] <XioNoX>	 elukey: fair :)
[15:29:14] <XioNoX>	 deploying
[15:30:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:31:26] <XioNoX>	 elukey: deployed and tested, all yours
[15:32:37] <elukey>	 fixing
[15:34:22] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:34:49] <XioNoX>	 sync-netbox-hiera is all good now
[15:35:03] <elukey>	 fix applied on both cumin nodes
[15:35:16] <elukey>	 and tested
[15:35:49] <XioNoX>	 puppet running on cumin1002
[15:36:16] <XioNoX>	 then we can give the green light to dcops and everyone else, while still being careful
[15:36:26] <elukey>	 +1
[15:36:35] <XioNoX>	 alright done
[15:38:55] <XioNoX>	 elukey: thanks for the help! that was a wild upgrade...
[15:39:12] <elukey>	 definitely..
[15:39:40] <elukey>	 one of the next days let's capture in a task what we need to build a testing pipeline
[15:39:48] <elukey>	 so it is fresh in our memory
[15:40:06] <XioNoX>	 yeah exactly
[15:40:41] <XioNoX>	 and lets upgrade more frequently so there aren't 1000 breaking changes
[15:41:01] <cdanis>	 dcl test environment? :)
[15:49:22] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:50:39] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:57:08] <elukey>	 also end to end tests, like homer/etc..
[15:57:25] <volans>	 it's all doable with VMs
[15:57:31] <elukey>	 upgrading more frequently would be nice, but even a minor can break everything
[15:58:38] <volans>	 to be clear, it was all already doable with netbox-next
[16:09:10] <XioNoX>	 clearly not all, but lots of it yeah
[16:15:16] * topranks silently cries as he closes his last tab with the Netbox 3 UI 
[16:24:22] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:29:22] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:40:13] <elukey>	 folks as FYI, after a chat with Riccardo and Papaul we decided to modify a little late_command.sh - https://gerrit.wikimedia.org/r/c/operations/puppet/+/1058641
[16:40:48] <elukey>	 it should hopefully avoid some race conditions where reimage wants to use puppet 7 on a host, but puppet 5 gets deployed (and fails for various reasons )
[16:59:22] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:04:22] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:34:22] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:35:39] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:50:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:54:22] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:25:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:29:22] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:44:22] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:45:40] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:00:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:04:22] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:35:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:39:22] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:03:19] <wikibugs>	 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#10033579 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d8033fb3-d4d1-4e37-8764-0a7625abbe34) set by ayounsi@cumin1002 for 5 days, 0:00:00 on 2 host(s) and their...