[02:04:21] RESOLVED: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:35:39] FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:49:22] RESOLVED: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:43:19] slyngs: thanksss [06:48:32] elukey: good morning ! thinking about it, with Netbox 4 it's not possible to git cherry pick a pending script/report CR [06:51:39] Possible other options : 1/ scp the file to /srv/netbox/customscripts, 2/ merge it in the dev branch and add it as a new "data source" in Netbox-next [06:52:37] 3/ use the "file upload" in the "add script" page (probably the easiest option) [06:57:51] XioNoX: o/ ack thanks! I think I'll brutally copy the file manually [06:57:55] in -next of course [06:58:18] elukey: try the file upload, it's probably better/cleaner? [06:58:26] but also less tested :) [06:58:35] ahhhhh you want me to test it! [06:58:39] right ok I'll do it :) [06:59:09] elukey: we're good to try a new round of Netbox upgrade btw [06:59:31] let me know if you have some time to assist this morning [07:00:10] elukey: also don't forget to update your local repo to pickup https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/1058208 [07:00:14] XioNoX: this morning I am a bit busy, would it be ok in the afternoon? [07:01:00] sure, ideally before dcops starts working [07:33:50] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Q1:codfw:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371434#10030695 (10ayounsi) a:03Papaul [07:54:22] Hi folks. I'd like to have s3cmd installed on the cumin nodes, please [useful for testing S3 services e.g. thanos-swift and the new apus Ceph cluster]. Is that OK? Do you have a preferred way of getting this done? [07:56:05] Hi! I don't have anything against it, but I don't have a solid grasp of what is the policy for cumin nodes. If it is a matter of adding credentials + s3cmd via a custom profile, it should be fine [07:56:10] Cc: volans: --^ [07:58:17] I wasn't planning on installing credentials (just making them in my ~ as needed), I'd just like the CLI tool available [07:59:38] (yes, I could just stick the binary in ~, but that seems hacky) [08:02:08] Is it worth a task or documentation on wikitech? [08:09:56] Emperor: my preference would be to have a tool available for everybody, to avoid a proliferation of ~/.something in various home dirs (easier to miss right perms etc..) [08:16:13] If you'd rather I put in a phab ticket for "please install s3cmd on cumin nodes" I can do so [08:17:38] Essentially, I want to do some testing of the new apus cluster's S3 api, and would like the s3cmd tool available to do so, ideally with minimal faff :) [08:23:41] elukey: I've commented in -sre alrady :) [08:23:58] for the tool, as for the credentials, having puppet install them would be ideal indeed [08:37:19] is there an existing profile/class that could have one more package added to its install_packages list? A whole new one for one package feels like overkill... [08:39:14] for the multi-role nature of the cumin hosts their role is a list of included profiles, each one for each different use case, see role::cluster::management [08:57:36] this is sounding like a lot of hassle, and I should just stick the binary in ~ for now unless I end up needing a bunch of credentials &c permanently available [09:03:21] Emperor: surry but no, cumin host should not have personal venvs or software installed randomly at all! And all deb packages should be installed via puppet. A puppet profile for what you need takes 5 minutes to write. If this is purely testing it should be done on the test hosts of the service and when ready be properly setup for production. [09:45:20] Emperor: +1 to what volas said [10:15:13] comme ça? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1058575 [10:30:02] Emperor: I personally don't agree with having s3cmd without credetials for everybody to use, since you can easily run it from stat10XX nodes for example [10:30:32] I think that cumin nodes should allow people with use cases like "I want to fetch data from thanos swift" to not worry about where to get credentials [10:30:47] otherwise cumin nodes become "test" nodes, and they shouldn't be [10:30:53] at least this is my view :) [10:31:13] if your requirement is just to have s3cmd available, maybe stat10xx could be good enough? [10:32:17] Be ready for a Netbox 4.1 release with new UI as soon as we upgrade to 4.0.x - https://github.com/netbox-community/netbox/issues/16907 [10:34:05] what's the timeline XioNoX ? if it's close it might be wise to pile the upgrades to reduce the adapt-to-new-ui trauma to everyone? [10:34:27] I fully agree with what Luca said above wrt the s3 client [10:34:32] I don't know anything about the statxxx nodes, but there aren't any in codfw, right? [10:34:39] volans: timeline is the day after we upgrade, can't escape it [10:35:02] volans: but also I prefer to minimize the upgrade trauma than new UI trauma, sorry [10:35:17] that ship has sailed already [10:35:20] sorry :D [10:35:46] without wishing to grumble too much, if "no, you can't use cumin nodes for test utilities" is policy, where can I use testing utilities from in both DCs? Given I have S3 endpoints in both and need to check both work [10:36:14] [I thought I'd started out with the "because I need to test things" requirement, too?] [10:37:25] do they need to have special permission or acls in the network? [10:38:26] Emperor: nope only eqiad, this is the only downside (I thought that cross-dc with TLS was an option) [10:39:18] anyway, the cumin nodes can be used for tests, but in a more structured way - I was just suggesting to render the s3-cmd credentials somewhere under /etc [10:39:19] volans: no, just to be able to reach the apus endpoints (port 443) [10:39:31] we do the same on stat100x nodes btw, for example to access thanos swift for ML [10:39:44] (lunch, will read in a bit) [10:39:53] is there a test cumin host? [10:40:04] I don't currently have any general-purpose apus account credentials (and may never do so) [10:40:29] XioNoX: ofc not :D [10:43:01] and I want clients in both DCs (since the inter-DC replication is one of the things I want to test) [10:52:14] topranks: netbox 4.1 sneak peak: https://github.com/netbox-community/netbox/issues/7025 [10:53:21] (FTR, if we end up with some permanent set of apus credentials it would be useful to make generally available, templating them out would be sensible) [11:18:21] XioNoX: interesting feature! [11:18:30] could definitely see how it'd be useful, although require some thought [11:20:22] yeah, what's redundant and at which scale and to do what with it. It could be useful for example with the maintenance email parser :) To alert if 2 redundant links are going to go dow [11:22:22] yeah that's the perfect use-case for it [12:10:39] Emperor: okok thanks for giving us more details.. Just one question - using s3cmd from stat10xx nodes (so eqiad only) is not a viable option because of the cross dc calls? [12:17:45] elukey: indeed [12:18:29] particularly one of the things I want to test (as well as using the DC-explicit hostnames) is r/w to/from the discovery record from both DCs [12:18:39] okok got it [12:25:18] Emperor: from puppet it seems that role::mediabackup::worker could fit your use case, it has s3cmd in both dcs and it should be owned by DP [12:25:19] elukey: Netbox DB converted and imported, running the deploy cookbook on 1003/2003 [12:25:26] ack! [12:25:54] elukey: can you change the discovery record? [12:26:14] should we do it after the deploy cookbook finishes and we verify that all is good? [12:27:02] elukey: ah, OK, yes, I think I can make that work. I'll abandon my CR [12:27:48] Emperor: if they don't work, let's revisit the cumin option, we can add s3cmd for a while and unblock you (but we'll need to find a better and more permanent solution in the long run) [12:28:41] elukey: sure, I can do some small testing with my tunnel, but we will soon need in production testing [12:28:59] deployed on 2003, one small error (which suprisingly doesn't happen on -dev [12:29:27] XioNoX: ack let's do it, and flip only after the testing [12:29:39] even if it is little etc.. it will give us some confidence [12:31:18] are you saying something could go wrong? :) [12:31:36] netbox 4 doesn't like us :D [12:34:49] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#10031670 (10ops-monitoring-bot) Deployed netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.8 to future netbox prod - ayounsi@cumin1002 - T336275 [12:37:46] deplow done, now the testing [12:42:43] elukey: script works :) [12:42:50] \o/ [12:43:27] ready for prime time? [12:44:16] elukey: yeah, we're better than where we were last time [12:44:38] we can flip dicovery while I test all the other ones [12:45:08] ok proceeding [12:47:03] puppet running on dns nodes [12:50:54] all the important scripts and reports have been tested successfully [12:54:22] once discovery is deployed, we can release the new homer [12:54:37] should be done [12:55:15] each refresh I alternate between the two :) [12:55:23] so I guess a few more minutes [12:58:27] works for me now [12:58:33] same! [13:00:16] * elukey meeting, will read in a bit [13:01:44] also seems ok for me :) [13:06:58] great, now it's homer's turn to have issues releasing its new version... [13:07:00] "Your build configuration is incomplete and previously worked by accident!" [13:08:44] I know I shouldn't laugh..... [13:09:02] don't worry I'm also lauthing :) [13:09:06] lol [13:09:33] if I can help let me know, I expect it's beyond me if you can't figure it out though [13:09:51] https://www.irccloud.com/pastebin/rF5xlt66/ [13:10:04] so it's something to do with the docker env used to build the wheels [13:10:12] I also should have done that sooner but totally forgot [13:10:45] last time I had to do this I hit some issues with the docker env too, permission things I think [13:11:11] I hate to say it but this is possibly the most bettercallvolans thing I've ever seen [13:11:16] no prizes for guessing who sorted it out for me [13:11:19] haha yeah :) [13:12:00] Can you upgrade setuptools? [13:12:05] eh, I think I solved it, but dunno if the fix is legit or not [13:13:15] You can just apply this https://usercontent.irccloud-cdn.com/file/FYqi6gOJ/works.jpg [13:13:29] if someone can review https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/1058600 [13:13:41] the fix is the tiny change in Dockerfile.build [13:25:28] back! [13:27:18] homer 0.7.0 deployed but there are issues [13:30:02] two more breaking changes that either were not documented or I didn't see [13:30:52] one is that when querying the interfaces from the master node of a virtual-chassis, it doesn't return the interfaces of the other members anymore (eg. https://netbox.wikimedia.org/dcim/interfaces/?device_id=614 ) [13:32:07] the other one is how data back from scripts is structured I think [13:32:11] https://www.irccloud.com/pastebin/kJmDY8yQ/ [13:35:30] the 2nd one seems like a bug in the library, dunno if someone can confirm or not [13:38:18] it seems likely, url_path being a binary object and not a str [13:40:12] we could add some logging to https://github.com/netbox-community/pynetbox/blob/master/pynetbox/core/response.py#L429 [13:40:21] I have a fix for the first issue at least, working on it [13:51:20] XioNoX: what homer command did you run for https://www.irccloud.com/pastebin/kJmDY8yQ/ ? [13:51:35] I added a hacky print statement, I want to see if we can check something [13:52:59] elukey: `homer cr3-ulsfo* diff` on cumin2002 for example [13:55:46] INFO:homer:Generating configuration for cr3-ulsfo.wikimedia.org [13:55:49] PATH: /api/extras/scripts/1/ of type [13:55:52] PATH: /api/core/jobs/53166/ of type [13:55:54] PATH: /api/users/users/6/ of type [13:55:57] PATH: b'' of type [13:56:04] https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/1058609 this fixes the first issue (tested manually) [13:57:01] URL: None URL_PATH: b'' of type [13:57:06] ok yeah something strange [13:57:25] _endpoint_from_url is called with an empty pat [13:57:28] *path [13:58:22] checking a quick fix [13:58:54] Changes for 1 devices: ['cr3-ulsfo.wikimedia.org'] [13:58:54] # No diff [13:58:54] --------------- [13:58:54] INFO:homer:Homer run completed successfully on 1 devices: ['cr3-ulsfo.wikimedia.org'] [13:59:00] nice! [13:59:26] I was looking at pynetbox and netbox changes/issues but nothing has been reported at least [13:59:32] sort of, the issue is in pynetbox, we cannot really patch it right? [13:59:50] 10netops, 06Infrastructure-Foundations, 06SRE: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501 (10cmooney) 03NEW p:05Triage→03Low [14:00:39] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:00:43] not cleanly, but they're reactive upstream, so might be worth keeping the local workaround [14:01:24] until we can rebuild the wheels with the updated version [14:02:29] ok so lemme file a patch so we can reason about it [14:04:22] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:06:27] so on cumin2002 homer is working well on all the devices I tested [14:09:23] next would be to deploy the cookbook changes once we're good with Homer [14:09:38] checking another thing in the code [14:09:43] no rush [14:11:06] XioNoX: I am going to re-break the venv removing my patch, I need to add another logging in another piece of the code [14:11:18] no pb [14:12:59] interesting.. so in the stacktrace, __init__ calls self._endpoint_from_url(values["url"]) [14:13:06] and in the "breaking" case, this is values [14:13:07] Values: {'obj': None, 'url': None, 'time': '2024-07-31T12:41:33.339876+00:00', 'status': 'success', 'message': 'Generated successfully, see the output tab for result.'} [14:15:25] no idea how to interpret that [14:19:09] https://github.com/elukey/pynetbox/commit/155e2050ae09b538f85c0217307e81af8446f7ee [14:19:16] this is the current fix [14:19:17] from there https://netbox.wikimedia.org/api/extras/scripts/1/ it's like the log line, and not the actual result [14:19:21] better than the one before [14:19:22] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:41] XioNoX: do you mind to re-test homer on cumin2002? [14:19:43] elukey: that makes sens [14:20:39] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:21:17] elukey: all good on all 4 ulsfo devices (no diff) [14:22:57] https://github.com/netbox-community/pynetbox/pull/632 [14:23:32] I am a little hesitant in leaving homer as is with live patching [14:23:46] but I guess we can live with it for a bit, if you are confident that upstream will answer [14:25:17] yeah, their last release was a month ago, and nobody else than us will do the next homer release [14:26:35] elukey: next is https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1050445/ commit message still says not tested, but it was tested as much as possible (not all were fully testable) [14:26:54] my worry is that we re-deploy the venv and we forget about the fix, but we can live with it, not ideal but a complete rollback is worse at this point [14:27:27] even if that happens, we will quickly remember where the fix is :) [14:27:58] XioNoX: at this point we need to merge and possibly dry-run those, to have a vague sense of safety [14:28:04] (the new cookbooks I mean) [14:28:08] elukey: yeah, that's the plan [14:28:19] let's do it [14:30:01] elukey: running puppet on cumin2002 to pickup the new cookbooks, and I disabled puppet on 1002 to roll back more easily if needed [14:31:23] okok [14:32:16] I also need to run the provision cookbook, will test one [14:33:27] running now [14:34:32] so far decom' works fine [14:37:02] all good for decom in dry-run [14:38:49] provision looks good too (still in progress) [14:39:02] running reimage as we speak [14:40:11] back in 10 mins [14:41:18] re-image failed with "RuntimeError: New OS is bullseye but bookworm was requested" [14:41:25] so probably not related [14:43:42] and now (when setting --os bullseye) "spicerack.puppet.PuppetHostsError: Unable to find CSR fingerprints for all hosts, detected errors are:" [14:49:18] elukey: for when you're back, one small bug: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1058618 [14:50:24] provision worked! [14:50:28] nice! [14:51:14] +1ed [14:51:35] elukey: not urgent but also small addition to that one https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1056989/2..3 [14:53:29] why do we need to sudo in there? [14:53:53] elukey: probably don't, just to run it as that other user [14:54:22] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:54:38] XioNoX: if it generates a permission issue, let's change it, otherwise I think we can skip no? [14:54:55] elukey: `runuser` is the command needed? [14:55:04] elukey: yeah it will cause permissions issues [14:55:33] when updating the scripts using the UI, it will complain if the file is owned by root [14:55:42] makes sense then okok [14:56:34] re: runuser, never seen it but it seems so yes [14:56:55] ah TIL /usr/sbin/runuser on cumin nodes [14:57:11] XioNoX: --^ [14:57:49] yeah I saw it in some cookbooks, that's the only reason I'm mentioning it :) [14:58:04] let's use it [14:58:15] CR updated [14:59:22] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:59:34] sync netbox hiera works fine [15:00:42] it doesn't crash, but it tries to remove all the VMs now [15:02:17] not great :D [15:03:09] who needs VMs anyway? [15:03:11] * topranks hides [15:03:35] how does it work? Is the cookbook committing to the puppet repo? [15:04:02] okok it uses reposync [15:04:38] elukey: yeah it runs the "VM_LIST_GQL" query with the parameters ['active', 'failed'] [15:09:34] If I manually run the query the hosts that the cookbook wants to remove still show up https://usercontent.irccloud-cdn.com/file/YJ99hCiJ/Screenshot%202024-07-31%20at%2017-08-47%20GraphiQL%20NetBox.png [15:09:54] ohhh [15:09:58] https://www.irccloud.com/pastebin/i6QVUvcH/ [15:10:15] but it's not capitalized in the output [15:10:48] we really need to invest time into a complete testing pipeline for netbox [15:11:19] elukey: 100% agreed :) [15:11:34] I updated the cookbook manually and it fixes the issue [15:11:52] I'm also wondering why there is this condition, while we filter them ahead of time [15:13:04] does seem to be useless given it's in the query already [15:14:27] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1058625 [15:14:48] elukey ^ topranks ^ [15:15:36] there are lots of thing we need to do after the upgrade to clean it all up properly, and document it [15:15:38] XioNoX: the comment is a bit dense to read, what do you mean with it? [15:16:10] elukey: a reminder that the if right under might be useless [15:16:14] XioNoX: might it be better to have if host['status'].lower() == [15:16:19] lgtm anyway [15:16:56] topranks: yeah, I went for the easiest fix for now :) [15:17:07] XioNoX: let's write in like "check if the 'if' block below is useless as we etc.. [15:17:13] *it [15:17:21] SURE [15:17:29] damn I need .lower() myself :) [15:17:30] +1 [15:17:47] * elukey scared by topranks shouting :D [15:18:08] hahaha [15:18:26] topranks: line 343 there is already a 'status': host['status'].lower() and that's what we do for baremetal too [15:18:40] so at least the issue won't show up later down the pipeline [15:19:22] ah [15:19:23] cool [15:20:00] elukey: CR updated [15:20:21] lgtm [15:20:38] waiting for CR then will deploy [15:20:51] and then re-enable puppet on cumin1002 [15:21:13] I added it to the pile of things to do after the upgrade :) [15:23:14] anything left after that? :) [15:23:58] elukey: did you do your venv modification on both cumin hosts for homer ? [15:23:58] announcement ? [15:24:18] answer to the "anything left?" btw ^ [15:24:28] XioNoX: nope, lemme do it [15:24:59] elukey: can you do that one too ? [15:25:01] https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/1058609 [15:25:32] it's in the homer_plugins path in the venv [15:25:49] but this one is in our control, we should create a new release [15:25:52] no? [15:26:09] elukey: we can, I didn't want risking breaking your venv change [15:26:16] elukey: but yeah I can deploy it [15:26:59] XioNoX: no problem I'll re apply it [15:27:05] one hack is already enough :D [15:27:10] elukey: fair :) [15:29:14] deploying [15:30:39] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:31:26] elukey: deployed and tested, all yours [15:32:37] fixing [15:34:22] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:34:49] sync-netbox-hiera is all good now [15:35:03] fix applied on both cumin nodes [15:35:16] and tested [15:35:49] puppet running on cumin1002 [15:36:16] then we can give the green light to dcops and everyone else, while still being careful [15:36:26] +1 [15:36:35] alright done [15:38:55] elukey: thanks for the help! that was a wild upgrade... [15:39:12] definitely.. [15:39:40] one of the next days let's capture in a task what we need to build a testing pipeline [15:39:48] so it is fresh in our memory [15:40:06] yeah exactly [15:40:41] and lets upgrade more frequently so there aren't 1000 breaking changes [15:41:01] dcl test environment? :) [15:49:22] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:50:39] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:57:08] also end to end tests, like homer/etc.. [15:57:25] it's all doable with VMs [15:57:31] upgrading more frequently would be nice, but even a minor can break everything [15:58:38] to be clear, it was all already doable with netbox-next [16:09:10] clearly not all, but lots of it yeah [16:15:16] * topranks silently cries as he closes his last tab with the Netbox 3 UI [16:24:22] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:29:22] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:40:13] folks as FYI, after a chat with Riccardo and Papaul we decided to modify a little late_command.sh - https://gerrit.wikimedia.org/r/c/operations/puppet/+/1058641 [16:40:48] it should hopefully avoid some race conditions where reimage wants to use puppet 7 on a host, but puppet 5 gets deployed (and fails for various reasons ) [16:59:22] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:04:22] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:34:22] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:35:39] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:50:39] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:54:22] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:25:40] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:29:22] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:44:22] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:45:40] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:00:40] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:04:22] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:35:40] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:39:22] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:19] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#10033579 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d8033fb3-d4d1-4e37-8764-0a7625abbe34) set by ayounsi@cumin1002 for 5 days, 0:00:00 on 2 host(s) and their...