[00:00:37] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10005374 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=23e26d8b-bf98-4528-9f4f-f796eb123261) set by cmooney@cumin1002 for 0:15:00 on 1 host(s) and th... [00:02:19] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10005377 (10ops-monitoring-bot) VM netflow2003.codfw.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM [00:24:20] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:19:20] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:42:08] if anybody has time for a ferm change on puppetmasters: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1055964 [09:34:24] elukey: +1 with a small nit [10:00:49] elukey: so what's left of netbox is 1/ load a fresh DB, 2/ pointing CDN to the netbox 4 servers, 3/ test it [10:02:17] if the tests are good, 4/ deploy the cookbooks 5/ depoy homer, 6/ puppetize the new redis DB [10:07:58] should we puppetize the new redis config as one of first steps? [10:08:36] elukey: if we have to rollback again that means much more things to juggle with [10:09:22] XioNoX: then I am not getting the new setting - IIUC it was only for the [1,2]003 servers, to avoid using the same redis instances as the netbox 3 nodes [10:09:39] elukey: yeah exactly [10:09:47] then why should it be rolledback? [10:10:13] elukey: if we need to rollback to netbox 3 longer term [10:10:27] okok but hopefully not :) [10:10:31] nuclear option :) [10:10:46] anyway, ideally I'd love to see puppet running fine on all netbox nodes, then we switch the CDN [10:11:00] without custom changes etc.. [10:11:09] it is very difficult to track what's going on otherwise [10:11:17] elukey: that's not really doable without a big refactor [10:11:42] okok because of yesterday's patches, yes [10:11:54] something to keep in mind for the next upgrade [10:12:05] at this point let's proceed with the fresh db [10:12:31] I can take care of the CDN afterwards [10:13:53] cool, on it [10:15:32] CDN revert: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1056130 [10:15:59] then I'll run puppet on dns nodes as yesterday [10:18:03] doing the netbox 3.2 -> 3.7 DB conversion [10:23:22] elukey: importing the 3.7 DB in netbox 4 [10:26:42] ack [10:27:11] elukey: DB done: you can proceed with the CDN [10:31:16] doing it [10:33:08] puppet in progress on the dns nodes [10:38:11] all reports are running fine, so a clear progress compared to yesterday :) [10:38:23] <_joe_> hi folks, who can I ask for a review of a change to .puppet-lint.rc in operations/puppet? [10:40:22] _joe_ o/ I think that all the expers are out/afk, I can try to take a look but my understanding is limited :D (also Jesse is another good reviewer) [10:40:28] _joe_: feel free to link it here, but maybe Jesse ? [10:41:09] <_joe_> I can still ask jbond as he's still here :D [10:41:25] <_joe_> jbond: :* I'm joking ofc [10:41:28] lol [10:42:20] XioNoX: dns nodes updated, but I still see netbox3 [10:42:33] _joe_: i don't mind if you want to drop the link [10:42:47] <3 [10:43:14] hellooooo jbond! [10:43:20] <_joe_> jbond: lol no I would never do that I just wanted to say hi :) [10:43:33] <_joe_> and I found an excuse, I hope you're doing well [10:44:15] _joe_: will that change when used enforce 4 spaces on existing touched files? I'm just afraid it will create some noise in code reviews as IIRC we use a mixed value right now [10:44:35] hi _joe_ elukey and all, yes im doing well thanks. sumer is here, the pool is open and the wine is pouring (well not now its not even 13:00) ;) [10:44:39] elukey: it's good now [10:44:48] ahahahah glad to hear [10:44:56] jbond: hellooooo!!! [10:45:15] <_joe_> volans: it won't [10:45:24] <_joe_> volans: we don't use the strict_indent check [10:45:32] not in CI, in the editor I mean [10:45:39] <_joe_> and IIRC we don't use a mixed value, ever. [10:45:51] mixed I mean we have files that use 2 spaces [10:46:01] <_joe_> we've always used 4 spaces as indentation [10:46:11] <_joe_> you mean external vendors? [10:46:16] im not sure 4 was evern enforced [10:46:35] random example modules/raid/manifests/megaraid.pp [10:46:42] git grep '^ [a-z]' | grep "\.pp" [10:46:49] <_joe_> volans: so you're saying it's an even better idea? [10:46:57] if you conver them first yes [10:47:01] *convert [10:47:02] :D [10:47:22] <_joe_> it's really not a requirement [10:47:38] <_joe_> but good grief there's files with mixed indentation ofc [10:47:51] told ya :D [10:48:14] <_joe_> volans: so the counterargument is that I already had this in my editor config [10:48:24] <_joe_> but now the puppet extension wants to use puppet-lint.rc [10:49:51] jbond: hey! [10:50:11] I am quite jealous of you and your pool, lots of water where I am but unfortunately it's coming from the sky :P [10:50:17] XioNoX: lemme know if you need any help [10:50:32] I'm totally for forcing 4 spaces btw, just that we should do a pass of "black for puppet" first maybe to start clean [10:52:36] topranks: thanks, all though i have to fight the paper wasps at the moment ;) [10:52:54] elukey: so reports are fine, capirca is fine too [10:53:08] but for some reason the other scripts don't have the "run" button... [10:53:17] fyi puppet-lint has a --fix option theses days. but may need to run it without the wmf style guide [11:01:23] <_joe_> the patch is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1056131 [11:03:05] <_joe_> it won't work though :) [11:03:37] <_joe_> because we fix puppet-lint at 2.4.2 [11:04:39] :D [11:05:24] ack [11:06:12] volans, elukey, I think the issue is with the Importer in `class ImportPuppetDB(Script, Importer)` (and similar) [11:06:36] what's the error? [11:07:13] volans: no error as it, but the "run" button doesn't show up in https://netbox.wikimedia.org/extras/scripts/ [11:07:35] was it showing up yesterday? [11:08:05] volans: nah, but it's showing up in -next [11:08:16] wut? [11:08:24] what's different between the two? [11:08:54] volans: I remember seeing that I *think* `from _common import Importer` is now from the point of view of the src/ root or something like that [11:09:26] I mean between next and prod [11:09:37] that's what I'm trying to figure out [11:10:02] there is one thing wrong in both [11:10:07] there is no name of the script [11:10:15] or description [11:10:20] for import_server_facts [11:10:23] for example [11:10:53] sorry I've been busy with DP stuff this morning, but I just don't understand one thing [11:10:55] right, and when I try to run it it says "No module named '_common'" [11:11:12] given we had the two hosts ready, didn't we test everything was working there before attempting the migratin again? [11:11:45] I tested most of it, but looks like I missed that one [11:11:47] maybe they changed the way they load the scr4ipts [11:12:01] let me look at netbox code [11:15:09] might not be a quick fix, we might have to rethink a bit the scripts that use that mechanism [11:15:15] XioNoX: what if we revert the CDN settings, and keep testing with a local tunnel or something? [11:15:18] is it possible? [11:15:34] so we unblock people in need to use netbox 3, and we keep testing [11:17:11] +1 [11:17:56] created https://gerrit.wikimedia.org/r/c/operations/puppet/+/1056134/ [11:17:59] for the revert [11:18:44] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006345 (10cmooney) So, we hit a bit of a speed-bump in codfw with the gnmic stats once the new switches were made live there. We now have 36 active gnmic subscriptions... [11:19:06] I'm so stupid... https://github.com/netbox-community/netbox/issues/16698 [11:19:07] ... [11:19:28] so you found the issue and opened the task LOL [11:22:21] I totally forgot about that.... [11:23:24] XioNoX: what I have to do to get the No module named '_common' [11:23:27] error? [11:24:05] +1ed the revert btw, I've lunch coming up in few and I guess we don't have a clear win solution right away [11:24:14] yeah agreed [11:27:28] rollback in progress [11:27:38] going to lunch after it finishes [11:28:36] thx [11:28:40] volans: no idea [11:30:16] at this point I would think about doing a quick puppet refactoring to allow both netbox 3/4 to live with puppet enabled [11:30:29] and then we keep testing [11:30:33] yeah agreed too [11:30:38] since it may take a while to narrow down all bugs :( [11:31:06] I can help in case, if you want to send an initial code review I can work on it as well [11:31:36] I'm going to take a break first :) I'm quite tired of it [11:32:31] definitely, going afk for a bit too :) I didn't mean "now", but in the next 1/2 days :) [11:36:10] XioNoX: rollback done! [11:36:13] <3 [11:36:28] * elukey lunch! [11:39:52] `added status: needs owner` for the upstream bug report, so that won't be fixed anytime soon [12:30:28] that's a shame, it doesn't sound like something that would be hard for "someone who knows what they're doing" [12:30:38] I guess we can use the work-around you detail in the bug report? [12:31:56] yeah, hopefully the last wrinkle :) [12:32:20] XioNoX: separately I was looking at memory usage for our netflow* VMs [12:32:50] topranks: yeah saw the task, +1 to bump it everywhere [12:32:51] the ones at the POPs are ok (a bit hot but alright), netflow1002 is swapping a bit though, I might increase from 2 to 3GB RAM? [12:33:09] or bump it everywhere, it probably wouldn't hurt [12:33:30] up to you [12:34:02] I'll check the ganeti resources if it looks like there are lots (codfw had plenty) I'll bump them all to 3GB for consistency [12:34:21] I still prefer bumping the ram on the netflow hosts rather than creating a new VM for gnmic [12:45:36] Ok, yeah I've no strong preference. I know the gnmic dev's are quite keen on k8s and their distributed model but not sure we need to go there. [12:47:58] could be interesting to experiment with it, the tradeoff is an additional dependency in the chain and possible complexity, but it's "just" monitoring (k8s won't break if gnmic breaks) [12:57:04] yeah agreed [13:11:50] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006695 (10ops-monitoring-bot) VM netflow1002.eqiad.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM [13:16:50] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006699 (10ops-monitoring-bot) VM netflow3003.esams.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM [13:24:10] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006727 (10ops-monitoring-bot) VM netflow4002.ulsfo.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM [13:24:27] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006728 (10ops-monitoring-bot) VM netflow5002.eqsin.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM [13:30:48] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006766 (10ops-monitoring-bot) VM netflow6001.drmrs.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM [13:33:50] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006779 (10cmooney) In Eqiad our netflow VM was also running a little hot, and swapping to disk. I've now increased the resources for it and also the other netflow VMs i... [13:34:50] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10006783 (10ops-monitoring-bot) VM netflow7001.magru.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM [13:37:47] 10netops, 06Infrastructure-Foundations, 06SRE: Set Leaf switches in Codfw rows C & D to active and make new vlans live - https://phabricator.wikimedia.org/T370629#10006786 (10cmooney) 05Open→03Resolved All actions complete. @papaul, @Jhancock.wm please note that after this change if running the netb... [13:46:39] elukey, what do you think is cleaner to puppetize netbox 4 vs. 3 ? a variable netbox4=True ? A variable netbox_version=4 ? Matching on the hostname ? Knowing that it's temporary [13:47:49] XioNoX: even if it is temporary, it should be clean - I'd say that having a "version" selector would be the best option, or even a netbox4 one [13:48:08] in the profile I mean, so that we can tune it easily via hiera [13:48:11] yeah agreed on making it clean :) [14:46:25] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068#10007060 (10ssingh) On `dns6001`, we have anycast-hc 0.9.8 running with the patch to change the logging level to WARN for when a service is dow... [14:49:00] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10007067 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=85a0a04b-e091-4107-9bc3-7c9ca22300c8) se... [14:57:07] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10007073 (10MatthewVernon) @cmooney Swift (ms-be) and Ceph (moss-be) ready when you are. [15:01:15] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10007080 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=71f4229e-483c-4848-9bc3-6926b62b02ae) se... [15:01:45] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10007081 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=18d9056a-9166-4006-b516-a07496523fd2) se... [15:21:36] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10007228 (10cmooney) Upgrade complete, things look ok network wise and all host are back pinging again. Thanks all f... [15:26:44] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10007276 (10MatthewVernon) Both Ceph and Swift back to normal, thanks. [16:23:32] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10007599 (10cmooney) 05Open→03Resolved [16:24:27] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#10007601 (10cmooney) 05Open→03Resolved [17:10:38] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:14:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:59:20] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:04:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:49:20] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:54:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:59:18] Hey folks! Per wikitech it sounds like you're the good folks who own cloudflare-related stuff [19:01:18] I'm just doing a spot of research to fill out https://phabricator.wikimedia.org/T370808 – I think we're likely to want to register our Citoid and zotero tranlation server services as "friendly bots" with cloudflare. I think it's likely Peter will give the go-ahead, and I'm wondering who might have the authority to implement that who I could then coordinate with and/or dump the ticket on [19:24:20] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:25:38] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:40:38] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:44:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:57:14] 10netops, 06Infrastructure-Foundations, 06SRE: Add data to automation for new switches in codfw C/D - https://phabricator.wikimedia.org/T369106#10008422 (10cmooney) 05Open→03Resolved [20:14:20] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:15:38] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:19:13] 10SRE-tools, 06Infrastructure-Foundations, 06SRE: Pairing tool for new SREs using sudo under supervision - https://phabricator.wikimedia.org/T299989#10008570 (10Kappakayala) Is there an update on this? We have a new team member joining us and this will be super helpful as we onboard them. [20:30:38] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:34:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:04:20] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:05:38] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:19:20] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:24:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:54:20] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:55:06] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10008870 (10Ladsgroup) I'm repooling the replicas now. [21:55:38] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:29:20] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:34:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:45:38] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:49:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:19:20] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:24:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:36:54] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:39:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed