[01:48:39] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:48:39] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:23:39] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:48:39] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:56:32] moritzm: Is there a specific reason for ferm.service to not have either Restart=always, or for ferm status to not return an error if systemctl status ferm.service is in error? [09:57:25] (by ferm status I mean /usr/local/sbin/ferm-status) [09:58:18] I can make a patch for either of those, I just want to solve that race condition problem, I'm getting tired of restarting ferm.service manually because of the race condition described in https://phabricator.wikimedia.org/T354855 [10:03:49] adding Restart=always will not be reliable, ferm doesn't use a proper systemd unit, but only shells out to running /etc/init.d/ferm under the hood [10:04:22] ferm-status's main purpose is to validate that the rules loaded are in line with the intended rules [10:04:51] we can probably extend it to also bail out if ferm is unavailable [10:05:13] but I'm not very familiar with it, this was all John's doing [10:05:24] but happy to review and proposed patch ofc! [10:05:43] the alternative is to create a modules/toil class [10:06:07] which detects whether ferm went down with the kubeproxy race and then restart it [10:13:52] 10netbox, 10Infrastructure-Foundations: Netbox MoveServersUplinks script doesn't handle trunked ports correctly - https://phabricator.wikimedia.org/T355899#9571421 (10ayounsi) [10:13:58] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.7.x - https://phabricator.wikimedia.org/T336275#9571424 (10ayounsi) [10:15:00] 10netbox, 10Infrastructure-Foundations: Netbox MoveServersUplinks script doesn't handle trunked ports correctly - https://phabricator.wikimedia.org/T355899#9488963 (10ayounsi) I think we should "just" put it on the list of things to check after the Nebox upgrade. This behavior seems like a bug, and might have... [11:48:39] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:56:34] moritzm: So basically the ferm service unit ensure => running never applies in cases where the ferm.service is failed, because it uses ferm-status as a check command, which does not check the actual service status but only the rules [11:58:15] So I think the "right" course of action would be for me to add running a systemctl status ferm.service in ferm-status so it errors out if the service is failed, and puppet actually restarts the service if it isn't available [12:00:28] ack, based on the high level reading I did on ferm-status that sounds good to me [12:17:19] 10netops, 10Infrastructure-Foundations, 10cloud-services-team, 10User-aborrero: clouddb: evaluate moving them into cloud-private - https://phabricator.wikimedia.org/T357543#9571774 (10aborrero) In {T346947}, in https://gerrit.wikimedia.org/r/c/operations/homer/public/+/973769/comments/dedcd277_a07c883b @cm... [13:28:24] XioNoX, topranks: could either of you please check https://phabricator.wikimedia.org/T353525#9571917 ? [13:28:48] * topranks looking [13:28:53] moritzm: correct [13:29:16] yep [13:33:08] thanks [13:43:50] 10netbox, 10Infrastructure-Foundations: Netbox: capirca.getHosts script runs into timeout - https://phabricator.wikimedia.org/T358339#9571963 (10MoritzMuehlenhoff) [13:44:05] the capirca script again hit the 5 mins timeout, I opened https://phabricator.wikimedia.org/T358339 [13:45:13] seems like a dupe of https://phabricator.wikimedia.org/T341843 [13:47:57] 5min is already way too much for what we're asking it [13:48:12] not sure if bumping https://docs.netbox.dev/en/stable/configuration/miscellaneous/#rq_default_timeout would help here [13:53:00] 10netbox, 10Infrastructure-Foundations: Netbox: capirca.getHosts script runs into timeout - https://phabricator.wikimedia.org/T358339#9572002 (10ayounsi) https://netbox.wikimedia.org/admin/extras/jobresult/?name=capirca.GetHosts&o=-3.1.2 Not a good track record. {F42064024} We can bump the timeout, but the sc... [13:53:58] 10netbox, 10Infrastructure-Foundations: Netbox: capirca.getHosts script runs into timeout - https://phabricator.wikimedia.org/T358339#9572007 (10ayounsi) p:05Triage→03High [13:56:17] I wanted to check if some of the routers had received the results (or IOW if the change was partially deployed), but seems like not: https://paste.debian.net/hidden/0aa2a936/ [13:57:04] moritzm: eh maybe I broke something earlier, Homer isn't generating the configs there [13:57:07] let me have a look [13:57:13] Homer fetch the output of the script [13:57:27] as the script failed, the latest output is not valid [13:57:42] ah ok [13:58:01] yeah, but my hypothesis was that it had maybe managed to complete some of them and then a subsequent run of the Netbox script might fix the remaining ones [13:58:20] I'll try to bump the timeout just in case [14:02:03] yeah and now it completed in 1min... [14:02:24] looking back at the history they always took 1 or 2 min max [14:02:24] yeah I think it's getting locked up by something when it fails [14:02:38] rather than it's "working ok" but just taking a little longer [14:02:48] yeah [14:02:53] not sure what's best :) [14:03:43] hopefully to put in the pile of things that will magically work after the netbox upgrade :) [14:04:01] moritzm: you should be good to run homer [14:04:28] confirmed, "homer diff" appeas to connect fine now [14:04:57] feel free to merge T358339 into T341843 you believe it's the same root cause [14:04:58] T358339: Netbox: capirca.getHosts script runs into timeout - https://phabricator.wikimedia.org/T358339 [14:04:58] T341843: Netbox report test_mgmt_dns_hostname - rq.timeouts.JobTimeoutException - https://phabricator.wikimedia.org/T341843 [14:05:37] probably the same root cause, but it's fine to have 2 tasks so we can investigate them individually if the upgrade doesn't fix it [14:06:01] 10netbox, 10Infrastructure-Foundations: Netbox: capirca.getHosts script runs into timeout - https://phabricator.wikimedia.org/T358339#9572027 (10ayounsi) [14:06:06] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.7.x - https://phabricator.wikimedia.org/T336275#9572028 (10ayounsi) [14:22:57] 10netops, 10Infrastructure-Foundations, 10SRE: Control IPv6 RA generation on core routers - https://phabricator.wikimedia.org/T358220#9572046 (10cmooney) 05Open→03Resolved [14:23:05] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9572047 (10cmooney) [14:23:45] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9476920 (10cmooney) [14:25:07] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-swift-storage, 10ops-codfw: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868#9572048 (10cmooney) 05Open→03Resolved a:03cmooney closing, thanks for the help! [14:33:04] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: wmf_auto_restart_cron.service failing in Cloud VPS bookworm instances - https://phabricator.wikimedia.org/T358343#9572080 (10taavi) [15:48:27] 10Mail, 10Infrastructure-Foundations, 10SRE: Integrations tests - https://phabricator.wikimedia.org/T358355#9572381 (10jhathaway) [15:48:39] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:48:40] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:48:41] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed