[01:48:39] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:48:39] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:23:39] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:48:39] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:56:32] <claime>	 moritzm: Is there a specific reason for ferm.service to not have either Restart=always, or for ferm status to not return an error if systemctl status ferm.service is in error?
[09:57:25] <claime>	 (by ferm status I mean /usr/local/sbin/ferm-status)
[09:58:18] <claime>	 I can make a patch for either of those, I just want to solve that race condition problem, I'm getting tired of restarting ferm.service manually because of the race condition described in https://phabricator.wikimedia.org/T354855
[10:03:49] <moritzm>	 adding Restart=always will not be reliable, ferm doesn't use a proper systemd unit, but only shells out to running /etc/init.d/ferm under the hood
[10:04:22] <moritzm>	 ferm-status's main purpose is to validate that the rules loaded are in line with the intended rules
[10:04:51] <moritzm>	 we can probably extend it to also bail out if ferm is unavailable
[10:05:13] <moritzm>	 but I'm not very familiar with it, this was all John's doing
[10:05:24] <moritzm>	 but happy to review and proposed patch ofc!
[10:05:43] <moritzm>	 the alternative is to create a modules/toil class 
[10:06:07] <moritzm>	 which detects whether ferm went down with the kubeproxy race and then restart it
[10:13:52] <wikibugs>	 10netbox, 10Infrastructure-Foundations: Netbox MoveServersUplinks script doesn't handle trunked ports correctly - https://phabricator.wikimedia.org/T355899#9571421 (10ayounsi)
[10:13:58] <wikibugs>	 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.7.x - https://phabricator.wikimedia.org/T336275#9571424 (10ayounsi)
[10:15:00] <wikibugs>	 10netbox, 10Infrastructure-Foundations: Netbox MoveServersUplinks script doesn't handle trunked ports correctly - https://phabricator.wikimedia.org/T355899#9488963 (10ayounsi) I think we should "just" put it on the list of things to check after the Nebox upgrade. This behavior seems like a bug, and might have...
[11:48:39] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:56:34] <claime>	 moritzm: So basically the ferm service unit ensure => running never applies in cases where the ferm.service is failed, because it uses ferm-status as a check command, which does not check the actual service status but only the rules
[11:58:15] <claime>	 So I think the "right" course of action would be for me to add running a systemctl status ferm.service in ferm-status so it errors out if the service is failed, and puppet actually restarts the service if it isn't available
[12:00:28] <moritzm>	 ack, based on the high level reading I did on ferm-status that sounds good to me
[12:17:19] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10cloud-services-team, 10User-aborrero: clouddb: evaluate moving them into cloud-private - https://phabricator.wikimedia.org/T357543#9571774 (10aborrero) In {T346947}, in https://gerrit.wikimedia.org/r/c/operations/homer/public/+/973769/comments/dedcd277_a07c883b @cm...
[13:28:24] <moritzm>	 XioNoX, topranks: could either of you please check https://phabricator.wikimedia.org/T353525#9571917 ?
[13:28:48] * topranks looking 
[13:28:53] <XioNoX>	 moritzm: correct
[13:29:16] <topranks>	 yep
[13:33:08] <moritzm>	 thanks
[13:43:50] <wikibugs>	 10netbox, 10Infrastructure-Foundations: Netbox: capirca.getHosts script runs into timeout - https://phabricator.wikimedia.org/T358339#9571963 (10MoritzMuehlenhoff)
[13:44:05] <moritzm>	 the capirca script again hit the 5 mins timeout, I opened https://phabricator.wikimedia.org/T358339
[13:45:13] <XioNoX>	 seems like a dupe of https://phabricator.wikimedia.org/T341843
[13:47:57] <XioNoX>	 5min is already way too much for what we're asking it
[13:48:12] <XioNoX>	 not sure if bumping https://docs.netbox.dev/en/stable/configuration/miscellaneous/#rq_default_timeout would help here
[13:53:00] <wikibugs>	 10netbox, 10Infrastructure-Foundations: Netbox: capirca.getHosts script runs into timeout - https://phabricator.wikimedia.org/T358339#9572002 (10ayounsi) https://netbox.wikimedia.org/admin/extras/jobresult/?name=capirca.GetHosts&o=-3.1.2 Not a good track record. {F42064024}  We can bump the timeout, but the sc...
[13:53:58] <wikibugs>	 10netbox, 10Infrastructure-Foundations: Netbox: capirca.getHosts script runs into timeout - https://phabricator.wikimedia.org/T358339#9572007 (10ayounsi) p:05Triage→03High
[13:56:17] <moritzm>	 I wanted to check if some of the routers had received the results (or IOW if the change was partially deployed), but seems like not: https://paste.debian.net/hidden/0aa2a936/
[13:57:04] <topranks>	 moritzm: eh maybe I broke something earlier, Homer isn't generating the configs there 
[13:57:07] <topranks>	 let me have a look 
[13:57:13] <XioNoX>	 Homer fetch the output of the script
[13:57:27] <XioNoX>	 as the script failed, the latest output is not valid
[13:57:42] <topranks>	 ah ok 
[13:58:01] <moritzm>	 yeah, but my hypothesis was that it had maybe managed to complete some of them and then a subsequent run of the Netbox script might fix the remaining ones
[13:58:20] <XioNoX>	 I'll try to bump the timeout just in case
[14:02:03] <XioNoX>	 yeah and now it completed in 1min...
[14:02:24] <XioNoX>	 looking back at the history they always took 1 or 2 min max
[14:02:24] <topranks>	 yeah I think it's getting locked up by something when it fails 
[14:02:38] <topranks>	 rather than it's "working ok" but just taking a little longer 
[14:02:48] <XioNoX>	 yeah
[14:02:53] <XioNoX>	 not sure what's best :)
[14:03:43] <XioNoX>	 hopefully to put in the pile of things that will magically work after the netbox upgrade :)
[14:04:01] <XioNoX>	 moritzm: you should be good to run homer
[14:04:28] <moritzm>	 confirmed, "homer diff" appeas to connect fine now
[14:04:57] <moritzm>	 feel free to merge T358339 into T341843 you believe it's the same root cause
[14:04:58] <stashbot>	 T358339: Netbox: capirca.getHosts script runs into timeout - https://phabricator.wikimedia.org/T358339
[14:04:58] <stashbot>	 T341843: Netbox report test_mgmt_dns_hostname - rq.timeouts.JobTimeoutException - https://phabricator.wikimedia.org/T341843
[14:05:37] <XioNoX>	 probably the same root cause, but it's fine to have 2 tasks so we can investigate them individually if the upgrade doesn't fix it
[14:06:01] <wikibugs>	 10netbox, 10Infrastructure-Foundations: Netbox: capirca.getHosts script runs into timeout - https://phabricator.wikimedia.org/T358339#9572027 (10ayounsi)
[14:06:06] <wikibugs>	 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.7.x - https://phabricator.wikimedia.org/T336275#9572028 (10ayounsi)
[14:22:57] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Control IPv6 RA generation on core routers - https://phabricator.wikimedia.org/T358220#9572046 (10cmooney) 05Open→03Resolved
[14:23:05] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9572047 (10cmooney)
[14:23:45] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9476920 (10cmooney)
[14:25:07] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-swift-storage, 10ops-codfw: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868#9572048 (10cmooney) 05Open→03Resolved a:03cmooney closing, thanks for the help!
[14:33:04] <wikibugs>	 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: wmf_auto_restart_cron.service failing in Cloud VPS bookworm instances - https://phabricator.wikimedia.org/T358343#9572080 (10taavi)
[15:48:27] <wikibugs>	 10Mail, 10Infrastructure-Foundations, 10SRE: Integrations tests - https://phabricator.wikimedia.org/T358355#9572381 (10jhathaway)
[15:48:39] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:48:40] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:48:41] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed