[00:01:31] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:49:37] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:16:31] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:15:52] <wikibugs>	 10netbox, 10Infrastructure-Foundations: Should we have two versions of the Juniper QFX5120-48Y in Netbox? - https://phabricator.wikimedia.org/T331519 (10ayounsi) 05Open→03Resolved The support contract is different on the old vs. new licensing, so we need to be able to verify that the proper support is appl...
[06:54:16] <wikibugs>	 10netbox, 10Infrastructure-Foundations: Should we have two versions of the Juniper QFX5120-48Y in Netbox? - https://phabricator.wikimedia.org/T331519 (10ayounsi) 05Resolved→03Open Re-opening as the LibreNMS report needs to be updated to handle those discrepancies.
[07:09:29] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.phabricator: Don't fail when logging to a restricted task - https://phabricator.wikimedia.org/T335879 (10MoritzMuehlenhoff) >>! In T335879#9173531, @Volans wrote: > This leave us just with two options: > * catch the exception in the cookbooks...
[07:16:31] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:26:30] <XioNoX>	 moritzm: https://netbox.wikimedia.org/extras/reports/results/5017462/ "furud (WMF6587) Device is in PuppetDB but is Offline in Netbox (should be Active or Failed)"
[07:32:36] <moritzm>	 that's strange, per https://phabricator.wikimedia.org/T345867#9151787 the decom cookbook set it to Decommissioning
[07:33:24] <XioNoX>	 moritzm: yeah the issue is that it didn't get removed from PuppetDB
[07:34:23] <XioNoX>	 it's now offline as DCops ran the offline cookbook after you https://netbox.wikimedia.org/dcim/devices/1425/changelog/
[07:34:41] <moritzm>	 I can re-run the decom cookbook, in some edges cases the puppetdb removal seems to fail, I've seen that before
[07:42:28] <moritzm>	 I ran puppet node clean/decativate for furud, should clear it up
[07:48:47] <XioNoX>	 thx!
[07:51:20] <XioNoX>	 volans: does this verifies it has been removed? or retries? https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/hosts/decommission.py#L398
[07:53:25] <volans>	 XioNoX: runs puppet node clean $host and puppet node deactivate $host via cumin and raises if non-zero exit code
[07:53:53] <volans>	 *but*
[07:54:38] <volans>	 if the puppet timer runs before the shutdown it's possible it gets re-added to puppetdb
[07:54:56] <volans>	 I don't recall if we try to disable puppet
[07:55:01] <volans>	 if not we can add it
[07:55:06] <volans>	 but it would be best-effort
[07:55:16] <volans>	 as the decom must be able to run also if the host is not reachable
[08:01:29] <moritzm>	 it happens from time to time, we have an older task about this: https://phabricator.wikimedia.org/T206448
[08:13:52] <volans>	 mmmh re-checking the code
[08:14:10] <volans>	 we already do the puppet removal after the poweroff and after waiting for 20s
[08:15:48] <volans>	 and we use ['chassis', 'power', 'off'] on physical hw, so it's like puling the plug
[08:16:06] <volans>	 moritzm: is the host still up by any chance?
[08:16:54] <volans>	 if so we can add a check that the host is actually off after the power off I guess
[08:17:00] <moritzm>	 no, it was unracked pretty quickly since the server along with the storaga arrays attached to it consumed half a rack
[08:17:15] <volans>	 got it
[08:17:40] <moritzm>	 that was the snowflake server for which had also seen that connection error to the switch
[08:18:06] <moritzm>	 A lot of things have changed since T206448 was created
[08:18:07] <stashbot>	 T206448: Decommission script race condition - https://phabricator.wikimedia.org/T206448
[08:18:32] <moritzm>	 if this happens again (I mostly only notice these indrectly by security updates failing) I can update the task
[11:16:32] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:53:56] <wikibugs>	 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Should we have two versions of the Juniper QFX5120-48Y in Netbox? - https://phabricator.wikimedia.org/T331519 (10ayounsi) 05Open→03Resolved All good now.
[12:12:21] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Configure ECMP hashing function on QFX5120 platform - https://phabricator.wikimedia.org/T339852 (10cmooney) 05Open→03Resolved a:03cmooney Closing.  If we want to do it on EVPN/VXLAN devices we can revisit in future.
[13:29:12] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10Observability-Metrics: Investigate and deploy 'max-repeaters = 20' to all librenms devices - https://phabricator.wikimedia.org/T346759 (10fgiunchedi)
[13:29:33] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10Observability-Metrics, 10SRE: cr*-eqsin long poll times from librenms - https://phabricator.wikimedia.org/T346606 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Opened {T346759} for followups, this is done
[13:55:15] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.phabricator: Don't fail when logging to a restricted task - https://phabricator.wikimedia.org/T335879 (10Volans) Sure, but wmflib is a general purpose library and shouldn't make that assumption. So I'd rather do that via a parameter so that th...
[14:01:53] <wikibugs>	 10Packaging, 10Cloud-VPS, 10Infrastructure-Foundations, 10serviceops, 10cloud-services-team (FY2023/2024-Q1): Package mcrouter for Debian Bookworm - https://phabricator.wikimedia.org/T346762 (10fnegri)
[14:09:05] <wikibugs>	 10Packaging, 10Cloud-VPS, 10Infrastructure-Foundations, 10serviceops, 10cloud-services-team (FY2023/2024-Q1): Package mcrouter for Debian Bookworm - https://phabricator.wikimedia.org/T346762 (10fnegri) 05Open→03In progress p:05Triage→03High
[14:30:47] <wikibugs>	 10CAS-SSO, 10Data-Platform-SRE, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) We successfully implemented OIDC on production datahub and auth/login seems to be working great. However there are some challenges with the user jour...
[14:43:06] <wikibugs>	 10netops, 10Infrastructure-Foundations: cr1-esams:fpc0 errors - https://phabricator.wikimedia.org/T346779 (10ayounsi) p:05Triage→03High
[15:17:36] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:52:45] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:02:36] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:04:13] <jinxer-wm>	 (DiskSpace) firing: Disk space krb1001:9100:/ 1.713% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[16:22:36] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:24:13] <jinxer-wm>	 (DiskSpace) resolved: Disk space krb1001:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[16:27:36] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:27:37] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed