[00:01:31] (SystemdUnitFailed) firing: (6) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:49:37] (SystemdUnitFailed) firing: (6) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:16:31] (SystemdUnitFailed) firing: (6) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:15:52] 10netbox, 10Infrastructure-Foundations: Should we have two versions of the Juniper QFX5120-48Y in Netbox? - https://phabricator.wikimedia.org/T331519 (10ayounsi) 05Open→03Resolved The support contract is different on the old vs. new licensing, so we need to be able to verify that the proper support is appl... [06:54:16] 10netbox, 10Infrastructure-Foundations: Should we have two versions of the Juniper QFX5120-48Y in Netbox? - https://phabricator.wikimedia.org/T331519 (10ayounsi) 05Resolved→03Open Re-opening as the LibreNMS report needs to be updated to handle those discrepancies. [07:09:29] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.phabricator: Don't fail when logging to a restricted task - https://phabricator.wikimedia.org/T335879 (10MoritzMuehlenhoff) >>! In T335879#9173531, @Volans wrote: > This leave us just with two options: > * catch the exception in the cookbooks... [07:16:31] (SystemdUnitFailed) firing: (5) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:26:30] moritzm: https://netbox.wikimedia.org/extras/reports/results/5017462/ "furud (WMF6587) Device is in PuppetDB but is Offline in Netbox (should be Active or Failed)" [07:32:36] that's strange, per https://phabricator.wikimedia.org/T345867#9151787 the decom cookbook set it to Decommissioning [07:33:24] moritzm: yeah the issue is that it didn't get removed from PuppetDB [07:34:23] it's now offline as DCops ran the offline cookbook after you https://netbox.wikimedia.org/dcim/devices/1425/changelog/ [07:34:41] I can re-run the decom cookbook, in some edges cases the puppetdb removal seems to fail, I've seen that before [07:42:28] I ran puppet node clean/decativate for furud, should clear it up [07:48:47] thx! [07:51:20] volans: does this verifies it has been removed? or retries? https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/hosts/decommission.py#L398 [07:53:25] XioNoX: runs puppet node clean $host and puppet node deactivate $host via cumin and raises if non-zero exit code [07:53:53] *but* [07:54:38] if the puppet timer runs before the shutdown it's possible it gets re-added to puppetdb [07:54:56] I don't recall if we try to disable puppet [07:55:01] if not we can add it [07:55:06] but it would be best-effort [07:55:16] as the decom must be able to run also if the host is not reachable [08:01:29] it happens from time to time, we have an older task about this: https://phabricator.wikimedia.org/T206448 [08:13:52] mmmh re-checking the code [08:14:10] we already do the puppet removal after the poweroff and after waiting for 20s [08:15:48] and we use ['chassis', 'power', 'off'] on physical hw, so it's like puling the plug [08:16:06] moritzm: is the host still up by any chance? [08:16:54] if so we can add a check that the host is actually off after the power off I guess [08:17:00] no, it was unracked pretty quickly since the server along with the storaga arrays attached to it consumed half a rack [08:17:15] got it [08:17:40] that was the snowflake server for which had also seen that connection error to the switch [08:18:06] A lot of things have changed since T206448 was created [08:18:07] T206448: Decommission script race condition - https://phabricator.wikimedia.org/T206448 [08:18:32] if this happens again (I mostly only notice these indrectly by security updates failing) I can update the task [11:16:32] (SystemdUnitFailed) firing: (5) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:53:56] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Should we have two versions of the Juniper QFX5120-48Y in Netbox? - https://phabricator.wikimedia.org/T331519 (10ayounsi) 05Open→03Resolved All good now. [12:12:21] 10netops, 10Infrastructure-Foundations, 10SRE: Configure ECMP hashing function on QFX5120 platform - https://phabricator.wikimedia.org/T339852 (10cmooney) 05Open→03Resolved a:03cmooney Closing. If we want to do it on EVPN/VXLAN devices we can revisit in future. [13:29:12] 10netops, 10Infrastructure-Foundations, 10Observability-Metrics: Investigate and deploy 'max-repeaters = 20' to all librenms devices - https://phabricator.wikimedia.org/T346759 (10fgiunchedi) [13:29:33] 10netops, 10Infrastructure-Foundations, 10Observability-Metrics, 10SRE: cr*-eqsin long poll times from librenms - https://phabricator.wikimedia.org/T346606 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Opened {T346759} for followups, this is done [13:55:15] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.phabricator: Don't fail when logging to a restricted task - https://phabricator.wikimedia.org/T335879 (10Volans) Sure, but wmflib is a general purpose library and shouldn't make that assumption. So I'd rather do that via a parameter so that th... [14:01:53] 10Packaging, 10Cloud-VPS, 10Infrastructure-Foundations, 10serviceops, 10cloud-services-team (FY2023/2024-Q1): Package mcrouter for Debian Bookworm - https://phabricator.wikimedia.org/T346762 (10fnegri) [14:09:05] 10Packaging, 10Cloud-VPS, 10Infrastructure-Foundations, 10serviceops, 10cloud-services-team (FY2023/2024-Q1): Package mcrouter for Debian Bookworm - https://phabricator.wikimedia.org/T346762 (10fnegri) 05Open→03In progress p:05Triage→03High [14:30:47] 10CAS-SSO, 10Data-Platform-SRE, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) We successfully implemented OIDC on production datahub and auth/login seems to be working great. However there are some challenges with the user jour... [14:43:06] 10netops, 10Infrastructure-Foundations: cr1-esams:fpc0 errors - https://phabricator.wikimedia.org/T346779 (10ayounsi) p:05Triage→03High [15:17:36] (SystemdUnitFailed) firing: (5) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:52:45] (SystemdUnitFailed) firing: (6) httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:02:36] (SystemdUnitFailed) firing: (6) httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:04:13] (DiskSpace) firing: Disk space krb1001:9100:/ 1.713% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:22:36] (SystemdUnitFailed) firing: (8) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:24:13] (DiskSpace) resolved: Disk space krb1001:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:27:36] (SystemdUnitFailed) firing: (8) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:27:37] (SystemdUnitFailed) firing: (5) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed