[00:24:36] <jinxer-wm> FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [00:24:37] <jinxer-wm> FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [02:05:26] <jinxer-wm> FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:24:36] <jinxer-wm> FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [04:24:37] <jinxer-wm> FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [06:05:26] <jinxer-wm> FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:25:26] <jinxer-wm> RESOLVED: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:24:36] <jinxer-wm> FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [08:24:37] <jinxer-wm> FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [08:44:25] <jinxer-wm> FIRING: SystemdUnitFailed: krb5-kdc.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:25] <jinxer-wm> FIRING: [2x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:54:25] <jinxer-wm> FIRING: [2x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:37:14] <moritzm> ^ krb1002 is in setup [10:21:48] <jinxer-wm> FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:39:25] <jinxer-wm> FIRING: [3x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:24:36] <jinxer-wm> FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [12:24:37] <jinxer-wm> FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [12:27:25] <jinxer-wm> FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [12:44:25] <jinxer-wm> FIRING: [4x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:45:56] <federico3> I'm seeing a TLS cert error at https://puppetboard.wikimedia.org/report/db1178.eqiad.wmnet/2648b4e15246c9ba5bf24ad499312c438f1f2045 when db1178.eqiad.wmnet is trying to reach https://puppetserver1001.eqiad.wmnet:8140/puppet/... - curl-ing from the same host is not triggering cert errors at the moment - was there a transient issue with the cert perhaps? [13:09:25] <jinxer-wm> FIRING: [5x] SystemdUnitFailed: replicate-krb-database.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:14:25] <jinxer-wm> FIRING: [5x] SystemdUnitFailed: replicate-krb-database.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:21:48] <jinxer-wm> FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:23:10] <federico3> elukey: perhaps you know about the error above? [14:24:58] <elukey> federico3: I can check, but never seen it.. has the host being reimage or similar recently? [14:26:52] <federico3> elukey: it's been up a long time but rebooted 10 days ago [14:27:20] <elukey> mmmm it seems to me that somehow puppetserver1001 may be in trouble responding for some reason, and db1178 gets the failures [14:27:34] <elukey> it is not consistent with all the puppet runs afaics [14:28:19] <elukey> nothing out of the ordinary from https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=puppetserver1001&var-datasource=thanos&var-cluster=puppet&from=now-24h&to=now [14:28:40] <federico3> curl 'https://puppetserver1001.eqiad.wmnet:8140/' is successful from a cert PoV (getting 404) [14:30:43] <federico3> the first cert error seems to appear at 2025-04-17T00 [14:30:58] <jhathaway> I'm happy to take a look as well federico3 [14:31:16] <elukey> Hey Jesse o/ [14:31:20] <elukey> please go ahead :) [14:31:24] <jhathaway> nod [14:31:27] <jhathaway> will do! [14:31:54] <elukey> I am wondering if this happens for other hosts as well, may be a sign of puppetserver reaching max capacity? [14:32:07] <jhathaway> could be [14:32:10] <federico3> (could it be that the certs has been rotated and the host is using an old CA cert?) [14:36:07] <elukey> we are not rotating certs that frequently IIRC, and I'd expect errors to happen only once in a while [14:36:14] <elukey> this one seems more consistent [14:44:31] <federico3> this is the hourly error count https://phabricator.wikimedia.org/P75448 [14:44:41] <jhathaway> it appears to be only and issue with puppetserver1001.eqiad.wmnet [14:44:44] <jhathaway> *an [14:45:08] <jhathaway> with 1002 & 1003 it works fine [15:30:44] <wikibugs> 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10765458 (10Jgreen) 05Invalid→03Resolved [16:24:36] <jinxer-wm> FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [16:24:37] <jinxer-wm> FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [16:27:25] <jinxer-wm> FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [16:34:25] <jinxer-wm> FIRING: [4x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:37:25] <jinxer-wm> RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [17:49:37] <federico3> jhathaway: should i open a task to track this? [17:50:26] <jhathaway> sure, I think I have the cause figured out, but there are a few broken pieces [18:21:48] <jinxer-wm> FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:23:12] <wikibugs> 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627 (10jhathaway) 03NEW [19:23:57] <wikibugs> 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627#10766249 (10jhathaway) p:05Triage→03Medium a:03jhathaway [19:24:43] <wikibugs> 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627#10766252 (10jhathaway) Only occurs on puppetserver1001.eqiad.wmnet, cert was revoked on April 14th: ` puppetserver-2025-04-14.0.log.gz:2025-04-14T07:26:35.169Z INFO [qtp1905171892-12616218] [p.p.certificate-authority] Rev... [19:26:06] <wikibugs> 07Puppet: sync-puppet-ca timer broken - https://phabricator.wikimedia.org/T392628 (10jhathaway) 03NEW [19:26:18] <wikibugs> 07Puppet: sync-puppet-ca timer broken - https://phabricator.wikimedia.org/T392628#10766268 (10jhathaway) p:05Triage→03Medium [19:30:19] <wikibugs> 07Puppet: validate systemd units - https://phabricator.wikimedia.org/T392629 (10jhathaway) 03NEW [19:30:27] <wikibugs> 07Puppet: validate systemd units - https://phabricator.wikimedia.org/T392629#10766295 (10jhathaway) p:05Triage→03Medium [20:24:36] <jinxer-wm> FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [20:24:37] <jinxer-wm> FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [20:34:25] <jinxer-wm> FIRING: [3x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:49:25] <jinxer-wm> FIRING: [4x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:54:25] <jinxer-wm> FIRING: [4x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:20:50] <wikibugs> 07Puppet: non-ca puppetservers do not check the ca certificate revocation list or CRL - https://phabricator.wikimedia.org/T392637 (10jhathaway) 03NEW [21:21:25] <wikibugs> 07Puppet: non-ca puppetservers do not check the ca certificate revocation list or CRL - https://phabricator.wikimedia.org/T392637#10766678 (10jhathaway) p:05Triage→03Medium [21:42:28] <wikibugs> 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627#10766729 (10jhathaway) 05Open→03Resolved I opened subtasks for the issues discovered when looking at this issue, the server certificate itself has been regenerated, however why the cert was revoked in the first plac... [21:42:54] <wikibugs> 07Puppet: Non-ca puppetservers do not check the CA certificate revocation list or CRL - https://phabricator.wikimedia.org/T392637#10766732 (10jhathaway) [22:21:48] <jinxer-wm> FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure