[00:24:36] <jinxer-wm>	 FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting
[00:24:37] <jinxer-wm>	 FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts
[02:05:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:24:36] <jinxer-wm>	 FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting
[04:24:37] <jinxer-wm>	 FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts
[06:05:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:25:26] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:24:36] <jinxer-wm>	 FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting
[08:24:37] <jinxer-wm>	 FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts
[08:44:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: krb5-kdc.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:49:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:54:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:37:14] <moritzm>	 ^ krb1002 is in setup
[10:21:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[10:39:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:24:36] <jinxer-wm>	 FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting
[12:24:37] <jinxer-wm>	 FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts
[12:27:25] <jinxer-wm>	 FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag
[12:44:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:45:56] <federico3>	 I'm seeing a TLS cert error at https://puppetboard.wikimedia.org/report/db1178.eqiad.wmnet/2648b4e15246c9ba5bf24ad499312c438f1f2045 when db1178.eqiad.wmnet is trying to reach  https://puppetserver1001.eqiad.wmnet:8140/puppet/...     -  curl-ing from the same host is not triggering cert errors at the moment - was there a transient issue with the cert perhaps?
[13:09:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: replicate-krb-database.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:14:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: replicate-krb-database.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:21:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[14:23:10] <federico3>	 elukey:  perhaps you know about the error above? 
[14:24:58] <elukey>	 federico3: I can check, but never seen it.. has the host being reimage or similar recently?
[14:26:52] <federico3>	 elukey: it's been up a long time but rebooted 10 days ago
[14:27:20] <elukey>	 mmmm it seems to me that somehow puppetserver1001 may be in trouble responding for some reason, and db1178 gets the failures
[14:27:34] <elukey>	 it is not consistent with all the puppet runs afaics
[14:28:19] <elukey>	 nothing out of the ordinary from https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=puppetserver1001&var-datasource=thanos&var-cluster=puppet&from=now-24h&to=now
[14:28:40] <federico3>	 curl 'https://puppetserver1001.eqiad.wmnet:8140/' is successful from a cert PoV (getting 404)
[14:30:43] <federico3>	 the first cert error seems to appear at 2025-04-17T00
[14:30:58] <jhathaway>	 I'm happy to take a look as well federico3 
[14:31:16] <elukey>	 Hey Jesse o/
[14:31:20] <elukey>	 please go ahead :)
[14:31:24] <jhathaway>	 nod
[14:31:27] <jhathaway>	 will do!
[14:31:54] <elukey>	 I am wondering if this happens for other hosts as well, may be a sign of puppetserver reaching max capacity?
[14:32:07] <jhathaway>	 could be
[14:32:10] <federico3>	 (could it be that the certs has been rotated and the host is using an old CA cert?)
[14:36:07] <elukey>	 we are not rotating certs that frequently IIRC, and I'd expect errors to happen only once in a while
[14:36:14] <elukey>	 this one seems more consistent
[14:44:31] <federico3>	 this is the hourly error count https://phabricator.wikimedia.org/P75448
[14:44:41] <jhathaway>	 it appears to be only and issue with puppetserver1001.eqiad.wmnet
[14:44:44] <jhathaway>	 *an
[14:45:08] <jhathaway>	 with 1002 & 1003 it works fine
[15:30:44] <wikibugs>	 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10765458 (10Jgreen) 05Invalid→03Resolved
[16:24:36] <jinxer-wm>	 FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting
[16:24:37] <jinxer-wm>	 FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts
[16:27:25] <jinxer-wm>	 FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag
[16:34:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:37:25] <jinxer-wm>	 RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag
[17:49:37] <federico3>	 jhathaway: should i open a task to track this? 
[17:50:26] <jhathaway>	 sure, I think I have the cause figured out, but there are a few broken pieces
[18:21:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[19:23:12] <wikibugs>	 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627 (10jhathaway) 03NEW
[19:23:57] <wikibugs>	 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627#10766249 (10jhathaway) p:05Triage→03Medium a:03jhathaway
[19:24:43] <wikibugs>	 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627#10766252 (10jhathaway) Only occurs on puppetserver1001.eqiad.wmnet, cert was revoked on April 14th:  ` puppetserver-2025-04-14.0.log.gz:2025-04-14T07:26:35.169Z INFO  [qtp1905171892-12616218] [p.p.certificate-authority] Rev...
[19:26:06] <wikibugs>	 07Puppet: sync-puppet-ca timer broken - https://phabricator.wikimedia.org/T392628 (10jhathaway) 03NEW
[19:26:18] <wikibugs>	 07Puppet: sync-puppet-ca timer broken - https://phabricator.wikimedia.org/T392628#10766268 (10jhathaway) p:05Triage→03Medium
[19:30:19] <wikibugs>	 07Puppet: validate systemd units - https://phabricator.wikimedia.org/T392629 (10jhathaway) 03NEW
[19:30:27] <wikibugs>	 07Puppet: validate systemd units - https://phabricator.wikimedia.org/T392629#10766295 (10jhathaway) p:05Triage→03Medium
[20:24:36] <jinxer-wm>	 FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting
[20:24:37] <jinxer-wm>	 FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts
[20:34:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:49:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:54:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:20:50] <wikibugs>	 07Puppet: non-ca puppetservers do not check the ca certificate revocation list or CRL - https://phabricator.wikimedia.org/T392637 (10jhathaway) 03NEW
[21:21:25] <wikibugs>	 07Puppet: non-ca puppetservers do not check the ca certificate revocation list or CRL - https://phabricator.wikimedia.org/T392637#10766678 (10jhathaway) p:05Triage→03Medium
[21:42:28] <wikibugs>	 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627#10766729 (10jhathaway) 05Open→03Resolved I opened subtasks for the issues discovered when looking at this issue, the server certificate itself has been regenerated, however why the cert was revoked in the first plac...
[21:42:54] <wikibugs>	 07Puppet: Non-ca puppetservers do not check the CA certificate revocation list or CRL - https://phabricator.wikimedia.org/T392637#10766732 (10jhathaway)
[22:21:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure