[09:41:56] Emperor: not sure if you saw it, but thanos-fe1001:9100 has been alerting for the weekend [09:42:01] (SystemdUnitFailed) firing: (3) swift_dispersion_stats.service Failed on thanos-fe1001:9100 [09:49:59] sigh, thanks. [09:51:22] Hm, cert fail, and it's a puppet7 host [09:53:48] moritzm: thanos-fe1001 is one of the nodes you migrated to puppet7 and now swift_dispersion_stats is failing because of cert errors, are these likely related? 'urllib.error.URLError: ' [09:54:48] swift-dispersion-report fails with a lengthy backtrace ending thus [09:59:13] moritzm: swift-dispersion-report is failing on all the thanos frontends with the certificate verify failed error [09:59:55] (ms-fe1014 is the p7 ms frontend where we don't see that problem, so maybe it's a consequence of having moved the backends too? I don't know, but I have seen signs of TLS funkyness around the p7 migration) [10:25:42] openssl s_client -connect -showcerts /dev/null [10:26:01] ^-- from a thanos-frontend this gets Verification error: unable to verify the first certificate [10:26:11] whereas e.g. cumin1001 can verify this OK [10:27:28] that IP is thanos-swift.svc.eqiad.wmnet (and strace suggests to me that's the cert verification that's breaking swift-dispersion-report) [12:59:31] marostegui: where you able to test the two db servers and if so do you have some roles i can try migrating? [12:59:43] jbond: which two db servers? [13:00:12] you mentioned db1124 and db1133 yuo mentioned them last week [13:00:19] meeting [13:00:34] ok no probs gonna grab some food in the mean time :) [13:20:24] jbond: All the ones we did last week went fine [13:20:28] Can you do pc1014 too? [13:22:54] marostegui: sure ill do pc1014 now [13:35:09] marostegui: pc1014 is migrated [13:35:49] Cool I'll do some testing and next we should probably go for a DC master [13:36:12] ack sgtm just tell me what and when :) [13:36:24] We just got an alert for stats.service [13:36:29] On pc1014 [13:36:42] jbond: ^ [13:36:47] ac looking [13:36:55] Puppet agent stats service that is [13:37:58] marostegui: do you mean puppet-agent-timer.timer [13:38:07] * jbond i don't see any failed units [13:38:29] 14:34:49  (SystemdUnitFailed) firing: (4) prometheus_puppet_agent_stats.service Failed on pc1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:38:33] That's what we got [13:39:35] marostegui: ack you can ignore that there is a race condition, patch is allready out. also loks like its allready cleared [13:39:56] Ah cool [16:23:16] Opened T351653 re the TLS sadness on thanos-fe* [16:23:17] T351653: thanos internal TLS failure after puppet 7 update - https://phabricator.wikimedia.org/T351653