[08:42:42] is somebody working on the general puppet failure? [08:43:37] servers had issues reaching puppetserver1003.eqiad.wmnet:8140 [08:49:11] server doesn't seem particularly stressed: https://grafana.wikimedia.org/goto/Rqrp_U1NR?orgId=1 [08:49:17] (puppetserver1003) [08:49:30] but errors are happening even within eqiad [08:53:01] puppetserver1003 is logging things like '2025-04-25T08:51:27.423Z ERROR [qtp929272992-24290334] [p.r.core] Internal Server Error: org.eclipse.jetty.io.EofException: Early EOF' [08:53:38] yep... something is definitely off with puppetserver1003 [08:54:28] https://librenms.wikimedia.org/graphs/to=1745571000/id=30779/type=port_bits/from=1745484600/ shows a fairly sharp drop in traffic [08:54:35] I noticed that too [08:56:30] first such log is at 2025-04-24T23:50:01.267Z [08:56:54] checking https://grafana.wikimedia.org/goto/k-Bu_U1Ng?orgId=1 looks like even prometheus has issues scraping metrics from puppetserver1003 [08:57:10] look how data points are missing (and it doesn't happen with other puppetservers in eqiad) [09:01:17] probably unrelated but puppetserver1002 is showing this on motd (puppetserver needs restarting check /run/puppetserver/restart_required) [09:01:26] 1001 and 1003 aren't [09:02:52] hnowlan: assuming you're on-call we would need some coordination here :) [09:03:30] o/ I'll start a doc? [09:03:50] yep.. especially for follow-up actions [09:04:11] p50 HTTP response time in puppetserver1003 is over 5 minutes now and I'm not seeing any kind of alert being triggered [09:05:08] I think I'll restart puppetserver on puppetserver1003 unless you have any objections [09:05:18] sgtm [09:06:39] created, backfilling now https://docs.google.com/document/d/1z8TA9gu0hOKZonYfppFx9AuYglczTEbb0dfxuu3bTZM/edit?tab=t.0 [09:06:50] given puppet clients are stuck with slow responses from puppetserver1003 dunno how effective would be disabling puppet fleet wide first [09:07:08] let me try with cumin... [09:10:26] (ongoing... as soon as it gets stuck I'll restart puppetserver@puppetserver1003) [09:11:30] ack [09:14:41] failures on disable-puppet are starting to come back.. probably with puppet still running timeouts at this point [09:18:00] I see the first error at 22:50 yesterday [09:18:24] 2025-04-25T09:18:03.098Z INFO [async-dispatch-2] [p.s.m.master-service] Puppet Server has successfully started and is now ready to handle requests [09:21:09] puppetserver1003 looks good now.. I'll start re-enabling puppet fleet wide [09:21:33] nice [09:24:53] I think this is a ripple effect of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138904 [09:25:26] at least the time the patch was merged aligns with the time from /run/puppetserver/restart_required [09:25:26] HTTP requests increasing and response metrics still healthy [09:25:42] yeah.. I've noticed that commit [09:26:20] I'll leave a comment https://phabricator.wikimedia.org/T357900 that we priotise this [09:27:38] log still healthy and data on https://grafana.wikimedia.org/goto/cpcB9UJNR?orgId=1 looking good [09:28:07] we definitely need to add some alerts on compiler time and HTTP response time getting too high [09:28:15] I'm restarting puppetserver on 1002 as well (and then with some splay the others, to avoid triggering the alert for too many puppet failures), otherwise we'll run into the same error on the other servers over the weekend [09:28:18] indeed, noted [09:30:25] great, thanks taavi, hnowlan and moritzm <3 [09:30:37] mysteriously puppetserver on 2001-2003 doesn't need a restart, no occurence of /run/puppetserver/restart_required there [09:30:51] thanks vgutierrez! [09:31:38] moritzm: do you have any idea why that change might have had such a big impact? it seems like a fairly inoccuous change [09:33:13] I'm not really sure. The restart_required logic in the first place is needed for JRE upgrades: [09:33:24] since puppetserver implements ruby using jruby [09:33:42] every new version of OpenJDK subtly changes the JITed jruby [09:34:17] so when a new JRE is rolled out, the new requests the puppetserver accepts create a process using the JRE-new, while existing processes use JRE-old [09:34:40] we avoid this by always combining the JRE update with an immediate puppetserver restart [09:35:24] now why this is caused on puppetserver100[23] (and not on puppetserver2*) I have no idea at this point [09:35:48] thanks for that - very strange [09:35:58] the patch was needed to fix the CA sync, it used an incorrect syntax so the timer never kicked in [09:37:40] but why this caused the current errors, is not clear to me, but when we fix T357900 we should learn immediately about the needed restarts and it should not cause any issues [09:37:41] T357900: Alert on necessary puppetserver restarts - https://phabricator.wikimedia.org/T357900 [14:12:23] apologies, pretty sure the issue on puppetserver1003 was a side effect of me testing CRLs, for this bug https://phabricator.wikimedia.org/T392637 [14:13:07] well I actually tested that change on puppetserver1002, so that doesn't line up [14:13:57] but I did restart puppetserver on 1003 yesterday to see if it would pick up the crl, which it didn't [14:20:14] I'll add some notes to the doc [14:24:16] (not urgent/work) I got nerdsniped about T392692 — any suggestions on which phab tags would cover a task/patch which changes CSS in a template in puppet? :') [14:24:17] T392692: Update the errorpage template to use flex - https://phabricator.wikimedia.org/T392692 [14:27:06] TheresNoTime: T383062 and T240794 (via git log on the file) suggest SRE and Traffic fwiw [14:27:07] T383062: Reveal IP after click only on Varnish error pages - https://phabricator.wikimedia.org/T383062 [14:27:07] T240794: /sec-warning page: please add an HTML comment that is more easily visible to API and transport-level inspection/debugging - https://phabricator.wikimedia.org/T240794