[10:40:43] <_joe_> puppetservers seem to be down [10:40:48] <_joe_> moritzm volans ^^ [10:41:02] <_joe_> I see this error [10:41:05] <_joe_> 2024-02-19T10:15:33.774813+00:00 lists2001 puppet-agent[706550]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Exception while executing '/srv/puppet_code/environments/production/utils/get_config7.sh': Cannot run program "/srv/puppet_code/environments/production/utils/get_config7.sh" (in directory "."): error=0, Failed to exec spawn helper: pid: 3826355, signal: 11 [10:41:07] <_joe_> on node lists2001.codfw.wmnet [10:43:11] looking [10:43:25] from puppetboard many failing with the same reason [10:43:38] <_joe_> oh yeah sorry [10:43:43] <_joe_> list2001 was just an example [10:43:58] the script seems to exist and work as expected on all the puppetservers https://phabricator.wikimedia.org/P57003 [10:44:11] I don' tsee recent changes in git log for it [10:44:26] not sure if it is this case, but there could be some fallout if some alerts depend on metrics [10:45:09] <_joe_> jynus: I thought about that too, but I can't find a reason why thanos being down would make puppetserver fail [10:45:39] <_joe_> signal 11 is SIGSEV, right? [10:45:43] it works again now [10:45:50] <_joe_> *G [10:46:01] <_joe_> so I guess memory pressure? [10:46:18] it seems starting to resolve now [10:46:57] mmh it failed again from my test host a first time, worked on the second one [10:47:25] there was a spike in memory used at 10:36 [10:47:45] although the issue seems to have started earlier [10:48:44] looking at SAL, the timing matches with moritzm updating gnutls28 on bookworm and me reimaging cloudweb1004 [10:50:20] segment fault? [10:50:49] it started at 9:10 [10:51:31] it's not really recovering, still 33% of hosts with no resources [10:51:46] yeah [10:51:48] bunch of segfaults from jspawnhelper on puppetserver1001 [10:52:11] starting 09:06:27 [10:52:24] stopped about 3 minutes ago [10:52:27] puppetserver needs restarting check /run/puppetserver/restart_required [10:52:32] could be that? [10:53:00] same timing for puppetserver2001 [10:53:03] but the file content says since Tue Feb 13 01:52:22 PM UTC 2024 [10:53:08] so not today [10:53:21] but it cannot be the gnutls29, it is failing on bullseye hosts [10:54:05] @9:13 on puppetserver1002: libgnutls-dane0=3.7.9-2+deb12u2 libgnutls30=3.7.9-2+deb12u2 were upgraded [10:54:12] it's still segfaulting all over the place [10:54:39] I'll restart puppetserver on puppetserver1001 ? [10:54:45] moritzm: are you doing anything related right now? I guess we could try to restart or reboot one and see [10:55:12] restarting puppetserver.service seems to me like a reasonable thing to try [10:55:25] Just to confirm, I see no upgrade on apt log for my bullseye with the same puppet error [10:55:39] jynus: it's not about the client [10:55:45] on the server [10:55:49] ah! [10:55:54] the error is server side [10:56:21] I was missled because the error said "on node backup2006.codfw.wmnet" [10:56:48] claime: go ahead and restart one, can't be worse than this ;) [10:56:57] volans: restarted on puppetserver1001 [10:57:16] looks ok so far [10:57:40] I see runs coming through, no segfaults [10:57:47] another run worked now for me, but will check on other hosts [10:58:00] jynus: that would be randon [10:58:04] until we restart all [10:58:15] yeah, another failure [10:58:21] yeah, queuing up a restart of all, I see no segfaults on 1001 [10:58:23] just wait until we fix it plese [10:58:39] claime: +1 roll restart, checking if we have a cookbook by any chance [10:59:08] not that I can see :/ [10:59:19] sudo cumin -b 1 A:puppet7 'systemctl restart puppetserver.service && sleep 10' [10:59:21] ? [10:59:33] no [10:59:36] that's the alias for clients [10:59:38] on p7 [10:59:39] ah [10:59:45] puppetserver [10:59:46] A:puppetserver [10:59:53] yeah [10:59:59] why sleep 10? [11:00:01] use the cumin one [11:00:08] -s 10 [11:00:40] thx [11:01:14] claime: let us know when completed [11:01:18] sure [11:04:57] https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&from=now-5m&to=now&viewPanel=2&refresh=30s slowly recovering for now [11:06:05] run 4/6 done [11:08:13] volans: done [11:08:25] thx [11:08:45] claime: should we force a run with https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed ? [11:08:49] So I guess don't update gnutls without restarting puppetserver? [11:09:02] volans: yeah, probably, b5 ? [11:09:04] b10? [11:09:32] meh, doc says b20, going with that [11:09:36] on the old puppetserver we coul dhandle also a bigger -b than the -b20r reported there [11:09:51] but now with the split puppetmaster puppetserver I'm not sure, -b20 should be ok [11:10:04] running b20 [11:10:19] OK to proceed on 2179 hosts? lol [11:10:31] here we go [11:10:47] 🔥 [11:10:54] don't worry, most will be skipped :D [11:11:09] I know :D [11:11:17] I think --failed-only will be executeed on all hosts but only on those with a failed run puppet actually runs? [11:11:20] sorry, was in a meeting, checking backscroll [11:11:35] jelto: correct [11:12:12] ok! we are currently at 23% puppet agent failed with no resource [11:12:13] I'm still seeing 'Error: Connection to https://puppetserver2003.codfw.wmnet:8140/puppet/v3 failed, trying next route: Request to https://puppetserver2003.codfw.wmnet:8140/puppet/v3 timed out connect operation after 60.001 seconds' [11:13:23] taavi: yeah there are some agent failed [11:13:32] we'll see how it looks once the cumin run is done [11:14:01] they should be caught by the --failed-only [11:14:16] there are still a handful of codfw cloud ones which might fail for puppetserver2003 specifically [11:14:28] what is special about 2003? [11:14:51] the Netbox script which generated the homer term to allow access to the puppet master port timed out [11:14:56] ah [11:15:19] but that is specifically to a handful of cloud test hosts [11:16:13] claime: the derivative of the metric looks good, should go back to normal in the end :D [11:16:37] https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&from=now-30m&to=now&viewPanel=2&refresh=30s [11:17:29] moritzm: looks like that script has finished on friday and the output looks good. let me try running homer [11:18:56] ah, maybe it worked in a later attempt? thanks [11:20:46] the diff on cr*-codfw is removing debmonitor1002/2002 and replacing puppetmaster2003 with puppetserver2003 which looks good to me [11:21:21] taavi: lgtm [11:21:32] the segfault is puzzling, get_config7 only spawns gits, I'll poke at it a little deeper [11:22:10] moritzm: 2 things to keep in mind, the puppetservers were saying they needed a restart since few days [11:22:21] and it seems to have started with this morning package upgrade [11:22:59] deployed [11:24:23] this seems to be some internal mechanism of puppetserver, debdeploy flagged no needed restarts for libgnutls for the puppetserver hosts per se, which makes sense given puppetserver is written in Java/clojure [11:24:27] will have a closer look [11:26:22] yeah there was this message in the motd: puppetserver needs restarting check /run/puppetserver/restart_required [11:26:44] and he content of the file was mentioning Tue Feb 13 01:52:22 PM UTC 2024 as the date since it was pending restart [11:26:49] should we alert on this? [11:27:06] under 10% failed puppet agent runs [11:27:35] and there was no packge update on Feb 13 on that host [11:27:36] * Emperor deletes piles of emails about lack of puppet resources [11:28:01] why do you have emails? :D [11:28:52] volans: for which puppetserver was that? [11:29:20] puppetserver1002 [11:29:58] volans: sre-observability emails data-persistence about a lot of things ATM :-/ [11:33:06] 0.3% (7/2179) of nodes failed to execute command 'run-puppet-agent -q --failed-only': an-worker1088.eqiad.wmnet,cloudgw2003-dev.codfw.wmnet,cloudnet2005-dev.codfw.wmnet,cloudvirt1032.eqiad.wmnet,kubestagemaster2002.codfw.wmnet,ncmonitor1001.eqiad.wmnet,sretest2005.codfw.wmnet [11:33:37] yay! some hosts might have been already failing for different reasons [11:33:42] we alwyas have few outliers [11:36:37] /run/puppetserver/restart_required is internal to our puppetisation, for a notify of puppetserver.service it calls the exec to add the /run/puppetserver/restart_required state file, so we should indeed alert on this [11:37:09] quite a number of puppetserver config files which could trigger this [11:41:21] the change to ca.conf.epp included in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1002385 triggered the "puppetserver needs restarting check /run/puppetserver/restart_required" note in motd [11:41:44] got it [11:42:00] I'll open a task to alert on this [11:42:12] thx [11:44:42] puppet agent failures look fine again below 1% [11:44:56] yeah, I'm manually running stragglers [11:46:18] great, thx [11:46:32] claime: you don't "have" to do it, they might hve been failing previously for various reasons unrelated to this [11:48:01] volans: the reason I'm doing it is because some that are showing up as failed in prometheus, didn't fail the run via cumin, for some reason [11:48:07] So I want to make sure there isn't another issue [11:49:05] you run it with -p 95 so it allowed a 5% failure, exactly to allow you to run the command until the end and not stop becaue few hosts fail as we usually have always some failure [11:49:11] yep [11:49:13] for broken disks or WIP or setup [11:49:45] What I'm saying is for example kubernetes2015 was still showing up failed in prometheus [11:49:56] But didn't show up in the lists of nodes that failed to execute the run [11:50:12] https://puppetboard.wikimedia.org/node/kubernetes2015.codfw.wmnet [11:50:17] it succeeded in the last 30m [11:50:35] yes, I just ran it... [11:51:30] at :47 or at :24? [11:51:36] it was already ok at :24 [11:51:36] 47 [11:51:45] not sure why prometheus wasn't happy [11:52:12] I'm going to remove the silence [11:52:16] +1 [11:52:45] ack