[10:12:11] hey folks [10:12:27] on deploy1002 puppet has been failing since an hour ago, and the error is weird [10:12:30] Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Systemd::Timer::Job[httpbb_kubernetes_hourly]: [10:12:55] the only relevant thing that I see is https://gerrit.wikimedia.org/r/c/operations/puppet/+/868212 but it was merged two days ag [10:12:58] *ago [10:16:43] ok that's my fault elukey [10:16:44] checking [10:17:02] yeah, was going to say https://github.com/wikimedia/operations-puppet/commit/dfb1ae3003f29c09b12470a924cc96e7813d30ca [10:17:12] 'xactly [10:17:31] ahhh I lost that commit! [10:17:36] okok makes sense [10:17:52] Make PCC happy but elukey unhappy [10:17:53] :) [10:18:08] XioNoX: with claime? No way [10:18:14] elukey: <3 [10:18:27] I'll fix it pronto [10:18:59] the only rivalry between French and Italias is for soccer and wine [10:19:07] *Italians [10:19:19] the rest doesn't matter :D [10:19:26] That's because Italy sucks at rugby btw :p [10:19:52] claime: I was being polite and you ruined it, I am going to send you half wikilove this time :D [10:19:59] elukey: lmao [10:20:04] hahahaha [10:28:49] https://gerrit.wikimedia.org/r/c/operations/puppet/+/889764 < fix [10:33:13] elukey: merged, applied, good to go. [10:35:38] super thanks! [10:44:25] jelto: while checking another puppet issue, I stumbled upon a puppet issue with planet1002.eqiad.wmnet, fix up here https://gerrit.wikimedia.org/r/c/operations/puppet/+/889767/ [10:45:39] claime: thanks a lot, I'll take a look in a sec [10:45:46] no problem [10:47:22] Hi 👋... (full message at ) [11:44:28] ~/45 [16:40:01] XioNoX: can you think of anything that would make networking sketchy for a sessionstore host after a reboot? To be specific, something that would make it sketchy for a period of 15 minutes after a reboot (https://phabricator.wikimedia.org/T327954#8620319)? [16:40:50] have you checked if that aligns with puppet runs? [16:41:18] volans: I have not. [16:41:27] but I can. [16:41:52] volans: but isn't puppet run immediately after a reboot? [16:42:00] I just wonder if after a reboot you need 2 puppet runs for some reason and so the @reboot + the next one [16:42:29] (and yes it is run at reboot) [16:47:23] volans: yeah, the timing is suspicious [16:47:29] (looking at the logs) [16:47:35] if it's less than 24h ago [16:47:40] or you can repro [16:47:49] the changes are in puppetboard [16:47:52] with seconds [16:47:54] or local syslog [16:48:08] yeah, i looked at the local logs [16:48:32] (*within seconds) [17:04:00] btullis: what's the status of kafka-stretch2002? it's sending one email to root@ every day since Jan 10 [17:14:48] volans: not quite sure Puppet runs line up well here after all. The first one seemed very close, the second one is off by 4 minutes (Puppet only ran 4 minutes after the node righted itself). [17:15:05] ack [17:15:21] but the 2nd run has any changes that might be related to network? [17:15:46] "unchanged" [17:16:06] then hardly at fault :) [17:16:18] I'm grasping for anything here [17:17:17] the node has connectivity, clients can connect, you can open an ssh session, other nodes see the rebooted nodes heatbeats, but it misses heartbeats from other nodes [17:17:46] and 15 minutes later (+/- seconds) it just automagically starts working [17:18:00] and only after a reboot... restarting the service seems OK [17:18:32] misses heartbeats from other nodes... but never *all* of the other nodes [17:19:07] oh, and restarting the service while it's in this state does nothing... 15 minutes have to pass [17:22:26] Did you try a tcpdump to see if the heartbeats at least arrive to the host? [17:27:39] No, I was focusing on ensuring I had the circumstances down pat to reproduce, and only started to put some of these things together afterward [17:28:41] could it be a negative cache that when rebooting the host is down and the other clients cache that fact for 15m? [17:28:43] at which point "damn, I wish I'd grabbed a tcpdump" was pretty high on my lists of regrets [17:28:57] eheheh [17:29:16] volans: I don't think so, as far as they are concerned it *is* up and healthy [17:30:13] whatever the situation, it's asymmetric, the rebootee's view of the topology doesn't match the rest of the cluster, including an accurate state of that node [17:30:57] the logs suggest it's not receiving the gossip heartbeat from those other nodes [17:31:18] did ferm start ok at boot? [17:31:56] yeah, that was one of the other items on my list of regrets, after putting the environment back together [17:32:07] but there are no errors logged [17:32:30] /var/log/syslog.1:Feb 15 20:27:08 sessionstore2001 ferm[835]: Starting Firewall: fermError in /etc/ferm/conf.d/10_cassandra-cql line 10: [17:32:31] and you can ssh in, and clients can connect to Cassandra (if they couldn't, there'd be no errors) [17:32:39] !!?? [17:32:57] that's on sessionstore2001, not sure if at the time of the reboot [17:34:23] in that case it looks like a temporary failure to resolve sessionstore1001.eqiad.wmnet [17:35:36] it is at the time of reboot (one of them) [17:36:14] oh, that's why I missed it... I went looking on a subsequent reboot [17:36:34] yeah, and that rule is for tcp/9042, which would prevent client connections [17:37:01] *but* usually when ferm fails it fails to apply all rules [17:37:22] and IIRC that's all open unless we changed something recently~is [17:37:24] *~ish [17:37:46] all open? [17:38:02] -J ALLOW and not -J DROP [17:39:05] input defaults to DROP [17:39:24] Then... [17:41:12] I think that that's only after ferm runs [17:41:19] but I might be wrong [17:42:25] either way, that one error seems exceptional, ferm succeeded on the other examples of this [17:42:37] (so probably not the culprit) [17:43:10] ack [17:46:17] ack