[13:00:25] <elukey>	 hello folks, there seem to be a lot of exceptions for MW
[13:00:36] <elukey>	 most of them are related to "Could not enqueue jobs"
[13:00:50] <elukey>	 https://logstash.wikimedia.org/goto/3a82a96e1d70a0b52db7a9b21b30327a
[13:00:59] <elukey>	 started ~15 mins ago
[13:03:02] <elukey>	 volans, slyngs --^
[13:04:39] <volans>	 elukey: thanks for the headsup
[13:05:09] <volans>	 any related deployments?
[13:05:24] <volans>	 or changes to k8s pods
[13:05:29] <elukey>	 I haven't found anything so far
[13:05:35] <slyngs>	 I think we did have one deployment today, but I don't know what was in it
[13:06:45] <elukey>	 it started ~20 mins ago afaics
[13:06:54] <volans>	 jayme: could be related to your helm deploys?
[13:07:33] <elukey>	 the time matches but those are for staging
[13:07:39] <jayme>	 uhm...no. I'm just doing staging
[13:07:39] <volans>	 agree
[13:07:43] <slyngs>	 There is also a database repooling
[13:07:49] <volans>	 but the time was suspicious enough
[13:07:50] <jayme>	 hnowlan: ?
[13:08:22] <jayme>	 this maybe https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1017819
[13:09:53] <jayme>	 theres also a (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-api-int alert that is probably related
[13:10:40] <elukey>	 the error seems pointing to job runners but it is very high level
[13:10:41] <elukey>	 /rpc/RunSingleJob.php   JobQueueError: Could not enqueue jobs
[13:10:57] <volans>	 yes doesn't give much details
[13:11:27] <elukey>	 and the RED dashboard for them is flat mmm
[13:11:33] <elukey>	 https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&var-site=All&var-deployment=mw-jobrunner&var-method=GET&var-code=200&var-handler=static&var-service=mediawiki
[13:12:46] <volans>	 for mw-api-int has values
[13:12:58] <volans>	 and shows the same pattern of 500s
[13:13:17] <volans>	 started 12:47-48
[13:13:58] <volans>	 the metrics for the jobrunner in that dashboard are clearly wrong
[13:14:01] <volans>	 all flat
[13:15:45] <jayme>	 there it an elevated amout of recordlint jobs since ~9:20
[13:18:45] <elukey>	 jayme: do we have a way to see how the jobrunners are doing? I don't find anything in the RED dashboard, but I am probably missing where to check now
[13:19:04] <elukey>	 also errors seem to have reduced a lot
[13:19:36] <jayme>	 elukey: I'd assume https://grafana-rw.wikimedia.org/d/MVDqnbOVk/mw-jobrunner?orgId=1
[13:20:09] <elukey>	 ah TIL thnaks
[13:20:10] <elukey>	 *thanks
[13:20:20] <jayme>	 but "Could not enqueue jobs" should be decoupled I guess
[13:20:37] <jayme>	 as in: that should be mw submitting jobts to eventgate
[13:21:18] <elukey>	 right yes
[13:21:59] <elukey>	 nothing really visible on the eventgate's main dashboard afaics
[13:22:04] <elukey>	 https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=30s&var-service=eventgate-main&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos&var-site=All
[13:22:15] <jayme>	 agreed
[13:22:44] <jayme>	 but a buch of retries here https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=mediawiki&var-kubernetes_namespace=All&var-destination=All
[13:24:53] <jayme>	 maybe related to the ferm alerts
[13:25:02] <jayme>	 https://alerts.wikimedia.org/?q=%40state%3Dactive&q=team%3Dsre&q=ferm
[13:25:18] <jayme>	 both nodes run an eventgate-main pod
[13:26:21] <elukey>	 I merged the change to reload the k8s ipv6 range more or less when the issue started, but the deploy wasn't of course instant, so not sure if I caused some ferm weirdnesses for a brief while
[13:27:02] <elukey>	 zero errors now
[13:27:02] <hnowlan>	 sorry, was afk - I merged  https://gerrit.wikimedia.org/r/1017819  earlier which will increase job run duration on jobrunners
[13:27:04] <slyngs>	 Could be: Failed to run /usr/sbin/ip6tables-legacy-restore
[13:27:16] <jayme>	 should not have had any impact on wikikube though, elukey
[13:27:51] <elukey>	 yeah
[13:28:20] <elukey>	 the other thing that I can think of is a network blip related to some row or specific subset of nodes, but I haven't checked logstash yet
[13:29:50] <hnowlan>	 https://grafana.wikimedia.org/goto/a-tYEk-Sg?orgId=1
[13:29:56] <hnowlan>	 eventgate connection issues again 
[13:30:24] <jayme>	 yeah, pasted that some minutes ago. It pretty much matches the timeframe 
[13:30:32] <hnowlan>	 oh, heh, sorry 
[13:30:40] <jayme>	 np :)
[13:31:15] <jayme>	 that's why I was suspecting ferm. Maybe terminated persistent connection to a couple of eventgate pods
[13:31:56] <slyngs>	 We can try restarting ferm on kubernetes1029, that should be running anyway. I'll give it a go
[13:32:14] <jayme>	 I just did so on mw1485
[13:32:33] <slyngs>	 Cool, running normally on the Kubernetes host as well now
[13:32:44] <jayme>	 although I think puppet was faster than me
[13:32:51] <claime>	 By the way, that ferm issue is outstanding and doesn't have a clear path to resolution because ferm does not include the iptables flag to wait for the xlock to be released
[13:33:08] <claime>	 And kubernetes is very frequently holding that lock
[13:33:32] <claime>	 https://phabricator.wikimedia.org/T354855
[13:34:33] <claime>	 I've marked it resolved because we band-aided it with a puppet restart, but it would appear it's insufficient?
[13:37:18] <elukey>	 It could be helpful to know what kind of job was failed to be enqueue in the mw logs, not only the stacktrace
[13:37:30] <elukey>	 almost sure that it is super difficult to do
[13:38:04] <elukey>	 but it would point us to the right direction, like: from the envoy errors I see eventgate-analytics having troubles with its sidecar, not main
[13:38:41] <slyngs>	 Yeah, I was looking at the logs and "Unable to deliver all events" is probably correct, but which ones
[13:39:17] <elukey>	 no my bad I see now that there are errors for main as well
[13:40:27] <elukey>	 hnowlan: have you experienced the eventgate connection issues before? Because I don't see errors in the main eventgate dashboard
[13:41:05] <hnowlan>	 elukey: yeah, it's happened many times https://phabricator.wikimedia.org/T249745 
[13:41:10] <hnowlan>	 but it never manifests in eventgate itself 
[13:41:14] <jayme>	 I think eventgate won't notice in that case
[13:41:29] <jayme>	 as envoy on the client side will retry
[13:45:11] <elukey>	 mmm wait I am not following - IIUC MW reports "Could not enqueue jobs" because something failed when calling eventgate
[13:45:54] <jayme>	 yeah, the mw envoy reports an error back to mw in that case
[13:46:14] <jayme>	 but I *think* the failed requests never reached the envoy of eventgate
[13:46:35] <elukey>	 sure but I'd expect to see some http errors reported by nodejs/eventgate in the dashboard 
[13:47:11] <jayme>	 but those would only appear if the requests reached eventgate
[13:47:30] <elukey>	 ah okok you mean the envoy in front of node doing tls termination
[13:47:31] <elukey>	 okok
[13:47:44] <jayme>	 yes
[13:48:18] <elukey>	 maybe it could be nice to add those graphs to the eventgate's dashboard, if they are not there already
[13:49:17] <jayme>	 https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=mediawiki&var-kubernetes_namespace=All&var-destination=eventgate-analytics&var-destination=eventgate-analytics-external&var-destination=eventgate-main&from=1712580304370&to=1712583300656&viewPanel=30
[13:50:31] <elukey>	 +1 makes some sort of sense now
[13:53:23] <jayme>	 claime: I'm not sure this really is related to the ferm alerts. There are enough eventgate pods really - even if the two nodes where down. And I guess we would see *a lot* more issues in that case
[14:01:38] <jayme>	 I've added a comment to https://phabricator.wikimedia.org/T249745