[13:00:25] hello folks, there seem to be a lot of exceptions for MW [13:00:36] most of them are related to "Could not enqueue jobs" [13:00:50] https://logstash.wikimedia.org/goto/3a82a96e1d70a0b52db7a9b21b30327a [13:00:59] started ~15 mins ago [13:03:02] volans, slyngs --^ [13:04:39] elukey: thanks for the headsup [13:05:09] any related deployments? [13:05:24] or changes to k8s pods [13:05:29] I haven't found anything so far [13:05:35] I think we did have one deployment today, but I don't know what was in it [13:06:45] it started ~20 mins ago afaics [13:06:54] jayme: could be related to your helm deploys? [13:07:33] the time matches but those are for staging [13:07:39] uhm...no. I'm just doing staging [13:07:39] agree [13:07:43] There is also a database repooling [13:07:49] but the time was suspicious enough [13:07:50] hnowlan: ? [13:08:22] this maybe https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1017819 [13:09:53] theres also a (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-api-int alert that is probably related [13:10:40] the error seems pointing to job runners but it is very high level [13:10:41] /rpc/RunSingleJob.php JobQueueError: Could not enqueue jobs [13:10:57] yes doesn't give much details [13:11:27] and the RED dashboard for them is flat mmm [13:11:33] https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&var-site=All&var-deployment=mw-jobrunner&var-method=GET&var-code=200&var-handler=static&var-service=mediawiki [13:12:46] for mw-api-int has values [13:12:58] and shows the same pattern of 500s [13:13:17] started 12:47-48 [13:13:58] the metrics for the jobrunner in that dashboard are clearly wrong [13:14:01] all flat [13:15:45] there it an elevated amout of recordlint jobs since ~9:20 [13:18:45] jayme: do we have a way to see how the jobrunners are doing? I don't find anything in the RED dashboard, but I am probably missing where to check now [13:19:04] also errors seem to have reduced a lot [13:19:36] elukey: I'd assume https://grafana-rw.wikimedia.org/d/MVDqnbOVk/mw-jobrunner?orgId=1 [13:20:09] ah TIL thnaks [13:20:10] *thanks [13:20:20] but "Could not enqueue jobs" should be decoupled I guess [13:20:37] as in: that should be mw submitting jobts to eventgate [13:21:18] right yes [13:21:59] nothing really visible on the eventgate's main dashboard afaics [13:22:04] https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=30s&var-service=eventgate-main&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos&var-site=All [13:22:15] agreed [13:22:44] but a buch of retries here https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=mediawiki&var-kubernetes_namespace=All&var-destination=All [13:24:53] maybe related to the ferm alerts [13:25:02] https://alerts.wikimedia.org/?q=%40state%3Dactive&q=team%3Dsre&q=ferm [13:25:18] both nodes run an eventgate-main pod [13:26:21] I merged the change to reload the k8s ipv6 range more or less when the issue started, but the deploy wasn't of course instant, so not sure if I caused some ferm weirdnesses for a brief while [13:27:02] zero errors now [13:27:02] sorry, was afk - I merged https://gerrit.wikimedia.org/r/1017819 earlier which will increase job run duration on jobrunners [13:27:04] Could be: Failed to run /usr/sbin/ip6tables-legacy-restore [13:27:16] should not have had any impact on wikikube though, elukey [13:27:51] yeah [13:28:20] the other thing that I can think of is a network blip related to some row or specific subset of nodes, but I haven't checked logstash yet [13:29:50] https://grafana.wikimedia.org/goto/a-tYEk-Sg?orgId=1 [13:29:56] eventgate connection issues again [13:30:24] yeah, pasted that some minutes ago. It pretty much matches the timeframe [13:30:32] oh, heh, sorry [13:30:40] np :) [13:31:15] that's why I was suspecting ferm. Maybe terminated persistent connection to a couple of eventgate pods [13:31:56] We can try restarting ferm on kubernetes1029, that should be running anyway. I'll give it a go [13:32:14] I just did so on mw1485 [13:32:33] Cool, running normally on the Kubernetes host as well now [13:32:44] although I think puppet was faster than me [13:32:51] By the way, that ferm issue is outstanding and doesn't have a clear path to resolution because ferm does not include the iptables flag to wait for the xlock to be released [13:33:08] And kubernetes is very frequently holding that lock [13:33:32] https://phabricator.wikimedia.org/T354855 [13:34:33] I've marked it resolved because we band-aided it with a puppet restart, but it would appear it's insufficient? [13:37:18] It could be helpful to know what kind of job was failed to be enqueue in the mw logs, not only the stacktrace [13:37:30] almost sure that it is super difficult to do [13:38:04] but it would point us to the right direction, like: from the envoy errors I see eventgate-analytics having troubles with its sidecar, not main [13:38:41] Yeah, I was looking at the logs and "Unable to deliver all events" is probably correct, but which ones [13:39:17] no my bad I see now that there are errors for main as well [13:40:27] hnowlan: have you experienced the eventgate connection issues before? Because I don't see errors in the main eventgate dashboard [13:41:05] elukey: yeah, it's happened many times https://phabricator.wikimedia.org/T249745 [13:41:10] but it never manifests in eventgate itself [13:41:14] I think eventgate won't notice in that case [13:41:29] as envoy on the client side will retry [13:45:11] mmm wait I am not following - IIUC MW reports "Could not enqueue jobs" because something failed when calling eventgate [13:45:54] yeah, the mw envoy reports an error back to mw in that case [13:46:14] but I *think* the failed requests never reached the envoy of eventgate [13:46:35] sure but I'd expect to see some http errors reported by nodejs/eventgate in the dashboard [13:47:11] but those would only appear if the requests reached eventgate [13:47:30] ah okok you mean the envoy in front of node doing tls termination [13:47:31] okok [13:47:44] yes [13:48:18] maybe it could be nice to add those graphs to the eventgate's dashboard, if they are not there already [13:49:17] https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=mediawiki&var-kubernetes_namespace=All&var-destination=eventgate-analytics&var-destination=eventgate-analytics-external&var-destination=eventgate-main&from=1712580304370&to=1712583300656&viewPanel=30 [13:50:31] +1 makes some sort of sense now [13:53:23] claime: I'm not sure this really is related to the ferm alerts. There are enough eventgate pods really - even if the two nodes where down. And I guess we would see *a lot* more issues in that case [14:01:38] I've added a comment to https://phabricator.wikimedia.org/T249745