[08:59:01] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10JMeybohm) >>! In T324576#8454404, @Ottomata wrote: >> I will test and see what happens to a running Flink app when I take the opera... [11:10:13] 10serviceops, 10Content-Transform-Team-WIP, 10Maps: Enable traffic mirroring from codfw to eqiad - https://phabricator.wikimedia.org/T324459 (10Jgiannelos) 05Open→03Resolved [11:10:16] 10serviceops, 10Content-Transform-Team-WIP, 10Maps: Re-import full planet data into eqiad and codfw - https://phabricator.wikimedia.org/T314472 (10Jgiannelos) [12:28:20] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [13:52:20] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > cert-manager in our cluster Ah! Do we have a cert-manager that will work with this webhook as is? Meaning we could ins... [14:18:53] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10Jclark-ctr) 05Open→03Resolved removed power supply and reseated error has removed [14:20:12] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10Clement_Goubert) Thanks so much! [14:21:54] do the .fixtures in a chart get tested, or do they just get evaluated for valid yaml? [14:23:19] ottomata: the .fixtures are used to actually render chart and validate in CI [14:23:37] is there a way I can provide expected template output? [14:23:51] no [14:23:55] rats :) [14:24:29] we just use the fixtures to have all if-guards rendered and all the resulting yaml is validated against kubeconform [14:24:36] *via kubeconform [14:24:43] i've got a local values.yaml file i'm using for testing and developing, but its not necessarily testing a feature flag. any reason not to put in in chart .fixtures? [14:27:29] Not really if it renders to a valid chart I think [14:27:33] No. Not really. But every .fixture will result in one helm template | kubeconform run [14:27:59] so if ther is no need to to it, you would just extend ci runtime :) [14:37:00] hm, okay. [14:43:21] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > what the webhook actually does Responses from Flink mailing list: > webhooks in general are optional components of the... [14:49:22] ottomata: intersting answer from the mailinglist. They don't even mention the mutation the webhook does :D [14:50:35] since we're going to validate the FlinkDeployment against the CRD spec prior to submitting it to the cluster anyways (in CI) I guess we don't gain anything from the validation phase [14:56:46] jayme: okay cool. i'll leave in support for it then, but disable instaling it in admin_ng helmfile values [14:57:25] sounds good [15:01:19] jayme: will it be better to use the built in FlinkDeployment ingress support https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.2/docs/operations/ingress/ [15:01:22] or to use our templates somehow? [15:01:51] btw. they cut 1.3 today :-p [15:01:57] oh nicie! [15:02:16] i saw the release branch in github when I was poking around in there, and thought one might be coming up soon :) [15:09:57] regarding ingress: we (as in wikikube) supports default ingress API but I have not tested it very much as we're using the istio ingress CRD. Might be possible to just use it but I would need to check in detail [15:11:54] but AIUI there is no full templating support but more or less only this "template:" string that get's converted to an ingress object. [15:12:32] it might be easier to leverage the ingress stuff - plus we would stick with what we have already [15:14:20] okay, will try our stuff first then...any hints? its just for admin access to the Flink UI, i don't think we need e.g. a discovery endpoint (although, maybe that would be nice?) or...do we? [15:15:08] and we could craft a generic/shared hostname for all flinks so new flink deployments would not need DNS changes to be reachable (see https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#More_complex_setups) [15:15:16] oo yeah [15:15:18] oh boy docs...reading [15:15:38] not many docs, I kept it sparse so you can stick to guessing :D [15:16:03] Hmm did we do anything that could cause a sharp increase in api_appserver calls? https://grafana.wikimedia.org/goto/kuJxo3F4z?orgId=1 [15:16:13] (I may be paranoid because 1st oncall) [15:17:20] I caI'm not finding anything particular in RED [15:18:07] hmm [15:18:22] did you check SAL? [15:19:11] Spike's almost an hour after the end of the train deploy [15:19:32] So I don't think it's that [15:19:33] promising :) [15:19:47] Ah wait [15:22:08] * jayme 's superpower [15:24:48] I see an increase in GET requests but it doesn't seem linked to the train actually, it's coincidentally ramping up [15:25:19] https://grafana.wikimedia.org/goto/R50Z0qFVk?orgId=1 [15:25:29] It's quite a sharp increase if we look at 7 days [15:28:29] true indeed [15:33:14] https://superset.wikimedia.org/r/2187 [15:34:59] the surge in req/s is rather small compared to the load this seems to put on the workers [15:35:07] Yeah [15:35:26] I think it's a red herring [15:35:44] eheh, that url can't be opened. "Request Line is too large (4186 > 4094) " [15:36:21] if the req/s is a red herring I would still look towards the deployment as load might creep up slowly [15:39:21] Yeah, but I'd assume to see a ramp from around 13h10/13h15 until now [15:39:30] (in active/idle) [15:39:43] https://grafana.wikimedia.org/goto/XZujA3FVk?orgId=1 [15:40:24] but it's stable for 1h and then dips suddenly [15:46:34] jayme: recommendations for testing this ingress/istio stuff in minkube? [15:46:39] it that even possible? [15:47:34] claime: indeed...or maybe some caches running out after ~60min [15:47:50] i guess i have to install istio? [15:48:32] ottomata: you can install istio in minikube. Replicating out exact setup might be a bit tricky, though [15:49:05] i just want something that will work with our ingress templates [15:49:24] our istio setup is in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/custom_deploy.d/istio/ [15:49:28] reading custom_deploy yeah... [15:49:29] okay [15:49:36] yeah. I [15:50:09] I think the tricky part is not to enable all the istio stuff as we only have the istio-ingressgateway enabled [15:50:18] plus the cert-manager things ofc [15:50:36] i will put that advice into my pocket until I know what it means :p [15:51:18] istio is a full blown service mesh which we completely disable apart from the parts that do ingress [15:51:52] right so in main/config.yaml egress things are disabled [15:52:33] yes - and all the meshy sidecar things are disabled [15:56:18] claime: I think the situation get's worse - and unfortunately I'm not super experienced with mw debugging :/ but I would guess Amir1 is :) [15:56:30] oof, okay, i need to make build istioctl somehow...figureing out which version [15:56:31] then install it [15:56:35] but i'm on macos... hmm [15:56:37] jayme: Yeah, I'm starting to not like this either [15:56:41] what terrible thing I have done [15:56:44] * Amir1 reads up [15:56:52] I don't think it's your fault [15:56:55] [09.12.22 16:50] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4194 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [15:57:05] We're getting workers exhaustion on api_appserver [15:57:06] claime: everything is my fault even proven otherwise [15:57:25] ottomata: uh...I was about to say "we have debian packages" :D [15:57:28] okay, let's look at flamegraph [15:57:49] https://performance.wikimedia.org/arclamp/svgs/hourly/ [15:58:11] likely some perf regression [15:58:26] hehe, i'm reading the how to deploy section of custom_deploy.d/istio/README.md [15:59:40] it's seems eventbus is at fault here [15:59:42] https://performance.wikimedia.org/arclamp/svgs/hourly/2022-12-09_14.excimer-wall.api.svgz [15:59:49] compare with https://performance.wikimedia.org/arclamp/svgs/hourly/2022-12-09_10.excimer-wall.api.svgz [16:00:25] eventbus you say? :) [16:00:41] sorry, also reading a little backscroll then. is this since a deploy? [16:00:49] i'm not aware of any recent eventbus changes [16:00:52] it seems wmf.13 [16:00:58] It started ~1h after train for wmf.13 [16:00:59] oo via monologhandler [16:01:20] now 30% of run time is just trying to talk to eventbus it seems [16:01:30] i think the monolog handler eventbus is only really used for sending api request logs, and mayb esome cirrussearch request logs, as events [16:01:40] if there are a LOT more of those types of requests, there will be more events [16:04:26] we only got a bump of ~10% in requests in eqiad, stable in codfw [16:04:46] ok maybe a little more [16:04:56] the throttled logs are back as well https://grafana-rw.wikimedia.org/d/000000561/logstash?from=1670529261000&orgId=1&to=now&viewPanel=45 [16:05:46] i don't see a spike in events for those streams i suspected [16:05:56] lhttps://grafana.wikimedia.org/goto/2njpx3F4z?orgId=1 [16:05:57] https://grafana.wikimedia.org/goto/2njpx3F4z?orgId=1 [16:06:09] that seems unrelated at least, the target modules logs should show up in load.php requests not api.php [16:07:19] unless load.php logs so much it overwhelms eventgate and then api.php logs can't reach? [16:08:27] I can make it sample the logs but can't promise it would fix this issue [16:11:38] Amir1: is there any way to see the flame stack that queued the DeferredUPdate? I guess not eh? [16:12:40] if we can figure out what request type it is, we can replay it in mwdebug with xhgui and you'd have the full callgraph with details [16:13:14] request type meaning which api action? [16:14:31] yeah [16:15:04] if it's a simple GET, then something like just the request url would be enough [16:19:13] i don't see anything suspicious on eventgate end, no extra requests, no weird latencies, no kafka issues. Amir1 , reading the flame graph, the fact that the EventBus::send method is so wide means that the time is specifically spent there or below it (in curl) not above it, right? [16:19:41] Yup [16:20:05] The curl is wide [16:20:16] So it's stuck in http call [16:21:17] Amir1: ottomata: this is interesting: https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=api_appserver&var-origin_instance=All&var-destination=All&from=now-6h&to=now [16:21:21] oo wait i do see it in eventgate-analytics latencies [16:21:58] cdanis: retries? [16:22:27] retries, connection fail rate as well [16:22:28] timeouts too [16:23:19] "query processing would load too many samples into memory in query execution" has to be one of my least favorite messages [16:23:31] oof [16:24:57] https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&var-service=eventgate-analytics&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos&var-site=All [16:25:09] eqiad p99 on POSTs went from ~10ms to ~46ms [16:25:28] ya looknig there too, it does seem like there has been a gradual increase in throughput here over time...is it possible we just hit some threshold? [16:25:30] and need more pods? [16:25:40] dunno what would cause this to spike like this [16:27:32] pod cpu throttling? https://grafana.wikimedia.org/goto/v8OYaqK4k?orgId=1 [16:28:09] considering bumping replicas, objections? [16:28:10] Oh, yep, good find [16:28:25] sgtm [16:28:30] same, go [16:29:15] it is interesting it is only some of the pods [16:29:24] the increase in logs caused by mobile load.php (T324723) might have been a contributing factor I think [16:29:49] but that's going to be reverted soon [16:29:52] Look at the network usage cdanis [16:30:03] They all take a bump before starting throttling [16:30:16] claime: yeah, but I was expecting it to be more uniformly distributed [16:30:23] I imagine if it keeps going, as it load balances, it'd throttle more evenly [16:30:34] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/866612 [16:30:40] But it's been balancing for 2 hours and that didn't happen [16:31:15] I am guessing the imbalance is just connection reuse and bad luck [16:31:19] C'mon jenkins [16:31:30] Ah no I didn't need jenkins lol [16:31:47] And here's the page [16:31:51] given the phpfpm saturation page let's get that rolled out expeditiously :D [16:32:25] skippinig jenkins [16:33:01] deploying shortly... [16:33:19] ottomata: You taking the helmfile apply? [16:34:26] yes [16:34:29] doing now [16:34:54] hm FYI there is a chart version bump being applied too! the template looks mostly the same, tls config checksum changed tho. [16:35:12] applied in eqiad [16:36:33] https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1&from=now-30m&to=now&refresh=30s [16:36:43] 📉 [16:36:55] sweet [16:37:05] Noice [16:37:09] Thanks guys <3 [16:37:27] well, hm [16:37:32] a bunch of the older pods are still pretty heavily throttled [16:38:04] the uneven distribution might be a result of envoy<->envoy connections beeing kept open [16:38:21] where's our horizontal auto scaling?! [16:38:25] ;) [16:38:26] jayme: yeah, in traffic land there's a maximum number of requests allowed through the same connection (order of 1000 i think) [16:38:32] not sure if the same is done in envoy [16:38:43] that *should* be the case there as well IIRC [16:39:03] we can kill the pods and let the deployments recreate them? [16:39:17] It would respread the connections [16:39:27] ottomata: tbqh, having used both before, I'd much rather have per-instance utilization-based load balancing rather than autoscaling :) [16:39:45] claime: let's wait and see first [16:39:49] the old pods should be terminated [16:39:53] yeah, np [16:39:55] i think that is just old grafana stuff [16:39:57] at least 2 of the pods are no longer throttling [16:40:12] old pods should disappear from the dashboard, think... [16:40:14] ah [16:40:23] OH yeah [16:40:29] Because of the chart bump [16:40:34] yeah looks like ottomata is right [16:40:35] yup all pods have been recreated [16:40:57] oh, claime if it weren't for the chart versino bump, increasing replicas would have just spun up 10 new pods? [16:41:01] wihtout taking down the old ones? [16:41:03] I think [16:41:04] Not sure [16:41:07] jayme: ? [16:41:33] yes, absolutely [16:42:11] coo [16:42:25] hmm "max_requests_per_connection: 1000" we configure per default for the service-proxy stuff ( cdanis ) [16:42:28] (deploying in staging for consistency...) [16:42:37] It would scale up the replicaset right? [16:42:57] jayme: hmmm :/ [16:43:45] https://grafana.wikimedia.org/goto/PSZ-fqKVk?orgId=1 [16:43:46] claime: yep [16:43:50] I think it worked [16:44:06] phewf! [16:44:38] ah...yeah. We have a one time retry for eventgate-analytics in case of 5xx configured [16:44:50] %worker in active state still a little high compared to baseline, but not by much [16:44:59] so that probably ends up in maximum of 2k requests per connection [16:45:05] so, eventgate-analytics is def the busiest of the eventgates. i know the search team uses the cirrussearch request events to improve the search indexes, but i don't really know what the api-request logging is used for. [16:45:12] every once and a while its useful i guess...? [16:45:15] (~22% baseline, 30% rn) [16:46:06] So I managed to actually get a page during my 1st oncall week [16:46:28] nice one! [16:46:29] That I could have avoided if I'd called on Amir for help debugging earlier when I started seeing contention [16:46:48] Lesson learned :P [16:47:09] you noticed the contention because you were unpromptedly looking at dashboards, though [16:47:13] you'll always find something weird that way ;) [16:47:16] cdanis: Nah [16:47:36] I'd had the crit flapping but not pageing for an hour or so [16:47:45] ahhh [16:47:50] strange, I didn't get the page [16:47:54] And I was trying to find out what happened [16:48:09] Amir1: I acked it immediately so maybe with push latency? [16:48:26] but using flamegraph was literally one of two/three slides of debugging chapter in my mw presentation in Prague [16:48:35] I have to start taking quiz from now on [16:48:40] yes but I couldn't attend [16:48:47] Amir1: you weren't supposed to [16:48:53] And flamegraph isn't in the checklist :P [16:49:00] it was just arnold and cole scheduled for right now [16:49:21] cdanis: aah, ok, makes sense, I am oncall for the EU time but this one was a bit late [16:50:20] ottomata: you mind logging the replica bump numbers (and reason) to sal? [16:50:37] claime: fair, both of my sessions are actually being added to the onboarding videos though [16:50:45] Amir1: Cool! [16:51:54] and we probably need to follow up somewhere about the increased message volume - or did I miss something? [16:52:35] jayme: https://phabricator.wikimedia.org/T320518#8455510 [16:54:36] Amir1: ah, I understood from the following comments that that is expected to be already fixed. That's why I pointed that graph out again earlier :) [16:55:20] It'll get better but not fixed, the revert is not deployed yet [16:55:28] went from 3x to 2x :D [16:55:43] ok. I see. That wasn't clear to me :) [16:56:04] "drop rate has settled to a 2x increase since the deployment." .. is pretty clear, though :) [16:58:14] So chain of causality is more logging for mobile for data => eventgate-analytics congestion => api_appservers spend too long because of trying/retrying to log so active% shoots up leading to worker starvation [16:58:49] I gtg - have a nice rest of the day/weekend all [16:58:58] jayme: have a nice weekend! [16:58:59] claime: (thats what I understood fwiw) [17:00:05] FWIW, I just +2'ed the revert so it'll settle by the next train, don't forget to reduce number of pods next week [17:00:17] (next Friday to be exact) [17:05:24] Welp, I'll be off then [17:39:46] ah, just saw note about SAL, sorry, thanks cla ime [17:42:36] i wonder if there is a way to make eventgate configurably throttleble, especially for the hasty=true endpoint [17:42:56] just fail the http request early if too many requests? [17:43:02] could the envoy stuff handle that maybe? [17:43:21] eventgate-main is probably the only one we'd care about not doinig thta tfor. [17:43:28] all the others can drop if they are busy [18:26:45] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) Q for @dcausse and @gmodena. I've thus far been making Flink logs go only to the console in ECS format. The console log... [18:31:33] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) @JMeybohm it turns out that Flinks native k8s integration [[ https://nightlies.apache.org/flink/flink-docs-master/docs/de... [19:07:37] 10serviceops, 10Performance-Team, 10Wikimedia Enterprise, 10affects-Kiwix-and-openZIM: large amount of traffic to the action=parse API from MWOffliner - https://phabricator.wikimedia.org/T324866 (10daniel) [22:50:34] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) Okay @JMeybohm, I'm ready for a first pass review of the [[ https://gerrit.wikimedia.org/r/c/operations/deployment-charts...