[06:53:49] good morning folks [07:55:45] started two separate benthos instances for editquality, one hitting ml-serve-eqiad and the other ml-serve-codfw [07:56:01] will let them run for a bit, to measure performances [08:19:31] it seems that $sometimes a query to the mwapi stalls and takes seconds, then all the other ones pile up to say ~30s maximum [08:19:42] then back to less-than-a-second responses [08:20:42] I am wondering if there is a blocking call that we missed, causing this [08:21:45] \o [08:21:48] Morning [08:22:07] Are we maybe hitting a throttling/limit thing on MWAPI? [08:22:07] 10Machine-Learning-Team, 10Data-Engineering, 10observability: Evaluate Benthos as stream processor - https://phabricator.wikimedia.org/T319214 (10fgiunchedi) (FYI) For this task and the work in {T314981} we now have Debian packages for Benthos available for Buster and Bullseye [08:24:55] need to run some errands, bbl! [08:27:43] ttyl :) [09:13:34] back [09:13:56] klausman: morning! What kind of throttling? On their side? [09:15:22] Yeah, basically MWAPI slowing us down for some reason [09:17:46] no throttling comes to mind for api-ro atm [09:18:41] Then I figure the blockage must be local, running out of something like a thread pool or connection pool. Are we confident it's the model being slow, not benthos itself? [09:22:00] not sure if it is the model itself or the async api calls, or maybe a combination of both [09:22:26] we run the model' score blocking function inside the tornado/asyncio loop [09:22:41] IIRC you already have setups for more than one kind of model, right? Is there any difference between different models? [09:22:45] It doesn't seem to be Benthos afaics [09:23:47] I am adding more metrics to the istio sidecar dashboard, failures seems to happen for different models [09:24:04] (with different traffic flow) [09:24:06] Hmm. [09:24:35] Do we currently have any circuit-breaking in the sidecar that might be triggered? [09:27:39] we do have it but it is set for a very high limit, 100 rps/pod [09:27:55] and we are not hitting it.. but so far I didn't see metrics for it [09:27:56] yeah, seems unlikely then. [09:30:14] I am going to try to assign more threads to benthos processing pipelines, to see if it is not something related to benthos head of line blocking [09:31:52] started the new tests [09:32:21] the dashboard is improving: [09:32:22] https://grafana-rw.wikimedia.org/d/zsdYRV7Vk/istio-sidecar?from=now-6h&orgId=1&to=now&var-backend=api-ro.discovery.wmnet&var-cluster=codfw%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-goodfaith&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99&var-response_code=All [09:32:39] we can switch also and check eventgate-main.d.wmnet too, or all services at once [09:33:16] will see if we can add the circuit breaking metrics without adding a ton of spam to prometheus [09:35:40] :+1: [09:39:15] now that I see we have 5s timeout for api-ro.discovery.wmnet [09:39:20] too tight probably [09:39:30] (in the istio sidecar settings, destination rule) [09:42:37] What do you suggest instead? 1s? 2s? [09:44:21] ah nono increasing it [09:46:26] we also have https://grafana.wikimedia.org/d/c6GYmqdnz/knative-serving?orgId=1 to check [09:51:50] aiko: o/ if you want to see some config examples for Benthos, check in my home dir on stat1004 (subdir benthos) [09:55:18] elukey: any idea what the "DC" and "UC" bits on the latency by backend graphs mean? The graph definition says "response flags", but I dunno what they expand to [09:55:46] maybe "uncached" for UC, but what would DC then be... [09:56:34] https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/access_log/usage#config-access-log-format-response-flags <- found an explanation. [09:56:43] UC=Upstream connection termination in addition to 503 response code. [09:56:49] DC=Downstream connection termination. [09:57:23] So basically the difference between the model or the API closing the connection [09:59:26] there is a legend in the top-left corner (credits to Janis, I copied the main istio dashboard :D) [10:00:28] the main issue is that I don't see horrible latencies from the istio-sidecar dashboard for api-ro etc.. [10:00:35] only for the main model servers [10:01:08] same thing for eventgate-main [10:01:44] ooh, I completely missed that info panel :D [10:03:04] re: latency only on the model side: I think that would indicate some saturation/starvation of resources on the model side, not Istio or upstream. Though the number of connections with 0/DC seemsa bit odd. [10:09:46] from https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?from=now-6h&orgId=1&to=now&var-datasource=codfw%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-goodfaith&var-pod=All I don't see a lot of resource usage (at least CPU/memory wise) [10:10:48] we could set https://docs.python.org/3/using/cmdline.html#envvar-PYTHONASYNCIODEBUG to one isvc (like wikidata and enwiki) [10:10:53] and see between the logs [10:11:06] (this is how I found the DNS latency issue at the time) [10:11:11] but it may become noisy [10:14:05] this is logs from a pod before the 500s [10:14:07] [I 221017 08:54:57 web:2243] 200 POST /v1/models/srwiki-goodfaith:predict (127.0.0.1) 128.14ms [10:14:10] [I 221017 08:56:02 web:2243] 200 POST /v1/models/srwiki-goodfaith:predict (127.0.0.1) 192.54ms [10:14:13] [I 221017 08:58:05 web:2243] 200 POST /v1/models/srwiki-goodfaith:predict (127.0.0.1) 147.97ms [10:14:16] [I 221017 08:59:32 web:2243] 200 POST /v1/models/srwiki-goodfaith:predict (127.0.0.1) 669.75ms [10:14:19] [I 221017 09:05:31 web:2243] 200 POST /v1/models/srwiki-goodfaith:predict (127.0.0.1) 2408.60ms [10:17:49] ok I've set PYTHONASYNCIODEBUG:"1" to the enwiki-goodfaith isvc in ml-serve-codfw [10:17:58] let's see if we get anything good out of it [10:22:31] elukey: ack! thanks :) [10:33:34] I'll leave the debug logging going, so far nothing really clear stands out [10:33:52] (if you want to check, enwiki-goodfaith logs on ml-serve-codfw) [10:37:00] going afk for lunch! (the benthos tests are still running) [11:16:38] <- lunch [13:36:28] Good morning all! [13:39:34] morning! [13:52:20] Quick question: How many boxes do we have in mlserve and staging? [13:54:42] kubernetes workers (so excluding the control plane vms): 8 each ml-serve cluster, 2 for staging [13:55:00] what about with control plane? [13:55:11] I don't think finance cares about control plane vs not [13:56:43] those are VMs, so not really baremetal hosts, but it is 5 for each cluster (prod and staging don't differ in this case) [13:56:54] oh okay then i dont care [13:56:59] cool thanks elukey [13:57:03] :) [13:57:12] I think they just care about actual physical boxes [14:37:35] hello! this is your periodic reminder about the wikilabels databases [14:37:54] 10Machine-Learning-Team, 10Data-Engineering, 10observability, 10Event-Platform Value Stream (Sprint 03): Evaluate Benthos as stream processor - https://phabricator.wikimedia.org/T319214 (10Ottomata) [14:38:12] 10Machine-Learning-Team, 10Data-Engineering, 10observability, 10Event-Platform Value Stream (Sprint 03): Evaluate Benthos as stream processor - https://phabricator.wikimedia.org/T319214 (10lbowmaker) [14:38:12] I hope to get to that this week :) [16:14:13] One more question for folks: What is the current k8s version? What is the version we'd need to install kubeflow? What is the features in the new version that we currently lack? [16:17:31] chrisalbon: we run 1.16 that is not supported anymore from upstream, and we are targeting 1.23 that is a reasonable recent one (without getting to the latest and greatest) [16:18:10] Knative/Istio/Kubeflow all dropped support for 1.16 [16:18:26] we are basically running our stack on an unsupported platform (more or less) [16:18:34] I quit [16:18:46] hahahah [16:19:11] And the k8s version update involves service ops because its the same k8s version across the tech dept? [16:21:09] yes correct [16:21:20] great thanks [16:21:22] we are using the same baseline for k8s [16:21:26] yeah [16:21:41] and sharing components etc.. (like Istio) [16:21:59] what are all the components we share? [16:22:03] or at least some of them [16:22:08] (I'm writing a doc) [16:24:31] so CoreDNS and Calico are the two components to provide DNS and Networking/IP-allocation, Istio for Service Mesh / Ingress Gateway, the k8s control plane, the k8s kubelet/kubeproxy/etc.. (basically all the components of a cluster) [16:24:48] and all the Helm/yaml configurations that come with the above mess of course :D [16:30:01] awesome thanks! [16:30:08] no more questions tonight I promise [16:46:07] ahahah anytime [16:46:08] --- [16:46:14] I've read this article today: https://skipperkongen.dk/2016/09/09/easy-parallel-http-requests-with-python-and-asyncio/ [16:46:24] kserve uses a threadpoolexecutor IIUC [16:46:56] https://github.com/kserve/kserve/blob/release-0.8/python/kserve/kserve/model_server.py#L130 [16:47:09] and in our pods we have 5 threads [16:47:17] maybe we need more? [16:47:51] I still need to get how the ioloop uses the threadpool executor if specified [16:52:58] but yeah it feels as if the 5 workers that we set (automatically) are all busy and the rest has to wait [16:53:26] elukey: I think dgit is a great idea for if/when we need to patch upstream (referring ot the talk on the SRE meeting just now) [16:54:13] klausman: yeah it looks very promising [16:54:28] going afk for tonight folks! Have a nice rest of the day :) [16:55:11] \o1 [16:55:16] Will head out now as well [17:25:26] 10Lift-Wing, 10Machine-Learning-Team, 10Epic, 10Research (FY2022-23-Research-October-December): Create a language agnostic model to predict reverts on Wikipedia - https://phabricator.wikimedia.org/T314385 (10leila) [17:27:01] 10Lift-Wing, 10Machine-Learning-Team, 10Research (FY2022-23-Research-October-December): Create a language agnostic model to predict reverts on Wikipedia - https://phabricator.wikimedia.org/T314385 (10leila) [19:10:49] 10Machine-Learning-Team, 10ORES, 10MediaWiki-Core-Preferences, 10Moderator-Tools-Team (Kanban): When ORES quality filters are selected in mobile web, entries should be highlighted - https://phabricator.wikimedia.org/T314026 (10Jdlrobson) The highlighting code works fine on desktop Minerva so you can rule o...