[06:23:13] hi folks morning! [06:23:34] I am rolling out the changes to split traffic between istio gateways more consistently [06:23:44] istio + knative changes, tested in staging last week [06:24:30] the issue at the moment is that we have two set of istio gateways: one for kserve and one for other services (like ores-legacy etc..), and I thought they had traffic split [06:24:55] but due to some setting they were sharing traffic, and the new changes fix this behavior [06:33:46] all right this bit is done [06:33:57] next step is https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/940518 [06:34:05] since knative deployments are very slow [06:34:27] and I figured out the metrics disappeareance issue, fixed in --^ [06:35:13] (already applied manually to all clusters so we now have metrics) [06:38:10] klausman: o/ I realized that in https://grafana-rw.wikimedia.org/d/c6GYmqdnz/knative-serving -> Autoscaler we have metrics about requested/terminating pods [06:39:08] for example I can see that we terminate periodically enwiki-drafttopic [06:47:49] and https://github.com/knative/serving/blob/main/docs/scaling/SYSTEM.md#serverlessservices-sks gives more info on what the activator does for us [06:48:38] very interesting: if we check "kubectl get sks -n revertrisk" we can see if the activator is in "Proxy" mode (namely, if it handles the traffic) or not [06:50:10] in theory with https://knative.dev/docs/serving/load-balancing/target-burst-capacity/ we could think about having the activator working only when we scale from zero (so basically just a buffer) [06:50:31] * elukey bbiab [06:58:13] 10Machine-Learning-Team, 10Release Pipeline, 10ci-test-error: Post-merge build failed due to Internal Server Error - https://phabricator.wikimedia.org/T342084 (10kevinbazira) Thank you for digging into this, @dancy! Enabling users to push to the docker-registry images with layers > 2GB (~4.21GB in our case)... [07:30:48] kevinbazira: o/ I commented in --^, I think that we may need to think about other solutions for the embeddings file on the recommendation api's image [07:31:06] I am not 100% sure that serviceops will like having layers so big :( [07:33:44] knative deployments completed [07:34:05] Hello! [07:34:40] kalimera :) [07:40:09] isaranto: o/ [07:40:10] elukey: o/ I think you meant to say you commented on T288198 instead of T342084. Nonetheless, thank you for providing RelEng with more context on our use case(s). Waiting to hear their suggestions on the way forward. [07:47:36] 10Machine-Learning-Team, 10ORES, 10Advanced-Search, 10All-and-every-Wikisource, and 60 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10karapayneWMDE) [07:51:52] kevinbazira: yes the related task, but the subject is the same :) [07:52:45] my aim was to say that we may need to come up with alternatives, not waiting only on releng's ones because they are waiting on what service ops decides [07:53:08] and I am pretty sure that they are not happy to raise the limit :) [07:54:14] Oh sure, thanks elukey. We can discuss possible solutions 👍 [07:59:36] elukey: mornin! also: nice catch re: autoscaler [08:42:46] klausman: if you have time https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/940865 + next [08:43:05] Looking [08:43:06] If you are ok I'd test 4 activators before raising their memory [08:43:23] SGTM [08:43:26] and I also noticed that we run a single autoscaler only [08:43:42] if you have time I can merge + deploy and you can run your test, so we see how it goes [08:44:07] we have also knative metrics now etc.. [08:44:35] Do we know why that metric gap last week happened? [08:44:41] (2x LGTM'd) [08:45:33] klausman: yes it was in the first code review, for some reason the knative pods are now showing less labels, and the selectors in the network policy for the prometheus port were set with some (extra) knative specific ones [08:45:44] port blocked -> no metrics [08:45:47] aah, ok [08:45:55] no idea why [08:51:38] also knative's deployment now takes around 1:30 mins, that is a bit but we have a lot of pods that need to bootstrap sync etc.. [08:51:42] before it was longer [08:52:19] So it's not just fast, but a lot faster given the add'l pods? [08:53:33] before the first change the liveness/readiness probe started after 120 seconds, so the deployment took at least 2 mins + the time for pods to bootstrap etc.. [08:53:46] for some reason the webhook takes time to do leader election etc.. (I found a bug about it) [08:54:08] I tweaked the settings a little so it starts after 30s the first probe, and tries 12 times every 10s [08:54:13] and this seems to work fine [08:54:38] Neat! [08:55:22] I've meant to look into the whole live-ness/readiness probes and whether our numbers for it were "right", but it fell by the wayside [09:00:41] I know the LGTM just now was proforma, but it makes the change more visible in my gerrit dashboard :) [09:01:20] thanks! it is good anyway, please ping me anytime something feels off [09:01:29] aye [09:01:54] I'll keep a watch on `kubectl get pods -n knative-serving -o wide` during testing (ontop of the usual watch for the RR pods) [09:03:57] https://phabricator.wikimedia.org/F37148205 <- Like so :) [09:06:03] klausman: nice! Standup now :) [09:06:08] oh, right :) [09:21:47] 10Machine-Learning-Team: Migrate to LW docs - https://phabricator.wikimedia.org/T342523 (10isarantopoulos) [09:23:56] 10Machine-Learning-Team: Documentation to migrate an ORES request to LW - https://phabricator.wikimedia.org/T342523 (10isarantopoulos) [09:31:09] 10Machine-Learning-Team, 10Documentation: Documentation to migrate an ORES request to LW - https://phabricator.wikimedia.org/T342523 (10Aklapper) [09:35:14] klausman: all set if you want to test the 4 activators [09:35:52] ack, will take a quick look over my code to make sure it does what I think it does, then run a 5m test with 50 threads [09:36:30] I am trying to understand why we don't see all expected logs in logstash [09:36:36] in the istio gateway board I mean [09:40:58] starting test run [09:41:36] Seeing add'l rr-la pods. Startup was on the order of 2-3 seconds, at most [09:42:08] So far, no errors [09:45:59] from https://grafana-rw.wikimedia.org/d/c6GYmqdnz/knative-serving?forceLogin&forceLogin&from=now-3h&orgId=1&to=now&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-knative_namespace=knative-serving&var-revisions_namespace=revertrisk&viewPanel=9 I see ~120 rps to each activator for language agnostic [09:47:17] Test completer. 0 errors, 360qps total [09:47:32] (~108k requests) [09:47:49] I'll try another run with 100 threads (doubling), again for 5m [09:47:54] super [09:48:08] 0 errors is very nice :) [09:48:18] maybe we finally have something working [09:48:24] yep! It has happened in the past, but never with "cold" clusters [09:49:18] typical atcivator RAM usage was 400-450Mi during the run [09:50:18] klausman: IIRC you are not going through the API gateway right? If so let's schedule a test going through it, so we see how bad it becomes [09:50:38] yes, that should be relatively simple [09:50:52] wow 3 activators sustaining ~230 rps es [09:50:57] I'm hitting the eqiad-specific endpoint atm (https://inference.svc.eqiad.wmnet:30443/v1/models/revertrisk-language-agnostic:predict) [09:51:05] super [09:51:21] Still 0 errors [09:52:36] Memousage of the activators is now 530M, which is getting close to the limit [09:53:10] 0 errors, total of 681qps \o/ [09:53:26] 204268 requests [09:54:31] wow nice :) [09:55:00] how close we got to the memory limit? [09:55:19] 543M vs 600M [09:55:34] (or 545, but that's the neighborhood) [09:58:40] okok, for the moment we are probably ok what do you think? We can keep it monitored [09:59:07] yeah, especially since we know that 503s may be ooms [09:59:22] currently trying to get the API GW call right, mostly getting 400s :D [09:59:47] :D [10:00:04] https://api.wikimedia.org/service/lw/inference/v1/models/revertrisk-language-agnostic:predict should be right, no? [10:03:18] ah, got it. Will now run 50 threads for 5m [10:03:26] ok :) [10:03:31] isaranto: I opened https://github.com/dennistobar/serobot/pull/6 [10:03:48] the problem was that I was still setting the Host: header, which API GW does not like. [10:04:00] the bot owner reverted the check_risk function (instead of checkORES) because multi-lingual was behaving poorly [10:04:10] lemme know if it makes sense [10:05:59] elukey: this is great, nice work! [10:06:10] elukey: getting 429s now (rate limit) [10:06:45] klausman: ah yes you are doing unauthenticated right? [10:07:00] you have 50k/hour IIRC [10:07:16] No, I am using my personal token [10:07:35] with 100k/hour? [10:07:43] 50k/h [10:07:57] then it is the same of not authenticated, did you reach the limit? [10:08:16] I'll have to see once the run completes [10:08:50] Code # % [10:08:53] ---------------------- [10:08:54] 429: 8568 14.63 [10:08:56] 200: 49992 85.37 [10:08:58] Tot: 58560 [10:09:08] yep :) [10:09:13] So yes, I got almost 50k requests in before it limited me [10:10:30] looks good [10:10:46] so far it seems that the activators were the problem [10:10:53] Hmmm. I just moved my token to the WME tier and reset it, but still getting limited to 50k [10:11:13] this is not good [10:11:18] Dunno if the reset on the hour must happen before it's 100k. [10:11:54] One of the activators got to 586Mi mem usage [10:12:05] And another to 596 [10:12:20] I think a _little_ mem increase might be useful [10:12:21] If the token id is the same then you are banned for the hour I think, unless we explicitly remove them from the api-gateway's redis [10:12:30] klausman: +1 yes, let's bump it to be safe [10:12:40] I'll modify my change for that [10:13:27] The current change says 1.2Gi mem limit, wdyt? [10:13:54] I'd lower it a little, maybe 700/800MB? [10:14:03] yeah, 800M sounds fine [10:14:20] do we want req=limit or half? [10:15:13] let's do =, I need to better understand the req vs limit in detail, but so far IIUC limit is not always applied but only for bursts [10:15:20] ack [10:15:23] it is better to have the activators fully geared up [10:15:30] there is also the option of turning them off [10:15:37] we should probably run some tests klausman [10:15:45] it is a single setting that we can apply manually [10:15:57] https://knative.dev/docs/serving/load-balancing/target-burst-capacity/#setting-the-target-burst-capacity [10:16:22] we set it to zero and the activator is removed from the request path (not in proxy mode anymore) [10:17:18] (or we can also tune the current value) [10:18:42] How did we arrive at 211? [10:18:51] (mem change has been updated) [10:20:30] klausman: we don't set anything, so it should pick up the default [10:20:45] seems 200 [10:20:48] charts/knative-serving/templates/core.yaml has target-burst-capacity: "211" [10:21:02] that's the example config [10:21:15] Ah, I see. [10:21:40] (also, rr-la pods are downscaling now) [10:30:12] going afk for lunch + errand, ttl! [10:30:14] * isaranto lunch [10:31:50] elukey: will merge and deploy now (once CI is happy) [10:48:52] Ok, all merged and new limit is visible on the dashboards [10:48:57] * klausman lunch as well [12:52:13] klausman: nice! [12:53:33] klausman: if you have time do you mind to run a load test for enwiki-goodfaith? [12:53:45] basically same format, to see how we perform in there [12:55:06] I will have to cobble the request together, but yes. I'll do a disco/pybal direct test first, don't want to eat all my quota :) [12:55:46] yes yes we can skip the api gateway for this test [12:55:56] it is just to see if we have to tune anything over there [12:56:11] we also have way less pods so be gentle :D [13:00:13] (03CR) 10Abijeet Patro: [V: 03+2] Localisation updates from https://translatewiki.net. [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/940889 (owner: 10L10n-bot) [13:00:19] With one htread, for 10s, I get ~11.5qps 0 errors [13:01:55] 30 threads, 20s, 106qps, 0 errors [13:04:14] 60t: 141qps [13:05:24] Yeah, I think ~140qps is the saturation point [13:08:22] ok so IIRC we have ~10 pods for enwiki goodfaith [13:08:33] every one can sustain ~ 14qps in theory [13:08:41] with the current settings etc.. [13:09:10] yeah, looks good. I'm looking at one small discrepancy (seeing status code 0 in some dashboards) [13:10:58] ah, nvm that was me messing with curl until I got the URL/request right [13:40:58] Let's go get rid of ores! [13:48:14] chrisalbon_: or is ores that will get rid of us! :D [13:48:50] Fight to the death [13:54:17] isaranto: something interesting - I turned the ores-legacy's staging tls proxy logging to debug [13:54:30] and I saw connect timeouts for the 503s [13:55:13] o/ chrisalbon_: [13:55:31] Hey! [13:56:14] elukey: u mean all this might just have been fastapi timeouts? [13:58:00] isaranto: sorry I was answering in #serviceops, I wanted to tell you that the envoy's cluster config (basically the conn pool to lift wing) has by default a conn timeout of 0.250s [13:58:04] that is very tight [13:58:23] so when we have too many connections from ores-legacy to lift wing, we risk to hit it [13:58:28] I'm checking a stackoverflow thread that says fastapi default timeout is 60s [13:58:28] getting the 503s etc.. [13:58:36] ack [13:58:36] nono fastapi is fine in this case [13:58:41] it is the envoy proxy sigh [13:59:23] isaranto: I was wondering if we could support having multiple rev-ids at once in our model servers [13:59:37] but probably it wouldn't make things great [14:01:37] we probably can (and we should) [14:02:09] since mw-api supports multiple revids we would be good to go, but I have to double check with the revscoring code as well [14:07:08] isaranto: then let's open a task, and then somebody can pick it up in these days. WDYT? [14:08:01] 10Machine-Learning-Team, 10Patch-For-Review: use wikiID in inference name on LW for revscoring models - https://phabricator.wikimedia.org/T342266 (10isarantopoulos) Changes to reflect the above patches have been made in the API docs https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_revscoring_goodfait... [14:08:22] elukey: I think I have now reached the point where the tool is overengineered: https://phabricator.wikimedia.org/P49678 But I felt a histogram was worth the 15m of extra coding and testing :) [14:09:04] elukey: sounds good. I'll create the task [14:09:40] klausman: great work! Can you report your findings in the taask? [14:09:45] Will do [14:29:19] 10Machine-Learning-Team: Revert Risk multi-lingual model performance and reliability may need a review - https://phabricator.wikimedia.org/T340822 (10klausman) Luca and I have gotten to the bottom of where the 503s come from. It was ultimately caused by the autoscaling activators being underprovisioned (both rep... [14:38:24] Going afk to run an errand for 30-50' [14:38:35] *30-40 :) [14:58:41] isaranto: (for when you are back) - another road that we could pursue is to support up to X amount of rev-ids in the query string [14:58:47] like 10 [14:58:57] and return a 400 when they are more [14:59:08] explaining to batch on the client side etc.. [14:59:14] or to use Lift Wing directly [15:04:01] elukey: random thought: do we know how big the ORES Redis cache backing store is? [15:04:24] klausman: IIRC 16G [15:04:29] but we have multiple instances [15:04:44] ack. [15:14:48] 10Machine-Learning-Team: [ores-legacy] Clienterror is returned in some responses - https://phabricator.wikimedia.org/T341479 (10elukey) Another round of debug today :) I turned the envoy instance in the ores-legacy staging's tls proxy to log at DEBUG level, and I discovered that the 503s are connect timeouts to... [15:14:55] isaranto: left my thoughts to --^ [15:15:24] klausman: lemme know what you think about --^ when you have a moment [15:16:16] going afk for groceries, ttl! [15:21:41] elukey: should I add it to the ticket or braindump here? [15:32:51] So I think we should do #3 (use ORES cache in o-l) only if it's relatively easy to do. Both o-l and the cache are not things we want to maintain for long. #1 sounds a bit more sustainable, but would need some parallelism, I think. Hard to say what the LW-side impact of that would be for wikis that have only a handful of replicas. #2 seems the least credible, though *if* it works, it would be [15:32:53] most convenient for the users. Overall, my preferred solution is #0: stop supporting request batching beyond what HTTP itself allows for (pipelining/reusing TCP conns). [15:33:16] But for all of these, knowing how frequent and important batched requests are is a crucial piece of data we need. [15:42:15] thanks! [15:42:48] I think that batching so many reqs is probably something that few uses, but it needs to be verified [15:43:15] I love the fact that https://fastapi.tiangolo.com/async/#is-concurrency-better-than-parallelism mentions "machine learning" as a use case that doesn't fit :D [15:43:22] (they are right) [15:46:51] 10Lift-Wing, 10Machine-Learning-Team: Enable batch inference for revscoring models - https://phabricator.wikimedia.org/T342555 (10isarantopoulos) [15:47:05] yeah, the typical latency of ML req/resp is just not exactly Web-levels of responsive [15:47:34] the main issue is the cpu bound part though [15:47:39] that blocks the eventloop [15:48:16] The only way out of that would be to make the client and server do this entirely asynchronously, but that is not how REST works. [15:48:47] let's keep the real options on the table :) [15:49:00] A man can dream [15:49:02] I agree with Tobias on this one, to forbid these requests. On the other hand batched inference is something we ought to provide, as it is a common use case/thing to do [15:50:22] isaranto: can you check how many requests are hitting ores like this? And if there are bot UAs etc.. then we can decide if we want to forbid totally or not [15:50:41] I think that we can support little batches, maybe not 100 rev-ids at once :) [15:50:57] Yeah, I think the magic number there might be closer to 10 :) [15:51:19] supporting batches will require, in my opinion, that we have streams that warm up a cache (in cassandra?) like ores does with redis [15:51:30] it is the only way to avoid problems [15:51:35] there was no UA, but the huge request specified on the task is a really frequent one [15:51:55] frequent ==> how many calls / hour ? [15:52:04] more or less [15:52:15] I like to be cryptic :D [15:52:17] if we deny it we should add a message explaining why etc.. [15:52:31] lemme check as I dont remember (i remember sth in thousands) [16:25:18] going afk folks, have a nice rest of the day1 [16:25:22] o/ [16:32:51] \o [16:44:50] I had a bit of difficulty getting what I wanted from logstash, so I'll need some help on that tomorrow [16:44:55] logging off o/ [17:08:09] Thanks for all the work all