[03:07:44] (LiftWingServiceErrorRate) firing: ... [03:07:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [04:12:44] (LiftWingServiceErrorRate) resolved: ... [04:12:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [06:51:23] 06Machine-Learning-Team, 06Language-Team, 07Epic: Migrate Content Translation Recommendation API to Lift Wing - https://phabricator.wikimedia.org/T308164#9739156 (10kevinbazira) Hi @Isaac thank you for following up on this. The Content Translation recommendation API is now live in LiftWing production. It can... [07:05:46] Good morning! [07:36:25] 06Machine-Learning-Team: Support building and running of articletopic-outlink model-server via Makefile - https://phabricator.wikimedia.org/T360177#9739233 (10kevinbazira) [07:37:32] 06Machine-Learning-Team: Support building and running of logo-detection model-server via Makefile - https://phabricator.wikimedia.org/T363294 (10kevinbazira) 03NEW [08:01:30] It feels like Mordor over here https://www.aljazeera.com/gallery/2024/4/24/athens-turns-orange-under-north-african-dust-clouds [08:01:32] :D [08:04:29] or Dune [08:36:11] Walk without rhythm! [08:36:14] Also, morning! [08:39:14] morning! you're right! [08:39:42] 06Machine-Learning-Team, 10Cassandra: Figure out a way to query Cassandra node IPs from `profile::kubernetes::deployment_server::global_config` - https://phabricator.wikimedia.org/T362649#9739402 (10klausman) [08:51:27] morning! [08:52:01] (03PS1) 10Ilias Sarantopoulos: revertrisk: support all wikis [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023809 [08:52:15] morning Aiko! [08:52:44] (LiftWingServiceErrorRate) firing: ... [08:52:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [08:57:42] (03CR) 10CI reject: [V:04-1] revertrisk: support all wikis [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023809 (owner: 10Ilias Sarantopoulos) [09:01:09] isaranto: regarding the patch above, what do you think about removing the check_supported_wikis completely? [09:02:41] because muniza said they consider stopping storing the list of supported wikis in the model [09:03:28] I'd prefer to at least log a message whenever a non-natively supported wiki is used so that we can use it for debugging/monitoring [09:04:16] but in the long run we should remove it. For now we'd like to see if everything goes well etc. [09:05:09] ok sounds good to me [09:07:45] (03PS2) 10Ilias Sarantopoulos: revertrisk: support all wikis [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023809 (https://phabricator.wikimedia.org/T363203) [09:14:42] actually I asked Muniza and diego since they have more context. if there is no issue then we could remove all checks altogether. For the above patch to work the restrictions from knowledge-integrity will have to be lifted first [09:19:36] ack [09:22:44] (LiftWingServiceErrorRate) resolved: ... [09:22:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [09:38:12] ^^^ this resolved by itself, but I saved the container logs. It does look a lot like the ruwiki failure (super-slow backend leading to client DCs), but I got nothing concrete yet [09:42:27] ack! [09:43:15] where did u save the logs? curious to check the requests to see if it is revision-related or just mwapi issue [09:43:36] https://grafana.wikimedia.org/goto/3vpmAdBIg?orgId=1 for graphs on the matter [09:50:30] isaranto: my homedir on deploy1002, in viwiki-20240424. Let me make sure the permissions are open enough [09:51:21] isaranto: yeah, /home/klausman/viwiki-20240424 (and the files in it) should be readable for you [09:55:34] * klausman lunch [09:57:40] Danke schön! [10:12:59] * isaranto lunch! [10:34:24] 06Machine-Learning-Team, 06Moderator-Tools-Team, 10PageTriage: Detection and flagging of articles that are AI/LLM-generated - https://phabricator.wikimedia.org/T330346#9739810 (10Tgr) StackExchange used the number of client-side drafts (basically a function of typing speed, since they save post drafts period... [12:17:56] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU - https://phabricator.wikimedia.org/T362670#9740157 (10isarantopoulos) [12:17:57] 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9740156 (10isarantopoulos) [12:18:44] (LiftWingServiceErrorRate) firing: ... [12:18:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [12:31:54] I'm going to deploy the latest changes for revscoring-editquality reverted cause now we don't log the payload [12:32:02] going at ml-staging-first [12:33:26] the thing is that logs show horrible response times ~200-300 seconds while grafana says otherwise https://grafana-rw.wikimedia.org/d/n3LJdTGIk/kserve-inference-services?orgId=1&var-cluster=codfw%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-reverted&var-component=All&var-model_name=viwiki-reverted&from=now-6h&to=now (p99 tops at 10seconds) [12:33:34] (03PS1) 10Kevin Bazira: Makefile: add support for logo-detection [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023532 (https://phabricator.wikimedia.org/T363294) [12:33:36] isaranto:ack. [12:33:59] I had hoped this time around I was quick enough to get the start of the bad queries, but no such luck. Trying logstash [12:39:00] I am seeing stuff like: INFO:root:Function get_revscoring_extractor_cache took 265.9564 seconds to execute. [12:39:07] and: 2024-04-24 11:59:59.656 kserve.trace kserve.io.kserve.protocol.rest.v1_endpoints.predict: 265.8229219999921 [12:41:34] when someone has some time https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1023846 [12:41:48] (03CR) 10Kevin Bazira: "To make reviewing easier, here are the commands I used to test the logo-detection model-server build:" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023532 (https://phabricator.wikimedia.org/T363294) (owner: 10Kevin Bazira) [12:42:07] yes Tobias these are the extreme latencies I also saw. They go up to 600 seconds [12:42:45] So still +1'd [12:43:13] Danke! [12:43:32] the "so still" was half a though I hand't finished :D [12:43:37] though* [12:43:42] thought** [12:44:30] I note that that change is for all services in staging, and eq-rv everywhere, I presume that's intentional [12:47:32] yes! so that we log payload for all of them [12:48:35] well for all in staging and eq-rv everywhere [12:48:44] (LiftWingServiceErrorRate) resolved: ... [12:48:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [12:48:51] I see the change is merged, are you deploying it? [12:49:20] yes I just did revscoring-editqaulity-reverted in staging [12:50:10] am I missing or misreading something or is Grafana not showing correct numbers? There are predict steps taking 200+ seconds but the dashboard show up to 10 seconds [12:50:45] which Dashboard? [12:51:05] https://grafana.wikimedia.org/d/G7yj84Vnk/istio? [12:51:21] viviwiki reverted one https://grafana.wikimedia.org/d/n3LJdTGIk/kserve-inference-services?orgId=1&var-cluster=codfw%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-reverted&var-component=All&var-model_name=viwiki-reverted&from=now-6h&to=now [12:51:21] if so: note that the issue went away a minute or three ago, so [12:52:10] mercelisv: o/ [12:53:04] That 10s looks like a hard limit, maybe the metrics don't go beyond that [12:53:05] you can just go ahead and edit the current Pull request if you want (submit a new commit to the same branch). https://github.com/wikimedia/liftwing-python/pull/2 [12:53:14] not need to create another one [12:55:25] klausman: I don't see any Y-axis limits on Grafana so it doesnt seem to be like that. Unless there is some other config that overrides it (but I think it would show in the UI) [12:55:53] No I meant that the monitoring data shows values >10 as 10 [12:56:34] or it's a 10s timeout on the client side that is visible on that dash, but we never cancel our own backend call? [12:56:52] aha that would make sense [12:57:18] That would still be something for us to fix, if it's not too deep into the bowels of revscoring [12:57:19] I'm thinking we should have a timeout configured in our app and we use that everywhere in the requests etc [12:57:28] Agreed [12:57:44] so at least we set timeout in one place I mean [12:58:42] deploying changes to prod now [12:58:54] ack, seeing the restarts [13:00:23] restarts complete [13:02:27] TrSo far, no traffic on the BE [13:08:00] 10Lift-Wing, 06Machine-Learning-Team: [httpbb] fix failing httpbb test in production enwiki-articletopic - https://phabricator.wikimedia.org/T363334 (10isarantopoulos) 03NEW [13:09:03] hello folks [13:09:09] ohai Luca [13:11:05] re: timeout - I suspect that the high number of seconds are all CPU time, not calls to api-ro or similar [13:11:20] so stopping that from completing is not easy in my opinion [13:11:33] ack [13:11:45] hey Luca! [13:12:32] yeah see https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-reverted&var-pod=viwiki-reverted-predictor-default-00014-deployment-798758f48vpf&var-pod=viwiki-reverted-predictor-default-00015-deployment-6ccc88bzfcm2&var-container=All&from=now-6h&to=now [13:12:47] cpu completely saturated [13:13:05] kserve-container's cpu I mean [13:13:17] yes! [13:13:52] Could that be caused by constantly making stack traces and writing them to the logs? In the olden days of Python 2.2, stack traces were super expensive to create [13:14:27] I think it is all due to revscoring, when it needs to do feature extraction :( [13:14:53] also notice how autoscaling created another instance more, that probably saved a little [13:14:55] Another oddity is that the viwiki rs-eq-rv service seems to have traffic that is a mere 1pqs [13:15:00] ok so either we fix the code (if there is an issue) or increase cpu limits? [13:15:37] if we find a way to repro I think there may be some rev-ids leading to a lot of cpu time spent by revscoring [13:16:47] but higher cpu limits are probably not useful in our case - with asyncio we use only one thread/cpu, and if one request uses cpu time without returning the control to the ioloop everything stalls [13:16:58] including parsing/returning/etc.. other requests [13:17:49] so when we see those 200+ seconds preprocess timings it may mean that during the timeframe all is stalled [13:18:13] using MP revscoring maybe help, and in that case higher cpu limits would be needed [13:18:17] does it makes sense? [13:18:29] fine to say "no Luca please shut up" :D [13:18:57] yes it does make sense [13:19:10] I agree that the proper fix would be to fix the code instead of just throwing more resources at it [13:19:24] looking at the code there is a get request made within that function. So we can investigate further what is going on [13:19:35] I worry about it being a bit of a rabbithole in code we don't really want to support too much longer [13:19:49] perhaps some more logging within the function would let us know where time is being spent exactly [13:20:49] the key would be to find a reliable repro / rev-id, after that it would be easier to debug locally [13:21:06] if of course there is a rev-id kind/type that causes this [13:21:11] Agreed. Hopefully with the rollout we did today, we'll cover the next occurrence [13:22:00] something also to keep in mind - we raised the ruwiki-damaging min instances to 4 and the issue was mitigated (not completely solved) [13:22:30] and note that the issue was coming from the same function [13:22:41] the explanation that I give to myself is that there are more pods with free cpu to use, hence the less chance for clients to get to stalled pods [13:22:46] yep [13:22:52] also kudos Tobias as your alert helped a ton [13:23:00] 🎉 [13:24:26] elukey: do you have any idea why Grafana shows up to 10s when we have many occurencies of predict that take more than 200seconds? [13:24:26] https://grafana.wikimedia.org/d/n3LJdTGIk/kserve-inference-services?orgId=1&var-cluster=codfw%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-reverted&var-component=All&var-model_name=viwiki-reverted&from=now-6h&to=now [13:25:04] isaranto: I think the best way is to check what the prometheus metric exporter returns [13:25:09] just asking if there is any Grafana config related that I haven't spotted [13:25:10] ack [13:25:37] because it may be that the max prometheus bucket set for those metrics is 10s [13:26:13] isaranto: do you know how to do it? I can explain if you want [13:28:44] I was searching for access to the pushgateway UI or sth. are they available through Thanos? [13:28:55] elukey: please do tell me :D [13:29:18] lemme check one thing [13:32:43] isaranto: I checked the metric name in grafana (you can select a panel -> inspect -> JSON), and it is request_preprocess_seconds_bucket. In Thanos I am checking the metrics, and afaics if I isolate a specific pod the last value for the "le" label before "+Inf" is 10.0 [13:32:54] le shows the buckets defined for the prometheus metric [13:33:15] so it is how we suspected, the granularity of the metric shows up to 10s, nothing more [13:33:25] it kinda makes sense, it is not expected to have 200s+ :D [13:33:46] ok, clear then [13:33:51] thanks! [13:41:46] 10Lift-Wing, 06Machine-Learning-Team, 10revscoring: [revscoring] Investigate increased latencies in get_revscoring_extractor_cache function - https://phabricator.wikimedia.org/T363336 (10isarantopoulos) 03NEW [13:41:55] I create a task to investigate this [13:41:55] the other way is to use nsenter or similar on the node where the container runs, and use curl localhost:port/metrics to get the metrics [13:42:07] but it needs sudo, so me or Tobias need to run it [14:45:20] isaranto: I am thinking to add something like https://github.com/kserve/kserve/blob/master/python/kserve/kserve/model.py#L158 to our kserve logging for the json payload [14:45:42] so we associate the same req-id with preprocess_ms etc.. and the json payload [14:47:16] if we have the headers, don't recall [14:47:37] I am digging into ruwiki's logs, I found some slow requests [14:47:50] but it is difficult to pin-point the new log with the preprocess_ms one [14:48:24] yes that would make sense. I as well dont recall what headers we have available [14:48:42] 06Machine-Learning-Team, 06Language-Team, 07Epic: Migrate Content Translation Recommendation API to Lift Wing - https://phabricator.wikimedia.org/T308164#9740744 (10Isaac) Ahh this is great news @kevinbazira ! @KartikMistry is there any reason from the Content Translation side why we can't switch over to the... [15:25:37] (03PS1) 10Elukey: revscoring_model: log request_id alongside with inputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023877 (https://phabricator.wikimedia.org/T362663) [15:28:46] in theory --^ should do the trick [15:29:47] (03CR) 10Klausman: [C:03+1] revscoring_model: log request_id alongside with inputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023877 (https://phabricator.wikimedia.org/T362663) (owner: 10Elukey) [15:30:04] nice catch of those two ruwiki revision ids [15:30:48] Both of them are replacements of flags on proposed-for-deletion articles. what a bizarre combo [15:31:11] ah, it's the same article, even [15:31:11] (03CR) 10Ilias Sarantopoulos: [C:03+1] revscoring_model: log request_id alongside with inputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023877 (https://phabricator.wikimedia.org/T362663) (owner: 10Elukey) [15:37:02] (03CR) 10Elukey: [C:03+2] revscoring_model: log request_id alongside with inputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023877 (https://phabricator.wikimedia.org/T362663) (owner: 10Elukey) [15:37:14] I didn't find yet veeery long ones though [15:37:45] (03Merged) 10jenkins-bot: revscoring_model: log request_id alongside with inputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023877 (https://phabricator.wikimedia.org/T362663) (owner: 10Elukey) [15:50:30] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1023880 to update the damaging isvcs [15:50:42] if you are ok I am planning to test in staging then deploy to prod [15:52:57] 06Machine-Learning-Team: Test if we can avoid ROCm debian packages on k8s nodes - https://phabricator.wikimedia.org/T363191#9741067 (10isarantopoulos) [15:53:00] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU - https://phabricator.wikimedia.org/T362670#9741068 (10isarantopoulos) [15:53:16] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU - https://phabricator.wikimedia.org/T362670#9741070 (10isarantopoulos) [15:53:19] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9741069 (10isarantopoulos) [15:53:31] thanksss [15:53:39] or test on prod and the deploy to staging :P [15:54:09] deal :D [15:57:24] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU - https://phabricator.wikimedia.org/T362670#9741128 (10isarantopoulos) **Update**: We have Mistral-7b-instruct hosted on ml-staging that uses a CPU and is using the pytorch base image that w... [15:58:24] I'm stepping afk folks, have a nice evening and cu tomorrow! [15:58:29] o/ [16:00:42] INFO:root:JSON payload for the request-id cbfe027b-0c0b-469f-ab44-ecff959926b6: b'{"rev_id": 153825175}' [16:01:09] all good, proceeding with prod [16:05:56] and done [16:13:00] now it is way better to find slow rev-ids :) [16:27:18] nice work! [16:45:57] going afk for the evening, have a nice rest of the day folks! [18:28:45] 06Machine-Learning-Team, 13Patch-For-Review: Unsupported lang error for some wiki for revertrisk-language-agnostic calls - https://phabricator.wikimedia.org/T363203#9741958 (10CodeReviewBot) mnz updated https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/merge_requests/37 feat(featureset): fall...