[07:44:05] hello folks [07:47:37] so the readiness probe settings are not applied sigh [07:58:42] o/ [08:05:25] kalimera :) [08:06:13] Did u figure out why it wasn't applied? [08:06:32] Cause the structure seemed correct according to k8s api.. [08:15:13] no idea, I think that maybe it has to do with the go code for the kserve controller [08:15:20] the readiness probe atm is this one [08:15:42] Readiness: http-get http://:15020/app-health/queue-proxy/readyz delay=0s timeout=1s period=10s #success=1 #failure=3 [08:15:56] thta is a mixture of istio (port 15020) and knative (queue-proxy) [08:16:05] so maybe there is some bit missing in the code [08:19:54] Ack.. [08:22:03] the new knative revision is not created whey the isvc is changed [08:22:04] mmmmm [08:24:14] same for the liveness probe [08:29:50] I'm going afk for 30-40' [08:55:10] opened https://github.com/kserve/kserve/issues/2994 [09:14:43] Morning! [09:15:00] For those, like me, who occasionally code offline, here's a neat tool: https://zealdocs.org [09:15:28] It's an offline-mode coding (etc) docs browser that ships with a whole bunch of languages/frameworks/libs documentation [09:36:10] elukey: so how is changeprop deployed? :) [09:37:23] same as the other services [09:37:40] check the diff carefully before syncing [09:37:47] Alrighty. [09:37:52] and alert hugh/kamila [09:38:07] What's the easiest way to post-push verify that everything is working? [09:38:40] I usually check logs and the grafana dashboard [09:40:15] https://grafana.wikimedia.org/d/000000201/deprecated-change-propagation?orgId=1&refresh=1m is marked as deprecated, but I presume it's still useful? [09:53:07] klausman: https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&refresh=1m [09:53:25] Ah, thx [09:53:52] So should I update https://wikitech.wikimedia.org/wiki/Changeprop#How_to_monitor_it with that link? [10:02:12] in general yes, if you see stale docs feel free to fix it [10:08:50] done & done [10:09:04] wasn't quite sure if in this case sending people to the old dashboard might've still been useful [10:12:24] isaranto: I found a way to tweak the readiness probe for falcon, but I see that it takes more than 300s to bootstrap [10:12:37] I've rolled out openssl updates on ores*, could you roll-restart it (used by celery/uwsgi)? [10:13:18] moritzm: we love ORES! [10:13:21] sure :) [10:13:34] klausman: me eqiad and you codfw? [10:13:41] elukey: the readiness probe is per container right? [10:14:01] I mean different for storage initializer and kserve-container [10:14:16] yes yes, the storage init works fine [10:14:22] the kserve-container gets stuck in [10:14:22] Loading checkpoint shards: 0%| | 0/2 [00:00 is it trying to download from internet? [10:15:14] ack :-9 [10:15:59] no it is loading from the directory exactly like bloom does [10:16:49] weird [10:17:06] 10Machine-Learning-Team, 10Research, 10Section-Level-Image-Suggestions, 10Section-Topics, 10Structured-Data-Backlog: Let the model that learns section alignments consume section topics output - https://phabricator.wikimedia.org/T331968 (10mfossati) [10:17:09] changeprop update has been pushed to codfw, will keep an eye on it and let it soak before proceeding [10:17:20] ack [10:17:24] very strange that I see [10:17:25] Warning Unhealthy 4m44s (x3 over 4m45s) kubelet Readiness probe failed: Get "http://10.67.17.253:15021/healthz/ready": dial tcp 10.67.17.253:15021: connect: connection refused [10:17:52] at this point this is different from the one that I thought I had set? [10:19:10] 10Machine-Learning-Team, 10Research, 10Section-Level-Image-Suggestions, 10Section-Topics, 10Structured-Data-Backlog: Let the model that learns section alignments consume section topics output - https://phabricator.wikimedia.org/T331968 (10mfossati) [10:21:13] ah yes of course, the istio proxy [10:21:16] * elukey sigh [10:22:51] 10Machine-Learning-Team, 10Research-Backlog, 10Section-Level-Image-Suggestions, 10Structured-Data-Backlog: [XL] Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10mfossati) [10:29:07] elukey: so with the codfw changeprop change deployed, should we be seeing requests in the codfw pod logs? [10:29:25] (because I don't) [10:30:56] klausman: codfw is the inactive dc, so changeprop shouldn't work in there (except some use case). The logs emit only errors, so if you don't see anything it should be good [10:31:56] Alright, I shall continue with eqiad then [10:35:07] yes, but please ping hugh/kamila first [10:35:16] already done [10:35:23] (before codfw of course [10:35:25] ) [10:36:13] and here come the requests to articletopic in eqiad [10:38:18] nice :) [10:53:11] klausman: thanks!! I see some traffic hitting outlink models [10:54:08] yep. it's enough to validate that it increased, but not enough to make me worry :) [10:54:16] but I just realised I did something stupid :( [10:55:01] in changeprop config, I filtered only enwiki events [10:55:59] I should have removed that (that was for testing in staging) [10:56:09] so in prod we get all wikis events [10:57:09] I thought that was deliberate, starting slowly :) [10:58:48] I'll file a patch :) [11:00:46] going afk for lunch! The ml-serve-eqiad's experimental state is still messed up, will check later on [11:06:58] (03CR) 10Ladsgroup: feat: hardcode threshold calls to switch to Lift Wing (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/915541 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [11:07:57] (03CR) 10Ladsgroup: [C: 03+2] "I'm going to merge this and then we can test this in beta cluster" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/915541 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [11:13:11] <- lunch as well [11:16:43] (03Merged) 10jenkins-bot: feat: hardcode threshold calls to switch to Lift Wing [extensions/ORES] - 10https://gerrit.wikimedia.org/r/915541 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [12:01:07] (03CR) 10DCausse: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929735 (https://phabricator.wikimedia.org/T267648) (owner: 10DCausse) [12:04:12] 10Machine-Learning-Team, 10ORES, 10Advanced-Search, 10All-and-every-Wikisource, and 63 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10TheresNoTime) [12:16:46] 10Machine-Learning-Team, 10Data-Engineering-Planning, 10Event-Platform Value Stream: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10achou) We now can see some traffic hitting the outlink model server on LiftWing! https://grafana.wikimedia.org/d/zsdYRV7Vk/isti... [12:41:00] 10Machine-Learning-Team, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Patch-For-Review: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10achou) [12:42:43] {"error":"OutOfMemoryError : HIP out of memory. Tried to allocate 80.00 MiB (GPU 0; 15.98 GiB total capacity; 15.87 GiB already allocated; 70.00 MiB free; 15.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF"} [12:42:48] isaranto: --^ [12:43:00] I was able to make falcon-7b to start but... [12:44:13] nice! [12:45:30] I can take a look tomorrow. probably this means we are running out of GPU memory though [12:46:03] yeah, and also the exception needs to be handled with a clean or something since subsequent calls all fail [12:46:21] trying to delete the pod and start a new one, maybe with a smaller response it works [12:46:28] the difference is 10 MiB [12:47:36] did this happen when you made a call or when the model was loaded? As I understand it happened during a request [12:48:20] according to the memory profiling I have posted on phab during generation of samples (inference) double the amount of memory is being used so I need to work on that front [12:48:22] (03CR) 10Ottomata: events: propagate the event time with the dt field (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929735 (https://phabricator.wikimedia.org/T267648) (owner: 10DCausse) [12:49:25] and also try to use a 8 bit integers instead of 16bit floats which will cut down model size to half (also hurt quality of predictions). I tried a couple of days ago locally but I was getting errors [12:50:13] isaranto: during the request yes, result_length: 50 [12:50:36] ok then it makes sense. Great job making the pod to start though!! [12:50:46] I also had to bump the ram to 40G [12:50:53] because of OOM etc.. [12:51:16] Is the RAM usage spiky, i.e. just that high on startup? [12:51:58] 10Machine-Learning-Team, 10Spike: [Spike] Run models and frameworks on AMD GPU and identify challenges - https://phabricator.wikimedia.org/T334583 (10elukey) We were able to run falcon-7b llm on an AMD GPU on Lift Wing, but sadly the GPU's memory is not enough: ` {"error":"OutOfMemoryError : HIP out of memory... [12:55:39] 10Machine-Learning-Team: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231 (10elukey) [12:56:00] klausman: not sure [12:56:07] I haven't checked everything yet [12:56:22] let's see [12:57:28] https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=experimental&var-pod=falcon-7b-instruct-gpu-predictor-default-00001-deployment-5m4lc&var-container=All [12:57:36] looks stable, not 40G but around 30 [13:01:05] Phew. That's chunky [13:01:15] Wonder what the pod actually does with that memory [13:02:32] 10Machine-Learning-Team, 10Spike: [Spike] Run models and frameworks on AMD GPU and identify challenges - https://phabricator.wikimedia.org/T334583 (10isarantopoulos) Most probably this is related to the memory usage while generating new samples (i.e. running inference) . According to some memory profiling I ha... [13:08:06] (03PS1) 10Elukey: llm: add clean up steps when GPU errors are raised [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930622 (https://phabricator.wikimedia.org/T334583) [13:31:41] moritzm: ores workers restarted [13:37:56] ack,thx [13:39:05] 10Machine-Learning-Team, 10Patch-For-Review, 10Spike: [Spike] Run models and frameworks on AMD GPU and identify challenges - https://phabricator.wikimedia.org/T334583 (10elukey) Challenges found so far: * Due to how Knative works (that we use to manage deployments etc..) a pod is deleted only when its new v... [13:49:04] 10Machine-Learning-Team, 10Moderator-Tools-Team: Retrain revert risk models on a regular basis via moderator false positive reports - https://phabricator.wikimedia.org/T337501 (10elukey) We reviewed the task as a team, and we decided to postpone any decision to the next ML/Research sync to better understand th... [13:50:32] 10Machine-Learning-Team, 10Moderator-Tools-Team: Retrain revert risk models on a regular basis via moderator false positive reports - https://phabricator.wikimedia.org/T337501 (10Samwalton9) >>! In T337501#8934952, @elukey wrote: > We reviewed the task as a team, and we decided to postpone any decision to the... [13:56:35] 10Machine-Learning-Team, 10Moderator-Tools-Team: Retrain revert risk models on a regular basis via moderator false positive reports - https://phabricator.wikimedia.org/T337501 (10elukey) Sure definitely! We (as ML) are still very far from having some automated environment to train models, for the next fiscal y... [14:03:31] 10Machine-Learning-Team: Containerize Content Translation Recommendation API - https://phabricator.wikimedia.org/T338805 (10elukey) @kevinbazira I have a generic question about the python repo, nothing urgent but I'd like to know your thoughts. We are working on Fast API for ores-legacy, and most of the team is... [14:18:04] * elukey afk for a bit! [14:18:52] 10Machine-Learning-Team, 10Moderator-Tools-Team: Retrain revert risk models on a regular basis via moderator false positive reports - https://phabricator.wikimedia.org/T337501 (10diego) The easiest option I can think about, would be to have an app (toolforge, wmfcloud), that allows (certain) users to provide s... [14:27:02] 10Machine-Learning-Team: Containerize Content Translation Recommendation API - https://phabricator.wikimedia.org/T338805 (10kevinbazira) @elukey migrating the recommendation-api codebase from flask to fastapi is a good idea. However, this would be equivalent to rebuilding the entire project which I don't think i... [14:40:47] (03CR) 10Klausman: llm: add clean up steps when GPU errors are raised (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930622 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [15:13:19] 10Machine-Learning-Team, 10API Platform: Investigate ad-hoc traffic class for API GW rate limits applied to Inference services as used by WME - https://phabricator.wikimedia.org/T338121 (10klausman) After some experimenting, the state of how rate limits for API tokens, the API gateway and Lift Wing currently a... [15:16:04] 10Machine-Learning-Team, 10API Platform: Investigate ad-hoc traffic class for API GW rate limits applied to Inference services as used by WME - https://phabricator.wikimedia.org/T338121 (10klausman) I discussed the above questions with Luca today, and I think for now we can proceed with telling WME to start ex... [15:25:57] (03CR) 10Elukey: llm: add clean up steps when GPU errors are raised (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930622 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [15:33:08] 10Machine-Learning-Team: Containerize Content Translation Recommendation API - https://phabricator.wikimedia.org/T338805 (10elukey) @kevinbazira yep I agree, but we'd need to create a lot of scaffolding in deployment-charts to run Flask, to then migrate to Fast API, so extra work will be needed anyway. What I wo... [15:34:19] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "Emptying the cache is good for debugging purposes however I hope it doesn't mess up with the model that is loaded on GPU." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930622 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [15:43:24] (03CR) 10Elukey: llm: add clean up steps when GPU errors are raised (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930622 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [16:13:51] (03CR) 10Ilias Sarantopoulos: [C: 03+1] llm: add clean up steps when GPU errors are raised (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930622 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [16:16:33] (03PS2) 10Elukey: llm: add clean up steps when GPU errors are raised [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930622 (https://phabricator.wikimedia.org/T334583) [16:16:44] (03CR) 10Elukey: llm: add clean up steps when GPU errors are raised (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930622 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [16:16:54] (03CR) 10Elukey: llm: add clean up steps when GPU errors are raised (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930622 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [16:18:40] going afk folks o/ [16:20:39] (03CR) 10Elukey: [C: 03+2] llm: add clean up steps when GPU errors are raised [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930622 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [16:21:22] going afk as well! [16:29:06] (03PS4) 10DCausse: events: propagate the event time with the dt field [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929735 (https://phabricator.wikimedia.org/T267648) [16:29:08] (03PS1) 10DCausse: events: drop support for /mediawiki/revision/create#1.x events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930665 (https://phabricator.wikimedia.org/T267648) [16:43:20] (03CR) 10AikoChou: [C: 03+1] "Thanks for the commit! LGTM :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929690 (owner: 10DCausse) [17:05:44] (03PS5) 10DCausse: events: propagate the event time with the dt field [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929735 (https://phabricator.wikimedia.org/T267648) [17:05:46] (03PS2) 10DCausse: events: drop support for /mediawiki/revision/create#1.x events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930665 (https://phabricator.wikimedia.org/T267648) [17:23:41] (03PS3) 10DCausse: events: drop support for /mediawiki/revision/create#1.x events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930665 (https://phabricator.wikimedia.org/T267648) [17:25:23] (03CR) 10CI reject: [V: 04-1] events: drop support for /mediawiki/revision/create#1.x events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930665 (https://phabricator.wikimedia.org/T267648) (owner: 10DCausse) [17:44:31] 10Machine-Learning-Team, 10Moderator-Tools-Team: Retrain revert risk models on a regular basis via moderator false positive reports - https://phabricator.wikimedia.org/T337501 (10Samwalton9) >>! In T337501#8935117, @diego wrote: > The easiest option I can think about, would be to have an app (toolforge, wmfclo... [19:12:01] (03CR) 10DCausse: events: propagate the event time with the dt field (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929735 (https://phabricator.wikimedia.org/T267648) (owner: 10DCausse) [22:27:58] 10Machine-Learning-Team, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10dancy)