[07:51:06] hello folks1 [07:51:08] ! [07:55:44] Good morning! o/ [08:45:18] 10Machine-Learning-Team, 10Patch-For-Review: Enable local runs for article-descriptions model - https://phabricator.wikimedia.org/T351940 (10isarantopoulos) Current status is that I'm able to start the model server but I'm getting an error when making a request. ` 2023-11-24 21:04:56.674 uvicorn.error ERROR:... [09:03:33] 10Machine-Learning-Team, 10Patch-For-Review: Enable local runs for article-descriptions model - https://phabricator.wikimedia.org/T351940 (10isarantopoulos) [09:35:24] 10Machine-Learning-Team: Upgrade model servers to kserve 0.11.2 - https://phabricator.wikimedia.org/T351633 (10isarantopoulos) Results for drafttopic kserve 0.11.2 ` wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-drafttopic:predict --header "Host: e... [09:36:06] elukey: I can upgrade drafttopic model server so that we can also check pyrra. is it ok if I do it now? [09:39:17] also I was thinking of adding a label to all revscoring pods with the image tag that they are using in order to make it easier to see which image is deployed. We now get the checksum (sha256) on kubectl describe pod which isn't that straightforward. Any other ideas or objections to this? [10:10:26] 10Machine-Learning-Team: Establish a standard load testing procedure - https://phabricator.wikimedia.org/T348850 (10isarantopoulos) [[ https://github.com/locustio/locust | Locust ]] seems like a nice tool for this kind of work (Leaving this here for future reference) [10:24:15] reporting the same for draftquality and articletopic (ready to deploy) [10:39:19] sorry I was in an interview, looks good! [10:39:41] klausman: o/ if you are around, do you have a moment to check what's happening to ml-serve2007? [10:49:40] no worries! I'm ready to deploy all of the revscoring servers. results look good (and in some cases even better which is weird). I'll report in the task [10:49:55] isaranto: let's wait a sec, one prod node is down [10:50:01] yep! [10:50:03] not a big deal but lemme check what's wrong [10:50:28] until now my assumption is that articlequality results are worse in staging because of multiprocessing enabled. I'll run a test before and after deployment to verify this is the case [10:56:48] ack! [10:58:14] Multi-bit memory errors detected on a memory device at location(s) DIMM_A2. [10:58:30] and I think it happened before [10:58:35] anyway, powercycled, let's see [11:00:05] 🤞 [11:00:05] yeah I see https://phabricator.wikimedia.org/P44898 [11:00:22] so I think it is safe to keep going, but maybe a follow up with dcops is good [11:01:03] ack [11:01:11] so we have the same issue since Feb? [11:01:29] or does it recur from time to time? [11:01:43] it happened two times, I don't find other tasks but I recall the hostname [11:02:07] it may be a faulty bank of ram that once in a while causes issues [11:02:43] isaranto: green light [11:02:51] elukey: wdyt about the label for the image tag I wrote above ?--^ [11:03:02] adding a label to the pod [11:03:54] could be an option yes, maybe knative offers something [11:04:13] check just to be sure, otherwise we can add it [11:05:14] was just thinking to add it manually in the chart/helmfile extracting it from values.yaml. But I'll check if there is some other way through knative [11:15:08] oOo some values weren't updated (articletopic, articlequality) - I accidentally left them out in a previous patch. So I'll rerun some tests before upgrading - https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/977605 [11:22:12] tried deploying the article-descriptions model-server in the experimental namespace and it has run into a `CrashLoopBackOff` issue and the old pod is still up and running: [11:22:12] ``` [11:22:12] kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/experimental$ kube_env experimental ml-staging-codfw [11:22:12] kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/experimental$ kubectl get pods [11:22:12] NAME READY STATUS RESTARTS AGE [11:22:13] article-descriptions-predictor-default-00002-deployment-d86zggf 3/3 Running 0 3d1h [11:22:13] article-descriptions-predictor-default-00003-deployment-86xrwgk 1/3 CrashLoopBackOff 11 (3m28s ago) 37m [11:22:14] revertrisk-wikidata-predictor-default-00012-deployment-7b62cfjr 3/3 Running 0 3d19h [11:22:14] ``` [11:23:31] o/ kevin! I'm checking [11:23:41] okok [11:25:57] isaranto, kevinbazira - let's try to debug together in here [11:26:22] for example, what are the steps to take to verify what's happening in a crashloop [11:26:43] kubect describe $pod-name -n experimental is surely a start [11:26:47] *kubectl [11:26:51] to look for errors etc.. [11:26:51] sure! so first thing I'd check would be the pod description and events in the namespace [11:27:06] kevinbazira: do you want to check --^ ? [11:27:47] yes, I've checked: [11:27:47] ``` [11:27:47] kubectl describe pod article-descriptions-predictor-default-00003-deployment-86xrwgk [11:27:47] ``` [11:31:23] anything useful? [11:32:17] (checking as well) [11:32:34] also kubectl get events -n experimental is useful [11:34:43] Containers for `istio-validation`, `storage-initializer`, and `istio-proxy` started well. [11:34:43] The `kserve-container` also started and the following Warnings were thrown: [11:34:43] ``` [11:34:43] Readiness probe failed: HTTP probe failed with statuscode: 503 [11:34:43] Back-off restarting failed container [11:34:43] ``` [11:35:22] yes perfect, this is the issue, it fails the readiness probe [11:35:47] so something happens when kserve-container bootstraps, and it never reaches the point where it starts answering health checks [11:37:09] kevinbazira: another useful thing in kubectl describe is the following (section kserve-container): [11:37:12] State: Waiting [11:37:14] Reason: CrashLoopBackOff [11:37:17] Last State: Terminated [11:37:19] Reason: OOMKilled [11:40:21] Tobias bumped the memory allowance manually, do you recall by how much? [11:40:27] I can try to do the same so we see if it works [11:40:37] (it was manual, so after the deployment it got overridden) [11:41:32] bumped to 12G [11:41:33] If I remember correctly he had mentioned 8Gi [11:41:43] okok let's try 12G [11:41:46] ah snap ok, well worst case we reduce :) [11:41:55] sure sure thanks! [11:42:07] any doubts about the debug above? [11:42:41] OOMKilled is indeed the issue. I see 4G as requests and limit in the container which is not enough [11:43:36] I think it needs at least 9-10GB at the moment and once we update the way it loads the model to use less ram it may go down to 7-8 [11:44:00] we could have a higher limit (say 10/12G) and Request set to 7/8 [11:44:16] kevinbazira: ok pod is up! [11:44:53] Great! thanks let me check [11:52:10] (03PS1) 10Ilias Sarantopoulos: article-descriptions: low memory usage on load [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/977630 [11:53:40] I haven't tested the above patch but this is the way to go. Otherwise we're just throwing away resources (explanation in commit msg and docs) [11:58:32] ok let me test it on the ml-sandbox [12:02:16] 10Machine-Learning-Team: Upgrade model servers to kserve 0.11.2 - https://phabricator.wikimedia.org/T351633 (10isarantopoulos) I ran load testing for all revscoring model servers comparing staging (version 0.11.2) with production (0.11.1). All servers brought similar results with exception articlequality which w... [12:11:41] going afk for lunch! [12:21:54] 10Machine-Learning-Team: Upgrade model servers to kserve 0.11.2 - https://phabricator.wikimedia.org/T351633 (10isarantopoulos) The above makes sense (answering to myself 😛 ) as codfw in production is not idle but constantly gets traffic from enwiki. Running a load test on eqiad verified this: ` wrk -c 1 -t 1 --... [12:22:44] I'm having a discussion with myself on the phab task above. Just reporting here that all is going well with the upgrade. I still have goodfaith and reverted to go which I'll do after lunch [12:22:49] * isaranto lunch o clock! [12:28:54] isaranto: I tested the `low_cpu_mem_usage` parameter on the ml-sanbox with `accelerate` installed and the model loaded fast. Thank you for the suggestion! [12:30:09] (03CR) 10Kevin Bazira: [C: 03+1] "I tested the `low_cpu_mem_usage` parameter on the ml-sanbox with `accelerate` installed and the model loaded fast. Thank you for the sugge" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/977630 (owner: 10Ilias Sarantopoulos) [13:21:40] elukey: taking a look at ml-serve2007 now, sorry for the delay [13:22:24] Mh, rebooted 2h20m ago. Was that you? [13:22:54] ah, nvm, you already solved it. [13:28:02] (03CR) 10Ilias Sarantopoulos: [C: 03+2] article-descriptions: low memory usage on load [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/977630 (owner: 10Ilias Sarantopoulos) [13:28:48] (03Merged) 10jenkins-bot: article-descriptions: low memory usage on load [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/977630 (owner: 10Ilias Sarantopoulos) [13:34:13] (03PS1) 10Ilias Sarantopoulos: ci: test debian bookworm [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/977676 [13:35:03] (03CR) 10Ilias Sarantopoulos: "This is just a test to check if images are built properly. Will abandon this patch but record the findings in a task for us to have for fu" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/977676 (owner: 10Ilias Sarantopoulos) [13:37:45] (03PS2) 10Ilias Sarantopoulos: ci: test debian bookworm [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/977676 [13:46:02] 10Machine-Learning-Team: Upgrade model servers to kserve 0.11.2 - https://phabricator.wikimedia.org/T351633 (10isarantopoulos) [13:52:14] (03CR) 10CI reject: [V: 04-1] ci: test debian bookworm [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/977676 (owner: 10Ilias Sarantopoulos) [13:58:16] Good morning! [14:06:06] I am back! [14:07:40] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade Revert Risk Language-agnostic docker images to KServe 0.11 - https://phabricator.wikimedia.org/T347550 (10isarantopoulos) Upgraded the server and run some load tests. Results are in line with past values ` wrk -c 4 -t 2 --timeout 3s -s revertrisk.lua https:/... [14:24:15] chrisalbon: o/ [14:24:35] I have so many slack messages [14:24:43] Whyyyyyyyyyyyy [14:25:34] Hey elukey! [14:30:56] hey Chris! [14:31:38] Morning, Chris [14:34:33] Good news, we are getting an intern [14:34:58] nice! [14:35:20] Bad news (for me), last intern search had 15,000 applications [14:38:12] 🎉 sounds awesome! [14:41:16] (the part that we're getting an intern) [14:50:08] 10Machine-Learning-Team: Upgrade model servers to kserve 0.11.2 - https://phabricator.wikimedia.org/T351633 (10isarantopoulos) The following model servers have been upgraded to kserve 0.11.2 [x] revscoring [x] langid [x] llm [x] revertrisk-language-agnostic [x] revertrisk-wikidata Also: [x] Ran load tests and... [14:50:45] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade Revert Risk Language-agnostic docker images to KServe 0.11 - https://phabricator.wikimedia.org/T347550 (10isarantopoulos) 05Open→03Resolved [14:50:47] 10Machine-Learning-Team: Update to KServe 0.11 - https://phabricator.wikimedia.org/T337213 (10isarantopoulos) [14:51:56] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade Revert Risk Multilingual docker images to KServe 0.11 - https://phabricator.wikimedia.org/T347551 (10isarantopoulos) Deployed the model server so that it has the latest image. It is still running kserve 0.10. Status is the same and we are waiting for a new c... [14:54:18] ✅ finally done with kserve updates! [14:57:59] \o/ [14:58:02] nicely done isaranto [14:58:38] the catboost folks told me that they are investigating some buffer underflow related (maybe) to my merge request, I think it may take more time sigh [14:58:53] not sure if we want to attempt to build the python package by ourselves and test it [14:59:58] one thing that we could do is to verify what version of catboost we run [15:00:49] they'll release a new minor (hopefully) containing the cgroups v2 code, so I'd verify with Research if upgrading is ok [15:00:54] yes, I read it earlier as I was going though the tasks. I don't think it is worth it to do this manual work (upgrading it ourselves). especially since the two dependent model servers are not used somewhere (readability, revertrisk-multilingual) [15:01:04] (03PS1) 10Kevin Bazira: article-descriptions: fix AsyncSession host header [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/977241 (https://phabricator.wikimedia.org/T351940) [15:02:21] kevinbazira: this is the same issue I was getting --^ [15:02:55] do you have any more context? it is super weird ,right? [15:03:33] 10Machine-Learning-Team, 10Patch-For-Review: Enable local runs for article-descriptions model - https://phabricator.wikimedia.org/T351940 (10kevinbazira) I dicovered what was causing this issue and pushed a patch for it [[ https://gerrit.wikimedia.org/r/977241 | here ]]. Essentailly the first code snippet bel... [15:03:42] isranto: yes it is. I've shared here: https://phabricator.wikimedia.org/T351940#9359437 [15:04:09] *saranto:--^ [15:04:23] isaranto:--^ [15:04:45] yes, thanks for sharing! we need to dig a bit deeper though as it seems that something isn't working right [15:19:08] isaranto: ack! (re: catboost) [15:19:24] so KI uses 1.1.1 afaics [15:20:23] ack [15:20:45] same for readability [15:21:02] and pypi's last version is 1.2.2 [15:21:09] elukey: how do the slo dashboards look after the deployments? I took a look but I can't really tell [15:21:42] isaranto: bad :( [15:22:02] lemme give you a link [15:22:30] isaranto: https://w.wiki/8Hoz [15:22:40] see the Prometheus graph on the left? [15:22:48] Under "Requests" [15:22:53] it has a huge spike [15:23:10] the main issue is https://www.robustperception.io/rate-then-sum-never-sum-then-rate/ [15:23:27] good that we verified it :D [15:23:33] I need to go back to the drawing board [15:26:24] ook. so the deployment did that? are timestamps in utc or browser time? (seems like the latter) [15:27:44] I thought UTC, but if you click on prometheus links you should be able to see the query in the thanos ui [15:27:46] a nevermind I understood now. The way the calculation is done shows this spike (ilias explains out loud) [15:27:54] yeah exactly :( [15:28:06] there are too many metrics so we need to aggregate them before pyrra [15:28:22] but this causes some troubles since pyrra applied increase() (similar to rate()) behind the scenes [15:29:52] ok, verified with thanos. unfortunately they are in local browser time (nothing to do now , just noting) [15:41:52] elukey: I've been doing Prometheus monitoring for 8+ years, and the rate/sum thing still bites me several times a year XD [15:44:07] I kinda knew it didn't work, but it showed some good graphs so part of me was secretly hopeful :D [15:56:05] 10Machine-Learning-Team, 10Goal: Order 1 GPU for Lift Wing - https://phabricator.wikimedia.org/T341699 (10calbon) p:05Triage→03Medium [15:56:29] 10Machine-Learning-Team, 10Goal: Goal: Increase the number of models hosted on Lift Wing - https://phabricator.wikimedia.org/T348156 (10calbon) p:05Triage→03Medium [15:56:32] 10Machine-Learning-Team, 10Goal: Goal: Decide on an optional Lift Wing caching strategy for model servers - https://phabricator.wikimedia.org/T348155 (10calbon) p:05Triage→03High [15:56:36] 10Machine-Learning-Team, 10Goal: Goal: Lift Wing users can request multiple predictions using a single request. - https://phabricator.wikimedia.org/T348153 (10calbon) p:05Triage→03Medium [15:56:41] 10Machine-Learning-Team, 10Goal: Goal: Users can query a large language model using the API Gateway and receive a response in a reasonable amount of time. - https://phabricator.wikimedia.org/T348154 (10calbon) p:05Triage→03Medium [15:57:01] 10Machine-Learning-Team, 10Goal: Goal: Decide on an optional Lift Wing caching strategy for model servers - https://phabricator.wikimedia.org/T348155 (10calbon) p:05High→03Medium [16:01:31] klausman: if you have time (not urgent), can you follow up on the lift wing expansion tasks? I am a little worried about the status, seems all stalling [16:01:48] (they need to be reviewed, not sure if something is pending from us or not) [16:01:59] I think we stopped due to the size for the GPUs [16:02:34] Ack, will do [16:14:35] (03CR) 10Ilias Sarantopoulos: [C: 03+1] article-descriptions: fix AsyncSession host header [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/977241 (https://phabricator.wikimedia.org/T351940) (owner: 10Kevin Bazira) [16:15:05] kevinbazira: I'm solving the same issue in the local run patch https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/976670 [16:16:25] (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/977241 (https://phabricator.wikimedia.org/T351940) (owner: 10Kevin Bazira) [16:16:44] made it work, but we need to follow up with mwapi on why the headers cant be set. aiko may have (maybe maybe not :) )more context as I see the same code in revertrisk https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/revert-risk-model/model-server/model.py#83 [16:17:13] (03Merged) 10jenkins-bot: article-descriptions: fix AsyncSession host header [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/977241 (https://phabricator.wikimedia.org/T351940) (owner: 10Kevin Bazira) [16:17:16] we need to do some refactoring though as there are too many hardcoded values [16:17:46] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "I've tested this locally, not sure if it works with api-ro though" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/977241 (https://phabricator.wikimedia.org/T351940) (owner: 10Kevin Bazira) [16:18:12] isaranto: sure sure. she wrote this class: https://github.com/mediawiki-utilities/python-mwapi/blame/master/mwapi/async_session.py#L55 [16:18:50] a TIL, didnt know we wrote that. nice! [16:19:15] we'll need to report the bug then :0 [16:19:34] going afk folks, have a nice rest of evening/day [16:20:04] enjoy your evening isaranto! o/ [16:21:09] o/ [16:24:20] 10Machine-Learning-Team, 10observability, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q2): Istio recording rules for Pyrra and Grizzly - https://phabricator.wikimedia.org/T351390 (10elukey) Also found https://github.com/istio/istio/issues/38841, I need to verify if the bug is still a concern or no... [16:44:26] 10Machine-Learning-Team, 10observability, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q2): Istio recording rules for Pyrra and Grizzly - https://phabricator.wikimedia.org/T351390 (10elukey) Example of a random metric from Lift Wing: ` istio_requests_total{app="istio-ingressgateway", chart="gatewa... [18:01:28] going afk folks! [18:01:32] have a nice rest of the day [18:24:54] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 18th round of wikis (en.wp and de.wp) - https://phabricator.wikimedia.org/T308144 (10Sgs) >>! In T308144#9323129, @Trizek-WMF wrote: > We will work on this task at the beginning of 2024. I thought we were aiming to e... [18:34:44] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 18th round of wikis (en.wp and de.wp) - https://phabricator.wikimedia.org/T308144 (10Trizek-WMF) We have to make proper community engagement, which is not doable at the moment as I'm working on {T346108}. [19:50:31] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Growth-Team, 10Wikipedia-Android-App-Backlog, 10Patch-For-Review: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298 (10JTannerWMF)