[07:33:51] \o/ [07:34:05] nice work aiko :) [07:34:37] I got some comments in my pull request for catboost (finally), hope to have it fixed as well soon-ish [07:57:37] (03CR) 10Kevin Bazira: [C: 03+1] "Besides the merge conflict, the rest LGMT!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/975814 (https://phabricator.wikimedia.org/T351633) (owner: 10Ilias Sarantopoulos) [08:17:45] 10artificial-intelligence, 10Structured-Data-Backlog: Implement NSFW image classifier using Open NSFW - https://phabricator.wikimedia.org/T214201 (10Aklapper) a:05Harshineesriram→03None @Harshineesriram: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I... [08:20:43] (03PS2) 10Elukey: Upgrade model servers to kserve 0.11.2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/975814 (https://phabricator.wikimedia.org/T351633) (owner: 10Ilias Sarantopoulos) [08:20:51] (03CR) 10Elukey: [C: 03+1] Upgrade model servers to kserve 0.11.2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/975814 (https://phabricator.wikimedia.org/T351633) (owner: 10Ilias Sarantopoulos) [08:23:57] 10Machine-Learning-Team, 10artificial-intelligence, 10Wikilabels, 10articlequality-modeling: Build article quality model for Dutch Wikipedia - https://phabricator.wikimedia.org/T223782 (10Aklapper) a:05Psingh07→03None @Psingh07: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_man... [08:23:58] klausman, kevinbazira o/ - IIUC for rec-api-ng two we are missing to expose the service via api-gateway, is it something that you have planned for the next weeks? [08:24:12] yes. [08:24:20] also: good morning :) [08:25:41] morning! [08:26:53] about the rec-api-ng, my idea re: apigw was to have a working/testable service (via discovery) first, and then add the apigw config [08:27:23] I have admittedly been a bit OOTL regarding the rec-api-ng, but plan on syncing with Kevin (and you) about where it's at [08:33:39] nono I am not here :D [08:33:51] I was just reviewing tasks [08:34:00] anything that you and kevin decide is good for me [08:34:10] I wanted to ask what was the status :) [08:34:11] Hey folks! o/ [08:34:18] Morning, Ilias! [10:08:00] (03PS1) 10Ilias Sarantopoulos: change default config values to support local/patchdemo deployments [extensions/ORES] - 10https://gerrit.wikimedia.org/r/976157 [10:16:33] (03CR) 10Ilias Sarantopoulos: "Have a first version of the patch ready for review!" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/971547 (https://phabricator.wikimedia.org/T348298) (owner: 10Ilias Sarantopoulos) [10:17:11] I have the patch ready for adding revertrisk to the extension! https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ORES/+/971547 [10:17:16] at least the first version of it [10:18:38] wow! [10:19:27] does it require any changes to the db schema? [10:19:37] or will those be handled transparently? [10:19:54] nope. no changes required [10:20:02] very nice [10:20:15] at least that's how I've planned it. dont know what other will think [10:20:45] we chose to manipulate the revertrisk response and transform it to match the schema of the response of the ORES models [10:21:20] Amir1: o/ it is that time of the year where I try to bribe you to review my php code <3 [10:23:26] isaranto: makes sense for the moment, it seems the easies [10:23:30] *easiest [10:23:55] maybe let's think what to do if more use cases come (after RR LA I mean) [10:24:19] yy. It is not ideal nor does it scale. but at the moment is all I can do [10:26:27] That would require a rewrite of some parts of the extension for sure [10:50:17] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES: Update ORES extension configuration - https://phabricator.wikimedia.org/T351703 (10isarantopoulos) [10:50:25] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES: Update ORES extension configuration - https://phabricator.wikimedia.org/T351703 (10isarantopoulos) [10:51:06] I added the above task to track some config changes I made as side-changes to the revertrisk patch (but have nothing to do with it) [10:51:58] (03PS2) 10Ilias Sarantopoulos: change default config values to support local/patchdemo deployments [extensions/ORES] - 10https://gerrit.wikimedia.org/r/976157 (https://phabricator.wikimedia.org/T351703) [10:55:24] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Patch-For-Review: Update ORES extension configuration - https://phabricator.wikimedia.org/T351703 (10isarantopoulos) p:05Triage→03Medium [11:00:41] (03CR) 10Ilias Sarantopoulos: [C: 03+2] Upgrade model servers to kserve 0.11.2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/975814 (https://phabricator.wikimedia.org/T351633) (owner: 10Ilias Sarantopoulos) [11:14:38] (03Merged) 10jenkins-bot: Upgrade model servers to kserve 0.11.2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/975814 (https://phabricator.wikimedia.org/T351633) (owner: 10Ilias Sarantopoulos) [11:29:49] * elukey lunch! [11:50:14] * isaranto afk for lunch! [12:06:26] * klausman lunch and errands [12:35:57] isaranto: xD sure [12:42:59] its a black friday offer: 3 patches in the price of one :D [13:42:25] kevinbazira, klausman - I lefft a comment in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/975929, I am curious about the specs used - any particular reason to use 2 cpus in there? [13:42:30] (and 4G of memory) [13:43:23] I think that's mostly c&P from the RR service above [13:45:57] it is fine for testing, but let's be mindful with resource usage (my suggestion) [13:46:26] I see that we use more CPUs for revert risk as well, but I am not 100% sure if we need them [13:49:53] Ack. We could update the change to move to 1cpu/2Gi, or leave it and keep an eye on behavior, then decide resources once we move out of experimental? [13:50:44] fine to keep it, but I'd test less resource before graduating to production (as part of load tests etc..) [13:51:01] ack [13:51:17] basically as part of https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Hosting_stages_for_a_model_server_on_Lift_Wing [13:53:09] Should I add a blurb about resource considerations to step 3 there? [13:54:17] sure [13:54:32] there is some mention in step 4 [13:54:43] maybe you can also expand it to make it more clear [13:56:33] No, I think the existing step 4 is fine. I just want to mention resources earlier. Y'know for people like me who never read step X before having done X-1 ;) [13:56:55] Pushing the above change to staging/experimental now [13:59:15] Ok, one cotnainer is being oomkilled, investigating [14:03:05] Not sure what's going on, it never lives long enough for me to see logs [14:05:20] I bet that it is due to the memory consumption while loading the model [14:05:42] it is easy to check in localhost if this is the case [14:06:10] we don't have a ton of memory to spare in staging, but you can try to increase memory limits manually [14:06:15] via kubectl edit I mean [14:06:29] that's what I was about to do [14:07:11] currently figureing out the cmdline syntax for that [14:08:44] (thank the creator of bash completion!) [14:08:57] Ok, pushed to 6Gi, see if that works [14:09:57] Nope, not enough [14:12:28] Wow, even 8Gi is not enough [14:13:26] It's werid that in the graphs, the gap between limit and usage is enormous, the usage never getting near the limit. Must be one big bunch of allocatiosn in one go. [14:13:51] oh nvm, it rockets to 14G of usage early on [14:14:49] The model on S3 is only 2Gi [14:15:02] kevinbazira: any idea why the service is so memory hungry on startup? [14:18:40] klausman: testing the image on the ml-sandbox shows it used about 4Gi using docker stats: [14:18:40] ``` [14:18:40] CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS [14:18:40] 027e48595afe hopeful_faraday 0.00% 4.028GiB / 35.36GiB 11.39% 679MB / 2.12MB 0B / 36.9kB 16 [14:18:40] ``` [14:19:09] I just tried gicing it 16Gi, that seems to work, the graphs on Grafan show peak usage just under 9.3Gi [14:19:44] Working set, that is, "used" is 10.8Gi [14:21:30] https://phabricator.wikimedia.org/F41523036 [14:23:55] elukey: I now have a bunch of old crashlooping deployments in experimental, how do I remove those? [14:27:11] ah, they went away by themselves [14:27:35] And there was another container restart, even with 16Gi [14:28:56] klausman: I see 3 pods for the article-descriptions isvc, I was hoping to see 1. [14:28:56] ``` [14:28:56] kevinbazira@deploy2002:~$ kube_env experimental ml-staging-codfw [14:28:56] kevinbazira@deploy2002:~$ kubectl get pods [14:28:56] NAME READY STATUS RESTARTS AGE [14:28:57] article-descriptions-predictor-default-00002-deployment-65kcbjq 1/3 CrashLoopBackOff 3 (26s ago) 3m15s [14:28:57] article-descriptions-predictor-default-00003-deployment-6dsbtfg 1/3 OOMKilled 3 (53s ago) 3m10s [14:28:58] article-descriptions-predictor-default-00004-deployment-77mddw2 2/3 Running 1 (15s ago) 9m40s [14:28:58] revertrisk-wikidata-predictor-default-00011-deployment-65fz6hpp 3/3 Running 0 29d [14:28:59] ``` [14:29:39] I only see one (with 2 started containers) [14:30:12] The restart was caused by the container not becoming responsive in time [14:30:19] okok [14:30:56] kubectl get event -n experimental --field-selector involvedObject.name=article-descriptions-predictor-default-00004-deployment-77mddw2 will show the events around a pod [14:31:03] 4m35s Warning Unhealthy pod/article-descriptions-predictor-default-00004-deployment-77mddw2 Readiness probe failed: Get "http://10.194.61.163:15020/app-health/queue-proxy/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) [14:32:01] Seeing that the container received ~ 15M/s data on the network, could this just the model download being slow? [14:33:11] when experimenting with LLM we had to increase the readiness probe to allow for longer bootstrap timings [14:33:38] Yeah, I suspect that might be it. Is that timeout part of the helmfile [14:33:39] ? [14:34:08] proxy.istio.io/config: '{"readiness.status.sidecar.istio.io/periodSeconds": "600" [14:34:10] }' [14:34:14] this-^ [14:34:29] Or is that just the Istio sidecars? [14:35:22] ahno, that is how often it is probed [14:35:37] step-back for a second - what is the container causing the issue? [14:35:50] It's taking too long to respond to the readiness probe [14:36:07] yes but what is the readiness probing? [14:36:17] I don't understand the question [14:37:03] It's trying to load http://10.194.61.163:15020/app-health/queue-proxy/readyz and that times out. [14:37:41] so either the queue proxy is dead, or it can't reach the kserve container [14:38:26] Since the queue proxy logs have a lot of this: [14:38:28] aggressive probe error (failed 202 times): dial tcp 127.0.0.1:8080: connect: connection refused [14:38:46] I would say kserve is not starting up completely before the health check times out. [14:39:22] ok, I asked the above question since we need to know 1) what container is set with the Readiness probe that is failing 2) how to configure it [14:39:37] kubectl describe pod etc.. has the various Readiness probes [14:40:04] the queue-proxy one, as you already mentioned, has [14:40:05] Readiness: http-get http://:15020/app-health/queue-proxy/readyz delay=0s timeout=1s period=10s #success=1 #failure=3 [14:40:27] but it is not the only one [14:40:45] speaking of describe... [14:41:12] https://phabricator.wikimedia.org/P53677 [14:41:32] Looks like the kserve/isvc is trying to download things from Huggingface [14:41:47] Well, maybe, the message is ambiguous [14:42:08] or maybe we are missing something [14:42:13] kevinbazira: --^ [14:42:25] Yeah, either it's trying to talk to HF, or the local dir is missing [14:42:34] anyway, for the readiness probe, always check what is the container having the problem [14:42:39] for isvcs, we have two readiness probes [14:42:41] 1) queue proxy [14:42:42] 2) istio [14:42:59] and they have different env variables/configs to set them [14:43:40] Readiness: http-get http://:15020/app-health/queue-proxy/readyz delay=0s timeout=1s period=10s #success=1 #failure=3 [14:44:15] This reads to me as: start probing immediately, per-probe timeout is 1s, probe every 10s, after 3 timeouts (33s), consider the service dead [14:44:32] yes [14:44:38] any single probe success means everything is great [14:46:37] ok so in the utils.py code we have [14:46:39] BERT_PATH = "bert-base-multilingual-uncased" [14:47:56] and [14:47:57] tokenizer_bert = BertTokenizer.from_pretrained(BERT_PATH) [14:48:39] running docker exec foo ls -R |grep -i bertt (or bert-base) yields nothing [14:49:19] Where exactly in the container the file/dir is expected, I dpn't know, but I suspect ot [14:49:32] it's /srv/article-descriptions/model-server/ [14:49:34] I think it wants an absolute path [14:49:56] an absolute path would be best, yes, but I don't think the file/dir is in there at all [14:50:51] There are other bert files, like `configuration_visual_bert.py`, so I'm pretty sure I am not looking at the wrong container [14:50:56] o/ [14:51:09] the storage initializer download the s3 files to /mnt/models [14:52:03] so BERT_PATH is probably something from the vps/cloud instance [14:53:00] this is something we missed in the review then. When using from_pretrained we should specify that we are using a local dir. otherwise the pod will try to connect to hugging face which will fail [14:53:04] similar to this https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/llm/model-server/model.py#41 [14:53:33] isaranto: we use from_pretrained [14:53:50] yes, I saw it [14:53:52] but IIUC it wants an absolute path, and we set the BERT_PATH, that is not /mnt/models/etc.. [14:54:11] so we download everything in storage initializer and then the kserve container just loads everything from disk and not fetch anything [14:54:46] in theory yes, but from the error msg it seems that it doesn't find this path [14:54:49] we need to also set local_files_only=True, as in the link I sent above in the llm class use case [14:54:49] BERT_PATH = "bert-base-multilingual-uncased" [14:55:29] makes sense [14:55:50] There still remains the question why the file isn't there at all [14:56:49] # docker exec b24f1d43f118 ls /mnt/models|xargs [14:56:51] config.json pytorch_model.bin sentencepiece.bpe.model special_tokens_map.json tokenizer_config.json trainer_state.json training_args.bin [14:57:14] (xargs is only there to make it one line instead of 7 [14:57:19] it will need to be uploaded to swift manually. If it is not part of the model [14:58:23] at least that is what I understand. It hasn't been uploaded to swift so storage initializer doesnt find it there [14:59:11] lets talk in the meeting! [14:59:42] I may be 2' late [15:11:57] 10Machine-Learning-Team, 10observability, 10Patch-For-Review: Istio recording rules for Pyrra and Grizzly - https://phabricator.wikimedia.org/T351390 (10elukey) a:03elukey [15:20:52] 10Machine-Learning-Team, 10observability, 10Patch-For-Review: Istio recording rules for Pyrra and Grizzly - https://phabricator.wikimedia.org/T351390 (10elukey) [15:25:50] 10Machine-Learning-Team: Upgrade model servers to kserve 0.11.2 - https://phabricator.wikimedia.org/T351633 (10isarantopoulos) [15:26:16] 10Machine-Learning-Team: Upgrade model servers to kserve 0.11.2 - https://phabricator.wikimedia.org/T351633 (10isarantopoulos) p:05Triage→03High [15:26:31] 10Machine-Learning-Team: Upgrade model servers to kserve 0.11.2 - https://phabricator.wikimedia.org/T351633 (10isarantopoulos) p:05High→03Medium [15:27:01] 10Machine-Learning-Team, 10observability, 10Patch-For-Review: Istio recording rules for Pyrra and Grizzly - https://phabricator.wikimedia.org/T351390 (10elukey) p:05Triage→03Medium [15:33:40] 10Machine-Learning-Team: Discuss potential migration - https://phabricator.wikimedia.org/T344010 (10klausman) @calbon I think this task is outdated, should we just close it? [15:45:10] 10Machine-Learning-Team, 10artificial-intelligence, 10Bad-Words-Detection-System, 10revscoring: Add language support for Malay language (ms) - https://phabricator.wikimedia.org/T349968 (10elukey) 05Open→03Resolved [15:54:09] 10Machine-Learning-Team: Investigate LW Consumer lag alerts - https://phabricator.wikimedia.org/T351735 (10isarantopoulos) [16:02:25] 10Machine-Learning-Team: Deploy ctranslate2 version of nllb-200 - https://phabricator.wikimedia.org/T351740 (10isarantopoulos) [16:14:52] I tried to rerun the pipeline using "Build with parameters" and provide the appropriate patch but it fails https://integration.wikimedia.org/ci/job/inference-services-pipeline-revscoring-publish/ [16:15:12] I think it is more safe to retrigger the pipeline through the repo for our production builds [16:24:50] isaranto: when did the missing build for revscoring should have happened? [16:24:57] anything registered? Failures etc.. [16:25:39] nothing seems registered. Last run was after the article-descriptions patch and this should have happened after this patch got merged https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/975814 [16:25:50] the 2 failures registered are the manual ones I triggered [16:27:47] (03PS1) 10Ilias Sarantopoulos: revscoring: add missing type in function [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/976270 [16:29:05] Trying to retrigger it now with the above commit [16:29:37] isaranto: ahh wait, did we update integration/config after the rename? revscoring -> revscoring_model [16:30:42] I don't recall if there is a reference on the directory naming over there. checking now... [16:31:32] aa yes there is [16:31:56] the trigger happens when something changes in revscoring/* [16:32:03] exactly yes [16:32:09] this is why it didn't trigger [16:33:35] thanks for this, I had totally forgot it! noted for future changes [16:34:07] np! [16:34:17] I was really puzzled that there was no error registered [16:34:23] at least we have an explanation :D [16:35:48] niceeee [16:36:06] fyi: https://gerrit.wikimedia.org/r/c/integration/config/+/976272 [16:56:34] folks when you have a moment there is a question from Fabian in #talk-to-machine-learning [16:57:43] going afk for today, have a nice rest of the day folks! [17:18:25] aiko: this is the paste from our discussion https://phabricator.wikimedia.org/P53678 [17:19:20] isaranto: <3 [17:33:56] I answered to fabian on slack! [17:34:06] going afk folks, good afternoon/rest of day! [18:22:07] (03CR) 10Ladsgroup: [C: 03+1] "I'd be happy to merge this once the mw-config patch is deployed." [extensions/ORES] - 10https://gerrit.wikimedia.org/r/976157 (https://phabricator.wikimedia.org/T351703) (owner: 10Ilias Sarantopoulos) [19:45:23] (03CR) 10Kosta Harlan: change default config values to support local/patchdemo deployments (034 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/976157 (https://phabricator.wikimedia.org/T351703) (owner: 10Ilias Sarantopoulos) [19:46:48] (03CR) 10Kosta Harlan: change default config values to support local/patchdemo deployments (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/976157 (https://phabricator.wikimedia.org/T351703) (owner: 10Ilias Sarantopoulos) [19:47:09] (03CR) 10Kosta Harlan: change default config values to support local/patchdemo deployments (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/976157 (https://phabricator.wikimedia.org/T351703) (owner: 10Ilias Sarantopoulos) [19:51:27] (03CR) 10Ladsgroup: [C: 03+1] change default config values to support local/patchdemo deployments (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/976157 (https://phabricator.wikimedia.org/T351703) (owner: 10Ilias Sarantopoulos) [19:52:38] (03CR) 10Kosta Harlan: change default config values to support local/patchdemo deployments (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/976157 (https://phabricator.wikimedia.org/T351703) (owner: 10Ilias Sarantopoulos) [19:53:31] (03CR) 10Kosta Harlan: change default config values to support local/patchdemo deployments (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/976157 (https://phabricator.wikimedia.org/T351703) (owner: 10Ilias Sarantopoulos)