[09:55:44] hi o/ [09:55:44] I pushed a patch to the move article-descriptions model-server from staging to prod [09:55:44] please review whenever you get a minute: https://gerrit.wikimedia.org/r/1006194 thanks! [10:06:27] morning o/ [11:31:36] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Growth-Team, 10Wikipedia-Android-App-Backlog, 10MW-1.42-notes (1.42.0-wmf.16; 2024-01-30): Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298#9576023 (10kostajh) >>! In T348298#9559718, @Samwalton9-... [12:14:15] * klausman lunch [12:14:31] kevinbazira: I +1'd your A-D to prod change in case you missed it :) [12:15:40] klausman: thanks for the review :) [12:42:06] 10Machine-Learning-Team: Move the article-descriptions model server from staging to production - https://phabricator.wikimedia.org/T358467#9576155 (10kevinbazira) Before deploying the article-descriptions model server in prod, I tried running `helmfile -e ml-serve-* diff` for both *eqiad and *codfw and got the e... [12:44:08] klausman: I've tried running `helmfile -e ml-serve-* diff` for both *eqiad and *codfw. [12:44:09] this is returning the error: https://phabricator.wikimedia.org/T358467#9576155 [12:44:09] are there missing configs in: [12:44:09] ``` [12:44:09] time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 3}' -H "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1 [12:44:09] ``` [12:44:09] and [12:44:10] ``` [12:44:10] time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "es", "title": "Madrid", "num_beams": 3}' -H "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1 [12:44:11] ``` [12:45:56] ***are there missing configs in: [12:45:56] ``` [12:45:57] /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config [12:45:57] ``` [12:45:57] and [12:45:57] ``` [12:45:57] /etc/kubernetes/article-descriptions-deploy-ml-serve-eqiad.config [12:45:58] ``` [12:56:27] I'll have a look once I'm back home [12:57:00] great. thanks! [12:59:39] From a quick look, I think it's just missing secrets in the private repo/pm [13:05:36] * aiko afk for an appointment [13:25:32] kevinbazira: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1006524 This should fix it. [13:31:21] klausman: thanks. I'll run the diff again once that patch has been merged [13:44:21] One more thing missing :) [13:51:36] okok... had just run into another error... was about to share it :) [13:52:49] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1006528 there we go. [14:18:15] 10Machine-Learning-Team, 10Patch-For-Review: Move the article-descriptions model server from staging to production - https://phabricator.wikimedia.org/T358467#9576412 (10kevinbazira) After @klausman helped add [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/1006524/ | secrets ]], [[ https://gerrit.wiki... [14:19:04] I am strugglign to figure out what is missing there. the file permission is a red herring [14:49:20] 10Machine-Learning-Team: Move the article-descriptions model server from staging to production - https://phabricator.wikimedia.org/T358467#9576525 (10klausman) I had missed pushing the `admin_ng` change. That is fixed now, so pushing the model server config should work now. [14:49:35] kevinbazira: I once again had forgotten to push changes %-) Diff works for me now, can you test? [14:51:19] great! works like a charm! thanks klausman! [14:56:00] np, and sorry for the extra delay [15:11:16] klausman: np, it happens :) [15:11:30] meanwhile, both helmfile diff and sync run without any issues but when I check to see the pods, I get: [15:11:30] ``` [15:11:31] $ kube_env article-descriptions ml-serve-eqiad [15:11:31] $ kubectl get pods [15:11:31] No resources found in article-descriptions namespace. [15:11:31] ``` [15:12:22] Hmmm [15:19:29] There's a revision there, but it doesn't trigger a pod creation [15:28:01] kevinbazira: ah, the CPU request was too high. 16 is not allowed. I tried with 6 and that works. I think this is a cluster- or namespace-level limit we would need to raise [15:29:05] okoko ... I remember this had to be raised for staging too [15:33:57] For now I "hot-updated" the services to request 6 cpu in eqiad/codfw, currently working on a patch to make that the same in the repo. We can then test with that setting and decide how much we want the limit to be raised in prod. wdyt? [15:35:03] sounds good [16:02:48] (InfServiceHighMemoryUsage) firing: (2) High Memory usage detected in Inference Service - https://wikitech.wikimedia.org/w/index.php?title=Machine_Learning/LiftWing/Alerts#Inference_Services_High_Memory_Usage_-_InfServiceHighMemoryUsage_alert - https://alerts.wikimedia.org/?q=alertname%3DInfServiceHighMemoryUsage [16:12:25] kevinbazira: looks like 4Gi is cutting it close. No OOMs, bit flying close it it [16:12:28] to it* [16:13:37] I got `InfServiceHighMemoryUsage` alerts in the email [16:13:54] yeah also bot message above [16:16:34] so with the current config, a reuqest that was taking <3s on staging is now >8s in prod: [16:16:34] ``` [16:16:34] $ time curl "https://inference.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2, "debug": 1}' -H "Host: article-descriptions.article-descriptions.wikimedia.org" -H "Content-Type: application/json" --http1.1 [16:16:34] {"lang":"en","title":"Clandonald","blp":false,"num_beams":2,"groundtruth":"Hamlet in Alberta, Canada","latency":{"wikidata-info (s)":0.07287430763244629,"mwapi - first paragraphs (s)":0.27934741973876953,"total network (s)":0.3130209445953369,"model (s)":7.659532070159912,"total (s)":7.9725823402404785},"features":{"descriptions":{"fr":"hameau d'Alberta","en":"hamlet in central Alberta, Canada"},"first-paragraphs":{"en":"Clandonald is a [16:16:34] hamlet in central Alberta, Canada within the County of Vermilion River. It is located approximately 28 kilometres (17 mi) north of Highway 16 and 58 kilometres (36 mi) northwest of Lloydminster.","fr":"Clandonald est un hameau (hamlet) du Comté de Vermilion River, situé dans la province canadienne d'Alberta."}},"prediction":["Hamlet in Alberta, Canada","human settlement in Alberta, Canada"]} [16:16:34] real 0m8.049s [16:16:34] user 0m0.014s [16:16:35] sys 0m0.001s [16:16:35] ``` [16:17:09] I suspect that's due to the CPU constraint throttling the parallel fetch of data from rest [16:23:12] these results are close to what we had in: https://phabricator.wikimedia.org/T353127#9398823 [16:23:12] 4Gi was taking >8s and when we used 8Gi it became <4s [16:23:12] is it possible to raise the memory constraint in prod? [16:23:43] It's definitely worth a try [16:24:39] I presume you're testing in qiad? [16:24:41] +er [16:25:54] meant CPU not memory ... sorry I am getting to the end of my day! [16:27:01] Sure, but I'd prefer to touch a global limit with time "in the day" left, so how about we do that tomorrow morning? [16:27:57] okok ... we can pick this up tomorrow morning. thank you for your help today [16:28:41] np. I've silenced the InfServiceHighMemoryUsage alert for 24h [16:32:40] ack. regarding where I was running the test request. it should have been hitting codfw. [16:48:20] going afk o/ [16:58:36] \o