[07:11:14] hello folks [07:13:07] Buongiorno! [07:13:25] kalimera :) [07:27:45] (03PS1) 10Kevin Bazira: Update pageviews endpoint to match rest-gateway envoy listener [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/967925 (https://phabricator.wikimedia.org/T348607) [07:59:22] (03CR) 10Elukey: [C: 03+1] "Looks good! Just to double check, do we set the 'wikimedia.org' Host header for this use case? (Otherwise it doesn't work)" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/967925 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [08:09:08] (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the review :)" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/967925 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [08:09:55] I just merged the kafka alert! [08:11:37] (03Merged) 10jenkins-bot: Update pageviews endpoint to match rest-gateway envoy listener [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/967925 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [08:28:49] 10Machine-Learning-Team: Visualize KServe latency metrics in a dashboard - https://phabricator.wikimedia.org/T348456 (10elukey) 05Open→03Resolved [08:29:18] 10Machine-Learning-Team: Add sha512 checksum files to all the ML's models in the public dir - https://phabricator.wikimedia.org/T347838 (10elukey) 05Open→03Resolved [08:29:31] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: kserve CORS error - https://phabricator.wikimedia.org/T348511 (10elukey) 05Open→03Resolved [08:51:19] elukey:o/ [08:51:19] thanks for the reviews. [08:51:19] trying to deply on staging, the sync command is hanging for a while (way longer than usual) [08:51:19] so I checked the pods and noticed the new one has been pending for about 7mins: [08:51:19] ``` [08:51:19] $ kubectl get pods [08:51:19] NAME READY STATUS RESTARTS AGE [08:51:20] recommendation-api-ng-main-77cd984648-hmzfr 0/2 Pending 0 7m40s [08:51:20] recommendation-api-ng-main-856c65f996-xz45q 2/2 Running 0 19h [08:51:21] ``` [08:55:22] kevinbazira: I see that the pod is gone now. By running `kubectl get events` you can see the events in the namespace and I see `0/4 nodes are available: 2 Insufficient cpu, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate` [08:55:37] the sync on staging eventually failed with: [08:55:37] ``` [08:55:37] Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: timed out waiting for the condition [08:55:37] ``` [08:57:14] yeah we are at capacity in staging :( [08:57:25] you can use kubectl describe nodes to see [08:57:35] the output is verbose, but you can look for [08:57:35] elukey: we dont have access :( [08:57:47] Resource Requests Limits [08:57:47] -------- -------- ------ [08:57:47] cpu 64170m (97%) 107400m (163%) [08:57:47] memory 72204601856 (57%) 93488597504 (75%) [08:57:47] is there any other way for us to check? [08:58:06] I think it will be clear when we'll deploy kube-state-metrics [08:58:13] but serviceops is still working on it [08:58:18] so atm we have to wait [08:58:32] isaranto: we could, for the moment, reduce what we deploy in staging [08:58:37] ok. for now I can remove the second deployment for rr-multilingual that we added with aiko yesterday [08:58:47] w8 I'm on it [08:59:21] I noticed that we also have nllb-200-gpu-predictor-default-00001-deployment-59c87c9bf-49fck [08:59:24] pending [08:59:58] in experimental several pods are in crashing state [08:59:59] sigh [09:01:42] I can remove these deployments as well [09:02:00] the nllb you mentioned is in codfw prod right? [09:02:25] it is because of what we were discussing yesterday. since there are no GPUS it wont get scheduled [09:10:32] isaranto: nope it is in ml-staging-codfw, I think we don't have the override for staging [09:10:44] experimental is also messed up, I'll try to clean up [09:12:38] I'll open a task for that. It is because how our deployments are declared. We discussed it some time ago but left it there [09:13:06] the fact that it is a dict and not a list iirc [09:16:07] yes, so we need to override staging [09:16:19] I think we can patch it for now, and then refactor if we want [09:16:33] we had lists but it was difficult to override values [09:17:31] yes yes. I agree with dicts [09:19:10] +1ed your change, feel free to clean up, thanks! [09:19:29] we have two more nodes for staging btw, we'll likely get them next Q [09:34:32] oh well :) my patch didn't do anything cause the values exist in prod. I'm cleaning up the LLMs deployed in experimental since we're not using them and most work will be done in nllb for now. Once we need something we can deploy it [09:39:34] trying to find a quick way to declare the deployments we don't want in staging [09:46:40] well no luck with quick hacks [09:48:31] I removed the deployments from the values.yaml so now they wont appear in the staging manifests https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/969068 [10:42:18] tried to deploy rec-api-ng in staging and it still hangs, so yeah let's clean up for the moment [10:46:21] * elukey lunch! [10:53:24] I removed the deployments under the exprimental ns related to llms so you should be free to go in a bit kevinbazira [10:54:11] thanks isaranto, I'll let you know how it goes [11:00:02] rec-api-ng deployment on staging has succedded [11:00:02] ``` [11:00:02] $ kubectl get pods [11:00:02] NAME READY STATUS RESTARTS AGE [11:00:02] recommendation-api-ng-main-77cd984648-rlpnh 2/2 Running 0 37s [11:00:03] ``` [11:00:03] testing the uri now... [11:02:21] the uri works: [11:02:21] ``` [11:02:22] $ time curl 'https://recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443/types/translation/v1/articles?source=en&target=es&seed=&search=related_articles&application=CX' [11:02:22] [{"pageviews": 52278, "title": "October_2023_Speaker_of_the_United_States_House_of_Representatives_election", "wikidata_id": "Q122928535", "rank": 493.0}, {"pageviews": 27562, "title": "Munawar_Faruqui", "wikidata_id": "Q104979534", "rank": 490.0}, {"pageviews": 13588, "title": "Hollywoodbets", "wikidata_id": "Q46996031", "rank": 488.0}, {"pageviews": 57499, "title": "Ajay_Jadeja", "wikidata_id": "Q2722025", "rank": 482.0}, [11:02:22] {"pageviews": 19888, "title": "Carter_Reum", "wikidata_id": "Q56043532", "rank": 477.0}, {"pageviews": 22712, "title": "Cenk_Uygur", "wikidata_id": "Q19658", "rank": 468.0}, {"pageviews": 31516, "title": "Lokesh_Cinematic_Universe", "wikidata_id": "Q112243071", "rank": 466.0}, {"pageviews": 120151, "title": "Adolis_Garc\u00eda", "wikidata_id": "Q23899679", "rank": 462.0}, {"pageviews": 11140, "title": [11:02:22] "2023_American_League_Championship_Series", "wikidata_id": "Q122932482", "rank": 448.0}, {"pageviews": 10491, "title": "Murder_of_Grace_Millane", "wikidata_id": "Q59600568", "rank": 447.0}, {"pageviews": 11460, "title": "Terrell_Edmunds", "wikidata_id": "Q47173601", "rank": 443.0}, {"pageviews": 11216, "title": "Kevin_Sumlin", "wikidata_id": "Q6397568", "rank": 431.0}] [11:02:23] real 0m3.831s [11:02:23] user 0m0.009s [11:02:24] sys 0m0.005s [11:02:24] ``` [11:02:25] going to deploy to prod ... [11:04:00] Nice! [11:07:59] deployment to both codfw and eqiad has been completed: [11:08:00] ``` [11:08:00] $ time curl "https://recommendation-api-ng.discovery.wmnet:31443/types/translation/v1/articles?source=en&target=es&seed=&search=related_articles&application=CX" [11:08:00] [{"pageviews": 43650, "title": "Thalapathy_68", "wikidata_id": "Q122921080", "rank": 499.0}, {"pageviews": 13588, "title": "Hollywoodbets", "wikidata_id": "Q46996031", "rank": 494.0}, {"pageviews": 14737, "title": "Irfan_Pathan", "wikidata_id": "Q1746795", "rank": 487.0}, {"pageviews": 11460, "title": "Terrell_Edmunds", "wikidata_id": "Q47173601", "rank": 463.0}, {"pageviews": 18067, "title": "UFC_295", "wikidata_id": "Q120723030", [11:08:00] "rank": 460.0}, {"pageviews": 11272, "title": "Mickey_Arthur", "wikidata_id": "Q4711439", "rank": 457.0}, {"pageviews": 18717, "title": "Yaariyan_2", "wikidata_id": "Q114713458", "rank": 456.0}, {"pageviews": 23666, "title": "Heinrich_Klaasen", "wikidata_id": "Q20981646", "rank": 455.0}, {"pageviews": 12151, "title": "Frasier_(2023_TV_series)", "wikidata_id": "Q116636567", "rank": 444.0}, {"pageviews": 11631, "title": [11:08:00] "2023_Major_League_Baseball_postseason", "wikidata_id": "Q116788481", "rank": 439.0}, {"pageviews": 41912, "title": "South_Asian_river_dolphin", "wikidata_id": "Q950620", "rank": 438.0}, {"pageviews": 23642, "title": "UFC_Rankings", "wikidata_id": "Q39090982", "rank": 429.0}] [11:08:01] real 0m4.618s [11:08:01] user 0m0.009s [11:08:02] sys 0m0.005s [11:08:02] ``` [11:08:03] thank you isaranto and elukey for helping clear out some space on staging. [11:17:39] 10Machine-Learning-Team: Configure envoy settings to enable rec-api-ng container to access endpoints external to k8s/LiftWing - https://phabricator.wikimedia.org/T348607 (10kevinbazira) Folks from wikimedia-analytics notified us that they are migrating the pageviews endpoint as shown in the screenshot below: {F4... [11:36:27] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10Serbian-Sites, and 3 others: Deploy "add a link" to 16th round of wikis - https://phabricator.wikimedia.org/T308142 (10Sgs) a:05kostajh→03Sgs [11:45:05] * isaranto lunch! [12:38:52] Morning all [12:59:57] So the recommendation model is up? [13:00:02] Can I try it? [13:03:19] hey Chris! [13:04:00] yes it's up! [13:04:58] You can try the request Kevin posted above which is using the internal endpoint [13:04:58] `time curl "https://recommendation-api-ng.discovery.wmnet:31443/types/translation/v1/articles?source=en&target=es&seed=&search=related_articles&application=CX"` [13:05:06] Let’s gooo! [13:06:15] I will admit I was initially against the internal endpoint but it continues to be beneficial in multiple ways [13:34:43] "real 0m3.937s" [13:34:44] Do the folks that want this model have any speed requirements? [13:35:10] I'm okay with slow if they are okay with it. [13:56:58] chrisalbon: I'm not aware of any requirements but I think that our initial target should be the latency of the previous service as our first goal is to deploy the same service we should aim for same (or better ofc) latency. kevinbazira: do we have any examples from the old one to compare? [14:01:34] isaranto , chrisalbon: no speed requirements have been shared with us yet. [14:01:34] the recommendation-api hosted on wmflabs has similar speeds: [14:01:34] ``` [14:01:34] $ time curl "https://recommend.wmflabs.org/types/translation/v1/articles?source=en&target=es&seed=&search=related_articles&application=CX" [14:01:34] [{"pageviews": 15908, "title": "Jos\u00e9_Leclerc", "wikidata_id": "Q25683039", "rank": 496.0}, {"pageviews": 17965, "title": "Kaala_Paani", "wikidata_id": "Q122866084", "rank": 495.0}, {"pageviews": 11245, "title": "Liton_Das", "wikidata_id": "Q6653212", "rank": 491.0}, {"pageviews": 11946, "title": "Madonna_Sebastian", "wikidata_id": "Q20649862", "rank": 485.0}, {"pageviews": 10930, "title": "Kray_twins", "wikidata_id": "Q1283275", [14:01:34] "rank": 483.0}, {"pageviews": 15345, "title": "Tasha_Butts", "wikidata_id": "Q7687160", "rank": 481.0}, {"pageviews": 16106, "title": "Gaza\u2013Israel_conflict", "wikidata_id": "Q553184", "rank": 477.0}, {"pageviews": 11627, "title": "Hasbulla", "wikidata_id": "Q113950800", "rank": 464.0}, {"pageviews": 12620, "title": "Marco_Jansen", "wikidata_id": "Q51625991", "rank": 460.0}, {"pageviews": 17459, "title": "Sidney_Powell", [14:01:34] "wikidata_id": "Q101713828", "rank": 449.0}, {"pageviews": 17852, "title": "Suits_index", "wikidata_id": "Q1469999", "rank": 446.0}, {"pageviews": 11429, "title": "Gadar_2", "wikidata_id": "Q113987357", "rank": 439.0}] [14:01:35] real 0m4.303s [14:01:36] user 0m0.000s [14:01:36] sys 0m0.015s [14:01:37] ``` [14:01:37] as we were migrating it to LiftWing, we tried optimizing the speed based on Isaac's recommendations as shown here: https://phabricator.wikimedia.org/T347475#9268111 [14:01:52] ah cool [14:03:26] For this model in particular there is no point investing a ton of time optimizing it, since it is essentially a legacy model that is used by two teams, so the products they've built that use the model are already taking the 4s speed into account in their own architecture choices. [14:09:29] sounds cool then! [14:09:35] thanks kevinbazira ! [14:58:09] 10Machine-Learning-Team: Deploy nllb-200 to production - https://phabricator.wikimedia.org/T349163 (10isarantopoulos) I'm dumping some load tests that I ran using CPU and GPU services for a fixed input size of ~50words. It is clear that with this raw version of the model the CPU instance is struggling. **CPU**... [14:59:56] Regarding the increased latency we are experiencing in kserve updates: I think that in order to tackle the issue we need to identify the source problem. Open up a task and start with identifying the dependencies that changed and what is causing this increase. wdyt? cc: aiko elukey [15:01:12] isaranto: I think that the issue should clear itself when we deploy a fixed version of xgboost, and hopefully catboost as well.. Not entirely sure what changed, but there is a clear issue in those libraries when creating threads in cgroups v2 (That we use) [15:02:16] sadly https://github.com/catboost/catboost/issues/2518 seems not getting attention :( [15:02:44] elukey: for xgboost models (rr-lang-agnostic) we can push to update xgboost version which was released and test that one. but for catboost models (rr-multilingual and readability) there seems to be a clear issue with performance [15:02:54] especially rr-multilingual [15:03:22] sure sure [15:03:40] I think it is the same root cause, maybe it has different side effects in catboost [15:03:44] but it is worth to check [15:04:22] I'll ask in kserve community slack if anyone has experienced any issues [15:06:24] elukey: could you check the current amount of threads used in rr-multilingual in production? [15:07:17] is there a way for me to check? [15:08:21] klausman: ---^ (if you have time can you check? In a meeting :( ) [15:09:26] 10Machine-Learning-Team, 10Foundational Technology Requests: Content Translation Recommendations API - https://phabricator.wikimedia.org/T293648 (10Isaac) Adding another note (on top of T293648#8956259) around improving functionality of the API when we get to that stage: when a seed article is provided (e.g.,... [15:09:34] checking [15:10:15] 10Machine-Learning-Team: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10Isaac) Excellent thanks @kevinbazira ! I'm also seeing that the semantics for the `&pageviews` is that it defaults to `true` and any other value nulls it out, so that all looks good... [15:12:08] elukey: would this be visible on grafan (Pod Details)? [15:14:15] not sure, you can check via ps on one of the nodes running multi lingual [15:14:24] ml-serve nodes I mean [15:14:38] ps -eLf | grep $pid should work fine [15:14:49] ack, found ps -eLf | grep PID | wc -l in the backlog. [15:14:53] garfana looks clean [15:15:10] isaranto: I presume this is in codfw? [15:16:52] both [15:17:17] klausman: I mean it is for rr-multilingual which is deployed on both sites [15:17:32] ack [15:20:35] I've checked two pods, and both seem to have 83 threads, which is oddly consistent [15:20:56] both were in codfw, now checking eqiad [15:21:51] one in eqiad has 7 [15:23:14] and another with 7 [15:23:22] thanks Tobias. This is odd! [15:23:37] isaranto: so I'd say yes, those two containers I checked in eqiad had an elevated number of threads [15:24:44] so 83 in codfw and 7 in eqiad right? [15:24:44] do we have rr-ml deployed on all clusters? I thought it was only staging [15:25:02] isaranto: the one in codfw is kserve 0.11? [15:25:25] elukey: yes, rr-ml has been in non-staging for a while [15:25:26] aiko: no all are kserve 0.10 [15:25:43] klausman: yes I am aware, not with kserve 0.11 though [15:25:59] ah, correct. Checking staging now as well [15:26:02] aiko: sorry wait lemme check because I saw a commit form 9/10 [15:28:53] all the staging rr-multis have a lot of threads: 83, 116 and 165, for rr-ml-old-redictorm rr-ml-predictor and another which I can't make the name out, something like k8s_kserve-container_revertrisk-multiling[very long ID here] [15:30:17] so we had all the threads even before, now it seems that xgboost tries to use them more (if we don't set the OMP env variables) [15:30:30] in theory upgrading to 2.0.1 should remove all the extra threads [15:30:47] from what I checked image used at the time of the last deployment (the one declared in dep-charts) was using the kserve 0.10 image [15:31:20] rr-multilingual is not using xgboost but catboost. :( I suggest we use a higher number of threads as a limit (e.g. 7) [15:31:37] ah snap ok, I thought it was using both [15:33:26] klausman: that one is used to test batcher [15:33:48] ack, thx [15:35:18] isaranto: mmmm I don't see any omp-related library, one thing that we could do is to run a load test and execute perf to see what/if uses most of the cpu [15:42:19] one thing that we could do is to figure out how catboost is used for RR multi lingual [15:42:28] and see if we can set the number of threads [15:43:13] for example: https://github.com/catboost/catboost/blob/master/catboost/python-package/catboost/helpers.cpp#L322 [15:43:33] if threads are not specified, it invokes the function to get the cpu cores from the cgroup [15:43:37] that gets the wrong value [15:45:51] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MW-1.41-notes (1.41.0-wmf.22; 2023-08-15), 10Patch-For-Review: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 (10PatchDemoBot) Test wiki on [[ https://patchdemo.wmflabs.org | Patch demo ]] by ISarant... [15:46:41] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MW-1.41-notes (1.41.0-wmf.22; 2023-08-15), 10Patch-For-Review: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 (10PatchDemoBot) Test wiki on [[ https://patchdemo.wmflabs.org | Patch demo ]] by ISarant... [15:46:44] afk for a moment to pick up laundry, back in 15 [15:50:40] A way to do this would be to alter the thread_count after we load the model https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/main/knowledge_integrity/models/revertrisk_multilingual/model.py#L347 [15:53:03] although thread_count param is the one used at training [15:55:17] I'll open a task and we can discuss it over there. for now I will revert the images declared in deployment-charts so that we dont accidentally deploy the new version [16:04:29] we could also try to set the thread_count param when prediction https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/main/knowledge_integrity/models/revertrisk_multilingual/model.py#L377 [16:04:49] sounds promising --^ [16:04:53] https://catboost.ai/en/docs/concepts/python-reference_catboostclassifier_predict_proba [16:10:31] yes aiko! that's what I wanted to say by doing it on "after loading the model" but didnt phrase it correctly [16:14:16] (03PS1) 10Ilias Sarantopoulos: readability: bump xgboost version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/969166 (https://phabricator.wikimedia.org/T348664) [16:16:07] oups wrong model server --^ [16:16:33] nevermind I wanted to do rr-language-agnostic. although it wont pass due to conflict of requrements [16:16:45] going afk for today folks, have a nice rest of the day [16:17:53] o/ [16:19:31] (03Abandoned) 10Ilias Sarantopoulos: readability: bump xgboost version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/969166 (https://phabricator.wikimedia.org/T348664) (owner: 10Ilias Sarantopoulos) [16:31:40] 10Machine-Learning-Team: Increased latencies with Kserve 0.11.1 - https://phabricator.wikimedia.org/T349844 (10isarantopoulos) [16:31:50] 10Machine-Learning-Team: Increased latencies with Kserve 0.11.1 (cgroups v2) - https://phabricator.wikimedia.org/T349844 (10isarantopoulos) [16:33:05] I created a task about the issue we are discussing so info doesnt get lost and tried to summarize current status. we can continue the discussion over there! [16:33:21] I'm logging off as well. o/ [16:33:37] isaranto: thanks for opening the task! [16:33:54] have a nice evening o/ [17:38:23] night all! [17:54:29] 10Machine-Learning-Team, 10Research: Deploy multilingual readability model to LiftWing - https://phabricator.wikimedia.org/T334182 (10MGerlach) 05Open→03Resolved >>! In T334182#9224301, @elukey wrote: > @MGerlach we are done! Let us know if we are good or if anything is missing :) this is great news. I ha...