[07:58:36] o/ [08:00:24] kevinbazira: could you review when you have some time plz? https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/967455 [08:07:07] isaranto: o/ [08:07:22] sure sure, let me have a look [08:07:54] thank u :) [08:08:02] if I can help in any way with pageviews lemme know [08:11:14] (03CR) 10Kevin Bazira: [C: 03+1] llm: refactor to run locally [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967455 (https://phabricator.wikimedia.org/T349371) (owner: 10Ilias Sarantopoulos) [08:26:22] (03CR) 10Ilias Sarantopoulos: [C: 03+2] llm: refactor to run locally [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967455 (https://phabricator.wikimedia.org/T349371) (owner: 10Ilias Sarantopoulos) [08:33:21] thank you for the review! [08:36:02] (03Merged) 10jenkins-bot: llm: refactor to run locally [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967455 (https://phabricator.wikimedia.org/T349371) (owner: 10Ilias Sarantopoulos) [08:48:46] It's party time! [08:53:52] šŸŽ‰ [08:59:20] (03PS3) 10Ilias Sarantopoulos: ci: update pre-commit hooks versions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967458 (https://phabricator.wikimedia.org/T349382) [09:01:07] (03PS4) 10Ilias Sarantopoulos: ci: update pre-commit hooks versions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967458 (https://phabricator.wikimedia.org/T349382) [09:01:18] (03PS5) 10Ilias Sarantopoulos: ci: update pre-commit hooks versions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967458 (https://phabricator.wikimedia.org/T349382) [09:02:07] (03CR) 10Ilias Sarantopoulos: "I made a dummy change to ores-legacy to trigger CI. will remove it afterwards" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967458 (https://phabricator.wikimedia.org/T349382) (owner: 10Ilias Sarantopoulos) [09:03:25] (03CR) 10CI reject: [V: 04-1] ci: update pre-commit hooks versions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967458 (https://phabricator.wikimedia.org/T349382) (owner: 10Ilias Sarantopoulos) [09:08:27] (03PS6) 10Ilias Sarantopoulos: ci: update pre-commit hooks versions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967458 (https://phabricator.wikimedia.org/T349382) [09:10:37] (03CR) 10CI reject: [V: 04-1] ci: update pre-commit hooks versions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967458 (https://phabricator.wikimedia.org/T349382) (owner: 10Ilias Sarantopoulos) [10:05:36] hello folks [10:05:43] hello! [10:09:24] 10Machine-Learning-Team: Deploy nllb-200 to production - https://phabricator.wikimedia.org/T349163 (10isarantopoulos) nllb is live on codfw and eqiad! We'll need to change our helmfiles so that we can allow different deployments for eqiad and codfw. The issue we have now is that the gpu is only on codfw so the g... [10:10:35] (03PS1) 10Kevin Bazira: Update pageviews endpoint to match rest-gateway envoy listener port [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/967921 (https://phabricator.wikimedia.org/T348607) [10:15:36] 10Machine-Learning-Team: Refactor inference services repo to allow local runs - https://phabricator.wikimedia.org/T347404 (10isarantopoulos) Services in inf-services repo that require refactoring to allow local runs: - ~~Revscoring~~ - ~~Langid~~ - ~~LLM~~ - revertrisk (language agnostic and multilingua... [10:30:20] (03CR) 10Elukey: Update pageviews endpoint to match rest-gateway envoy listener port (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/967921 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [10:47:32] Morning all! [10:47:54] wow Chris did you move to the East coast? :D [10:49:10] haha, morning! [10:49:17] Ha. Iā€™m trying to be around more your time! [10:55:16] * elukey lunch! [10:55:38] (03CR) 10Kevin Bazira: Update pageviews endpoint to match rest-gateway envoy listener port (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/967921 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [11:17:48] <3 [11:59:34] * isaranto lunch! [12:15:49] kevinbazira: o/ [12:16:03] let's sync in here about the rest-gateway change [12:16:19] so the Pageview API can be accessed in two ways: [12:16:54] 1) Via Restbase (exposed to external clients etc..), and the URI is something.wikimedia.org/api/rest_v1/metrics/pageviews/etc.. [12:17:12] you can see examples in https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews [12:17:51] 2) Via AQS (that is behind Restbase), internal API endpoint only, not exposed to external clients. It has a different URI scheme than Restbase [12:18:13] We are currently using 2), so /analytics.wikimedia.org/v1/pageviews [12:18:39] to access the internal endpoint, the k8s envoy proxy uses something like aqs.discovery.wmnet:7272/analytics.wikimedia.org/v1/pageviews [12:19:01] so very different from what stated in the Pageview API page on Wikitech [12:19:39] Hugh and Dan thought that we were hitting the Restbase API, so they said "use the rest-gateway that offers the same API", but in reality we use the AQS one [12:20:09] so the change that Hugh did is not correct, or better, the URI is wrong, since the example that you added is to fetch data from the AQS endpoint, not the restbase one [12:20:28] rest-gateway is a new endpoint that is also reachable internally, and the envoy proxy knows about it [12:21:04] This is why in the change I suggested to use a different URI, since rest-gateway doesn't use the same scheme as the AQS API [12:21:08] does it make sense? [12:25:27] elukey: yes, it does. since aqs uses restbase, my understanding was that they use the same endpoint but I couldn't confirm this since there's no way I can test it. [12:25:28] this is why I had based on Hnowlan's example. [12:25:28] thanks for the clarification. let me update it to specifically use the restbase uri. [12:29:46] (03PS2) 10Kevin Bazira: Update pageviews endpoint to match rest-gateway envoy listener port [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/967921 (https://phabricator.wikimedia.org/T348607) [12:31:54] (03CR) 10Elukey: [C: 03+1] Update pageviews endpoint to match rest-gateway envoy listener port (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/967921 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [12:32:57] (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the review :)" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/967921 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [12:34:14] (03Merged) 10jenkins-bot: Update pageviews endpoint to match rest-gateway envoy listener port [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/967921 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [12:49:43] isaranto: if-when you want to go over teh Kafka alert together, lmk [12:52:27] klausman: yes, thanks! I'm available for another 40 minutes [12:54:45] Ok, do you have a change/patch I can look at? [12:55:09] it is this one https://gerrit.wikimedia.org/r/c/operations/alerts/+/962056 [12:55:15] Ack, taking a look. [12:56:31] you can check patchset 23 was a different query, but that one was returning 2 entries for codfw cluster so we want to have a sum of these by cluster [13:00:41] isaranto: so for the input series, if the alert triggers after 1h, and your test-side eval interval is 1m, wouldn't you need 10000 x60 (or maybe >60)? [13:02:34] https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/ The block here has some examples for repeating measurements [13:02:48] https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/ Direct link [13:03:08] the test doesn't pass (the alert doesn't trigger) even if I put a really high value e.g. 1000000000000 [13:03:32] I meant the x60 for 60 readins of 10k [13:06:30] ack! thanks for the link. never used unit tests for alerts before. Now I got it [13:07:02] elukey: thanks for the reviews. [13:07:02] I've merged the change and testng on staging shows that the pageviews request fails because of the uri (see logs below): [13:07:02] ``` [13:07:02] 2023-10-25 13:03:42,715 recommendation.api.external_data.fetcher get():26 INFO -- Request failed: {"url": "http://localhost:6033/api/rest_v1/metrics/pageviews/top/en.wikipedia/all-access/2023/10/23", "error": "404 Client Error: Not Found for url: http://localhost:6033/api/rest_v1/metrics/pageviews/top/en.wikipedia/all-access/2023/10/23"} [13:07:02] ``` [13:07:14] I mean I wasn't aware of the correct annotation for the repeated values in a series [13:07:34] now I can make it fire! \o/ thanks klausman: I'll ping you for the review [13:08:06] ack :) glad I could help [13:14:50] There is an update regarding PEP703 and no-GIL python ! The proposal has been accepted. https://discuss.python.org/t/pep-703-making-the-global-interpreter-lock-optional-in-cpython-acceptance/37075 [13:17:42] wow! [13:17:50] klausman: I updated the patch [13:18:07] I asked a question on the patch (instead irc I mean) [13:21:26] kevinbazira: did it work? [13:23:55] elukey: the uri didn't work on staging, I had shared the logs earlier https://usercontent.irccloud-cdn.com/file/b3qapdop/Screenshot%20from%202023-10-25%2017-21-43.png [13:25:01] ah lovely didn't see it [13:25:02] sigh [13:25:04] lemme check [13:25:11] okok [13:29:16] ok so from [13:29:16] elukey@stat1004:~$ curl https://rest-gateway.discovery.wmnet:4113/rest_v1/metrics/pageviews/top/en.wikipedia/all-access/2023/10/23 [13:29:19] {"httpReason":"Not Found","httpCode":404} [13:29:57] isaranto: lgtm, should I resolve the open comments or will you do so [13:30:00] ? [13:30:35] (I was missing /api/ but the result is the same) [13:30:56] feel free to resolve them. I don't know what the convention is but I'm used to leave the reviewer to resolve the comments so that they can check that their suggestions have been implemented [13:31:14] kevinbazira: weird https://www.wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikipedia/all-access/2023/10/23 [13:32:11] ?? [13:32:22] it is the restbase endpoint, not returning data [13:32:31] so maybe we have the wrong URI [13:33:26] ok so https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikipedia/all-access/2023/10/23 works [13:34:02] the one above has the www, ok [13:34:47] but if I use the internal rest-gateway endpoint, it doesn't [13:34:56] so I suspect that there is some translation/proxy [13:39:13] interesting, if it works without the www then the rec-api should work too, since the host header we use doesn't have the www: https://github.com/wikimedia/research-recommendation-api/blob/6acc4c6085ec6c97a1632cb018aa066d42e37f21/recommendation/data/recommendation_liftwing.ini#L11 [13:39:35] kevinbazira: yes yes but [13:39:40] elukey@stat1004:~$ curl https://rest-gateway.discovery.wmnet:4113/api/rest_v1/metrics/pageviews/top/en.wikipedia/all-access/2023/10/23 [13:39:43] {"httpCode":404,"httpReason":"Not Found"} [13:39:55] we use the internal endpoint, in theory, via the k8s envoy proxy [13:41:32] I asked in #wikimedia-analytics, maybe there is something that I don't see [13:42:03] thanks, I am following the conversation in that channel [14:57:18] klausman: thanks for the help and the review. I asked Herron for a review (Luca had suggested so) and then it will be ready to go! [14:57:34] ack [15:03:48] 10Machine-Learning-Team, 10Data-Engineering, 10serviceops: URI to use when hitting the Pageviews API on rest-gateway - https://phabricator.wikimedia.org/T349722 (10elukey) [15:03:59] kevinbazira: opened--^ [15:04:40] once we get an answer we should be done [15:11:42] elukey: great. thanks! [15:24:38] isaranto: rr-multilingual uses catboost too https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/main/pyproject.toml#L26 [15:25:15] aha! that must be it then [15:26:40] aiko: I don't know if you already saw it but regarding your work on batching -> https://phabricator.wikimedia.org/T348536#9272866 [15:26:47] you can use a similar service [15:28:44] regarding catboost luca opened this issue -> https://github.com/catboost/catboost/issues/2518 until then we go with num_of_threads set to 1 [15:29:50] elukey: you mentioned that you were seeing more threads even with OMP_NUM_THREADS set to 1? [15:29:58] I saw the batcher deployed on staging, thank u! [15:30:52] isaranto: yes exactly, I think that the code doesn't see the exact number of cgroups cores available (but the host's ones) and it creates too many threads [15:34:11] isaranto: ok so I'll file a patch with omp_num_threads set to 1 [15:35:12] please add also OMP_THREAD_LIMIT to 1 [15:36:57] I think that we tried but multiple threads were created anyway [15:37:20] i remember you set it manually and the threads were reduced but were still many [15:39:50] yep yep [15:43:04] going afk folks! Have a nice rest of the day [15:43:15] 10Machine-Learning-Team, 10Data-Engineering, 10serviceops: URI to use when hitting the Pageviews API on rest-gateway - https://phabricator.wikimedia.org/T349722 (10hnowlan) Documentation fail on my part - this endpoint requires the host header of "wikimedia.org" be set. This is to force clients at the edge t... [15:43:51] ciao! [15:44:12] 10Machine-Learning-Team, 10Data-Engineering, 10serviceops: URI to use when hitting the Pageviews API on rest-gateway - https://phabricator.wikimedia.org/T349722 (10elukey) 05Openā†’03Resolved a:03elukey Right this works! ` curl https://rest-gateway.discovery.wmnet:4113/wikimedia.org/v1/metrics/pageviews... [15:44:21] kevinbazira: https://phabricator.wikimedia.org/T349722#9280736 we can work on this tomorrow [15:45:21] elukey: sure sure [15:47:16] bye luca! have a nice evening [15:51:34] isaranto: do u think we still need to add the old version of multilingual to compare? [15:51:41] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/968715 [15:51:42] ^^ [15:52:43] not really. let's proceed like this and test. We need to figure out why too many threads are created and if this is caused by the library itself [15:54:25] but ok lets do it. The only issue is that it is a tight fit since iirc each namespace has 20Gi limit [15:56:13] hmm I can remove it (revertrisk-multilingual-old) in the patch [15:59:45] nono lets leave it and see [16:00:02] ahh [16:00:11] :( :) [16:00:33] ok then I'll add it back haha [16:01:03] haha sry. feel free to do whatever u want :). What is not clear to me is what is it in kserve 0.11 tha causes this [16:03:35] yeah we need to figure it out ! [16:07:19] deploying the changes [16:08:06] I'm logging off, more tomorrow! Have a nice evening/day! [16:08:42] bye Ilias :) you too! [16:48:47] 10Machine-Learning-Team: Upgrade Revert Risk Multilingual docker images to KServe 0.11 - https://phabricator.wikimedia.org/T347551 (10achou) Here are some load test results after setting `OMP_NUM_THREADS` and `OMP_THREAD_LIMIT` env vars. We found out that reverrisk-multilingual also uses `catboost`, which for so... [16:55:35] revertrisk-multilingual-old doesn't work due to "Model with name revertrisk-multilingual-old does not exist" [16:55:45] I will fix it tomorrow [17:33:41] aiko: it is fine as it is but when you make a request model name will be revertrisk-multilingual but host will be revertrisk-multilingual-old [17:34:11] (03PS1) 10AikoChou: revert-risk: allow suffixes to revert-risk model_name [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/968728 [17:34:59] ohh really? [17:37:14] ahh ok I tested it it works [17:43:01] (03Abandoned) 10AikoChou: revert-risk: allow suffixes to revert-risk model_name [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/968728 (owner: 10AikoChou) [17:44:13] isaranto: thanks for letting me know :D [17:49:19] going afk!! see u tomorrow [19:56:42] 10Machine-Learning-Team, 10Research, 10Epic: Develop a ML-based service to predict reverts on Wikipedia(s) - https://phabricator.wikimedia.org/T314384 (10diego)