[00:22:25] 10Machine-Learning-Team: Program & Events dashboard is using the old ORES service - https://phabricator.wikimedia.org/T352934 (10calbon) [01:23:15] 10Machine-Learning-Team, 10Education-Program-Dashboard: Program & Events dashboard is using the old ORES service - https://phabricator.wikimedia.org/T352934 (10Aklapper) [01:35:38] 10Machine-Learning-Team, 10Education-Program-Dashboard: Program & Events dashboard is using the old ORES service - https://phabricator.wikimedia.org/T352934 (10Ragesoss) When was this screenshot from? I deployed the update that switched it to LiftWing about 3 weeks ago, and it shouldn't be hitting the old ORES... [07:58:11] Good morning! [08:49:32] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Technical-Debt: Use expression builder instead of raw SQL in ORES - https://phabricator.wikimedia.org/T350986 (10isarantopoulos) p:05Triage→03Medium [08:49:37] 10Machine-Learning-Team: Globally fix ores.wikipedia.org/ui to new legacy domain - https://phabricator.wikimedia.org/T349996 (10isarantopoulos) p:05Triage→03Medium [08:49:44] 10Machine-Learning-Team: Apply common settings to publish events from Lift Wing staging to EventGate - https://phabricator.wikimedia.org/T349919 (10isarantopoulos) p:05Triage→03Medium [08:49:57] 10Machine-Learning-Team, 10Patch-For-Review: Increased latencies with Kserve 0.11.1 (cgroups v2) - https://phabricator.wikimedia.org/T349844 (10isarantopoulos) p:05Triage→03Medium [08:52:01] 10Machine-Learning-Team: Add a script for running the Revert Risk model server locally - https://phabricator.wikimedia.org/T352689 (10isarantopoulos) p:05Triage→03Medium [08:52:07] 10Machine-Learning-Team: Rethink aiohttp's session reuse in the isvc code - https://phabricator.wikimedia.org/T352290 (10isarantopoulos) p:05Triage→03Medium [08:52:11] 10Machine-Learning-Team, 10ORES: Review traffic on ores.wikimedia.org - https://phabricator.wikimedia.org/T352527 (10isarantopoulos) p:05Triage→03Medium [08:52:17] 10Machine-Learning-Team: Fix istio gateway's PodDisruptionBudgets for ml-serve - https://phabricator.wikimedia.org/T352400 (10isarantopoulos) p:05Triage→03Medium [08:52:24] 10Machine-Learning-Team: Investigate prediction bug in article-descriptions model-server - https://phabricator.wikimedia.org/T352750 (10isarantopoulos) p:05Triage→03Medium [08:52:34] 10Machine-Learning-Team, 10SRE Observability (FY2023/2024-Q2): Gap in metrics rendered from Thanos Rules - https://phabricator.wikimedia.org/T352756 (10isarantopoulos) p:05Triage→03Medium [08:52:39] 10Machine-Learning-Team: Improving error message for Revertrisk models - https://phabricator.wikimedia.org/T351278 (10isarantopoulos) p:05Triage→03Medium [08:52:45] 10Machine-Learning-Team: Apply multi-processing to preprocess() in isvcs that suffer from high latency - https://phabricator.wikimedia.org/T349274 (10isarantopoulos) p:05Triage→03Medium [08:52:53] 10Machine-Learning-Team, 10Project-Admins: Create three Phab Projects for Machine Learning: Lift Wing, Pilot Flag, Test Grounds - https://phabricator.wikimedia.org/T264774 (10isarantopoulos) p:05Triage→03Medium [08:52:59] 10Lift-Wing, 10Machine-Learning-Team: Discuss caching strategies for Lift Wing - https://phabricator.wikimedia.org/T349180 (10isarantopoulos) p:05Triage→03Medium [08:53:10] 10Machine-Learning-Team, 10ORES, 10Beta-Cluster-Infrastructure, 10PageTriage: Special:NewPagesFeed broken on beta cluster testwiki - https://phabricator.wikimedia.org/T349635 (10isarantopoulos) p:05Triage→03Medium [08:58:07] --^ tidying up the board a bit [09:08:40] morning! [09:20:14] 10Lift-Wing, 10Machine-Learning-Team: Investigate increase p99 latencies in ml-serve-eqiad - https://phabricator.wikimedia.org/T352958 (10isarantopoulos) [09:20:40] 10Lift-Wing, 10Machine-Learning-Team: Investigate increase p99 latencies in ml-serve-eqiad - https://phabricator.wikimedia.org/T352958 (10isarantopoulos) p:05Triage→03Unbreak! [09:20:46] o/ [09:21:00] I opened a task --^ for the latencies. I'm looking at the logs at the moment [09:27:15] 10Lift-Wing, 10Machine-Learning-Team: Investigate increase p99 latencies in ml-serve-eqiad - https://phabricator.wikimedia.org/T352958 (10isarantopoulos) I found the issue by looking at the logs of one of the pods that has increased latencies (`revertrisk-language-agnostic-predictor-default-00014-deplo5mn2g`).... [09:32:42] 10Machine-Learning-Team: Reduce default API response fields for article-descriptions model-server - https://phabricator.wikimedia.org/T352959 (10kevinbazira) [09:34:56] isaranto: very nice, I have no idea why we get a connection for zh-yue.wikipedia.org [09:35:11] I'd have expected api-ro (with the zh-yue as host header) [09:35:53] ah also port 443 directly [09:38:20] (03PS1) 10Kevin Bazira: article-descriptions: filter API response fields by debug flag [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980833 (https://phabricator.wikimedia.org/T352959) [09:39:50] (03CR) 10CI reject: [V: 04-1] article-descriptions: filter API response fields by debug flag [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980833 (https://phabricator.wikimedia.org/T352959) (owner: 10Kevin Bazira) [09:48:32] I rechecked the code, it doesn't make much sense [09:50:09] (03PS2) 10Kevin Bazira: article-descriptions: filter API response fields by debug flag [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980833 (https://phabricator.wikimedia.org/T352959) [09:51:32] morning! [09:53:28] hello :) [09:58:53] 10Lift-Wing, 10Machine-Learning-Team: Investigate increase p99 latencies in ml-serve-eqiad - https://phabricator.wikimedia.org/T352958 (10elukey) It is very weird that we see a direct connection to zh-yue.wikimedia.org on port 443, this is what the istio-proxy does: ` kubectl logs revertrisk-language-agnostic... [10:00:39] hey Aiko! [10:08:39] also looking into the issue of latency spikes. thanks for all the info so far [10:12:25] aiko: it is very weird, it seems as if self.wiki_url is not picked up for some reason, ending up using https://zh-yue.wikimedia.org instead of http://api-ro.discovery.wmnet [10:12:57] but not for all use cases, I tried to hit RR with lang zh-yue and all works [10:13:56] ah wait there is the revision logged [10:14:11] that works fine.. [10:14:32] with {"lang": "zh-yue", "rev_id": 2059733} [10:16:50] elukey: can you try with {"lang": "yue", "rev_id": 2059733} [10:17:34] cause I saw the log msg is ERROR:root:An error has occurred while fetching info for revision: 2059733 (yue). meaning lang is yue [10:17:47] that's weird [10:20:07] aiko: it hangs! [10:20:32] ah right I see, 2059733 (yue) [10:21:20] maybe KI's get_current_revision handles it in a separate way? [10:22:39] when I hit API_GW it hands but when I run it locally I get {"error":"Unsupported lang: yue."} [10:23:00] not sure why 'yue' is in the supported languages in revertrisk but it should be 'zh-yue' [10:23:07] perhaps we haven't deployed sth yet (although I think input validation has been there) [10:23:48] sure git blame says 10/8/23 so it shouldn't hang [10:24:22] but it seems that in https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/main/knowledge_integrity/revision.py#L268 we end up calling the wrong endpoint [10:24:29] hm I need to download the new binary to check [10:25:51] yeah we need to figure out why it ended up calling the wrong endpoint [10:27:00] isaranto: did you use the discovery endpoint for the internal call or the svc.eqiad one? [10:27:23] in theory most of us are redirected to eqiad, just wanted to cover all things [10:27:35] Morning! [10:27:49] I didn't use the internal one, just API-GW [10:28:06] ah sorry you ran it locally [10:28:13] lemme check [10:30:15] I loaded the model locally and now I'm debugging. version2 of the model doesn't throw the language not supported error [10:30:18] morning :) [10:30:20] morning Tobias! [10:31:50] yep it hangs also with the internal endpoint [10:33:21] ok both yue and zh-yue are in the supported languages. I saw it from the model binary but also the model card [10:34:21] I'll submit the apigw change for rec-api-ng in a bit (coordinating with Hugh) [10:40:36] the url yue.wikipedia.org resolves to zh-yue.wikipedia.org as I understand they refer to the same language and wiki https://en.wikipedia.org/wiki/Yue_Chinese. Unless I'm totally wrong [10:42:54] yes they're the same [10:44:13] 10Machine-Learning-Team, 10SRE Observability (FY2023/2024-Q2): Gap in metrics rendered from Thanos Rules - https://phabricator.wikimedia.org/T352756 (10elukey) @herron this is the list of changes: * https://gerrit.wikimedia.org/r/c/operations/puppet/+/956841 - Initial creation of the rules for Grizzly, Sep 18... [10:52:42] I was thinking that we could have some test for all the languages supported. I wonder if there are others as well [11:00:49] hm further debugging shows that WIKI_URL is picked up. aiko: shall we deploy the latest version so that we know we are checking the current version of the code? [11:05:50] isaranto: yeah I wanted to ask if I should go ahead to deploy the latest changes [11:06:18] aiko: go ahead and deploy and let's recheck [11:06:43] isaranto: okk [11:06:47] I don't think anything will change but just to be sure [11:06:49] thanks! [11:13:40] 10Machine-Learning-Team, 10SRE Observability (FY2023/2024-Q2): Gap in metrics rendered from Thanos Rules - https://phabricator.wikimedia.org/T352756 (10elukey) It is interesting that other istio-based dashboard SLIs have a similar but not exact behavior: https://w.wiki/8QbA {F41570194} In this case the hole... [11:15:20] isaranto: deployed! both eqiad and codfw [11:15:34] 10Machine-Learning-Team, 10SRE Observability (FY2023/2024-Q2): Gap in metrics rendered from Thanos Rules - https://phabricator.wikimedia.org/T352756 (10elukey) Reloads on titan1001: ` elukey@titan1001:~$ sudo journalctl -u thanos-rule.service | grep -i 'msg="reload rule files"'| awk '{print $1" "$2" "$3}' Nov... [11:16:46] Ack [11:30:27] * elukey lunch! [11:41:26] * aiko lunch [11:59:41] * klausman also lunch [12:10:33] (03CR) 10Ilias Sarantopoulos: "I'm not against using 0/1 but I think a boolean is more intuitive for an API." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980833 (https://phabricator.wikimedia.org/T352959) (owner: 10Kevin Bazira) [12:18:30] (03PS1) 10Ilias Sarantopoulos: llm: validate json input [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/981305 (https://phabricator.wikimedia.org/T352834) [12:19:10] (03PS2) 10Ilias Sarantopoulos: llm: validate json input [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/981305 (https://phabricator.wikimedia.org/T352834) [12:19:44] * isaranto afk lunch [12:44:32] (03CR) 10AikoChou: [C: 03+1] llm: refactor directory structure to treat as python module. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980429 (owner: 10Ilias Sarantopoulos) [12:45:03] (03CR) 10AikoChou: [C: 03+1] llm: validate json input [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/981305 (https://phabricator.wikimedia.org/T352834) (owner: 10Ilias Sarantopoulos) [13:16:34] (03CR) 10Ilias Sarantopoulos: [C: 03+2] llm: refactor directory structure to treat as python module. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980429 (owner: 10Ilias Sarantopoulos) [13:24:07] (03Merged) 10jenkins-bot: llm: refactor directory structure to treat as python module. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980429 (owner: 10Ilias Sarantopoulos) [13:25:06] (03PS3) 10Ilias Sarantopoulos: llm: validate json input [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/981305 (https://phabricator.wikimedia.org/T352834) [13:28:58] hey folks, I have just rolled out a change to ores-legacy and rec-api-ng in staging [13:29:06] to use cert-manager and not old cergen certs [13:29:14] nothing changes, but if you see issues lemme know [13:29:58] aye aye, cap'n [13:33:43] klausman: another important bit [13:33:52] I am doing this change in puppet private [13:33:52] - serving.kserve.io/s3-cabundle: "/usr/share/ca-certificates/wikimedia/Puppet_Internal_CA.crt" [13:33:55] + serving.kserve.io/s3-cabundle: "/usr/share/ca-certificates/wikimedia/Wikimedia_Internal_Root_CA.crt" [13:34:14] the s3-cabundle is the one used by the storage initializer to pull binaries from thanos swift [13:34:37] And Thanos swift has switched to the new CA scheme already? [13:35:01] nope [13:35:16] I guess you mean PKIright? [13:35:21] *PKI right [13:35:26] ack! [13:35:30] yes, brainfart [13:35:55] nono you are right, that is the wrong file [13:36:06] I need /etc/ssl/certs/wmf-ca-certificates.crt [13:36:13] fixing thanks [13:36:57] - serving.kserve.io/s3-cabundle: "/usr/share/ca-certificates/wikimedia/Wikimedia_Internal_Root_CA.crt" [13:37:00] + serving.kserve.io/s3-cabundle: "/etc/ssl/certs/wmf-ca-certificates.crt" [13:38:06] we install wmf-certificates, just checked [13:39:17] (03CR) 10Ilias Sarantopoulos: [C: 03+2] llm: validate json input [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/981305 (https://phabricator.wikimedia.org/T352834) (owner: 10Ilias Sarantopoulos) [13:42:35] (03Merged) 10jenkins-bot: llm: validate json input [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/981305 (https://phabricator.wikimedia.org/T352834) (owner: 10Ilias Sarantopoulos) [13:44:00] (03CR) 10Kevin Bazira: article-descriptions: filter API response fields by debug flag (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980833 (https://phabricator.wikimedia.org/T352959) (owner: 10Kevin Bazira) [13:47:46] I added a message to slack explaining what I did, if you deploy and see the following is ok: [13:47:51] - serving.kserve.io/s3-cabundle: /usr/share/ca-certificates/wikimedia/Puppet_Internal_CA.crt [13:47:54] + serving.kserve.io/s3-cabundle: /etc/ssl/certs/wmf-ca-certificates.crt [13:48:29] ack, thanks for the heads up Luca! [13:49:05] aiko: I've thought of a hack that works for yue but I'm trying to find a better way to do so [13:49:44] also another thing to check is why we see address zh-yue instead of api-ro [13:50:37] the hack I thought of is to use a config file with a mapping and change the value of the lang variable [13:52:34] 10Machine-Learning-Team: Add support for multiple revisions in knowledge-integrity - https://phabricator.wikimedia.org/T352987 (10achou) [14:08:55] (03CR) 10Ilias Sarantopoulos: article-descriptions: filter API response fields by debug flag (035 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980833 (https://phabricator.wikimedia.org/T352959) (owner: 10Kevin Bazira) [14:17:51] Good morning. Slept in a bit [14:19:48] isaranto: yes I think it's better we first figure out why we see address zh-yue instead of api-ro before fixing it using a hack [14:20:18] morning Chris o/ [14:21:04] morning! although it is getting dark here :) [14:25:16] attempt to enable revertrisk on beta -> https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/981337 [14:25:21] I'm excited about this! [14:27:43] wow! [14:28:33] 🤞 🤞 🤞 [14:28:36] very cool [14:28:50] (03PS3) 10Kevin Bazira: article-descriptions: filter API response fields by debug flag [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980833 (https://phabricator.wikimedia.org/T352959) [14:31:24] (03CR) 10Kevin Bazira: article-descriptions: filter API response fields by debug flag (033 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980833 (https://phabricator.wikimedia.org/T352959) (owner: 10Kevin Bazira) [14:34:26] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "Nice Kevin! Works like a charm!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980833 (https://phabricator.wikimedia.org/T352959) (owner: 10Kevin Bazira) [14:37:39] (03CR) 10Kevin Bazira: [C: 03+2] "Thank you for the reviews :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980833 (https://phabricator.wikimedia.org/T352959) (owner: 10Kevin Bazira) [14:38:27] (03Merged) 10jenkins-bot: article-descriptions: filter API response fields by debug flag [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980833 (https://phabricator.wikimedia.org/T352959) (owner: 10Kevin Bazira) [15:01:48] klausman: o/ [15:01:59] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/981342 seems wrong, the port is not right [15:02:03] please don't deploy [15:02:35] too late, but also missing /32 [15:02:52] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/981344 is the followup [15:03:24] yep ok, please change the port as well [15:03:34] 31443 is the right one, correct? [15:03:39] yes [15:04:28] fixed [15:04:49] +1ed [15:05:11] merci! [15:05:33] Revert risk on ores extension! [15:06:07] I'll be a min or two late for the research meeting [15:06:14] (well, six or seven) [15:10:56] 10Machine-Learning-Team, 10SRE Observability (FY2023/2024-Q2): Gap in metrics rendered from Thanos Rules - https://phabricator.wikimedia.org/T352756 (10elukey) I tried to compare the various graphs for ores-legacy in https://w.wiki/8QkW I am very puzzled about the increase() one, since I am not sure why befor... [15:41:32] hey I haven't made any progress on nllb. will likely work on it tomorrow [15:47:57] 10Machine-Learning-Team, 10Patch-For-Review: Test the kserve batcher for Revert Risk LA isvc - https://phabricator.wikimedia.org/T348536 (10achou) [15:47:59] 10Machine-Learning-Team, 10Goal: Goal: Lift Wing users can request multiple predictions using a single request. - https://phabricator.wikimedia.org/T348153 (10achou) [15:48:12] 10Machine-Learning-Team: Add support for multiple revisions in knowledge-integrity - https://phabricator.wikimedia.org/T352987 (10achou) [15:48:14] 10Machine-Learning-Team, 10Goal: Goal: Lift Wing users can request multiple predictions using a single request. - https://phabricator.wikimedia.org/T348153 (10achou) [15:48:48] klausman: and/or elukey: I'd like to pick your brain tomorrow at some point so that we can investigate the issue with revertrisk a bit more [15:49:01] Sure [15:49:18] the thing is that I'd like to debug it at least on staging and see what requests are being made [15:52:42] thanks! [16:00:54] I am off tomorrow (public holiday) :) [16:01:24] logging off earlier today, o/ [16:02:06] nice! cu! [16:12:55] have a lovely weekend Luca o/ [16:24:48] I'm logging off as well for day! Cu tomorrow! [16:38:58] \o [16:39:30] Still working with Hugh on the rec-api-ng thing in the API-GW. We have something working, but there are still details to figure out. I think we can wrap this up by tomorrow [18:55:12] night all! [20:17:48] 10Machine-Learning-Team, 10Foundational Technology Requests: Content Translation Recommendations API - https://phabricator.wikimedia.org/T293648 (10leila) [21:37:27] 10Machine-Learning-Team: Investigate why model cards receive very little traffic - https://phabricator.wikimedia.org/T353025 (10calbon) [21:38:44] 10Machine-Learning-Team: Investigate if model cards receive very little traffic - https://phabricator.wikimedia.org/T353025 (10calbon) [21:39:33] 10Machine-Learning-Team: Investigate how to improve model card integration with existing user flows - https://phabricator.wikimedia.org/T353025 (10calbon)