[03:57:27] 10Machine-Learning-Team, 10ORES: Review traffic on ores.wikimedia.org - https://phabricator.wikimedia.org/T352527 (10Aklapper) [07:24:16] Good morning! [07:28:10] 10Machine-Learning-Team: Fix the link recommendation training pipeline - https://phabricator.wikimedia.org/T352525 (10kevinbazira) To fix the link recommendation model training pipeline, I followed the steps below: * updated the requirements that were failing * setup both the python3.10 conda env and python3.7 e... [07:28:35] isaranto: o/ morning [07:29:22] I am building your article-descriptions patch to test it locally and share the reviews. [07:42:18] morning Kevin! let me know if anything is not clea [07:51:27] (03Abandoned) 10Ilias Sarantopoulos: ci: test debian bookworm [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/977676 (owner: 10Ilias Sarantopoulos) [08:12:09] hello folks [08:23:36] ο/ [08:28:36] (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/967458 (https://phabricator.wikimedia.org/T349382) (owner: 10Ilias Sarantopoulos) [08:48:36] isaranto: o/ [08:48:46] do you have time to help me testing the istio upgrade? [08:49:57] yep sure! [08:50:09] thanks! [08:50:15] always! [08:50:27] so I am deploying the new control plane + gateways, and I just depooled eqiad via DNS discovery [08:50:36] so we should have traffic landing to codfw only [08:50:50] the upgrade is already in staging of course [08:51:22] so I am going to upgrade ml-serv-eqiad, and before repooling we should spot-check if the basics are working [08:51:25] does it sound good? [08:54:37] ok. so you want me to check ml-serve-eqiad when you upgrade? [08:54:57] is it happening now? [08:55:16] isaranto: upgrade, we can check together [08:55:22] *upgraded [08:55:35] google meet? [08:56:25] IRC it is fine if you are ok [08:56:45] I just tested RR ML via inference.svc.eqiad.wmnet:30443 [08:56:59] I'm just going to run all httpbb tests [08:58:02] also for port 31443 [08:58:03] recommendation-api-ng.svc.eqiad.wmnet:31443 [09:02:00] all tests passed! [09:02:34] I just checked we don't have rec-api-ng and ores-legacy in there [09:02:56] yep I was about to say [09:05:41] isaranto: going to repool eqiad ok? [09:06:06] go ahead! [09:06:38] super, going to wait some mins and will do codfw too [09:06:51] same procedure [09:10:11] I am getting some errors which seem transient when retrying the tests [09:12:20] elukey: after multiple retries I still get them now for eqiad [09:13:03] https://phabricator.wikimedia.org/P54071 [09:13:56] I don't know if they are related. Sometimes I get 3 errors sometimes 2. On codfw I got 2 errors and then on all subsequent runs they were fine [09:16:42] isaranto: no idea, it may be something related to https://phabricator.wikimedia.org/T352290 [09:17:11] does it always say "failed to fetch features?" [09:19:10] yes. now they all passed [09:20:06] another thing that I remember is that sometimes, until the connection pool between the istio sidecar and the endpoint isn't established/populated, there may be transient errors [09:23:57] ack [09:26:12] codfw depooled! [09:30:17] does that mean we are hitting eqiad? [09:30:41] yes [09:31:15] clear [09:31:16] the inference.discovery.wmnet record now resolves only to eqiad [09:31:23] so we can operate on codfw etc.. [09:31:29] (first time that we do it basically) [09:31:41] just wanted to verify I understand :) [09:31:53] yes yes I was about to say, please ask all questions that you have :) [09:34:16] isaranto: upgraded! I ran a couple of tests and it looks good [09:34:22] do you want to run httpbb? [09:34:28] on it! [09:34:30] after that I'll repool and we should be done [09:34:31] <3 [09:35:59] 10Machine-Learning-Team, 10serviceops, 10Patch-For-Review: Bump istio Docker images to Bookworm - https://phabricator.wikimedia.org/T351933 (10elukey) [09:39:33] most tests run fine, I'm getting 2 errors for wikidata-itemquality and revertrisk-language-agnostic [09:39:48] weird [09:39:51] anyway, repooling [09:41:12] Ευχαριστώ for the help :) [09:41:47] all done! istio upgraded [09:43:44] * elukey bbiab [09:44:37] nice! I'll check the tests again in a bit and will open a task if they insist [10:07:06] (03CR) 10Kevin Bazira: "Thank you for fixing the rest_url issue, Ilias! I've left a few comments about a redundant python path and the python utils directory whic" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/979369 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos) [10:14:58] * isaranto afk, early lunch! [11:01:26] 10Machine-Learning-Team, 10Patch-For-Review: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10elukey) Deleted two Redis passwords in the private puppet repo. [11:02:29] 10Machine-Learning-Team, 10Patch-For-Review: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10elukey) >>! In T347278#9340995, @klausman wrote: >>>! In T347278#9340974, @elukey wrote: >> @klausman everything should be done, except the work in T349632, lemme know if any... [11:20:49] 10Machine-Learning-Team: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10klausman) These are the remaining hits in the puppet repo. We need to keep these: ` hieradata/common/profile/kubernetes/deployment_server.yaml 336: ores-legacy: 338: - name: ores-legacy... [11:25:36] taavi: \o I am chasing ores leftovers in puppet, and there is 9some) mention of it in modules/profile/files/toolforge/legacy_redirector.lua. The file itself says it's generated using a tool Arturo wrote/maintained (https://wikitech.wikimedia.org/wiki/User:Arturo_Borrero_Gonzalez#wmcs-generate-legacy-redirector-list.py). Any idea a) what that file specifically does and b) how to re-run it? the [11:25:38] docs mention tool-k8s controllers, which seem to not exist (anymore). [11:42:38] klausman: o/ as FYI I upgraded istio on all our clusters, and also cert-manager in staging only [11:42:58] thank you! [11:44:28] * elukey lunch [11:55:49] ditto [12:10:16] Morning all! [12:15:07] Morning Chris! [12:19:29] morning! [12:19:59] it is still basically night on Chris' side to be honest :D [12:27:52] lol yes [13:12:49] (03PS4) 10Ilias Sarantopoulos: article-descriptions: add helper function for rest gateway url [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/979369 (https://phabricator.wikimedia.org/T351940) [13:13:44] (03CR) 10Ilias Sarantopoulos: "Fixed the wrong import and updated README.md with the missing pythonpath info" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/979369 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos) [13:19:42] kevinbazira: I responded in the patch. lets also discuss here if needed [13:20:42] I ran the whole whing locally BUT one thing I haven't done is run the docker image. loading the model on my M1 takes a million years and I need to revisit my configuration. I remember solving this somehow a couple of months ago [13:22:56] (03CR) 10Ilias Sarantopoulos: "@kharlan Let me know what you think about the changes I made or if we need to discuss this more. Thanks!" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/971547 (https://phabricator.wikimedia.org/T348298) (owner: 10Ilias Sarantopoulos) [13:23:44] (03CR) 10Kosta Harlan: add revertrisk model to the list of models (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/971547 (https://phabricator.wikimedia.org/T348298) (owner: 10Ilias Sarantopoulos) [13:28:53] isaranto: just a sec. I am concluding the load tests then will have a look at the patch [13:33:54] sure! take you time just ping me if anything isn't clear [14:16:41] aiko: o/ [14:17:17] don't know if you saw my messages but w8 before merging the revertrisk patch [14:17:39] isaranto: yes I saw it [14:17:45] cool cool [14:17:52] do you need any help with that? [14:17:54] isaranto: I'll file a patch [14:18:04] ack [14:18:12] to integration/config :) [14:18:44] awesome. also have a great week :) [14:20:22] you too! :D [14:24:25] emailing emailing emailing emailing [14:26:20] hey folks we have our clusters published to https://grafana-rw.wikimedia.org/d/WG4NjDISk/wip-cluster-status-and-capacity [14:26:40] still all WIP, some metrics are still not 100% correct, but the final state should be helpful [14:27:09] we will be able to answer more easily if we are under capacity, what is the largest consumer of memory/cpu, etc.. [14:28:32] looking nice! [14:28:49] except the memory used, that is definitely not looking nice, but I get it, its WIP [14:29:20] chrisalbon: we are good on all clusters, anything that concerns you? [14:29:54] k8s-mlserve https://usercontent.irccloud-cdn.com/file/y9w7ZBV5/Screenshot%202023-12-04%20at%206.29.30%E2%80%AFAM.png [14:30:32] it disappeared now [14:30:41] I think it was a hiccup with the dashboard [14:30:44] Kamila is fixing the graphs, it was probably wrong [14:31:04] also the default is for 7 days and we just started collecting, maybe grafana didn't like it [14:32:05] isaranto: https://gerrit.wikimedia.org/r/c/integration/config/+/979974 [14:32:10] would it be possible to add average total requests per hour? People often ask that question of Lift Wing and I'd love to have an answer [14:33:44] so much info in the dashboards! looking nice [14:35:50] chrisalbon: do you mean total cpu/memory requests per hour? IIUC it should be Memory/CPU usage by namespace [14:36:04] I think the per-hour req/s thing would be something like sum(increase(istio_requests_total{destination_service_namespace=~"revertrisk"}[1h])) [14:36:09] it is very low though, I think something is wrong [14:36:14] klausman: that file controls which tools are old enough to get a redirect from tools.wmflabs.org to toolforge.org. do not touch, unless you are removing a tool which has been properly disabled and deleted. I think the current way to generate it is the command in the commit message of https://gerrit.wikimedia.org/r/c/operations/puppet/+/682325 [14:36:46] thankyou! [14:36:54] elukey I wasn't thinking about cpu/memory, I was thinking about api requests [14:37:44] chrisalbon: ack so the above dashboard is only for memory cpu, for rps we have https://grafana-rw.wikimedia.org/d/G7yj84Vnk/istio [14:38:07] ah nice! [14:38:10] taavi: where would that command run? [14:38:49] klausman: any cloud vps instance, or other box with the developer account ldap tree tooling installed [14:38:56] ty [14:39:29] btw we're at an offsite this week, so expect weird timezones and delayed responses :-P [14:39:40] Noted :) [14:40:06] klausman: FYI, I merged your labs-private change to remove ores leftovers [14:40:14] merci [14:40:30] (working on too many changes at the same time ...) [14:41:41] 10Machine-Learning-Team, 10serviceops: Bump istio and Cert Manager Docker images to Bullseye - https://phabricator.wikimedia.org/T351933 (10elukey) [14:45:17] 10Machine-Learning-Team, 10Patch-For-Review: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10klausman) 05Open→03Resolved We're all done here. The to-be-archived repos we'll handle in the separate ticket. [14:53:53] 10Machine-Learning-Team, 10ORES: Add deprecation warnings to ORES-related repositories on Github - https://phabricator.wikimedia.org/T349632 (10klausman) My proposed big warning for those repos: ` Warning: The ORES infrastructure and Revscoring models are being deprecated by the WMF Machine Learning team, pl... [14:54:28] ^^^ I added a proposed warning message for deprecated GH repos (ORES, Revscoring) to this bug, please give it a read and comment if you feel something is off or missing. [15:03:03] 10Machine-Learning-Team, 10ORES: Add deprecation warnings to ORES-related repositories on Github - https://phabricator.wikimedia.org/T349632 (10elukey) Two nits, the rest looks great: > For a transitionary period, the Revscoring from ORES In this case I'd specifically list the model names, since a lot of peo... [15:10:31] 10Machine-Learning-Team, 10ORES: Add deprecation warnings to ORES-related repositories on Github - https://phabricator.wikimedia.org/T349632 (10klausman) New version: ` Warning: The ORES infrastructure and Revscoring models are being deprecated by the WMF Machine Learning team, please check https://wikitech.w... [15:26:05] (03CR) 10AikoChou: [C: 03+1] "LGTM! only one question related to the PYTHONPATH" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/979369 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos) [15:27:45] 10Machine-Learning-Team, 10observability, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q2): Istio recording rules for Pyrra and Grizzly - https://phabricator.wikimedia.org/T351390 (10elukey) Two new pilots for Lift Wing: * Latency SLO - [[ https://slo.wikimedia.org/objectives?expr={__name__=%22lif... [15:29:28] 10Machine-Learning-Team, 10ORES: Add deprecation warnings to ORES-related repositories on Github - https://phabricator.wikimedia.org/T349632 (10elukey) Great, the only missing bit is the "Some Revscoring models" that I believe should contain the list of model names, maybe in a wikitech page is ok as well. I'll... [15:36:46] 10Machine-Learning-Team, 10ORES: Add deprecation warnings to ORES-related repositories on Github - https://phabricator.wikimedia.org/T349632 (10isarantopoulos) This looks nice! One nit: I'd rephrase the first phrase to `The ORES infrastructure is being deprecated by the WMF` as I wouldn't mention that we are... [15:37:12] I don't know if you agree with --^ I'm still thinking how we could phrase it [15:37:48] Maybe "Deprecate the _use_ of the Recscoring models on ORES"? [15:38:26] Still not quite great [15:45:22] (03CR) 10Kevin Bazira: [C: 03+1] "LGMT!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/979369 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos) [15:50:42] 10Machine-Learning-Team, 10ORES: Add deprecation warnings to ORES-related repositories on Github - https://phabricator.wikimedia.org/T349632 (10achou) I would suggest the find-models link to use https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Revscoring_models_(migrated_from_ORES) which is under... [15:54:18] aiko: good call re: link [15:55:11] klausman: o/ :D [16:02:50] Thanks all [16:06:32] 10Machine-Learning-Team, 10ORES: Add deprecation warnings to ORES-related repositories on Github - https://phabricator.wikimedia.org/T349632 (10klausman) Final version. Unless there are any objections today, I will update the linked PR today and after review, we can merge it tomorrow. I'll then add this text t... [16:11:13] 10Machine-Learning-Team: Add a script for running the model server locally - https://phabricator.wikimedia.org/T352689 (10achou) [16:14:43] isaranto: ---^ created a task of adding a script for local run you mentioned last week [16:15:04] thank you! [16:15:43] I've been fighting with CI and virtualenvs for some time today trying to run tests and I'm thinking to leave this work for when we move to gitlab [16:16:27] aiko: also regarding your comment for the PYTHONPATH I'm a bit puzzled as why it runs but I can investigate more and check [16:27:09] isaranto: I checked the revertrisk container, there the python/ and model_server/ are at the same level of the directory tree, so I think that's why it doesn't need to be in the PYTHONPATH [16:28:58] my head is gonna explode :) [16:29:31] yes because we're accessing it from the top level dir so it can "see" the python dir, but then we have the same in article-desc [16:35:16] also I forgot the __init__.py in the python dir. I think it is a good practice to explicitly define packages like this and append the top level directory to your PYTHONPATH, that way you can access them from any level in the directory tree [16:35:39] on the other hand modern python doesn't seem to care that much about __init__.py's [16:36:39] aiko: is it ok if I resolve your comment for now? [16:36:56] isaranto: sure! no problem [16:37:17] (03CR) 10Ilias Sarantopoulos: article-descriptions: add helper function for rest gateway url (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/979369 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos) [16:37:22] (03CR) 10Ilias Sarantopoulos: [C: 03+2] article-descriptions: add helper function for rest gateway url [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/979369 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos) [16:38:07] I'll test all these from a clean clone of the repo cause I've built so many images today everything is messed up (first of all in my head) [16:42:48] 10Machine-Learning-Team: Deploy ctranslate2 version of nllb-200 - https://phabricator.wikimedia.org/T351740 (10isarantopoulos) p:05Triage→03Medium a:03isarantopoulos [16:47:28] (03Merged) 10jenkins-bot: article-descriptions: add helper function for rest gateway url [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/979369 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos) [16:56:31] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/980002 [16:56:31] I'm leaving the deployment for tomorrow morning [16:57:15] ack! [16:57:57] That's me , I deploy on Friday evening and then sometimes I refuse to deploy on Monday evening :) [17:00:32] now that we mastered LLMs 😛... [17:00:49] get ready for LVMs (Large Vision Models) https://arxiv.org/abs/2312.00785 [17:18:04] going afk, have a nice rest of the day folks! [17:19:06] ciao Luca! [17:34:00] night elukey! [17:41:19] (03PS1) 10Ilias Sarantopoulos: nllb: add cpu optimized version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/980015 (https://phabricator.wikimedia.org/T351740) [17:41:49] going afk as well folks, cu tomorrow! [17:42:56] have a nice evening Ilias o/ [17:58:53] 10Machine-Learning-Team, 10ORES, 10Growth-Team, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team: 'Highlight likely problem edits' preference doesn't select any filters in mobile web - https://phabricator.wikimedia.org/T318683 (10Jdlrobson) The web team has not worked on this codebase before either. [18:00:58] (03PS4) 10AikoChou: revert-risk: enable local run [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/979315 (https://phabricator.wikimedia.org/T352181) [19:02:45] 10Machine-Learning-Team, 10Wikipedia-Android-App-Backlog (Android Release - FY2023-24): Migrate Machine-generated Article Descriptions from toolforge to liftwing. - https://phabricator.wikimedia.org/T343123 (10kevinbazira) The article-descriptions model-server has been deployed in the LiftWing experimental nam... [20:04:22] 10Machine-Learning-Team, 10Wikipedia-Android-App-Backlog (Android Release - FY2023-24): Migrate Machine-generated Article Descriptions from toolforge to liftwing. - https://phabricator.wikimedia.org/T343123 (10Isaac) Thanks @kevinbazira ! Awesome to see this working! A bug I uncovered below and then a few thou... [21:14:00] 10Machine-Learning-Team, 10Wikipedia-Android-App-Backlog (Android Release - FY2023-24): Migrate Machine-generated Article Descriptions from toolforge to liftwing. - https://phabricator.wikimedia.org/T343123 (10JTannerWMF) As far as latency, our goal in the app for features is 500 milliseconds. I'd say anything...