[06:55:07] (03CR) 10Kevin Bazira: [C:03+1] "Thank you for working on this, Bartosz!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1147008 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [07:06:29] good morning [07:09:46] Good morning [07:14:11] Hola! [07:22:28] hello! [07:53:07] 06Machine-Learning-Team, 10Recommendation-API: Content Translation Recommendations API - https://phabricator.wikimedia.org/T293648#10837952 (10Nikerabbit) 05Open→03Resolved a:03Nikerabbit All subtasks resolved. [07:57:13] (03CR) 10Ilias Sarantopoulos: ci: Enable import sorting. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1147008 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [07:58:33] (03CR) 10Ilias Sarantopoulos: "Just a comment for bitsandbytes. Other than that LGTM. Nice work!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1147008 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [08:01:41] (03PS15) 10Bartosz Wójtowicz: ci: Enable import sorting. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1147008 (https://phabricator.wikimedia.org/T393865) [08:02:37] (03CR) 10Bartosz Wójtowicz: ci: Enable import sorting. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1147008 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [08:06:06] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Just to note that even the slightest simple changes might cause issues in a production setting. For example a common case in python is if " [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1147008 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [08:12:20] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 10Wikimedia-Extension-setup, and 2 others: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171#10837978 (10isarantopoulos) Thank you both! @taavi Thanks for running the script! Does this script need to r... [08:12:51] (03CR) 10CI reject: [V:04-1] ci: Enable import sorting. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1147008 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [08:14:05] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 10Wikimedia-Extension-setup, and 2 others: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171#10837980 (10isarantopoulos) [08:14:50] elukey: re: https://gerrit.wikimedia.org/r/c/mediawiki/services/machinetranslation/+/1147812 - is it possible to use s3cmd for external purpose also? Just thinking quickly without looking at the details :) [08:15:52] kart_: o/ I am not sure, never used it in that way, I don't think it is possible [08:16:09] but it should be a matter of using curl/wget or s3cmd, something quick :) [08:17:40] (03CR) 10Bartosz Wójtowicz: "Thanks for the elaboration! Indeed, I didn't consider that re-sorting imports could cause circular dependency issues. In this case however" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1147008 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [08:19:53] elukey: let me create s3cmd version in another patch. [08:20:05] We need to test it well. [08:20:37] okok perfect, sorry for the -1 I just wanted to make sure that analytics.w.o wasn't used (even temporary) for prod since it is a single host [08:20:49] and not really great in getting a lot of bw usage request :) [08:21:08] (03CR) 10Bartosz Wójtowicz: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1147008 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [08:27:55] elukey: no issue. So, far MinT setup outside Wikimedia is not known. [08:29:59] elukey: also, is there anyway to get credential for testing for .s3cfg? [08:30:11] (I can't test otherwise locally) [08:34:29] On a second thought replacing entire commands from wget to s3cmd get seems overdo for production config part. Let me find another way! [08:35:05] Hey folks, I have a small question regarding our review/merge process. After submission of a new patchset, the +1s of previous reviewers are automatically removed from gerrit as far as I can see. Does it mean that we should aim to look for re-approvals on the final patchset before merging? [08:37:03] 06Machine-Learning-Team, 07I18n, 10Moderator-Tools-Team (Kanban): Ensure all ORES i18n messages are available for idwiki - https://phabricator.wikimedia.org/T394455#10838025 (10isarantopoulos) The messages mentioned in the task have been merged in [[ https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ORES... [08:38:18] elukey: Maybe one more variable to separate production config and test config would work. Mostly submitting updated script by EOD. [08:38:35] oookkkk [08:39:09] (03CR) 10Gkyziridis: [C:03+1] ci: Enable import sorting. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1147008 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [08:39:17] and just BASE_URL update (as we do right now also!) [08:39:44] I'll ping for config part when script is done. [08:40:09] 06Machine-Learning-Team, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10838033 (10OKarakaya-WMF) # Prod vs New Pipeline As we are likely to proceed with either prod pipeline or new pipeline, we discuss here the pros an... [08:41:01] bartosz: I saw that the +1 from Ilias stayed there. In general if everybody had approved then you can go for merging, if you submit more changes and you still need review then you ask for it. But in general if you have two +1 then you can go for merging. It always depends on the patch. [08:41:19] bartosz: we don't follow any strict rules. Since only a handful of people operate on the inference-services repo we have the convention that the author also does the +2 and merges after getting a review (if multiple ones are required). So if you upload a new patchset to resolve comments by a reviewer you should expect a re-review. But if the reviewer has already given a +1 then it is ok to merge. This all depends on the [08:41:19] change. For example updating the body of the commit message with a better description or fixing a typo that everybody had missed before wouldnt require an approval [08:43:15] isaranto: georgekyz: I see, thank you guys! [08:43:44] 🙌 [08:45:49] are we going to deploy the new model versions from this patch ? [08:48:13] I mean... the post-merge will generate new versions of images for all models because files are affected from the imports sorting or some stylish changes, but other than that nothing else is changed right? So, we probably do not need to deploy something right ? [08:48:28] I’m wondering about the same.. I can also go ahead with remaining 2 pre-commit patches - changing line length limit and bumping target python version for pyupgrade. After those, I could test images on staging and possibly re-deploy if needed? [08:49:43] we can proceed with the remaining changes and then deploy to staging. Bartosz can go through the process of deploying to staging and running the httpbb tests https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/test/revscoring_manual_tests/#liftwing-revscoring-api-tests [08:50:31] also now that I saw it: we should move these commands to a place on wikitech or just under a README.md under test/ in the repo as they don't apply only for revscoring services [08:52:22] 👍 [08:52:42] sounds good to me! [08:53:14] re :line length cause I saw I had missed it. Is this change necessary? Do we really need to use 88? [08:53:31] please doooon't it is too short [08:53:39] 😇 [08:54:02] I think it's the opposite - we're currently using 88, because we were using default black settings. Now, I'd plan to enforce the 120 limit [08:54:40] aa ok, then +2 from me :) [08:55:37] oh I thought we were using bigger one, sorry [08:55:49] or we can just keep 88 and avoid the changes if it still works for everyone [08:56:06] (03CR) 10Bartosz Wójtowicz: [C:03+2] ci: Enable import sorting. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1147008 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [08:56:57] morning morning o/ [08:56:57] klausman: elukey: I've addressed all comments on the vllm image patch and successfully tested the image on ml-lab1002. [08:56:57] whenever you get a minute, please let me know whether we can proceed to merge: https://gerrit.wikimedia.org/r/1146891 [08:56:57] thanks! [08:57:34] 06Machine-Learning-Team, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10838069 (10OKarakaya-WMF) Got results for top 10 languages with the new pipeline. New pipeline: implementation in research datasets with some improv... [09:03:11] kevinbazira: I left two small comments, nothing big, after those and the successful test we can probably merge. One question though - does the build require a ton of memory/cpu/disk-space ? [09:03:21] because it will run on build200X [09:03:45] isaranto: georgekyz: I'm also happy to keep it at 88, but since we are already over 80, I don't see a reason to not go to 100 or 120. The second change bumping target python version is also not anything major, but it would change all type annotations from e.g. `typing.List` to `list` in ~35 files, which were introduced in Python 3.9. [09:06:13] (03Merged) 10jenkins-bot: ci: Enable import sorting. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1147008 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [09:08:39] I don't have a strong opinion on line-length so I'll leave it up to you folks to decide whatever works best for you and your screens :D [09:18:14] so I would suggest we just bump python target-version from 3.7 to 3.11 as it can be seen as more of a tech debt, but leave the current line limit if it’s been working good so far :) [09:23:19] (03PS1) 10Bartosz Wójtowicz: ci: Bump Python target-version from 3.7 to 3.11. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1148282 (https://phabricator.wikimedia.org/T393865) [09:26:39] ack! [09:45:12] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team: Run analysis to retrieve thresholds for high impact wikis to deploy recent changes revert risk language agnostic filters to - https://phabricator.wikimedia.org/T392148#10838228 (10isarantopoulos) a:05Kgraessle→03gkyziridis [09:58:08] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team: Run analysis to retrieve thresholds for high impact wikis to deploy recent changes revert risk language agnostic filters to - https://phabricator.wikimedia.org/T392148#10838253 (10gkyziridis) >>! In T392148#10834187, @gkyziridis wro... [10:10:35] elukey: thanks for the reviews, I've fixed all comments. I don't have +2 rights on this repo, so you'll help me merge. [10:10:35] regarding the resources required by the build, IIUC >70GB disk-space and >60GB RAM [10:10:35] here are the details in a grafana dashboard: https://grafana.wikimedia.org/goto/-WKAdwaNR?orgId=1 [10:12:05] kevinbazira: let's wait for Tobias' final sign off, as I wrote in the code review [10:12:23] okok [10:15:38] kevinbazira: wait a sec, does it require a GPU to build? [10:19:11] no, the build process itself does not require a GPU [10:20:01] because the grafana dashboard you linked seems GPU-related [10:21:01] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&from=now-3h&to=now&timezone=utc&var-server=ml-lab1002&var-datasource=000000026&var-cluster=misc seems to indicate ~10GB of memory used, if you built around 9:10 UTC [10:21:18] that is a lot but not that bad [10:24:51] the docker-pkg-build.log shows start: 2025-05-20 06:25:06 and end: 2025-05-20 08:22:13 [10:25:49] two hours? [10:26:08] yep [10:26:15] oh my [10:26:42] :') [10:26:48] I guess you are here https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&from=2025-05-20T06:30:37.112Z&to=2025-05-20T07:56:47.769Z&timezone=utc&var-server=ml-lab1002&var-datasource=000000026&var-cluster=misc [10:27:05] peak of 40G [10:27:59] so the code review requires some ground work first kevinbazira, namely ml-lab100x will likely become a host to build/push docker image from [10:28:12] but it needs to be discussed with SRE [10:28:58] I'll kick off the discussions with Moritz, but klausman will have to start a formal process etc.. so we do things properly, add the necessary security boundaries etc.. [10:29:17] SGTM! [10:34:37] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 10Wikimedia-Extension-setup, and 2 others: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171#10838361 (10Ladsgroup) >>! In T382171#10837978, @isarantopoulos wrote: > Thank you both! > @taavi Thanks for... [10:47:38] * isaranto afk lunch! [11:20:23] Some of my CI jobs are hitting the `No space left on device` errors (e.g. https://integration.wikimedia.org/ci/job/inference-services-pipeline-article-descriptions/233/execution/node/84/log/). Is there anything I can do on my end to clean up the underlying CI machine(s)? [11:53:50] i think this happens because they are running all together. perhaps triggering just the one that fails from the jenkins UI could fix it [11:55:44] I just triggered a rebuild for article descriptions https://integration.wikimedia.org/ci/job/trigger-inference-services-pipeline-article-descriptions/ [11:58:02] ahh so rerunning a single job is only possible from jenkins UI, gerrit always triggers the full CI? [12:01:31] yes `recheck` will trigger the CI for the current patch [12:03:19] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Operational Excellence - LiftWing Platform Updates & Improvements - https://phabricator.wikimedia.org/T391943#10838712 (10isarantopoulos) [12:03:21] 06Machine-Learning-Team, 13Patch-For-Review: Simplify pre-commit hooks within inference-services repository. - https://phabricator.wikimedia.org/T393865#10838713 (10isarantopoulos) [12:06:08] isaranto: I see, thanks! Triggering a rebuild within jenkins indeed makes the job successful. However, can I make gerrit understand that it was re-triggered and update the reference to the new successful run? [12:09:16] I don't know! If someone else knows here that would be useful. I don't know if we can trigger a specific pipeline through gerrit (I doubt it). Lemme take a look. Otherwise we can force merge if we review the patch and also see manually re-trigger the build from jenkins and it succeeds [12:10:04] * I mean also Jenkins succeeds even if ran manually like we just did (I just read the above sentence and it didn't make any sense) [12:18:30] (03PS2) 10Bartosz Wójtowicz: ci: Bump Python target-version from 3.7 to 3.9. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1148282 (https://phabricator.wikimedia.org/T393865) [12:24:24] ^ ended up bumping target-version to 3.9 instead of 3.11, because some of the models run on python 3.9 [12:31:19] hmm , which are these? revscoring models? [12:31:32] going to meetings -- will check later! [12:35:31] isaranto: I could find it in `langid`, `ores-legacy` and `revscoring` models [13:31:44] 06Machine-Learning-Team: Build and push images to the docker registry from ml-lab - https://phabricator.wikimedia.org/T394778 (10isarantopoulos) 03NEW [13:32:40] 06Machine-Learning-Team: Deploy peacock/tone check model to production - https://phabricator.wikimedia.org/T394779 (10isarantopoulos) 03NEW [13:36:03] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Productionize peacock detection model - https://phabricator.wikimedia.org/T391940#10839271 (10achou) Update: - Working on negative samples collection for French, Spanish, Japanese, Portuguese, English for HIL evaluation. - Get examples of good edits on the same... [14:05:52] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Simple article summaries: Set up the software stack for efficiently serving production LLMs - https://phabricator.wikimedia.org/T391941#10839391 (10kevinbazira) We ran performance benchmarks on the wmf-debian-vllm image, verifying that porting the upstream image... [14:36:11] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06DBA, 10MediaWiki-Recent-changes, and 2 others: [Epic] Recent Changes ORES Enabled Revert Risk Powered Filters Rollout Plan - https://phabricator.wikimedia.org/T391964#10839526 (10isarantopoulos) We will deploy the ORES extension on idwiki without enab... [15:07:32] 06Machine-Learning-Team, 13Patch-For-Review: Simplify pre-commit hooks within inference-services repository. - https://phabricator.wikimedia.org/T393865#10839652 (10BWojtowicz-WMF) So far we've merged 2 patches: 1) [[ https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1145888 | Remo... [15:49:14] 06Machine-Learning-Team, 07I18n, 10Moderator-Tools-Team (Kanban): Ensure all ORES i18n messages are available for idwiki - https://phabricator.wikimedia.org/T394455#10839941 (10BAPerdana-WMF) I will check with the translation and get it done in Thursday (max.) [16:04:45] 06Machine-Learning-Team, 07I18n, 10Moderator-Tools-Team (Kanban): Ensure all ORES i18n messages are available for idwiki - https://phabricator.wikimedia.org/T394455#10840000 (10Trizek-WMF) a:03BAPerdana-WMF [16:05:57] * isaranto afk [16:20:10] 06Machine-Learning-Team: Build and push images to the docker registry from ml-lab - https://phabricator.wikimedia.org/T394778#10840104 (10kevinbazira)