[06:30:15] good morning! [06:35:27] Good morning [06:41:30] 06Machine-Learning-Team, 07Documentation: [Fix]: Documentation for ORES and MediaWiki Docker - https://phabricator.wikimedia.org/T393876#10828699 (10isarantopoulos) I had accidentally removed the install step. We have already tested this with @gkyziridis and everything works so I will mark this as resolved. [06:42:16] 06Machine-Learning-Team, 07Documentation: [Fix]: Documentation for ORES and MediaWiki Docker - https://phabricator.wikimedia.org/T393876#10828703 (10isarantopoulos) 05Open→03Resolved [06:44:03] 06Machine-Learning-Team, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10828704 (10OKarakaya-WMF) Looking into [this](https://gitlab.wikimedia.org/akhatun/research-mwaddlink/-/merge_requests/7) PR, we have previously tes... [07:09:00] hello! [07:49:57] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 06Wikipedia-Android-App-Backlog, 05FY2024-25 WE4.2: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298#10828768 (10kostajh) [08:00:10] morning morning o/ [08:00:10] elukey, klausman: I have prepared a WIP patch that adds the wmf-debian-vllm image to the production images repo: https://gerrit.wikimedia.org/r/1146891 [08:00:10] I tried testing this patch using: `docker-pkg -c config.yaml build images/amd/vllm085` and faced the following issues: [08:00:10] 1. on ml-lab1002, fails to connect to the internet as detailed in: https://phabricator.wikimedia.org/P76252 [08:00:10] 2. on ml-testing, runs out of space as detailed in: https://phabricator.wikimedia.org/P76254 [08:00:10] any ideas on how we can test this patch with `docker-pkg`? [08:01:35] The "connect to internet" failure is because APT does not respect the proxy env vars. [08:01:47] We'd have to do the apt.conf thing again [08:02:18] As for ml-testing, I'll see what I can do [08:03:08] the version tag is something :D gfx90arocm6.3.1pytorch2.8.0flash-attn2.7.4vllm0.8.5 [08:03:34] that one probably needs a -1 at the end, to respect Debian's format [08:08:05] As for teh external repo use, in the long term, we'd rpobably want to point that apt at WMF's apt repo and import all non-Debian packages to it [08:11:00] the AMD/radeon stuff should be easy to be added, even before releasing this image [08:11:07] Aye [08:11:29] I don't think we have 6.3 yet, but 6.1 is already there [08:11:46] but fetching from git is also done by other images on build2001, so there is a config for docker-pkg there that allows external fetches [08:12:16] The other question is if build2001 has enough disk/tmp space [08:12:46] 34G free on / [08:13:40] klausman: check /etc/production-images/config.yaml on build2001 for the proxy config [08:13:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [08:13:49] Deployment reference-need-predictor-00010-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00010-deployment - ... [08:13:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:14:04] the build script does stuff like /srv/deployment/docker-pkg/venv/bin/docker-pkg -c /etc/production-images/config.yaml "$@" images \ [08:14:14] so it should be easy to be done on ml-lab if needed [08:15:10] kevinbazira: on ml-lab1002 you need to add `http_proxy: "http://webproxy.codfw.wmnet:8080"` to the config.yaml that you are using to build [08:18:03] elukey: I had added the proxy to ~/.config/docker-pkg.yaml as per the `docker-pkg` docs: https://phabricator.wikimedia.org/P76252#306602 [08:18:03] I can also test it with the config.yaml [08:18:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [08:18:49] Deployment reference-need-predictor-00010-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00010-deployment - ... [08:18:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:20:20] kevinbazira: the -c config.yaml part takes the priority over your home config, so this is probably why you see the errors [08:20:29] I don't think that the two configs are merged [08:20:41] okok... on it [08:25:49] elukey: ah, that also explains how build2001 can do wmf registry uploads (no, I will not be copying that auth info elsewhere :)) [08:26:08] klausman: re space on build200[12], it is probably worth to open a task to figure out what are the needs for these huge builds, so we can figure out if we have to adjust any spec [08:26:33] yeah, at least disk space is a bit easier than the memory [08:27:00] build2002 has ~390GBs for example [08:27:08] elukey: also, did you see the upstream bug Kevin found where AMD discusses the reasons for the library sizes, and some potential reductions are discussed? [08:27:22] nope [08:27:26] https://github.com/ROCm/ROCm/issues/4224 [08:28:52] the tl;dr is that it's a cross product of gpu-models x problem sizes x problem types. There's also quite a bit of buundling. https://github.com/ROCm/ROCm/issues/4224#issuecomment-2583348653 also mentions compression resulting in a 9x (!) smaller hipblaslt [08:30:05] Said patch was merged upstream already, but I don't think there has been a release [08:30:38] let's hope something good comes out [08:31:02] Aye, at a minimum, I am glad AMD acknowledges this as a problem that needs solving [08:32:10] after adding apt.conf and setting the proxy in config.yaml, I am still getting the same connection error on ml-lab1002: [08:32:10] ``` [08:32:10] [docker-pkg-build] INFO - W: Failed to fetch http://mirrors.wikimedia.org/debian/dists/bookworm/InRelease Temporary failure resolving 'webproxy.codfw.wmnet' [08:32:10] ``` [08:32:10] when I run `docker build --network host --build-arg http_proxy=http://webproxy:8080 -t wmf-debian-vllm .` the image builds without issues. how can we set `--network host` when using `docker-pkg`? [08:49:44] I honestly don't know. Doing some searching [08:50:28] 06Machine-Learning-Team, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10828970 (10kevinbazira) >>! In T393474#10828704, @OKarakaya-WMF wrote: > @kevinbazira may help with the [back testing threshold](https://meta.wikim... [08:55:35] kevinbazira: can you tell me what commands are you running and where? [08:55:43] I'd like to test them if it is not an issue [08:56:20] also if you specify the http_proxy stuff in config.yaml you should remove the related build arg in the dockerfile [08:59:43] elukey: sure sure! on ml-lab1002.eqiad.wmnet, I have this patch: https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1146891 [08:59:44] and I am running `docker-pkg -c config.yaml build images/amd/vllm085` [09:02:26] kevinbazira: where is the checkout? I tried /home/kevinbazira/WMF_vLLM_image/production-images but it doesn't seem to have it [09:05:11] elukey: please check again /home/kevinbazira/WMF_vLLM_image/production-images [09:10:52] I've cleaned it up and removed all the tests [09:15:42] kevinbazira: now it works, I had to manually add the webproxy's ip address in config.yaml [09:15:55] not sure why it happens, maybe we are missing an extra setting [09:16:14] it now fails for some package [09:16:31] ah no wait sigh [09:16:33] W: Failed to fetch http://security.debian.org/debian-security/dists/bookworm-security/InRelease Could not connect to 208.80.154.74:8080 (208.80.154.74), connection timed out [09:18:02] yep, same issue I was facing [09:21:51] 06Machine-Learning-Team, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10829093 (10OKarakaya-WMF) updated the single model comment based on the thresholds you've shared @kevinbazira . Thank you! [09:24:52] trying something else [09:29:07] ack! [09:41:58] 06Machine-Learning-Team, 10LDAP-Access-Requests, 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10829180 (10BWojtowicz-WMF) Thank you @BCornwal... [09:45:22] kevinbazira, klausman found the issue - I needed to add "iptables": false in docker's daemon.json [09:45:25] now it works [09:46:10] kevinbazira: I left some changes in your repo, some of them are related to using docker-pkg's specific way to install debs [09:46:54] elukey: thanks! do we still need `http_proxy: "http://208.80.154.74:8080"` in the config.yaml? [09:47:20] yes, I changed it now to webproxy.eqiad.wmnet [09:47:31] sorry I didn't add the changes that I mentioned, only removed ENV variables [09:47:37] lemme give you an example [09:47:43] if you want to install a package, you can do [09:47:55] {{ "packagename1 packagename2 ..." | apt_install }} [09:48:11] this will take care of all the settings + apt-get update + apt-get install [09:48:28] so you can remove all boilerplate code and refactor apt installs [09:48:35] it should also take care of proxying properly [09:48:54] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team: Run analysis to retrieve thresholds for high impact wikis to deploy recent changes revert risk language agnostic filters to - https://phabricator.wikimedia.org/T392148#10829187 (10gkyziridis) >>! In T392148#10827957, @Kgraessle wrot... [09:50:58] kevinbazira: going to comment directly on the code review [09:51:19] okok thanks! [09:52:33] added you and Tobias as a reviewers [09:53:00] *the [10:01:48] kevinbazira: done! [10:02:35] thanks! let me address the comments ... [10:03:33] some of them, like the user, may require some thinking/testing, but the idea is to avoid running the containers as root [10:03:38] elukey: good catch! [10:05:34] Also, agreed re: using WMF's apt repo before merging. I'll work on getting 6.3 added to the repo today [10:14:42] elukey: thank you for your help <3 [10:15:48] <3 [10:48:52] 06Machine-Learning-Team, 07I18n, 10Moderator-Tools-Team (Kanban): Ensure all ORES i18n messages are available for idwiki - https://phabricator.wikimedia.org/T394455#10829402 (10Reedy) [12:00:59] (03CR) 10Gkyziridis: [C:03+1] "LGTM! Nice initiative to work on this. One minor suggestion only regarding the commit message, since the repo is called `inference-service" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1145888 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [12:41:28] elukey, kevinbazira: hit a snag with 6.3: it requires libc >=2.38, but bookworm only has 2.36 [12:41:55] (trixie has 2.41) [12:51:03] weird, how come it works on bookworm? IIUC from Kevin's test it worked [12:57:51] I don't know. I just made a bookworm-chroot and added the AMD Rocm 6.3 apt repo, and ended up with the above conflict [12:58:07] ah, hang on.... would this still be jammy. [12:59:01] my bad, I had copied the osurces.list from my private workstation which is trixie, and so it pulle noble instead of jammy [13:01:25] ohh okok that explains [13:01:27] good :) [13:07:24] (03CR) 10Ilias Sarantopoulos: "Nice work Bartosz! Welcome aboard!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1145888 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [13:37:32] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1146991 is ready for review [13:37:48] (03PS7) 10Bartosz Wójtowicz: ci: Upgrade pycommit setup to use only ruff. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1145888 (https://phabricator.wikimedia.org/T393865) [13:42:04] (03CR) 10Bartosz Wójtowicz: "Thank you both ver much! I've slightly updated the commit message to reflect that its from the `ci` component." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1145888 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [13:42:21] 06Machine-Learning-Team: AI/ML Infrastructure Request: **Accessing topics endpoints at scale** - https://phabricator.wikimedia.org/T392833#10830000 (10Isaac) Just adding some quick thoughts of nice-to-haves: * Regarding `Being able to access model outputs at scale would likely unlock additioanal use cases for li... [13:44:16] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Done" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1145888 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [13:46:19] (03PS8) 10Bartosz Wójtowicz: ci: Upgrade pre-commit setup to use only ruff. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1145888 (https://phabricator.wikimedia.org/T393865) [13:46:56] (03CR) 10Bartosz Wójtowicz: ci: Upgrade pre-commit setup to use only ruff. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1145888 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [13:47:26] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06DBA, 10MediaWiki-Recent-changes, and 2 others: DBA Review of Tables that ORES Extension will create - https://phabricator.wikimedia.org/T391103#10830006 (10Kgraessle) >>! In T391103#10806815, @Ladsgroup wrote: > I made a comment on the patch. > > For... [13:52:23] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06DBA, 10MediaWiki-Recent-changes, and 2 others: DBA Review of Tables that ORES Extension will create - https://phabricator.wikimedia.org/T391103#10830019 (10Kgraessle) [14:06:39] ok, https://apt.wikimedia.org/wikimedia/pool/thirdparty/amd-rocm63/ now exists (and includes the usual fake libpython) [14:07:34] elukey: one thing of note: rocm-gdb now has a minimum verisonb requirement for libpython3.10, so I had to update the control file field Proviedes with a version spec (= 3.10.0). [14:09:27] (03Abandoned) 10Bartosz Wójtowicz: Something being done by Bartosz. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1145108 (owner: 10Bartosz Wójtowicz) [14:09:54] okok [14:10:14] please update the docs if needed, so we know [14:10:29] bartosz: you can now +2 (you should have +2 rights) [14:13:42] (03CR) 10Bartosz Wójtowicz: [C:03+2] "Let's merge the pre-commit upgrades!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1145888 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [14:17:21] (03Merged) 10jenkins-bot: ci: Upgrade pre-commit setup to use only ruff. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1145888 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [14:17:55] isaranto: I see, thanks! The pre-commit patch got merged :-) [14:18:19] congs on your first merge \o/ [14:19:08] wehooo [14:22:15] kevinbazira: thanks! :D [14:22:52] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06DBA, 10MediaWiki-Recent-changes, and 2 others: DBA Review of Tables that ORES Extension will create - https://phabricator.wikimedia.org/T391103#10830125 (10isarantopoulos) As you mention ores is already in the `createExtensionTables.php` [[ https://gi... [14:23:24] georgekyz: could you take a look next week to figure out if we can extract all the thresholds for revertrisk in one go? [14:23:54] Congrats Bartosz 🎉 [14:23:56] I will help coordinate with the steps required for the deployment(s) and we can sync over this [14:27:51] ozge_: thanks! 😊 [14:28:41] (03PS1) 10Bartosz Wójtowicz: ci: Enable import sorting and update ruff formatting rules. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1147008 (https://phabricator.wikimedia.org/T393865) [14:32:43] 06Machine-Learning-Team, 07I18n, 10Moderator-Tools-Team (Kanban): Ensure all ORES i18n messages are available for idwiki - https://phabricator.wikimedia.org/T394455#10830185 (10BAPerdana-WMF) Seems correct as my understanding. [14:32:54] wow, so many changes! --^ bartosz while I appreciate the effort this becomes quite risky (implementing too many changes at once) even if these seem to be just formating.Is there a way we can limit the number of changes? if not then we should just do a thoroughhhhh review [14:32:57] thanks!! [14:33:35] or split it, that would make rollback of self-contained changes easier, if one of them is breaking [14:35:33] I see that it blew up, I'll try splitting into multiple smaller patches [14:36:05] yeah that would be even better. one patch for isort, one for line length, one for pyupgrade. It will at least help with the reviews [14:36:32] that said, thank you once again for jumping in on this and giving our ci/pre-commit some love <3 [14:43:20] I agree, splitting patches for isort, line length and pyupgrade sounds good, will do it this way. And I'm super happy to refresh our pre-commit a little! :-) [14:43:38] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06DBA, 10MediaWiki-Recent-changes, and 2 others: DBA Review of Tables that ORES Extension will create - https://phabricator.wikimedia.org/T391103#10830212 (10Ladsgroup) >>! In T391103#10830006, @Kgraessle wrote: >>>! In T391103#10806815, @Ladsgroup wrot... [14:55:01] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06DBA, 10MediaWiki-Recent-changes, and 2 others: DBA Review of Tables that ORES Extension will create - https://phabricator.wikimedia.org/T391103#10830247 (10Kgraessle) >>! In T391103#10830212, @Ladsgroup wrote: >>>! In T391103#10830006, @Kgraessle wrot... [15:07:01] isaranto: I am on it already! We cannot load all wikies at once in memory, so I am running it in a for loop for each wiki. The script can run once and obtain all threhsolds but it takes some time I will measure it. [15:20:18] Cool cool,thanks! [15:20:34] Going afk folks have a nice weekend all! [15:25:15] enjoy [15:40:44] nice weekend o/ [15:42:22] 06Machine-Learning-Team, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10830469 (10OKarakaya-WMF) tried mwaddlink pipeline https://airflow-research.wikimedia.org/dags/mwaddlink/grid?tab=logs&task_id=model_model_0&dag_ru... [15:44:35] Nice weekend! [15:46:21] have a great weekend! [18:12:02] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Productionize peacock detection model - https://phabricator.wikimedia.org/T391940#10830957 (10ppelberg) [18:43:06] (03CR) 10Scardenasmolinar: [C:03+1] "This looks good to me!" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1074472 (https://phabricator.wikimedia.org/T375280) (owner: 10Jsn.sherman) [19:31:36] (03CR) 10Scardenasmolinar: [C:03+2] PopulateDatabase errors out and stops processing revisions when any revertRiskLiftWingRequest request fails [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1074472 (https://phabricator.wikimedia.org/T375280) (owner: 10Jsn.sherman) [20:13:26] (03Merged) 10jenkins-bot: PopulateDatabase errors out and stops processing revisions when any revertRiskLiftWingRequest request fails [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1074472 (https://phabricator.wikimedia.org/T375280) (owner: 10Jsn.sherman) [22:33:34] 06Machine-Learning-Team, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10831663 (10fkaelin) > - [prod](https://github.com/wikimedia/research-mwaddlink/blob/main/src/scripts/generate_addlink_model.py) > - model on pro... [22:49:29] 06Machine-Learning-Team, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10831692 (10fkaelin) >>! In T393474#10830469, @OKarakaya-WMF wrote: > tried mwaddlink pipeline > > https://airflow-research.wikimedia.org/dags/mwadd...