[06:36:52] Good morning [06:54:50] mooorning! [07:00:09] good morning! [07:13:04] fyi: I've scheduled a backport deployment for now as part of the work for the revertrisk rollout in the ores extension https://wikitech.wikimedia.org/wiki/Deployments#Monday,_May_19 [07:34:04] morning Folks [07:45:52] morning morning o/ [07:45:55] klausman: o/ using the ROCm packages mirror you shared here: https://apt.wikimedia.org/wikimedia/pool/thirdparty/amd-rocm63/ [07:45:55] I get the error below shown here: https://phabricator.wikimedia.org/P76288 [07:45:55] what could I be missing? [07:48:59] Don't use jammy, but bookworm as the distribution [07:49:33] The WMF repos follow the Debian scheme, not the Ubuntu one (like upstream ROCm does) [07:52:05] There may also be an easier way (similar to using `{{ "wget gnupg ca-certificates apt-transport-https" | apt_install }}`) to add additional repos, I'll do some digging [07:54:00] checking the existing prodimages, I tlooks like the echo "deb ..." > ... is still the right way [07:58:30] getting the same error: https://phabricator.wikimedia.org/P76288#306767 [08:00:23] what is the exact repo line you add to /etc/apt? [08:01:34] It hsould be "deb http://apt.wikimedia.org/wikimedia bookworm-wikimedia thirdparty/amd-rocm61" [08:05:48] rocm63 of course [08:05:56] (I blame a lack of caffeine) [08:17:21] yep, I've used rocm63 and got: https://phabricator.wikimedia.org/P76288#306768 [08:19:11] ah, I hadn't imported the rocm package since we always installed the subpackages directly (rocm is just a metapackage). Lemme fix that [08:20:58] okok [08:21:42] and now I remember why: the rocm package pulls in a lot of stuff we don't really need (like the openmp-sdk [08:22:56] So you think we could change the Dockerfile to explicitly install the packages we used to install host-side? It's a long-ish list, but I think it's better than to pull in all the extra deps. wdyt? [08:23:00] cc elukey ^^ [08:24:25] actually, can you give this a shot: replace `rocm` with `rocm-dev` in the package list [08:26:48] o/ [08:26:59] \o [08:27:10] totally missed that we install 'rocm' directly, it may explain why the size of the image is so big [08:27:24] yeah, agreed [08:28:26] no okok wait a sec, rocm is installed in the build image [08:28:50] later on I don't think anything is copied over to the final image [08:29:04] so yeah not ideal, but ok-ish for this use case [08:29:20] it seems needed to build vllm [08:29:26] yes it is [08:29:30] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/amd_rocm/manifests/init.pp#77 fwiw,this is the list of packages we explictly install on the hosts. Note that e.g. radeontop is of course not needed in this image [08:29:35] but I guess that the end wheel has its own .so files etc. [08:29:54] SO we need to import the whole enchilada into our thirdparty repo? [08:30:34] the list that we have now is fine, does it include rocm-dev? [08:30:49] if so we could try it, and/or a smaller list instead of "rocm" [08:31:10] the main question mark is what vllm needs, and with a tighter list it is a matter of testing it [08:31:45] Yes, rocm-dev is in our copy of packages [08:32:41] it should be sufficient kevinbazira, in theory [08:33:00] let me test a vllm build [08:33:46] We may need to add some of the stuff from the init.pp file I mentioned above. Probaly anything except radeontop [08:44:07] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06DBA, 10MediaWiki-Recent-changes, and 2 others: [Epic] Recent Changes ORES Enabled Revert Risk Powered Filters Rollout Plan - https://phabricator.wikimedia.org/T391964#10833711 (10isarantopoulos) [08:49:24] thnx for updating this~~~~~^ [08:51:07] georgekyz: I just added some commands there for reference [09:28:44] georgekyz: regarding the translations that are missing https://phabricator.wikimedia.org/T382171#10828105 [09:29:40] I do have access to translate messages on translate wiki. Can you make sure you create an account + validate your email? and then we can check the permissions together in the following days [09:29:57] 06Machine-Learning-Team, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10833890 (10OKarakaya-WMF) Hello @Michael , @kostajh , @Tgr and @kevinbazira I'd like to confirm my understanding about the current implementatio... [09:33:33] (03PS14) 10Bartosz Wójtowicz: ci: Enable import sorting. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1147008 (https://phabricator.wikimedia.org/T393865) [09:35:20] isaranto: Done ~~^ [09:36:49] nice! [09:38:51] 06Machine-Learning-Team, 13Patch-For-Review: Simplify pre-commit hooks within inference-services repository. - https://phabricator.wikimedia.org/T393865#10833966 (10BWojtowicz-WMF) [[ https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1147008 | Second patch ]] enabling import sortin... [09:41:54] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 10Wikimedia-Extension-setup, and 2 others: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171#10833976 (10isarantopoulos) I have translated the above i18n messages on translatewiki using either Google su... [09:46:37] 06Machine-Learning-Team, 07I18n, 10Moderator-Tools-Team (Kanban): Ensure all ORES i18n messages are available for idwiki - https://phabricator.wikimedia.org/T394455#10833998 (10isarantopoulos) [09:46:40] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06DBA, 10MediaWiki-Recent-changes, and 2 others: [Epic] Recent Changes ORES Enabled Revert Risk Powered Filters Rollout Plan - https://phabricator.wikimedia.org/T391964#10833997 (10isarantopoulos) [09:46:42] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Moderator-Tools-Team (Kanban), 10MW-1.45-notes (1.45.0-wmf.2; 2025-05-20): PopulateDatabase errors out and stops processing revisions when any revertRiskLiftWingRequest request fails - https://phabricator.wikimedia.org/T375280#10834000 (10isarantopoulo... [09:46:49] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 10Wikimedia-Extension-setup, and 2 others: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171#10833999 (10isarantopoulos) [09:51:10] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team: Run analysis to retrieve thresholds for high impact wikis to deploy recent changes revert risk language agnostic filters to - https://phabricator.wikimedia.org/T392148#10834013 (10isarantopoulos) [09:51:12] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06DBA, 10MediaWiki-Recent-changes, and 2 others: [Epic] Recent Changes ORES Enabled Revert Risk Powered Filters Rollout Plan - https://phabricator.wikimedia.org/T391964#10834012 (10isarantopoulos) [09:53:05] 06Machine-Learning-Team, 07I18n, 10Moderator-Tools-Team (Kanban): Ensure all ORES i18n messages are available for idwiki - https://phabricator.wikimedia.org/T394455#10834032 (10isarantopoulos) I have translated the above i18n messages on translatewiki using either Google suggested messages or MinT. I also cr... [09:54:50] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06DBA, 10MediaWiki-Recent-changes, and 2 others: [Epic] Recent Changes ORES Enabled Revert Risk Powered Filters Rollout Plan - https://phabricator.wikimedia.org/T391964#10834039 (10isarantopoulos) [09:56:57] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06DBA, 10MediaWiki-Recent-changes, and 2 others: [Epic] Recent Changes ORES Enabled Revert Risk Powered Filters Rollout Plan - https://phabricator.wikimedia.org/T391964#10834042 (10isarantopoulos) [09:56:59] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06DBA, 10MediaWiki-Recent-changes, and 2 others: DBA Review of Tables that ORES Extension will create - https://phabricator.wikimedia.org/T391103#10834043 (10isarantopoulos) [10:16:49] klausman: vllm won't build without rocm: https://phabricator.wikimedia.org/P76290 [10:16:50] adding rocm-dev failes with: https://phabricator.wikimedia.org/P76290#306790 [10:17:12] hey folks, lemme know what you think about https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/1147713 when you have a moment [10:18:02] kevinbazira: re: rocm-dev, one package is missing from our apt IIUC [10:18:28] there is the rocprofiler-sdk etc.. [10:18:40] now, we may not need all that stuff though, as Tobias was saying before [10:19:15] we could choose to list all the packages that we want to install, that is not as readable as "rocm" or "rocm-dev", but probably more efficient [10:19:27] (like less garbage on our servers etc.. and on the build image) [10:21:18] kevinbazira: I added the list that Tobias mentioned to the phaste [10:22:00] basically you should use {{ 'hsa-rocr-dev etc...' | apt_install }} instead of the 'rocm' single package [10:22:10] thanks! let me add those to the list [10:22:33] it may not work at first, you'll likely see some errors when building vllm like "oh noesss I am missing this lib!" [10:22:46] then we'll find the package, add it to the list, rinse build etc.. [10:23:04] until we find the fixed point :D [10:25:48] okok ... on it [10:40:26] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team: Run analysis to retrieve thresholds for high impact wikis to deploy recent changes revert risk language agnostic filters to - https://phabricator.wikimedia.org/T392148#10834187 (10gkyziridis) == RevertRisk Thresholds Analysis for al... [10:42:43] klausman: I've listed all deps but rocm-dev is still not happy: https://phabricator.wikimedia.org/P76290#306798 [10:43:21] kevinbazira: ah snap sorry rocm-dev is in there, I didn't see it [10:43:37] yep [10:43:57] That's odd, when I tested it, that dep on rocprofiler-sdk wasn't there. Either way, a fix is easy, but I'll make extra sure no other deps have crept in [10:44:45] the full list is good anyway, rocm-dev may not be enough alone [10:44:51] (brb) [10:44:53] yeah, I think the -sdk package is new in 6.2 or 6.3 [10:45:01] probably yes [10:45:27] I'll make a chroot to decidedly test this. It'll taker longer than just adding the -sdk, but that way we're sure the depgraph is complete [10:51:51] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 10Wikimedia-Extension-setup, and 2 others: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171#10834236 (10isarantopoulos) [10:52:24] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 10Wikimedia-Extension-setup, and 2 others: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171#10834238 (10isarantopoulos) [11:04:16] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1147739 is ready and will add the missing pkgs [11:05:01] hey team, patch enabling import sorting for inference-services repo is ready to review https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1147008. I wrote a small description to help with the review: https://phabricator.wikimedia.org/T393865#10833965 [11:10:56] elukey: I have a question about pcc and those apt repos. When I run pcc on that patch against aptXXXX, I always get noops, yet somehow, the patch works. Is there some other host that is invovlved? Or why can't pcc see the change, as it were? [11:32:52] 06Machine-Learning-Team, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10834359 (10Michael) >>! In T393474#10833889, @OKarakaya-WMF wrote: > Hello @Michael , @kostajh , @Tgr and @kevinbazira > > I'd like to confirm my u... [11:39:42] (03CR) 10Gkyziridis: [C:03+1] "LGTM in general, I did not parse anything wrong but my concern is that" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1147008 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [11:50:27] (03CR) 10Bartosz Wójtowicz: "Thanks!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1147008 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [12:19:30] kevinbazira: merged the change and updated the apt servers, the install of rocm-dev should now work [12:19:54] thanks! let me test it ... [12:30:55] (03CR) 10Gkyziridis: [C:03+1] "Thank you for taking care on all the above and thnx for reporting in the phabticket. Lets wait for some extra pair of eyes to review and t" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1147008 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [12:58:18] roc-dev installation is now fixed. thanks Tobias. [12:58:20] now testing full image with user "somebody" ... [13:08:06] klausman: so I usually check the change catalog json if nothing shows up (and I expect a change) to see if the file is like I expect. Sometimes, I don't recall when, a noop happens when it shouldn't [13:08:24] ack, ty [13:11:32] klausman: when you have a moment: https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/1147713 [13:11:45] right, saw that a minute ago [13:12:01] +1'd [13:12:02] yes yes anytime :) [13:12:26] at the moment we have https://slo.wikimedia.org/?search=revertrisk [13:12:39] that is a test, but it seems working fine without weird errors [13:13:24] so the slo.w.org view is a rolling SLO window, meanwhile we also have https://grafana.wikimedia.org/d/ccssRIenz/pyrra-detail?orgId=1&from=2025-03-01T00:00:00.000Z&to=2025-05-31T23:59:59.000Z&timezone=utc&var-prometheus=000000019&var-slo=revertrisk-la-requests&var-site=$__all&refresh=30s [13:13:42] that is not great atm, but we'll improve it (pyrra creates it by default) [13:13:54] so ideally, there is a path forward to have slos for all the isvcs [13:18:26] yeah, completely agreed [13:19:02] What do you feel is not great about the Pyrra dashboard? [13:22:15] 06Machine-Learning-Team, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10834668 (10OKarakaya-WMF) I've reproduced the implementation in research datasets for one wiki (simplewiki) on a notebook 🎉 ` snapshot = "2025-04"... [13:28:14] the grafana dashboards by default are broken (see the two hidden panels), and there is no mention of the SLI related to the SLO etc.. [13:28:19] it is not that great [13:28:32] and the Pyrra UI offers only rolling windows [13:28:50] mmmh, I see your point. [13:29:28] we are still working on it :D [13:55:09] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06DBA, 10MediaWiki-Recent-changes, and 2 others: [Epic] Recent Changes ORES Enabled Revert Risk Powered Filters Rollout Plan - https://phabricator.wikimedia.org/T391964#10834861 (10isarantopoulos) [14:10:18] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06DBA, 10MediaWiki-Recent-changes, and 2 others: [Epic] Recent Changes ORES Enabled Revert Risk Powered Filters Rollout Plan - https://phabricator.wikimedia.org/T391964#10834936 (10isarantopoulos) [14:19:47] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 10Wikimedia-Extension-setup, and 2 others: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171#10834959 (10isarantopoulos) I've run the maintenance script to create the tables ` mwscript-k8s --comment="T3... [14:23:42] bartosz: sorry I haven't had time to review the patch. Will do tomorrow morning! [14:24:21] re:bitsandbytes there was a multi platform build that made bitsandbytes also available for rocm. I don't know if this has been included in the release but I'll follow up! [14:28:41] isaranto: no worries, let's follow up on the patch tomorrow! I'll also do some reading on where bitsandbytes stands with rocm support [14:29:42] * isaranto afk! [14:38:30] elukey: using amd-pytorch-common base image throws an error: https://phabricator.wikimedia.org/P76308 [14:38:31] what could I be missing? [14:42:00] kevinbazira: I'd try to build it before starting, it may be needed locally [14:46:18] trying to build it locally (on ml-lab1002) yields nothing [14:48:07] checking in a sec [14:48:11] okok [14:53:51] kevinbazira: can you try using docker-registry.discovery.wmnet in the config.yaml? [14:59:42] I've changed the redistry uri and got: https://phabricator.wikimedia.org/P76308#306869 [15:02:14] kevinbazira: try to add `ca_bundle: /etc/ssl/certs/ca-certificates.crt` to the config.yaml [15:05:28] now getting: docker-registry.discovery.wmnet/amd-pytorch-common [15:05:45] *now getting: https://phabricator.wikimedia.org/P76308#306870 [15:08:13] ok like before [15:08:50] what command do you use? [15:11:09] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 10Wikimedia-Extension-setup, and 2 others: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171#10835179 (10Kgraessle) >>! In T382171#10834959, @isarantopoulos wrote: > I've run the maintenance script to c... [15:13:54] trying to build the base image locally using `docker-pkg -c config.yaml build images/amd/pytorch-common`, yields nothing. [15:13:54] building the vllm image that has `FROM {{ "amd-pytorch-common" | image_tag }}` using `docker-pkg -c config.yaml build images/amd/vllm085` throws the error shared [15:17:32] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 10Wikimedia-Extension-setup, and 2 others: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171#10835226 (10ops-monitoring-bot) Cookbook cookbooks.sre.wikireplicas.update-views run by taavi: Started updati... [15:18:36] so in theory it doesn't find anything new for pytorch-common, comparing the docker-registry, and it doesn't build [15:18:44] but then vllm doesn't find the base image [15:19:08] yep [15:20:59] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 10Wikimedia-Extension-setup, and 2 others: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171#10835251 (10ops-monitoring-bot) Cookbook cookbooks.sre.wikireplicas.update-views started by taavi completed:... [15:22:00] I am trying a couple of settings [15:22:10] okok [15:23:53] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 10Wikimedia-Extension-setup, and 2 others: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171#10835277 (10taavi) >>! In T382171#10834959, @isarantopoulos wrote: > checking on quarry.wmcloud.org I see tha... [15:24:10] kevinbazira: the file images/amd/vllm085/Dockerfile.template seems a one line only though [15:24:50] yes, that's because I am testing the base image [15:25:31] can you restore the old content? [15:26:39] done [15:27:24] I am trying to build the image but I get [15:27:25] WARNING - Ignoring /home/kevinbazira/WMF_vLLM_image/production-images/images/amd/vllm085 since it lacks a Dockerfile.template (builder.py:268) [15:28:57] it has it, maybe you ran when I was restoring the content: [15:28:57] ``` [15:28:57] $ ls /home/kevinbazira/WMF_vLLM_image/production-images/images/amd/vllm085 [15:28:57] changelog control Dockerfile.template [15:28:57] ``` [15:29:28] does it try to build if you try now? [15:31:01] nope, how about we just use the bookworm base image that was working before? [15:31:34] this is not the problem, something is going on [15:31:46] 06Machine-Learning-Team, 06collaboration-services, 10Wikipedia-iOS-App-Backlog (iOS Release FY2024-25): [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119#10835345 (10Gehel) Removing Search Platform as it seems that we're not needed. [15:31:59] to be honest, I have stated here multiple times that ml-lab100X shouldn't be used to build docker image without some work first to make it working and compatible with our infra [15:40:09] ohh all right I found it, I needed to run [15:40:10] docker-pkg -c config.yaml build images/amd/ [15:40:18] kevinbazira: --^ [15:40:33] let's see if it works now [15:41:01] I added an entry to config.yaml for the amd pytorch image [15:41:16] (well I can run it, not sure if it is late for you) [15:41:48] (building) [15:43:16] it is getting late but I am happy to complete this [15:44:01] I think it works now [15:44:08] it got to install package steps [15:44:24] one note - it is not necessary to have the amd-pytorch-image at the top, only for the final image is fine [15:44:31] okok... should I test on my end? (don't want to disrupt your process) [15:44:36] you can use bookworm or python3-bookworm at the top [15:44:49] sure [15:44:53] okok [15:45:14] but we can restart tomorrow, so you enjoy your evening [15:45:28] it will be there tomorrow morning as well :) [15:49:37] sure sure, have a good evening o/ [15:49:48] there is a build running ... [15:50:00] thanks alot for you help today [15:54:32] <3 [16:52:59] 06Machine-Learning-Team, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10835915 (10SSalgaonkar-WMF) >>! In T393474#10834359, @Michael wrote: > Could we improve the existing pipeline first, before we add new features? Whi... [18:24:20] 06Machine-Learning-Team, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10836537 (10OKarakaya-WMF) I agree @SSalgaonkar-WMF 💯 [19:26:14] FIRING: SLOMetricAbsent: linkrecommendation-requests - https://slo.wikimedia.org/?search=linkrecommendation-requests - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [19:31:14] RESOLVED: SLOMetricAbsent: linkrecommendation-requests - https://slo.wikimedia.org/?search=linkrecommendation-requests - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [21:46:12] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 10Wikimedia-Extension-setup, and 2 others: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171#10837405 (10Ladsgroup) >>! In T382171#10835277, @taavi wrote: >>>! In T382171#10834959, @isarantopoulos wrote...