[06:08:12] Morning! [06:36:25] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Remove redundant deployments from ml-staging - https://phabricator.wikimedia.org/T361117#9668032 (10isarantopoulos) [06:38:22] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Remove redundant deployments from ml-staging - https://phabricator.wikimedia.org/T361117#9668034 (10isarantopoulos) The following deployments have been removed from ml-staging: - revscoring-articlequality: ruwiki - revscoring-articletopic:... [06:38:50] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: 14Remove redundant deployments from ml-staging - 14https://phabricator.wikimedia.org/T361117#9668036 (10isarantopoulos) 05Open→03Resolved [08:00:47] * isaranto afk for 1h [09:25:08] * isaranto back! [09:41:50] morning :) [10:10:27] hey Aiko! [10:19:51] 06Machine-Learning-Team, 10Structured-Data-Backlog (Current Work): Host a logo detection model for Commons images - https://phabricator.wikimedia.org/T358676#9668461 (10CodeReviewBot) mfossati merged https://gitlab.wikimedia.org/mfossati/scriptz/-/merge_requests/6 Improve functionality [11:31:01] 06Machine-Learning-Team, 10Structured-Data-Backlog (Current Work): Host a logo detection model for Commons images - https://phabricator.wikimedia.org/T358676#9668801 (10CodeReviewBot) kevinbazira opened https://gitlab.wikimedia.org/mfossati/scriptz/-/merge_requests/7 lw_prototype: validate input data [11:40:52] 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9668842 (10isarantopoulos) I'm facing issues trying to update huggingfaceserver dependencies to use torch 2.2.1. I've reached a point where I'm blocked because [[ https://git... [11:57:37] I made an attemp to add a new image version for pytorch base image https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1015297 [11:59:43] I wasn't really sure whether I should create a new image or just a new version/tag for this one. The latter makes for sense in my head, but the naming convention (amd-torch22) kind of confuses things as I want to install version 2.1 [12:00:15] * isaranto lunch [12:07:37] isaranto: o/ you can clone the pytorch dir into pytorch21 and start from scratch [12:07:41] more clean [12:07:51] so we'll have both 2.x minor versions etc.. [12:08:25] I am going to tweak a little the current base image to have the otrch package under /opt/lib/python, so hopefully pip's blubber will not reinstall torch if already present [12:08:30] going to work on it today [12:08:34] lemme know if you are blocked [12:09:11] also hello folks :) [12:09:14] \o [12:09:29] I've merged the Istio change (and added a small fix you're CC'd on) [12:09:59] Currently trying to figure out why the kubectl diff does not look the way I expected it to [12:10:41] hi both! [12:11:22] elukey: ok, I'll do that then! I'm having some issues with docker-pkg build but I'll try again and ping you folks for questions after lunch [12:11:41] * isaranto now actually going for lunch [12:18:11] ack! [13:03:53] (03CR) 10Kosta Harlan: [C:04-1] "overall LGTM, some minor comments inline." [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) (owner: 10Jsn.sherman) [13:04:52] (03CR) 10Kosta Harlan: [C:04-1] Exclude first revision on page from scoring (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) (owner: 10Jsn.sherman) [13:19:23] pip is really weird [13:19:51] IIUC even if you have a package already installed and you use --index-url, pip may still want to re-download everything [13:21:08] iirc it will download it and then tell you that it already exists. If you have it in the cache all well, no download happens [13:21:19] so I guess in our case the cache doesn't exist so... [13:21:44] didn't think about that. I'll have to remove it then in my work [13:22:15] btw I'm having an issue with `docker-pkg` https://phabricator.wikimedia.org/P58979 It can't find python3-bookworm image although I have it [13:22:21] anyone encountered this? [13:22:34] what's your docker-pkg commandline? it is rather fickle [13:22:54] ` docker-pkg -c config.yaml build images/amd/pytorch21` [13:23:22] try with `docker-pkg build images/ --select *pytorch*` [13:23:51] ah [13:24:06] with "" around pytorch since you are on macos and probably using zsh, we had some issues with Aiko's terminal recently [13:24:16] Yes, if you give it the full path, it stops seeing images outside of it, even if you have them locally [13:27:08] isaranto: re pip - okok so it downloads it anyway, but in theory if the same requirement is already deployed locally it shouldn't do anything [13:27:14] I missed this part, going to keep testing [13:28:03] it doesn't make a lot of sense to me but it we have to pay this price in CI it shouldn't be a big deal [13:28:19] but I think that in the base image we'll have to use pip --target /opt/lib/python/site-packages [13:30:16] ack [13:31:03] thanks for the docker-pkg help. this seems to be working! [13:31:15] super :) [13:33:31] 06Machine-Learning-Team, 10ORES, 06Wikimedia Enterprise: Investigate tools that use ORES - https://phabricator.wikimedia.org/T330854#9669191 (10JArguello-WMF) @prabhat can we remove ourselves from this ticket? [13:41:23] need to run an errand, be back in ~30 mins [13:45:12] 06Machine-Learning-Team, 10ORES, 06Wikimedia Enterprise: Investigate tools that use ORES - https://phabricator.wikimedia.org/T330854#9669241 (10prabhat) @JArguello-WMF Yes, we can. [13:47:08] 06Machine-Learning-Team, 10ORES: Investigate tools that use ORES - https://phabricator.wikimedia.org/T330854#9669254 (10JArguello-WMF) [14:21:10] back! [14:24:03] 06Machine-Learning-Team: Fix locust load testing for Revert Risk models - https://phabricator.wikimedia.org/T361234 (10achou) 03NEW [14:26:13] so of course if I add --target /opt/lib/python/site-packages to the pip args in the base image I end up with a torch dir owned by root [14:30:18] Does the user we want to own them even exist at that point? [14:32:44] nope, it is created by blubber, I think they call it "somebody" [14:33:08] could we then tell blubber to chown-R the tree once the user exists? [14:33:32] not sure, but it would leave the base image half broken [14:34:09] Hm. maybe something like chown a+r on the tree (plus maybe a+x for dirs) in the base image generation? [14:34:22] it'd still be owned by root, but at least world-usable [14:34:59] (I may be barking up the wrong tree there, I presume root owner and umask make the tree unusable) [14:35:36] I need to figure out why it fails for file perms though, in theory pip shouldn't touch old stuff [14:35:44] unless it wants to update some file [14:36:13] maybe it has a similar problem as mkdir -p --mode, which will change the mode of already existing directories [14:37:49] (03PS5) 10Jsn.sherman: Exclude first/only revision on page from scoring [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) [14:38:18] 06Machine-Learning-Team: Update and fix locust load testing for revscoring models - https://phabricator.wikimedia.org/T361238 (10achou) 03NEW [14:38:33] ah no maybe the /opt/lib/python dir is not writable, so when it downloads the wheel from the blubber's pip it fails [14:38:36] lemme check [14:39:58] yes I think that is the issue [14:40:02] (03CR) 10CI reject: [V:04-1] Exclude first/only revision on page from scoring [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) (owner: 10Jsn.sherman) [14:40:07] (03PS6) 10Jsn.sherman: Exclude first/only revision on page from scoring [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) [14:41:01] (03CR) 10Jsn.sherman: "thanks for the review!" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) (owner: 10Jsn.sherman) [14:42:21] (03CR) 10CI reject: [V:04-1] Exclude first/only revision on page from scoring [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) (owner: 10Jsn.sherman) [14:44:54] (03PS1) 10AikoChou: locust: fix missing host header for revertrisk load tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1015341 (https://phabricator.wikimedia.org/T361234) [14:47:15] (03PS7) 10Jsn.sherman: Exclude first/only revision on page from scoring [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) [14:49:33] (03CR) 10CI reject: [V:04-1] Exclude first/only revision on page from scoring [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) (owner: 10Jsn.sherman) [14:52:35] (03PS2) 10AikoChou: locust: fix missing host header for revertrisk load tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1015341 (https://phabricator.wikimedia.org/T361234) [14:53:48] (03PS8) 10Jsn.sherman: Exclude first/only revision on page from scoring [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) [15:04:09] (03CR) 10Jsn.sherman: Exclude first/only revision on page from scoring (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) (owner: 10Jsn.sherman) [15:14:10] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: 14hw troubleshooting: failed disk for ml-serve2008.codfw.wmnet (not urgent) - 14https://phabricator.wikimedia.org/T360446#9669745 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm 14new drive has been inserted and the alert has cleared. retu... [15:26:36] 06Machine-Learning-Team, 10Observability-Metrics: SLO dashboards for Lift Wing showing unexpected values - https://phabricator.wikimedia.org/T359879#9669840 (10elukey) @herron something really strange: https://w.wiki/9bMW I compared the recording rule with the actual metric, trying to aggregate with the same... [15:47:37] elukey: when you have a moment, I have a question about the cassandra rile we use, specifically the listen_addresses [15:57:33] The problem is that for other users of Cassandra at WMF, in order to support running more than one node per machine (cassandra-a, cassandra-b etc), the role assumes additional IPs to be bound on the machine and makes the listen addresses to use that IP. [15:58:21] Unfortunately, the current version of the networkpolicy-machinery for Cassandra that I added uses just the ml-cache-storage role, which refers to the host IP, not the cassandra-specific one. [15:59:08] The question now is if we can make a query that refers to those (from e.g. hieradata/role/codfw/ml_cache/storage.yaml) or if we just change our listen_addresses in that file to the host's main IP. [15:59:50] I _think_ the latter is easier, and it's unlikely that we ever have multiple independent Cass nodes on our machines. We could then also release the extra addresses [15:59:52] klausman: I'd suggest to reach out to data-persistence, they may know how to workaround this [16:02:43] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Growth-Team, 06Wikipedia-Android-App-Backlog: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298#9670034 (10jsn.sherman) [16:04:20] (03CR) 10Ilias Sarantopoulos: [C:03+1] python: upgrade aiohttp's version to avoid issues with py3.11 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1015070 (owner: 10Elukey) [16:05:34] (03CR) 10Ilias Sarantopoulos: [C:03+1] locust: fix missing host header for revertrisk load tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1015341 (https://phabricator.wikimedia.org/T361234) (owner: 10AikoChou) [16:12:54] (03CR) 10Elukey: [C:03+2] python: upgrade aiohttp's version to avoid issues with py3.11 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1015070 (owner: 10Elukey) [16:13:17] oh my python dependency hell [16:14:05] tell me about it.. [16:17:34] I just found out there is no pytorch-triton-rocm version 2.1.0 for python3.11 [16:17:47] sigh [16:17:56] we can use bullseye if you want [16:18:30] I am currently trying to re-create what Blubber does, namely the "somebody" user [16:18:43] Since I'm using a kserve fork I would prefer to use bookworm since we'll have to go down that road again [16:21:48] or I can create a fork of the vllm package and try to use torch 2.2.1 so we don't create a new base image [16:24:04] anything already opened to upstream? I mean, maybe they are about to release [16:24:08] py3.11 is not that new [16:24:14] surely people already asked [16:25:37] for pytorch-triton there is 2.2.0 for py3.11 but not for 2.1.0 which is an indirect dependency for pytorch-rocm( if I'm not mistaken) [16:26:10] (03PS3) 10Jsn.sherman: update revertrisk-language-agnostic min & desc [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014519 (https://phabricator.wikimedia.org/T348298) [16:26:11] I'll ping to vllm upstream for support of newer torch versions and try out with a fork if I can update it myself [16:26:22] (03CR) 10Jsn.sherman: update revertrisk-language-agnostic min & desc (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014519 (https://phabricator.wikimedia.org/T348298) (owner: 10Jsn.sherman) [16:37:24] 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9670182 (10isarantopoulos) There is an open [[ https://github.com/vllm-project/vllm/pull/3442 | Pull Request ]] in vllm repo to upgrade pytorch support to 2.2.1, just leaving... [16:59:46] 06Machine-Learning-Team, 13Patch-For-Review: Fix locust load testing for Revert Risk models - https://phabricator.wikimedia.org/T361234#9670319 (10achou) @isarantopoulos do you remember the config values in locust.conf when you ran the revertrisk tests? I can't reproduce the result in [[ https://gerrit.wikimed... [17:01:10] isaranto: o/ -----^ [17:03:13] 👀 [17:09:23] or that's the result testing via api-gw? [17:16:16] ok something that I didn't think before [17:17:20] pip with --index-url seems to download and try to install torch anyway, and because how layers in a Docker image work, /opt/lib/python/site-packages/torch is overwritten [17:17:42] by another layer, where we copy /opt/etc.. from the build dir [17:17:54] in turn this creates two layers of ~10G each [17:18:51] Would --exists-action help? [17:19:19] mno, that's per-file [17:23:16] 06Machine-Learning-Team, 13Patch-For-Review: Fix locust load testing for Revert Risk models - https://phabricator.wikimedia.org/T361234#9670436 (10isarantopoulos) The locust.conf was the one that is committed in the repo. I can't recall if anything was different at the moment. However since the host header was... [17:25:08] maybe the problem is pip, poetry would alleviate this issue [17:25:47] I have torch 2.2.1+rocm on the base image, if I try to pip install torch 2.2.0 it starts to download everything [17:25:58] I'd have expected an error that torch is already installed [17:26:39] Is it maybe seeeing the two package names as different because of index-url? [17:27:08] Not that there'd be an easy way to avoid that :-/ [17:27:14] it says at the end [17:27:14] WARNING: Target directory /opt/lib/python/site-packages/torch already exists. Specify --upgrade to force replacement. [17:27:18] bah [17:27:36] Very clever to only do that after d/l'ing everything [17:30:44] the main issue is that we COPY /opt/lib/python/site-packages in the Docker image that is built upon the base image [17:30:56] we COPY from the build one, so another layer is created [17:31:11] that adds +10G on top of the other ones [17:31:42] so ideally the solution is not to have torch listed in any requirements.txt [17:31:47] but not sure about transitive deps [17:32:15] Yeah, I wish pip had something like --assume-installed=pytorch [17:33:45] I think that the problem is probably in the approach, because of the COPY - even if we had pip not doing anything weird, we'd create a new layer with all the packages on top of the base image (already having torch) [17:34:05] and its total size doubles [17:35:46] sneaky issue [17:36:15] I'll add what I found in the task, I need to rethink this [17:37:52] my head is going to explode as well, by trying too many things and messing things up :D [17:38:39] same here, with a dose of Puppet QL [17:38:44] aiko: I'll run the load test in the morning and check [17:39:03] I have the same code on statbox that I ran so probably I can figure if something is different [17:39:48] the other thing that can happen is that if preprocessing takes more time (because of mwapi) we can serve less requests (but this will be a transient issue if it happens) [17:42:27] logging off folks! have a nice evening (and a long weekend for whoever it applies) [17:43:05] 06Machine-Learning-Team, 13Patch-For-Review: Create a Pytorch base image - https://phabricator.wikimedia.org/T360638#9670506 (10elukey) There is an obstacle with the current approach that I didn't think about. In the current setup, this happens: * We pip install pytorch-rocm in the base image, so a layer is c... [17:43:11] summarized in --^ [17:43:14] isaranto: o/ [17:43:50] \o [17:44:14] I am not going to figure this PQL query today, I guess it has to wait until I'm back [17:44:21] \o heading out as well [17:44:56] Happy Easter and/or a nice weekend everyone [17:50:06] isaranto: ack! [17:51:22] going afk, have a nice rest of the day folks [17:53:10] bye Luca and Tobias o/ [20:48:33] I'm trying to import a function to test it in a test class I made, however when I run the tests I get a ModuleNotFound error. I have added an __init__.py file and ran export PYTHONPATH=$PYTHONPATH:. in the terminal and I'm still getting the ModuleNotFound error [20:48:52] has anyone ever ran into this problem before? [20:49:57] from examples.revertrisk_examples import revert_risk_api_request, this is the line I use to import it and this is the file path: /liftwing-python/examples/revertrisk_examples.py [20:52:14] Traceback (most recent call last): [20:52:14]   File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/unittest/mock.py", line 1372, in patched [20:52:15]     with self.decoration_helper(patched, [20:52:15]   File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/contextlib.py", line 137, in __enter__ [20:52:16]     return next(self.gen) [20:52:16]            ^^^^^^^^^^^^^^ [20:52:17]   File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/unittest/mock.py", line 1354, in decoration_helper [20:52:17]     arg = exit_stack.enter_context(patching) [20:52:18]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [20:52:18]   File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/contextlib.py", line 517, in enter_context [20:52:19]     result = _enter(cm) [20:52:19]              ^^^^^^^^^^ [20:52:20]   File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/unittest/mock.py", line 1427, in __enter__ [20:52:20]     self.target = self.getter() [20:52:21]                   ^^^^^^^^^^^^^ [20:52:21]   File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/pkgutil.py", line 700, in resolve_name [20:52:22]     mod = importlib.import_module(modname) [20:52:22]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^