[06:08:12] <isaranto>	 Morning!
[06:36:25] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Remove redundant deployments from ml-staging - https://phabricator.wikimedia.org/T361117#9668032 (10isarantopoulos)
[06:38:22] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Remove redundant deployments from ml-staging - https://phabricator.wikimedia.org/T361117#9668034 (10isarantopoulos) The following deployments have been removed from ml-staging:      - revscoring-articlequality: ruwiki       - revscoring-articletopic:...
[06:38:50] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: 14Remove redundant deployments from ml-staging - 14https://phabricator.wikimedia.org/T361117#9668036 (10isarantopoulos) 05Open→03Resolved
[08:00:47] * isaranto afk for 1h
[09:25:08] * isaranto back!
[09:41:50] <aiko>	 morning :)
[10:10:27] <isaranto>	 hey Aiko!
[10:19:51] <wikibugs>	 06Machine-Learning-Team, 10Structured-Data-Backlog (Current Work): Host a logo detection model for Commons images - https://phabricator.wikimedia.org/T358676#9668461 (10CodeReviewBot) mfossati merged https://gitlab.wikimedia.org/mfossati/scriptz/-/merge_requests/6  Improve functionality
[11:31:01] <wikibugs>	 06Machine-Learning-Team, 10Structured-Data-Backlog (Current Work): Host a logo detection model for Commons images - https://phabricator.wikimedia.org/T358676#9668801 (10CodeReviewBot) kevinbazira opened https://gitlab.wikimedia.org/mfossati/scriptz/-/merge_requests/7  lw_prototype: validate input data
[11:40:52] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9668842 (10isarantopoulos) I'm facing issues trying to update huggingfaceserver dependencies to use torch 2.2.1.  I've reached a point where I'm blocked because [[ https://git...
[11:57:37] <isaranto>	 I made an attemp to add a new image version for pytorch base image https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1015297
[11:59:43] <isaranto>	 I wasn't really sure whether I should create a new image or just a new version/tag for this one. The latter makes for sense in my head, but the naming convention (amd-torch22) kind of confuses things as I want to install version 2.1
[12:00:15] * isaranto lunch
[12:07:37] <elukey>	 isaranto: o/ you can clone the pytorch dir into pytorch21 and start from scratch
[12:07:41] <elukey>	 more clean
[12:07:51] <elukey>	 so we'll have both 2.x minor versions etc..
[12:08:25] <elukey>	 I am going to tweak a little the current base image to have the otrch package under /opt/lib/python, so hopefully pip's blubber will not reinstall torch if already present
[12:08:30] <elukey>	 going to work on it today
[12:08:34] <elukey>	 lemme know if you are blocked
[12:09:11] <elukey>	 also hello folks :)
[12:09:14] <klausman>	 \o
[12:09:29] <klausman>	 I've merged the Istio change (and added a small fix you're CC'd on)
[12:09:59] <klausman>	 Currently trying to figure out why the kubectl diff does not look the way I expected it to
[12:10:41] <isaranto>	 hi both!
[12:11:22] <isaranto>	 elukey: ok, I'll do that then! I'm having some issues with docker-pkg build but I'll try again and ping you folks for questions after lunch
[12:11:41] * isaranto now actually going for lunch
[12:18:11] <elukey>	 ack!
[13:03:53] <wikibugs>	 (03CR) 10Kosta Harlan: [C:04-1] "overall LGTM, some minor comments inline." [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) (owner: 10Jsn.sherman)
[13:04:52] <wikibugs>	 (03CR) 10Kosta Harlan: [C:04-1] Exclude first revision on page from scoring (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) (owner: 10Jsn.sherman)
[13:19:23] <elukey>	 pip is really weird
[13:19:51] <elukey>	 IIUC even if you have a package already installed and you use --index-url, pip may still want to re-download everything
[13:21:08] <isaranto>	 iirc it will download it and then tell you that it already exists. If you have it in the cache all well, no download happens
[13:21:19] <isaranto>	 so I guess in our case the cache doesn't exist so...
[13:21:44] <isaranto>	 didn't think about that. I'll have to remove it then in my work
[13:22:15] <isaranto>	 btw I'm having an issue with `docker-pkg` https://phabricator.wikimedia.org/P58979 It can't find python3-bookworm image although I have it
[13:22:21] <isaranto>	 anyone encountered this?
[13:22:34] <klausman>	 what's your docker-pkg commandline? it is rather fickle
[13:22:54] <isaranto>	 ` docker-pkg -c config.yaml build images/amd/pytorch21`
[13:23:22] <elukey>	 try with `docker-pkg build images/ --select *pytorch*`
[13:23:51] <klausman>	 ah
[13:24:06] <elukey>	 with "" around pytorch since you are on macos and probably using zsh, we had some issues with Aiko's terminal recently
[13:24:16] <klausman>	 Yes, if you give it the full path, it stops seeing images outside of it, even if you have them locally
[13:27:08] <elukey>	 isaranto: re pip - okok so it downloads it anyway, but in theory if the same requirement is already deployed locally it shouldn't do anything
[13:27:14] <elukey>	 I missed this part, going to keep testing
[13:28:03] <elukey>	 it doesn't make a lot of sense to me but it we have to pay this price in CI it shouldn't be a big deal
[13:28:19] <elukey>	 but I think that in the base image we'll have to use pip --target /opt/lib/python/site-packages
[13:30:16] <isaranto>	 ack
[13:31:03] <isaranto>	 thanks for the docker-pkg help. this seems to be working!
[13:31:15] <elukey>	 super :)
[13:33:31] <wikibugs>	 06Machine-Learning-Team, 10ORES, 06Wikimedia Enterprise: Investigate tools that use ORES - https://phabricator.wikimedia.org/T330854#9669191 (10JArguello-WMF) @prabhat can we remove ourselves from this ticket?
[13:41:23] <elukey>	 need to run an errand, be back in ~30 mins
[13:45:12] <wikibugs>	 06Machine-Learning-Team, 10ORES, 06Wikimedia Enterprise: Investigate tools that use ORES - https://phabricator.wikimedia.org/T330854#9669241 (10prabhat) @JArguello-WMF Yes, we can.
[13:47:08] <wikibugs>	 06Machine-Learning-Team, 10ORES: Investigate tools that use ORES - https://phabricator.wikimedia.org/T330854#9669254 (10JArguello-WMF)
[14:21:10] <elukey>	 back!
[14:24:03] <wikibugs>	 06Machine-Learning-Team: Fix locust load testing for Revert Risk models - https://phabricator.wikimedia.org/T361234 (10achou) 03NEW
[14:26:13] <elukey>	 so of course if I add --target /opt/lib/python/site-packages to the pip args in the base image I end up with a torch dir owned by root
[14:30:18] <klausman>	 Does the user we want to own them even exist at that point?
[14:32:44] <elukey>	 nope, it is created by blubber, I think they call it "somebody"
[14:33:08] <klausman>	 could we then tell blubber to chown-R the tree once the user exists?
[14:33:32] <elukey>	 not sure, but it would leave the base image half broken
[14:34:09] <klausman>	 Hm. maybe something like chown a+r on the tree (plus maybe a+x for dirs) in the base image generation?
[14:34:22] <klausman>	 it'd still be owned by root, but at least world-usable
[14:34:59] <klausman>	 (I may be barking up the wrong tree there, I presume root owner and umask make the tree unusable)
[14:35:36] <elukey>	 I need to figure out why it fails for file perms though, in theory pip shouldn't touch old stuff
[14:35:44] <elukey>	 unless it wants to update some file
[14:36:13] <klausman>	 maybe it has a similar problem as mkdir -p --mode, which will change the mode of already existing directories
[14:37:49] <wikibugs>	 (03PS5) 10Jsn.sherman: Exclude first/only revision on page from scoring [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281)
[14:38:18] <wikibugs>	 06Machine-Learning-Team: Update and fix locust load testing for revscoring models - https://phabricator.wikimedia.org/T361238 (10achou) 03NEW
[14:38:33] <elukey>	 ah no maybe the /opt/lib/python dir is not writable, so when it downloads the wheel from the blubber's pip it fails
[14:38:36] <elukey>	 lemme check
[14:39:58] <elukey>	 yes I think that is the issue
[14:40:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Exclude first/only revision on page from scoring [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) (owner: 10Jsn.sherman)
[14:40:07] <wikibugs>	 (03PS6) 10Jsn.sherman: Exclude first/only revision on page from scoring [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281)
[14:41:01] <wikibugs>	 (03CR) 10Jsn.sherman: "thanks for the review!" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) (owner: 10Jsn.sherman)
[14:42:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Exclude first/only revision on page from scoring [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) (owner: 10Jsn.sherman)
[14:44:54] <wikibugs>	 (03PS1) 10AikoChou: locust: fix missing host header for revertrisk load tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1015341 (https://phabricator.wikimedia.org/T361234)
[14:47:15] <wikibugs>	 (03PS7) 10Jsn.sherman: Exclude first/only revision on page from scoring [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281)
[14:49:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Exclude first/only revision on page from scoring [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) (owner: 10Jsn.sherman)
[14:52:35] <wikibugs>	 (03PS2) 10AikoChou: locust: fix missing host header for revertrisk load tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1015341 (https://phabricator.wikimedia.org/T361234)
[14:53:48] <wikibugs>	 (03PS8) 10Jsn.sherman: Exclude first/only revision on page from scoring [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281)
[15:04:09] <wikibugs>	 (03CR) 10Jsn.sherman: Exclude first/only revision on page from scoring (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) (owner: 10Jsn.sherman)
[15:14:10] <wikibugs>	 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: 14hw troubleshooting: failed disk for ml-serve2008.codfw.wmnet (not urgent) - 14https://phabricator.wikimedia.org/T360446#9669745 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm 14new drive has been inserted and the alert has cleared. retu...
[15:26:36] <wikibugs>	 06Machine-Learning-Team, 10Observability-Metrics: SLO dashboards for Lift Wing showing unexpected values - https://phabricator.wikimedia.org/T359879#9669840 (10elukey) @herron something really strange: https://w.wiki/9bMW  I compared the recording rule with the actual metric, trying to aggregate with the same...
[15:47:37] <klausman>	 elukey: when you have a moment, I have a question about the cassandra rile we use, specifically the listen_addresses
[15:57:33] <klausman>	 The problem is that for other users of Cassandra at WMF, in order to support running more than one node per machine (cassandra-a, cassandra-b etc), the role assumes additional IPs to be bound on the machine and makes the listen addresses to use that IP.
[15:58:21] <klausman>	 Unfortunately, the current version of the networkpolicy-machinery for Cassandra that I added uses just the ml-cache-storage role, which refers to the host IP, not the cassandra-specific one.
[15:59:08] <klausman>	 The question now is if we can make a query that refers to those (from e.g. hieradata/role/codfw/ml_cache/storage.yaml) or if we just change our listen_addresses in that file to the host's main IP.
[15:59:50] <klausman>	 I _think_ the latter is easier, and it's unlikely that we ever have multiple independent Cass nodes on our machines. We could then also release the extra addresses
[15:59:52] <elukey>	 klausman: I'd suggest to reach out to data-persistence, they may know how to workaround this
[16:02:43] <wikibugs>	 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Growth-Team, 06Wikipedia-Android-App-Backlog: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298#9670034 (10jsn.sherman)
[16:04:20] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] python: upgrade aiohttp's version to avoid issues with py3.11 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1015070 (owner: 10Elukey)
[16:05:34] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] locust: fix missing host header for revertrisk load tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1015341 (https://phabricator.wikimedia.org/T361234) (owner: 10AikoChou)
[16:12:54] <wikibugs>	 (03CR) 10Elukey: [C:03+2] python: upgrade aiohttp's version to avoid issues with py3.11 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1015070 (owner: 10Elukey)
[16:13:17] <isaranto>	 oh my python dependency hell
[16:14:05] <elukey>	 tell me about it..
[16:17:34] <isaranto>	 I just found out there is no pytorch-triton-rocm version 2.1.0 for python3.11
[16:17:47] <elukey>	 sigh
[16:17:56] <elukey>	 we can use bullseye if you want
[16:18:30] <elukey>	 I am currently trying to re-create what Blubber does, namely the "somebody" user
[16:18:43] <isaranto>	 Since I'm using a kserve fork I would prefer to use bookworm since we'll have to go down that road again
[16:21:48] <isaranto>	 or I can create a fork of the vllm package and try to use torch 2.2.1 so we don't create a new base image
[16:24:04] <elukey>	 anything already opened to upstream? I mean, maybe they are about to release
[16:24:08] <elukey>	 py3.11 is not that new
[16:24:14] <elukey>	 surely people already asked
[16:25:37] <isaranto>	 for pytorch-triton there is 2.2.0 for py3.11 but not for 2.1.0 which is an indirect dependency for pytorch-rocm( if I'm not mistaken)
[16:26:10] <wikibugs>	 (03PS3) 10Jsn.sherman: update revertrisk-language-agnostic min & desc [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014519 (https://phabricator.wikimedia.org/T348298)
[16:26:11] <isaranto>	 I'll ping to vllm upstream for support of newer torch versions and try out with a fork if I can update it myself
[16:26:22] <wikibugs>	 (03CR) 10Jsn.sherman: update revertrisk-language-agnostic min & desc (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014519 (https://phabricator.wikimedia.org/T348298) (owner: 10Jsn.sherman)
[16:37:24] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9670182 (10isarantopoulos) There is an open [[ https://github.com/vllm-project/vllm/pull/3442 | Pull Request  ]] in vllm repo to upgrade pytorch support to 2.2.1, just leaving...
[16:59:46] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Fix locust load testing for Revert Risk models - https://phabricator.wikimedia.org/T361234#9670319 (10achou) @isarantopoulos do you remember the config values in locust.conf when you ran the revertrisk tests? I can't reproduce the result in [[ https://gerrit.wikimed...
[17:01:10] <aiko>	 isaranto: o/ -----^
[17:03:13] <isaranto>	 👀
[17:09:23] <aiko>	 or that's the result testing via api-gw?
[17:16:16] <elukey>	 ok something that I didn't think before 
[17:17:20] <elukey>	 pip with --index-url seems to download and try to install torch anyway, and because how layers in a Docker image work, /opt/lib/python/site-packages/torch is overwritten
[17:17:42] <elukey>	 by another layer, where we copy /opt/etc.. from the build dir
[17:17:54] <elukey>	 in turn this creates two layers of ~10G each
[17:18:51] <klausman>	 Would --exists-action help?
[17:19:19] <klausman>	 mno, that's per-file
[17:23:16] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Fix locust load testing for Revert Risk models - https://phabricator.wikimedia.org/T361234#9670436 (10isarantopoulos) The locust.conf was the one that is committed in the repo. I can't recall if anything was different at the moment. However since the host header was...
[17:25:08] <elukey>	 maybe the problem is pip, poetry would alleviate this issue
[17:25:47] <elukey>	 I have torch 2.2.1+rocm on the base image, if I try to pip install torch 2.2.0 it starts to download everything
[17:25:58] <elukey>	 I'd have expected an error that torch is already installed
[17:26:39] <klausman>	 Is it maybe seeeing the two package names as different because of index-url?
[17:27:08] <klausman>	 Not that there'd be an easy way to avoid that :-/
[17:27:14] <elukey>	 it says at the end
[17:27:14] <elukey>	 WARNING: Target directory /opt/lib/python/site-packages/torch already exists. Specify --upgrade to force replacement.
[17:27:18] <elukey>	 bah
[17:27:36] <klausman>	 Very clever to only do that after d/l'ing everything
[17:30:44] <elukey>	 the main issue is that we COPY /opt/lib/python/site-packages in the Docker image that is built upon the base image
[17:30:56] <elukey>	 we COPY from the build one, so another layer is created
[17:31:11] <elukey>	 that adds +10G on top of the other ones
[17:31:42] <elukey>	 so ideally the solution is not to have torch listed in any requirements.txt
[17:31:47] <elukey>	 but not sure about transitive deps
[17:32:15] <klausman>	 Yeah, I wish pip had something like --assume-installed=pytorch
[17:33:45] <elukey>	 I think that the problem is probably in the approach, because of the COPY - even if we had pip not doing anything weird, we'd create a new layer with all the packages on top of the base image (already having torch)
[17:34:05] <elukey>	 and its total size doubles
[17:35:46] <elukey>	 sneaky issue
[17:36:15] <elukey>	 I'll add what I found in the task, I need to rethink this
[17:37:52] <isaranto>	 my head is going to explode as well, by trying too many things and messing things up :D
[17:38:39] <klausman>	 same here, with a dose of Puppet QL
[17:38:44] <isaranto>	 aiko: I'll run the load test in the morning and check
[17:39:03] <isaranto>	 I have the same code on statbox that I ran so probably I can figure if something is different
[17:39:48] <isaranto>	 the other thing that can happen is that if preprocessing takes more time (because of mwapi) we can serve less requests (but this will be a transient issue if it happens)
[17:42:27] <isaranto>	 logging off folks! have a nice evening (and a long weekend for whoever it applies)
[17:43:05] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Create a Pytorch base image - https://phabricator.wikimedia.org/T360638#9670506 (10elukey) There is an obstacle with the current approach that I didn't think about. In the current setup, this happens:  * We pip install pytorch-rocm in the base image, so a layer is c...
[17:43:11] <elukey>	 summarized in --^
[17:43:14] <elukey>	 isaranto: o/
[17:43:50] <klausman>	 \o
[17:44:14] <klausman>	 I am not going to figure this PQL query today, I guess it has to wait until I'm back
[17:44:21] <klausman>	 \o heading out as well
[17:44:56] <klausman>	 Happy Easter and/or a nice weekend everyone
[17:50:06] <aiko>	 isaranto: ack! 
[17:51:22] <elukey>	 going afk, have a nice rest of the day folks
[17:53:10] <aiko>	 bye Luca and Tobias o/ 
[20:48:33] <mercelisv>	 I'm trying to import a function to test it in a test class I made, however when I run the tests I get a ModuleNotFound error. I have added an __init__.py file and ran export PYTHONPATH=$PYTHONPATH:. in the terminal and I'm still getting the ModuleNotFound error
[20:48:52] <mercelisv>	 has anyone ever ran into this problem before?
[20:49:57] <mercelisv>	 from examples.revertrisk_examples import revert_risk_api_request, this is the line I use to import it and this is the file path: /liftwing-python/examples/revertrisk_examples.py
[20:52:14] <mercelisv>	 Traceback (most recent call last):
[20:52:14] <mercelisv>	   File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/unittest/mock.py", line 1372, in patched
[20:52:15] <mercelisv>	     with self.decoration_helper(patched,
[20:52:15] <mercelisv>	   File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/contextlib.py", line 137, in __enter__
[20:52:16] <mercelisv>	     return next(self.gen)
[20:52:16] <mercelisv>	            ^^^^^^^^^^^^^^
[20:52:17] <mercelisv>	   File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/unittest/mock.py", line 1354, in decoration_helper
[20:52:17] <mercelisv>	     arg = exit_stack.enter_context(patching)
[20:52:18] <mercelisv>	           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[20:52:18] <mercelisv>	   File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/contextlib.py", line 517, in enter_context
[20:52:19] <mercelisv>	     result = _enter(cm)
[20:52:19] <mercelisv>	              ^^^^^^^^^^
[20:52:20] <mercelisv>	   File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/unittest/mock.py", line 1427, in __enter__
[20:52:20] <mercelisv>	     self.target = self.getter()
[20:52:21] <mercelisv>	                   ^^^^^^^^^^^^^
[20:52:21] <mercelisv>	   File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/pkgutil.py", line 700, in resolve_name
[20:52:22] <mercelisv>	     mod = importlib.import_module(modname)
[20:52:22] <mercelisv>	           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^