[00:09:34] <wikibugs>	 (03Merged) 10jenkins-bot: build: Upgrade mediawiki/mediawiki-codesniffer to v43.0.0 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1009935 (owner: 10Umherirrender)
[07:04:00] <isaranto>	 Good morning o/
[08:17:58] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Code looks good. The table does have a nice index (two even). 😄 Happy to merge this soon." [extensions/ORES] - 10https://gerrit.wikimedia.org/r/991932 (owner: 10Umherirrender)
[09:17:39] <wikibugs>	 06Machine-Learning-Team: Add a util function in python to detect GPU - https://phabricator.wikimedia.org/T359793 (10isarantopoulos) 03NEW
[09:22:26] <wikibugs>	 06Machine-Learning-Team: Add a util function in python to detect GPU - https://phabricator.wikimedia.org/T359793#9619134 (10isarantopoulos) @achou suggested to use pyopencl ([[ https://github.com/inducer/pyopencl | GitHub ]], [[ https://pypi.org/project/pyopencl/ | PyPI ]]) which seems well supported and promisi...
[09:55:10] <isaranto>	 aa this pytorch index page is misbehaving again on rocm5.5
[10:28:20] <wikibugs>	 06Machine-Learning-Team: Assess runtime performance impact of pydantic data models in the RRLA model-server - https://phabricator.wikimedia.org/T355742#9619269 (10MunizaA) @kevinbazira this is very helpful, thank you!   Please correct me if I'm wrong but I assume the preprocessing time for each payload was calcu...
[10:38:09] <aiko>	 morning!
[10:48:44] <wikibugs>	 06Machine-Learning-Team: Assess runtime performance impact of pydantic data models in the RRLA model-server - https://phabricator.wikimedia.org/T355742#9619343 (10kevinbazira) @MunizaA, we're happy to hear that the information provided was helpful. For more context, the preprocessing time for each payload was re...
[10:49:18] <isaranto>	 Morning Aiko!
[11:32:27] * isaranto lunch
[12:02:10] <klausman>	 'ello. :)
[12:12:38] <isaranto>	 o/ Tobias!
[13:17:17] <elukey>	 hi folks!
[13:17:34] <isaranto>	 Hey Luca!
[13:17:46] <klausman>	 ciao, luca!
[13:32:50] * klausman late lunch
[13:38:23] <isaranto>	 Sooo I was checking some old httpbb tests that we removed for ores
[13:38:53] <isaranto>	 actually we removed some tests for ores-beta.wmflabs.org which made sense
[13:39:28] <isaranto>	 but I want to reintroduce them for ores-legacy staging to start with 
[13:40:09] <isaranto>	 any thoughts/objections to it?
[13:40:37] <elukey>	 sure makes sense
[13:40:40] <isaranto>	 as I'm seeing httpbb cant check for boolean fields yet, so perhaps I'll have to add some functionality there
[14:58:54] <elukey>	 the SLO dashboards are now showing weird numbers, I changed the windows for the new quarter but something is off
[15:01:17] <isaranto>	 I see them now..
[15:02:30] <elukey>	 I filed a couple of patches to improve the performances (not yet merged) but even in thanos UI the metrics returned are weird
[15:06:12] <elukey>	 see https://w.wiki/9S2m
[15:07:04] <elukey>	 maybe I am totally missing something but in the second case, the response_code=200 show ~3000 rps
[15:07:17] <elukey>	 sorry, increased requests
[15:07:23] <elukey>	 meanwhile in the third they are zero
[15:07:33] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "This sounds nice! I opened a task so that we can track this work over there https://phabricator.wikimedia.org/T359793" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008858 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou)
[15:12:51] <isaranto>	 elukey: in the above example is the only difference between 2nd and 3rd query the response code filter (response_code=~"(2|3|4)..",)?
[15:13:01] <isaranto>	 just asking in case I am not reading properly
[15:13:05] <elukey>	 correct yes 
[15:14:07] <klausman>	 elukey: the other weirdness is that even if you sum by dcs and site, the two sites are identical
[15:14:25] <klausman>	 dcs meaning destination canon service
[15:14:35] <klausman>	 (I only now realize it sounds like DCs, as well)
[15:16:38] <wikibugs>	 (03CR) 10AikoChou: "Thanks for creating the task!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008858 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou)
[15:20:10] <elukey>	 there is something strange going on with Thanos
[15:20:15] <elukey>	 I don't see any other explanations
[15:20:29] <elukey>	 maybe be due to https://gerrit.wikimedia.org/r/c/operations/puppet/+/992415 
[15:20:38] <klausman>	 Do you think that maybe the prometheus="..." change we talked about last week broke something?
[15:22:17] <elukey>	 that is the change above
[15:22:50] <elukey>	 up to Jan 24th there was prometheus="k8s-mlserve", after prometheus="thanos-rule"
[15:23:13] <elukey>	 that in theory should be fine, this is why Keith told us in the code review that removing the "prometheus" label was fine (I guess)
[15:23:14] <klausman>	 which makes sense, as we discussed, since the origin of the data is a thanos rule, not some prometheus it scrapes.
[15:23:31] <elukey>	 but results are weird, not because of the division themselves
[15:23:59] <klausman>	 yeah, and site=codfw and site-eqiad should basically never have the same rate/increase (unless they're both 0)
[15:25:14] <elukey>	 the issue that I am seeing is worse, take a look to the second and third panels in the short url above
[15:25:39] <elukey>	 in the latter eqiad or codfw don't matter, they report 0 reqs for HTTP 200
[15:28:17] <klausman>	 yeah, and a div by zero would normally result in NaN
[15:28:47] <klausman>	 They occasinally are non-0 tho (add `[5m]`
[15:28:49] <klausman>	 )
[15:29:17] <klausman>	 Not the 200s, however
[15:29:25] <elukey>	 with [90] they are surely not zero, so the corresponding slo/slis are totally off
[15:29:32] <elukey>	 like 1500%
[15:30:12] <klausman>	 At any rate (ha...) the third TS should never be smaller than the second
[15:30:22] <elukey>	 exactly
[15:32:55] <klausman>	 Mh. Do you think us dropping `prometheus` in the recording rule would help here?
[15:35:32] <elukey>	 no idea, my main suspicion at the moment is that thanos somehow returns different results for queries
[15:35:47] <elukey>	 but the data is there, the second panel shows it
[15:36:17] <elukey>	 adding "response_code=~"(2|3|4).."," to the third panel makes the data to appear
[15:38:11] <elukey>	 very luckily all SREs are in Warsaw right now for the summit :D
[15:38:18] <elukey>	 so we'll have to wait
[15:43:40] <klausman>	 I've never gotten prom to do something like this before, but then again, I've not usefd Thanos for RRs much
[15:48:33] <elukey>	 I don't recall anything so weird before the paternity leave
[15:49:58] <klausman>	 I wonder if it only affects us, or if we're the first ones to notice.
[15:54:45] <elukey>	 no idea :(
[15:54:55] <elukey>	 I'll try to report it in #olly
[16:00:28] <elukey>	 ok reported, let's see
[16:00:35] <elukey>	 I'll try to open a task if nobody reads
[16:21:34] <wikibugs>	 06Machine-Learning-Team: Investigate if it is possible to reduce torch's package size - https://phabricator.wikimedia.org/T359569#9621000 (10elukey)
[16:21:43] <wikibugs>	 06Machine-Learning-Team: Investigate if it is possible to reduce torch's package size - https://phabricator.wikimedia.org/T359569#9621002 (10elukey)
[16:28:28] <elukey>	 https://tracker.debian.org/pkg/pytorch
[16:28:39] <elukey>	 currently only in unstable, but should be released with Trixie in theory
[16:29:04] <elukey>	 it would lock us to a specific version, but we'd get torch + rocm all bundled in Debian
[16:29:28] <elukey>	 a ton of work removed from us
[16:30:59] <isaranto>	 this is great! I just wish and hope that newer pytorch releases will be provided as well as we may have issues with LLMs
[16:31:29] <isaranto>	 for example if a newer version of pytorch is required by transformers library
[16:31:32] <elukey>	 maybe something in backports, but not really as up to date as we wanted
[16:32:07] <elukey>	 one thing that I noticed - following https://lernapparat.de/pytorch-rocm or even the rocm guide, I don't manage to build a wheel with all the rocm libs
[16:32:11] <elukey>	 packaged inside I mean
[16:32:30] <elukey>	 so I am wondering if building manually would just create a .whl that relies on system libs
[16:32:41] <elukey>	 if so, it would be perfect for us
[16:34:17] <elukey>	 I wanted to test it on a stat box with a GPU, but building the latest torch requires python 3.8 :(
[16:34:23] <elukey>	 maybe I can try with conda, mmm
[16:35:15] <isaranto>	 I am using pyenv on the statbox so I can use later python versions in my virtual environments
[16:35:21] <isaranto>	 e.g. 3.9 3.11 etc
[16:35:31] <isaranto>	 lemme see if I have a guide handy
[16:35:40] <elukey>	 never used pyenv, lemme try
[16:36:47] <wikibugs>	 (03PS1) 10Krinkle: SqlModelLookup.php: Document that empty cache bypasses is intentional [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1010236 (https://phabricator.wikimedia.org/T184938)
[16:37:00] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Sort model data by model name in SqlModelLookup [extensions/ORES] - 10https://gerrit.wikimedia.org/r/991932 (owner: 10Umherirrender)
[16:38:34] <elukey>	 trying https://www.dwarmstrong.org/pyenv/
[16:38:36] <elukey>	 looks nice!
[16:40:00] <isaranto>	 yep that would work
[16:40:07] <isaranto>	 I didn't find "my guide"
[16:40:36] <isaranto>	 just this `curl https://pyenv.run | bash`
[16:40:45] <isaranto>	 I started creating one but never did
[16:41:11] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Sort model data by model name in SqlModelLookup (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/991932 (owner: 10Umherirrender)
[16:41:37] <elukey>	 isaranto: I didn't see anything
[16:41:54] <elukey>	 I never seen that command :D
[16:42:48] <isaranto>	 :D
[16:42:55] <isaranto>	 haha
[16:43:24] <isaranto>	 that is why I didnt create the guide
[16:43:42] <isaranto>	 the guide you found is the standard way
[16:44:02] <isaranto>	 git clone I mean
[16:44:48] <isaranto>	 just saying it explicitly here. I copy pasted the above message without even reading it
[16:44:51] <isaranto>	 oh my
[16:45:56] <elukey>	 the | bash "guides" are really not trustable :D
[16:46:23] <isaranto>	 we can keep the above for security training "what NOT to do"
[16:47:23] <isaranto>	 on Friday I did rm -rf /* on my machine and deleted a bunch of Applications before I noticed what I had done
[16:52:02] <elukey>	 :D
[16:52:27] <elukey>	 the | bash stuff is dangerous since if you don't check all the commands it may install anything without you noticing
[16:52:43] <elukey>	 I woudn't recommend it for a test container either
[16:52:52] <elukey>	 anyway, you suggestion worked!
[16:52:57] <isaranto>	 yes yes, I can't delete it from IRC so :(
[16:53:05] <elukey>	 but on the stat boxes the cmake version is old
[16:53:08] <elukey>	 so the build stops
[16:53:10] * elukey sighs
[16:56:28] <elukey>	 stat1008 is the only one that has the GPU atm (see https://phabricator.wikimedia.org/T358763), and it runs buster
[16:57:00] <elukey>	 I could use ml-staging2001 but I would need to install on it a lot of build deps probably, something that I'd prefer to avoid
[16:57:56] <elukey>	 (I'd need a gpu to test the wheel and see if it works)
[17:01:32] <wikibugs>	 06Machine-Learning-Team: Investigate if it is possible to reduce torch's package size - https://phabricator.wikimedia.org/T359569#9621112 (10elukey) Followed some guides and afaics most of manual builds end up with a Python wheel  that doesn't contain the extra ROCm libs, like the upstream pytorch ones. I guess...
[17:06:12] <elukey>	 isaranto: thanks for the TIL about pyenv :)
[17:06:48] <isaranto>	 I think it is great for managing multiple versions and not messing up your system
[17:07:13] <elukey>	 I am wondering how it works with the stdlib though
[17:07:16] <elukey>	 at least on debian
[17:08:29] <wikibugs>	 (03Merged) 10jenkins-bot: Sort model data by model name in SqlModelLookup [extensions/ORES] - 10https://gerrit.wikimedia.org/r/991932 (owner: 10Umherirrender)
[17:09:40] <elukey>	 klausman: if you want to progress the torch build on your system with an amd gpu please go ahead, I recall that you had a plan on friday
[17:10:09] <elukey>	 more than happy if you want to try, I keep getting stopped by our infra :D
[17:10:37] <elukey>	 going afk for today folks!
[17:10:43] <elukey>	 have a nice rest of the day :)
[17:11:29] <isaranto>	 ciao Luca!
[17:15:18] <wikibugs>	 06Machine-Learning-Team, 10ORES: Add httpbb tests for ores-legacy - https://phabricator.wikimedia.org/T359871 (10isarantopoulos) 03NEW
[17:22:05] <isaranto>	 I'm holding the ores-legacy deployment for tomorrow morning as I was running some tests
[17:22:31] <isaranto>	 going afk folks!
[17:36:59] <klausman>	 elukey: I poked at it some more this morning, but it eventually failed and I haven't dug into the why yet
[17:37:08] <klausman>	 also heading out now.
[17:37:10] <klausman>	 \o