[00:09:34] (03Merged) 10jenkins-bot: build: Upgrade mediawiki/mediawiki-codesniffer to v43.0.0 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1009935 (owner: 10Umherirrender) [07:04:00] Good morning o/ [08:17:58] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Code looks good. The table does have a nice index (two even). 😄 Happy to merge this soon." [extensions/ORES] - 10https://gerrit.wikimedia.org/r/991932 (owner: 10Umherirrender) [09:17:39] 06Machine-Learning-Team: Add a util function in python to detect GPU - https://phabricator.wikimedia.org/T359793 (10isarantopoulos) 03NEW [09:22:26] 06Machine-Learning-Team: Add a util function in python to detect GPU - https://phabricator.wikimedia.org/T359793#9619134 (10isarantopoulos) @achou suggested to use pyopencl ([[ https://github.com/inducer/pyopencl | GitHub ]], [[ https://pypi.org/project/pyopencl/ | PyPI ]]) which seems well supported and promisi... [09:55:10] aa this pytorch index page is misbehaving again on rocm5.5 [10:28:20] 06Machine-Learning-Team: Assess runtime performance impact of pydantic data models in the RRLA model-server - https://phabricator.wikimedia.org/T355742#9619269 (10MunizaA) @kevinbazira this is very helpful, thank you! Please correct me if I'm wrong but I assume the preprocessing time for each payload was calcu... [10:38:09] morning! [10:48:44] 06Machine-Learning-Team: Assess runtime performance impact of pydantic data models in the RRLA model-server - https://phabricator.wikimedia.org/T355742#9619343 (10kevinbazira) @MunizaA, we're happy to hear that the information provided was helpful. For more context, the preprocessing time for each payload was re... [10:49:18] Morning Aiko! [11:32:27] * isaranto lunch [12:02:10] 'ello. :) [12:12:38] o/ Tobias! [13:17:17] hi folks! [13:17:34] Hey Luca! [13:17:46] ciao, luca! [13:32:50] * klausman late lunch [13:38:23] Sooo I was checking some old httpbb tests that we removed for ores [13:38:53] actually we removed some tests for ores-beta.wmflabs.org which made sense [13:39:28] but I want to reintroduce them for ores-legacy staging to start with [13:40:09] any thoughts/objections to it? [13:40:37] sure makes sense [13:40:40] as I'm seeing httpbb cant check for boolean fields yet, so perhaps I'll have to add some functionality there [14:58:54] the SLO dashboards are now showing weird numbers, I changed the windows for the new quarter but something is off [15:01:17] I see them now.. [15:02:30] I filed a couple of patches to improve the performances (not yet merged) but even in thanos UI the metrics returned are weird [15:06:12] see https://w.wiki/9S2m [15:07:04] maybe I am totally missing something but in the second case, the response_code=200 show ~3000 rps [15:07:17] sorry, increased requests [15:07:23] meanwhile in the third they are zero [15:07:33] (03CR) 10Ilias Sarantopoulos: "This sounds nice! I opened a task so that we can track this work over there https://phabricator.wikimedia.org/T359793" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008858 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou) [15:12:51] elukey: in the above example is the only difference between 2nd and 3rd query the response code filter (response_code=~"(2|3|4)..",)? [15:13:01] just asking in case I am not reading properly [15:13:05] correct yes [15:14:07] elukey: the other weirdness is that even if you sum by dcs and site, the two sites are identical [15:14:25] dcs meaning destination canon service [15:14:35] (I only now realize it sounds like DCs, as well) [15:16:38] (03CR) 10AikoChou: "Thanks for creating the task!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008858 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou) [15:20:10] there is something strange going on with Thanos [15:20:15] I don't see any other explanations [15:20:29] maybe be due to https://gerrit.wikimedia.org/r/c/operations/puppet/+/992415 [15:20:38] Do you think that maybe the prometheus="..." change we talked about last week broke something? [15:22:17] that is the change above [15:22:50] up to Jan 24th there was prometheus="k8s-mlserve", after prometheus="thanos-rule" [15:23:13] that in theory should be fine, this is why Keith told us in the code review that removing the "prometheus" label was fine (I guess) [15:23:14] which makes sense, as we discussed, since the origin of the data is a thanos rule, not some prometheus it scrapes. [15:23:31] but results are weird, not because of the division themselves [15:23:59] yeah, and site=codfw and site-eqiad should basically never have the same rate/increase (unless they're both 0) [15:25:14] the issue that I am seeing is worse, take a look to the second and third panels in the short url above [15:25:39] in the latter eqiad or codfw don't matter, they report 0 reqs for HTTP 200 [15:28:17] yeah, and a div by zero would normally result in NaN [15:28:47] They occasinally are non-0 tho (add `[5m]` [15:28:49] ) [15:29:17] Not the 200s, however [15:29:25] with [90] they are surely not zero, so the corresponding slo/slis are totally off [15:29:32] like 1500% [15:30:12] At any rate (ha...) the third TS should never be smaller than the second [15:30:22] exactly [15:32:55] Mh. Do you think us dropping `prometheus` in the recording rule would help here? [15:35:32] no idea, my main suspicion at the moment is that thanos somehow returns different results for queries [15:35:47] but the data is there, the second panel shows it [15:36:17] adding "response_code=~"(2|3|4).."," to the third panel makes the data to appear [15:38:11] very luckily all SREs are in Warsaw right now for the summit :D [15:38:18] so we'll have to wait [15:43:40] I've never gotten prom to do something like this before, but then again, I've not usefd Thanos for RRs much [15:48:33] I don't recall anything so weird before the paternity leave [15:49:58] I wonder if it only affects us, or if we're the first ones to notice. [15:54:45] no idea :( [15:54:55] I'll try to report it in #olly [16:00:28] ok reported, let's see [16:00:35] I'll try to open a task if nobody reads [16:21:34] 06Machine-Learning-Team: Investigate if it is possible to reduce torch's package size - https://phabricator.wikimedia.org/T359569#9621000 (10elukey) [16:21:43] 06Machine-Learning-Team: Investigate if it is possible to reduce torch's package size - https://phabricator.wikimedia.org/T359569#9621002 (10elukey) [16:28:28] https://tracker.debian.org/pkg/pytorch [16:28:39] currently only in unstable, but should be released with Trixie in theory [16:29:04] it would lock us to a specific version, but we'd get torch + rocm all bundled in Debian [16:29:28] a ton of work removed from us [16:30:59] this is great! I just wish and hope that newer pytorch releases will be provided as well as we may have issues with LLMs [16:31:29] for example if a newer version of pytorch is required by transformers library [16:31:32] maybe something in backports, but not really as up to date as we wanted [16:32:07] one thing that I noticed - following https://lernapparat.de/pytorch-rocm or even the rocm guide, I don't manage to build a wheel with all the rocm libs [16:32:11] packaged inside I mean [16:32:30] so I am wondering if building manually would just create a .whl that relies on system libs [16:32:41] if so, it would be perfect for us [16:34:17] I wanted to test it on a stat box with a GPU, but building the latest torch requires python 3.8 :( [16:34:23] maybe I can try with conda, mmm [16:35:15] I am using pyenv on the statbox so I can use later python versions in my virtual environments [16:35:21] e.g. 3.9 3.11 etc [16:35:31] lemme see if I have a guide handy [16:35:40] never used pyenv, lemme try [16:36:47] (03PS1) 10Krinkle: SqlModelLookup.php: Document that empty cache bypasses is intentional [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1010236 (https://phabricator.wikimedia.org/T184938) [16:37:00] (03CR) 10Krinkle: [C: 03+2] Sort model data by model name in SqlModelLookup [extensions/ORES] - 10https://gerrit.wikimedia.org/r/991932 (owner: 10Umherirrender) [16:38:34] trying https://www.dwarmstrong.org/pyenv/ [16:38:36] looks nice! [16:40:00] yep that would work [16:40:07] I didn't find "my guide" [16:40:36] just this `curl https://pyenv.run | bash` [16:40:45] I started creating one but never did [16:41:11] (03CR) 10Krinkle: [C: 03+2] Sort model data by model name in SqlModelLookup (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/991932 (owner: 10Umherirrender) [16:41:37] isaranto: I didn't see anything [16:41:54] I never seen that command :D [16:42:48] :D [16:42:55] haha [16:43:24] that is why I didnt create the guide [16:43:42] the guide you found is the standard way [16:44:02] git clone I mean [16:44:48] just saying it explicitly here. I copy pasted the above message without even reading it [16:44:51] oh my [16:45:56] the | bash "guides" are really not trustable :D [16:46:23] we can keep the above for security training "what NOT to do" [16:47:23] on Friday I did rm -rf /* on my machine and deleted a bunch of Applications before I noticed what I had done [16:52:02] :D [16:52:27] the | bash stuff is dangerous since if you don't check all the commands it may install anything without you noticing [16:52:43] I woudn't recommend it for a test container either [16:52:52] anyway, you suggestion worked! [16:52:57] yes yes, I can't delete it from IRC so :( [16:53:05] but on the stat boxes the cmake version is old [16:53:08] so the build stops [16:53:10] * elukey sighs [16:56:28] stat1008 is the only one that has the GPU atm (see https://phabricator.wikimedia.org/T358763), and it runs buster [16:57:00] I could use ml-staging2001 but I would need to install on it a lot of build deps probably, something that I'd prefer to avoid [16:57:56] (I'd need a gpu to test the wheel and see if it works) [17:01:32] 06Machine-Learning-Team: Investigate if it is possible to reduce torch's package size - https://phabricator.wikimedia.org/T359569#9621112 (10elukey) Followed some guides and afaics most of manual builds end up with a Python wheel that doesn't contain the extra ROCm libs, like the upstream pytorch ones. I guess... [17:06:12] isaranto: thanks for the TIL about pyenv :) [17:06:48] I think it is great for managing multiple versions and not messing up your system [17:07:13] I am wondering how it works with the stdlib though [17:07:16] at least on debian [17:08:29] (03Merged) 10jenkins-bot: Sort model data by model name in SqlModelLookup [extensions/ORES] - 10https://gerrit.wikimedia.org/r/991932 (owner: 10Umherirrender) [17:09:40] klausman: if you want to progress the torch build on your system with an amd gpu please go ahead, I recall that you had a plan on friday [17:10:09] more than happy if you want to try, I keep getting stopped by our infra :D [17:10:37] going afk for today folks! [17:10:43] have a nice rest of the day :) [17:11:29] ciao Luca! [17:15:18] 06Machine-Learning-Team, 10ORES: Add httpbb tests for ores-legacy - https://phabricator.wikimedia.org/T359871 (10isarantopoulos) 03NEW [17:22:05] I'm holding the ores-legacy deployment for tomorrow morning as I was running some tests [17:22:31] going afk folks! [17:36:59] elukey: I poked at it some more this morning, but it eventually failed and I haven't dug into the why yet [17:37:08] also heading out now. [17:37:10] \o