[06:44:04] 06Machine-Learning-Team: Investigate revertrisk threshold generation for enwiki - https://phabricator.wikimedia.org/T400590 (10isarantopoulos) 03NEW [06:51:58] (03CR) 10Kevin Bazira: [C:03+1] articletopic-outlink-model: Update base image from bullseye to the latest bookworm image. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1172620 (https://phabricator.wikimedia.org/T400349) (owner: 10Gkyziridis) [07:21:35] (03CR) 10Kevin Bazira: [C:03+1] ores-legacy-model: Update base image from bullseye to the latest bookworm image. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1172612 (https://phabricator.wikimedia.org/T400348) (owner: 10Gkyziridis) [07:23:54] good morning [07:30:38] good morning! [07:34:49] (03CR) 10Kevin Bazira: "just like you did in: https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1172612/comment/28bf2ce3_c1b85c9a/" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1172597 (https://phabricator.wikimedia.org/T400347) (owner: 10Gkyziridis) [07:37:18] Hi team, I’m trying to access edit-check pod logs from Friday, but I’m having some difficulties setting the correct filters on Logstash as I’m not able to select the edit-check namespace :( Would someone have an idea how to access those past logs? [07:40:54] morning morning o/ [07:41:23] bartosz: let me check ... [07:41:51] bartosz: https://logstash.wikimedia.org/goto/59219077b0f3fa8dce2109ea35d881da :) [07:42:18] there is a dashboard called App Kubernetes that can help [07:42:29] elukey beat me to it. ty! :) [07:42:48] <3 [07:44:12] Thank you Luca and Kevin! Following the link from Luca I can see it filters by `revision-models` namespace and not `edit-check`, which I can't seem to find [07:47:27] here is one with edit-check: https://logstash.wikimedia.org/goto/667b9339c15965af5a8cedf28d590069 [07:48:48] Sweet, thank you <3 [07:48:55] np! [07:49:57] I couldn't find it in the drop-down menus in filters, but I guess one needs to just type it manually? [07:52:28] yep, it's a search dropdown menu. items appear as you type them, then you can select them [08:31:34] good morning [08:54:02] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11038323 (10elukey) @Jclark-ctr Hi! I tried to provision ml-serve10[13,14] but the BMC seems not reachable, I get connection timeouts if I try. Is there anything extra... [09:42:42] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11038457 (10Jclark-ctr) @elukey. That is correct these racks do not have power yet [09:53:28] 06Machine-Learning-Team, 06Research, 10Research-engineering: Share code between Research & ML teams - https://phabricator.wikimedia.org/T398974#11038514 (10Miriam) [10:25:55] I've created a Phab task for the alert, which fired over the weekend: https://phabricator.wikimedia.org/T400602. The issue was that one of replicas for `reference-need` deployment couldn't schedule for 30mins, most probably due to unavailable resources on our cluster - I remember the exact same thing happening ~a month ago also with `reference-need`. I'll investigate if we could somehow lower the resources used for this service, because our [10:25:55] requests for it are very high right now. [10:26:28] 06Machine-Learning-Team, 07Essential-Work: Investigate reference-need persistently unavailable replicas alert - https://phabricator.wikimedia.org/T400602 (10BWojtowicz-WMF) 03NEW [10:33:34] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11038654 (10elukey) [10:33:45] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11038655 (10elukey) [10:34:16] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11038656 (10elukey) Perfect! So ml-serve10[12,13] are ready to go, they are running Trixie though. [10:38:24] bartosz: o/ I have some doubts though, namely why all replicas weren't available? If it was a problem of resources I'd expect new pods to have difficulties to come up, not existing ones. [10:39:01] ah oh I misread the alert, some new pods were not schedulable [10:39:14] so the existing ones still served requests etc.. [10:40:53] Yes yes, it was just one new replica that was not schedulable, the remaining ones were happy [11:46:37] 06Machine-Learning-Team, 10EditCheck: Investigate `edit-check` returning empty responses - https://phabricator.wikimedia.org/T400606 (10BWojtowicz-WMF) 03NEW [12:22:29] 06Machine-Learning-Team, 06Research, 10Research-engineering: Share code between Research & ML teams - https://phabricator.wikimedia.org/T398974#11038925 (10OKarakaya-WMF) This could be interesting for this task: https://gitlab.wikimedia.org/repos/research/research-common [15:11:21] 06Machine-Learning-Team, 10EditCheck: Investigate `edit-check` returning empty responses - https://phabricator.wikimedia.org/T400606#11039678 (10ppelberg) [15:12:44] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Investigate `edit-check` returning empty responses - https://phabricator.wikimedia.org/T400606#11039681 (10ppelberg) [15:23:31] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11039740 (10Jclark-ctr) 05Open→03Resolved I am closing out this ticket and opening second ticket T400626 for the remaining two servers since power will not be... [15:32:02] 06Machine-Learning-Team: Investigate revertrisk threshold generation for enwiki - https://phabricator.wikimedia.org/T400590#11039775 (10gkyziridis) Initial try to run [[ https://gitlab.wikimedia.org/repos/machine-learning/exploratory-notebook/-/blob/main/revert-risk/revert_risk_threshold_analysis_all.py?ref_type...