[01:43:04] 06Machine-Learning-Team, 10ORES, 10FY2023-24-WE 2.1 Typography and palette customizations: Special:ORESModels doesnt work in night theme - https://phabricator.wikimedia.org/T366379#9853188 (10OKJ04) [01:50:26] 06Machine-Learning-Team, 10ORES, 10FY2023-24-WE 2.1 Typography and palette customizations: Special:ORESModels doesnt work in night theme - https://phabricator.wikimedia.org/T366379#9853290 (10JJMC89) [06:49:08] Good morning folks! [08:30:34] morning! [08:30:57] hello Ilias o/ [08:32:24] Hi Aiko! [08:32:35] Hello everyone :) [08:36:06] eswiki and viwiki was firing again on the weekend. I'll check them if they are the same issue we had before [08:36:31] hi Tobias o/ [08:37:00] o/ Tobias [08:39:03] aiko: I saw Luca's comment https://phabricator.wikimedia.org/T363336#9853559. I agree that we should start looking into just cutting down the content that is being scored [08:39:16] if it is the same case all over again (big revisions) [08:44:52] isaranto: yeah that's a good idea! [09:24:59] I wonder if that is something we should do in general with data fetched from mwapi. But I also do not know a) what a good value would be and b) how much it would affect prediction accuracy. [09:38:27] it is likely that prediction accuracy will indeed be affected but it would be a tradeoff in order for things to work properly [09:39:07] Yeah, I guess we'll have to find a sweet spot between accuracy and performance (latency) [09:43:00] o/ [09:43:39] there could be another road if we don't want to mess too much with revscoring, but it will entail adding more cpus to the pods [09:44:09] IIRC the mwapi query that revscoring makes is something that we do in our revscoring code via async calls [09:44:19] I'd prefer the limiting approach, but it depends on how complex the changes to revscoring would be [09:44:33] (and the accuracy impact) [09:45:01] so we have the json length, and we could think about a limit: if it is larger than X, we offload the preprocess() call to a process (so using MP revscoring) [09:45:18] (03PS4) 10AikoChou: revertrisk: modify the response type in batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035012 (https://phabricator.wikimedia.org/T358744) [09:45:27] we could use 2 CPUs for each pod by default, and increase if needed [09:46:18] we should have a task about using revscoring MP for preprocess [09:46:30] but we never really explored that possibility [09:47:09] we also have to figure out if clients wait for 30+ seconds for a reply [09:47:23] because a lot of istio code 0 may be clients just giving up [09:47:27] o/ Luca [09:47:31] hey :) [09:47:32] we do have this task https://phabricator.wikimedia.org/T349274 [09:47:47] (03PS5) 10AikoChou: revertrisk: modify the response type in batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035012 (https://phabricator.wikimedia.org/T358744) [09:49:09] exactly that one [09:49:35] Yeah, I am pretty sure the code 0 results are mostly clients giving up (response code DC) [09:50:33] so we have two roads, maybe we could think about working on both in a spike and see which one is the most flexible [09:50:40] also I'd involve research for the revscoring change [09:52:31] good point! I agree with both approaches [09:52:51] (03CR) 10AikoChou: "Thanks for the suggestion! I updated the response structure. It looks better in this way. :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035012 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou) [09:55:31] (03PS2) 10Kevin Bazira: article-descriptions: refactor error messages to avoid repetition [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032731 [09:55:39] (03CR) 10Ilias Sarantopoulos: [C:03+1] article-descriptions: refactor error messages to avoid repetition [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032731 (owner: 10Kevin Bazira) [09:57:13] (03CR) 10Klausman: [C:03+1] revertrisk: modify the response type in batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035012 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou) [10:06:10] agree! hi luca o/ [10:21:38] * aiko lunch [10:25:30] ditto! [10:44:26] * isaranto lunch! [13:06:29] Good morning all [13:06:40] Heyo Chris [13:06:46] Back in the US? [13:15:31] Hey Chris o/ [13:20:01] (03CR) 10Kevin Bazira: [C:03+1] "Super! I've tested the new patch and it works like a charm." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035012 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou) [13:21:13] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032731 (owner: 10Kevin Bazira) [13:21:58] (03Merged) 10jenkins-bot: article-descriptions: refactor error messages to avoid repetition [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032731 (owner: 10Kevin Bazira) [13:23:19] (03PS6) 10AikoChou: revertrisk: modify the response type in batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035012 (https://phabricator.wikimedia.org/T358744) [13:25:43] (03CR) 10AikoChou: [C:03+2] "Thanks for the review!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035012 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou) [13:34:19] (03Merged) 10jenkins-bot: revertrisk: modify the response type in batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035012 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou) [13:44:07] (03PS1) 10Kevin Bazira: locust: use multiple payloads for load testing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1038346 (https://phabricator.wikimedia.org/T365554) [13:46:43] (03CR) 10CI reject: [V:04-1] locust: use multiple payloads for load testing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1038346 (https://phabricator.wikimedia.org/T365554) (owner: 10Kevin Bazira) [13:52:06] (03PS2) 10Kevin Bazira: locust: use multiple payloads for load testing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1038346 (https://phabricator.wikimedia.org/T365554) [14:39:16] aiko: o/ I tried a slow viwiki rev-id with RR-agnostic, and it is super quick. I am pretty sure that we do a totally different feature extraction process, do you have any more details? [14:59:59] I made some changes to the pydantic validation https://github.com/wikimedia/liftwing-python/pull/5 [15:04:01] elukey: as far as I remember revertrisk has very simple logic in feature extraction so it is fast [15:06:05] preprocessing is done from knowledge_integrity package vs revscoring [15:08:04] yep yep I was wondering if it parsed anything content-related for rev-ids [15:13:30] elukey: o/ it only parses language-agnostic features e.g. number of headings, media, reference, links, etc [15:19:08] ack thanks! [15:27:28] isaranto: +1'd! [15:28:25] Danke! Mercelis is going to base his future work on this [15:28:43] I'm just checking if I can simplify things even further , if not I'll just merge this one [15:56:34] added a few more comments to the task [15:56:45] aiko: I have removed the decorator by adding validation in the parent class. This way when we add a model we dont even need to define a request function if the model is simple [15:57:06] it is closer to what you suggested earlier [16:01:42] wow that's nice! [16:02:45] thanks Luca for the comments and the investigation [16:03:07] <3 [16:03:10] we will discuss it this week to tackle this once and for all... [16:04:03] yep I think somebody needs to be assigned to this to find a permanent fix, I think it is affecting a lot external clients [16:07:08] ack [16:07:16] I'm logging off for the evening folks, cu tomorrow! [16:07:19] o/ [16:23:08] bye Ilias! have a nice evening :) [18:33:22] heyas chrisalbon or klausman either of you about for question on quantity to order on remainder GPU machines? [18:33:50] basically i need to quote and order those other two orders for codfw/eqiad ASAP and I didnt rrealize we folded the GPU upgrade budget in so my old total of 2 in codfw and 4 in eqiad isn't right [18:34:04] and we have a total of about 9 hosts to order with that budget, and the 1 already on its way to codfw. [18:34:10] Chris is out sick, let me take a look at my notes [18:34:16] how should we split up the other 9 between codfw and eqiad [18:34:18] cool, thank you! [18:34:22] sorry for the urgency =P [18:34:32] we keep waiting for the gpu host to arrive but delays, so we just need to order hte rest [18:35:21] So this is about the first order _after_ the initial machine, right? [18:36:03] Robh do you need it today or can we get you an answer tomorrow morning? I feel like shit but Klausman and I can talk through it tomorrow on the ml team call [18:36:08] Basically how urgent is urgent [18:37:31] mar k is out the second half of the week and my fear if we dont get this ordered now it wont land this quarter [18:37:45] so a day could matter and push this order into next week for approvals [18:38:10] but if you are sick you are sick and you dont want to make a bad call if you lack focus. [18:38:16] Okay then let’s figure this out now [18:38:16] since shipping them site to site isn't free. [18:38:49] So the notes I had before we folded in the 'upgrade existing hosts with GPUs' was 2 to codfw and 4 to eqiad (not counting the codfw in flight order for 1 already) [18:39:00] but now there is remainder budget not for just 6 hosts, but 9 total. [18:39:10] IIRC it was 3 each for prod in codfw and eqiad, 2 for training/tinkering (eqiad), 1 for staging (codfw) and one machine we hadn't decided on. [18:39:14] both myself and wil ly did math and compared so you dont have to worry about the cost part [18:39:34] just how you wanna divide up the 10 total servers, 1 of which is already ordered and in route to codfw. [18:39:37] what is the total machine count according to your notes? [18:39:46] ah, so we agree on 10, good! [18:40:08] yeah, counting the one already ordered and pending shipment to codfw [18:40:17] ack [18:41:05] I think the 3/3/2/1 (prod/prod/train/stage is solid. WHat we'd do with the floater I don't know (or if Chris and I agreed to anything) [18:41:43] Put that one in eqiad. We can use that for training [18:42:08] So basically 3/3/3/1? [18:42:19] yes [18:42:23] SGTM [18:42:37] What is that broken down between eqiad and codfw [18:42:54] 3+3 eqiad 3+1 codfw [18:43:10] /prod/train and prod/staging [18:43:17] okay cool [18:43:17] with the staging one being the one in flight [18:43:28] robh is that what you needed? [18:44:40] meh locked up lost all i said after i agreed 10 total, 9 yet to quote. [18:44:55] So 9 machines to go. [18:45:21] 6 to eqiad (3 prod, 3 training), 3 to codfw (prod) [18:45:32] perfect thanks klausman [18:45:51] Thank you both of you [18:45:53] And on ein-flight to codfw [18:45:58] feel better chrisalbon sorry to make you work when sick! [18:46:07] i'll get quotes in asap [18:46:09] no worries, I can suffer to get this order in [18:46:30] thanks all [18:49:51] aye, cap'n. going back to lurking :) [19:48:03] 06Machine-Learning-Team, 06Research: Deployment of model updates - https://phabricator.wikimedia.org/T366528 (10fkaelin) 03NEW [20:14:33] 06Machine-Learning-Team, 06Research, 10Research-engineering: Deployment of model updates - https://phabricator.wikimedia.org/T366528#9856990 (10XiaoXiao-WMF) [23:05:44] FIRING: LiftWingServiceErrorRate: ... [23:05:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=eswiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [23:10:44] RESOLVED: LiftWingServiceErrorRate: ... [23:10:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=eswiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate