[01:43:04] <wikibugs>	 06Machine-Learning-Team, 10ORES, 10FY2023-24-WE 2.1 Typography and palette customizations: Special:ORESModels doesnt work in night theme - https://phabricator.wikimedia.org/T366379#9853188 (10OKJ04)
[01:50:26] <wikibugs>	 06Machine-Learning-Team, 10ORES, 10FY2023-24-WE 2.1 Typography and palette customizations: Special:ORESModels doesnt work in night theme - https://phabricator.wikimedia.org/T366379#9853290 (10JJMC89)
[06:49:08] <isaranto>	 Good morning folks!
[08:30:34] <aiko>	 morning!
[08:30:57] <aiko>	 hello Ilias o/
[08:32:24] <isaranto>	 Hi Aiko!
[08:32:35] <klausman>	 Hello everyone :)
[08:36:06] <aiko>	 eswiki and viwiki was firing again on the weekend. I'll check them if they are the same issue we had before
[08:36:31] <aiko>	 hi Tobias o/
[08:37:00] <isaranto>	 o/ Tobias
[08:39:03] <isaranto>	 aiko: I saw Luca's comment https://phabricator.wikimedia.org/T363336#9853559. I agree that we should start looking into just cutting down the content that is being scored
[08:39:16] <isaranto>	 if it is the same case all over again (big revisions)
[08:44:52] <aiko>	 isaranto: yeah that's a good idea!
[09:24:59] <klausman>	 I wonder if that is something we should do in general with data fetched from mwapi. But I also do not know a) what a good value would be and b) how much it would affect prediction accuracy.
[09:38:27] <isaranto>	 it is likely that prediction accuracy will indeed be affected but it would be a tradeoff in order for things to work properly
[09:39:07] <klausman>	 Yeah, I guess we'll have to find a sweet spot between accuracy and performance (latency)
[09:43:00] <elukey>	 o/
[09:43:39] <elukey>	 there could be another road if we don't want to mess too much with revscoring, but it will entail adding more cpus to the pods
[09:44:09] <elukey>	 IIRC the mwapi query that revscoring makes is something that we do in our revscoring code via async calls
[09:44:19] <klausman>	 I'd prefer the limiting approach, but it depends on how complex the changes to revscoring would be
[09:44:33] <klausman>	 (and the accuracy impact)
[09:45:01] <elukey>	 so we have the json length, and we could think about a limit: if it is larger than X, we offload the preprocess() call to a process (so using MP revscoring)
[09:45:18] <wikibugs>	 (03PS4) 10AikoChou: revertrisk: modify the response type in batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035012 (https://phabricator.wikimedia.org/T358744)
[09:45:27] <elukey>	 we could use 2 CPUs for each pod by default, and increase if needed
[09:46:18] <elukey>	 we should have a task about using revscoring MP for preprocess
[09:46:30] <elukey>	 but we never really explored that possibility
[09:47:09] <elukey>	 we also have to figure out if clients wait for 30+ seconds for a reply
[09:47:23] <elukey>	 because a lot of istio code 0 may be clients just giving up
[09:47:27] <isaranto>	 o/ Luca
[09:47:31] <elukey>	 hey :)
[09:47:32] <isaranto>	 we do have this task https://phabricator.wikimedia.org/T349274
[09:47:47] <wikibugs>	 (03PS5) 10AikoChou: revertrisk: modify the response type in batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035012 (https://phabricator.wikimedia.org/T358744)
[09:49:09] <elukey>	 exactly that one
[09:49:35] <klausman>	 Yeah, I am pretty sure the code 0 results are mostly clients giving up (response code DC)
[09:50:33] <elukey>	 so we have two roads, maybe we could think about working on both in a spike and see which one is the most flexible
[09:50:40] <elukey>	 also I'd involve research for the revscoring change
[09:52:31] <isaranto>	 good point! I agree with both approaches
[09:52:51] <wikibugs>	 (03CR) 10AikoChou: "Thanks for the suggestion! I updated the response structure. It looks better in this way. :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035012 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou)
[09:55:31] <wikibugs>	 (03PS2) 10Kevin Bazira: article-descriptions: refactor error messages to avoid repetition [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032731
[09:55:39] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] article-descriptions: refactor error messages to avoid repetition [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032731 (owner: 10Kevin Bazira)
[09:57:13] <wikibugs>	 (03CR) 10Klausman: [C:03+1] revertrisk: modify the response type in batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035012 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou)
[10:06:10] <aiko>	 agree! hi luca o/
[10:21:38] * aiko lunch
[10:25:30] <klausman>	 ditto!
[10:44:26] * isaranto lunch!
[13:06:29] <chrisalbon>	 Good morning all
[13:06:40] <klausman>	 Heyo Chris
[13:06:46] <klausman>	 Back in the US?
[13:15:31] <isaranto>	 Hey Chris o/
[13:20:01] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] "Super! I've tested the new patch and it works like a charm." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035012 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou)
[13:21:13] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032731 (owner: 10Kevin Bazira)
[13:21:58] <wikibugs>	 (03Merged) 10jenkins-bot: article-descriptions: refactor error messages to avoid repetition [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032731 (owner: 10Kevin Bazira)
[13:23:19] <wikibugs>	 (03PS6) 10AikoChou: revertrisk: modify the response type in batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035012 (https://phabricator.wikimedia.org/T358744)
[13:25:43] <wikibugs>	 (03CR) 10AikoChou: [C:03+2] "Thanks for the review!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035012 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou)
[13:34:19] <wikibugs>	 (03Merged) 10jenkins-bot: revertrisk: modify the response type in batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035012 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou)
[13:44:07] <wikibugs>	 (03PS1) 10Kevin Bazira: locust: use multiple payloads for load testing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1038346 (https://phabricator.wikimedia.org/T365554)
[13:46:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] locust: use multiple payloads for load testing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1038346 (https://phabricator.wikimedia.org/T365554) (owner: 10Kevin Bazira)
[13:52:06] <wikibugs>	 (03PS2) 10Kevin Bazira: locust: use multiple payloads for load testing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1038346 (https://phabricator.wikimedia.org/T365554)
[14:39:16] <elukey>	 aiko: o/ I tried a slow viwiki rev-id with RR-agnostic, and it is super quick. I am pretty sure that we do a totally different feature extraction process, do you have any more details?
[14:59:59] <isaranto>	 I made some changes to the pydantic validation https://github.com/wikimedia/liftwing-python/pull/5
[15:04:01] <isaranto>	 elukey: as far as I remember revertrisk has very simple logic in feature extraction so it is fast
[15:06:05] <isaranto>	 preprocessing is done from knowledge_integrity package vs revscoring
[15:08:04] <elukey>	 yep yep I was wondering if it parsed anything content-related for rev-ids
[15:13:30] <aiko>	 elukey: o/ it only parses language-agnostic features e.g. number of headings, media, reference, links, etc
[15:19:08] <elukey>	 ack thanks!
[15:27:28] <aiko>	 isaranto: +1'd!
[15:28:25] <isaranto>	 Danke! Mercelis is going to base his future work on this
[15:28:43] <isaranto>	 I'm just checking if I can simplify things even further , if not I'll just merge this one
[15:56:34] <elukey>	 added a few more comments to the task
[15:56:45] <isaranto>	 aiko: I have removed the decorator by adding validation in the parent class. This way when we add a model we dont even need to define a request function if the model is simple
[15:57:06] <isaranto>	 it is closer to what you suggested earlier 
[16:01:42] <aiko>	 wow that's nice!
[16:02:45] <isaranto>	 thanks Luca for the comments and the investigation
[16:03:07] <elukey>	 <3
[16:03:10] <isaranto>	 we will discuss it this week to tackle this once and for all...
[16:04:03] <elukey>	 yep I think somebody needs to be assigned to this to find a permanent fix, I think it is affecting a lot external clients
[16:07:08] <isaranto>	 ack
[16:07:16] <isaranto>	 I'm logging off for the evening folks, cu tomorrow!
[16:07:19] <elukey>	 o/
[16:23:08] <aiko>	 bye Ilias! have a nice evening :)
[18:33:22] <robh>	 heyas chrisalbon or klausman either of you about for question on quantity to order on remainder GPU machines? 
[18:33:50] <robh>	 basically i need to quote and order those other two orders for codfw/eqiad ASAP and I didnt rrealize we folded the GPU upgrade budget in so my old total of 2 in codfw and 4 in eqiad isn't right
[18:34:04] <robh>	 and we have a total of about 9 hosts to order with that budget, and the 1 already on its way to codfw.
[18:34:10] <klausman>	 Chris is out sick, let me take a look at my notes
[18:34:16] <robh>	 how should we split up the other 9 between codfw and eqiad
[18:34:18] <robh>	 cool, thank you!
[18:34:22] <robh>	 sorry for the urgency =P
[18:34:32] <robh>	 we keep waiting for the gpu host to arrive but delays, so we just need to order hte rest
[18:35:21] <klausman>	 So this is about the first order _after_ the initial machine, right?
[18:36:03] <chrisalbon>	 Robh do you need it today or can we get you an answer tomorrow morning? I feel like shit but Klausman and I can talk through it tomorrow on the ml team call
[18:36:08] <chrisalbon>	 Basically how urgent is urgent
[18:37:31] <robh>	 mar k is out the second half of the week and my fear if we dont get this ordered now it wont land this quarter
[18:37:45] <robh>	 so a day could matter and push this order into next week for approvals
[18:38:10] <robh>	 but if you are sick you are sick and you dont want to make a bad call if you lack focus.
[18:38:16] <chrisalbon>	 Okay then let’s figure this out now
[18:38:16] <robh>	 since shipping them site to site isn't free.
[18:38:49] <robh>	 So the notes I had before we folded in the 'upgrade existing hosts with GPUs' was 2 to codfw and 4 to eqiad (not counting the codfw in flight order for 1 already)
[18:39:00] <robh>	 but now there is remainder budget not for just 6 hosts, but 9 total.
[18:39:10] <klausman>	 IIRC it was 3 each for prod in codfw and eqiad, 2 for training/tinkering (eqiad), 1 for staging (codfw) and one machine we hadn't decided on.
[18:39:14] <robh>	 both myself and wil ly did math and compared so you dont have to worry about the cost part
[18:39:34] <robh>	 just how you wanna divide up the 10 total servers, 1 of which is already ordered and in route to codfw.
[18:39:37] <klausman>	 what is the total machine count according to your notes?
[18:39:46] <klausman>	 ah, so we agree on 10, good!
[18:40:08] <robh>	 yeah, counting the one already ordered and pending shipment to codfw
[18:40:17] <klausman>	 ack
[18:41:05] <klausman>	 I think the 3/3/2/1 (prod/prod/train/stage is solid. WHat we'd do with the floater I don't know (or if Chris and I agreed to anything)
[18:41:43] <chrisalbon>	 Put that one in eqiad. We can use that for training
[18:42:08] <klausman>	 So basically 3/3/3/1?
[18:42:19] <chrisalbon>	 yes
[18:42:23] <klausman>	 SGTM
[18:42:37] <chrisalbon>	 What is that broken down between eqiad and codfw
[18:42:54] <klausman>	 3+3 eqiad 3+1 codfw
[18:43:10] <klausman>	 /prod/train and prod/staging
[18:43:17] <chrisalbon>	 okay cool
[18:43:17] <klausman>	 with the staging one being the one in flight
[18:43:28] <chrisalbon>	 robh is that what you needed?
[18:44:40] <robh>	 meh locked up lost all i said after i agreed 10 total, 9 yet to quote.
[18:44:55] <klausman>	 So 9 machines to go.
[18:45:21] <klausman>	 6 to eqiad (3 prod, 3 training), 3 to codfw (prod)
[18:45:32] <chrisalbon>	 perfect thanks klausman
[18:45:51] <robh>	 Thank you both of you
[18:45:53] <klausman>	 And on ein-flight to codfw
[18:45:58] <robh>	 feel better chrisalbon sorry to make you work when sick!
[18:46:07] <robh>	 i'll get quotes in asap
[18:46:09] <chrisalbon>	 no worries, I can suffer to get this order in
[18:46:30] <chrisalbon>	 thanks all
[18:49:51] <klausman>	 aye, cap'n. going back to lurking :)
[19:48:03] <wikibugs>	 06Machine-Learning-Team, 06Research: Deployment of model updates - https://phabricator.wikimedia.org/T366528 (10fkaelin) 03NEW
[20:14:33] <wikibugs>	 06Machine-Learning-Team, 06Research, 10Research-engineering: Deployment of model updates - https://phabricator.wikimedia.org/T366528#9856990 (10XiaoXiao-WMF)
[23:05:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[23:05:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=eswiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[23:10:44] <jinxer-wm>	 RESOLVED: LiftWingServiceErrorRate: ...
[23:10:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=eswiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate