[06:55:07] 10Lift-Wing, 10Machine-Learning-Team: Return meaningful HTTP responses in Lift Wing's revscoring backends - https://phabricator.wikimedia.org/T300270 (10elukey) @ACraze I saw something similar as well, +1 it seems a good way to go! [06:55:32] 10Lift-Wing, 10Machine-Learning-Team: Return meaningful HTTP responses in Lift Wing's revscoring backends - https://phabricator.wikimedia.org/T300270 (10elukey) [06:55:35] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey) [07:32:03] good morning :) [08:42:35] kevinbazira: o/ I filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/757839/, lemme know what you think about it [08:43:17] it is related to your change, to show what I was recommending about transformers [08:43:43] if we spare those two lines then it will be better when multiple models are listed etc.. [08:59:56] thank you for sharing the example, Luca! [09:00:28] i have fixed the editquality transformer configs. [09:05:15] kevinbazira: np, thanks for fixing! Do you know how to check the diffs from CI? [09:05:50] (jenkins leaves a link when posting the +2, that contains the expected diff of your change) [09:07:10] yep, i've seen it: https://integration.wikimedia.org/ci/job/helm-lint/6593/console [09:08:24] kevinbazira: there is something not right, namely goodfaith gets the transformer but not the WIKI_HOST variable [09:08:42] are we adding the transformer only for damaging or for both? [09:08:59] it should b for both [09:09:28] ack then the custom_env bits needs to be added to goodfaith too [09:10:02] (look for WIKI_HOST in the CI diff, it is only for damaging) [09:10:19] alright, let me fix this right now [09:14:00] i've added the transformer to enwiki-goodfaith as well [09:26:30] kevinbazira: merged, you can deploy from deploy1002 if you want [09:26:46] the difference from the last time is that you'll know have to deploy to ml-serve-codfw too [09:28:55] alright, thanks elukey I'll deploy on Monday. [11:34:18] * elukey afk! lunch [14:22:19] folks I added two pie charts to https://logstash.wikimedia.org/app/dashboards#/view/ORES [14:22:23] for UA and URIs [14:22:29] 90% of our traffic is precache :D [14:59:13] the only topic that I see on kafka mentioning ores is mediawiki.job.ORESFetchScoreJob [15:10:41] I am reading tasks like https://phabricator.wikimedia.org/T201868 and the precache thing may be very weird to understand right now [15:31:59] Thanks to Amir I understand a little bit more the precache thing [15:32:07] it is very important for the migration that we'll do [15:32:34] in the UA breakdown the top 2 UAs are: [15:32:37] 1) Changeprop [15:32:40] 2) Mediawiki [15:32:54] Changeprop only hits us for precaching, nothing more [15:33:19] Mediawiki does it via the JobQueue (and Changeprop of course, but indirectly) [15:33:39] and the kafka topics mentioned above is only made by Mediawiki scores [15:35:13] part of the scores are duplicated, since they are in both [15:35:17] but not all [15:35:22] --- [15:35:37] this means that turning off precache / changeprop would only impact ORES [15:35:54] and that Lift Wing will need to support the Mediawiki use case [15:36:05] (it wasn't that clear to me before) [15:36:51] I would personally try not to replicate the changeprop precache thing in Lift Wing [15:37:24] we surely need to have a score cache (so that, for example, the same request doesn't end up in calling the mw api twice) [15:37:40] but the precache seems a little overkill [15:37:55] (it may be needed for ORES, but I it shouldn't for Lift Wing) [15:38:01] does the above make sense? [15:57:46] o/ [15:57:46] elukey: thx for digging into the precache stuff [15:58:30] i agree we may not need it for Lift Wing [15:58:59] (we'll need a score cache of some sorts though) [16:04:48] yep definitely [16:14:21] accraze: what do you feel about the logging change in ores? [16:19:13] i think it looks great! [16:19:50] im not too worried about log volume tbh [16:20:12] will approve and merge if its ready [16:21:39] super thanks :) [16:21:44] I just created https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/757922 [16:21:49] to use yamlconf==0.2.5 [16:21:55] if it is totally off I'll drop it : [16:21:56] :) [16:23:14] niiiiice! [16:26:39] jenkins will let us know if it works (it seems like it should) [16:33:29] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks): ORES deployment - Winter 2022 - nlwiki articlequality model - https://phabricator.wikimedia.org/T300195 (10elukey) The change got merged today so we can proceed with both in my opinion :) [16:40:56] what is the difference between a precache and a score cache? [16:41:05] also, good morning! [16:41:27] good morning! [16:41:59] my understanding is that "precache" is a way to force the population of a score cache [16:42:42] in our case, change-prop is configured (with an hack) to query ORES after every rev-id it sees [16:43:14] but it doesn't really care about the response, it is just to allow ores to calculate a score and set the result in redis [16:43:17] (the score cache) [16:43:58] on the other hand, mediawiki jobs need the score value to populate a table in the mediawiki db and to add the score to a kafka topic [16:44:11] the two things are completely separate [16:44:19] we'll need to support the latter for sure [16:46:06] ORES' traffic is currently 90% for precache (from changeprop) [16:46:47] ah got it, and this is why we have suspected that ORES is making multiple predictions per revision, one for the precache from changeprop and one from mediawiki to populate the table [16:46:57] exactly yes [16:47:13] but mw and changeprop overlap only in some scores [16:47:32] because the traffic breakdown clearly says that changeprop makes way more requests [16:47:48] Well, I feel like I should be sad ORES models are used less than the metrics would suggest, but honestly I'm happy to remove a big barely used part of the system [16:48:54] and yeah, we definitely need to support the mediawiki db table, for example this comment from MS https://www.mediawiki.org/w/index.php?title=Topic:Wousr2ymimxdycnp&topic_showPostId=woux4fhhuvk3pcec#flow-post-woux4fhhuvk3pcec [16:50:27] what was the motivation behind the precache/changeprop? Like, what is used for? [16:53:16] my guess is that some bots need to process a consecutive list of revisions for the same wiki in a row (maybe every X amount of time) and a separate call to ORES for each one, lasting say ~300ms->500ms was considered too much [16:53:23] ^^^ [16:54:41] I'd argue that a bot is not an interactive use case, it can wait a few ms to get a score [16:55:12] That makes sense from what I know about it. And that might not be an issue with Lift Wing? Is Lift Wing faster? Or can we redirect any bots to making MW queries? [16:55:43] for the moment the timings that I see for models (request/response) are along the same line [16:55:47] 300->500ms [16:56:25] with the caveat of how many concurrent requests a pod can answer, since once it gets overloaded the latency rises quickly [16:58:23] but it may due to the model itself, revscoring based models may not scale more that x [16:58:35] what we can do is autoscale :D [17:00:00] yeah im interested in seeing how the autoscale works for revscoring models [17:00:38] elukey: btw your yamlconf change is merged! [17:01:49] \o/ [17:02:02] nice one :) [17:04:04] Three thoughts: [17:04:05] 1. It is not clear to me that precaching is actually a required feature for users. Most people I've talked to query the MW db. There might be some though. [17:04:05] 2. The actual long-term solution is probably new, faster models. I have long suspected that 300-->500ms is from I/O, and so there isn't really much on the infrastructure side we can do to improve that. Instead new models that require less I/O will be blazing fast.E [17:04:05] 3. I am worried about saying, "in ORES, thanks to this massively inefficient precache, you can get a prediction in 10ms, but in our awesome new system we just launched, it will take 400ms" so hopefully we can use something like autoscaling to bring the ms down a bit. [17:11:21] the best that we can do, even with auto-scaling, is to have a single score taking those 300/500ms and not more [17:11:32] unless the score-cache is hit [17:19:06] also, one point is that the score-cache (populated by changeprop) covers a big chunk of recent edits [17:19:13] that likely bots will use [17:19:37] but any other user will see latencies up to 300/500ms [17:21:04] Yeah I don’t think autoscaling is going to solve this, the issue is the I/O of the models themselves not any infra issue. [17:21:28] and 1. is important, I am almost sure that users are bots that don't really care if something takes 10ms or 500ms [17:21:41] But that is an interesting idea that probably a lot of what people want would be covered by the score cache [17:21:46] I mean it is of course a nice thing if a model takes one order of magnitude time to answer [17:22:04] Yeah. I don’t want to build a precache if there isn’t a need [17:22:29] it is also true that we can move the changeprop hack to something more modern [17:22:42] like https://github.com/kserve/website/blob/main/docs/modelserving/kafka/kafka.md [17:22:55] it requires knative-evening (that we currently don't deploy) [17:23:02] but it would be entirely on our side [17:24:40] Okay cool, do we have options but let’s not build things just to build them [17:25:35] exactly I have the same thought [17:25:42] accraze: what do you think? [17:27:20] same thoughts here, we have options [17:27:39] the kafka integration would be pretty cool [17:30:34] folks I will not work on monday, so we'll see each other on Tue :) [17:30:41] have a good day and weekend! [17:30:42] Bye! [17:30:58] have a good weekend elukey! [17:31:08] o/ [18:58:02] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Editquality Transformer - https://phabricator.wikimedia.org/T298943 (10ACraze) @kevinbazira - I noticed an issue with the new transformer. It seems since we need to pass `self.model.features` to the extractor, we will also need to load... [20:15:11] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks): ORES deployment - Winter 2022 - nlwiki articlequality model - https://phabricator.wikimedia.org/T300195 (10Halfak) OK I'll pull it in. [20:18:12] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks): ORES deployment - Winter 2022 - nlwiki articlequality model - https://phabricator.wikimedia.org/T300195 (10Halfak) Looks like something went wrong with the editquality repo. I got a smudge error. This us... [20:18:54] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks): ORES deployment - Winter 2022 - nlwiki articlequality model - https://phabricator.wikimedia.org/T300195 (10Halfak) I should also note that this change includes hiwiki editquality models and the ores loggin... [20:19:11] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks): ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10Halfak) [20:19:46] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks): ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10Halfak) [20:20:32] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks): ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10Halfak) [20:20:36] 10Machine-Learning-Team, 10artificial-intelligence, 10editquality-modeling, 10Hindi-Sites: Train and test editquality models for Hindi Wikipedia - https://phabricator.wikimedia.org/T252581 (10Halfak) [20:23:31] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks): ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10Halfak) It looks like the mirroring failure is affecting the ORES... [20:26:24] (03PS3) 10Halfak: (WIP) nlwiki articlequality, hiwiki editquality, ores observability [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/755731 (https://phabricator.wikimedia.org/T300195) [20:27:16] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10Halfak) I've pushed the changes that I can t... [21:39:03] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): ML Sandbox Transformer Configuration - https://phabricator.wikimedia.org/T299972 (10ACraze) I've been rebuilding the sandbox cluster using the install script with the updated charts for knative and kserve. The KServe stack is able to load with all containers... [23:06:34] (03PS3) 10Umherirrender: SpecialORESModels: remove unneeded factory method [extensions/ORES] - 10https://gerrit.wikimedia.org/r/676190 (owner: 10DannyS712) [23:08:52] (03CR) 10jerkins-bot: [V: 04-1] SpecialORESModels: remove unneeded factory method [extensions/ORES] - 10https://gerrit.wikimedia.org/r/676190 (owner: 10DannyS712) [23:18:35] (03CR) 10Umherirrender: "recheck" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/676190 (owner: 10DannyS712) [23:30:45] (03PS1) 10Umherirrender: Convert to abstract schema [extensions/ORES] - 10https://gerrit.wikimedia.org/r/757989 (https://phabricator.wikimedia.org/T268566) [23:31:49] (03CR) 10Umherirrender: "check experimental" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/757989 (https://phabricator.wikimedia.org/T268566) (owner: 10Umherirrender) [23:45:44] (03CR) 10Umherirrender: "recheck" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/676190 (owner: 10DannyS712) [23:48:14] (03CR) 10Umherirrender: [C: 03+2] SpecialORESModels: remove unneeded factory method [extensions/ORES] - 10https://gerrit.wikimedia.org/r/676190 (owner: 10DannyS712) [23:50:20] (03Merged) 10jenkins-bot: SpecialORESModels: remove unneeded factory method [extensions/ORES] - 10https://gerrit.wikimedia.org/r/676190 (owner: 10DannyS712)