[05:19:23] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM too!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856880 (https://phabricator.wikimedia.org/T322006) (owner: 10Ilias Sarantopoulos) [05:34:28] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 6th round of wikis - https://phabricator.wikimedia.org/T304550 (10kevinbazira) a:05kevinbazira→03None [05:43:49] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 (10kevinbazira) [08:59:38] isaranto: o/ [08:59:44] good morning :) [08:59:57] so on stat1004:/home/elukey/benthos you can find some examples of configs [09:01:04] there are a few, I have tested some variations of them over time [09:01:14] elukey: good morning! Thanks I’ll check them out. [09:02:10] isaranto: config-editquality-multi-no-events.yaml may be a good one to start, the workflow that it implements is [09:02:58] 1) read the kafka topic "eqiad.mediawiki.revision-create" in which we publish events related to "Edits" across wikis [09:03:27] if you want an idea about the stream, check https://stream.wikimedia.org/v2/ui/#/?streams=mediawiki.revision-create [09:04:12] 2) then we get the rev-id/wiki values from every msg, and we hit Lift Wing [09:05:01] the config-editquality-multi.yaml config also passes the revision-create event to lift wing, that behind the scenes sends an event to kafka as well after creating the score [09:05:10] Some caveat/notes: [09:06:15] 1) We are currently sending events to mediawiki.revision-score-test, from all revscoring model servers. It is convenient to test, but in the future we'll change it. The topic is specified in the "deployment-charts" repo (you can get it from gerrit, I'll explain it when you want) [09:06:40] 2) We have various kafka clusters in our infrastructure, but the most relevant to us are 3: [09:06:43] - kafka main eqiad [09:06:47] - kafka main codfw [09:06:50] - kafka jumbo eqiad [09:07:23] the first two are the ones that mediawiki uses, they are in both eqiad and codfw (where mediawiki appservers are) and are directly supported by SRE [09:07:51] they use Mirror Maker to mirror topics between each other, this is why sometimes you'll see topics prefixed with "eqiad." or similar [09:08:16] The jumbo cluster is eqiad only, and owned by Data Engineering. Most of the topics in the main clusters are mirrored as well to jumbo [09:08:36] so if you see in the config I am pulling kafka events/msgs from Jumbo directly [09:09:16] 3) Last but not the least - I am using the benthos binary downloaded from upstream's binary distribution, feel free to copy it to your home dir etc.. [09:09:26] you can also build it yourself if you want [09:10:05] Do you want to get more spam? [YN] :D [09:12:58] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ci: add new syntax directive for blubber files [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856880 (https://phabricator.wikimedia.org/T322006) (owner: 10Ilias Sarantopoulos) [09:14:13] sigh I had to restart the ml-serve-codfw apiservers again, some weird LIST call taking ages to complete and/or getting errors [09:16:12] (03PS14) 10Elukey: Refactor revscoring model servers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856520 (https://phabricator.wikimedia.org/T320374) [09:16:45] (03CR) 10Elukey: Refactor revscoring model servers (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856520 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [09:17:05] (03CR) 10Elukey: Refactor revscoring model servers (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856520 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [09:22:51] (03Merged) 10jenkins-bot: ci: add new syntax directive for blubber files [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856880 (https://phabricator.wikimedia.org/T322006) (owner: 10Ilias Sarantopoulos) [09:27:36] elukey: no spam at all! :) Wow thanks for the instructions/hints [09:29:11] lemme know if you need more or if you have questions later on when you test etc.. [09:36:20] Morning! [09:37:12] o/ [09:51:32] elukey: I hope that the perf issues with the k8s API will "just go away" with 1.23, but maybe I am naive. OTOH, dumping time and energy into that problem when 1.23 is (relatively) close probably is not a good use of time/energy (better put that towards 1.23) [09:58:13] klausman: I think it may go away when we upgrade knative, so after/during 1.23.. I agree that it doesn't really make sense now, I'll keep restart when needed [10:01:00] inb4 we make it a corn job, foget about it and then wonder in 9 months why it keeps restarting regularly :D [10:01:03] cron* [10:02:47] ahhaha yes [10:03:18] What do you mean, "learning from past mistakes"? :D [11:33:25] * elukey lunch [12:10:00] same! [12:51:42] 10Machine-Learning-Team, 10ORES, 10Advanced Mobile Contributions, 10Growth-Team, and 3 others: 'Highlight likely problem edits' preference doesn't select any filters in mobile web - https://phabricator.wikimedia.org/T318683 (10jsn.sherman) [13:05:06] 10Machine-Learning-Team: Upgrade the link recommendation algorithm from Spark 2 to Spark 3. - https://phabricator.wikimedia.org/T323493 (10kevinbazira) [14:13:04] aiko: o/ [14:13:22] when you have time lemme know if https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/856520/ is still missing anything :) [14:15:47] ok! I'll take a look later [14:19:16] <3 [14:20:45] isaranto: in the spam I forgot to mention https://grafana.wikimedia.org/d/zsdYRV7Vk/istio-sidecar [14:21:00] so we use istio's sidecars (basically envoy) [14:21:18] in every pod, and we are able to get metrics about what service is called etc.. [14:21:52] we don't really need the full mesh reports, but for example we value latencies to api-ro.discovery.wmnet (the internal mediawiki api where we fetch features from) [14:22:10] so when you test benthos you can check latency metrics in there [14:22:44] and if you need to check the pod's memory/cpu usa [14:22:46] *usage [14:22:47] https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?from=now-6h&orgId=1&to=now&var-datasource=codfw%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-drafttopic&var-pod=All [14:23:05] there are also more dashboards but these are important [14:23:35] (everything is still WIP pre-MVP, so if something doesn't make sense please speak up so we can talk about it :) [14:32:09] thanks! [14:37:11] (afk for a little while, my car battery died this morning and I am going to get a replacement) [15:13:48] (03CR) 10AikoChou: [C: 03+1] "looks good to me!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856520 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [15:15:15] back! I have a new battery :) [15:17:22] aiko: thanks! [15:17:43] I am currently testing it with wrk, I see performance improvements with MP across all the models [15:21:00] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856520 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [15:26:38] ah kserve 0.10 RC0 is out https://github.com/kserve/kserve/releases/tag/v0.10.0-rc0 [15:33:40] :) [15:36:17] this https://github.com/kserve/kserve/pull/2425 is really great (it is way better than my hack with the decorator to show timings) [15:39:56] oh, hsitograms, nice. [15:48:08] (03PS15) 10Elukey: Refactor revscoring model servers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856520 (https://phabricator.wikimedia.org/T320374) [15:58:29] 10Machine-Learning-Team, 10Patch-For-Review: Test ML model-servers with Benthos - https://phabricator.wikimedia.org/T320374 (10elukey) Very nice tests on draftquality with the process pool: ` elukey@wintermute:~/Wikimedia/scratch-dir/model_servers_tests$ wrk -c 5 -t 5 -s inference-draftquality.lua -d 300 --ti... [15:58:46] (03CR) 10Elukey: [C: 03+2] Refactor revscoring model servers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856520 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [16:05:12] (03Merged) 10jenkins-bot: Refactor revscoring model servers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856520 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [16:08:04] isaranto: o/ so ---^ is going to create new docker images [16:08:31] and we'll need to deploy them to staging, do some checks and then prod [16:08:41] do you want to do it? (of course I'll help!) [16:10:18] of course! I am logging off now. wanna meet in the morning? (if u have time). I also have questions about benthos. I read the docs and started running it on stat4 [16:19:52] isaranto: sure sure, at any time, let's sync tomorrow :) [16:26:50] going afk earlier as well folks, see you tomorrow! [16:26:59] have a nice evening/rest-of-the-day [16:35:25] bye Luca! [16:55:51] heading out now as well \o [17:12:58] night all!