[08:32:50] <isaranto>	 o/ have a great week start every1 :)
[08:33:52] <elukey>	 morning :D
[08:33:57] <klausman>	 \o
[08:34:02] <elukey>	 was it a prompt for bloom-3b? :D
[08:34:34] <isaranto>	 nope but I can try!
[08:35:04] <isaranto>	 I could try "write a message for wishing everyone a great start of the week"
[08:35:05] <isaranto>	 hehe
[08:35:25] <isaranto>	 so u know it was mine otherwise it would be much better 
[08:56:08] <aiko>	 o/ morning :)
[08:56:33] <klausman>	 'ello :)
[09:05:09] <wikibugs>	 (03PS2) 10AikoChou: events: remove content_slots field from prediction classification event [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928583 (https://phabricator.wikimedia.org/T328899)
[09:06:42] <wikibugs>	 (03CR) 10Elukey: events: remove content_slots field from prediction classification event (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928583 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[09:12:41] <wikibugs>	 (03CR) 10AikoChou: events: remove content_slots field from prediction classification event (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928583 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[09:24:45] <elukey>	 klausman: o/ Do you have time to work on the documentation/test updates for the api gateway tier stuff?
[09:25:03] <klausman>	 yep, that's the plan for today/this week
[09:25:13] <klausman>	 I should really get better at the standup bot hing
[09:25:20] <elukey>	 ack
[09:25:54] <elukey>	 klausman: can you also update the status of the slack thread with enterprise when some change happens so they are aware?
[09:26:08] <klausman>	 will do
[09:48:16] <wikibugs>	 (03PS3) 10AikoChou: events: remove content_slots field from prediction classification event [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928583 (https://phabricator.wikimedia.org/T328899)
[10:19:03] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] events: remove content_slots field from prediction classification event (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928583 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[10:27:14] * elukey lunch!
[10:38:39] <wikibugs>	 (03PS1) 10AikoChou: revert-risk: change output schema and add model version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929175
[10:44:18] * klausman lunch
[10:51:33] <wikibugs>	 (03CR) 10AikoChou: "output example:" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929175 (owner: 10AikoChou)
[10:51:56] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] events: remove content_slots field from prediction classification event [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928583 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[10:55:47] <wikibugs>	 (03CR) 10AikoChou: [C: 03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928583 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[10:56:19] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] events: remove content_slots field from prediction classification event [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928583 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[11:02:37] <wikibugs>	 (03Merged) 10jenkins-bot: events: remove content_slots field from prediction classification event [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928583 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[11:05:07] <wikibugs>	 (03PS42) 10Ilias Sarantopoulos: feat: hardcode threshold calls to switch to Lift Wing [extensions/ORES] - 10https://gerrit.wikimedia.org/r/915541 (https://phabricator.wikimedia.org/T319170)
[11:05:44] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: feat: hardcode threshold calls to switch to Lift Wing (032 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/915541 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos)
[11:12:29] <wikibugs>	 (03PS10) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES (2) - with one Scorefetcher [extensions/ORES] - 10https://gerrit.wikimedia.org/r/926420 (https://phabricator.wikimedia.org/T319170)
[12:20:59] <dcausse>	 o/ I looked at the stream populated in eqiad.mediawiki.revision_score_drafttopic, quickly checking it does not seem to have the meta.domain field being set is this on purpose?
[12:48:06] <wikibugs>	 (03CR) 10Klausman: revert-risk: change output schema and add model version (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929175 (owner: 10AikoChou)
[13:03:53] <elukey>	 dcausse: o/ no real reason, we can definitely check and add it if it is necessary!
[13:05:27] <dcausse>	 elukey: thanks, good to know, we might need it perhaps, some search components might expect this field to be properly set
[13:07:37] <wikibugs>	 (03CR) 10Elukey: revert-risk: change output schema and add model version (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929175 (owner: 10AikoChou)
[13:21:35] <aiko>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/929335 hope this works!
[13:24:22] <elukey>	 aiko: go ahead!
[13:51:57] <wikibugs>	 10Machine-Learning-Team, 10Data-Engineering, 10Event-Platform Value Stream: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10lbowmaker)
[13:57:38] <aiko>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/929342 didn't notice that the template has been changed  :D
[13:59:09] <aiko>	 elukey: thanks for merging :)
[14:01:36] <wikibugs>	 10Machine-Learning-Team, 10Data-Engineering, 10Event-Platform Value Stream: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10Ottomata) > Design schema for output topic I believe this should be done.  @achou...
[14:21:35] <aiko>	 yayyy I saw the test outlink event being posted to the target Kafka topic!
[14:23:29] <elukey>	 niceeeeeeeeeee
[14:23:31] <wikibugs>	 (03PS11) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES (2) - with one Scorefetcher [extensions/ORES] - 10https://gerrit.wikimedia.org/r/926420 (https://phabricator.wikimedia.org/T319170)
[14:23:37] <aiko>	 but weird still see the canary events posted there every 15min
[14:23:40] <klausman>	 nice work!
[14:28:17] <aiko>	 if you run kafkacat -C -t codfw.mediawiki.page_outlink_topic_prediction_change -b kafka-main1001.eqiad.wmnet:9093 -X security.protocol=ssl -X ssl.ca.location=/etc/ssl/certs/wmf-ca-certificates.crt on stat100x, you will see
[14:29:03] <elukey>	 so change prop doesn't filter those as we expected?
[14:30:00] <aiko>	 I think they are not from liftwing
[14:30:47] <elukey>	 ah okok, so in theory we may get some of them by the probing that comes from DE's infra
[14:38:28] <aiko>	 ok that makes sense.. I recall that I set the canary_events_enabled to be true here https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/wmf-config/ext-EventStreamConfig.php#1373
[14:39:31] <aiko>	 should we set it to false?
[14:42:51] <elukey>	 aiko: probably yes, good catch
[14:44:36] <klausman>	 Do the Kafka events (in general, not just ours) have metadata, e.g. the IP the sent them?
[14:45:07] <elukey>	 I don't think so
[14:47:20] <klausman>	 Hm. Would be useful in cases like this, when debugging. But I guess for high-throughput message queues, you want as little overhead (and extra allocations) as possible.
[14:47:48] <elukey>	 but these meta infos are also a privacy concern, in general
[14:47:57] <klausman>	 Of course.
[14:48:11] <elukey>	 you can add any detail to a json msg sent to kafka, like we do for webrequest (where the ip is there etc..)
[14:49:38] <klausman>	 Yeah, I was about to say that in the debug case, you might as well add the extra data client-side
[14:50:19] <elukey>	 ack
[14:50:37] <elukey>	 so for batching, I can confirm that our model servers would need to change in order to support it
[14:51:56] <elukey>	 torch serve supports it afaics https://github.com/pytorch/serve/blob/master/docs/batch_inference_with_ts.md
[14:52:42] <elukey>	 and kserve supports it, but we'd need to import its docker image first
[14:56:26] <klausman>	 How big is that?
[14:57:13] <elukey>	 not sure, I am trying to find it first
[14:57:35] <elukey>	 kserve has the concept of ClusterServingRuntime, and there is a kserve-torchserve
[14:57:52] <elukey>	 https://hub.docker.com/r/pytorch/torchserve-kfs/tags
[14:57:59] <elukey>	 the image is around 2/3G
[14:58:27] <klausman>	 ang 6G with GPU support. Though I suspect that's nvidia-only
[14:58:52] <klausman>	 yep, only nv
[14:59:29] <elukey>	 but we cannot use those, so we can build our own with AMD supprot
[14:59:37] <elukey>	 https://github.com/pytorch/serve/blob/master/docker/Dockerfile is the upstream one
[15:00:36] <klausman>	 btw, I played with rocm on my private Linux machine on the weekend. But that GPU is just too old and feeble to do anything useful
[15:01:00] <klausman>	 It's only mildly faster than the (much *much* newer) CPU
[15:06:21] <wikibugs>	 10Machine-Learning-Team: Containerize Content Translation Recommendation API - https://phabricator.wikimedia.org/T338805 (10kevinbazira)
[15:11:32] <isaranto>	 regarding the inference batcher another thing we ought to support is batch inference (in a single request) which is as common use case as well
[15:13:31] <isaranto>	 with the prepackaged model servers and v2 of inference api it is done like this -> https://kserve.github.io/website/0.10/get_started/first_isvc/#5-perform-inference
[15:14:08] <isaranto>	 in our case it would be done by modifying the custom model servers to accept a list of samples (1 or many)
[15:15:26] <elukey>	 isaranto: it is not as simple as that, because if we run predict sequentially on all "instances" sent in the batch we'll not add any meaningful boost in my opinion
[15:15:50] <elukey>	 the torchserve and triton server may have more logic to better support this use case, especially with gpus
[15:16:32] <klausman>	 parts of the batch (e.g. WMAPI req's) might be more parallelizable than others
[15:17:02] <elukey>	 but what is the use case for non-gpu-related batch requests? Asking thinking out loud
[15:17:05] <isaranto>	 I agree. I'm not saying that one contradicts the other but it is quite common use case to want to get a couple of predictions with the same request
[15:17:06] <elukey>	 is the a benefic?
[15:17:09] <elukey>	 *benefit
[15:17:37] <elukey>	 isaranto: sure sure, but a user could simply call the api X times, rather than passing a complex request etc..
[15:18:14] <klausman>	 I could see a benefit if the processing was short and thus the "establish connection" overhead was a larger part of total request time, but I don't think that's really the case for us (yet)
[15:18:33] <klausman>	 (cf. HTTP pipelining)
[15:19:30] <elukey>	 but then you have all side effects of pipelining, like head of line blocking etc..
[15:19:49] <klausman>	 yeah, as I said, I am not 100% sure it'd be worth it
[15:19:59] <elukey>	 it may be worth it in the GPU context
[15:20:14] <klausman>	 That is true, it's sortof a parallelizable backend
[15:20:54] <wikibugs>	 10Machine-Learning-Team, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Gehel) Removing Search Platform, our work here is done.
[15:22:19] <wikibugs>	 10Machine-Learning-Team, 10Epic: Test KServe inference batching - https://phabricator.wikimedia.org/T335480 (10elukey) I tried today to add the batcher functionality in a kserve pod, and the new pod was created nicely (the `agent` docker image that we already have on our docker registry can act also as batcher...
[15:26:34] <elukey>	 added my thoughts to --^, please add more 
[15:30:37] <klausman>	 done
[15:30:41] <wikibugs>	 10Machine-Learning-Team, 10Epic: Test KServe inference batching - https://phabricator.wikimedia.org/T335480 (10klausman) A few thoughts:  - parts of the batch (e.g. WMAPI req's) might be more parallelizable than others - if the processing was short and thus the "establish connection" overhead was a larger part...
[16:14:48] <wikibugs>	 10Machine-Learning-Team, 10Epic: Test KServe inference batching - https://phabricator.wikimedia.org/T335480 (10isarantopoulos) Some thoughts as well: When needed batch inference offer the following speed up: e.g. it takes roughly the same amount of time to generate predictions for many samples as it takes to d...
[16:47:49] * elukey afk!
[17:04:28] <wikibugs>	 10Machine-Learning-Team, 10Data-Engineering, 10Event-Platform Value Stream: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10lbowmaker)
[21:05:11] <wikibugs>	 10Machine-Learning-Team, 10CirrusSearch, 10Discovery-Search (Current work): Add outlink topic model predictions to CirrusSearch indices - https://phabricator.wikimedia.org/T328276 (10EBernhardson) Looked into this, it looks like progress is being made but it's not quite ready for us to pickup. The event stre...