[08:32:50] o/ have a great week start every1 :) [08:33:52] morning :D [08:33:57] \o [08:34:02] was it a prompt for bloom-3b? :D [08:34:34] nope but I can try! [08:35:04] I could try "write a message for wishing everyone a great start of the week" [08:35:05] hehe [08:35:25] so u know it was mine otherwise it would be much better [08:56:08] o/ morning :) [08:56:33] 'ello :) [09:05:09] (03PS2) 10AikoChou: events: remove content_slots field from prediction classification event [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928583 (https://phabricator.wikimedia.org/T328899) [09:06:42] (03CR) 10Elukey: events: remove content_slots field from prediction classification event (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928583 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [09:12:41] (03CR) 10AikoChou: events: remove content_slots field from prediction classification event (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928583 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [09:24:45] klausman: o/ Do you have time to work on the documentation/test updates for the api gateway tier stuff? [09:25:03] yep, that's the plan for today/this week [09:25:13] I should really get better at the standup bot hing [09:25:20] ack [09:25:54] klausman: can you also update the status of the slack thread with enterprise when some change happens so they are aware? [09:26:08] will do [09:48:16] (03PS3) 10AikoChou: events: remove content_slots field from prediction classification event [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928583 (https://phabricator.wikimedia.org/T328899) [10:19:03] (03CR) 10Elukey: [C: 03+1] events: remove content_slots field from prediction classification event (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928583 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [10:27:14] * elukey lunch! [10:38:39] (03PS1) 10AikoChou: revert-risk: change output schema and add model version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929175 [10:44:18] * klausman lunch [10:51:33] (03CR) 10AikoChou: "output example:" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929175 (owner: 10AikoChou) [10:51:56] (03CR) 10Kevin Bazira: [C: 03+1] events: remove content_slots field from prediction classification event [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928583 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [10:55:47] (03CR) 10AikoChou: [C: 03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928583 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [10:56:19] (03CR) 10Ilias Sarantopoulos: [C: 03+1] events: remove content_slots field from prediction classification event [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928583 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [11:02:37] (03Merged) 10jenkins-bot: events: remove content_slots field from prediction classification event [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928583 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [11:05:07] (03PS42) 10Ilias Sarantopoulos: feat: hardcode threshold calls to switch to Lift Wing [extensions/ORES] - 10https://gerrit.wikimedia.org/r/915541 (https://phabricator.wikimedia.org/T319170) [11:05:44] (03CR) 10Ilias Sarantopoulos: feat: hardcode threshold calls to switch to Lift Wing (032 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/915541 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [11:12:29] (03PS10) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES (2) - with one Scorefetcher [extensions/ORES] - 10https://gerrit.wikimedia.org/r/926420 (https://phabricator.wikimedia.org/T319170) [12:20:59] o/ I looked at the stream populated in eqiad.mediawiki.revision_score_drafttopic, quickly checking it does not seem to have the meta.domain field being set is this on purpose? [12:48:06] (03CR) 10Klausman: revert-risk: change output schema and add model version (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929175 (owner: 10AikoChou) [13:03:53] dcausse: o/ no real reason, we can definitely check and add it if it is necessary! [13:05:27] elukey: thanks, good to know, we might need it perhaps, some search components might expect this field to be properly set [13:07:37] (03CR) 10Elukey: revert-risk: change output schema and add model version (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929175 (owner: 10AikoChou) [13:21:35] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/929335 hope this works! [13:24:22] aiko: go ahead! [13:51:57] 10Machine-Learning-Team, 10Data-Engineering, 10Event-Platform Value Stream: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10lbowmaker) [13:57:38] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/929342 didn't notice that the template has been changed :D [13:59:09] elukey: thanks for merging :) [14:01:36] 10Machine-Learning-Team, 10Data-Engineering, 10Event-Platform Value Stream: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10Ottomata) > Design schema for output topic I believe this should be done. @achou... [14:21:35] yayyy I saw the test outlink event being posted to the target Kafka topic! [14:23:29] niceeeeeeeeeee [14:23:31] (03PS11) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES (2) - with one Scorefetcher [extensions/ORES] - 10https://gerrit.wikimedia.org/r/926420 (https://phabricator.wikimedia.org/T319170) [14:23:37] but weird still see the canary events posted there every 15min [14:23:40] nice work! [14:28:17] if you run kafkacat -C -t codfw.mediawiki.page_outlink_topic_prediction_change -b kafka-main1001.eqiad.wmnet:9093 -X security.protocol=ssl -X ssl.ca.location=/etc/ssl/certs/wmf-ca-certificates.crt on stat100x, you will see [14:29:03] so change prop doesn't filter those as we expected? [14:30:00] I think they are not from liftwing [14:30:47] ah okok, so in theory we may get some of them by the probing that comes from DE's infra [14:38:28] ok that makes sense.. I recall that I set the canary_events_enabled to be true here https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/wmf-config/ext-EventStreamConfig.php#1373 [14:39:31] should we set it to false? [14:42:51] aiko: probably yes, good catch [14:44:36] Do the Kafka events (in general, not just ours) have metadata, e.g. the IP the sent them? [14:45:07] I don't think so [14:47:20] Hm. Would be useful in cases like this, when debugging. But I guess for high-throughput message queues, you want as little overhead (and extra allocations) as possible. [14:47:48] but these meta infos are also a privacy concern, in general [14:47:57] Of course. [14:48:11] you can add any detail to a json msg sent to kafka, like we do for webrequest (where the ip is there etc..) [14:49:38] Yeah, I was about to say that in the debug case, you might as well add the extra data client-side [14:50:19] ack [14:50:37] so for batching, I can confirm that our model servers would need to change in order to support it [14:51:56] torch serve supports it afaics https://github.com/pytorch/serve/blob/master/docs/batch_inference_with_ts.md [14:52:42] and kserve supports it, but we'd need to import its docker image first [14:56:26] How big is that? [14:57:13] not sure, I am trying to find it first [14:57:35] kserve has the concept of ClusterServingRuntime, and there is a kserve-torchserve [14:57:52] https://hub.docker.com/r/pytorch/torchserve-kfs/tags [14:57:59] the image is around 2/3G [14:58:27] ang 6G with GPU support. Though I suspect that's nvidia-only [14:58:52] yep, only nv [14:59:29] but we cannot use those, so we can build our own with AMD supprot [14:59:37] https://github.com/pytorch/serve/blob/master/docker/Dockerfile is the upstream one [15:00:36] btw, I played with rocm on my private Linux machine on the weekend. But that GPU is just too old and feeble to do anything useful [15:01:00] It's only mildly faster than the (much *much* newer) CPU [15:06:21] 10Machine-Learning-Team: Containerize Content Translation Recommendation API - https://phabricator.wikimedia.org/T338805 (10kevinbazira) [15:11:32] regarding the inference batcher another thing we ought to support is batch inference (in a single request) which is as common use case as well [15:13:31] with the prepackaged model servers and v2 of inference api it is done like this -> https://kserve.github.io/website/0.10/get_started/first_isvc/#5-perform-inference [15:14:08] in our case it would be done by modifying the custom model servers to accept a list of samples (1 or many) [15:15:26] isaranto: it is not as simple as that, because if we run predict sequentially on all "instances" sent in the batch we'll not add any meaningful boost in my opinion [15:15:50] the torchserve and triton server may have more logic to better support this use case, especially with gpus [15:16:32] parts of the batch (e.g. WMAPI req's) might be more parallelizable than others [15:17:02] but what is the use case for non-gpu-related batch requests? Asking thinking out loud [15:17:05] I agree. I'm not saying that one contradicts the other but it is quite common use case to want to get a couple of predictions with the same request [15:17:06] is the a benefic? [15:17:09] *benefit [15:17:37] isaranto: sure sure, but a user could simply call the api X times, rather than passing a complex request etc.. [15:18:14] I could see a benefit if the processing was short and thus the "establish connection" overhead was a larger part of total request time, but I don't think that's really the case for us (yet) [15:18:33] (cf. HTTP pipelining) [15:19:30] but then you have all side effects of pipelining, like head of line blocking etc.. [15:19:49] yeah, as I said, I am not 100% sure it'd be worth it [15:19:59] it may be worth it in the GPU context [15:20:14] That is true, it's sortof a parallelizable backend [15:20:54] 10Machine-Learning-Team, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Gehel) Removing Search Platform, our work here is done. [15:22:19] 10Machine-Learning-Team, 10Epic: Test KServe inference batching - https://phabricator.wikimedia.org/T335480 (10elukey) I tried today to add the batcher functionality in a kserve pod, and the new pod was created nicely (the `agent` docker image that we already have on our docker registry can act also as batcher... [15:26:34] added my thoughts to --^, please add more [15:30:37] done [15:30:41] 10Machine-Learning-Team, 10Epic: Test KServe inference batching - https://phabricator.wikimedia.org/T335480 (10klausman) A few thoughts: - parts of the batch (e.g. WMAPI req's) might be more parallelizable than others - if the processing was short and thus the "establish connection" overhead was a larger part... [16:14:48] 10Machine-Learning-Team, 10Epic: Test KServe inference batching - https://phabricator.wikimedia.org/T335480 (10isarantopoulos) Some thoughts as well: When needed batch inference offer the following speed up: e.g. it takes roughly the same amount of time to generate predictions for many samples as it takes to d... [16:47:49] * elukey afk! [17:04:28] 10Machine-Learning-Team, 10Data-Engineering, 10Event-Platform Value Stream: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10lbowmaker) [21:05:11] 10Machine-Learning-Team, 10CirrusSearch, 10Discovery-Search (Current work): Add outlink topic model predictions to CirrusSearch indices - https://phabricator.wikimedia.org/T328276 (10EBernhardson) Looked into this, it looks like progress is being made but it's not quite ready for us to pickup. The event stre...