[07:02:11] 10Machine-Learning-Team: Test ML model-servers with Benthos - https://phabricator.wikimedia.org/T320374 (10elukey) >>! In T320374#8308764, @achou wrote: > @elukey Is this exception raised by a deleted page (`badrevids` error)? Not sure since there are no logs about it, this is why I want to add some use cases t... [07:58:24] (03CR) 10Elukey: [C: 03+1] Remove directories and scripts that are not used in production [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841141 (https://phabricator.wikimedia.org/T320494) (owner: 10AikoChou) [08:13:26] (03PS1) 10Elukey: extractor_utils: add fetch_features function [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841855 (https://phabricator.wikimedia.org/T320374) [08:14:43] (03PS2) 10Elukey: extractor_utils: add fetch_features function [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841855 (https://phabricator.wikimedia.org/T320374) [09:03:12] found https://www.benthos.dev/docs/components/outputs/fallback [09:03:17] really interesting [09:04:31] \o Morning! [09:04:52] elukey: sounds like that would be a nice way to prevent events getting lost, but I wonder what that fallback would be [09:05:20] klausman: morning! I think kafka is the best choice [09:05:36] I am currently using the "file" fallback at the moment that works nicely [09:05:38] Or is it intended to be used like: "Here's a list of kafka endpoints, the first is a LB/LVS kinda address, and then we list individual nodes? [09:05:46] (so I can repro rev-ids causing issues etc.) [09:06:32] klausman: IIUC we use http_client first, and if the http call fails (like HTTP 500) then we push to kafka [09:06:40] you can specify brokers etc.. [09:06:41] yeah, having local halt-and-dump-on-error is nice. [09:07:01] Ah yes, that makes sense. [09:07:17] I am currently processing en|zh wiki messages from revision-create on staging (goodfaith), all works nicely [09:07:21] there are still some little bugs [09:07:39] Aren't there always :) [09:08:00] (03PS3) 10Elukey: extractor_utils: add fetch_features function [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841855 (https://phabricator.wikimedia.org/T320374) [09:08:30] But it's nice to know that we went from nothing but knowing of Benthos to something very credible in such a short amount of time. Speaks well of Benthos (not to knock your brains, Luca :D) [09:08:44] ahahah yes yes [09:08:52] it looks really promising [09:09:09] yesterday Filippo opened https://phabricator.wikimedia.org/T320468 [09:09:34] the patch is easy but istio upstream removed the branch used to build the 1.9.5 release sigh [09:09:49] Guh. Not even tags of old releases left? [09:10:16] there is a tag for 1.9.5, but IIRC they created a branch since there was a patch needed on top of it [09:10:24] there is the 1.9.8 branch though [09:10:32] that could be usable [09:10:40] (but we'd need to import istioctl too probably etc..) [09:10:57] ah, chasing deps, as always [09:11:13] At least it's statically linked [09:14:03] (03CR) 10Elukey: "An example of rev-id leading to a TextDeleted response is 1115335945 for enwiki." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841855 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [09:18:16] 10Machine-Learning-Team, 10Patch-For-Review: Test ML model-servers with Benthos - https://phabricator.wikimedia.org/T320374 (10elukey) ` output: label: "liftwing" fallback: - http_client: url: 'https://inference-staging.svc.codfw.wmnet:30443/v1/models/${! json("revision_create_event.database")... [09:22:26] mmm the avg latencies of our models are not great [09:22:51] How so? [09:22:57] no idea [09:23:05] maybe it is staging [09:23:07] https://grafana.wikimedia.org/d/zsdYRV7Vk/istio-sidecar?from=now-6h&orgId=1&to=now&var-backend=enwiki-goodfaith-predictor-default-82kxn-private&var-cluster=codfw%20prometheus%2Fk8s-mlstaging&var-namespace=revscoring-editquality-goodfaith&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99&var-response_code=All [09:23:13] but p50 is not great [09:23:33] Oh, ow. [09:24:19] And that's at just 60qps [09:24:45] the total traffic above is probably misleading I think, need to fix it [09:24:55] it is the same as the istio gws dashboard [09:25:21] the graph below show some rps, not 60 [09:26:59] well, with even fewer requests, the latencies being that high is much worse. Unless it's a sampling or computation error. [09:27:42] spot checking in the kserve-container logs I see latencies of seconds, but around 6/7 [09:27:49] (maximum) [09:28:04] (03CR) 10Klausman: [C: 03+1] extractor_utils: add fetch_features function (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841855 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [09:28:27] sometimes p99 for api-ro.discovery.wmnet goes up to seconds as well [09:28:43] but yeah maybe there is something to improve on the model server side [09:28:43] [mumbles something about DNS] [09:28:44] we'll see [09:32:07] (03CR) 10AikoChou: [C: 03+1] extractor_utils: add fetch_features function [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841855 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [09:32:14] (03CR) 10Elukey: extractor_utils: add fetch_features function (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841855 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [09:32:35] thanks aiko and klausman for the review :) [09:35:51] (03CR) 10Elukey: [C: 03+2] extractor_utils: add fetch_features function [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841855 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [09:43:25] (03Merged) 10jenkins-bot: extractor_utils: add fetch_features function [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841855 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [09:44:53] (03PS5) 10AikoChou: Remove directories and scripts that are not used in production [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841141 (https://phabricator.wikimedia.org/T320494) [09:45:32] (03CR) 10AikoChou: [C: 03+2] Remove directories and scripts that are not used in production [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841141 (https://phabricator.wikimedia.org/T320494) (owner: 10AikoChou) [09:53:16] (03Merged) 10jenkins-bot: Remove directories and scripts that are not used in production [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841141 (https://phabricator.wikimedia.org/T320494) (owner: 10AikoChou) [10:15:26] deployed the new editquality images (manually) on staging, let's see how it goes now [10:22:24] (03PS1) 10Elukey: draftquality: use the fetch_feature shared funtion [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841888 (https://phabricator.wikimedia.org/T320374) [10:22:54] (03PS2) 10Elukey: draftquality: use the fetch_features shared funtion [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841888 (https://phabricator.wikimedia.org/T320374) [10:24:12] (03PS3) 10Elukey: draftquality: use the fetch_features shared function [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841888 (https://phabricator.wikimedia.org/T320374) [10:24:46] (03PS1) 10Elukey: topic: use the fetch_features shared function [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841889 (https://phabricator.wikimedia.org/T320374) [10:26:22] aaand filed the rest of the code reviews to use the fetch_features shared function [10:34:14] going afk for lunch, I'll do some errands afterwards so I'll come back a little later [12:12:25] (03CR) 10AikoChou: [C: 03+1] draftquality: use the fetch_features shared function [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841888 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [12:12:44] (03CR) 10AikoChou: [C: 03+1] topic: use the fetch_features shared function [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841889 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [13:17:36] (03CR) 10Klausman: [C: 03+1] topic: use the fetch_features shared function [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841889 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [13:44:55] Morning all [13:45:07] heyo Chris, how isit? [13:45:27] all good, how are you? [13:46:44] Much better. [13:47:22] Doc says I had a virus of some sort (not Noro, but similar) on the weekend and recovery was delayed because I was running very low on hydration and electrolytes. Got some Siwss equivalent of Gatorade and that worked wonders. [14:20:13] (03CR) 10Kevin Bazira: [C: 03+1] draftquality: use the fetch_features shared function [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841888 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [14:20:55] (03CR) 10Kevin Bazira: [C: 03+1] topic: use the fetch_features shared function [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841889 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [14:21:05] thanks all for the reviews! [14:32:58] (03CR) 10Elukey: [C: 03+2] draftquality: use the fetch_features shared function [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841888 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [14:38:06] (03Merged) 10jenkins-bot: draftquality: use the fetch_features shared function [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841888 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [14:41:38] (03CR) 10Elukey: [C: 03+2] topic: use the fetch_features shared function [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841889 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [14:42:07] tested locally the new docker images, all good (draft/topic), +2ed and then I'll send a code change later on to update deployment-charts [14:48:24] (03Merged) 10jenkins-bot: topic: use the fetch_features shared function [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/841889 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [15:21:10] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey) [15:21:23] 10Lift-Wing, 10Data Engineering Planning, 10Event-Platform Value Stream, 10Epic, and 2 others: Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10elukey) 05Open→03Resolved [15:23:03] 10Lift-Wing, 10Data Engineering Planning, 10Event-Platform Value Stream, 10Epic, and 2 others: Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10Ottomata) Resolved?! [15:26:57] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10calbon) [15:28:08] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Set up the ml-cache clusters - https://phabricator.wikimedia.org/T302232 (10calbon) 05Stalled→03Resolved [15:42:21] 10Lift-Wing, 10Data Engineering Planning, 10Event-Platform Value Stream, 10Epic, and 2 others: Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10elukey) >>! In T301878#8311944, @Ottomata wrote: > Resolved?! Yes sorry we were grooming and I was supposed to add some wrap-... [15:43:33] 10Lift-Wing, 10Data Engineering Planning, 10Event-Platform Value Stream, 10Epic, and 2 others: Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10Ottomata) Cool! Just curious as to the 'deployed/prod' state of the thing. Sounds like WIPs work, but prod things still not... [15:48:02] 10Lift-Wing, 10Data Engineering Planning, 10Event-Platform Value Stream, 10Epic, and 2 others: Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10elukey) Nono you can now hit Lift Wing with a request carrying a revision-create event and a correspondent revision-score even... [16:06:31] deployed the new istio images to ml-serve-codfw [16:06:32] https://logstash.wikimedia.org/app/dashboards#/view/7f883390-fe76-11ea-b848-090a7444f26c?_g=h@94b69b2&_a=h@af9184c [16:06:39] the logspam seems to have stopped [16:15:40] rolled out to all clusters [16:30:41] and https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/841947 to update all model servers with the new docker images [16:31:16] going afk for today folks! [16:31:21] have a good rest of the day! [19:22:31] (03CR) 10Thiemo Kreuz (WMDE): "This change is ready for review." [extensions/ORES] - 10https://gerrit.wikimedia.org/r/836827 (owner: 10Thiemo Kreuz (WMDE))