[07:15:08] 10Machine-Learning-Team: Investigate high latencies registered by the ml-serve api control plane - https://phabricator.wikimedia.org/T310073 (10elukey) 05Open→03Resolved a:03elukey This turned out to be a problem with the DNS control plane, see T318814 [08:10:26] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-notice: Deploy "add a link" to 5th round of wikis - https://phabricator.wikimedia.org/T304549 (10kostajh) >>! In T304549#8341630, @Trizek-WMF wrote: > Thank you @kostajh. > > Let's put on hold wikis with too... [08:14:36] hi folks [08:14:45] going to roll out the new docker images in a few [08:14:53] prepping the deployment-charts change [08:41:02] (03PS1) 10AikoChou: revertrisk: add revertrisk model server and pipeline [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/849478 (https://phabricator.wikimedia.org/T321594) [08:46:19] ---^ tested in local docker and ml-sandbox [08:56:04] super, will review it in a bit [08:59:16] elukey: I don't use a process pool for this model, because with the async session, I already get decent performance. I tried using a process pool, but the latency wasn't good [08:59:33] (03CR) 10Hashar: "recheck after CI config deployment https://gerrit.wikimedia.org/r/c/integration/config/+/849480" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/849478 (https://phabricator.wikimedia.org/T321594) (owner: 10AikoChou) [08:59:50] aiko: what do you mean that the latency wasn't good? No improvements? [09:00:18] elukey: it got worse [09:00:41] aiko: with wrk/load-test? [09:00:48] maybe it was only in the sandbox [09:00:52] it shouldn't really get worse [09:01:16] elukey: yes it was in sandbox with wrk [09:02:12] aiko: did you try to change the number of processes? Like reducing it etc.. I am wondering if it is due to cpu throttling [09:03:20] elukey: the avg latency without process pool was around 120-164ms (#connections from 1 to 20), but the avg latency with process pool increased to 710ms [09:04:53] elukey: maybe [09:04:56] (created https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/849485/ to update all the docker images) [09:06:14] elukey: I didn't increase cpu resource to the pod either, so maybe it's cpu throttling [09:09:35] if lang not in self.model.supported_wikis [09:09:37] \o/ [09:11:35] :D [09:12:59] (03CR) 10Elukey: "Really great work! Added a few comments but we are basically ready to go. I'd love more testing related to the usage of the process pool f" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/849478 (https://phabricator.wikimedia.org/T321594) (owner: 10AikoChou) [09:13:15] thanks for the review :) [09:28:43] rolled out the new docker images to all ml-serve-codfw's good faith model servers [09:28:51] will re-run benthos to see performancse [09:28:53] *perfs [09:37:19] aiko: really interesting https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?from=now-6h&orgId=1&to=now&var-datasource=codfw%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-goodfaith&var-pod=All&viewPanel=333 [09:37:27] this is goodfaith wikidata with the process pool [09:37:38] a lot more cpu usage, but clearly throttling [09:37:57] the red bar at 4s is probably fixed for all, we have way less [09:38:11] but so far it seems that more processing is happening [09:46:08] elukey: how to tell it's throttling? because the value is negative? [09:54:20] aiko: yeah it is displayed as negative for convenience (see "Throttled selector in red below the graph in the legen) [09:54:23] *legend [09:56:10] Mornong \o [09:56:41] elukey: still no joy with pcc [09:56:51] Same "No fact file" error as yesterday [10:02:23] I've tried https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Manually_update_cloud but the first command fails with no acces (conn closed) and the second doesn't seem to help [10:02:34] taavi: Any idea what I'm missing? [10:09:21] klausman: you can ask in #wikimedia-cloud for some help [10:11:40] or possibly #wikimedia-sre [10:13:58] (03PS2) 10AikoChou: revertrisk: add revertrisk model server and pipeline [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/849478 (https://phabricator.wikimedia.org/T321594) [10:15:42] (03CR) 10AikoChou: revertrisk: add revertrisk model server and pipeline (033 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/849478 (https://phabricator.wikimedia.org/T321594) (owner: 10AikoChou) [10:29:01] (03CR) 10Elukey: revertrisk: add revertrisk model server and pipeline (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/849478 (https://phabricator.wikimedia.org/T321594) (owner: 10AikoChou) [10:31:48] klausman: any luck with pcc? [10:33:34] Nothing, so far (also was working on some more AWS/NLLB mail back-and-forth :-/) [10:34:09] I'll ask for advice in wmf-cloud [10:42:17] * elukey lunch [10:42:28] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-notice: Deploy "add a link" to 5th round of wikis - https://phabricator.wikimedia.org/T304549 (10kostajh) @Trizek-WMF do you want me to enable these wikis (excluding bugwiki and bpywiki) later today, at 13:00... [11:33:29] <- lunch and a few errands [12:11:47] (03CR) 10AikoChou: revertrisk: add revertrisk model server and pipeline (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/849478 (https://phabricator.wikimedia.org/T321594) (owner: 10AikoChou) [12:43:56] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-notice: Deploy "add a link" to 5th round of wikis - https://phabricator.wikimedia.org/T304549 (10Trizek-WMF) @kostajh, yet, let's turn it on for all wikis, except bugwiki and bpywiki. [13:33:21] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 6th round of wikis - https://phabricator.wikimedia.org/T304550 (10kevinbazira) Model evaluation has been completed and below are the backtesting results: | | Precision@0.5 | Recall@0.5 |cbk_zamwiki | 0.90 | 0.65 |cdo... [13:34:21] (03CR) 10Elukey: [C: 03+1] "Left a comment about the error msg, but after that feel free to merge!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/849478 (https://phabricator.wikimedia.org/T321594) (owner: 10AikoChou) [13:42:38] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-notice: Deploy "add a link" to 5th round of wikis - https://phabricator.wikimedia.org/T304549 (10kostajh) [13:43:23] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-notice: Deploy "add a link" to 5th round of wikis - https://phabricator.wikimedia.org/T304549 (10kostajh) >>! In T304549#8345510, @Trizek-WMF wrote: > @kostajh, yet, let's turn it on for all wikis, except bugw... [14:19:12] (03CR) 10AikoChou: revertrisk: add revertrisk model server and pipeline (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/849478 (https://phabricator.wikimedia.org/T321594) (owner: 10AikoChou) [14:19:33] (03PS3) 10AikoChou: revertrisk: add revertrisk model server and pipeline [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/849478 (https://phabricator.wikimedia.org/T321594) [14:22:03] (03CR) 10AikoChou: [C: 03+2] "Thanks for the review! :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/849478 (https://phabricator.wikimedia.org/T321594) (owner: 10AikoChou) [14:28:54] (03Merged) 10jenkins-bot: revertrisk: add revertrisk model server and pipeline [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/849478 (https://phabricator.wikimedia.org/T321594) (owner: 10AikoChou) [15:53:39] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10CommRel-Specialists-Support (Oct-Dec-2022): Inspect "add a link" models to improve their performance - https://phabricator.wikimedia.org/T309263 (10Trizek-WMF) [15:59:02] 10Lift-Wing, 10Machine-Learning-Team: Deploy revert-risk-model to production - https://phabricator.wikimedia.org/T321594 (10achou) The model has been uploaded to Thanos Swift: ` aikochou@stat1004:~$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls s3://wmf-ml-models/experimental/revertrisk/20221026144108/ 2022-10-26... [16:32:08] going afk! Have a nice rest of the day folks [16:35:26] \o [17:11:37] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-notice: Deploy "add a link" to 5th round of wikis - https://phabricator.wikimedia.org/T304549 (10Trizek-WMF) [17:11:46] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-notice: Deploy "add a link" to 5th round of wikis - https://phabricator.wikimedia.org/T304549 (10Trizek-WMF) bug.wp and bpy.wp are now in {T309263}. [17:13:10] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-notice: Deploy "add a link" to 5th round of wikis - https://phabricator.wikimedia.org/T304549 (10Trizek-WMF) a:05Trizek-WMF→03None I let QA close the task when done. [17:18:46] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-notice: Deploy "add a link" to 5th round of wikis - https://phabricator.wikimedia.org/T304549 (10Tgr) We could also just lower the score threshold or minimum link count for wikis which get very few suggestions.