[06:58:56] <isaranto> o/ good morning! [07:09:17] <kevinbazira> o/ kalimera [07:09:29] <kevinbazira> thanks for the review, Ilias! [07:09:55] <isaranto> o/ kevin [07:10:01] <kevinbazira> I am goind to deploy the model-servers that rely on the updated events module one-by-one [07:10:12] <isaranto> np I have a patch for you as well https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1131323 [07:11:02] <isaranto> I'll submit another one for the api gateway afterwards but for that one we'll need Tobias to deploy [07:22:16] <kevinbazira> right! I've +1'd the patch. [07:22:17] <kevinbazira> are there tests currently running on the edit-check endpoint? if so, will both the `edit-check-staging` patch and the APIGW one be deployed at the same time? [07:29:23] <isaranto> I'll deploy the change for the service now and later today we can deploy the one I just opened for API GW https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1132534 [07:29:39] <isaranto> + I'm opening one now to fix the load tests to match the staging name [07:33:07] <wikibugs> (03PS1) 10Ilias Sarantopoulos: locust: fix model name for edit check [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1132535 (https://phabricator.wikimedia.org/T388817) [07:33:10] <isaranto> done! [07:36:09] <wikibugs> (03CR) 10Kevin Bazira: [C:03+1] locust: fix model name for edit check [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1132535 (https://phabricator.wikimedia.org/T388817) (owner: 10Ilias Sarantopoulos) [07:38:27] <wikibugs> (03CR) 10Ilias Sarantopoulos: [C:03+2] locust: fix model name for edit check [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1132535 (https://phabricator.wikimedia.org/T388817) (owner: 10Ilias Sarantopoulos) [07:49:24] <kevinbazira> article-country deployed. outlink predictor next: https://gerrit.wikimedia.org/r/1132537 [08:13:30] <isaranto> I've +1. shall we also update the transformer image to have an up2date deployment? [08:44:01] <kevinbazira> sure sure ... I've updated the patch with tne transformer image too [08:52:08] <isaranto> thanks! [08:53:18] <wikibugs> (03CR) 10Ilias Sarantopoulos: [V:03+2 C:03+2] locust: fix model name for edit check [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1132535 (https://phabricator.wikimedia.org/T388817) (owner: 10Ilias Sarantopoulos) [08:53:40] <wikibugs> (03PS14) 10Gkyziridis: inference-services: edit-check GPU version for batch prediction. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) [10:19:59] * isaranto lunch! [10:20:33] <wikibugs> 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: LiftWing model-servers log improper JSON in stderr - https://phabricator.wikimedia.org/T389768#10693087 (10kevinbazira) [10:24:42] <klausman> ditto :) [10:26:42] <kevinbazira> outlink deployed. will deploy RRLA once the event stream is in prod. [11:39:33] <wikibugs> (03PS15) 10Ilias Sarantopoulos: inference-services: edit-check GPU version for batch prediction. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [11:41:20] <wikibugs> (03CR) 10Ilias Sarantopoulos: "Resolving the previous comments as all have been implemented" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [11:44:03] <wikibugs> (03PS16) 10Ilias Sarantopoulos: inference-services: edit-check GPU version for batch prediction. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [11:44:28] <isaranto> aiko: the above patch is now ready for review. I have tested it as well locally [12:10:12] <wikibugs> (03PS17) 10Ilias Sarantopoulos: edit-check: implement for batch prediction [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [12:10:47] <isaranto> klausman: let me know if you can deploy the api gw patch sometime today https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1132534 [12:10:49] <isaranto> thanks! [12:12:46] <klausman> yeah, I was about to do that :) [12:13:18] <aiko> isaranto: alright! I'll review it [12:13:37] <isaranto> great, thank you both! [12:14:55] <isaranto> I'm following up on an alert we got on saturday for reference-need and I am seeing this chart for a pod that worries me https://grafana.wikimedia.org/goto/4yxH_8THR?orgId=1 [12:15:35] <isaranto> memory usage is increasing which likely indicates that there is a memory leak. this seems consistent in all pods [12:16:04] <klausman> It seems it did something similar before (go to "2 days") yesterday 9am-noon [12:16:49] <isaranto> I increased memory limits/requests on saturday as I saw the same thing happening [12:18:08] <klausman> Think it might be a memory leak? [12:18:59] <isaranto> this would be my guess. Something we missed when adding multiprocessing to the service [12:26:49] <isaranto> my assumption is that the process pool isn't managed properly and a process that has died isn't shut down properly so it still occupies memory - which means that we load the model once more in the new process that is spawned [12:27:03] <isaranto> taking a quick look and opening up a task [12:37:27] <wikibugs> 10Lift-Wing, 06Machine-Learning-Team, 06Wikimedia Enterprise: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10693471 (10isarantopoulos) We are no longer getting 500s as before so the stability has improved BUT the overall latency of the service is stil... [12:56:09] <wikibugs> 10Lift-Wing, 06Machine-Learning-Team, 06Wikimedia Enterprise: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10693506 (10isarantopoulos) There is an increasing memory consumption which ends up in pods getting killed because they get out of memory (OOMKi... [13:01:57] <klausman> isaranto: APIGW change has been pushed everywhere [13:02:05] <isaranto> awesome thank you! [13:04:24] <wikibugs> 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Investigate options for providing beta cluster / patchdemo access to liftwing staging - https://phabricator.wikimedia.org/T388269#10693521 (10isarantopoulos) **request**: ` curl https://api.wikimedia.org/service/lw/inference/v1/models/edit-check-staging:pre... [13:04:40] <wikibugs> 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Investigate options for providing beta cluster / patchdemo access to liftwing staging - https://phabricator.wikimedia.org/T388269#10693522 (10isarantopoulos) 05Open→03Resolved a:03isarantopoulos [14:28:59] <wikibugs> (03CR) 10AikoChou: [C:03+1] "LGTM! Only a few minor issues." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [14:41:37] <wikibugs> (03PS18) 10Ilias Sarantopoulos: edit-check: implement for batch prediction [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [14:42:18] <isaranto> aiko: thanks for the review, I updated it, lemme know if it is ok! [14:42:21] <wikibugs> (03CR) 10CI reject: [V:04-1] edit-check: implement for batch prediction [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [14:42:51] <wikibugs> (03PS19) 10Ilias Sarantopoulos: edit-check: implement for batch prediction [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [14:49:33] <wikibugs> (03CR) 10AikoChou: [C:03+1] edit-check: implement for batch prediction (033 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [14:57:17] <wikibugs> 10Lift-Wing, 06Machine-Learning-Team, 06Wikimedia Enterprise: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10694215 (10isarantopoulos) I have verified the above by looking at a specific pod: 1. Found some BrokenProcessPool [[ https://logstash.wikimed... [15:12:39] <wikibugs> (03CR) 10Ilias Sarantopoulos: [C:03+2] edit-check: implement for batch prediction (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [15:13:12] <wikibugs> (03CR) 10Ilias Sarantopoulos: edit-check: implement for batch prediction [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [15:13:23] <wikibugs> (03PS20) 10Ilias Sarantopoulos: edit-check: implement batch prediction [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [15:13:37] <wikibugs> (03PS21) 10Ilias Sarantopoulos: edit-check: implement batch requests/prediction [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [15:13:42] <wikibugs> (03PS22) 10Ilias Sarantopoulos: edit-check: implement batch requests/predictions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [15:13:47] <wikibugs> (03CR) 10Ilias Sarantopoulos: [C:03+2] edit-check: implement batch requests/predictions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [15:14:38] <isaranto> thanks for the review Aiko! I fixed the commit msg and merged! [15:19:34] <wikibugs> (03CR) 10DCausse: "I think this should be ready to go" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130530 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [15:19:40] <wikibugs> (03PS2) 10DCausse: search weighted_tags: drop BC for rc0 weighted_tag stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130530 (https://phabricator.wikimedia.org/T375821) [15:22:40] <wikibugs> (03Merged) 10jenkins-bot: edit-check: implement batch requests/predictions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [15:36:13] <wikibugs> (03CR) 10Kevin Bazira: "Thank you for working on this, David. LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130530 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [15:36:50] <wikibugs> (03PS3) 10Kevin Bazira: search weighted_tags: drop BC for rc0 weighted_tag stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130530 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [15:44:03] <wikibugs> 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Investigate options for providing beta cluster / patchdemo access to liftwing staging - https://phabricator.wikimedia.org/T388269#10694418 (10isarantopoulos) Updated request after batch prediction implementation ` curl https://api.wikimedia.org/servic... [15:50:48] <wikibugs> (03CR) 10Kevin Bazira: [C:03+2] search weighted_tags: drop BC for rc0 weighted_tag stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130530 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [16:00:50] <wikibugs> 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 06Wikipedia-Android-App-Backlog: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298#10694499 (10Samwalton9-WMF) [16:01:34] <wikibugs> (03Merged) 10jenkins-bot: search weighted_tags: drop BC for rc0 weighted_tag stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130530 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [16:04:19] <isaranto> going afk folks, have a nice evening/rest of day! [17:23:06] <wikibugs> 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 06Wikipedia-Android-App-Backlog: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298#10694988 (10Kgraessle) Adding the thresholds we arrived at from the analysis that was complete... [20:36:03] <wikibugs> 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 06Wikipedia-Android-App-Backlog: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298#10695756 (10kostajh) >>! In T348298#10694988, @Kgraessle wrote: > Adding the thresholds we arr...