[07:40:46] hello folks [07:53:42] 10Machine-Learning-Team, 10artificial-intelligence, 10Bad-Words-Detection-System, 10revscoring: Add language support for Malay language (ms) - https://phabricator.wikimedia.org/T349968 (10elukey) @Hakimi97 nice! Do you want to keep this task open or should we close? (feel free to contact us on Libera irc a... [08:12:22] finally https://pypi.org/project/kserve/ has 0.11.2! [08:59:45] hello! I'll be rolling out model servers today with kserve 0.11.2 then! [09:06:52] o/ [09:18:38] folks somebody should check the alerts that were fired yesterday [09:18:42] (not me :) [09:18:55] I filed a change to improve the dashboard link [09:19:16] but there seems to be some retry happening in changeprop [09:19:51] hey I checked them but I haven't found the issue yet [09:20:03] it wasn't directed to you Ilias :D [09:20:10] it was more "the team should :)" [09:20:35] it doesn't seem something super urgent, but we can open a task in case [09:21:38] got it! I'm just initializing the conversation though :) [09:22:18] I don't get what the issue is. everytime the alert fires it is because of the retry and it is resolved in the next 10-15 minutes [09:22:46] I'm looking at other jobs at the time to see if they have the same issues [09:23:26] in theory we shouldn't have retry jobs [09:23:52] I suspect that Lift Wing is still returning 500s every now and then [09:57:55] There are also puppet alerts that I am looking at (since it was likely me who broke it) [10:29:38] going to the dentist in a few (sigh, joyful activity for Monday), will take an early lunch break [10:30:40] Good luck! [10:37:30] good luck 🤞 [10:48:12] Ok, Puppet mess un-messed. Will now break, er I mean migrate the codfw machines to v7 [10:48:42] isaranto: unless you want me to help with/look into the Kafka lag alerts [10:50:17] klausman: go ahead. The kafka lag alerts are not urgent but we need to look into them. I'm looking at LW logs and I'll open a task so it doesnt fall through the cracks [10:50:30] Roger that [10:50:33] I mean go ahead and work on codfw :) [10:52:44] That's how I read it :) [10:53:21] (03PS12) 10Kevin Bazira: article-descriptions: add article-descriptions model server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) [11:02:47] etcd v7 migration done, no issues [11:05:17] (03CR) 10Kevin Bazira: "I have tested this patch and it works well with the new kserve==0.11.2." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [11:16:23] control plane also migrated, now for the last role, k8s workers [11:36:22] 10Machine-Learning-Team: Upgrade model servers to kserve 0.11.2 - https://phabricator.wikimedia.org/T351633 (10isarantopoulos) [11:37:38] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Wikimedia-production-error: Test failure with WatchedItemQueryServiceExtension::modifyWatchedItemsWithRCInfoQuery() - https://phabricator.wikimedia.org/T222677 (10isarantopoulos) [11:37:41] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Wikimedia-production-error: MWException: Default '"soft"' is invalid for preference oresDamagingPref of most users - https://phabricator.wikimedia.org/T345305 (10isarantopoulos) p:05Unbreak!→03Triage [11:37:57] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10ORES, 10Growth-Team, 10PageTriage: PageTriage requires ORES to be installed - https://phabricator.wikimedia.org/T200412 (10isarantopoulos) p:05Unbreak!→03Triage [11:45:19] 10Machine-Learning-Team, 10ORES, 10Beta-Cluster-Infrastructure, 10Wikimedia-production-error: Failed executing job: ORESFetchScoreJob - https://phabricator.wikimedia.org/T243553 (10isarantopoulos) [11:45:54] I'm tidying up the unsorted column on our board to bring it in a shape that will help facilitate meetings. Done column will be a mess with a lot of old resolved tasks but we'll have just one messy column from the first 5 (or so) columns that we usually look at [11:46:26] please don't pay attention to the messages coming in in the next 30 minutes (bulk operations are so slow :( ) [11:59:20] * klausman lunch [11:59:43] * isaranto lunch! [12:09:02] (03CR) 10Elukey: article-descriptions: add article-descriptions model server (035 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [12:09:57] (03CR) 10Elukey: article-descriptions: add article-descriptions model server (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [12:13:24] 10Machine-Learning-Team, 10ORES, 10MediaWiki-Platform-Team, 10PoolCounter: Poolcounter lock time spike - https://phabricator.wikimedia.org/T205976 (10isarantopoulos) [12:14:29] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Cognate, 10MediaWiki-Core-Tests, and 3 others: Can PHPUnit @covers tags cover entire files - https://phabricator.wikimedia.org/T183604 (10isarantopoulos) [12:24:39] 10Machine-Learning-Team: Store and fetch the recommendation-api embedding from Swift - https://phabricator.wikimedia.org/T343576 (10isarantopoulos) 05Open→03Resolved [12:24:41] 10Machine-Learning-Team, 10Release Pipeline, 10ci-test-error: Post-merge build failed due to Internal Server Error - https://phabricator.wikimedia.org/T342084 (10isarantopoulos) [12:24:50] 10Machine-Learning-Team: Store and fetch the recommendation-api embedding from Swift - https://phabricator.wikimedia.org/T343576 (10isarantopoulos) 05Resolved→03Open [12:24:53] 10Machine-Learning-Team, 10Release Pipeline, 10ci-test-error: Post-merge build failed due to Internal Server Error - https://phabricator.wikimedia.org/T342084 (10isarantopoulos) [12:25:00] 10Machine-Learning-Team: Adapt the recommendation-api to use float32 preprocessed numpy arrays from swift - https://phabricator.wikimedia.org/T346218 (10isarantopoulos) 05Open→03Resolved [12:25:05] 10Machine-Learning-Team, 10Patch-For-Review: Host the recommendation-api container on LiftWing - https://phabricator.wikimedia.org/T339890 (10isarantopoulos) [12:25:15] 10Machine-Learning-Team: Upload recommendation-api preprocessed numpy binaries to Swift - https://phabricator.wikimedia.org/T346411 (10isarantopoulos) 05Open→03Resolved [12:25:17] 10Machine-Learning-Team: Adapt the recommendation-api to use float32 preprocessed numpy arrays from swift - https://phabricator.wikimedia.org/T346218 (10isarantopoulos) [12:25:25] 10Machine-Learning-Team: Store and fetch the recommendation-api embedding from Swift - https://phabricator.wikimedia.org/T343576 (10isarantopoulos) 05Open→03Resolved [12:25:27] 10Machine-Learning-Team, 10Release Pipeline, 10ci-test-error: Post-merge build failed due to Internal Server Error - https://phabricator.wikimedia.org/T342084 (10isarantopoulos) [12:25:36] 10Machine-Learning-Team: Deploy nllb-200 to production - https://phabricator.wikimedia.org/T349163 (10isarantopoulos) 05Open→03Resolved [12:25:38] 10Machine-Learning-Team, 10Goal: Goal: Users can query a large language model using the API Gateway and receive a response in a reasonable amount of time. - https://phabricator.wikimedia.org/T348154 (10isarantopoulos) [12:25:46] 10Machine-Learning-Team: Refactor LLM class and model server to run locally - https://phabricator.wikimedia.org/T349371 (10isarantopoulos) 05Open→03Resolved [12:25:48] 10Machine-Learning-Team: Refactor inference services repo to allow local runs - https://phabricator.wikimedia.org/T347404 (10isarantopoulos) [12:25:53] 10Machine-Learning-Team: [revscoring] Fix Multiprocessing code - https://phabricator.wikimedia.org/T348265 (10isarantopoulos) 05Open→03Resolved [12:26:01] 10Lift-Wing, 10Machine-Learning-Team: Create llm namespace on Lift Wing - https://phabricator.wikimedia.org/T348661 (10isarantopoulos) 05Open→03Resolved [12:26:04] 10Machine-Learning-Team, 10Goal: Goal: Users can query a large language model using the API Gateway and receive a response in a reasonable amount of time. - https://phabricator.wikimedia.org/T348154 (10isarantopoulos) [12:58:40] 10Machine-Learning-Team, 10observability, 10Patch-For-Review: Istio recording rules for Pyrra - https://phabricator.wikimedia.org/T351390 (10fgiunchedi) The other option to consider is to push the "base" recording rules down to Prometheus (off from Thanos rule), that should make it more lightweight for thano... [13:01:33] isaranto: (when you have a moment) https://gerrit.wikimedia.org/r/c/operations/alerts/+/975736 [13:02:49] thanks <3 [13:09:22] klausman: I changed the patch for the memory alert to only include floats. https://gerrit.wikimedia.org/r/c/operations/alerts/+/963724 [13:09:22] elukey: shall we merge this afterwards or do we want a review from herron as well? [13:12:28] isaranto: you folks go ahead like I wasn't in the ML team :) [13:39:07] (03PS13) 10Kevin Bazira: article-descriptions: add article-descriptions model server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) [13:42:16] (03PS1) 10Ilias Sarantopoulos: Upgrade model servers to kserve 0.11.2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/975814 (https://phabricator.wikimedia.org/T351633) [13:52:47] (03CR) 10Kevin Bazira: article-descriptions: add article-descriptions model server (033 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [13:59:08] (03CR) 10Elukey: article-descriptions: add article-descriptions model server (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [14:05:40] (03CR) 10Ilias Sarantopoulos: article-descriptions: add article-descriptions model server (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [14:25:33] 10Machine-Learning-Team, 10observability: Istio recording rules for Pyrra and Grizzly - https://phabricator.wikimedia.org/T351390 (10elukey) [14:36:52] the OpenAI mess is entertaining [14:36:57] :D [14:39:14] (03PS14) 10Kevin Bazira: article-descriptions: add article-descriptions model server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) [14:42:29] I was about to say the same.. wow [14:44:10] now I am wondering how many engineers will move to microsoft [14:54:08] https://twitter.com/karaswisher/status/1726598360277356775 [14:54:26] if this is true , then most of them 😛 [14:55:41] (03CR) 10Ilias Sarantopoulos: article-descriptions: add article-descriptions model server (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [14:57:36] (03CR) 10Kevin Bazira: article-descriptions: add article-descriptions model server (033 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [14:59:16] isaranto: wow didn't see it! [15:09:12] (03CR) 10Elukey: article-descriptions: add article-descriptions model server (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [15:22:30] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM! Nice work!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [15:24:10] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Growth-Team, 10Patch-For-Review: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298 (10isarantopoulos) a:03isarantopoulos [15:24:54] (03CR) 10Kevin Bazira: [C: 03+2] "Thank you all for the reviews :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [15:35:49] kserve 0.11.2!!! [15:36:29] openAI v3.0! [15:37:29] (03Merged) 10jenkins-bot: article-descriptions: add article-descriptions model server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [15:37:56] I have no words for the OpenAI thing [15:40:16] 10Machine-Learning-Team, 10Patch-For-Review: Update to KServe 0.11 - https://phabricator.wikimedia.org/T337213 (10klausman) 0.11.2 was released recently. I will update the images to that version before proceeding. [16:33:14] 10Machine-Learning-Team, 10observability, 10Patch-For-Review: Istio recording rules for Pyrra and Grizzly - https://phabricator.wikimedia.org/T351390 (10elukey) After https://gerrit.wikimedia.org/r/c/operations/puppet/+/975808 I see Thanos recording rule for Pyrra taking way less to compute, and all the dash... [16:52:56] found another ores tool https://github.com/Ironholds/ores [16:52:58] in R! [16:53:29] wow [16:53:36] not very up to date luckily [16:54:45] it is avaialable in CRAN (R repository) https://cran.r-project.org/web/packages/ores/index.html [16:55:11] but yes not much to do. it should work with ores-legacy. I'll just leave a comment/issue on GH [16:58:44] super [17:00:47] * elukey afk! [17:00:52] have a good rest of the day folks! [17:05:44] https://github.com/Ironholds/ores/issues/2 [17:05:49] Ciao Luca! [17:48:27] 10Machine-Learning-Team: Update to KServe 0.11 - https://phabricator.wikimedia.org/T337213 (10klausman) Images have been built and published: ` `# docker images REPOSITORY TAG IMAGE ID CREATED SIZE d... [17:48:35] heading out now as well [17:57:28] night all! [18:29:01] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade Revert Risk Language-agnostic docker images to KServe 0.11 - https://phabricator.wikimedia.org/T347550 (10achou) Update: The revertrisk-la image (kserve 0.11.1 & knowledge integrity v0.5.0) with model binary v3 has been deployed to staging. I ran some load... [18:48:27] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade model servers to kserve 0.11.2 - https://phabricator.wikimedia.org/T351633 (10achou) Just wanted to provide a reference for the revertrisk-wikidata model, which is currently under evaluation and improvement. T343419 [19:10:59] (03CR) 10AikoChou: [C: 03+1] Upgrade model servers to kserve 0.11.2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/975814 (https://phabricator.wikimedia.org/T351633) (owner: 10Ilias Sarantopoulos) [21:29:23] aiko: greaaaat! Happy to see the latency issue seems resolved. I haven't deployed any model server yet with kserve 0.11.2. I'll start rolling them out in the morning.