[05:22:14] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, 10Wikimedia Enterprise - Content Integrity: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11347913 (10kevinbazira) As we prepare to publish the revertrisk-wikidata model-server image... [06:15:58] (03PS1) 10Kevin Bazira: revertrisk-wikidata: update CI config to publish the model-server image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1202378 (https://phabricator.wikimedia.org/T406179) [06:28:31] (03CR) 10Kevin Bazira: "I built the model-server image locally and the largest layer is ~1.81GB as shown here: https://phabricator.wikimedia.org/T406179#11347913" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1202378 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [07:58:29] good morning [08:13:59] bartosz: o/ re: monitoring label - it is needed by the custom "modules" in deployment-charts, that we use for some scaffolding of our pods. It adds the prometheus scrape label as you pointed out, but afaics our prometheus config is instructed to pull metrics from pods that have the kserve scrape label or the istio scrape label [08:14:14] so effectively I think that the monitoring: false that we set for staging doesn't really count [08:14:22] we can remove it for clarity [08:14:39] lemme know if it makes sense [09:06:09] elukey: o/ I see, thank you for looking into it! So should we remove the `monitoring.enabled` values alltogether from our charts? [09:15:39] bartosz: if we have it on staging envs I'd say yes, it is not really needed [09:16:02] morning! :) [09:21:04] good morning! :) [09:21:16] elukey: what about production, does it influence anything there? [09:24:14] in theory no, do we have discrepancies ? Like some enabled and some not [09:31:00] we have it set to true in all prod value files [09:45:42] 06Machine-Learning-Team, 05Goal: Q2 FY2025-26 Goal: Deploy Add-a-link v2 models to production - https://phabricator.wikimedia.org/T408790#11348670 (10OKarakaya-WMF) I've collected current performance rates and counts of the candidate wikis: {F69947228} {F69947238} {F69947248} - en for comparison: {F699472... [09:47:15] bartosz: that is probably ok, if we set it in values.yaml and not override it for staging we'll have the same metric settings everywhere [09:47:25] so I'd be in favor of removing only the staging ones for the moment [09:47:28] does it make sense? [09:51:29] yess, it makes sense to me! [09:51:54] super [10:42:08] (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1202378 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [10:49:12] 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414 (10achou) 03NEW [11:00:31] 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11348931 (10achou) Hi @klausman, I'd love to hear your thoughts on what we need to do to make this Cassandra integration. [11:09:20] (03CR) 10Kevin Bazira: [C:03+2] revertrisk-wikidata: update CI config to publish the model-server image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1202378 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [11:18:19] (03Merged) 10jenkins-bot: revertrisk-wikidata: update CI config to publish the model-server image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1202378 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [12:09:41] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11349324 (10Michael) >>! In T401021#11343856, @achou wrote: > @dcausse Thanks a... [12:21:46] 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11349347 (10DPogorzelski-WMF) is Cassandra running on the prod network? if yes it should be reachable at a given address/port with a set of credentials, no? [12:22:10] 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11349349 (10DPogorzelski-WMF) for local workflows it might be good to have it in a docker compose [12:26:51] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 10PersonalDashboard: AI/ML Infrastructure Request: Assistance in Rolling out Revert Risk to wikis that don't have damaging/goodfaith models - https://phabricator.wikimedia.org/T408607#11349360 (10gkyziridis) | **Wiki** | **Thresho... [12:28:36] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 2 others: WE 1.3.4 Roll out Revert Risk Filters to Wikis that don't have damaging/goodfaith Edit Models - https://phabricator.wikimedia.org/T408388#11349361 (10DMburugu) [12:29:14] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 2 others: WE 1.3.4 Roll out Revert Risk Filters to Wikis that don't have damaging/goodfaith Edit Models - https://phabricator.wikimedia.org/T408388#11349364 (10DMburugu) [12:33:29] which team is managing cassandra? [12:38:16] dpogorzelski: SRE's Data Persistence team [12:42:25] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 2 others: WE 1.3.4 Roll out Revert Risk Filters to Wikis that don't have damaging/goodfaith Edit Models - https://phabricator.wikimedia.org/T408388#11349419 (10DMburugu) a:03DMburugu [13:03:06] 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11349477 (10achou) @DPogorzelski-WMF Yes, Cassandra is on the prod network, and @Eevans should be able to provide more info about this. [13:08:38] 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11349490 (10BWojtowicz-WMF) > for local workflows it might be good to have it in a docker compose I agree, will add a local Cassandra docker compose setup in incoming patch adding Cassandra i... [13:34:23] (03PS1) 10Kevin Bazira: revertrisk-wikidata: improve error handling and logging [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1202710 (https://phabricator.wikimedia.org/T406179) [13:34:33] 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11349574 (10DPogorzelski-WMF) Egress seems to be disabled but that could be just the default chart value since some services can clearly make egress calls to fetch models, kafka, etc. need to... [13:42:39] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 10PersonalDashboard: AI/ML Infrastructure Request: Assistance in Rolling out Revert Risk to wikis that don't have damaging/goodfaith models - https://phabricator.wikimedia.org/T408607#11349597 (10Kgraessle) @Samwalton9-WMF @OTicho... [13:45:45] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 10PersonalDashboard: AI/ML Infrastructure Request: Assistance in Rolling out Revert Risk to wikis that don't have damaging/goodfaith models - https://phabricator.wikimedia.org/T408607#11349604 (10Samwalton9-WMF) >>! In T408607#113... [13:53:44] (03CR) 10Gkyziridis: "Nice logging and error handling Kevin!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1202710 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [13:53:48] (03CR) 10Gkyziridis: [C:03+1] revertrisk-wikidata: improve error handling and logging [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1202710 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [13:59:16] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 2 others: Enable revertrisk filters in thwiki - https://phabricator.wikimedia.org/T409438 (10Kgraessle) 03NEW [14:00:07] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 2 others: Enable revertrisk filters in thwiki - https://phabricator.wikimedia.org/T409438#11349682 (10Kgraessle) a:05DMburugu→03None [14:00:46] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 10PersonalDashboard, and 2 others: Enable revertrisk filters in thwiki - https://phabricator.wikimedia.org/T409438#11349686 (10Kgraessle) [14:00:55] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 10PersonalDashboard, and 2 others: Enable revertrisk filters in thwiki - https://phabricator.wikimedia.org/T409438#11349687 (10Kgraessle) p:05Triage→03High [14:02:39] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 10PersonalDashboard: AI/ML Infrastructure Request: Assistance in Rolling out Revert Risk to wikis that don't have damaging/goodfaith models - https://phabricator.wikimedia.org/T408607#11349692 (10Kgraessle) Perfect, I went ahead a... [14:51:44] FIRING: LiftWingServiceErrorRate: ... [14:51:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revertrisk&var-backend=revertrisk-multilingual-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [14:56:44] RESOLVED: LiftWingServiceErrorRate: ... [14:56:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revertrisk&var-backend=revertrisk-multilingual-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [15:03:29] looking --^ [15:19:02] aiko: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1202665 looks good? [15:23:38] 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11350125 (10Eevans) >>! In T409414#11349347, @DPogorzelski-WMF wrote: > is Cassandra running on the prod network? if yes it should be reachable at a given address/port with a set of credential... [15:49:46] 06Machine-Learning-Team, 05Goal: Q2 FY2025-26 Goal: Deploy Add-a-link v2 models to production - https://phabricator.wikimedia.org/T408790#11350261 (10OKarakaya-WMF) [16:04:10] 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11350324 (10elukey) On the Lift Wing side, we need to configure two things: 1) Istio routing to be able to handle TCP calls to the cassandra cluster in the istio proxy sidecar, since it acts... [16:20:23] btw i don't have the +2 capability in gerrit in case i'm expected to be able to +2 on merge requests once approved [16:20:51] ah yes lemme fix it [16:26:16] dpogorzelski: can you try now to reload gerrit? [16:28:47] 06Machine-Learning-Team, 06SRE, 10SRE-Access-Requests: Promote dpogorzelski from ops-limited to ops - https://phabricator.wikimedia.org/T408702#11350458 (10elukey) To keep archives happy: I added the uid to the `ops` ldap group as well! [16:29:25] 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11350459 (10Eevans) >>! In T409414#11350324, @elukey wrote: > [ ... ] > > @Eevans Hi! Is there a load balancing endpoint in front of the cassandra nodes, or should we randomly pick one to con... [16:58:12] works thx :) [16:58:43] as FYI I added you to the `ops` LDAP group, that behind the scenes is used also to grant perms in gerrit etc.. [16:59:02] so we don't have to add you to multiple groups etc.. [16:59:41] kk [16:59:47] thx [17:03:58] dpogorzelski: do you want help in deploying https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1202194 ? [17:33:19] (going afk, tomorrow :) [17:35:12] 06Machine-Learning-Team: Enable ChangeProp to consume mediawiki.page_content_change.v1 - https://phabricator.wikimedia.org/T409469 (10achou) 03NEW [17:35:58] 06Machine-Learning-Team: Enable ChangeProp to consume mediawiki.page_content_change.v1 - https://phabricator.wikimedia.org/T409469#11350769 (10achou) [17:36:01] 06Machine-Learning-Team: Create a Revise Tone Task Generator in LiftWing - https://phabricator.wikimedia.org/T408538#11350770 (10achou) [17:42:17] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Task generation engine for Revise Tone task - https://phabricator.wikimedia.org/T408341#11350794 (10achou) [17:42:18] 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11350793 (10achou) [17:44:59] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Task generation engine for Revise Tone task - https://phabricator.wikimedia.org/T408341#11350815 (10achou) [17:46:17] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Task generation engine for Revise Tone task - https://phabricator.wikimedia.org/T408341#11350826 (10achou) [17:46:18] 06Machine-Learning-Team, 10Cassandra, 05Goal, 07OKR-Work: Provision Cassandra + Data Gateway resources for Tone Check - https://phabricator.wikimedia.org/T408129#11350825 (10achou) [17:56:12] hello, I purely guessed this channel name:) is this the right place for the "machinetranslations" service on k8s? [17:56:48] as in: deployment-charts/helmfile.d/services/machinetranslation [19:23:18] (03PS1) 10Sbisson: Sort page collections [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1202781 [20:10:16] (03PS1) 10Sbisson: Don't ignore sitematrix and interwiki map errors [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1202795 [20:24:21] 06Machine-Learning-Team: Enable ChangeProp to consume mediawiki.page_content_change.v1 - https://phabricator.wikimedia.org/T409469#11351437 (10Ottomata) I wanted to understand how multi-DC ness relates to all the pieces here. Just writing down what I found: - Kafka jumbo-eqiad is only in eqiad - LiftWing is onl... [20:28:24] 06Machine-Learning-Team: Enable ChangeProp to consume mediawiki.page_content_change.v1 - https://phabricator.wikimedia.org/T409469#11351442 (10Ottomata)