[06:54:54] good morning! [06:55:20] good morning :) [06:59:29] o/ looking for a small patch review updating the articletopic image on staging https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1192782 [07:16:19] +1! [07:27:27] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review: Data Persistence Design Review: Article topic model caching - https://phabricator.wikimedia.org/T402984#11232115 (10BWojtowicz-WMF) > On an somewhat related note: I'm bouncing around the idea that perhaps your use-case is a better... [07:27:59] isaranto: thank you! [08:19:07] (03PS1) 10Bartosz Wójtowicz: articletopic: Update locust tests to use both `page_id` and `page_title`. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1192833 (https://phabricator.wikimedia.org/T371021) [08:24:43] Hey folks, I am free from tasks and I was planning to work on this: https://phabricator.wikimedia.org/T403236 [08:24:43] Is there anything else that is higher priority? Does anyone need any support ? [08:27:06] I'd be looking for a small review updating the locust tests for articletopic - https://gerrit.wikimedia.org/r/1192833 🥺 [08:33:58] o/ georgekyz feel free to do that anytime but it is not that important. could you take a look at https://phabricator.wikimedia.org/T405358 instead? [08:34:10] let me update the description real quick [08:35:54] 10Lift-Wing, 06Machine-Learning-Team: Calculate tone check model service metrics for fixed calendar window - https://phabricator.wikimedia.org/T405338#11232287 (10isarantopoulos) 05Open→03Resolved a:03isarantopoulos [08:41:52] isaranto: Yes sure! I will review the patch first and then will jump to the event_sanitized [08:42:09] if anybody else needs help you can prioritize that. [08:54:24] perfect [08:54:51] 10Lift-Wing, 06Machine-Learning-Team: Add LiftWing streams data to event_sanitized (increase data retention) - https://phabricator.wikimedia.org/T405358#11232358 (10isarantopoulos) [08:55:02] 10Lift-Wing, 06Machine-Learning-Team: Add LiftWing streams data to event_sanitized (increase data retention) - https://phabricator.wikimedia.org/T405358#11232359 (10isarantopoulos) a:03gkyziridis [08:56:32] georgekyz: I updated the description. ping me if you have any questions. If it is straightforward to tackled please open the appropriate patches and we can continue the discussion from there, otherwise after you do your first investigation we can discuss it in today's meeting. thanks! [09:00:03] (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1192833 (https://phabricator.wikimedia.org/T371021) (owner: 10Bartosz Wójtowicz) [09:08:20] 06Machine-Learning-Team, 10Semantic Search: Semantic Search POC - In article QA - https://phabricator.wikimedia.org/T405359#11232400 (10OKarakaya-WMF) I've updated check to a [rubric based approach](https://docs.google.com/spreadsheets/d/1IBVBisx2Ojg0PJvxvOzlYJW4Y_5f2Wp1dOyimpvGEPc/edit?gid=1046284968#gid=1046... [09:58:10] klausman: o/ ready to test ml-serve1012 as k8s worker - https://gerrit.wikimedia.org/r/c/operations/puppet/+/1192856 [09:58:49] :+1: [10:01:28] FIRING: [3x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [10:06:28] FIRING: [6x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [10:07:44] ^^^ on it [10:09:04] Just additional IPs for external services, applying it [10:12:16] and done [10:33:38] hello, I've a very small MR in airflow. https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1716 [10:33:38] As we have decided to publish our models to wmf-ml-models, this MR uses a new location. [10:33:38] /wmf/data/published/wmf-ml-models/addalink/v2 [10:33:38] I'll add all models that are above the release threshold in a separate MR. [10:33:50] Can you take a look when you have time? @kevinbazira [10:34:25] ack. looking ... [10:37:42] +1 [10:38:24] 🙌 [11:01:28] FIRING: [6x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [11:06:28] RESOLVED: [6x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [11:33:38] based on the previous communication, I suppose this --^ is also triggered by Tobias' work. [11:33:55] yes, both the "firing" and the "resolved" [11:34:17] ack. ty! [11:49:13] 10Lift-Wing, 06Machine-Learning-Team: Add LiftWing streams data to event_sanitized (increase data retention) - https://phabricator.wikimedia.org/T405358#11232847 (10isarantopoulos) [11:49:47] georgekyz: --^ I corrected the task description. [11:50:36] I mentioned the wrong schema (event_sanitized instead of event) [11:52:13] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review: Data Persistence Design Review: Article topic model caching - https://phabricator.wikimedia.org/T402984#11232863 (10isarantopoulos) Since the events that are produced (prediction data) are ingested in a hive table `event.mediawiki... [11:57:11] isaranto: thank you. [11:58:27] 06Machine-Learning-Team, 07Essential-Work: Update tone-check training pipeline to use Parquet datasets instead of CSV - https://phabricator.wikimedia.org/T406117 (10kevinbazira) 03NEW [12:11:01] hello, https://analytics.wikimedia.org/published/wmf-ml-models/addalink/v2/jawiki/ moving new addalink models to the new location worked 🎉 I'll create a new MR to deploy the rest of the models that are above the threshold. [12:11:21] \o/ [12:15:05] deployed by the following airflow pipeline: https://airflow-ml.wikimedia.org/dags/add_a_link_release_prod/grid?dag_run_id=manual__2025-10-01T11%3A41%3A35.122648%2B00%3A00&tab=mapped_tasks&task_id=prod_release [12:20:58] (03PS2) 10Bartosz Wójtowicz: articletopic: Update locust tests to use both `page_id` and `page_title`. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1192833 (https://phabricator.wikimedia.org/T371021) [12:22:08] (03CR) 10Bartosz Wójtowicz: articletopic: Update locust tests to use both `page_id` and `page_title`. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1192833 (https://phabricator.wikimedia.org/T371021) (owner: 10Bartosz Wójtowicz) [12:23:43] (03CR) 10Bartosz Wójtowicz: [C:03+2] "Thank you for review @gkyziridis@wikimedia.org <3" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1192833 (https://phabricator.wikimedia.org/T371021) (owner: 10Bartosz Wójtowicz) [12:24:19] (03Merged) 10jenkins-bot: articletopic: Update locust tests to use both `page_id` and `page_title`. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1192833 (https://phabricator.wikimedia.org/T371021) (owner: 10Bartosz Wójtowicz) [12:30:54] ozge_: wow nice! \o/ [12:47:13] 10Lift-Wing, 06Machine-Learning-Team: Add LiftWing streams data to event_sanitized (increase data retention) - https://phabricator.wikimedia.org/T405358#11233041 (10isarantopoulos) [12:58:07] (03CR) 10Ilias Sarantopoulos: [C:03+2] "Goodbye!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1189846 (https://phabricator.wikimedia.org/T405083) (owner: 10Ilias Sarantopoulos) [12:58:47] awesome Ozge [13:06:53] (03Merged) 10jenkins-bot: nsfw: remove blubber images and code [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1189846 (https://phabricator.wikimedia.org/T405083) (owner: 10Ilias Sarantopoulos) [13:32:33] 06Machine-Learning-Team, 10EditCheck, 10VisualEditor, 06Editing-team (Tracking), 07Epic: Expand language coverage for Tone Check - https://phabricator.wikimedia.org/T394448#11233195 (10ppelberg) [13:37:23] FIRING: SLOMetricAbsent: linkrecommendation-requests - https://slo.wikimedia.org/?search=linkrecommendation-requests - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:59:07] 06Machine-Learning-Team, 13Patch-For-Review: Add support for K8s 1.23 on Trixie - https://phabricator.wikimedia.org/T405891#11233354 (10elukey) I had to copy over some extra packages: * calicoctl * wikimedia-lvs-server * dragonfly-* * nerdctl * crictl Everything seems to work, but the most notable issue is t... [14:01:59] ml-serve1012 (one of the new big gpu nodes) is currently in experimental state on ml-serve-eqiad [14:02:03] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review: Data Persistence Design Review: Article topic model caching - https://phabricator.wikimedia.org/T402984#11233362 (10BWojtowicz-WMF) In this case I also agree that querying directly without Data Gateway would be the best option for... [14:02:08] so far it seems that the main functionalities are working [14:02:23] including the amd gpu device plugin [14:02:38] I've set it as cordoned for the time being, so we don't accidentally schedule pods on it [14:36:02] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Add LiftWing streams data to event_sanitized (increase data retention) - https://phabricator.wikimedia.org/T405358#11233545 (10gkyziridis) ==Update== I will make two independent patches one for each table. [14:37:23] RESOLVED: SLOMetricAbsent: linkrecommendation-requests - https://slo.wikimedia.org/?search=linkrecommendation-requests - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [14:54:12] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review: Data Persistence Design Review: Article topic model caching - https://phabricator.wikimedia.org/T402984#11233589 (10Eevans) >>! In T402984#11233362, @BWojtowicz-WMF wrote: > In this case I also agree that querying directly without... [18:49:44] FIRING: LiftWingServiceErrorRate: ... [18:49:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [18:54:44] RESOLVED: LiftWingServiceErrorRate: ... [18:54:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [20:03:25] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179 (10diego) 03NEW [20:03:44] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11234935 (10diego) [23:47:57] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review: Data Persistence Design Review: Article topic model caching - https://phabricator.wikimedia.org/T402984#11235394 (10Eevans) p:05Triage→03Medium [23:49:50] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review: Data Persistence Design Review: Article topic model caching - https://phabricator.wikimedia.org/T402984#11235400 (10Eevans)